You are on page 1of 32

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/274514948

On the use of translation corpora in contrastive linguistics: A case study of


impersonalization in English and German

Article  in  Languages in Contrast · April 2015


DOI: 10.1075/lic.15.1.02gas

CITATIONS READS

20 624

1 author:

Volker Gast
Friedrich Schiller University Jena
87 PUBLICATIONS   738 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

English-German contrasts View project

The typology of clause linkage View project

All content following this page was uploaded by Volker Gast on 14 June 2016.

The user has requested enhancement of the downloaded file.


John Benjamins Publishing Company

This is a contribution from Languages in Contrast 15:1


© 2015. John Benjamins Publishing Company
This electronic file may not be altered in any way.
The author(s) of this article is/are permitted to use this PDF file to generate printed copies to be
used by way of offprints, for their personal use only.
Permission is granted by the publishers to post this file on a closed server which is accessible
only to members (students and faculty) of the author’s/s’ institute. It is not permitted to post
this PDF on the internet, or to share it on sites such as Mendeley, ResearchGate, Academia.edu.
Please see our rights policy on https://benjamins.com/#authors/rightspolicy
For any other use of this material prior written permission should be obtained from the
publishers or through the Copyright Clearance Center (for USA: www.copyright.com).
Please contact rights@benjamins.nl or consult our website: www.benjamins.com
On the use of translation corpora
in contrastive linguistics
A case study of impersonalization in English
and German*

Volker Gast
Friedrich Schiller University Jena

This article argues for a type of corpus-based contrastive research that is item-
specific, predictive and hypothesis-driven. It reports on a programmatic study
of the ways in which impersonalization is expressed in English and German.
Impersonalization is taken to be epitomized by human impersonal pronouns like
German man (e.g. Man lebt nur einmal ‘You/one only live(s) once’). English does
not have a specialized impersonal pronoun like Germ. man and uses a variety
of strategies instead. The question arises what determines the choice of a given
impersonalization strategy in English. Drawing on relevant theoretical work and
using data from a translation corpus (Europarl), variables potentially affecting
the distribution of impersonalization strategies in English are identified, and
their influence on the choice of a strategy is determined. By testing hypotheses
derived from theoretical work and using multivariate quantitative methods of
analysis, the study is intended to illustrate how bridges can be built between
fine-grained semantic analyses, on the one hand, and more coarse-grained, but
empirically valid, corpus research, on the other.

Keywords: impersonal, generalizing, veridical, non-veridical, modal, English/


German

*  The present study grew out of a project on human impersonal pronouns, funded by the
German Science Foundation (Ga/1288–6). The financial support from this institution is grate-
fully acknowledged. Some of the data used were coded by members of this project, especially by
Lisa Deringer. I am also greatly indebted to the other members of the project, Florian Haas and
Olga Rudolf, for their cooperation and input. Moreover, I wish to thank the audience of the 7th
International Contrastive Linguistics Conference (Ghent, July 2013) as well as two anonymous
reviewers and the editor of this issue for helpful comments and criticism.

Languages in Contrast 15:1 (2015), 4–33.  doi 10.1075/lic.15.1.02gas


issn 1387–6759 / e-issn 1569–9897 © John Benjamins Publishing Company
On the use of translation corpora in contrastive linguistics 5

1. Introduction

1.1 Quantitative corpus-based research in contrastive linguistics

There is a growing consensus that contrastive linguistics, just like language-spe-


cific research, can profit considerably from the use of natural language corpora
(cf. Johansson 1998, 2000, Granger et al. 2003, Schmied 2008, Gast 2012, among
many others). While considerable progress has been made in the development of
relevant resources (such as the Oslo Multilingual Corpus),1 the methodological
branch of corpus-based contrastive linguistics is still tender. The present article
is intended to make a contribution to the development of a methodology for an
empirical contrastive linguistics by arguing for a research design that is geared the
exploitation of translation corpora. It has the following features:
– It aims at an item-specific (rather than corpus-level) analysis.
– It aims at making predictions (beyond describing the data).
– It is hypothesis-driven (rather than data-driven).
Item-specificity concerns the comparanda, the things to be compared. While com-
parable corpora (corpora that have been independently compiled using similar
sampling criteria) are most naturally used for aggregative, text-level comparisons,
translation corpora (additionally) allow us to compare specific sentences, or even
sub-sentential components, across languages. We cannot only determine how of-
ten a given linguistic structure x is found in corpus A, in comparison to a linguistic
structure y in corpus B, we can — ideally — also determine what type of structure
we find in corpus B, corresponding to structure x in corpus A, and under what
conditions such correspondences are found.
Item-specificity is a precondition for the second property mentioned above,
i.e., the aim of moving from observation to prediction. Given a pair of translation
equivalents S1 and S2, we can try and predict — on the basis of specific features
of S1 — what type of structure we will find in S2. This is of course an ambitious
goal, and it is certainly only partially attainable; but even predictions that are only
partially correct are potentially valuable, e.g. insofar as they allow us to compare
alternative theories in terms of their predictive power. This point relates directly
to the third design feature mentioned above, i.e., a hypothesis-driven, rather than
data-driven approach. The study presented in this article is intended to illustrate
this type of research by deriving hypotheses from insights gained in theoretical

1.  The Oslo Multilingual Corpus was developed at the University of Oslo under the direction of
S. Johansson and C. Fabricius-Hansen between 1999 and 2008.

© 2015. John Benjamins Publishing Company


All rights reserved
6 Volker Gast

work on the topic of inquiry (impersonalization), and testing them on a sample of


manually annotated examples from a translation corpus.

1.2 A corpus-based study of impersonalization

Given the ambitiousness of the undertaking, the present study can only be pro-
grammatic. In view of the wide range of factors influencing the choice of a given
structure (by a given speaker) to render a specific meaning, predictions concern-
ing the linguistic matter used in any given case can only be approximations. Let
us consider an example in order to illustrate the idea of an item-based, predic-
tive and hypothesis-driven problem of contrastive corpus linguistics. Consider the
German sentence (1) with its interlinear glosses (an English translation will be
provided below), which has been taken from the Europarl corpus (cf. Section 2 be-
low). It is a German sentence that was translated from Spanish.2
(1) Herr Präsident!
Mr president
Es ist zweifellos nichts Neues, wenn man sagt,
it is without a doubt nothing new if imps says
dass sich die Europäische Union
that refl the European Union
an einem Scheideweg befindet.
at a crossroads stands  [Europarl, v7]

The corpus also contains an English version of this sentence. Given this informa-
tion, what kind of prediction can we make with respect to the form of the English
sentence in the corpus?
As the possibilities of translating a sentence from one language into another
are manifold, this is of course an ambitious question to ask. We can be more mod-
est, however. Let us focus on the impersonal pronoun man, glossed as ‘imps’ in
(1). Man is used in a broad range of contexts. We will return to its distribution in
Section 4 below. For the time being, suffice it to say that there is no specific English
pronoun corresponding to Germ. man (for contrastive studies of impersonal pro-
nouns in Germanic languages, see for instance Johansson 2004, van der Auwera et
al. 2012). How can the meaning of (1) be rendered in English, then?
Let us rephrase this question from a theoretical point of view: What kind of
information do we need in order to determine the type of structure, or at least
a set of possible structures, to be used in an English equivalent of (1)? The most

2.  “Señor Presidente, decir que la Unión Europea se encuentra en una encrucijada no es, sin
duda alguna, afirmar nada nuevo.”

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 7

important prerequisite seems to be some representation of its sentence meaning.


This is less trivial than it may appear to be, as pronouns like man are highly con-
text-sensitive in their interpretation. Sometimes they correspond to a universal
quantifier (e.g. Man lebt nur einmal ‘You/everyone only lives once’), sometimes
to an existential quantifier (Man hat geklopft ‘They knocked at the door’). We will
return to this question in Section 4.1 (cf. also van der Auwera et al. 2012, Gast and
van der Auwera 2013 and references cited there).
A second variable that needs to be considered when predicting cross-linguis-
tic correspondences is the syntactic context. In (1), the pronoun man occurs in a
conditional clause, and this may well have an influence on the type of structure
found in English. As the data used for this study show, such sentences often seem
to be rendered using a participle in English. Finally, in a translation corpus it is
likely that there will be translation effects, and the fact that the sentence is a trans-
lation from Spanish might have a bearing.
What we minimally need in order to predict the type of English sentence cor-
responding to (1) in the Europarl corpus is, thus, a theory of the interpretation
and distribution of the German impersonal pronoun man. On the basis of such a
theory we can then formulate hypotheses about expected cross-linguistic corre-
spondences in the corpus. By testing these hypotheses we can furthermore assess
the accuracy of the theories themselves.
While the present article falls short of these goals in several respects, it is in-
tended to make a modest contribution to such a predictive, hypothesis-driven
contrastive corpus linguistics. Before we proceed, we should, however, provide the
English counterpart of (1) actually found in the Europarl corpus. It is given in (2).
The pronoun man is not rendered at all, and an infinitive is used instead.
(2) Mr President, to say that the European Union is at a crossroads is no doubt
nothing new.  [Europarl, v7]

As will be seen, sentences with the German impersonal pronoun man are only
rarely rendered in English using a similar pronoun (like one, for instance). What
we typically find are non-finite, sometimes nominal or adjectival constructions,
or passives (cf. Sect. 3.1). Rather than speaking about ‘impersonal pronouns’, we
should therefore use the functional concept of ‘impersonalization’ as a tertium
comparationis. It is defined in (3) (from Gast and van der Auwera 2013).
(3) Impersonalization is the process of filling an argument position of a
predicate with a variable ranging over sets of human participants without
establishing a referential link to any entity from the universe of discourse.
(Gast and van der Auwera 2013: 136)

© 2015. John Benjamins Publishing Company


All rights reserved
8 Volker Gast

Under the assumption that Germ. man epitomizes impersonalization as defined in


(3), and assuming that argument structure is basically preserved in (high quality)
translation — as it concerns the question of who does what to whom, and thus the
most central pieces of information, at least in political speech — we can use Germ.
man as a ‘filter’ for identifying sentences expressing impersonalization as defined
in (3). The (rather difficult) question “How does English express impersonaliza-
tion?” can accordingly be operationalized by asking (the much simpler question)
“What types of structures does English use to express a propositional content that
is expressed using the pronoun man in German?”. German man is thus primarily
used as a ‘methodological anchor’, a formal correlate of the functional operation
of impersonalization.

1.3 Overview of the article

Following this introductory section, some remarks about the corpus used for the
present study (Europarl) are made in Section 2. Section 3 provides a descriptive
overview of the data. In Section 4, the variables for the empirical study are de-
scribed and explained. Section 5 contains a multifactorial analysis using multino-
mial logistic regression. A model is built stepwise (bottom-up) (Sections 5.1–5.3),
and the predictions that the model makes are exemplified (Section 5.4). Section 6
contains the conclusions.

2. Using a translation corpus in contrastive linguistics: The Europarl corpus

The Europarl corpus contains the proceedings of the European Parliament since
1996 (cf. Koehn 2005, Cartoni and Meyer 2012, Cartoni et al. 2013). The number
of languages rose from eleven in 1996 to 24 in 2014, reflecting the expansion of the
European Union during that time. As far as corpus design is concerned, it is impor-
tant to note that the texts were mostly translated manually and directly (i.e., n-to-n)
from 1996 to 2003. Since 2003, many texts have been translated into English first,
and then into the other languages of the European Union (cf. Cartoni et al. 2013: 35).
The corpus is freely accessible on the internet (http://www.statmt.org/europarl/).
The Europarl corpus contains planned spoken (political) speech and conse-
quently represents a highly formal register. In order to estimate the degree of reg-
ister specificity of the structures under investigation, I will therefore use some ad-
ditional data from the OpenSubtitles corpus (cf. http://opus.lingfil.uu.se), which
contains movie subtitles from 59 languages with a total number of 1,211 bitexts
(in April 2014).

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 9

For the present study, only data from English and German will be relevant.
Three types of pairs of texts (or sentences) can be distinguished, as far as the pa-
rameter ‘direction of translation’ is concerned: (i) English sentences and their
translations into German, (ii) German sentences and their translations into
English, and (iii) pairs of English and German sentences that were both translated
from some third language. For a contrastive study we obviously need to be able
to keep these cases apart. In other words, we need some terminology. As I am not
aware of any term for that concept, I will coin a neologism for any sentence s cor-
responding to some other sentence s’ from the translation corpus, and I will call s
and s’ heterophrases. Heterophrase is a relational term, i.e., a sentence si is always
a heterophrase of some other sentence sj. Heterophrases are pairs or sets of sen-
tences that are intended to render (approximately) the same meaning, in the same
context, irrespective of the source and direction of translation.
When a sentence s1 is an original, and another sentence s2 is a translation of s1,
s1 will be said to be the ‘original’ of s2, and s2 the ‘translation’ of s1. If two hetero-
phrases are translations from a third language, they will be metaphorically called
‘sisters’. Two sentences si and sj are sisters if they are heterophrases, but they were
not translated directly from one language into the other.

3. Impersonalization in English and German as reflected in the Europarl


corpus

The notion of ‘impersonalization’ was introduced in (3) above, and it was pointed
out that English does not have a dedicated impersonal pronoun corresponding to
German man (cf. also Johansson 2004, van der Auwera et al. 2012). The question
arises what types of strategies we find in English corresponding to this pronoun.
The present study addresses this question on the basis of a random sample of 399
examples of man (both originals and translations) with their English heterophras-
es.3 This section provides a descriptive overview of the English strategies of ex-
pressing impersonalization, starting with the more general level of heterophrases
in Section 3.1, and moving on to translation effects in Section 3.2.

3.  The sample has been made available at: http://www.uni-jena.de/~mu65qev/data/.

© 2015. John Benjamins Publishing Company


All rights reserved
10 Volker Gast

3.1 Heterophrases of man

Let us first get an idea of the types of strategies that we find in the English versions
from the sample. Their frequencies are shown in Figure 1.4
We will not discuss every single strategy in the following and focus on the
most common types.5 As Figure 1 shows, the most frequent strategy is the passive,
illustrated in (4).
(4) man → passive
Ge: Ich gratuliere Ihnen zu dem, was man im deutschen Parlamentarismus
in Ihrem Fall unzulässigerweise eine Jungfernrede nennt.
En: I would like to congratulate you on what is referred to in German
parliamentary-speak, inappropriately in your case, as a maiden speech.
 [Europarl, v7]

On second to fourth position, we find the first person plural pronoun we, nominal
constructions and infinitives. Examples of we and nominal expressions are given
in (5) and (6), an example of an infinitive was already given in (2) above.
50
40
30
20
10
0

psv we nom inf one ptcp you rphr they people I 3sg anyone imp some
people
Figure 1.  English heterophrases of sentences with impersonal man in German (Europarl).

4.  All diagrams were generated with the Software environment R (R Core Teacm 2013). Some
of them were edited manually.

5.  Here is a list of some perhaps not self-explanatory abbreviations used in Figure 1: ptcp: par-
ticiple; rphr: complete rephrasing; imp: imperative.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 11

(5) man → we
Ge: Je lockerer man mit den Sitten umgeht, desto stärker hält man die Hand
auf der Brieftasche.
En: Basically, the more we lose our grip on morals, the more we tighten our
grip on our wallet.  [Europarl, v7]
(6) man → nominal expression
Außerdem konzentrierte man sich in diesen vergangenen Jahren
grundsätzlich auf die Sicherheit im Passagierverkehr.
‘During these years, the emphasis has mainly been on the safety of passenger
transport.’  [Europarl, v7]

The English impersonal pronoun one is found in a relatively low number of cases
(cf. (7) for an example), considering the (high) register of the corpus. Impersonal
you is much more common in spoken English (cf. also below for the data from
the OpenSubtitles corpus), but it is also sporadically found in the Europarl cor-
pus (cf. (8)).
(7) man → one
Ge: Zur Beseitigung dieser Ursachen muß man sich die Situation in den
betreffenden Ländern vergegenwärtigen.
En: In order to remove these causes, one has to look at what is happening in
the countries in question. [Europarl, v7]
(8) man → you
Ge: Man kann sich nicht vorbereiten, wenn man hier eine Erklärung hört
und gar nicht weiß, was Inhalt einer solchen Erklärung ist.
En: You cannot prepare if you hear a statement in this House and have no
idea of its content. [Europarl, v7]

Finally, we also find participial constructions and gerunds. (9) is an example of a


present participle, (10) contains a past passive participle.
(9) man → present participle
Ge: Herr Präsident, Frau Kommissarin, liebe Kolleginnen und Kollegen!
Wenn man die Flugblätter sieht, die in den letzten Wochen verteilt
wurden, dann denkt man, wir diskutieren hier über den Öko-Gau oder
über den Tod der Automobilindustrie in Europa.
En: Mr President, Commissioner, reading the leaflets which have been
distributed in recent weeks, one would think that we were talking about
the ultimate environmental catastrophe or the death of the car industry
in Europe. [Europarl, v7]

© 2015. John Benjamins Publishing Company


All rights reserved
12 Volker Gast

(10) man → past participle


Ge: Alle Informationen, die man in den letzten sechs Monaten sowohl
von Einzelpersonen als auch von Gruppen erhalten habe, um eine
Untersuchung der NATO-Aktionen während des Kosovo-Konflikts zu
beantragen, wurden vom Ankläger registriert.
En: All information received during the past six months, either from
individuals or from groups requesting an investigation of NATO’s
actions during the Kosovo conflict, have been recorded by the
Prosecutor. [Europarl, v7]

The question of what determines and conditions the use of any given strategy is
the central topic of this article, and we will turn to it in Sections 4 and 5. Before
going into some more detail, let us compare the frequencies of impersonalization
strategies found in Europarl to those in the OpenSubtitle corpus, to get an idea of
the register specificity of the distribution shown in Figure 1. Figure 2 summarizes
the frequencies of impersonalization strategies in the OpenSubtitle corpus, based
on a randomly extracted sample of 500 examples.
In the OpenSubtitle corpus, you is by far the most frequent strategy, followed
by passives and the third person plural pronoun they. Examples of impersonal you
and they are given in (11) and (12).
(11) Ge: Ich weiß, dass man die Flügel nicht wieder ankleben kann.
En: I know you just can’t glue the wings back on.  [OpenSubtitles]
(12) Ge: So etwas kann man noch nicht bauen!
En: They cannot make things like that yet.  [OpenSubtitles]

Figure 3 displays a comparison of the frequencies in the two corpora in the form
of a Cohen-Friendly association plot (cf. Cohen 1980, Friendly 1992). It shows the
200
150
100
50
0

you psv they inf nom we zero one I pst.ptcp guy he anybody it that man who
prs.ptcp People imp a man anyone

Figure 2.  Heterophrases of man (OpenSubtitles).

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 13

B
people pst.ptcp
any-x I imp inf it nom one prs.ptcp psv they we who you zero
Pearson
residuals

Europarl 4.00

2.00

0.00
A

–2.00

–4.00
Opensubtitle

–8.14
p-value =
< 2.220-16

Figure 3.  A comparison of Europarl and OpenSubtitles (Cohen-Friendly association plot).

observed frequencies in relation to the expected frequencies, on the assumption of


statistical independence of the variables. When a box is located above the baseline,
it corresponds to an overrepresented combination of variables, boxes underneath
the baseline indicate underrepresented cases. The size of a box is proportional to
the deviation from statistical independence in any given case. Levels of statistical
significance are indicated by shading.
Figure 3 shows that the most important difference between the Europarl and
the OpenSubtitle corpus consists in the fact that you is, comparatively speaking,
massively underrepresented in the former corpus and overrepresented in the lat-
ter. The same applies to the (informal) impersonal use of third person plural pro-
nouns (impersonal they). All other (major) strategies are overrepresented in the
Europarl corpus, in comparison to the OpenSubtitle corpus, in particular we, pas-
sives, participles or gerunds and impersonal one. Given our assumptions about the
formal vs. informal character of the two corpora under discussion, this certainly
does not come as a great surprise.

3.2 Translating from and into English

Let us now have a closer look at the distribution of strategies, taking into account
whether a sentence is an English original (which was translated into German us-
ing man), an English translation of German man, or a translation from a third
language (where the German and English sentences are ‘sisters’, using the termi-
nology established in Section 2). The data is represented in the form of an associa-
tion plot in Figure 4 (note that the labels on the left indicate the original language;
English → original, German → translation from German, other → translation
from other language).

© 2015. John Benjamins Publishing Company


All rights reserved
14 Volker Gast

psv adj I imp inf it nom one people pst.ptcp they we who you zero
anyone prs.ptcp they(anph) Pearson
some people residuals
2.38
2.00
English

nom one

0.00
A

German

–2.00

other
–3.04
p-Value =
0.023452

Figure 4.  English originals, translations from German and from other languages.

The plot contains three shaded boxes. Passives are significantly overrepre-
sented in English originals, relative to translations. We is massively underrepre-
sented here. In sentences that were translated from German into English, you is
significantly overrepresented (i.e., man is relatively often translated into English
using you, in comparison to the other impersonalization strategies). The other dif-
ferences in the distributions are not statistically significant, but this is probably
at least partly due to the limited sample size, and by adding more data we could
certainly detect further interesting asymmetries. At this point we will not go into
any further detail with respect to translation effects, however. We will return to
the relation between heterophrases in the quantitative interpretation of the data.

4. Investigating the distribution of impersonalization strategies in the


Europarl corpus

The main question to be addressed in this section can be put as follows:


(13) To what extent can we predict — for any given occurrence of Germ.
man — what type of structure we will find in the corresponding English
heterophrase?

Finding an answer to the question in (13) is, first and foremost, a theoretical chal-
lenge. It implies that we understand the factors that determine the distribution of
German man as well as its English counterparts. In other words, it implies that we
are able to answer the question in (14).

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 15

(14) What determines the interpretation and distribution of impersonal


pronouns or impersonalization strategies in English and in German?

For a corpus-based study, we thus have to find variables which are candidates po-
tentially having an impact on the interpretation and distribution of the relevant
strategies. For each variable, we will formulate and test a hypothesis concerning its
impact. Building upon the results obtained in this section, we can then try and fit
a multinomial logistic regression model in Section 5. The independent variables
to be tested are the following:
– the type of quantification expressed (Section 4.1)
– semantic properties of the sentential context:
– generalizing vs. specific (Section 4.2)
– veridical vs. non-veridical (Section 4.3)
– modal(ized) vs. non-modal(ized) (Section 4.4)
– the clause type found in any given instance (Section 4.5)
– the source language (Section 4.6)
The semantic variables — the type of quantification expressed and the properties
of the sentential context — as well as the clause type were coded manually. The
information about the speaker was extracted automatically from the relevant tags
of the Europarl corpus. The dependent variable is the impersonalization strategy.
Given that some of the strategies are very rare and have moreover similar distri-
butional and semantic properties, I distinguish only eight levels for this variable
(unlike in Figure 1 above):
– a passivized predicate (psv), cf. (4)
– a nominal or adjectival construction (nom), cf. (6)
– a non-finite verbal construction (infinitives, imperatives) (nfin), cf. (2)
– the impersonal pronoun one, cf. (7)
– impersonal uses of the personal pronoun you, cf. (8)
– the first person plural pronoun we, cf. (5)
– pronouns other than you, one or we, e.g. anyone, they (pron), cf. the English
version of (12)
– a complete rephrasing of the sentence (rphr)
Examples have been provided for all the strategies except the last one (rephrasing).
It is illustrated in (15).6
(15) rephrasing of man-sentence

6.  A reviewer points out that (15) could also be classified as ‘nominal’. I have chosen the ‘re-
phrasal’-option because the two sentences contain considerably different main predicates.

© 2015. John Benjamins Publishing Company


All rights reserved
16 Volker Gast

Ge: Man macht sich Sorgen um die “Verbuchung der Eigenmittel”, um ihre
“Bereitstellung” oder um die “Kontrolle” der der Kommission zur
Verfügung gestellten Beträge.
En: It shows a concern with the accounting involving own resources, the
process of making them available or with the monitoring of declared
amounts made available to the Commission. [Europarl, v7]

4.1 The type of quantification expressed

One of the most important determinants of the distribution of impersonalization


strategies is, obviously, the interpretation of the impersonalized argument. Two
major groups of uses can be distinguished (cf. van der Auwera et al. 2012, Gast and
van der Auwera 2013): Impersonalization strategies can have either a universal or
an existential interpretation. In a universal interpretation, they are roughly equiv-
alent to universal quantifiers like all or every. In an existential interpretation, they
correspond more or less to an existential pronoun like someone, though they differ
in systematic ways from such pronouns. Examples illustrating these paraphrases
are given in (16) and (17).
(16) Universal
Man lebt nur einmal. ≡ Everybody only lives once.
(17) Existential
Man hat geklopft. ≡ Someone has knocked.

Given that the type of interpretation is an obvious candidate for being a parameter
that influences the choice of an impersonalization strategy in English, we can for-
mulate the hypothesis in (18) and test the corresponding null hypothesis in (19):
(18) H1: The type of quantification expressed by German man has an impact on
the use of impersonalization strategies in the English heterophrases.
(19) H0: The type of quantification expressed by German man has no impact on
the use of impersonalization strategies in the English heterophrases.

For the (manual) coding of the data we used a substitution test: Depending on
whether substitution with somebody or everybody led to approximately equivalent
sentences, the examples were classified as existential or universal. A χ2-test shows
that the null hypothesis in (19) can be rejected (χ2 = 20.5, df = 7, p = 0.005). The
data are shown in the form of an association plot in Figure 5.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 17

strat.n
psv nom pron non-fin one you rephr we Pearson
residuals:
2.16
2.00

0
uni

0.00

1
–2.00
–2.12
p-Value =
0.004546

Figure 5.  Impersonalization strategy x quantification (exst/univ): An association plot.

In spite of the varying interpretations illustrated in (16) and (17), impersonaliza-


tion strategies are often regarded as forming a natural class. The reason is that the
interpretation of pronouns like man — universal or existential — appears to be
a function of the context. There are basically two ways of explaining this type of
context sensitivity. We can refer to the alternative theories as the ‘theory of para-
sitic binding’ and the ‘theory of non-veridicality’. These theories will be briefly
explained in the following sections, and the variables corresponding to them will
be introduced in the course of the discussion.

4.2 Generalizing vs. specific sentences

The first way of explaining the type of context sensitivity that characterizes imper-
sonal pronouns is to assume that the type of interpretive variability characteristic
of them is similar to, or even an instance of, what has been called ‘quantification
variability effects’ since Lewis (1975). An example is given in (20). The indefinite
NP a dog is here interpreted as being universally quantified over, as it occurs in the
scope of the adverb always.
(20) A dog is always smart. ≈ All dogs are smart.

One way of analysing this phenomenon is to assume that indefinites (of this type)
basically introduce variables, and that these variables are bound by some operator
in the sentential environment. A similar assumption can be made for impersonal
pronouns (see for instance Alonso-Ovalle 2002, Moltmann 2006). Accordingly,
the interpretation of man as an existential or universal quantifier can be derived
from the assumption that the quantifier binding the event variable also binds the
(otherwise) free variable introduced by man (or similar expressions).

© 2015. John Benjamins Publishing Company


All rights reserved
18 Volker Gast

(21) Universal quantification


Man lebt nur einmal. (‘One only lives once.’)
∀e [live(x,e) → once(e)]
‘For any event e, if e is an event of living (of any individual x), e only happens
once.’
(22) Existential quantification
Man hat geklopft. (‘They have/someone has knocked.’)
∃e[knock(x,e)]
‘There is an event e such that e is a knocking event (by some x)’

An example of a generalizing sentence is given in (23), and a specific (non-gener-


alizing) sentence is illustrated in (24).
(23) Ge: Das kritisiere ich nicht; es kommt immer mal vor, daß man sich
vertreten läßt.
En: I am not criticising this; it happens from time to time that people send
someone to represent them. [Europarl, v7]
(24) Ge: Wir verlangen, daß man dabei größte Transparenz walten läßt …
En: We demand that the greatest possible transparency should be in place…
 [Europarl, v7]

The analysis of impersonal pronouns in analogy to indefinites is supported by the


fact that impersonal pronouns can often — with a certain effect of specification
— be replaced with an indefinite noun phrase. For instance, instead of one in (25)
we could use the noun phrase an honourable man, as in (26) (which is actually the
original, taken from the title of a chapter from a volume on ‘Witchcraft narratives
in Germany’).
(25) One should not talk about that which one cannot prove.
(26) An honourable man should not talk about that which he cannot prove.

We can thus formulate the following hypothesis H1, and the complementary null
hypothesis H0:
(27) H1: The status of a sentence as generalizing or specific has an impact on
the distribution of impersonalization strategies in English heterophrases of
German sentences with man.
(28) H0: The status of a sentence as generalizing or specific has no impact on
the distribution of impersonalization strategies in English heterophrases of
German sentences with man.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 19

As an operational test we used the criterion whether the adverbial generally could
salva veritate be added. The data show that the null hypothesis H0 cannot be reject-
ed (χ2 = 11.0, df = 7, p = 0.14). Somewhat surprisingly perhaps, the parameter ‘gener-
alizing’ does not, in itself, seem to have a significant impact on the choice of an im-
personalization strategy, in the dataset used for the present study. Note that this is
probably not primarily due to scarcity of data. The data are simply relatively evenly
distributed, specifically in the most frequent strategies, i.e., ‘passive’, ‘nominal’ and
‘non-finite’. The frequencies are shown in Table 1 and visualized in Figure 6.

Table 1.  Contingency table for strategy x generalizing/specific.


gen psv nom pron nfin one you rphr we
0 46 50 13 18  7  5 17 42
1 47 44 11 23 14 14  8 40

strat.n
psv nom pron n-fin one you rephr we

Pearson
residuals:
1.43

0
gen

0.00

–1.44
p-value=
0.13716

Figure 6.  Impersonalization strategy x generalizing/specific: An association plot.

4.3 Veridical vs. non-veridical sentences

An alternative to the ‘theory of parasitic binding’ is constituted by what I have


called the ‘theory of non-veridicality’ above. It is based on the assumption that,
within a specific domain, non-veridical clauses — whose propositional content

© 2015. John Benjamins Publishing Company


All rights reserved
20 Volker Gast

is not implied to be true — license universal readings of impersonal pronouns


while veridical clauses do not. The notion of veridicality was originally coined by
Montague (1969), and was introduced into formal semantics by Zwarts (1995)
and Giannakidou (1998), among others. Abstracting away from some complica-
tions, “[t]he intuitive idea behind veridicality and nonveridicality is very simple:
a linguistic item L is veridical if it expresses certainty about, or commitment to,
the truth of a sentence; and L is nonveridical if it doesn’t express commitment”
(Giannakidou 2011: 1675).
Veridicality is known to play an important role in the licensing of negative po-
larity items (cf. Giannakidou 2011 for an overview of the discussion). For example,
the adverb ever can occur in questions, conditionals, under negation etc., but not
in simple declarative clauses. Put differently, it can occur in non-veridical contexts
(cf. (29)), but not in veridical ones (cf. (30)).
(29) Have you ever been there? (→
/  You have been there.)
(30) * I have ever been there.

The role of (non-)veridicality in the licensing of impersonal pronouns cannot be


the same as in the case of negative polarity items. Specific impersonal pronouns
(e.g. Eng. one, you) are licensed in (specific) non-veridical contexts, but not in an-
tiveridical ones. This means that they are not licensed under negation. (31) is non-
veridical — it does not imply that everyone is always polite — and the sentence
is fine. (32) would be veridical if it were grammatical (i.e., the context is veridical
and the impersonal pronoun is therefore not licensed). (33), in turn, is antiveridi-
cal (negated), but it is not grammatical, either. (33) is the type of context licensing
negative polarity items while not licensing impersonal pronouns.
(31) One should always be polite.
(32) * One had breakfast yesterday.
(33) * One did not have breakfast yesterday.

In order to account for the distribution of impersonal pronouns, the notion of


(non-)veridicality needs to be relativized to a specific domain, i.e, the ‘minimal
clause’, consisting of the predicate plus its arguments and any (internal) negation.
Under this premise, (33) counts as veridical, as it implies ‘x did not have breakfast’.
The negation is part of the context of evaluation, here represented by a subscript ‘V’.
(34) Yesterday [V not [ x have breakfast ]] → x did not have breakfast

Let us consider two attested examples from the corpus. (35) is classified as non-
veridical, (4) (repeated here) as veridical.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 21

(35) Ge: In der Tat, wenn man die beiden Mitglieder, die sich gemeldet haben
hinzuzählt, dann ergibt sich als Ergebnis …
En: Indeed, if we add the two Members who have declared themselves, then
the result of the vote would be … [Europarl, v7]
(4) Ge: Ich gratuliere Ihnen zu dem, was man im deutschen Parlamentarismus
— in Ihrem Fall unzulässigerweise — eine Jungfernrede nennt.
En: I would like to congratulate you on what is referred to in German
parliamentary-speak, inappropriately in your case, as a maiden speech.
 [Europarl, v7]

Note that the theory of non-veridicality seems to be incompatible with the fact
that one is licensed in generalizing contexts, even though it often sounds stilted
in such contexts (e.g. One only lives once). The present approach does not aim at
postulating categorical rules, but at finding parameters which, in interaction with
other factors, contribute to the grammaticality of some structure and determine
its distribution. Note, however, that generic sentences could in fact be argued to be
non-veridical relative to the minimal clause, as they can be regarded as conveying
an implicit conditional (cf. the representation of One only lives once in (21)).
We can, again, formulate two complementary hypotheses and test the null
hypothesis.
(36) H1: The status of the minimal clause containing an impersonalized
argument position as veridical or non-veridical has an impact on the
distribution of impersonalization strategies in English heterophrases of
German sentences with man.
(37) H0: The status of the minimal clause containing an impersonalized
argument position as veridical or non-veridical has no impact on the
distribution of impersonalization strategies in English heterophrases of
German sentences with man.

The variable ‘veridicality’ was regarded as a property of propositions, not of op-


erators (as is suggested by the quotation from Giannakidou 2011 above) — more
specifically, as a property of the proposition expressed in the ‘minimal clause’ con-
taining an impersonalized predicate. An example was classified as ‘veridical’ if the
minimal clause was true just in case the host proposition was true (cf. (35) and (4)
above for illustration).
Figure 7 visualizes the data. According to a χ2-test, the null hypothesis can be
rejected (χ2 = 16.5, df = 7, p = 0.02). Veridicality — within the domain of the mini-
mal clause as defined above — does seem to have an impact on the choice of an
impersonalization strategy.

© 2015. John Benjamins Publishing Company


All rights reserved
22 Volker Gast

strat.n
psv nom pron n-fin one you rephr we
Pearson
residuals:
1.93
0
ver

0.00

–1.69
p-value=
0.021054

Figure 7.  Impersonalization strategy x (non-)veridical: An association plot.

4.4 Modal vs. non-modal sentences

A sentence was classified as ‘modal’ if it contained a modal auxiliary. Modal aux-


iliaries form a subset of the class of non-veridical operators. The category ‘modal’
is therefore a special case of the category of ‘non-veridicality’. Modals are known
to provide a particularly fertile ground for the use of impersonals (as in You/one
shouldn’t drink and drive; see for instance van der Auwera et al. 2012, Gast and
van der Auwera 2013). It therefore seems worthwhile considering what influence
modals have on the choice of an impersonalization strategy. We will test the fol-
lowing hypotheses:
(38) H1: The presence or absence of a modal in the clause has an impact on the
choice of an impersonalization strategy.
(39) H0: The presence or absence of a modal in the clause has no impact on the
choice of an impersonalization strategy.

On the basis of the dataset used for the present study, the null hypothesis in (39)
can be rejected (χ2 = 16.7, df = 7, p = 0.02). The data is visualized in Figure 8.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 23

strat.n
psv nom pron one you rephr we
n-fin Pearson
residuals:
1.92

0
mod

0.00

–1.68
p-value=
0.019309

Figure 8.  Impersonalization strategy x (non-)modal: An association plot

4.5 The clause type

So far, we have only considered semantic variables: the type of quantification ex-
pressed in a sentence, the property of expressing a generalization vs. a particu-
lar statement, the question of (non-)veridicality and the presence vs. absence of a
modal. In this section we will consider a formal, basically syntactic, variable, i.e.
the type of clause in which man occurs.
The clause type is potentially relevant because the entire ‘ecology’ of the gram-
matical system of English is different from that of German. English has a category
that German lacks — the gerund (cf. (40)) — and it makes ample use of participial
constructions (cf. König and Gast 2012: Ch.13 for a contrastive overview). It is
therefore to be expected that in many cases where German uses man (in a finite
clause), English will use some non-finite structure. (41) is an example where (the
first occurrence of) German man was translated into English using a participle
(the second occurrence was translated with one).
(40) Ge: Liest man beide Kommissionsdokumente, so war 1998 ein Jahr, in dem
die 1997 eingeleiteten Modernisierungsvorhaben fortgeführt und in
Teilen noch abgeschlossen wurden; …
En: On reading both Commission documents, one learns that 1998 was the
year in which the modernisation proposals introduced in 1997 were
pursued and even partially completed, … [Europarl, v7]

© 2015. John Benjamins Publishing Company


All rights reserved
24 Volker Gast

(41) Ge: Herr Präsident, Frau Kommissarin, liebe Kolleginnen und Kollegen!
Wenn man die Flugblätter sieht, die in den letzten Wochen verteilt
wurden, dann denkt man, wir diskutieren hier über den Öko-Gau oder
über den Tod der Automobilindustrie in Europa.
En: Mr President, Commissioner, reading the leaflets which have been
distributed in recent weeks, one would think that we were talking about
the ultimate environmental catastrophe or the death of the car industry
in Europe. [Europarl, v7]

We can, again, formulate two complementary hypotheses and test the null
hypothesis:
(42) H1: The type of clause in which man occurs has an impact on the type of
impersonalization strategy found in English.
(43) H0: The type of clause in which man occurs has no impact on the type of
impersonalization strategy found in English.

I have classified the clause types into four categories: main/declarative clause
(m.dec), (direct or indirect) questions (q), adjunct clauses, i.e., adverbial clauses
and relative clauses (adj), and complement clauses (comp). This classification is
relatively coarse-grained, and perhaps the classes have not been chosen in an ideal
way. A more fine-grained classification would have implied a high number of cells
with low counts, which compromises statistical analyses. On the basis of this four-
way classification the null hypothesis cannot be rejected. In fact, the p-value is
rather high (χ2 = 27.37, df = 21, p = 0.16). The data is visualized in Figure 9.

strat.n
psv nom pron n-fin one you rephr we
Pearson
residuals:
2.00
main/decl

question
typ.n

0.00

adjunct

comp –2.00
–2.23
p-value=
0.15907

Figure 9.  Impersonalization strategy x clause type: An association plot.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 25

4.6 The source type

Finally, it is obvious that in a study of translated text we need to take the ‘source
type’ into account — i.e. we need to know whether a sentence is an original or a
translation. Moreover, I distinguish two types of translations, those from German
and those from other languages. This distinction is made because the entire study
is based on German man as a ‘flag’ or ‘anchor’ for the argument structure opera-
tion of impersonalization.
The hypotheses concerning the source type can be phrased as follows:
(44) H1: The type of source of any given example (original, translation) has an
impact on the choice of an impersonalization strategy in English.
(45) H0: The type of source of any given example (original, translation) has no
impact on the choice of an impersonalization strategy in English.

The data is visualized in Figure 10. The null hypothesis can be rejected (χ2 = 43.97,
df = 18, p < 0.001).

strat.n
psv nom pron n-fin one you rephr we
Pearson
residuals:
2.38
2.00
translation

0.00
langrel

original

–2.00

sisters
–3.04
p-value=
0.0015869

Figure 10.  Impersonalization strategy x source type: An association plot.

4.7 Summary

The statistics for the variables explored in the present section are summarized
in Table 2, ordered by the p-values. For the multifactorial analysis presented in
Section 5 below, only those variables will be taken into account for which the null
hypothesis of statistical independence could be rejected, i.e., veridical, modal,
quantification and source type.

© 2015. John Benjamins Publishing Company


All rights reserved
26 Volker Gast

Table 2.  Main statistics for the variables under consideration in the present study.
variable χ2-value df p significance
clause type 27.37 21 0.16
generalizing 11.0  7 0.14
veridical 16.5  7 0.02 *
modal 16.7  7 0.02 *
quantification 20.5  7 0.005 **
source type 43.97 18 <.001 **

5. Towards a multifactorial analysis

In Section 4, we investigated the impact of some variables on the choice of an im-


personalization strategy in English individually. The question arises how the vari-
ables that were found to have an impact on the distribution of impersonalization
strategies in English conspire or compete in any given case. This question will be
addressed in the present section by fitting a multinomial logistic regression model.
We will build the model bottom-up, starting with one-variable models and
adding variables stepwise, rather than proceeding top-down, stepwise removing
predictors from a maximal model. The reason is that the various models are intend-
ed to be connected to hypotheses and assumptions about the theory of imperson-
alization. The aim of this section is not to randomly determine the model with the
best fit; it is to fit a model that mirrors more or less what we (think we) know about
the theory of impersonalization. The hypothesis-driven nature of the approach
taken in the present study is more compatible with such a bottom-up procedure.
We will start with models with one variable in Section 5.1 and then subse-
quently move on to models with three and four variables in Sections 5.2–5.3.
Section 5.4 provides an example of a prediction made by the model.

5.1 Models with one predictor

I used the multinom-function from the nnet-package and the mlogit-function of


the mlogit-package for R (R Core Team 2013) to fit multinomial logistic regres-
sion models with the response variable ‘strategy’ (with eight levels, as described
in Section 4). The nnet-function delivers values for the residual deviance of a
model and the Akaike Information Criterion (AIC), two indicators of goodness
of fit. Roughly speaking, residual deviance is an indicator of the amount of data
not accounted for by a model. AIC is sensitive to the number of parameters and

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 27

increases with the number of variables used in a model, or with their levels. For
both scores it holds that ‘the lower the better’.
The mlogit-package provides the (negative) log-likelihood value and McFadden’s
R for each model (cf. Smith and McKenna 2013).7 The absolute value of the nega-
2

tive log-likelihood value, multiplied by two, is identical to the residual deviance, so it


will not be indicated in the following. McFadden’s R2 increases with goodness of fit.
In addition to these statistics I will indicate the number of cases that are cor-
rectly predicted by the model — the CPO-value, as I will call it (for ‘correctly pre-
dicted outcome’). The notion of ‘prediction’ is strictly speaking inaccurate here, as
the models predict the data to which they were fitted. Still, the CPO-value provides
a simple and intuitively accessible — though in itself, insufficient — measure of
goodness of fit. The CPO-value will be illustrated with examples in Section 5.4.
Table 3 summarizes the statistics for one-variable models, ordered in terms of
their residual deviances, for the variables ‘(non-)veridicality’, ‘modality’, ‘quantifi-
cation’ and ‘source’.

Table 3.  Models with one variable.


variable Res. Deviance AIC McFadden’s R2 CPO
ver 1484.9 1512.9 0.011   98
mod 1484.7 1512.7 0.011 107
quant 1476.5 1504.5 0.016   98
source 1459.4 1501.4 0.028 104

The model with the best statistics is the one with the variable ‘source type’. As
Table 3 shows, in 104 cases the model assigns the highest probability to the strat-
egy that was actually used, i.e. in 26% of cases. This is a considerable improvement
in comparison with a model that simply chooses the most frequent strategy (pas-
sives), which would make a correct prediction in 23.3% of cases. The source type-
model simply selects the strategy ‘passive’ for English originals and for translations
from languages other than German, and ‘nominal’ for translations from German,
and this is the right choice in 26% of cases.
A one-variable model with the variable ‘modality’ correctly predicts an even
higher number of cases, though the goodness-of-fit indicators are slightly worse.
This model chooses the strategy ‘nominalization’ for sentences with a modal, and
‘passive’ for sentences without a modal. In this way it selects the correct outcome
in 26.8% of cases.

7.  McFadden’s R2, “(sometimes referred to as ‘deviance R2’), is one minus the ratio of the full-
model log-likelihood to the intercept-only log-likelihood … ” (Smith and McKenna 2013: 18).

© 2015. John Benjamins Publishing Company


All rights reserved
28 Volker Gast

As these examples illustrate, one-variable models are very simple and do not
have much explanatory force. Still, they can give us a rough idea of the role that
each variable plays in determining the distribution of the English impersonaliza-
tion strategies.

5.2 Models with two predictors

There can be little doubt that the final model will contain the variable ‘source type’,
which was shown to be a relatively good predictor in Section 5.1, and which is
totally independent of the other (semantic) variables. We will therefore start with
combinations of the semantic variables, (non-)veridicality, modality and quanti-
fication. We will focus on combinations of quantification with (non-) veridicality
and modality. The statistics for these models (both with and without interactions)
are shown in Table 4.

Table 4.  Models with two semantic variables.


variables Res. deviance AIC McFadden’s R2 CPO
quant + ver 1466.7 1508.7 0.023 101
quant * ver 1462.2 1518.2 0.026 101
quant + mod 1461.8 1503.8 0.026 111
quant * mod 1438.3 1494.3 0.042 115

The models with quantification and modality fare much better in all respects than
those with quantification and veridicality. As is shown by a likelihood ratio test,
quantification and modality interact, however. The model with interactions —
represented in the last row — is significantly better than the model without in-
teractions (LogLikq+m = −730.9, LogLikq*m = −719.13, df = 7, χ2 = 23.53, p = 0.001).
Both models are significantly better than the one-variable model with ‘quantifi-
cation’ (LogLikq = −738.3, LogLikq + m = −719.13, df = 14, χ2 = 38.29, p < 0.001). The
quant*mod-model makes correct predictions for 28.8% of cases ( = 115/399).
Let us now combine the variable ‘source type’ with each of the three semantic
variables that turned out to be significant. As likelihood ratio tests have revealed
no significant interactions between the variables shown in Table 5, only the mod-
els without interactions are shown.
The model with the variables ‘source type’ and ‘quantification’ is the one that
fares best in terms of the model statistics so far, even though it only predicts 113
outcomes correctly (in comparison to the 115 outcomes correctly predicted by the
quant*mod-model). The number of correctly predicted outcomes is not, however,
a very reliable indicator of goodness of fit, as was pointed out above.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 29

Table 5.  Combinations of ‘source type’ with the three semantic variables.


variables Res. Deviance AIC McFadden’s R2 CPO
source type + ver 1442.3 1498.3 0.039 115
source type + mod 1441.5 1497.5 0.04 117
source type + quant 1433.2 1489.2 0.046 113

5.3 Models with three and four predictors

We do not have many options to fit a model with three variables. The two linguis-
tically most reasonable models are shown in Table 6 (translation effects will be
integrated below). Note that there is an interaction between the variables ‘quantifi-
cation’ and ‘modality’, but not between ‘quantification’ and ‘veridicality’.

Table 6.  Models with three variables.


variables Res. Deviance AIC McFadden’s R2 CPO
source type + quant + ver 1423.0 1493.9 0.052 116
source type + quant * mod 1395 1479 0.071 115

The model with source type, quantification and modality, as well as the interaction
between the latter variables, is the best model so far and it fits the data significantly
better than any two-variable model (LogLiks + q = −716.59, LogLiks + m + q = −697.47,
df = 14, χ2 = 38.23, p < 0.001). While the model statistics have continually improved
in the course of ‘model building’ — within a generally rather modest goodness of
fit — the number of correctly predicted outcomes has not changed significantly and
even dropped somewhat from the two-variable models to the three-variable models.
A final improvement can be achieved by fitting a four-variable model with
the predictors ‘source’, ‘quantification’, ‘modality’ (interacting with quantification)
and ‘veridicality’. While the model does not significantly change the test statistics
at a five percent level, it is not far from a significant improvement, with p = 0.08
(LogLiks + m q = −697.5, LogLiks + v + q*m = 691.13, df = 7, χ2 = 12.68). The number of
correctly predicted outcomes increases considerably, to 123 (i.e., 30.8%). The sta-
tistics for the four-variable model are summarized in Table 7.

Table 7.  A model with four variables.


variables Res. deviance AIC McFadden’s R2 CPO
source type + quant * mod + ver 1382.26 1480.26 0.079 123

© 2015. John Benjamins Publishing Company


All rights reserved
30 Volker Gast

5.4 How the prediction works

As the discussion in the previous sections has shown, a multinomial logistic re-
gression model can be used to make predictions about impersonalization strate-
gies used in English. The accuracy of the models is of course quite limited, as can
be seen from the model statistics. However, the improvements resulting from the
stepwise addition of variables were significant, and, in my view, constitute an in-
teresting result, specifically insofar as they were tied to predictions made on the
basis of theoretical considerations, which thus received some support.
As was mentioned above, a model assigns probabilities to each outcome. Let
us consider one example in order to understand how this works. The German sen-
tence in (46) has the features in (47) (it is a translation from Dutch).
(46) Herr Präsident! Über die Bedeutung der Finanzhilfen für Zypern und Malta
ist man sich weitgehend einig. [Europarl, v7]
(47) Source: sister
Mod: 0
Ver: 1
Quant: univ

On the basis of this information, the four-variable model assigns the following
probabilities to each of the eight possible outcomes/strategies:
(48) psv nom pron nfin one you rphr we
19% 27% 10% 7% 6% 5% 4% 22%

The strategy ‘nominal’ has the highest percentage assigned by the model and is
thus correctly predicted in this case, as is shown by the English heterophrase of
(46):
(49) Mr President, there is a general consensus of opinion that financial aid to
Cyprus and Malta is important. [Europarl, v7]

This example shows why the CPO-value is of limited use as an indicator of good-
ness of fit: It only assesses the strategy with the highest probability. The model con-
tains information about the probabilities of the other strategies as well, however.
Given that the other indicators of goodness of fit are hard to interpret, linguisti-
cally speaking, I have decided to provide a CPO-value in each case anyway.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 31

6. Summary and conclusions

In the present article I have argued for a corpus-based contrastive research design
that is based on the comparison of individual sentences in a translation corpus.
Such studies allow us to make much more specific predictions than could be made
on the basis of a text-level comparison of comparable corpora. While it is true, of
course, that translated language differs in systematic ways from original language
— as was also confirmed by the present study — the advantage of allowing for a
comparison of individual pairs of sentences is an invaluable advantage of transla-
tion corpora. Very obviously, both types of corpora are needed and can be used for
different types of research questions.
On a descriptive level, the study has shown that in the corpus under investi-
gation, the distribution of impersonalization strategies is most prominently in-
fluenced by the source type, i.e., the question of whether a sentence is an English
original, or whether it was translated from either German or some other language.
Among the semantic variables, ‘type of quantification’ (existential/universal),
(non-)veridicality and modality are the most useful ones. The variable ‘(non-)gen-
eralizing’ has not been found to be a powerful predictor. Similarly, no significant
impact on the distribution of impersonalization strategies has been determined
for the variable ‘clause type’. It needs to be mentioned, however, that only four
levels were distinguished for this variable. More fine-grained studies, and studies
based on larger samples of examples, may show this variable to be a significant
contributor to the distribution of impersonalization strategies as well.
This disclaimer basically applies to the entire study, which is programmatic in
many respects. The model statistics indicated a rather modest goodness of fit, to
put it mildly. This is not surprising in view of the fact that linguistic choices are
multi-dimensional decisions, and that the present study has focused on matters of
sentence semantics and syntax. It is likely that information structure plays a role,
and that there are speaker-specific preferences as well as lexical effects. Integrating
such variables into the model would certainly be a worthwhile undertaking. Even
on the basis of a broader range of data, we have to reckon with a certain remnant
of random variation, however.
Even though the aim of predicting translational equivalents has only partially
been achieved, I hope to have shown that an item-specific, predictive and hypothe-
sis-driven approach to contrastive linguistics can lead to results that would be hard
or impossible to achieve otherwise. In particular, this type of research allows us to
bridge the gap between fine-grained theoretical studies and quantitative corpus
research. The ambitiousness of the project has been pointed out repeatedly, and
it should be obvious that there is a lot of room for improvement, on all accounts.
Most importantly, topics like impersonalization, which interact with various levels

© 2015. John Benjamins Publishing Company


All rights reserved
32 Volker Gast

of linguistic organization, could be better investigated if (translation) corpora an-


notated at various levels of linguistic description (minimally, syntax, sentence se-
mantics and information structure) were available. The creation of such corpora
is, in my view, one of the challenges for empirically oriented contrastive linguistics
in the next few years.

References

Alonso-Ovalle, Luis. 2002. Arbitrary Pronouns are not that Indefinite. In C. Beyssade, R. Bok-
Bennema, F. Drijkoningen, and P. Monachesi (eds.), Romance Languages and Linguistic
Theory 2000, 1–14. Amsterdam: John Benjamins. DOI: 10.1075/cilt.232.02alo
van der Auwera, Johan, Volker Gast and Jeroen Vanderbiesen. 2012. Human Impersonal
Pronouns in English, Dutch and German. Leuvense Bijdragen 98: 27–64.
Cartoni, Bruno and Thomas Meyer. 2012. Extracting Directional and Comparable Corpora
from a Multilingual Corpus for Translation Studies. In Proceedings 8th International
Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey.
Cartoni, Bruno, Sandrine Zufferey and Thomas Meyer. 2013. Using the Europarl Corpus for
Cross-linguistic Research. Belgian Journal of Linguistics 27. 23–42. 
DOI: 10.1075/bjl.27.02car
Cohen, A. 1980. On the Graphical Display of the Significant Components in a Two-way
Contingency Table. Communications in Statistics – Theory and Methods A. 1025–1041.
DOI: 10.1080/03610928008827940
Friendly, M. 1992. Graphical Methods for Categorical Data. In SAS User Group International
Conference Proceedings. 190–200.
Gast, Volker. 2012. Contrastive Linguistics: Theories and Methods. In B. Kortmann (ed.),
Dictionary of Linguistics and Communication Science: Linguistics Theory and Methodology.
Berlin: de Gruyter Mouton.
Gast, Volker and Johan van der Auwera. 2013. Towards a Distributional Typology of Human
Impersonal Pronouns, Based on Data from European Languages. In D. Bakker and M.
Haspelmath (eds), Languages across Boundaries. Studies in Memory of Anna Siewierska,
31–56. Berlin: de Gruyter Mouton.
Giannakidou, Anastasia. 1998. Polarity Sensitivity as (Non)Veridical Dependency. Amsterdam:
John Benjamins. DOI: 10.1075/la.23
Giannakidou, Anastasia. 2011. Negative and Positive Polarity Items. In Klaus von Heusinger
and Claudia Maienborn (eds), Semantics: An International Handbook, volume 33.2 of
Handbücher der Sprach- und Kommunikationswissenschaften, 1660–1712. Berlin: de
Gruyter Mouton.
Granger, Sylviane, Jacques Lerot and Stephanie Petch-Tyson, (eds), 2003. Corpus-Based
Approaches to Contrastive Linguistics and Translation Studies. Amsterdam: Rodopi.
Johansson, Stig. 1998. On the Role of Corpora in Cross-linguistic Research. In S. Johansson and
S. Oksefjell (eds.), Corpora and Cross-linguistic Research: Theory, Method, and Case Studies,
3–24. Amsterdam/Atlanta: Rodopi.

© 2015. John Benjamins Publishing Company


All rights reserved
On the use of translation corpora in contrastive linguistics 33

Johansson, Stig. 2000. Contrastive Linguistics and Corpora. SPRIKreports, Reports from the
project ‘Languages in Contrast’. University of Oslo. Available at http://www.hf.uio.no/ilos/
forskning/prosjekter/sprik/pdf/sj/johansson2.pdf [last accessed December 2014].
Johansson, Stig. 2004. Viewing Languages through Multilingual Corpora, with Special Reference
to the Generic Person in English, German and Norwegian. Languages in Contrast 4. 261–
280. DOI: 10.1075/lic.4.2.05joh
Koehn, Phillip. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. Pukhet:
MT Summit X.
Lewis, David. 1975. Adverbs of Quantification. In E. Keenan (ed.), Formal Semantics of Natural
Language, 3–15. Cambridge: Cambridge University Press. 
DOI: 10.1017/CBO9780511897696.003
Moltmann, Friederike. 2006. Generic one, Arbitrary PRO, and the First Person. Natural
Language Semantics 14. 257–81. DOI: 10.1007/s11050-006-9002-7
Montague, Richard. 1969. On the Nature of Certain Philosophical Entities. The Monist 53. 159–
194. DOI: 10.5840/monist19695327
R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org/ [last ac-
cessed December 2014].
Schmied, Josef. 2008. Contrastive Corpus Studies. In A. Lüdeling and M. Kytö (eds), Corpus
Linguistics: An International Handbook, 1140–1159. Berlin: de Gruyter Mouton.
Smith, Thomas J. and Cornelius M. McKenna. 2013. A Comparison of Logistic Regression
Pseudo R2 Indices. Multiple Linear Regression Viewpoints 39. 17–26.
Zwarts, Frans. 1995. Nonveridical Contexts. Linguistic Analysis 25. 286–312.

Author’s address
Volker Gast
English Department
Friedrich Schiller University of Jena
Ernst-Abbe-Platz 8
07743 Jena
Germany
volker.gast@uni-jena.de

© 2015. John Benjamins Publishing Company


All rights reserved
View publication stats

You might also like