Laviosa How Comparable Can Comparable Be - Target 9

How Comparable Can
'Comparable Corpora' Be?

Sara Laviosa
University of Birmingham & UMIST
Abstract: The development of a coherent methodology for corpus-based work in

translation studies is essential for the evolution of this new field of research into a
fully-fledged paradigm within the discipline. The design of a monolingual, multi-
source-language comparable corpus of English as a resource for the systematic
study of the nature of translated text can be regarded as an important step
towards the development of such a methodology. This paper deals with a crucial
and problematic aspect of the design of a monolingual comparable corpus,
namely the achievement of an adequate level of comparability between its trans-
lational and non-translational components.
Résumé: Une méthodologie cohérente pour l'étude de corpus de traductions
s'impose si l'on entend conférer à celle-ci la valeur d'un paradigme au sein de la
traductologie. L'étude systématique du texte traduit qui s'étaye sur un corpus
comparable d'anglais à plusieurs langues-sources peut être considérée comme
une étape importante de ce processus. Le présent article étudie un aspect essen-
tiel mais problématique de la constitution de corpus monolingues comparables, à
savoir la mise au point d'un niveau approprié de comparabilité entre ses parties
traduites et non-traduites.
1. Introduction
When I first read the manuscript of Mona Baker's article "Corpora in Transla-
tion Studies: An Overview and Some Suggestions for Future Research"
(1995), I was inspired by the challenge of working towards developing a
coherent methodology for corpus-based translation studies, because I believed
then (and still do) that this is an essential step for realising the potential
envisaged in this new field of research. In October 1994, I began working on
the creation of a monolingual comparable corpus of English, which was
Target 9:2 (1997), 289-319. DOI 10.1075/target.9.2.051av

ISSN 0924-1884 / E-ISSN 1569-9986 © John Benjamins Publishing Company
290 SARA LAVIOSA
conceived as a resource to be made available to the academic community for

the systematic study of the linguistic nature of translated text. The present size
of the corpus is 2,000,000 words and it is now in the process of being made
accessible to translations scholars through the network.
There have been three main phases involved in the overall process of
developing a corpus-based methodology for the empirical and large-scale
study of translation: (a) the elaboration of the criteria for designing an English
Comparable Corpus (ECC), (b) the application of these principles to the
creation of two sub-sections of the ECC, namely newspaper articles and
narrative prose, and (c) the investigation of the sub-sections in comparison to
each other, as a way of testing the viability of the proposed methodology (see
Laviosa-Braithwaite 1996). In this paper, I will focus on one of the key issues
tackled in the design of the ECC, namely the level of comparability that can be
achieved between its translated and non-translated texts. More specifically, I
will analyse and evaluate how the original notion of 'comparable corpus' has
been developed in the first and second phase of the design of two subcorpora
of the ECC: Newspapers and Narrative Prose. Awareness of the main practical
and theoretical issues involved in seeking a reasonable level of comparability
has important implications for the type of research questions that can be posed
to the corpus and for the interpretation of the final results.
2. The Design of the English Comparable Corpus (ECC)
2.1. Initial Definition of Comparable Corpus
In translation studies the term 'comparable' corpus has been proposed by

Baker to denote a corpus consisting of two sets of texts in the same language:
translations, and originals. The two collections of texts, she says, "should
cover a similar domain, variety of language and time span, and be of compa-
rable length. The translation corpus should be representative in terms of the
range of original authors and of translators" (Baker 1995: 234). One of the key
features of this type of corpus is the comparability between its translational
and non-translational components, which should be similar in as many re-
spects as possible in an attempt to ensure that any linguistic differences found
between them can be reliably attributed to their different status as translation
vs. non-translation, rather than to confounding variables.
HOW COMPARABLE CAN 'COMPARABLE CORPORA' BE? 291
The English Comparable Corpus (ECC) consists of two computerised

collections of processed texts: one, which I refer to as the Translational
English Corpus (TEC), comprises translations into English from a variety of
source languages; the other, which I have named the Non-Translational En-
glish Corpus (NON-TEC), includes original English texts which are compa-
rable to TEC texts in ways that are discussed in detail in this paper.
2.2. Corpus Typology and Classification of the ECC
In order to categorise both the corpus as a whole and each of its components in
a way that is consistent with existing corpus typologies I have adapted and
expanded the groups of contrastive parameters proposed by Atkins et al.
(1992) in their typology of corpora.
The typology presented below is organised along four hierarchical levels.
The first level consists of six sets of contrastive parameters that relate to the
most general features of a text corpus. Subsequent levels concern increasingly
more specific groups of parameters relevant to the corpus type designed in the
present study. The proposed typology is not intended to be exhaustive. Its
function is to provide a common framework within which the English Compa-
rable Corpus can be described in relation to other corpus types. It also
constitutes the first stage in the design process, during which the general
features of the corpus were established.
1) Level I:
Corpus Types: FULL-TEXT
SAMPLE
MIXED (FULL-TEXT AND SAMPLE)
MONITOR
A full-text corpus contains unabridged texts, while a sample corpus is made
up of portions of texts selected according to stated design principles concern-
ing size, location of the sample within the full text and method of selection. A
mixed corpus contains both unabridged texts and portions of others. Finally,
a monitor corpus is made up of full texts scanned continuously and passed
through a filter to keep the data on a given language up-to-date.
Corpus Types: SYNCHRONIC
DIACHRONIC
A synchronic corpus contains texts produced within a restricted period of
time, white a diachronic corpus is made of texts produced over a long period.
292 SARA LAVIOSA
Corpus Types: GENERAL

TERMINOLOGICAL
A general corpus is made up of texts assumed to be representative of

everyday, non-specialised language. A terminological corpus includes texts
originating within specialised subject fields, e.g. chemistry or geology, which
are characterised by heavy use of recurring terms.
Corpus Types: MONOLINGUAL
BILINGUAL
MULTILINGUAL
A monolingual corpus contains texts produced in one language. A bi/multi-

lingual corpus is made up of texts produced in two or more languages,
selected according to similar criteria.
Corpus Types: LANGUAGE(S) OF CORPUS
For example a corpus of English, French, German etc.
Corpus Types: WRITTEN
SPOKEN
MIXED (WRITTEN AND SPOKEN)
A written corpus is made up entirely of written texts; both written-to-be-read

and written-to-be-spoken.1 A spoken corpus consists of recorded spoken
texts including those that are spoken-to-be-written.
2) Level II
Monolingual Corpus Types: SINGLE
COMPARABLE
A single monolingual corpus consists of one set of texts all in the same lan-
guage. A monolingual comparable corpus is made up of two single mono-
lingual corpora: one translational, the other non-translational. The two corpora
are set up according to similar design criteria. In corpus linguistics the term
'comparable corpus' is generally used to refer to a bi/multilingual corpus
made up of two or more sets of texts from the same subject domain(s) (Sinclair
1991b; Teubert 1994; Peters and Picchi 1996), while the term 'parallel corpus'
refers to a corpus of original texts in language A and their translations into
language B. However, in translation studies and contrastive linguistics the
terminology is not always consistent; some scholars use 'parallel corpus' to
cover both types of bilingual corpora (Johansson and Holland 1994;
Hartmann 1994; Gellerstam 1996), while others follow the traditional termi-
nology of contrastive analysis (Aijmer et al. 1996; Granger 1996) and differ-
entiate between a 'translation corpus' (original texts in language A and their
translations in language B) and a 'parallel corpus' (original texts in language
A and B). The terminology I employ in this study overlaps with existing
categorisations in corpus linguistics in order to facilitate the comparison of my
research results with those of other studies in this field. I think it would be
useful to aim towards adopting a consistent terminology in corpus-based
translation studies because this would aid the systematic accumulation of new
data and facts about translation and translating.
3) Level IIIa
Single Corpus Types: TRANSLATIONAL
NON-TRANSLATIONAL
A translational corpus is made up of texts, which are known to have been
translated into a given language. A non-translational corpus consists of
original texts in a given language.
4) Level IIIb
Comparable Corpus Types: TRANSLATION-DEPENDENT
NON-TRANSLATION-DEPENDENT
INDEPENDENT
A translation-dependent comparable corpus is one in which the non-
translational component is modelled on the composition of the translational
set. A non-translation-dependent comparable corpus is one where the
composition of the translational set is modelled on the non-translational
corpus. In an independent comparable corpus the two components are
designed separately and subsequently linked on the basis of independently
established criteria of comparability.
On the basis of the corpus typology so far delineated, ECC is classified as
a monolingual, mixed full-text and sample, synchronic, translation-dependent,
comparable corpus of written general English. TEC is generally described at
this stage as a monolingual, single, full-text, synchronic corpus of written
general translational English. NON-TEC is categorised as a monolingual,
single, mixed full-text and sample, synchronic corpus of written general non-
translational English.
The order in which I present the two components of ECC: TEC first,
NON-TEC after, is not arbitrary, but rather reflects an important aspect of the
methodology, namely the priority given to the design of TEC, since it is
294 SARA LAVIOSA
precisely its composition that determines the dimensions of comparability and

the resulting design of NON-TEC. This methodological aspect in turn de-
pends on having established that the purpose of the ECC design is the study of
translational language versus original language and not the other way around.
The three-level corpus typology presented above is hopefully adequate for
classifying the main general features of ECC and NON-TEC. However, it is
insufficient for categorising the principal characteristics of the translational
corpus which, by its very nature, has more dimensions than a monolingual
single corpus of original language. I have therefore added a fourth level in the
corpus typology. This is made up of seven sets of contrastive parameters that
will be combined with the previous general categories for the final classifica-
tion of texts in TEC.
4) Level IV
Translational Corpus Types: MONO-SOURCE-LANGUAGE
BI-SOURCE-LANGUAGE
MULTI-SOURCE-LANGUAGE
A mono-source-language corpus is made up of texts translated from one
initial source language. A bi-/multi-source-language corpus is made up of
texts translated from two or more initial source languages.
Translational Corpus Types: MONO-TRANSLATING-MODE
BI-TRANSLATING-MODE
MULTI-TRANSLATING-MODE
A mono-translating-mode corpus contains texts translated in one of the

following modes:
written
translating in writing from a written source text
translating in writing from a transcribed oral source text
translating in writing from a spoken source text
oral
translating orally from a written source text
interpreting
simultaneous
consecutive
liaison (where an interpreter mediates between two or more individuals by
alternating with the speakers [source language] in turn in relatively short
segments [target language])
A bi-/multi-translating-mode corpus is made up of texts translated in two or

more of these modes.
Translational Corpus Types: MONO-TRANSLATION-METHOD
BI-TRANSLATION-METHOD
MULTI-TRANSLATION-METHOD
A mono-translation-method corpus comprises texts translated through one
of the following methods:
human translation
machine translation (MT)
computer-assisted translation (CAT)
A bi-/multi-translation-method corpus is made up of different groups of

translations, each of them characterised by one translation method.
Translational Corpus Types: INTO-MOTHER-TONGUE
INTO-FOREIGN-LANGUAGE
INTO-LANGUAGE-OF-HABITUAL-USE
MIXED-TARGET-LANGUAGE-STATUS
The corpus contains texts that have been translated into the translator's mother
tongue, foreign language, language of habitual use or a mixture of the three.
This parameter is established by direct questioning about the translator's
nationality at birth and the current one.
Translational Corpus Types: PROFESSIONAL
STUDENT
A professional corpus consists of translations carried out by professional
translators, who translate on a regular basis as part of their main occupation,
regardless of whether they have received any formal training, while a student
corpus contains translation assignments produced by students of translation
and interpreting.
Translational Corpus Types: PUBLISHED
UNPUBLISHED
A published corpus consists of translations which have been published and

are widely available for sale to the public, while an unpublished corpus
contains unpublished works, such as translations which are being proposed to
prospective publishers.
296 SARA LAVIOSA
2.3. The Theoretical and Practical Motivation for the TEC Design
The decision to create a corpus of general language has been taken mainly
because this is considered more representative of the translational language
population, particularly in terms of reception.2 Moreover, since it was to be a
resource for the systematic study of translated text, I have assumed3 that a
general rather than a specialised corpus would interest a larger community of
scholars. The exclusion of technical texts has led to the creation of a mono-
translation-method corpus (human translation) since both Machine Transla-
tion and Computer Assisted Translation systems are used mainly in
specialised subject fields (Sager 1994: 300; 303). Preference has been given to
a multi-source-language corpus because this makes it possible to test and
develop a methodology for the identification of features of translational
language which are assumed to be independent of the influence of the specific
source language involved in the translation process. Preference has also been
given to published translations carried out by professional translators since
these can be regarded as more representative of translational language, given
that one can assume that they are read by a wider audience. Full, rather than
sample texts have been included in order to obtain reliable data on the various
measures of lexical simplification selected for the investigation of the corpus,
such as type/token ratio and lexical density. These may in fact vary from one
section of the text to another (Jeremy Clear, personal communication, 1996),
consistently with Sinclair's claim that few linguistic features of a book-length
text are distributed evenly throughout (Sinclair 1991a: 19). Moreover, a full-
text corpus is a much more useful resource since it permits a greater variety of
linguistic analyses, such as the investigation of large patterns of text and of the
development of characters in a novel (Baker 1995: 240). A full-text corpus is
also invaluable to the researcher who wishes to compare a particular transla-
tion with its source text by creating a parallel corpus alongside the initial
comparable one. The reason why a synchronic rather than a diachronic corpus
was created lies in the nature of the hypotheses being tested on the corpus,
which do not concern the development of linguistic patterns over time, but the
regularities of current linguistic behaviour in translation (Laviosa forthcom-
ing; Laviosa-Braithwaite forthcoming a; b). The mode of the text and the
translating mode are both written for entirely practical reasons, namely greater
and more varied availability of texts, less time and lower costs with regard to
both acquiring the translations and converting them into computer-readable
form. It is also comparatively easier to establish who holds the copyright for a
written translation than for an interpreted text. Finally, English has been
chosen as the language of the corpus mainly because of the existence of a very
large corpus of general language — the British National Corpus — which has
been made available to the academic community since March 1995 and from
which suitable texts can be extracted for the design of NON-TEC. English is
also the world's best described language (Sinclair 1991a), it is therefore
reasonable to assume that it would attract more scholars to the emerging
corpus-based approach in translation studies. The choice of English has in turn
led to the exclusion of mediated translations since this phenomenon is rare in
current translational English, given the hegemonic status of the English lan-
guage world-wide.
By taking into account the additional dimensions outlined in the previous
section, TEC is now categorised also — if somewhat arbitrarily — as a multi-
source-language, mono-translation-mode (written-to-be-read mode),
mono-translation-method (human translation), largely into-mother-
tongue, professional, published corpus. These characteristics of TEC are
the result of choices made on the basis of both theoretical and practical
considerations.
2.4. The Identification of the Text Categories4 of TEC
Once the general features of TEC have been established, the next step in the
design process consists of identifying the text categories which best fit these
characteristics.
Moreover, the choice of suitable text genres is partly governed by the
likelihood of their representing a variety of female and male translators and
authors. At this stage, as in the previous one in which the general features of
TEC were established, both a priori criteria and practical considerations are
taken into account. In case of conflict, priority is generally given to theoretical
principles.
The a priori criteria for identifying the text genres of TEC are as follows:
• General English (not restricted to any particular regional variety)
• Full-texts
• Synchronic: produced within the last 15 years
• Written
• Published
298 SARA LAVIOSA
• Translated from a variety of source languages

• Translated without any computer-assisted translation software
• Translated through the written translating mode
• Translated by professional translators (as defined in 2.2)
• Translated into the translators' mother tongue
• Variety of female and male translators
• Variety of female and male authors
The main practical considerations are:
• large size and availability in electronic form
• the ease with which permission from copyright holders can be ob-
tained
On the basis of these external criteria, the following text genres have been
chosen for inclusion in the translational corpus of ECC:
BIOGRAPHY
BUSINESS GUIDES
COURSEBOOKS
CUSTOMS AND FOLKLORE
FICTION (GENERAL AND SHORT STORIES)
FOOD AND DRINK
GENERAL KNOWLEDGE
GUIDES (TOURIST)
LEAFLETS (PROMOTIONAL)
MAGAZINES (IN-FLIGHT)
NEWSPAPERS
OFFICIAL REPORTS (PUBLISHED AND/OR PUBLIC DOMAIN) 5
WRITTEN SPEECHES 6
TRAVEL (GENERAL)
2.5. Selecting Suitable Texts for TEC
2.5.1. Selection Criteria

The same external criteria adopted to identify suitable text genres have been
used for selecting the actual collections and the individual texts in each
category. Moreover, an attempt has been made to characterise the target
audience of both the source and the translated texts as both female and male,
literate, intellectual adults. Whenever possible, particularly in the case of
literary translations, I have given preference to those translators and authors

who are highly reputed and distinguished professionals because their works
can be assumed to be read by a wider audience, hence are more apt to
represent the general language of translation. The latter features have been
used as additional principles to guide the selection procedure and have led to
deliberately excluding both tabloid press and pulp fiction. Highly experimen-
tal works of fiction, poetry and children's literature have also been excluded
because they are arguably less typical and less representative of general
original and translational language. In addition, they tend to have restricted
audiences in both source and target languages.
2.5.2. Selection Procedures

The selection of suitable texts to include in the translational corpus and the
concomitant process of assigning a particular publication to the appropriate
text genre have been carried out on the basis of information gathered from a
variety of sources. This is a list of the main sources:
• Leading national newspapers (e.g. The Guardian and The European)
• Whitaker's Bookbank
• The Arts Council of England
• Publishers
• Literary and non-literary translators
• Translation scholars
• Editors
• Translation Agencies
• The Translation Service of the European Commission
• The Welsh Language Board
• The Italian Consulate
With the exception of the Whitaker's Bookbank, about which more below, all
of the other sources were approached either via personal communication or
with a standard letter outlining the research aims and requesting information
about suitable translated material. After the relevant data had been collected,
an initial list of suitable texts, organised according to text genre, was drawn
up.
Information about translated texts contained in the Whitaker's Bookbank
(updated May 1995) was retrieved in a systematic way. A list of all the
Whitaker's text categories was first retrieved. Of these, six were chosen on the
basis of their assumed representativeness of general translational language:
300 SARA LAVIOSA
Biography, Customs and Folklore, Fiction (General), Fiction (Short Stories),

General Knowledge and Travel (General). The second step involved retriev-
ing all the translations belonging to each subject domain. Further searches
regarding the date of publication and the price were then applied as detailed
below:
TEXT TOTAL FURTHER FINAL

CATEGORY ENTRIES SEARCHES ENTRIES
Date Price
Biography 1023 >01/95 <20.00 47
Customs and Folklore 148 >01/927 <20.00 14
Fiction (General) 2161 >01/95 <9.99 94
Fiction (Short Stories) 513 >01/95 <20.00 21
General Knowledge 13 — <20.00 10
Travel (General) 348 >01/95 <20.00 15
The reasons for carrying out these additional searches were both practical —
reduction of the initial lists to a manageable number of texts which are not
exceedingly expensive — and theoretical — the most recent books should be
more representative of current translational language and the lower price is
taken as an indicator of wider reception.
As regards General Fiction, which covers the largest number of texts, a
further manual selection was carried out to exclude books whose price is less
than £4.95 in an attempt to weed out pulp fiction. The entire selection stage
ended with the compilation of a single list of suitable texts derived from both
the Whitaker's Bookbank and the other sources.
2.5.3. Copyright Permissions

Before a selected publication could be included in the corpus, permission had
to be obtained from the copyright owner. A standard letter which outlines the
aims of the research and asks for permission to reproduce in electronic form
the texts concerned was sent to either the publisher or the translator, or both.
2.5.4. The Acquisition of Texts for TEC

Once permission was granted, the texts were acquired in a variety of ways. In
the case of books or official reports, they were bought or were generously
given free of charge by the publishers, the translators or the translation
agencies. Newspaper articles were photocopied from back numbers held in
public libraries or downloaded either from The Guardian on CD-ROM or

from on-line services such as Campus 2000, both of which offer full-text
access to leading daily and weekly UK newspapers.
2.5.5. From a Printed or Electronic Text to a TEC Text

Two methods of text conversion from print to electronic form were adopted:
scanning by the use of an OCR reader, and keyboarding. The first method was
used extensively for books, while keyboarding was limited to newspaper
articles. Each individual text was checked for spelling and for typing and
recognition errors, saved as a separate ASCII file, and then minimally marked
up for underlying structural text features, such as chapter and title, by means
of a start-tag (<...>) marking the beginning of an element and an end-tag
(</...>) marking its end. All tags were enclosed in angle brackets. The average
time taken to convert an entire book into a TEC text is 20 hours. Example (1)
is an extract from the first story of The Siren by Dino Buzzati, translated by
Lawrence Venuti. It shows the markup for the title of the book and that of the
first story in the collection.
(1) <book title>The Siren</book title> <story title>Barnabo of the
Mountains</story title> No one remembers when the house was
built for the foresters from the village of San Nicola. Also known as
Casa dei Marden, it was located in Valle delle Grave, at the foot of
the mountains. Five paths emanated from that point and entered the
forest. One descended into the valley toward San Nicola, gradually
turning into a true road. The other four rose amid the trees,
Example (2) is an extract from an article from the The Guardian collection.
The title, the synopsis, and the words accompanying the pictures of newspaper
articles are non-translated, therefore they are enclosed in angle brackets and
marked accordingly, so that they are excluded from the analyses (e.g. "non-
te", which stands for "non-translational English"). The name of the source-
text newspaper is also excluded from the text by means of angle brackets.
Tables, diagrams and pictures are marked up in the same way. This ensures
that what is eventually analysed is in fact the translated portion of the article,
not the editor's additions as well.
(2) <omit desc=picture & text reason=non-te extent=12xomit
desc=title — THE SEASON IN HELL — reason=not-texomit
desc=synopsis — As the French troops in Rwanda carve out a new
302 SARA LAVIOSA
role, the man leading the Red Cross delegation in Kigali has
decided to return to France. He says there is nothing more his team
can do. Frederic Fischer reports — reason=not-te extent=39xomit
desc=stnewspaper — Le Monde -> SMALL and slim, almost
skinny, Philippe Gaillard is not exactly the kind of man you would
mistake for Rambo. Yet he has been running a delegation of the
International Committee of the Red Cross in Kigali almost every
day for a year. Yesterday, he left Rwanda for good. He has decided
never to return.
2.5.6. The File Structure of TEC

TEC is a subdirectory of the ECC. TEC in turn is made up of subdirectories
which correspond to the text categories or subcorpora of the corpus and are
therefore identified with the same name: BIOGRAPHY, FICTION, NEWS-
PAPERS, etc. Each subdirectory consists of ASCII text files which corre-
spond to the individual texts of the corpus. The identification of what
constitutes a separate text was guided by the criterial characteristics proposed
by Atkins et al. (1992: 2). These are: discursiveness — a text is assumed to
consist of coherent sentences and paragraphs; integrality — that is having a
beginning, middle and end, and being considered to be complete in itself;
conscious production of a unified authorial effort, and stylistic homogene-
ity. A book consisting of stories written by the same author is therefore
regarded as one text and saved as a separate file in the Fiction subdirectory
(e.g. The Siren by Dino Buzzati), whereas a book consisting of stories each
written by a different author is considered a group of texts and structured as a
separate subdirectory which has the same name as the title of the book; it
consists of files corresponding to the individual stories and is saved within the
Fiction subdirectory (e.g. The Dedalus Book of Surrealism, by different
authors, edited by Michael Richardson). Similarly, a group of articles selected
from the same newspaper (e.g. The Guardian or The European) is organised
as a separate subdirectory which has the same name as the newspaper; it is
made up of files which correspond to the individual articles and is saved
within the Newspaper subdirectory. Diagram 1 below outlines the file struc-
ture of TEC at the time of writing.
Diagram 1. File Structure of TEC
2.5.7. The Extra-Textual Attributes of TEC Texts

Rationale. The importance of clearly identifying, describing and documenting
the component texts of a corpus has been extensively emphasised by corpus
linguists (Atkins et al. 1992; Sinclair 1992) and systematically applied in the
design of existing corpora. There are essentially two interdependent reasons
for recording this information: one concerns the work of the lexicographer or
the descriptive linguist, which consists of analysing inductively the associa-
tion between language patterns and extra-textual features such as genre,
regional variety, gender and so on; the other is methodological, and concerns
the issue of representativeness and balance of a corpus, which one can strive to
achieve only if information on both linguistic and extralinguistic textual
characteristics is available and correlated.
I propose that, when designing a translational corpus as a resource for the
systematic study of the linguistic nature of translated text, the rationale for
selecting and recording a given set of extralinguistic data includes at least two
interrelated additional factors. The first is the role of many of these
extralinguistic features as variables that can be manipulated in order to create
tailor-made subcorpora and to test theory- and/or data-driven hypotheses. The
second factor is their intrinsic value as objects of study in themselves and
sources of information about the "preliminary norms", and especially aspects
304 SARA LAVIOSA
of "translation policy" (Toury 1995: 58) prevailing at a given time in a given

socio-cultural milieu. For example, the Newspaper Subcorpus of TEC, which
includes all the translations published in two national newspapers during a
randomly selected period, is composed largely of texts translated from Ro-
mance languages, such as French, Italian, Spanish and Portuguese. From this
finding it may be inferred that the prevailing translation policy for the two
newspapers represented in the corpus is to give preference to source articles
written in Romance languages. Moreover, in the specific case of a translation-
dependent comparable corpus, some of the extralinguistic features recorded
contribute to determining the main criteria for establishing comparability with
a non-translational corpus. I therefore suggest that the extralinguistic data of a
corpus designed to meet the needs of translation studies are more productive
and theoretically loaded than the corresponding attributes recorded in corpora
conceived within the discipline of corpus linguistics, where they have a more
descriptive and instrumental role.
List of Attributes Recorded for Each TEC Text.8 The extralinguistic attributes
recorded for TEC texts concern the translator, the translation, the translation
process and the source text. The list below includes the attribute, an explana-
tory note where necessary, and the alternative values of each attribute.
TRANSLATOR(S)
Attribute: Name
Values: Full name
Attribute: Gender
Values: Female, Male
Attribute: Sexual orientation (only for literary texts)
Values: Lesbian, Gay, Heterosexual, Immaterial, Unknown
Attribute: Age
Values: The actual age at the time the translation was carried out
Attribute: Employment status
Values: Freelance, In-house, Employee of Translation Agency, Di-
rector of Translation Agency, Other (to specify ad hoc)
Attribute: Translation workload
Note: This refers to the amount of time normally devoted by the
translator to translating activities
Values: Full-time, Part-time

Attribute: Nationality at birth
Attribute: Current nationality
Attribute: Domicile (at the time of translating)
Values: Source Language Country, Target Language Country,
Other Country
TRANSLATION
Attribute: Text category
Note: In the present study this feature is also referred to as text
genre, institutional text type, subject domain and subject
field. Its identification is based on consensus. This means
that the text categories used in the present classification of
translational texts are drawn from institutional groupings
such as those used by the Whitaker's Bookbank, publishing
companies, public libraries, bookshops, or in everyday lan-
guage use. All the texts that are assigned, on the basis of
consensual criteria, to a particular text category form a
subcorpus of the main corpus, with the same denomination
as the text category.
Values: Biography, Fiction, Newspapers, etc.
Attribute: Collection
Note: This refers to a publication composed of texts written by
different authors (e.g. a newspaper, a book containing a
collection of stories or of academic articles)
Values: The actual name or title of the publication (e.g. The Guard-
ian, The Book of Surrealism)
Attribute: Text title
Note: This is the title of each individual textual unit in the corpus.
This may be either a text that is part of a collection or a
single published work (e.g. a novel or a collection of stories
by the same author)
Values: The actual title of the individual textual unit
Attribute: Text extent
Value: Full, Sample
306 SARA LAVIOSA
Attribute: Mode
Note: This is the mode in which the textual content is delivered
Values: Written-to-be read, Written-to-be-spoken, Spoken, Spoken-
to-be-written
Attribute: Word-count
Values: The actual number of orthographic words counted by the
word-count facility of WordSmith Tools (Scott 1996)
Attribute: Special features
Values: Pictures and/or Diagrams and/or Tables and/or Other (to
specify ad hoc)
Attribute: Date of publication
Values: The date printed in the published translation
Attribute: Place of publication
Values: Country where the translation is published
Attribute : Publisher
Values: The name of the publisher of the translation
Attribute: Publication of the name of the translator(s)
Note: This refers to whether or not the name of the translator is
visible anywhere in the text
Values: Yes, No
Attribute : Copyright
Note: Who holds the copyright for the translated text
Values: Translator, Publisher of the translation, Author, Publisher
of the source text, Other (specify ad hoc)
TRANSLATION PROCESS
Attribute: Relation between translation and source text
Note: This concerns the final status of the target-language text in
relation to that of the source-language text
Values: Complete, Excerpt,9 Direct, Indirect (Mediated)
Attribute: Direction of translation
Note: This is inferred from the data regarding the nationality of
the translator at birth and her/his current nationality. If the
nationality is that of an English-speaking country both at
birth and currently, it is inferred that the translator is a

native speaker of English. If both nationalities are of a
foreign country, it is inferred that English is a foreign
language. If only the nationality at birth is that of a foreign
country, it is assumed that English is the language of ha-
bitual use10
Values: Into mother tongue, Out of mother tongue, Into language of
habitual use, Team
Attribute: Written translating mode
Note: This refers to the specific mode of writing in which the
translation process has been performed
Values: Translating in writing from a written source text
Translating in writing from a transcribed oral source text
Translating in writing from a spoken source text
Attribute: Commissioner
Note: This refers to the initiator of the translating process
Values: Translator, Author, Publisher, Series Editor, Other (to
specify ad hoc)
Attribute : Subcommissioner
Note: This refers to the person or agency who commissions a
translation from a translator on behalf of the initial commis-
sioner
Values: Translation Agency, Series Editor, Other (to specify ad
hoc)
Attribute: Editing
Note: This concerns the type of editing that is carried out by a
person other than the translator(s)
Values: Light copy-editing, Heavy editing, Cooperative editing
(liaising between any two of: author, translator, publisher),
None
Attribute: Time lag
Note: This is the time elapsing between the commissioning and
the publication of the translation
Values: The actual time in days
308 SARA LAVIOSA
SOURCE TEXT
Attribute: Language
Attribute: Status
Values: Original, Translation, Excerpt
Attribute: Name of the author(s)
Values: Full name(s)
Attribute: Gender of the author(s)
Values: Female, Male
Attribute: Sexual Orientation of the author(s) (only for literary
texts)
Values: Lesbian, Gay, Heterosexual, Immaterial, Unknown
Attribute: Date of publication
Attribute: Place of publication
Values: Country where the source text was published or produced
Attribute: Publisher
Values: Name of the publisher
The examination of the actual data relating to these attributes, providing the
corpus is fairly large and representative, may reveal extralinguistic patterning
from which various interrelated external features of current translational
practice in the English-speaking culture may be inferred. For example, they
may throw light on aspects of "translation policy" (Toury 1995: 58), such as
the preference for certain text categories, or they may reveal trends in the
application of copyright laws. Patterns may also emerge as regards the transla-
tion process, for example the type of editing that is generally carried out or the
procedures underlying commissioning, and so on. Such information based on
external evidence may then feed into the theoretical branch of translation
studies and either be integrated into existing models or give rise to new
specific hypotheses which can then be tested with linguistic analyses carried
out with a corpus-based methodology. At the same time, the attributes re-
corded for each TEC text constitute independent variables which have been
derived from existing theories — for example Sager's communicative theory
of translation (Sager 1994), or Toury's conditional laws of translational
behaviour (Toury 1995: 259-279) — and can be used to test the validity of
some aspects of these models. This illustration of the possible uses of text
attributes within the proposed corpus-based methodology highlights an essen-
tial feature of the corpus-based approach to translation studies; namely, the
interrelationship between description, testing of a priori hypotheses and meth-
odology.
Acquiring and Recording the Extralinguistic Features of TEC Texts. The
features of TEC texts are recorded in a database file. Information about the
actual values of the attributes is collected from three sources:11
• a questionnaire sent with a standard explanatory letter to one of the
following:
the translator of an individual text or group of texts
the editor of a collection of texts
the translation agency subcommissioning a collection of texts
• inspection of the relevant sections of the printed or electronic copies of
the texts
• direct questioning of the professionals involved in the translation
process.
Special mention needs to be made in respect of the problematic acquisition of
the data regarding the sexual orientation of both the translator and the author
of literary translations. Although this is considered valuable information that
may interest scholars concerned with the study of translational language and
gender, it is also recognised that it is a highly sensitive and controversial area
of research. Despite these challenges, I have attempted to develop a method
for eliciting and collecting this data. This involves drafting a different letter
for literary translators which contains a short note explaining the rationale for
eliciting data on various extra-textual features of the translation and an invita-
tion to comment on their own sexual preferences and those of the author if
they feel they can or want to do so.12 Other sources consist of published and/or
publicly available information such as statements made by the professionals
concerned, editorial comments, public speeches and interviews. The latter
method has been employed in the case of two autobiographies by Juan
Goytisolo: Realms of Strife and Forbidden Territory, both translated by Peter
Bush. In this instance the information about the homosexuality of the author is
published on the cover of the books and in the texts themselves, as well as
having been made publicly known by the translator in the course of a talk
delivered at a recent conference on translation studies (Bush 1995).
310 SARA LAVIOSA
2.6. Design of NON-TEC
2.6.1. The Dimensions of Comparability

In a translation-dependent comparable corpus the compilation of the non-
translational component begins when the creation of the translational corpus is
completed. This is because the dimensions of comparability — which also act
as criteria for the selection of suitable texts — are established largely on the
basis of the main characteristics of the structure and composition of the
translational component. The other two factors influencing the identification
of these dimensions are the availability of texts in electronic form and permis-
sion to use them for research purposes. In the case of biographical and literary
texts, which are extracted from the sample British National Corpus, these
considerations have meant the exclusion of the dimensions of text extent,
number of separate texts, average word count and proportion of USA and UK
publishers. In principle it would be desirable to include all these parameters,
but it has not been possible to do so in the present study.
Two sets of dimensions of comparability have been set up: a common one
for all three subcorpora and a specific one for newspapers. Both sets have been
identified on the basis of the structure and composition of TEC at the time of
writing. The two groups of dimensions are as follows:
Common to all three subcorpora:
• institutional text category
• time span
• distribution of female and male authors
• distribution of single and team authorship
• overall size of the subcorpus
• target-audience age, gender and level
Thanks to the availability of both translational and non-translational full texts
within each of the two newspapers selected for TEC and the possibility of
generally assessing the topic of the articles from the title, the following
additional dimensions have been established for each collection of the News-
paper subcorpus:
• newspaper
• newspaper section
• topic
• number of articles
• word count
• text extent
This means that a relatively higher level of comparability has been sought and
established in the Newspaper subcorpora, where the collections consist of the
same number of articles; the translational and non-translational texts come
from the same newspaper and newspaper section; they deal with similar topics
and they are both full texts of a similar size.
2.6.2. Selection, Acquisition and Markup of NON-TEC Texts

The following selection criteria are specific to Biography and Fiction and have
been based both on the classifications available in the BNC and on the
common dimensions of comparability put forward for the three NON-TEC
subcorpora.
Selection criteria for biographical and literary texts:
• institutional text category (biography and fiction)
• time span (1983-1993)
• sample size greater than 30,000 words (in an attempt to reach a
comparable overall size with the minimum discrepancy in the actual
number of texts in each corpus)
• similar proportion of female and male authors in the TEC and NON-
TEC subcorpora
• similar proportion of single and team authorship
• target audience age: adults
• target audience gender: mixed (the assumed readership of each text is
mixed gender)
• target audience level: mostly high (this is established in the BNC on
the basis of "a subjective assessment of the text's technicality or
difficulty", User Reference Guide for the British National Corpus,
1995: 14)
The selection criteria identified for the Newspaper subcorpus are based on
some of the general dimensions of comparability and also on the parameters
that have been specifically identified for each collection, as reported below.
312 SARA LAVIOSA
Selection criteria for the Newspaper subcorpus:

• newspaper: The Guardian and The European
• newspaper section: Home News for The Guardian and Presswatch and
Sevenday for The European
• topic: political affairs
• number of articles: 102 for The Guardian and 64 each for Presswatch
and Sevenday for The European
• average word count: 638 for The Guardian and 150 for both collec-
tions of The European
• overall size: 65151 for The Guardian and 9640 for each collection in
The European
• text extent: full
• time span: 19.5-15.12.1994 for The Guardian and 27.5.1993-
30.11.1995 for each collection in The European13
• similar proportion of female and male authors14
• similar proportion of single and team authorship15
The procedure followed for the selection of comparable texts differs for each
subcorpus. Both the biographical and the literary texts have been extracted
from the British National Corpus. However, since the BNC classification does
not list biographies under a separate category, the search for suitable texts was
carried out manually. This involved examining all the individual titles listed in
the User Reference Guide and drawing up a list which was subsequently
checked against the Whitaker's Bookbank. From the final group, three titles
were selected on the basis of the selection criteria established for both biogra-
phy and fiction.
The selection of literary texts was carried out by means of an automatic
search of the "Imaginative" texts contained in the BNC. This involved the
application of systematic queries corresponding to the selection criteria pro-
posed for biography and fiction, and resulted in the identification of 15 text
samples which have a high-level target audience16 and a ratio of 4 female to 11
male authors.
The NON-TEC articles of The Guardian newspaper have been selected
and downloaded from The Guardian on CD-ROM, while articles of The
European have been partly examined manually and keyboarded from back
numbers and partly chosen and downloaded from Campus 2000. Consistently
with the markup carried out for TEC, the titles and subtitles have been
enclosed in angle brackets.
The synopses, on the other hand, have been included in the text, because,
unlike those that form part of TEC articles, there does not seem to be, on the
whole, a clear division between them and the main body of the article.
The conversion procedure from electronic text to NON-TEC text is the
same as the one adopted for TEC.
2.6.3. The File Structure of NON-TEC

This mirrors the file structure of TEC as shown in Diagram 2.
Diagram 2. File Structure of NON-TEC
2.6.4. The Extra-Textual Attributes of NON-TEC Texts

These consist of the Text Category, Collection, Text Title, Status of the text,
Text Extent, Mode, Word-count, the Author's name, gender and sexual orien-
tation (only for literary texts), the Date and Place of publication and the
Publisher. The data relating to most of these attributes were collected by
inspecting the relevant sections of the printed or electronic copies of the
texts.17 Since the methodology proposed to acquire information about the
author's sexual orientation is still experimental, I decided to limit its applica-
tion to TEC narrative works as a way of testing its feasibility. I have therefore
made no attempt to elicit these details for the NON-TEC literary texts at the
present stage.
314 SARA LAVIOSA
The function of these attributes is descriptive and operational. By using

the query facility of the database, the researcher can extract information about
the corpus composition, check the extent to which it is comparable to its
translational counterpart and create ad hoc ECC subcorpora where these
extralinguistic features can be controlled for testing specific hypotheses.
3. Evaluation of Comparability in the Design of ECC
There are differences between Narrative works (i.e. Biography and Fiction) on
the one hand and Newspapers on the other, with regard to the dimensions and
consequent level of comparability. This is partly because of intrinsic differ-
ences between the text genres (for example it is arguably more difficult to
identify a unified topic in narrative works), and partly because, in the case of
NON-TEC narrative publications, I had to rely on texts that were already
available in machine-readable form. The combination of these two factors has
resulted in my proposing only a minimal set of dimensions for Biography and
Fiction which concern mainly global, extra-textual features. One dimension
involves a degree of subjective evaluation. This is the target-audience level
which is assessed for NON-TEC narrative texts by the designers of the BNC
on the basis of the perceived difficulty of the text, while for the TEC texts, it is
established by myself on the basis of my own reading of these publications.
Given these constraints, my attempt to seek an adequate level of similarity
between TEC and NON-TEC narrative texts has proved to be highly problem-
atic, particularly in the present initial stages of corpus design when the
methodology is still experimental.
The dimensions of comparability put forward for newspapers are, on the
other hand, greater in number and relatively less problematic to apply, given
the greater availability of translational and non-translational texts within the
same newspaper and the possibility of identifying the topic of each article
from the titles and subtitles with a reasonable degree of accuracy. The level of
comparability pursued with this text genre can therefore be considered reason-
ably adequate. There are however differences between the two collections
within the Newspaper subcorpus. The Guardian articles are on the whole
more similar than those selected from The European, particularly with regard
to the average word-count, distribution of female and male authorship and
time span. Discrepancies in the case of The European are caused by restric-
tions on the availability of machine-readable texts at the time when the articles
were being selected and downloaded. In future studies, providing permission
is granted by the publishers, more texts could be extracted from existing on-
line services and a more adequate level of comparability could be sought for
The European collections.
The discrepancy in comparability between Biography and Fiction, on the
one hand, and Newspapers, on the other, could be partly reduced if one had
access to a very large, full-text corpus of general English, which included a
large portion of fiction and biography published not only in the UK but also in
the USA. This would ensure comparability on two additional dimensions: text
extent and place of publication. Moreover, the selection of suitable texts could
be refined if the corpus compiler read the original works initially earmarked
according to external categorisations, and then assessed their comparability on
the basis of general criteria, such as the level of difficulty of the language,
intended target audience, and style, in order to supplement, with her/his own
impressions, the classification provided by the team responsible for the cre-
ation of the parent corpus. The application of these criteria would, in my view,
ensure a more accurate evaluation of the individual texts and increase the
comparability of the translated and non-translated components of the corpus.
4. Use of the English Comparable Corpus
As a way of testing the viability of an ECC-based methodology for a study of

the linguistic nature of translation I carried out a series of comparative
analyses between the TEC and NON-TEC Newspaper subcorpora and the
TEC and NON-TEC narrative subcorpora. My results reveal four consistent
patterns of lexical use in translated English vis-à-vis original English, inde-
pendently of text category. These regular features are:
• lower proportion of lexical (or content) words versus grammatical
words
• greater use of the high frequency words of English
• greater repetition of the 108 most frequent words (or list heads) used in
each subcorpus
• use of a smaller number of lemmas in the list heads.
I have called these features "core patterns of lexical use" (Laviosa-Braithwaite
1996: 157) and have put forward the hypothesis that these may not be
316 SARA LAVIOSA
restricted to two text genres, but may prove typical of translated text in
general.
It follows from the present analysis of the design of a monolingual multi-
source-language comparable corpus of English that the strength of any future
evidence which may or may not confirm the existing findings will depend to a
significant extent on the level of comparability that the researcher will have
established during the two crucial phases of corpus design.
Author's address:
Sara Laviosa • Department of Language Engineering • UMIST • PO Box 88 •
MANCHESTER M60 1QD • United Kingdom
Notes
1. As Atkins et al. point out (1992: 7), texts written to be spoken overlap with the spoken
text. It could therefore be considered a spoken mode or regarded as a separate class
altogether.
2. According to UNESCO statistics (van Slype et al. 1983 quoted in Sager 1994: 297)
scientific, industrial and legislative translation represents 20%, 21% and 9% of the entire
volume of translation activities. The remaining amount is distributed as follows: commer-
cial (35%), press and current affairs (3.5%), audio-visual (3.5%), educational (1.5%),
literary (0.3%), miscellaneous (6.7%). In the UK, the largest share of translation produc-
tion — circa 90% — is technical and scientific (Francis Sutcliffe, ALPNET, personal
communication, 1996). However, I have assumed that the readership of these specialised
publications is rather narrow, compared with that of general translation.
3. This assumption is based largely on the general impression derived from participating, in
the last two years, in several national and international conferences on the subject of
translation, where the papers presented dealt mainly with literary and general language.
4. The terms subject domain, subject field, institutional text type, text category and text
genre are used interchangeably in this study. They all refer to groups of texts considered
similar by the corpus compiler or by general consensus on the basis of their
extralinguistic features. I have deliberately chosen to avoid the words 'genre' and 'text
type' on their own, because these are used by Biber and Finegan (1986; 1991) to refer to
two different notions. According to these scholars, 'genres' are "the text categories
readily distinguished by speakers of English (e.g. novels, newspaper articles, public
speeches)" (Biber and Finegan 1991: 213). The notion of 'genre' is therefore used to
characterise texts on the basis of external criteria (Biber and Finegan 1986: 20). 'Text
types', on the other hand, are defined in terms of the linguistic characteristics of the texts
themselves. They represent sets of texts "that are similar with respect to their linguistic
form, irrespective of genre categories" (Biber and Finegan 1986: 20). Their identification
depends therefore on the analysis of the predominant linguistic features of the texts,
which, in the case of Biber's studies, is carried out through Factor Analysis. Nakamura
(1989; 1991; 1994) makes the same distinction between 'genre' and 'text type', and uses
a statistical method called "Extended HAYASHI's Quantification Method Type III" to
describe text types in large corpora (Nakamura 1994: 141).
5. Examples of published reports are those commissioned by The European Commission
{Draft Guidelines as to the Form and Content of Schemes) and those published by the
Welsh Language Board {Discussion Document: A Strategy for the Re-Introduction of
Welsh Second Language at Key Stage 4). Examples of reports which are unpublished, but
are available to the public upon request, are the reports produced by UNIDO — Regula-
tions on drugs, Reports on Women's conditions.
6. Examples of speeches are: From Poem to Novel, From Novel to Poem by José Saramago,
translated by Giovanni Pontiero, and Introduction: Elytis in His Own Words by the Nobel
Prizewinner Greek Poet Elytis, translated by David Connolly.
7. For Customs and Folklore it was not necessary to restrict the search to publications dated
from January 95 onwards because the initial total number of translations in print was
reduced to a manageable amount by reducing the time span from January 1992 and the
price to less than £20.00. For General Knowledge the initial total number was reduced by
just applying the search that identified all those publications with a price of less than
£20.00.
8. The layout of this list has been adapted from the taxonomy of text attributes proposed by
Atkins et al. (1992: 5-6).
9. The value 'excerpt' covers any form of incomplete text.
10. This type of information is inferred, rather than collected directly, because of the general
reluctance on the part of translators, translation agencies and publishers to reveal whether
their translations have been produced out of the mother tongue. This in turn originates
from their concern about possible negative evaluations of their work.
11. Unknown data is recorded with an "X" in the corresponding database field.
12. I am particularly grateful to Carol Maier, Peter Bush and Luise von Flotow for their
insightful comments on this issue.
13. The time span for Sevenday is much narrower (5.10-23.11.1995) than intended. This is
due to the unavailability of texts in electronic form.
14. For one of the The European collections (Sevenday) this criterion could not be applied
since the name of the author is not revealed.
15. For the same reason explained in note 14, this criterion could not be applied in the case of
the Sevenday collection for The European.
16. Three levels of target audience are identified in the BNC: high, medium and low. They
are established on the basis of an assessment of a text's technicality, which in turn
depends on the perceived difficulty of the text.
17. Word-count is calculated using WordSmith Tools (Scott 1996). The status of the newspa-
per articles has been established via direct questioning of the editors of the weekly
supplement Guardian Europe and Presswatch respectively.
318 SARA LAVIOSA
References
Aijmer, Karin, Bengt Altenberg and Mats Johansson, eds. 1996. Languages in Contrast:
Papers from the Lund Symposium on Text-Based Cross-Linguistic Studies, 4-5 March
1994. Lund: Lund University Press.
Atkins, Sue, Jeremy Clear and Nicholas Olster. 1992. "Corpus Design Criteria". Literary
and Linguistic Computing 7:1. 1-16.
Baker, Mona. 1995. "Corpora in Translation Studies: An Overview and Some Suggestions
for Future Research". Target 7:2. 223-243.
Biber, Douglas and Edward Finegan. 1986. "An Initial Typology of English Text Types".
Jan Aarts and Willem Meijs, eds. Corpus Linguistics II: New Studies in the Analysis and
Exploitation of Computer Corpora. Amsterdam: Rodopi, 1986. 19-46.
Biber, Douglas and Edward Finegan. 1991. "On the Exploitation of Computerized Corpora
in Variation Studies". Karin Aijmer and Bengt Altenberg, eds. English Corpus Linguis-
tics. London and New York: Longman, 1991. 204-220.
British National Corpus (BNC). 1995. User Reference Guide for the British National
Corpus: Version 1.0. Oxford: Oxford University Computing Services.
Bush, Peter. 1995. "Translator Activism and Translation Theory: A Cuban Case Study".
Talk given at the Conference on the Linguistic Foundations of Translation. University
of Liverpool, 15-17 September 1995.
Gellerstam, Martin. 1996. "Translations as a Source for Cross-Linguistic Studies". Aijmer
et al. 1996: 53-63.
Granger, Sylviane. 1996. "From CA to CIA and Back: An Integrated Approach to Comput-
erized Bilingual and Learner Corpora". Aijmer et al. 1996: 37-52.
Hartmann, R.R.K. 1994. "The Use of Parallel Text Corpora in the Generation of Translation
Equivalents for Bilingual Lexicography". Paper presented at the EURALEX Congress,
Amsterdam 1994.
Johansson, Stig and Knut Hofland. 1994. "Towards an English-Norwegian Parallel Cor-
pus". Udo Fries, Gunnel Tottie and Peter Schneider, eds. Creating and Using English
Language Corpora: Papers from the Fourteenth International Conference on English
Language Research and Computerized Corpora, Zurich 1993. Amsterdam and Atlanta:
Rodopi, 1994. 25-37.
Laviosa, Sara, forthcoming. "Core Patterns of Lexical Use in a Comparable Corpus of
English Narrative Prose". Sara Laviosa, ed. The Corpus-Based Approach: A New
Paradigm in Translation Studies.
Laviosa-Braithwaite, Sara. 1996. The English Comparable Corpus (ECC): A Resource and
a Methodology for the Empirical Study of Translation. Manchester: UMIST. [PhD
Thesis.]
Laviosa-Braithwaite, Sara, forthcoming a. "Analysing Comparable Translational and Non-
Translational Texts with Tools of Corpus Linguistics". The Proceedings of the Second
International Conference on Current Trends in Studies of Translation and Interpreting,
Budapest, 5-7 September, 1996.
Laviosa-Braithwaite, Sara, forthcoming b. "The English Comparable Corpus: A Resource
and a Methodology". The Proceedings of the Conference on Translation Studies: "Unity
in Diversity?" Dublin City University, 9-11 May, 1996.
Nakamura, Junsaku. 1989. "A Quantitative Study on the Use of Personal Pronouns in the
Brown Corpus". Jacet Bulletin 20. 51-71.
Nakamura, Junsaku. 1991. "The Relationships among Genres in the LOB Corpus Based
upon the Distribution of Grammatical Tags". Jacet Bulletin 22. 55-74.
Nakamura, Junsaku. 1994. "Extended HAYASHI's Quantification Method Type III and Its
Applications in Corpus Linguistics". Journal of language and Literature 1. 141-192.
Peters, Calor and Eugenio Picchi. 1996. "Bilingual Reference Corpora for Translators and
Translation Studies". Paper presented at the Conference on Translation Studies: "Unity
in Diversity?" Dublin City University, 9-11 May, 1996.
Sager, Juan C. 1994. Language Engineering and Translation: Consequences of Automa-
tion. Amsterdam and Philadelphia: John Benjamins.
Scott, Michael. 1996. WordSmith Tools. Oxford: Oxford University Press.
Sinclair, John. 1991a. Corpus Concordance Collocation. Oxford: Oxford University Press.
Sinclair, John. 1991b. Council of Europe Multilingual Lexicography Project. Report
submitted to the Council of Europe under contract no. 57/89.
Sinclair, John. 1992. Lexicographers' Needs. Pisa Workshop on Text Corpora, January
1992.
Teubert, Wolfgang. 1994. "Parallel Corpora and Multilingual Lexicography". Unpublished
manuscript provided by the author.
Toury, Gideon. 1995. Descriptive Translation Studies and beyond. Amsterdam and Phila-
delphia: John Benjamins.
van Slype, G. et al. 1983. Better Translation for Better Communication. Oxford: Oxford
University Press.

Laviosa How Comparable Can Comparable Be - Target 9

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Laviosa How Comparable Can Comparable Be - Target 9

Uploaded by

Copyright:

Available Formats

How Comparable Can

'Comparable Corpora' Be?

Abstract: The development of a coherent methodology for corpus-based work in

Target 9:2 (1997), 289-319. DOI 10.1075/target.9.2.051av

conceived as a resource to be made available to the academic community for

2. The Design of the English Comparable Corpus (ECC)

2.1. Initial Definition of Comparable Corpus

In translation studies the term 'comparable' corpus has been proposed by

The English Comparable Corpus (ECC) consists of two computerised

2.2. Corpus Typology and Classification of the ECC

Corpus Types: GENERAL

A general corpus is made up of texts assumed to be representative of

A monolingual corpus contains texts produced in one language. A bi/multi-

A written corpus is made up entirely of written texts; both written-to-be-read

precisely its composition that determines the dimensions of comparability and

A mono-translating-mode corpus contains texts translated in one of the

A bi-/multi-translating-mode corpus is made up of texts translated in two or

A bi-/multi-translation-method corpus is made up of different groups of

A published corpus consists of translations which have been published and

2.4. The Identification of the Text Categories4 of TEC

• Translated from a variety of source languages

2.5. Selecting Suitable Texts for TEC

2.5.1. Selection Criteria

literary translations, I have given preference to those translators and authors

2.5.2. Selection Procedures

Biography, Customs and Folklore, Fiction (General), Fiction (Short Stories),

TEXT TOTAL FURTHER FINAL

2.5.3. Copyright Permissions

2.5.4. The Acquisition of Texts for TEC

public libraries or downloaded either from The Guardian on CD-ROM or

2.5.5. From a Printed or Electronic Text to a TEC Text

2.5.6. The File Structure of TEC

Diagram 1. File Structure of TEC

2.5.7. The Extra-Textual Attributes of TEC Texts

of "translation policy" (Toury 1995: 58) prevailing at a given time in a given

Values: Full-time, Part-time

birth and currently, it is inferred that the translator is a

2.6. Design of NON-TEC

2.6.1. The Dimensions of Comparability

2.6.2. Selection, Acquisition and Markup of NON-TEC Texts

Selection criteria for the Newspaper subcorpus:

2.6.3. The File Structure of NON-TEC

Diagram 2. File Structure of NON-TEC

2.6.4. The Extra-Textual Attributes of NON-TEC Texts

The function of these attributes is descriptive and operational. By using

3. Evaluation of Comparability in the Design of ECC

4. Use of the English Comparable Corpus

As a way of testing the viability of an ECC-based methodology for a study of

You might also like