Professional Documents
Culture Documents
Jun Pan
Hong Kong Baptist University / 224 Waterloo Road, Kowloon Tong, Hong Kong SAR, China
janicepan@hkbu.edu.hk
82
Temnikova, I, Orăsan, C., Corpas Pastor, G., and Mitkov, R. (eds.) (2019) Proceedings of the 2nd Workshop on
Human-Informed Translation and Interpreting Technology (HiT-IT 2019), Varna, Bulgaria, September 5 - 6, pages 82–88.
https://doi.org/10.26615/issn.2683-0078.2019_010
political settings. In particular, the EPIC Language subsets Word tokens Types
Chinese 2,578,911 83,312
(European Parliament Interpretation Corpus; Cantonese 1,072,368 61,837
http://catalog.elra.info/en-us/ Putonghua 1,506,541 30,320
repository/browse/ELRA-S0323/) also English 3,815,083 32,748
Total 6,393,994 116,060
included annotation of paralinguistic features,
which are of interest to interpreting researchers. Table 1: The composition of the CEPIC by lan-
The EPTIC and the CEPIC are very similar to guage.
a great extent since both included simultaneous
interpreting of parliamentary speeches, yet the
transcription and annotation, especially for inter-
CEPIC also collected data of consecutive in-
preting with the language combination of Chinese
terpreting, which is often employed at bilateral
(Cantonese and Putonghua) and English.
meetings or questions and answers at press confer-
ences in political settings. In addition, the EPTIC 2 About the CEPIC1
only included languages translated and interpreted
at the European Parliament, while a comparison 2.1 General Information
with those translated and interpreted in other The CEPIC is currently over 6 million word to-
regions/continents would provide interesting kens in size. It consists of transcripts of speeches
perspectives on political translation/interpreting at delivered by top political figures (e.g. govern-
large. ment leaders) from Hong Kong, Beijing, Wash-
In this regard, the WAW corpus (http:// ington DC and London, as well as their trans-
alt.qcri.org/resources/wawcorpus/ lated/interpreted texts2 . The speeches were deliv-
provides a very interesting perspective by cover- ered by native speakers (otherwise coded as code-
ing conference interpreting between English and mixing) and interpreted into the B language of the
Arabic in Qatar. However, the data were collected interpreters (usually government interpreters), a
from international conferences rather than from phenomenon common in political setting at which
political settings. the Chinese and English languages are concerned
Many other corpora that involve the Chinese (Pan and Wong, forthcoming). Both directions of
and English language interpreting in similar set- Chinese-English and English-Chinese interpreting
tings, including the CEIPPC (Chinese-English were covered. Table 1 shows some basic statistics
Interpreting for Premier Press Conferences, see of the CEPIC.
Wang (2012); also introduced by Setton (2011) The main speech types of CEPIC include the
and Bendazzoli (2018)) and the CECIC (Chinese- reading of government reports such as policy ad-
English Conference Interpreting Corpus, see Hu dresses and budget speeches, questions and an-
(2013); also introduced by Setton (2011) and Ben- swers at press conferences, parliamentary debates,
dazzoli (2018)), are unfortunately not open to as well as remarks delivered at bilateral meet-
public access. In addition, although Cantonese ings, most of which were done and collected on a
to Putonghua and English simultaneous interpret- yearly basis, except for remarks at bilateral meet-
ing has been performed at the Legislative Council ings when it depends on if such meetings were
(LegCo) of Hong Kong SAR for over two decades, held in a specific year. Some of the speeches were
there has seen no existing publicly available cor- interpreted in a consecutive mode, and some in si-
pus designed specifically for the study of interpret- multaneous, which were coded in the metadata.
ing of such speeches, especially one that included In particular, speeches in the Hong Kong sub-
paralinguistic features such as the EPIC, although set were mainly interpreted from Cantonese into
part of the official transcripts are archived regu- Putonghua and English, and those in the Beijing
larly online (on government or LegCo websites). subset from Putonghua to English. The other two
1
The CEPIC, therefore, aims to provide an open Some of the information in this section is also accessible
via the CEPIC website (Pan, 2019).
access corpus covering Chinese and English lan- 2
Speeches collected in the corpus, in particular those pro-
guage political interpreting, also in the hope of vided on the official government websites, are considered
offering a possible solution to future collection translations instead of interpreting, as they are translated be-
fore interpreting or revised based on the interpreted version,
of interpreting corpora by providing templates of which, with spoken language features (e.g. spoken words and
metadata collection and solutions to spoken data particles) deleted, read more like written language.
83
subsets, i.e. Washington DC and London, mainly omitted in official transcripts provided on govern-
included English speeches delivered in similar set- ment websites). Text and audio/video links were
tings (which can be regarded as monolingual ref- also included at the end of each text for those who
erence subsets to the interpreted English speeches) may be interested in the sources of the speeches
and whenever applicable, their interpreted ver- (Figure 1).
sions in Chinese (usually only at bilateral meetings
or joint press conferences).
2.3 Speech Transcription & Annotation • English Annotated: [er] So [that] that is the
big difference [er] in our approach and the
Data of CEPIC were collected in two ways:
approach [er] that [er] I think [er] might have
• Speech transcripts and their translations col- been debated about. (Press Conference of US
lected from government websites (Raw); Budget Speech, 1997-02-06)
• A revised or newly transcribed version As can be seen from the above examples, the
(when there are no readily available tran- annotated version features annotations of different
scripts) of these speeches and their inter- prosodic and paralinguistic features (e.g. fillers,
preted texts based on audios/videos col- repetitions and self-repair, etc.) that are of con-
lected from government websites and TV cern to the study of spoken language as well as
programme archives (Annotated). In partic- interpreting.
ular, the annotated version of the CEPIC was
3 Main Functions of the CEPIC3
transcribed and annotated in a way that re-
flects features of spoken language data. The CEPIC features a user-friendly interface with
three main functions.4
Texts of the CEPIC were manually revised
or transcribed based on audios/videos with the 3.1 Keyword Search
speeches and their interpreting, if any. Whenever Users can input a keyword in English or (Sim-
possible, existing official transcripts provided on plified/Traditional) Chinese in the corpus. The
government websites and transcripts generated by corpus has a lexical associative function. There-
voice recognition software were used as basis for fore, when characters/letters are keyed in the
transcription to help speed up the process. The search box, the associative results will au-
transcription of CEPIC follows a standardised pro- tomatically display beneath the search box.
cess and aims to represent the spoken text as close
3
as it was delivered. In addition, all Cantonese texts A full user manual including graphics of examples can
be accessed from the CEPIC website (Pan, 2019).
were transcribed in a way to capture spoken Can- 4
Examples and data listed in this paper were generated
tonese features (including particles that are usually using the CEPIC online search engine (Pan, 2019).
84
Parameters Value Parameters Value
{Keyword} Interesting {Keyword} Interesting
{Speaker Role} Member of Parliament (UK) {Location} Hong Kong
{Time} 1997 to 2017 {Time} 1997 to 2017
{Subset} Annotated {Subset} Raw
Table 2: Parameters used for a sample Keyword Table 3: Parameters used for a sample Expanded
Search. Keyword in Context.
A prosodic/paralinguistic feature can also be If users click on one of the collocates, the con-
searched when choosing the annotated version of cordance lines that included both the search term
the corpus. and the collocate will appear under Keyword in
Apart from choosing either the raw or anno- Context.
tated subset of the corpus for searching, users
can adjust parameters including Part of Speech,
Location, Speaker Name, Speaker Role, Speaker
Gender, Speaker Language, Delivery Mode, Inter-
preter Gender, Interpreter Language, Interpreting
Mode, and Time Span, to refine a search.
The search results can be arranged by Year, Lo-
cation, or Speaker Name, and downloaded in excel
format.
For instance, if the parameters listed in Table 2
are selected, a total of 8 instances can be found in
the CEPIC (Figure 2).
Figure 3: Word Collocation of ”interesting”
85
Figure 4: Results of a Keyword Search of ”inter-
esting” (2)
86
automatic POS enhancement (especially for Can- Kaibao Hu and Qing Tao. 2013. The chinese-english
tonese). In particular, its data can be used to train conference interpreting corpus: Uses and limita-
tions. Meta 58(3):626–642.
machine translation systems (for political texts) or
automatic speech recognition and speech-to-text Christopher D. Manning, Mihai Surdeanu, John
transcription systems (of English, Cantonese and Bauer, Jenny Finkel, Steven J. Bethard, and David
Putonghua). McClosky. 2014. The stanford corenlp natu-
ral language processing toolkit. In Proceed-
ings of the 52nd Annual Meeting of the As-
Acknowledgements sociation for Computational Linguistics: System
Demonstrations.. Baltimore, Maryland: Associ-
The CEPIC is developed with the funding and sup- ation for Computational Linguistics, pages 55–
port of the Early Career Scheme (ECS) of Hong 60. https://aclweb.org/anthology/papers/P/P14/P14-
Kong SAR’s Research Grants Council (Project 5010/.
No.: 22608716), and the Digital Scholarship
Jun Pan. 2007. Two styles of interpretation: Reflection
Grant and the Faculty Research Grant of the Hong
on the influence of oriental and western thought pat-
Kong Baptist University (Project No.: FRG2/17- terns on the relationship between the speaker and the
18/046). interpreter. Foreign Language and Culture Studies,
I would like to thank my colleagues Dr. Billy 6:677–688.
Tak Ming WONG and Ms. Rebekah WONG
Jun Pan. 2019. The Chinese/English Political In-
for their support and advice, and all the re- terpreting Corpus (CEPIC). Hong Kong Baptist
search assistants, student helpers and library col- University Library [Retrieved on 19 June 2019].
leagues who contributed to the project. Please re- https://digital.lib.hkbu.edu.hk/cepic.
fer to https://digital.lib.hkbu.edu.
Jun Pan, Fernando Gabarron Barrios, and Haoshen He.
hk/cepic/about.php#project for a list
forthcoming. Part-of-speech (pos) tagging enhance-
of the team members. ment for the chinese/english political interpreting
corpus (cepic). In Translation Studies in East
Asia: Tradition, Translation and Transcendence.
References http://www.cbs.polyu.edu.hk/2019east/index.php.
Claudio Bendazzoli. 2018. Corpus-based interpreting Jun Pan and Billy T.M. Wong. forthcoming. Prag-
studies: Past, present and future developments of a matic competence in chineseenglish retour inter-
(wired) cottage industry. In Making Way in Corpus- preting of political speeches: A corpus-driven ex-
based Interpreting Studies (New Frontiers in Trans- ploratory study of pragmatic markers. Intralinea
lation Studies Series). Singapore: Springer, pages 1– https://www.intralinea.org.
19.
Mariachiara Russo, Claudio Bendazzoli, and Bart De-
Claudio Bendazzoli and Annalisa Sandrelli. 2009. francq (Eds.). 2018. Making Way in Corpus-based
Corpus-based interpreting studies: Early work Interpreting Studies. Singapore: Springer.
and future prospects. Revista Tradumtica:
L’aplicaci dels corpus lingstics a la traducci (7). Beatrice Santorini. 1990. Part-of-Speech Tag-
https://revistes.uab.cat/tradumatica. ging Guidelines for the Penn Treebank Project
(3rd revision, 2nd printing). Department
Silvia Bernardini, Adriano Ferraresi, Mariachiara of Linguistics, University of Pennsylvania.
Russo, Camille Collard, and Bart Defrancq. 2018. https://catalog.ldc.upenn.edu/docs/LDC99T42/tagg
Building interpreting and intermodal corpora: A uid1.pdf.
how-to for a formidable task. In Making Way in
Corpus-based Interpreting Studies (New Frontiers Robin Setton. 2011. Corpus-based interpreting stud-
in Translation Studies Series). Singapore: Springer, ies (cis): Overview and prospects. In Corpus-
pages 21–42. based Translation Studies. Research and Applica-
tions. London: Continuum, pages 33–75.
Maria Rosaria Buri. 2015. Interpreting in diplomatic
settings. https://aiic.net/page/7349/interpreting-in- Miriam Shlesinger. 1998. Corpus-based interpreting
diplomatic-settings/lang/. studies as an offshoot of corpus-based translation
studies. Meta: Journal des traducteurs 43(4):486–
Daniel Gile. 1991. A communication-oriented analysis 493. https://doi.org/10.7202/004136ar.
of quality in nonliterary translation and interpreta-
tion. In Translation: Theory and Practice, Tension Francesco Straniero Sergio and Caterina Falbo. 2012.
and Interdependence. Amsterdam: John Benjamins Breaking Ground in Corpus-based Interpreting
Publishing Company, pages 188–200. Studies. Bern: Peter Lang.
87
Binhua Wang. 2012. A descriptive study of norms in
interpreting : based on the chinese-english consec-
utive interpreting corpus of chinese premier press
conferences. Meta 57(1):198–212.
Fei Xia. 2000. The part-of-speech tagging
guidelines for the penn chinese treebank
(3.0). RCS Technical Reports Series (38).
http://repository.upenn.edu/ircsr eports/38.
A Supplemental Material
Examples and data listed in this paper are gener-
ated using the CEPIC online search engine, which
should be cited as:
88