You are on page 1of 48

Corpus Linguistics

Developing a PolyU Language Bank
Sherman Lee egslee@inet.polyu.edu.hk PI: Grahame Bilbow Thanks to: Chris Greaves, Raymond Cheung, Li Lan

Outline

Background
  

As an illustration
 

Goals of corpus linguistics Types of corpora Applications of corpus analysis Exploring units of meaning Case study

Developing a PolyU Language Bank
  

The PolyU Language Bank
  

Aims and objectives of project Similar existing projects Procedures Current status Sample corpora Sample search

2

Goals of corpus linguistics

Chomskyan linguistics
     

Corpus linguistics
     

‘Langue’ (competence) Ideal speaker/hearer Language = innate mental faculty Intuitive evidence Universals Grammar

‘Parole’ (performance) Complexity/variation Language = social phenomenon Empirical evidence Differences Meaning
3

Basic tools

Corpus: a systematic collection of speech or writing that is built according to explicit design criteria for a specific purpose c.f. EAGLES’ broad definition: “A corpus can potentially contain any text type, incl. word lists, dictionaries, etc.” Concordancer: search engine (e.g. WordSmith; SARA) Concordance: occurrences of search item, displayed in list with immediate context shown

4

g. Learner corpora e. Comparable  Monolingual vs Multilingual  Synchronic vs Diachronic. ESP.Types of corpora Written vs Spoken  General vs Specialised   e.g. Parallel. Monitor  Annotated vs Unannotated  5 .

Written corpora 6 .

Specialised corpora 7 .

Other examples of available corpora CO First generation ma Brown Corpus (1960s) Br .

Some applications of corpus analysis  Language teaching & learning   Empirical teaching data – authentic examples of language use Reference source – answering learners’ questions or explaining learner errors: Preparation of teaching materials – e. concordancing and data-driven learning Using parallel texts to find suitable translation equivalents Creation of translation databases or glossaries for domain-specific terminology. e.g. CLOZE tests CALL. syntactic theory Pragmatics & discourse – e. CA of discourse features in spoken (conversational) data • “What’s the difference between ‘at last’ and ‘in the end’?” • “How is ‘hardly’ used?”    Translation     Linguistics and language research     9 .g. linguistic features across registers Grammar – corpora used as data to test hypotheses. law. science Exploring units of meaning in texts Lexicography & lexical studies – e.g. relative word frequency Language variation – e. vocabulary lists.g. business.g.

information management… Language teaching (TEFL): vocabulary often introduced in the form of new single words Words considered to be basic units of meaning “… If you dog a dog during the dog days of summer. in how language is actually used in discourse Meaning is a key problem for translation. you’ll be a dog tired dog catcher…” “… Can I sit down? My dogs are barking…”  Is the word an ideal unit of meaning?  Most lexical errors made by language learners result from failure to deal with ambiguities of single words 10 . units of meaning  Focus on meaning because:    What are basic units of meaning?   People interested in the meanings of texts. language learning.Exploring meaning.

duty of care Adv + A N + of + N 11 . set phrases Often determined by a syntactic pattern     Adj + N V+N • friendly fire. draw conclusions • politically correct. environmentally friendly • cause of death. code of practice. collocations. idioms. proof of identity.‘Unambiguous Units of Meaning’     Notion of an ‘Unambiguous Unit of Meaning’ necessary for understanding meaning UUoM = keyword and all words in the context that contribute to making the word unambiguous Compounds. multi-word units. closing remarks • invite proposals.

phone-ins  The Times (1995.220. home news. Jan – March)  Corpora from 1960 . specialist periodicals. readers’ letters. balanced corpora of 15 genres of text 12 . school/university essays • Informal conversation. memos. academic books. un/published letters.367 words • Written : business. reviews Brown corpus / LOB corpus • Each 1 million words • Written.Case study  Search for units of meaning in online dictionaries and corpora    Corpora from 1990s  friendly fire environmentally friendly British National Corpus (BNC) • 100.1970s  • 10.000. popular fiction.000+ words • Written (90%) • Spoken (10%) • Extracts from regional/national newspapers. government). formal meetings (business. radio shows.

.

.

Search results .

introduce notion of units of meaning into language learning 16 .What the results show  ‘friendly fire’. ‘environmentally friendly’    Represent fairly new concepts Occur in the newer corpora (1990s) as units of meaning Occur as entries in some of the online dictionaries only (not bilingual dictionaries)   New terminology and terms of common usage not always recorded in dictionaries and termbanks One way of using corpora for learning and translation:  Use corpus evidence to help students recognise units of meaning.

language learning and research purposes A WWW interface via which users can freely access the language bank With browse. search and concordance facilities  To provide a user-friendly platform   17 .Aims of PULB project  To design and build an archive of language corpora = ‘language bank’   To be used by staff and students in the department For teaching.

media. literature Target Size: 30 million words (European) / characters (Asian) 18 . Japanese. social. law. transcribed spoken data Language types: native speaker. departmental collections Medium: written texts. Chinese. French.Ingredients of PULB       Sources: standard corpora. academia. learner corpora Languages: English. German Genres: business.

Why a language bank? . EAP)  That you can utilise for your research • A ready-made collection of data waiting for you to work on • Saving on time and resources  Way of incorporating new methods and information technology into the department’s teaching and research activities     Increase students’ awareness of this rapidly developing methodology / branch of language studies (corpus linguistics.“What’s in it for us”  Free and simple shared access to a collection of language corpora  That you can utilise for your teaching • Authentic examples of language use at your fingertips • Empirical teaching data covering different specialisms (ESP. corpora studies) Way of integrating theory with technology in the classroom Train students to be more computer-literate All of the above can • Motivate students to become active learners • Help students to more effectively learn the target language (cf goals of DDL) 19 .

news.essex.uk/w3c/ Access to corpora (Gutenberg texts.hk/concordance/ Access to variety of corpora and texts (bilingual/parallel corpora.Similar existing projects  W3 Corpora Project (Essex)     http://clwww. LOB.polyu. LOB-tagged) Web interface for performing searches Online tutorial and info on corpus linguistics http://vlc.ac. PolyU)    20 . Bible.edu. works of fiction) Web interface for performing searches  Web Concordancer (VLC.

.

.

.

.

.

.

.

.

.

Directions for PULB  Build a language bank with features that parallel those of similar sites  ~ VLC  ~ Essex • Bring together corpora and texts of various types and genres. legal considerations) • Provide on-site tutorial. corpora-based info  Include extra features   Allow searches in multiple texts / corpora simultaneously Some form of parallel concordancing 30 . of different languages • Make available different facilities for different categories of users (cf.

Target composition of PULB Business Chinese Chinese Legal Chinese French German Business Japanese Japanese Japanese Literature PolyU Language Bank English General corpora Learner corpora Business English (PUBC) Legal English Academic English English Literature HK spoken corpus Conference speeches Academic presentations Workplace English n i s u B s s e n i t i r w g h c a e T g n i t c e l f e r s n o i a i c o S l a r e t n i n o i t c s e d u t S t n k r o w B N C I C E B R O W N Specialised corpora Spoken Corpora 31 .

Procedures (i)  Collate. incl. categorise data from various sources • • Commercially available data Departmental collections. sort.  PolyU Business Corpus (Li and Bilbow)  Bilingual corpora (Xu)  ESP / EAP corpora (Forey)  Learner corpora (Sengupta) … 32 .

size. time of compilation. Structural features (headings. genre of subtexts • Bibliographic info (written text) • Ethnographic info (spoken data) • Compiler. typographic features) E. Duplications of text samples E.g.g.g. Sub-categories. macro categories  Clean up texts    E.Procedures (ii)   For the departmental collections: Decide how to present each collection  E. Personal information found in data • To protect anonymity or privacy of authors and speakers  Annotate texts   Provide descriptive information about each corpus Provide descriptive information about the texts • Number. type of collection…  Provide structural information for texts if necessary • Mark texts for paragraph boundaries etc… 33 .g.

Procedures (iii)  Put corpora together on platform. OLAC. set up search and support facilities:     ‘PULB map’ Browse facility Search and concordance facilities Tutorial / general information  Transplant PULB onto dept website for use by staff and students Promote PULB among corpora community   Data provider to data archives / distribution sites. ICAME 34 .g. e.

polyu.edu.The PolyU Language Bank  Current status Range of corpora totalling 12M+ words  Individual corpus descriptions  Index of corpora  Simple to use built-in concordancer  Available at http:// langbank.hk/  35 .engl.

.

37 ... Chi.The PolyU Language Bank  Some of the currently available corpora        PolyU Business Corpus (Eng. Written) Corpus of Multilingual Texts Corpus of Nursing and Health Science Texts Learner Corpus of Essays and Reports HK Bilingual Corpus of Legal and Documentary Texts . Jap) BNC Sampler Corpus (Spoken.

.

.

.

How you can contribute  Talk to us about your ideas  What would you like to see being incorporated into PULB? Can you think of other ways in which PULB can be organised and structured? How likely are you to make use of PULB in your teaching and research? Do you have any suggestions for corpus studies based on available or potentially available corpora from PULB? Do you know of similar projects being undertaken elsewhere that we can learn from? Do you have collections of language data from past research projects that are (could be) presented as a corpus (corpora)? Can we help you put your collections to good use? Can we work together to incorporate your collections into PULB? • In terms of corpora • In terms of search facilities and supplementary information      Talk to us about your collections / corpora    41 .

engl. accessible via WWW You can help us by contributing your ideas and/or your language collections Please visit and test the PULB website at http:// langbank.hk/ and provide us with feedback using the online evaluation form Thank you very much 42 .edu.Concluding remarks     Corpora represent a valuable but under exploited resource for teaching and research PULB aims to bring together various corpora under a single departmental archive.polyu.

Social grooming .

CLOZE .

1. 1. company reports and brochures… Has been used for creating a bilingual English-Chinese business lexicon 45 English (c.1 M words) .PolyU Business Corpus   Compiled in 1999-2000 (Li & Bilbow) Multilingual .comparable corpora:      Business texts from: newspapers.3 M words) Chinese (c.2 M words) Japanese (c. government reports. 1.

PolyU Business Lexicon .

Duplication .