You are on page 1of 9

Corpus and Lexicography:

Construction of ‘Word List’ of Pakistani English

Submitted by: Safdar Hussain

Submitted to: Prof. Dr. Zafar Iqbal


Abstract:

The history of corpus-based lexicography is not so old but because of its

reliable empirical data most of the modern dictionaries of English depend upon

word lists created by corpus analysis. As far as dictionaries of English

produced in Pakistan are concerned, no one is based on corpus data analysis.

The aim of this small research project is to present a corpus-based ‘word list’

for learners’ dictionary of Pakistani English. It has been tried to use an

objective frequency criterion to select the words and multiword units to be

described in the dictionary. Therefore, an automated analysis of an

approximately 10.5 million word corpus of Pakistani English newspapers has

been used. This paper will also discuss issues related to data collection and

data analysis of Pakistani English corpus and problems faced in this project.
CORPUS AND LEXICOGRAPHY:

CONSTRUCTION OF ‘WORD LIST’ OF PAKISTANI ENGLISH

Safdar Hussain

Introduction:

In linguistics, a collection of texts is generally referred to as corpus. Francis

(1992:17) defines corpus as “a collection of text assumed to be representative of a

given language, dialect, or other subset of a language to be used for linguistic

analysis”. Corpus linguistics is concerned with the compilation and investigating of

such corpora. Corpus linguistics is a relatively new discipline which originates from

the second half of the 20th century when the first machine readable corpora were

compiled. The function of a corpus may vary according to the needs and demands of

the research project.

The use of a corpus has played an important role in the modern dictionary

construction because it is based on large textual corpora of words, containing a wide

variety of electronically stored text such as newspapers, books, etc. The example

sentence for each word is extracted directly from the corpus that shows how a word is

really used. With the advent of the computational lexicon, the corpus has become

more and more important.

We are interested in contributing a small, publicly available corpus of written

text of Pakistani English. In pursuit of natural language processing research in Urdu,

we could not find a publicly available Urdu corpus with which to work, so we

had to start our own to train and test machine learning algorithms.

Objectives of the Study:


The English which is spoken in Pakistan is different from that spoken in other

regions of the world, and it is regarded as the unique variety which is called Pakistani

English.

The aim of present study is to produce a corpus of Pakistani English in first

stage and then to extract a ‘word list’ of most frequent words of Pakistani English

based on corpus data.

Construction of Corpus:

Important questions that arise when building a corpus are

 Authenticity of language data

 Electronic/machine readable form

 Design and collection according to sampling procedures

 Representative of given language

From authentic language data it means data obtained from real language

source with out tempering. Usually data is stored in notepad or xml form for corpus

analysis. Issues of making it representative of given language and of appropriate size

are also important which are tackled by proper sampling. For the Pakistani English

language there is currently no project like the British National Corpus (BNC 2000) or

the Bank of English (BOE 2000); therefore we must rely on the texts that are freely

accessible. Data for present study was colleted from online news English news papers

published in Pakistan which solve the issue of authenticity of data. The data collected

from archives of news paper websites converted into notepad format to be used for
corpus analysis. Matter to make corpus representative of Pakistani English was very

hard to realize because of lack of resources and time available for this research project

Corpus of spoken language is relatively very small as compared to corpus of written

language because of time required to transcribe spoken form of language. Because of

nature of this small project it was hard to collected data from scanned copies of text

books using OCR (optical character recognizer). Data from internet news paper

websites is easily available from achieves section.

In order to cover actual language, we have chosen the September 2008 issues

of newspapers: the daily times, the dawn, the frontier post, the nation, the news and

the post. There is no clear-cut classification of the articles. Therefore, only exporting

by date makes it possible to export all articles. In the context of this study, a corpus

was compiled in a machine readable form. It is approximately 10.5 million words

corpus of Pakistani English newspapers.

Corpus compilation for the Research:

This section describes the corpora created for this study. The main method of

preparing data for entry into a corpus was adaptation of material in electronic format.

The material was not readily available, other than from the Internet. Therefore,

Internet sources were used; appropriate data was downloaded and stripped of its

HTML formatting by creating a text file of it. All embedded computer instructional

text was removed. The data was then catalogued, arranged chronologically and

entered into the corpus in the form of the corresponding text files. All the daily

newspapers included into the corpus have offline printed versions. Despite some

difficulties in the development of Internet technologies in Pakistan, there is a

considerable progress in the evolution of Pakistani online media. Because of the

unavailability of online versions of the weekly and monthly magazines, only daily
newspapers have been taken into account for this study. The list of the websites from

which the texts were downloaded is presented in the following table

Detail of Sample from Different News Papers:

Newspapers Token* Types/Raw Types/Tag


www.dailytimes.com.pk
Daily Times 2,017,137 60,303 66,594
www.dawn.com.pk
Dawn 2,709,703 78,858 86,842
www.frontierpost.com.pk
Frontier Post 1,095,201 43,574 48,087
www.nation.com.pk
Nation 1,259,076 44,463 49,127
www.news.com.pk
News 2,496,829 65,295 72,410
www.thepost.com.pk
The Post 2,120,733 60,452 66,965
*Token: Total number of words in a corpus

% Age of Tokens from Different News Papers:

Token

The Post Daily Times


18% 17%

News Dawn
21% 24%

Nation Frontier Post


11% 9%

Number of Token Types from Different News Papers:


Types

The Post, 66,965 Daily Times, 66,594

News, 72,410 Dawn, 86,842

Nation, 49,127 Frontier Post,


48,087

Number of Types/Raw Different News Papers:

Types/Raw

The Post, 60,452 Daily Times, 60,303

News, 65,295
Dawn, 78,858

Nation, 44,463 Frontier Post,


43,574

This corpus is smaller than the English corpora but it seems to be large

enough, taking into account our objective of creating a world list of Pakistani English

collocations and grammatical structures of the Pakistani English language. It is also

not claimed that given corpus is perfectly balanced, but it is made up of the kind of

texts that the potential users of our dictionary will have to deal with. Corpora are used

to derive empirical knowledge about language, which can supplement, and frequently

supplant, information from reference sources and introspection (Leech 1991).


Data Analysis:

The whole corpus was then tagged and lemmatised with the AntConc. The

result of this analysis was processed in order to restore the aspect of the original texts.

We submitted the entire lemmatised corpus (51 845 143 words) to AntConc, a well-

known text analysis tool. As AntConc was used to create a frequency lists for the

whole corpus. This frequency list has been corrected on some minor points. For

example, frequent words written with hyphens that were split up during the

lemmatisation process have been extracted from the original corpus and added to the

list. Some errors of lemmatisation have also been corrected.

Results:

Results of the data analysis from AntConc produced word lists of 89843 word

types out of 10.406212 word tokens. List of top 1000 highly frequent word types is

given below.

Limitations

There are several limitations to the study regarding the statistics of the corpus

research. Urdu words occur quite frequently and regularly in Pakistani English

newspapers. It is not possible to take all the Urdu words into account in a 10.5 million

word corpus. Those words that appear less than 10 times have been neglected keeping

in mind the scope and nature of this study.

Conclusion:

Frequency is a powerful tool in the lexicographer's arsenal of resources,

allowing her to make informed linguistic decisions about how to frame the entry and

analyse the lexical patterns associated with words in a more objective and consistent

way. However, in dictionary making editorial judgment is of paramount importance,

because blindly following the corpus, no matter how carefully it may be constructed
to represent the target language type accurately, can lead to oddities. We expect our

motto: 'Corpus-based, but not corpus bound' to hold good for many years to come.

References:

Francis. F.N. (1992) Language Corpora B.C. In J. Svartvik (ed.), Directions in

Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4-8 August

1991. Berlin and New York: Mouton de Gruyter. 17-32

Kennedy, G. (1998) An introduction to corpus linguistics. Harlow: Longman, section

4.1.2

Leech, G. (1991) The State of the Art in Corpus Linguistics. In K. Aijmer and B.

Tognini Bonelli, E. (2001) Corpus Linguistics at Work. Amsterdam: John Benjamins.

You might also like