You are on page 1of 18

Introduction to Language Technologies:

Challenges and Applications


Sameh Alansary

s.alansary@alexu.edu.eg

Prof. of Computational Linguistics


Head of Phonetics and Linguistics Dept.
Faculty of Arts, Alexandria University

Lecture 5
Language Resources
Corpus linguistics

• Corpus linguistics is the study of language based on large collections of "real

life" language use stored in corpora, computerized databases created for

linguistic research. It is also known as corpus-based studies.


Existing Corpora: English Corpora:

Corpus Size in Words Collection Date


Brown Corpus 1 Million 1960

LLC London- Lund Corpus of 500,000 1960, 1975- 81,


spoken English 1985- 88
Bank of English (Collins Cobuild) More than 450 million 1980

International Corpus of English 1 Million for each 1990


(ICE) English country
BNC- British National Corpus 100 million 1991- 1995
Arabic corpora: previous trials

Arabic Newswire
CLARA GSAC

“Classical
A “corpus
General ArabicArabic
Scientific
of Corpus
Contemporary “
Corpus”
“Al-Hayat
Arabic Newswire CorpusArabic
newspaper”
CLARA(CAC)
Corpus
(GSAC)
(CCA)
“An-Nahar
“Arabic
Linguistic Data
Arabic
Al-Hayat Corpus“ (LDC)
Gigaword“
Consortium
Gigaword
Corpus CAC
CCA Charles
University
University of
University ofof University,
Manchester,
Manchester,
Leeds Prague
(UK)Institute
Institute
Latifa of
of
Al-
Linguistic
An-Nahar Data
Lebanon
University
Science Consortium
of newspaper
Essex
and Technology (LDC)
Science
Sulaiti and Technology
& Eric Atwell

Al-Hayat An-Nahar
Other Languages Corpora Korpus 2000
Corpaix spoken corpus
French Danish

NEGRA Corpus METU Corpus


German Turkish
LIVAC Synchronous Corpus
Goteborgsposten
Chinese Corpus Swedish

Oslo Corpus CSL Corpus


Bosnian Serbian
Concordance lines: Key word in context (KWIC)
From corpus analysis to big data analytics
• Big Data is today, the hottest buzzword around, and with the amount of data being
generated every minute by consumers, or/and businesses worldwide, there is huge value
to be found in Big Data analytics.

• In today’s world, Big Data analytics is fueling everything


we do online in every industry.
International Corpus of Arabic (ICA)

http://www.bibalex.org/ica/ar/
International Corpus of Arabic (ICA)

✓ Cover all varieties of Arabic as being used all over the Arab world.
✓ Present a strong Arabic resource to support linguistic research in
general and natural language processing (NLP) in particular.
✓ Provide authentic information about the Arabic language.
✓ Be Morphologically, syntactically and semantically analyzed.
✓ Be available for free.
Percentage of sources

8% 29%

Press
Net Articles
Books
Academics

43% 20%
Genres
Miscellaneous Humanities Strategic Sciences
20187245 6452027 10588754
25% 8% 13%

Social Sciences
10127407
12%

Sports
5824524
7%

Literature
9427694
12% Natural Sciences
1056042
1%
Applied Sciences
Biography Art & Culture Religion 1030342
1412953 3698550 11090317 1%
2% 5% 14%
ICA Sub-Genres
9.00%

8.00% 7.79%

6.98%
7.00%
6.25%
6.00% 5.74%
5.27%
5.00% 4.40%
4.11% 4.08% 3.68% 3.93% 4.01%
3.20% 3.25% 3.40% 3.19%
4.00% 2.95% 3.56% 3.60% 3.50%
3.17% 3.22% 3.22%
2.60%
2.57%
3.00%
2.35%

2.00%

1.00%

0.00%
Countries
Syria
Oman 4,500,000 Egypt
Mouritania 6,000,000 13,000,000
Saudi Arabia
Sudan 2,500,000 8,000,000
2,000,000 Qatar
2,000,000
Morocco
3,500,000
Jordan
3,500,000

Libya Iraq
3,000,000 2,500,000

Kuwait
5,000,000
Bahrain
3,000,000 Palestine
Algeria Yemen Lebanon 5,000,000
2,000,000 1,500,000 Tunisia 2,500,000
Outside Arab World UAE 2,000,000
1,000,000 4,500,000
Searching the Corpus
Searching the corpus content can be made depending four main options:
+ Exact match search.
+ Lemma Based Search.
+ Root Based Search.
+ Stem Based Search.
More options can be used:
+ Word Class and Sub Class.
+ Stem Pattern and its type.
+ Number.
+ Definiteness.
+ Gender.
+ Country.

You might also like