Professional Documents
Culture Documents
s.alansary@alexu.edu.eg
Lecture 5
Language Resources
Corpus linguistics
Arabic Newswire
CLARA GSAC
“Classical
A “corpus
General ArabicArabic
Scientific
of Corpus
Contemporary “
Corpus”
“Al-Hayat
Arabic Newswire CorpusArabic
newspaper”
CLARA(CAC)
Corpus
(GSAC)
(CCA)
“An-Nahar
“Arabic
Linguistic Data
Arabic
Al-Hayat Corpus“ (LDC)
Gigaword“
Consortium
Gigaword
Corpus CAC
CCA Charles
University
University of
University ofof University,
Manchester,
Manchester,
Leeds Prague
(UK)Institute
Institute
Latifa of
of
Al-
Linguistic
An-Nahar Data
Lebanon
University
Science Consortium
of newspaper
Essex
and Technology (LDC)
Science
Sulaiti and Technology
& Eric Atwell
Al-Hayat An-Nahar
Other Languages Corpora Korpus 2000
Corpaix spoken corpus
French Danish
http://www.bibalex.org/ica/ar/
International Corpus of Arabic (ICA)
✓ Cover all varieties of Arabic as being used all over the Arab world.
✓ Present a strong Arabic resource to support linguistic research in
general and natural language processing (NLP) in particular.
✓ Provide authentic information about the Arabic language.
✓ Be Morphologically, syntactically and semantically analyzed.
✓ Be available for free.
Percentage of sources
8% 29%
Press
Net Articles
Books
Academics
43% 20%
Genres
Miscellaneous Humanities Strategic Sciences
20187245 6452027 10588754
25% 8% 13%
Social Sciences
10127407
12%
Sports
5824524
7%
Literature
9427694
12% Natural Sciences
1056042
1%
Applied Sciences
Biography Art & Culture Religion 1030342
1412953 3698550 11090317 1%
2% 5% 14%
ICA Sub-Genres
9.00%
8.00% 7.79%
6.98%
7.00%
6.25%
6.00% 5.74%
5.27%
5.00% 4.40%
4.11% 4.08% 3.68% 3.93% 4.01%
3.20% 3.25% 3.40% 3.19%
4.00% 2.95% 3.56% 3.60% 3.50%
3.17% 3.22% 3.22%
2.60%
2.57%
3.00%
2.35%
2.00%
1.00%
0.00%
Countries
Syria
Oman 4,500,000 Egypt
Mouritania 6,000,000 13,000,000
Saudi Arabia
Sudan 2,500,000 8,000,000
2,000,000 Qatar
2,000,000
Morocco
3,500,000
Jordan
3,500,000
Libya Iraq
3,000,000 2,500,000
Kuwait
5,000,000
Bahrain
3,000,000 Palestine
Algeria Yemen Lebanon 5,000,000
2,000,000 1,500,000 Tunisia 2,500,000
Outside Arab World UAE 2,000,000
1,000,000 4,500,000
Searching the Corpus
Searching the corpus content can be made depending four main options:
+ Exact match search.
+ Lemma Based Search.
+ Root Based Search.
+ Stem Based Search.
More options can be used:
+ Word Class and Sub Class.
+ Stem Pattern and its type.
+ Number.
+ Definiteness.
+ Gender.
+ Country.