You are on page 1of 2

Corpora and Corpus-Analysis Tools

A corpus is simply a collection of text or utterances that is used as a basis for conducting some type of
linguistic investigation.

Recently, corpus refer to a large collection of electronic text that gathered according to explicit criteria.

Types of corpora:

Monolingual Parallel Comparable


Bilingual Multilingual Monolingual Bilingual Multilingual

Corpus analysis tools:

1-word frequency lists:


*it can be sorted in different orders (e.g. alphabetical order)
*user discover< types (all the words even repeated words) and token (only different words)
#Lemmatized lists:
*To group related words together rather than separate each individual word form.
*Lemma used to describe a word that includes and represents all related words.
*There is a problem when lemmatizing a word list automatically which is homograph, when two words have
same spelling but different part of speech. ‫يعني يحطهم تحت ليما واحد وهم كلمتين مختلفتين‬
#A stop list:
*Any items that a user wants the computer to ignore (e.g. ignore articles)

2-Concordances:
-Monolingual -Bilingual
*The results are displayed after a search conducted.
*The most common format is a KWIC, all occurrences of the search are lined up in the center.
*Contexts can be sorted in a variety of ways. (e.g. order of appearance, alphabetically)
*Search patterns (types): exact-string, case-sensitive, wildcard, use boolean operators, context search.
*A parallel corpus is a corpus contains a collection of ST in language A aligned with their translation into
language B.
*Alignment is the process whereby sections of the ST are linked up with their corresponding translation.
*The aligned sections are typically displayed either alongside each other or one above the other, they also
can be sorted (e.g. alphabetically)
*Most bilingual concordancers are bidirectional. (the search can be entered in either language A or
language B).
*Statistical measures is a sophisticated feature it tries to specifically identify potential equivalence.
(separate windows). The advantage to this type of search is the ST and TT can be sorted independently to
reveal patterns in both languages.
*Bilingual query is a type of searching which user can specify search term in both languages, it's useful for
checking whether or not given translation is attested.

3-colcations:
It regarded when two words go together.
A mutual Information (MI) is a formula for determining either the two words are collocates or not.
If two words are strongly connected they will have high MI score, and if not, they will have low MI score.
Its drawback:
1-it assumes that the different words occurred as completely independent events whereas languages
actually full of dependencies
2-MI requires usually about 5 of co-occurrences within a corpus in order to be valid.
Annotation and mark-up:
It depends on the project at hand.
It requires a greater initial investment of time, but subsequently allows more specific searching.

Linguistic (annotation) Non-linguistic (marking up)


The advantage is that it allows users to focus their
research more narrowly
1-Syntactic A way of adding non-linguistic information to a
*Each word in the corpus has its part of speech corpus.
specified with tags (there's no standard tag set). It's possible to ask the computer to retrieve only
*Taggers programs add all part of speech to a occurrences of a specific search (such as the title of a
corpus automatically (need edited by human) text, publication date)
2-Symantic
*distinguish between multiple meanings of a word.
Such as Homonyms (have same spelling different
meaning)

You might also like