Professional Documents
Culture Documents
1-Word Frequency Lists:: Monolingual Parallel Comparable
1-Word Frequency Lists:: Monolingual Parallel Comparable
A corpus is simply a collection of text or utterances that is used as a basis for conducting some type of
linguistic investigation.
Recently, corpus refer to a large collection of electronic text that gathered according to explicit criteria.
Types of corpora:
2-Concordances:
-Monolingual -Bilingual
*The results are displayed after a search conducted.
*The most common format is a KWIC, all occurrences of the search are lined up in the center.
*Contexts can be sorted in a variety of ways. (e.g. order of appearance, alphabetically)
*Search patterns (types): exact-string, case-sensitive, wildcard, use boolean operators, context search.
*A parallel corpus is a corpus contains a collection of ST in language A aligned with their translation into
language B.
*Alignment is the process whereby sections of the ST are linked up with their corresponding translation.
*The aligned sections are typically displayed either alongside each other or one above the other, they also
can be sorted (e.g. alphabetically)
*Most bilingual concordancers are bidirectional. (the search can be entered in either language A or
language B).
*Statistical measures is a sophisticated feature it tries to specifically identify potential equivalence.
(separate windows). The advantage to this type of search is the ST and TT can be sorted independently to
reveal patterns in both languages.
*Bilingual query is a type of searching which user can specify search term in both languages, it's useful for
checking whether or not given translation is attested.
3-colcations:
It regarded when two words go together.
A mutual Information (MI) is a formula for determining either the two words are collocates or not.
If two words are strongly connected they will have high MI score, and if not, they will have low MI score.
Its drawback:
1-it assumes that the different words occurred as completely independent events whereas languages
actually full of dependencies
2-MI requires usually about 5 of co-occurrences within a corpus in order to be valid.
Annotation and mark-up:
It depends on the project at hand.
It requires a greater initial investment of time, but subsequently allows more specific searching.