Professional Documents
Culture Documents
are perhaps the most frequent: cross-lingual information retrieval, translingual information
retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers
more generally both to technology for retrieval of multilingual collections and to technology
which has been moved to handle material in one language to another. The term Multilingual
Information Retrieval (MLIR) involves the study of systems that accept queries for information
in various languages and return objects (text, and other media) of various languages, translated
into the user's language. Cross-language information retrieval refers more specifically to the use
case where users formulate their information need in one language and the system retrieves
relevant documents in another. To do so, most CLIR systems use various translation techniques.
CLIR techniques can be classified into different categories based on different translation
resources:[2]
Lemmatization considers the context and converts the word to its meaningful base
form, which is called Lemma. Sometimes, the same word can have multiple different
Lemmas.
There are two aspects to show their differences:
Thesaurus
The thesaurus is one the most used knowledge organisation tools. Thomas’s (2004) article is a
good starting point for knowing about the thesaurus; it defined thesaurus, explained the
procedures is as follows.
order to localize the word limits and the results are stored in a
TF for Document 1
Term Frequency 1 2 2 1 1 1 1 1
TF for Document 2
Term Frequency 1 1 1 1 1 1 1
TF for Document 3
Term Frequency 1 1 1
In reality each document will be of different size. On a large document the frequency of the
terms will be much higher than the smaller ones. Hence we need to normalize the document
based on its size. A simple trick is to divide the term frequency by the total number of terms. For
example in Document 1 the term game occurs two times. The total number of terms in the
document is 10. Hence the normalized term frequency is 2 / 10 = 0.2. Given below are the
normalized term frequency for all the documents.