You are on page 1of 5

B, Cross-language information retrieval 

(CLIR) is a subfield of information retrieval dealing


with retrieving information written in a language different from the language of the user's query.
 The term "cross-language information retrieval" has many synonyms, of which the following
[1]

are perhaps the most frequent: cross-lingual information retrieval, translingual information
retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers
more generally both to technology for retrieval of multilingual collections and to technology
which has been moved to handle material in one language to another. The term Multilingual
Information Retrieval (MLIR) involves the study of systems that accept queries for information
in various languages and return objects (text, and other media) of various languages, translated
into the user's language. Cross-language information retrieval refers more specifically to the use
case where users formulate their information need in one language and the system retrieves
relevant documents in another. To do so, most CLIR systems use various translation techniques.
CLIR techniques can be classified into different categories based on different translation
resources:[2]

 Dictionary-based CLIR techniques


 Parallel corpora based CLIR techniques
 Comparable corpora based CLIR techniques
 Machine translator based CLIR techniques
C, Lemmatisation is closely related to stemming. The difference is that a stemmer
operates on a single word without knowledge of the context, and therefore cannot
discriminate between words which have different meanings depending on part of
speech. However, stemmers are typically easier to implement and run faster, and the
reduced accuracy may not matter for some applications.
Stemming just removes or stems the last few characters of a word, often leading to
incorrect meanings and spelling.

Lemmatization considers the context and converts the word to its meaningful base
form, which is called Lemma. Sometimes, the same word can have multiple different
Lemmas. 
There are two aspects to show their differences:

1. A stemmer will return the stem of a word, which needn't be identical to the


morphological root of the word. It usually sufficient that related words map to the
same stem,even if the stem is not in itself a valid root, while in lemmatisation, it
will return the dictionary form of a word, which must be a valid word.
2. In lemmatisation, the part of speech of a word should be first determined and the
normalisation rules will be different for different part of speech, while
the stemmer operates on a single word without knowledge of the context, and
therefore cannot discriminate between words which have different meanings
depending on part of speech.
D, The first books and standards on thesaurus construction, written in the 1970s, were
addressed to professional information workers, many of whom had at least some
understanding of the principles of classification but needed now to understand how to build
tools with greater specificity and to effectively use the retrieval power offered by post-
coordinate systems. Then came the revolution: distributed systems, full-text searching using a
new breed of search engines, culminating in the arrival of the World Wide Web

Thesaurus
The thesaurus is one the most used knowledge organisation tools. Thomas’s (2004) article is a
good starting point for knowing about the thesaurus; it defined thesaurus, explained the

procedure of thesaurus construction and the rationale for self-instruction in thesaurus-making. 


E, Multimedia Information Retrieval (MIR) is an organic system made up of Text Retrieval (TR);
Visual Retrieval (VR); Video Retrieval (VDR); and Audio Retrieval (AR) systems. So that each
type of digital document may be analysed and searched by the elements of language appropriate
to its nature, search criteria must be extended. Such an approach is known as the Content Based
Information Retrieval (CBIR), and is the core of MIR. This novel content-based concept of
information handling needs to be integrated with more traditional semantics. Multimedia
Information Retrieval focuses on the tools of processing and searching applicable to the content-
based management of new multimedia documents. Translated from Italian by Giles Smith, the
book is divided into two parts. Part one discusses MIR and related theories, and puts forward
new methodologies; part two reviews various experimental and operating MIR systems, and
presents technical and practical conclusions.

F, The Document Image Retrieval System (DIRS)

The overall structure of the proposed system is presented in

It consisted of two different parts: the offline and the online

procedure. An analytical description of the different stages of both

procedures is as follows.

In the offline operation, the document images are analyzed in

order to localize the word limits and the results are stored in a

database. This procedure consists of three main stages. Initially,

the document images pass the preprocessing stage,


3, Step 1: Term Frequency (TF)

TF for Document 1

Document1 the game of life is a everlasting learning

Term Frequency 1 2 2 1 1 1 1 1

TF for Document 2

Document2 the unexamined life is not worth living

Term Frequency 1 1 1 1 1 1 1

TF for Document 3

Document3 never stop learning

Term Frequency 1 1 1

In reality each document will be of different size. On a large document the frequency of the
terms will be much higher than the smaller ones. Hence we need to normalize the document
based on its size. A simple trick is to divide the term frequency by the total number of terms. For
example in Document 1 the term game occurs two times. The total number of terms in the
document is 10. Hence the normalized term frequency is 2 / 10 = 0.2. Given below are the
normalized term frequency for all the documents.

Normalized TF for Document 1

Document1 the game of life is a everlasting learning

Normalized TF 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1

Normalized TF for Document 2


Document2 the unexamined life is not worth living

Normalized TF 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857 0.142857

Normalized TF for Document 3

Document3 never stop learning

Normalized TF 0.333333 0.333333 0.333333

You might also like