Multilingual Information Retrieval

by T.Mehbub Basha

Overview Introduction Document Preprocessing Monolingual Information Retrieval .

Introduction Information Retrieval (IR): Concerned with satisfying information needs of users. . Constantly increasing number of information items. Ex: documents World Wide Web(WWW) websites requires efficient approaches to retrieve relevant subsets for specific information needs. it requires to adapt the retrieval techniques applied to Web search to these new scenarios.

Why we need multilingual? .

3% of English Internet users . last accessed November 16. social networks or personal emails — are written in different languages(27. .Websites. 2010) People from different nations and languages are connected in social networks Internet usage statistics as presented in Figure 1.1 show that only one fourth of the Internet users are native English speakers.

Figure 1.1: Statistics of the number of Internet users by language .

Many information retrieval approaches are based on Machine Translation (MT) systems. However.. meanings) This motivates the development of multilingual retrieval methods that do not depend on MT or at least are able to compensate errors introduced by the translation systems. these systems still have high error rates(like grammars. .Cont.

The overall search process is visualized in Figure II.DEFINITION OF INFORMATION RETRIEVAL: Given a collection D containing information items di and a keyword query q representing an information need. This process consists of two parts. . IR is defined as the task of retrieving a ranked list of information items d1. 1.Indexing part 2. .1 .Search part . . sorted by their relevance in respect to the specified information need. d2.


the matching algorithm determines relevant documents which are then returned as ranked results. . Using the vector representation of the query.The indexing part processes the entire document collection to built index structures & Each document is thereby preprocessed and mapped to a vector representation The search part is based on the same preprocessing step that is also applied to the query.

Cross-lingual and Multilingual IR.monolingual case. . the content of information items di and the keyword query q are thereby written in the same language. the information need and the corresponding query of the user may be formulated in other languages than the one in which the documents are written in.

Introduction Document Preprocessing Monolingual Information Retrieval .

So use character sequences . script and other factors.avoids the problem of detecting word borders . But for Chinese. terms used in IR systems are often defined by the words of these languages. Depending on language. the process for identifying terms can differ substantially For Western European languages.Preprocessing takes a set of raw documents as input and produces as set of tokens as output. words are not separated by whitespaces.

tokenization & normalization of tokens . encoding.Common techniques used for document preprocessing document syntax.

