P. 1
Computational Linguistics

Computational Linguistics

|Views: 40|Likes:
Published by Abdulredha Shuli

More info:

Published by: Abdulredha Shuli on Nov 17, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





Information retrieval systems (IRS) are designed to search for rele-
vant information in large documentary databases. This information
can be of various kinds, with the queries ranging from “Find all the
documents containing the word conjugar” to “Find information on



the conjugation of Spanish verbs”. Accordingly, various systems
use different methods of search.
The earliest IRSs were developed to search for scientific articles
on a specific topic. Usually, the scientists supply their papers with a
set of keywords, i.e., the terms they consider most important and
relevant for the topic of the paper. For example, español, verbos,
might be the keyword set of the article “On means of
expressing unreal conditions” in a Spanish scientific journal.
These sets of keywords are attached to the document in the bib-
liographic database of the IRS, being physically kept together with
the corresponding documents or separately from them. In the sim-
plest case, the query should explicitly contain one or more of such
keywords as the condition on what the article can be found and re-
trieved from the database. Here is an example of a query: “Find the
documents on verbos and español”. In a more elaborate system, a
query can be a longer logical expression with the operators and, or,
not, e.g.: “Find the documents on (sustantivos or adjetivos) and
(not inglés)”.

Nowadays, a simple but powerful approach to the format of the
query is becoming popular in IRSs for non-professional users: the
query is still a set of words; the system first tries to find the docu-
ments containing all of these words, then all but one, etc., and fi-
nally those containing only one of the words. Thus, the set of key-
words is considered in a step-by-step transition from conjunction to
disjunction of their occurrences. The results are ordered by degree
of relevance, which can be measured by the number of relevant
keywords found in the document. The documents containing more
keywords are presented to the user first.
In some systems the user can manually set a threshold for the
number of the keywords present in the documents, i.e., to search for
“at least m of n” keywords. With m = n, often too few documents, if
any, are retrieved and many relevant documents are not found; with
m = 1, too many unrelated ones are retrieved because of a high rate
of false alarms.



Usually, recall and precision are considered the main characteris-
tics of IRSs. Recall is the ratio of the number of relevant documents
divided by the total number of relevant documents in the da-
tabase. Precision is the ratio of the number of relevant documents
divided by the total number of documents found.
It is easy to see that these characteristics are contradictory in the
general case, i.e. the greater one of them the lesser another, so that
it is necessary to keep a proper balance between them.
In a specialized IRS, there usually exists an automated indexing
subsystem, which works before the searches are executed. Given a
set of keywords, it adds, using the or operator, other related key-
words, based on a hierarchical system of the scientific, technical or
business terms. This kind of hierarchical systems is usually called
thesaurus in the literature on IRSs and it can be an integral part of
the IRS. For instance, given the query “Find the documents on con-
,” such a system could add the word morfología to both the
query and the set of keywords in the example above, and hence find
the requested article in this way.
Thus, a sufficiently sophisticated IRS first enriches the sets of
keywords given in the query, and then compares this set with the
previously enriched sets of keywords attached to each document in
the database. Such comparison is performed according to any crite-
ria mentioned above. After the enrichment, the average recall of the
IRS system is usually increased.
Recently, systems have been created that can automatically build
sets of keywords given just the full text of the document. Such sys-
tems do not require the authors of the documents to specifically
provide the keywords. Some of the modern Internet search engines
are essentially based on this idea.
Three decades ago, the problem of automatic extraction of key-
words was called automatic abstracting. The problem is not simple,
even when it is solved by purely statistical methods. Indeed, the
most frequent words in any business, scientific or technical texts are
purely auxiliary, like prepositions or auxiliary verbs. They do not
reflect the essence of the text and are not usually taken for abstract-



ing. However, the border between auxiliary and meaningful words
cannot be strictly defined. Moreover, there exist many term-forming
words like system, device, etc., which can seldom be used for in-
formation retrieval because their meaning is too general. Therefore,
they are not useful for abstracts.
The multiplicity of IRSs is considered now as an important class
of the applied software and, specifically, of applied linguistic sys-
tems. The period when they used only individual words as keys has
passed. Developers now try to use word combinations and phrases,
as well as more complicated strategies of search. The limiting fac-
tors for the more sophisticated techniques turned out to be the same
as those for grammar and style checkers: the absence of complete
grammatical and semantic analysis of the text of documents. The
methods used now even in the most sophisticated Internet search
engines are not efficient for accurate information retrieval. This
leads to a high level of information noise, i.e., delivering of irrele-
vant documents, as well as to the frequent missing of relevant ones.
The results of retrieval operations directly depend on the quality
and performance of the indexing and comparing subsystems, on the
content of the terminological system or the thesaurus, and other data
and knowledge used by the system. Obviously, the main tools and
data sets used by an IRS have the linguistic nature.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->