You are on page 1of 8

Addis Ababa University

School of Information Science


Department of Information Science
IR
Assignment I
Literature review on Information Extraction

By:
Name ID No.
Biniam Worku GSE/6722/13

Submission Date: 21/9/2021


Information extraction

1. Overview

Information extraction (IE) is the automated retrieval of specific information related to a


selected topic from a body or bodies of text.

Information extraction tools make it possible to pull information from text documents,
databases, websites or multiple sources. IE may extract info from unstructured, semi-
structured or structured, machine-readable text. Usually, however, IE is used in natural
language processing (NLP) to extract structured from unstructured text.

Information extraction depends on named entity recognition (NER), a sub-tool used to


find targeted information to extract. NER recognizes entities first as one of several
categories such as location (LOC), persons (PER) or organizations (ORG). Once the
information category is recognized, an information extraction utility extracts the named
entity’s related information and constructs a machine-readable document from it, which
algorithms can further process to extract meaning. IE finds meaning by way of other
subtasks including co-reference resolution, relationship extraction, language and
vocabulary analysis and sometimes audio extraction.

IE dates back to the early days of Natural Language Processing of the 1970’s. JASPER
is a system for IE that for Reuters by Carnegie Melon University is an early example.
Current efforts in multimedia document processing in IE include automatic annotation
and content recognition and extraction from images and video could be seen as IE as
well. Because of the complexity of language, high-quality IE is a challenging task for
artificial intelligence (AI) systems.

Apart from the definition of information extraction, the difference between information
extraction and information retrieval must be explained, since these two techniques are
often mutually confused. Information retrieval can be characterized as the operation
previous to information extraction within the text mining framework. The aim of
information retrieval is to filter the available documents and find those which correspond
to the queries representing the user’s information need. After this process information
extraction derives names and events from the texts provided by the information retrieval
mechanism. Another way to distinguish between these two techniques is to look at their
output results. In the case of information retrieval, the output is the collection of
documents relevant to the user’s information need, although he must then read these in
order to obtain precise information; whereas after information extraction, a user has a
collection of records with different entities, relations and events which have been
derived from those documents (Cowie and Lehnert, 1996; Wilks, 1997; Applet and
Israel, 1999; Ben-Dov and Feldman, 2005).

2. Pros & Cons of Information Extraction system

There are two approaches to building extraction systems; and the pros and cons of the
system depends on the approaches used to build the IE. These approaches are
knowledge engineering approach and automatically trainable systems .

Knowledge engineering approach:

 Grammars are constructed by hand

 Domain patterns are discovered by a human expert

 through introspection and inspection of a corpus

 Much laborious tuning and “hill climbing “

Advantages

 With skill and experience, good performing systems

 are conceptually not hard to develop


 The best performing systems have been hand

 crafted (still true for scenario patterns)

Disadvantages

 Very laborious development process

 Domain adaptation might require re-configuration

 Needs experts which have both, linguistic & domain expertise

Automatically Trainable Systems

 Use statistical methods when possible

 Learn rules from annotated corpora

 Learn rules from interaction with user

Advantages

 Domain portability is relatively straightforward

 System expertise is not required for customization

 Data driven rule acquisition ensures full coverage of examples

Disadvantages

 Training data may not exist, and maybe very expensive to acquire
 Large volume of training data may be required

 Changes to specifications may require re-annotation of large quantities of training


data

3. Information extraction System Architecture

An information extraction system is composed of a series of modules that process text


by applying rules. Since information extraction involves selected pieces of data an
extraction system processes a text by creating computer data structures for relevant
sections of a text while at the same time eliminating irrelevant sections from the
processing. Although there will be variations among systems, generally the functions
for the following set of modules will be performed somewhere in the processing.

The initial module, a Text Zoner, takes a text as input and separates the text into
identifiable segments. The Preprocessor module then takes the segments that
contain text (as opposed to formatted information) and, for individual words within
each sentence in those segments, accesses a lexicon (i.e., dictionary) and associates
properties like part-of-speech and meaning with each word. To reduce the amount of
information to be processed, the Filter module subsequently eliminates sentences that
do not contain any relevant information for the application.

The following modules, including the optional Pre-parser and Fragment Combiner
modules, are geared toward analyzing the grammatical relationships among the
words to create data structures from which sentence meaning can be
interpreted. Because of the difficulty of analyzing these relationships, more and more of
the systems have developed a Preparser module here to identify sequences
or combinations of words that form phrases. Accessing grammar rules, the next
module, the Parser, analyzes the sequences of words and phrases and tries to
understand the grammatical relationships among the constituents. The output
is either a successfully analyzed (parsed) sentence with relationships among the
sentence constituents labelled or a partially analyzed sentence with some constituent
relationships labelled and others constituents left as unattached fragments. It
is these unattached fragments that bring the Fragment Combiner module into play to try
to turn a partially labelled sentence with fragments into a completely labelled
one. With the grammatical relationships identified, either a fully analyzed sentence
or a partially analyzed sentence containing fragments is then processed by the
Semantic Interpreter. This module interprets the labelled grammatical relationships and
generates a representation of the sentence meaning in some form. The next module,
the Lexical Disambiguation module, replaces the representation of any ambiguous
words within the sentence with a specific, unambiguous representation. The next
step, the Coreference Resolution module, takes the meaning representation for the
sentences within a text (from the Semantic Interpreter) and identifies which of the
events or entities that occur in the data structures of the individual sentences actually
refer to the same entity or event in the real world, a critical step to avoid database
duplication. The final module is the Template Generator in which information output
by the Semantic Interpreter and Coreference Resolution modules is turned into
template fills in the desired format.

4.Conclusion
The exponential growth of multifaceted unstructured big data is creating challenges in
context-aware analytics, data-driven decision-making, and data management. IE
techniques are important to extract useful information from unstructured data that
improve the effectiveness of data analytics. In this regard, a structured review was
conducted to investigate the limitations of existing IE techniques for unstructured big
data analysis. For this reason, state-of-the art IE subtasks and their techniques from
different types of data (text, images, audio, and video) have been discussed briefly. This
review also investigated the effectiveness of existing IE techniques for unstructured big
data. It has been observed that variety of big data is creating numerous challenges for
traditional IE systems in terms of accuracy, scalability, generalizability, and usability
which leads us to a new era of advanced IE approaches with new opportunities and
challenges. It has also been concluded that hybrid approaches (i.e., combination of
learning-based and rule based) achieved better performance in IE whereas the quality
of data has a significant impact on the effectiveness of results. However, many
challenges are still associated with these hybrid approaches such as language barriers,
domain issues, and appropriate method selection for the task at hand. These
challenges are specific to IE process but scalability, quality, heterogeneity, and
interoperability are critical factors associated with IE from unstructured big data. To
overcome the limitations of existing IE techniques, more advanced and adaptive pre-
processing techniques are required to remove the quality and usability issues. Further,
some suggestions have also been provided in this review by critically analyzing the
literature and limitations of existing solutions. Our analysis finds that there is a
significant potential to improve the analysis process in terms of context-aware analytics
systems. Advanced techniques and methods for IE systems, particularly for multifaceted
unstructured big data sets, are the utmost requirement. Existing approaches and
methods do not apply to all domains and varying types of data, even in a single data
type. There is a need to develop new techniques and refine existing techniques for pre-
processing stage of data that can help to significantly reduce the problems in data sets,
later used for IE, knowledge discovery, and decision-making.

References

1. Mary Ellen Okurowski, INFORMATION EXTRACTION OVERVIEW ,


Department of Defense.,9800 Savage Road, Fort Meade, Md. 20755.
2. Kiran Adnan and Rehan Akbar(2019), Limitations of information extraction
methods and techniques for heterogeneous unstructured big data , International
Journal of Engineering Business Management Volume 11: 1–23
3. Sagnik Ray Choudhury and C Lee Giles(2011), An Architecture for Information
Extraction from Figures in Digital Libraries , Information Sciences and
Technology Pennsylvania State University

You might also like