Professional Documents
Culture Documents
By:
Name ID No.
Biniam Worku GSE/6722/13
1. Overview
Information extraction tools make it possible to pull information from text documents,
databases, websites or multiple sources. IE may extract info from unstructured, semi-
structured or structured, machine-readable text. Usually, however, IE is used in natural
language processing (NLP) to extract structured from unstructured text.
IE dates back to the early days of Natural Language Processing of the 1970’s. JASPER
is a system for IE that for Reuters by Carnegie Melon University is an early example.
Current efforts in multimedia document processing in IE include automatic annotation
and content recognition and extraction from images and video could be seen as IE as
well. Because of the complexity of language, high-quality IE is a challenging task for
artificial intelligence (AI) systems.
Apart from the definition of information extraction, the difference between information
extraction and information retrieval must be explained, since these two techniques are
often mutually confused. Information retrieval can be characterized as the operation
previous to information extraction within the text mining framework. The aim of
information retrieval is to filter the available documents and find those which correspond
to the queries representing the user’s information need. After this process information
extraction derives names and events from the texts provided by the information retrieval
mechanism. Another way to distinguish between these two techniques is to look at their
output results. In the case of information retrieval, the output is the collection of
documents relevant to the user’s information need, although he must then read these in
order to obtain precise information; whereas after information extraction, a user has a
collection of records with different entities, relations and events which have been
derived from those documents (Cowie and Lehnert, 1996; Wilks, 1997; Applet and
Israel, 1999; Ben-Dov and Feldman, 2005).
There are two approaches to building extraction systems; and the pros and cons of the
system depends on the approaches used to build the IE. These approaches are
knowledge engineering approach and automatically trainable systems .
Advantages
Disadvantages
Advantages
Disadvantages
Training data may not exist, and maybe very expensive to acquire
Large volume of training data may be required
The initial module, a Text Zoner, takes a text as input and separates the text into
identifiable segments. The Preprocessor module then takes the segments that
contain text (as opposed to formatted information) and, for individual words within
each sentence in those segments, accesses a lexicon (i.e., dictionary) and associates
properties like part-of-speech and meaning with each word. To reduce the amount of
information to be processed, the Filter module subsequently eliminates sentences that
do not contain any relevant information for the application.
The following modules, including the optional Pre-parser and Fragment Combiner
modules, are geared toward analyzing the grammatical relationships among the
words to create data structures from which sentence meaning can be
interpreted. Because of the difficulty of analyzing these relationships, more and more of
the systems have developed a Preparser module here to identify sequences
or combinations of words that form phrases. Accessing grammar rules, the next
module, the Parser, analyzes the sequences of words and phrases and tries to
understand the grammatical relationships among the constituents. The output
is either a successfully analyzed (parsed) sentence with relationships among the
sentence constituents labelled or a partially analyzed sentence with some constituent
relationships labelled and others constituents left as unattached fragments. It
is these unattached fragments that bring the Fragment Combiner module into play to try
to turn a partially labelled sentence with fragments into a completely labelled
one. With the grammatical relationships identified, either a fully analyzed sentence
or a partially analyzed sentence containing fragments is then processed by the
Semantic Interpreter. This module interprets the labelled grammatical relationships and
generates a representation of the sentence meaning in some form. The next module,
the Lexical Disambiguation module, replaces the representation of any ambiguous
words within the sentence with a specific, unambiguous representation. The next
step, the Coreference Resolution module, takes the meaning representation for the
sentences within a text (from the Semantic Interpreter) and identifies which of the
events or entities that occur in the data structures of the individual sentences actually
refer to the same entity or event in the real world, a critical step to avoid database
duplication. The final module is the Template Generator in which information output
by the Semantic Interpreter and Coreference Resolution modules is turned into
template fills in the desired format.
4.Conclusion
The exponential growth of multifaceted unstructured big data is creating challenges in
context-aware analytics, data-driven decision-making, and data management. IE
techniques are important to extract useful information from unstructured data that
improve the effectiveness of data analytics. In this regard, a structured review was
conducted to investigate the limitations of existing IE techniques for unstructured big
data analysis. For this reason, state-of-the art IE subtasks and their techniques from
different types of data (text, images, audio, and video) have been discussed briefly. This
review also investigated the effectiveness of existing IE techniques for unstructured big
data. It has been observed that variety of big data is creating numerous challenges for
traditional IE systems in terms of accuracy, scalability, generalizability, and usability
which leads us to a new era of advanced IE approaches with new opportunities and
challenges. It has also been concluded that hybrid approaches (i.e., combination of
learning-based and rule based) achieved better performance in IE whereas the quality
of data has a significant impact on the effectiveness of results. However, many
challenges are still associated with these hybrid approaches such as language barriers,
domain issues, and appropriate method selection for the task at hand. These
challenges are specific to IE process but scalability, quality, heterogeneity, and
interoperability are critical factors associated with IE from unstructured big data. To
overcome the limitations of existing IE techniques, more advanced and adaptive pre-
processing techniques are required to remove the quality and usability issues. Further,
some suggestions have also been provided in this review by critically analyzing the
literature and limitations of existing solutions. Our analysis finds that there is a
significant potential to improve the analysis process in terms of context-aware analytics
systems. Advanced techniques and methods for IE systems, particularly for multifaceted
unstructured big data sets, are the utmost requirement. Existing approaches and
methods do not apply to all domains and varying types of data, even in a single data
type. There is a need to develop new techniques and refine existing techniques for pre-
processing stage of data that can help to significantly reduce the problems in data sets,
later used for IE, knowledge discovery, and decision-making.
References