Professional Documents
Culture Documents
Corporate data
Unstructured
Repetitive Nonrepetitive
FIG. 1.3.1
The great divide.
Repetitive unstructured data are data that occur very often and whose records
are almost identical in terms of structure and content. There are many examples
of repetitive unstructured data—telephone call records, metered data, analog
data, and so forth.
Nonrepetitive unstructured data are data that consist of records of data where
the records are not similar, in terms of either structure or content. There are
many examples of nonrepetitive unstructured data—e-mails, call center conver-
sations, warranty claims, and so forth.
13
Unstructured
Repetitive Nonrepetitive
Hadoop centric
Textual
disambiguation
centric
FIG. 1.3.2
Different types of unstructured data.
Hadoop
The repetitive unstructured data are said to be “Hadoop” centric. Being
“Hadoop” centric means that processing of repetitive unstructured data
revolves around processing and managing the Hadoop/big data environment.
The centricity of the repetitive unstructured data is seen in Fig. 1.3.3.
Hadoop centric
FIG. 1.3.3 The center of the Hadoop environment naturally enough is Hadoop. Hadoop is
Hadoop centric unstructured one of the technologies by which data can be managed over very large amounts
data. of data. Hadoop/big data is at the center of what is known as “big data.”
Repetitive Unstructured Data 15
Load Delete
Statistical Access
analysis
Data
management
Analyze Visualize
FIG. 1.3.5
Services needed by big data.
FIG. 1.3.4
Hadoop.
Hadoop is one of the primary storage mechanism for big data. The essential
characteristics of Hadoop are that Hadoop
- is capable of managing very large volumes of data,
- manages data on less expensive storage,
- manages data by the “Roman census” method,
- stores data in an unstructured manner.
Because of these operating characteristics of Hadoop, very large volumes of data
can be managed. Hadoop is capable of managing volumes of data significantly
larger than standard relational database management systems.
The big data technology of Hadoop is depicted in Fig. 1.3.4.
But Hadoop/big data is a raw technology. In order to be useful, Hadoop/big
data requires its own unique infrastructure.
The technologies that surround Hadoop/big data serve to manage the data and
to access and analyze the data found in Hadoop. The infrastructure services that
surround Hadoop are seen in Fig. 1.3.5.
The services that surround Hadoop/big data are familiar to anyone that has ever
used a standard DBMS. The difference is that in a standard DBMS, the services
are found in the DBMS itself, while in Hadoop, many of the services have to be
done externally. A second major difference is that throughout the Hadoop/big
data environment, there is the need to service huge volumes of data. The devel-
oper in the Hadoop/big data environment must be prepared to manage and
handle extremely large volumes of data. This means that many infrastructure
tasks can be handled only in the Hadoop/big data environment itself.
Indeed, the Hadoop environment is permeated by the need to be able to handle
extraordinarily large amounts of data. The need to handle large amounts of
data—indeed, almost unlimited amounts of data—is seen in Fig. 1.3.6
There is then an emphasis on doing the normal tasks of data management in
the Hadoop environment where the process must be able to handle very large
amounts of data.
16 C HA PT E R 1 . 3 : The “Great Divide”
FIG. 1.3.6
An infinite amount of data.
Textual
disambiguation
Textual
disambiguation
centric
FIG. 1.3.7
Textual disambiguation centric unstructured data.
Nonrepetitive Unstructured Data 17
Unstructured Textual
ETL
FIG. 1.3.8
From unstructured to structured data.
In truth, there are many more facets to the process of textual disambiguation
than those shown. Some of the more important facets of textual disambigua-
tion are shown in Fig. 1.3.9.
There is a concern regarding the volume of data that is managed by textual dis-
ambiguation. But the volume of data that can be processed is secondary to the
Taxonomy/
Inline ontology Proximity
patterns
Homograph
resolution
Unstructured Textual
Stemming ETL
Subdoc
processing
Stop word
processing Acronym Associative
resolution resolution
FIG. 1.3.9
Some of the services needed to turn unstructured into structured data.
18 C HA PT E R 1 . 3 : The “Great Divide”
Unstructured Textual
ETL
FIG. 1.3.10
Transformation.
DIFFERENT WORLDS
This difference is seen in Fig. 1.3.11.
Unstructured Textual
ETL
FIG. 1.3.11
Transforming big data.
Different Worlds 19
Part of the reason for the difference between repetitive unstructured data and
nonrepetitive unstructured data lies in the very data themselves. With repetitive
unstructured data, there is not much of a need to discover the context of the
data. With repetitive unstructured data, data occur so frequently and so repeat-
edly that the context of that data is fairly obvious or fairly easy to ascertain. In
addition, there typically are not much contextual data to begin with when it
comes to repetitive unstructured data. Therefore, the emphasis is almost
entirely on the need to manage volumes of data.
But with nonrepetitive unstructured data, there is a great need to derive the con-
text of the data. Before the data can be used analytically, the data need to be
contextualized. And with nonrepetitive unstructured data, deriving the context
of the data is a very complex thing to do. For sure, there is a need to manage
volumes of data when it comes to nonrepetitive unstructured data. But the pri-
mary need is the need to contextualize the data in the first place.
For these reasons, there is a “great divide” when it comes to managing and deal-
ing with the different forms of unstructured data.