You are on page 1of 7

CHAPTER 1.

The “Great Divide”

CLASSIFYING CORPORATE DATA


Corporate data can be classified in many different ways. One of the major clas-
sifications is by structured versus unstructured data. And unstructured data can
be further broken into two categories—repetitive unstructured data and nonre-
petitive unstructured data. This division of data is shown in Fig. 1.3.1.

Corporate data

Unstructured

Repetitive Nonrepetitive

FIG. 1.3.1
The great divide.

Repetitive unstructured data are data that occur very often and whose records
are almost identical in terms of structure and content. There are many examples
of repetitive unstructured data—telephone call records, metered data, analog
data, and so forth.
Nonrepetitive unstructured data are data that consist of records of data where
the records are not similar, in terms of either structure or content. There are
many examples of nonrepetitive unstructured data—e-mails, call center conver-
sations, warranty claims, and so forth.

13

Data Architecture. https://doi.org/10.1016/B978-0-12-816916-2.00003-6


© 2019 Elsevier Inc. All rights reserved.
14 C HA PT E R 1 . 3 : The “Great Divide”

THE “GREAT DIVIDE”


Between the two types of unstructured data is what can be termed the
“great divide.”
The “great divide” is the demarcation of repetitive and nonrepetitive records, as
seen in the figure. At first glance, it does not appear that there should be a massive
difference between repetitive unstructured records and nonrepetitive unstructured
records of data. But such is not the case at all. There indeed is a HUGE difference
between repetitive unstructured data and nonrepetitive unstructured data.
The primary distinction between the two types of unstructured data is that
repetitive unstructured data focus its attention on the management of data
in the Hadoop/big data environment, whereas the attention of nonrepetitive
unstructured data focuses its attention on textual disambiguation of data.
And as shall be seen, this difference in focus makes a huge difference in how
the data are perceived, how the data are used, and how the data are managed.
This difference—the “great divide”—is shown in Fig. 1.3.2.
It is seen then that there is a very different focus between the two types of
unstructured data.

Unstructured

Repetitive Nonrepetitive

The great divide Textual


Hadoop disambiguation

Hadoop centric
Textual
disambiguation
centric
FIG. 1.3.2
Different types of unstructured data.

REPETITIVE UNSTRUCTURED DATA

Hadoop
The repetitive unstructured data are said to be “Hadoop” centric. Being
“Hadoop” centric means that processing of repetitive unstructured data
revolves around processing and managing the Hadoop/big data environment.
The centricity of the repetitive unstructured data is seen in Fig. 1.3.3.
Hadoop centric
FIG. 1.3.3 The center of the Hadoop environment naturally enough is Hadoop. Hadoop is
Hadoop centric unstructured one of the technologies by which data can be managed over very large amounts
data. of data. Hadoop/big data is at the center of what is known as “big data.”
Repetitive Unstructured Data 15

Load Delete

Statistical Access
analysis
Data
management
Analyze Visualize

FIG. 1.3.5
Services needed by big data.
FIG. 1.3.4
Hadoop.

Hadoop is one of the primary storage mechanism for big data. The essential
characteristics of Hadoop are that Hadoop
- is capable of managing very large volumes of data,
- manages data on less expensive storage,
- manages data by the “Roman census” method,
- stores data in an unstructured manner.
Because of these operating characteristics of Hadoop, very large volumes of data
can be managed. Hadoop is capable of managing volumes of data significantly
larger than standard relational database management systems.
The big data technology of Hadoop is depicted in Fig. 1.3.4.
But Hadoop/big data is a raw technology. In order to be useful, Hadoop/big
data requires its own unique infrastructure.
The technologies that surround Hadoop/big data serve to manage the data and
to access and analyze the data found in Hadoop. The infrastructure services that
surround Hadoop are seen in Fig. 1.3.5.
The services that surround Hadoop/big data are familiar to anyone that has ever
used a standard DBMS. The difference is that in a standard DBMS, the services
are found in the DBMS itself, while in Hadoop, many of the services have to be
done externally. A second major difference is that throughout the Hadoop/big
data environment, there is the need to service huge volumes of data. The devel-
oper in the Hadoop/big data environment must be prepared to manage and
handle extremely large volumes of data. This means that many infrastructure
tasks can be handled only in the Hadoop/big data environment itself.
Indeed, the Hadoop environment is permeated by the need to be able to handle
extraordinarily large amounts of data. The need to handle large amounts of
data—indeed, almost unlimited amounts of data—is seen in Fig. 1.3.6
There is then an emphasis on doing the normal tasks of data management in
the Hadoop environment where the process must be able to handle very large
amounts of data.
16 C HA PT E R 1 . 3 : The “Great Divide”

FIG. 1.3.6
An infinite amount of data.

NONREPETITIVE UNSTRUCTURED DATA


The emphasis in the nonrepetitive unstructured environment is quite different
than the emphasis on the management of the Hadoop big data technology. In
the nonrepetitive unstructured environment, there is an emphasis on “textual
disambiguation” (or on “textual ETL”). This emphasis is shown in Fig. 1.3.7.
Textual disambiguation is the process of taking nonrepetitive unstructured data
and manipulating it into a format that can be analyzed by standard analytic
software. There are many facets to textual disambiguation, but perhaps the
most important functionality is one that can be called “contextualization.”
Contextualization is the process by which text is read and analyzed and the con-
text of the text is derived. Once the context of the text is derived, the text is then
reformatted into a standard database format where the text can be read and ana-
lyzed by standard “business intelligence” software.
The process of textual disambiguation is shown in Fig. 1.3.8.

Textual
disambiguation

Textual
disambiguation
centric
FIG. 1.3.7
Textual disambiguation centric unstructured data.
Nonrepetitive Unstructured Data 17

Unstructured Textual
ETL

FIG. 1.3.8
From unstructured to structured data.

There are many facets to textual disambiguation. Textual disambiguation is


completely free from the limitations of natural language processing (NLP).
In textual disambiguation, there is a multifaceted approach to the identification
of and derivation of context.
Some of the techniques used to derive context include the following:

- The integration of external taxonomies and ontologies


- Proximity analysis
- Homographic resolution
- Subdocument processing
- Associative text resolution
- Acronym resolution
- Simple stop word processing
- Simple word stemming
- Inline pattern recognition

In truth, there are many more facets to the process of textual disambiguation
than those shown. Some of the more important facets of textual disambigua-
tion are shown in Fig. 1.3.9.
There is a concern regarding the volume of data that is managed by textual dis-
ambiguation. But the volume of data that can be processed is secondary to the

Taxonomy/
Inline ontology Proximity
patterns
Homograph
resolution
Unstructured Textual
Stemming ETL
Subdoc
processing
Stop word
processing Acronym Associative
resolution resolution

FIG. 1.3.9
Some of the services needed to turn unstructured into structured data.
18 C HA PT E R 1 . 3 : The “Great Divide”

Unstructured Textual
ETL

FIG. 1.3.10
Transformation.

transformation of data that occurs during the transformation process. Simply


stated, it doesn’t matter how fast you can process data if you cannot understand
what it is that you are processing. The fact that textual disambiguation is dom-
inated by transformation is depicted in Fig. 1.3.10.
There is then a completely different emphasis on the processing that occurs in
the repetitive unstructured world versus the processing that occurs in the non-
repetitive unstructured world.

DIFFERENT WORLDS
This difference is seen in Fig. 1.3.11.

Unstructured Textual
ETL

FIG. 1.3.11
Transforming big data.
Different Worlds 19

Part of the reason for the difference between repetitive unstructured data and
nonrepetitive unstructured data lies in the very data themselves. With repetitive
unstructured data, there is not much of a need to discover the context of the
data. With repetitive unstructured data, data occur so frequently and so repeat-
edly that the context of that data is fairly obvious or fairly easy to ascertain. In
addition, there typically are not much contextual data to begin with when it
comes to repetitive unstructured data. Therefore, the emphasis is almost
entirely on the need to manage volumes of data.
But with nonrepetitive unstructured data, there is a great need to derive the con-
text of the data. Before the data can be used analytically, the data need to be
contextualized. And with nonrepetitive unstructured data, deriving the context
of the data is a very complex thing to do. For sure, there is a need to manage
volumes of data when it comes to nonrepetitive unstructured data. But the pri-
mary need is the need to contextualize the data in the first place.
For these reasons, there is a “great divide” when it comes to managing and deal-
ing with the different forms of unstructured data.

You might also like