You are on page 1of 10

Structured vs.

Unstructured
Data
By Christine Taylor, Posted March 28, 2018

Structured data is far easier for Big Data programs to digest, while the
myriad formats of unstructured data creates a greater challenge. Yet
both types of data play a key role in effective data analysis.
SHARE
Download the authoritative guide: Big Data 2019: Mining Data for Revenue

Structured data vs. unstructured data: structured data is comprised of clearly defined
data types whose pattern makes them easily searchable; while unstructured data –
“everything else” – is comprised of data that is usually not as easily searchable,
including formats like audio, video, and social media postings.

Unstructured data vs. structured data does not denote any real conflict between the
two. Customers select one or the other not based on their data structure, but on the
applications that use them: relational databases for structured, and most any other type
of application for unstructured data.

If you're looking for big data solutions for your enterprise, refer to our list of the top big
data companies

However, there is a growing tension between the ease of analysis on structured data
versus more challenging analysis on unstructured data. Structured data analytics is a
mature process and technology. Unstructured data analytics is a nascent industry with
a lot of new investment into R&D, but is not a mature technology. The structured data
vs. unstructured data issue within corporations is deciding if they should invest in
analytics for unstructured data, and if it is possible to aggregate the two into better
business intelligence.

What is Structured Data?


Structured data usually resides in relational databases (RDBMS). Fields store length-
delineated data phone numbers, Social Security numbers, or ZIP codes. Even text
strings of variable length like names are contained in records, making it a simple matter
to search. Data may be human- or machine-generated as long as the data is created
within an RDBMS structure. This format is eminently searchable both with human
generated queries and via algorithms using type of data and field names, such as
alphabetical or numeric, currency or date.

Common relational database applications with structured data include airline


reservation systems, inventory control, sales transactions, and ATM activity. Structured
Query Language (SQL) enables queries on this type of structured data within relational
databases.

Some relational databases do store or point to unstructured data such as customer


relationship management (CRM) applications. The integration can be awkward at best
since memo fields do not loan themselves to traditional database queries. Still, most of
the CRM data is structured.

What is Unstructured Data?


Unstructured data is essentially everything else. Unstructured data has internal
structure but is not structured via pre-defined data models or schema. It may be textual
or non-textual, and human- or machine-generated. It may also be stored within a non-
relational database like NoSQL.
Typical human-generated unstructured data includes:

 Text files: Word processing, spreadsheets, presentations, email, logs.

 Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is unstructured
and traditional analytics tools cannot parse it.

 Social Media: Data from Facebook, Twitter, LinkedIn.

 Website: YouTube, Instagram, photo sharing sites.

 Mobile data: Text messages, locations.

 Communications: Chat, IM, phone recordings, collaboration software.

 Media: MP3, digital photos, audio and video files.

 Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:

 Satellite imagery: Weather data, land forms, military movements.

 Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.

 Digital surveillance: Surveillance photos and video.

 Sensor data: Traffic, weather, oceanographic sensors.

Structured vs. Unstructured Data: What’s


the Difference?
Besides the obvious difference between storing in a relational database and storing
outside of one, the biggest difference is the ease of analyzing structured data vs.
unstructured data. Mature analytics tools exist for structured data, but analytics tools
for mining unstructured data are nascent and developing.

Users can run simple content searches across textual unstructured data. But its lack of
orderly internal structure defeats the purpose of traditional data mining tools, and the
enterprise gets little value from potentially valuable data sources like rich media,
network or weblogs, customer interactions, and social media data. Even though
unstructured data analytics tools are in the marketplace, no one vendor or toolset are
clear winners. And many customers are reluctant to invest in analytics tools with
uncertain development roadmaps.

On top of this, there is simply much more unstructured data than structured.
Unstructured data makes up 80% and more of enterprise data, and is growing at the rate
of 55% and 65% per year. And without the tools to analyze this massive data,
organizations are leaving vast amounts of valuable data on the business intelligence
table.
Structured data is traditionally easier for Big Data applications to digest, yet today's data
analytics solutions are making great strides in this area.

How Semi-Structured Data Fits with


Structured and Unstructured Data
Semi-structured data maintains internal tags and markings that identify separate data
elements, which enables information grouping and hierarchies. Both documents and
databases can be semi-structured. This type of data only represents about 5-10% of the
structured/semi-structured/unstructured data pie, but has critical business usage
cases.

Email is a very common example of a semi-structured data type. Although more


advanced analysis tools are necessary for thread tracking, near-dedupe, and concept
searching; email’s native metadata enables classification and keyword searching
without any additional tools.
Email is a huge use case, but most semi-structured development centers on easing data
transport issues. Sharing sensor data is a growing use case, as are Web-based data
sharing and transport: electronic data interchange (EDI), many social media platforms,
document markup languages, and NoSQL databases.

Examples of Semi-structured Data

 Markup language XML This is a semi-structured document language. XML is a


set of document encoding rules that defines a human- and machine-readable format.
(Although saying that XML is human-readable doesn’t pack a big punch: anyone
trying to read an XML document has better things to do with their time.) Its value is
that its tag-driven structure is highly flexible, and coders can adapt it to universalize
data structure, storage, and transport on the Web.

 Open standard JSON (JavaScript Object Notation) JSON is another semi-


structured data interchange format. Java is implicit in the name but other C-like
programming languages recognize it. Its structure consists of name/value pairs (or
object, hash table, etc.) and an ordered value list (or array, sequence, list). Since the
structure is interchangeable among languages, JSON excels at transmitting data
between web applications and servers.

 NoSQL Semi-structured data is also an important element of many NoSQL (“not


only SQL”) databases. NoSQL databases differ from relational databases because
they do not separate the organization (schema) from the data. This makes NoSQL a
better choice to store information that does not easily fit into the record and table
format, such as text with varying lengths. It also allows for easier data exchange
between databases. Some newer NoSQL databases
like MongoDB and Couchbase also incorporate semi-structured documents by
natively storing them in the JSON format.
In big data environments, NoSQL does not require admins to separate operational and
analytics databases into separate deployments. NoSQL is the operational database and
hosts native analytics tools for business intelligence. In Hadoop environments, NoSQL
databases ingest and manage incoming data and serve up analytic results.

These databases are common in big data infrastructure and real-time Web applications
like LinkedIn. On LinkedIn, hundreds of millions of business users freely share job titles,
locations, skills, and more; and LinkedIn captures the massive data in a semi-structured
format. When job seeking users create a search, LinkedIn matches the query to its
massive semi-structured data stores, cross-references data to hiring trends, and shares
the resulting recommendations with job seekers. The same process operates with sales
and marketing queries in premium LinkedIn services like Salesforce. Amazon also
bases its reader recommendations on semi-structured databases.

Structured vs. Unstructured Data: Next


Gen Tools are Game Changers
New tools are available to analyze unstructured data, particularly given specific use
case parameters. Most of these tools are based on machine learning. Structured data
analytics can use machine learning as well, but the massive volume and many different
types of unstructured data requires it.

A few years ago, analysts using keywords and key phrases could search unstructured
data and get a decent idea of what the data involved. eDiscovery was (and is) a prime
example of this approach. However, unstructured data has grown so dramatically that
users need to employ analytics that not only work at compute speeds, but also
automatically learn from their activity and user decisions. Natural Language Processing
(NLP), pattern sensing and classification, and text-mining algorithms are all common
examples, as are document relevance analytics, sentiment analysis, and filter-driven
Web harvesting. Unstructured data analytics with machine-learning intelligence allows
organizations to:

 Analyze digital communications for compliance. Failed compliance can cost


companies millions of dollars in fees, litigation, and lost business. Pattern recognition
and email threading analysis software searches massive amounts of email and chat
data for potential noncompliance. A recent example includes Volkswagen’s woes,
who might have avoided a huge fines and reputational hits by using analytics to
monitor communications for suspicious messages.

 Track high-volume customer conversations in social media. Text analytics and


sentiment analysis lets analysts review positive and negative results of marketing
campaigns, or even identify online threats. This level of analytics is far more
sophisticated simple keyword search, which can only report basics like how often
posters mentioned the company name during a new campaign. New analytics also
include context: was the mention positive or negative? Were posters reacting to each
other? What was the tone of reactions to executive announcements? The automotive
industry for example is heavily involved in analyzing social media, since car buyers
often turn to other posters to gauge their car buying experience. Analysts use a
combination of text mining and sentiment analysis to track auto-related user posts on
Twitter and Facebook.

 Gain new marketing intelligence. Machine-learning analytics tools quickly work


on massive amounts of documents to analyze customer behavior. A major magazine
publisher applied text mining to hundreds of thousands of articles, analyzing each
separate publication by the popularity of major subtopics. Then they extended
analytics across all their content properties to see which overall topics got the most
attention by customer demographic. The analytics ran across hundreds of thousands
of pieces of content across all publications, and cross-referenced hot topic results by
segments. The result was a rich education on which topics were most interesting to
distinct customers, and which marketing messages resonated most strongly with
them.

In eDiscovery, data scientists use keywords to search unstructured data and get a
reasonble idea of the data involved.

No matter what your business specifics are, today’s goal is to tap business value
whether the data is structured or unstructured. Both types of data potentially hold a
great deal of value, and newer tools can aggregate, query, analyze, and leverage all data
types for deep business insight across the universe of corporate data.
Next steps: to fully understand the enterprise IT infrastructure that hosts today's
structured and unstructured Big Data tools, read The Comprehensive Guide to Cloud
Computing.

You might also like