You are on page 1of 7

Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity

September 12, 2015 by Kevin Normandeau 4 Comments


We have all heard of the the 3Vs of big data which
are Volume, Variety and Velocity. Yet, Inderpal Bhandar, Chief Data Officer at
Express Scripts noted in his presentation at the Big Data Innovation Summit in
Boston that there are additional Vs that IT, business and data scientists need to
be concerned with, most notably big data Veracity. Other big data Vs getting
attention at the summit are: validity and volatility. Here is an overview the 6Vs
of big data.
Volume
Big data implies enormous volumes of data. It used to be employees created
data. Now that data is generated by machines, networks and human interaction
on systems like social media the volume of data to be analyzed is massive. Yet,
Inderpal states that the volume of data is not as much the problem as other Vs
like veracity.
Variety
Variety refers to the many sources and types of data both structured and
unstructured. We used to store data from sources like spreadsheets and
databases. Now data comes in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. This variety of unstructured data creates problems for
storage, mining and analyzing data. Jeff Veis, VP Solutions at HP
Autonomypresented how HP is helping organizations deal with big challenges
including data variety.
Velocity
Big Data Velocity deals with the pace at which data flows in from sources like
business processes, machines, networks and human interaction with things like
social media sites, mobile devices, etc. The flow of data is massive and
continuous. This real-time data can help researchers and businesses make
valuable decisions that provide strategic competitive advantages and ROI if you
are able to handle the velocity. Inderpal suggest that sampling data can help deal
with issues like volume and velocity.
Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data
that is being stored, and mined meaningful to the problem being analyzed.
Inderpal feel veracity in data analysis is the biggest challenge when compares to
things like volume and velocity. In scoping out your big data strategy you need
to have your team and partners work to help keep your data clean and processes
to keep dirty data from accumulating in your systems.
Validity
Like big data veracity is the issue of validity meaning is the data correct and
accurate for the intended use. Clearly valid data is key to making the right
decisions. Phil Francisco, VP of Product Management from IBM spoke about

IBMs big data strategy and tools they offer to help with data veracity and
validity.
Volatility
Big data volatility refers to how long is data valid and how long should it be
stored. In this world of real time data you need to determine at what point is
data no longer relevant to the current analysis.
Big data clearly deals with issues beyond volume, variety and velocity to other
concerns like veracity, validity and volatility.

Summary: Weve scoured the literature to bring you a complete listing of


possible definitions of Big Data with the goal of being able to determine whats
a Big Data opportunity and whats not. Our conclusion is that Volume, Variety,
and Velocity still make the best definitions but none of these stand on their own
in identifying Big Data from not-so-big-data. Understanding these
characteristics will help you analyze whether an opportunity calls for a Big
Data solution but the key is to understand that this is really about breakthrough
changes in the technology of storing, retrieving, and analyzing data and then
finding the opportunities that can best take advantage.

What Business Users Want to Know


Conversations with business users invariably start with the question what is
Big Data. Implicit in the question is that if it can be defined then they can
understand where it currently exists; where the opportunities to be exploited
may lie, and when and how will the business user need to deal with this.
Sounds simple enough, but as we observed in a prior posting there are many
different characteristics of Big Data on which data scientists agree, but none
which by themselves can be used to say that this example is Big Data and that
one is not. Happily, almost everyone who has weighed in on this conversation

has chosen descriptors that begin with V, hence the name of this article. Most
common you will hear Volume, Variety, and Velocity. These may be the most
common but by no means the only descriptors that have been used.
You would think this would be settled by now but a scan of the literature says
otherwise. In fact we were able to find eight, count them eight different
characteristics claimed for Big Data.
Volume
Volume always seems to head each list. There is general agreement that if
volume is in the gigabytes it is probably not Big Data, but at the terabyte and
petabyte level and beyond it may very well be. Volume is a key contributor to
the problem of why traditional relational database management systems
(RDBMS, data warehouses as we know them today) fail to handle Big Data.
Underlying that failure are more complex issues of cost, reliability, long query
times, and their inability to handle new sources of unstructured or semistructured data like text.
Big companies are no strangers to Big Data. As early as the 1980s UPS began
to capture and track data on package movements that now number 16.3 million
packages per day while responding to 39.5 million tracking requests per day,
now storing over 16 petabytes of data.[i] Wal-Mart records more than 1 million
customer transactions per hour, generating more than 2.5 petabytes of data.[ii]
And in one survey 17% of companies report currently managing more than a
petabyte of data with an additional 22% reporting hundreds of terabytes.[iii]
So if close to 40% companies report already managing terabytes of data or more
whats changed? Whats changed is the desire to unleash the knowledge
contained in transactional stores and external data sources through analysis, and
when that happens the new NoSQL storage and retrieval architectures and tools
become important.
Variety:
Different Types: Variety describes different formats of data that do not lend
themselves to storage in structured relational database systems. These include a
long list of data such as documents, emails, social media text messages, video,
still images, audio, graphs, and the output from all types of machine-generated
data from sensors, devices, RFID tags, machine logs, cell phone GPS signals,
DNA analysis devices, and more. This type of data is characterized as
unstructured or semi-structured and has existed all along. In fact its estimated
by some studies to account for 90% or more of the data in organizations.

Different Sources: Variety is also used to mean data from many different
sources, both inside and outside of the company. Whats changed is the
realization that through analysis it can yield new and valuable insights not
previously available.
There are two primary challenges here. First, storing and retrieving these data
types quickly and cost efficiently. Second, during analysis, blending or aligning
data types from different sources so that all types of data describing a single
event can be extracted and analyzed together.
Then there is the interaction of variety with volume. Unstructured data is
growing much more rapidly than structured data. Gartner estimates that
unstructured data doubles every three months and offers the example that there
are seven million web pages added each day.
In terms of opportunity, Variety is seen by business users as the major focus of
new Big Data initiatives. Companies have been handling large volumes of data
for many years and view that process as incremental and business and usual.
But the new and unique opportunity to add unstructured data to the analytic mix
is seen by many as a game changer.[iv]
Velocity
Data-In-Motion: Data scientists like to talk about data-at-rest and data-inmotion. One meaning of Velocity is to describe data-in-motion, for example,
the stream of readings taken from a sensor or the web log history of page visits
and clicks by each visitor to a web site. This can be thought of as a fire hose of
incoming data that needs to be captured, stored, and analyzed. Consistency and
completeness of fast moving streams of data are one concern. Matching them to
specific outcome events, a challenge raised under Variety is another. Velocity
also incorporates the characteristics of timeliness or latency is the data being
captured at a rate or with a lag time that makes it useful.
Lifetime of Data Utility: A second dimension of Velocity is how long the data
will be valuable. Is it permanently valuable or does it rapidly age and lose its
meaning and importance. Understanding this dimension of Velocity in the data
you choose to store will be important in discarding data that is no longer
meaningful and in fact may mislead.
Real Time Big Data Analytics: The third dimension of Velocity is the speed
with which it must be stored and retrieved. This is one of the major
determinants of NoSQL storage, retrieval, analysis, and deployment architecture
that companies must work through today. When you visit a sophisticated
content web site such as Yahoo or the Huffington Post, those ads that pop up

have been selected specifically for you based on the capture, storage, and
analysis of your current web visit, your prior web site visits, and a mash up of
external data stored in a NoSQL DB like Hadoop and added to the analytics.
When you sign on to Amazon or Netflix and see recommended purchases or
views just for you the same process has taken place. The architecture of
capture, analysis, and deployment must support real-time turnaround (in this
case fractions of a second) and must do this consistently over thousands of new
visitors each minute. Real Time Big Data Analytics (RTBDA) is one of the
main frontiers of development in Big Data today.
Whats changed? The data was always there but the ability to capture, analyze,
and act on it in (near) real time is indeed a brand new feature of Big Data
technology.
Value
Although Value is frequently shown as the fourth leg of the Big Data stool,
Value does not differentiate Big Data from not so big data. It is equally true of
both big and little data that if we are making the effort to store and analyze it
then it must be perceived to have value.
Big Data however is perceived as having incremental value to the organization
and many users quote having found actionable relationships in Big Data stores
that they could not find in small stores. Certainly it is true that if in the past we
were storing data about groups of customers and are now storing data about
each customer individually then the granularity of our findings is much finer
and we approach that desired end-goal of offering each customer a
personalization-of-one in their experience with us.
Another take on Value is that Big Data tends to have low value density, meaning
that you have to store a lot of it to extract findings.[v] This is likely true but
since new Big Data storage and retrieval technologies are so much less
expensive than previous, low value density should not be a hurdle that prevents
us from searching for those valuable kernels.
Finally, there is at least one reviewer who goes to philosophical extremes
quoting Sartre existence precedes essence. By which he means that we may
choose to store Big Data before even understanding exactly what use we have
for it.[vi] Were not entirely sure about this. We still encourage business users
to work backwards from the desired outcome before deciding exactly what Big
Data to capture.

There are at least four additional characteristics that pop up in the literature
from time to time. All of these share the same definitional problems of Value.
That is they may be a descriptor of data but not uniquely of Big Data.
Veracity: What is the provenance of the data? Does it come from a reliable
source? It is accurate and by extension, complete.
Variability: There are several potential meanings for Variability. Is the data
consistent in terms of availability or interval of reporting? Does it accurately
portray the event reported? When data contains many extreme values it
presents a statistical problem to determine what to do with these outlier values
and whether they contain a new and important signal or are just noisy data.
Viscosity: This term is sometimes used to describe the latency or lag time in
the data relative to the event being described. We found that this is just as easily
understood as an element of Velocity.
Virality: Defined by some users as the rate at which the data spreads; how
often it is picked up and repeated by other users or events.
Ive been working with the US Department of Commerce National Institute for
Standards and Technology (NIST) working group developing a standardized
"Big Data Roadmap" since the summer of 2013. Reaching a common definition
of Big Data was one of the first tasks we tackled. Those grand qualifiers from
our college philosophy classes, is the characteristic BOTH necessary and
sufficient turns out to be extremely useful.
In fact, we elected to stick with Volume, Variety, and Velocity and kicked the
last five out of the Big Data definition as broadly applicable to all types of data.
Unfortunately, as you may know if youve grappled with explaining this
yourself, Volume, Variety, and Velocity do pass the necessary and sufficient test
but not all Big Data opportunities demonstrate all three characteristics. One
suggestion was to call it Big Data if it met two out of three but even that didnt
completely pass muster.
Variety comes close when speaking narrowly of unstructured data since storage
and retrieval techniques for these data types has really been revolutionized by
new NoSQL tools and techniques including blending these with traditional
structured data. Likewise, Velocity comes close when talking about Real Time
Big Data Analytics for the same reason.
We argued in a previous post that Big Data is not so much about the data itself
as it is about a whole new NoSQL / NewSQL technology . Big Data is about
this new set of tools and techniques in search of appropriate problems to solve.

Each business application may be different and it is growing apparent that real
solutions in real companies are frequently hybrids of NoSQL and traditional
RDBMS and analytic tools. These definitions may help sort down opportunities
at a high level, but before proceeding, each opportunity needs to be carefully
analyzed for realistic business value and realistic technology applications.