Big Data Myths Complete

Big Data and Science: Myths and Reality
Muhammad Umar
Department of Computer Science & Information Technology University
of Sargodha
University Road Sargodha, Punjab, Pakistan
Email:
Abstract
As Big Data unbending draws attention are able to store it or process it very
from every part of a society, it has also efficiently. For example the New York
suffered from many characterizations Exchange generates about one Terabyte
that are incorrect. This article will of new trade data per day.
explores common myths about Big Data Or The statistic shows that 500+
and exposes the truths which are terabytes of new data get ingested into
underlying. Big Data impacts nearly the databases of social media
every aspect of our modern society, site facebook, every day. This data is
including business, government, health mainly generated in terms of photo and
care and research in almost every video uploads, message exchanges and
discipline: life sciences, engineering, putting like, comments etc.
natural sciences, art and humanities. As
it has drawn much attention and become
economically important, there are many Types of Big Data
who have preferred angles on the
interpretation of Big Date. At the same 1. Structured
time, as many have been exposed to the 2. Unstructured
term with little prior knowledge of 3. Semi-Structured
computing or technology they are easily
sawayed by the EXPERTS. There is a Structured Data
widely use of the term Big Data in ways Any data that can be stored accessed and
that are inappropriate but self serving. In processed in the form of fixed format is
many cases these erroneous termed as a structured data.
interpretations have then been taken up
and amplified by others, including even Unstructured Data
technically sophisticated people. In this Data with unknown structure is
article I will discuss some of the more classified as unstructured data.
common myths.
Semi-Structured Data
Introduction It can contain both forms of data, such
as structured and unstructured data at the
Big Data is a huge size Data. It is a term same time.
used to describe a collection of data in
huge volume and then growing with
time. In short it is such a data which is
so large and complex that none of the
old or traditional data management tools
Myths: attention focused on size, and suggested
the now popular "3Vs" of Big Data [5].
IBM then urged the addition of the 4th
1 Big Data Myth: Size is all that V [6], and this was adopted by the
matters majority. So, of course, most people in
technology will tell you that Big Data
The word “Big” means size. It is a raises the issues of Volume, Velocity,
method that takes the dimensions of the Variety, and Veracity (or at least the
masses easily passed. We've all heard first three of these). However they will
the statements of how a high amount of immediately go on to discuss how many
phone books are needed to store data Petabytes have some problem. I
that is easily stored on a single disk discussed above, why the Volume (or
drive. So it's no surprise that for most size) gets the wrong attention. Let me
sleepers, Big Data is about size. One turn now to why I think Variety and
would think that people working with Veracity don't get the attention they
technology would know better. deserve. The main reason for this
Unfortunately, size also helps in simple disregard is that there is no
estimation. It is straightforward to well-accepted measure of any. When
calculate the number of bytes in another there is no measurement, it is difficult to
data store, and it is relatively easy to plot track progress. If I own companies and
the sequence of such measurements in a create a new system that can handle
chart showing high growth. In fact, such larger volume than competitors, I can
charts have become so common that suggest this with some benchmark
even the most nominated people get the resistance measures. If I am a student
idea. Where does this lead, among other and develop an algorithm that scales
things, seriously if people have a few better than a competitor, I know exactly
hundred gigabytes of data and are not how to compare my algorithm against a
sure they have a big data problem. This competitor and persuade skeptical
is sad, because we put down so many reviewers. In contrast, consider the
people we should be able to help. In variety. If I have a product that makes
addition to the above points, I believe a management easy to use, what technical
better idea would have been in our claim would make it sound like
understanding of Big Data if it weren't advertising hype? When I write a paper
for the economics of the IT industry. about a data model that is better for
Today we have a great ecosystem of Big handling diversity than the current state
Data systems. These systems, in of the art, I have to think hard about how
particular, are innovative: collectively, I will compare it to the competition and
they form a new paradigm for find the beauty of my idea. Progress is
measurement. There are many problems hard on things you can't measure, in the
that require this measure and are useful field of education and in education. The
for new construction. These facts have variants can be very difficult for 4Vs to
led to the creation of a new industrial talk about, but they are the ones people
segment and have many benefits, all of can be motivated to talk about. Veracity
which are good. But the great progress encounters many problems such as
made in this space has also absorbed O Variety. Under the simplest models, at
oxygen in the air with everything else, least we can start measuring other things,
as it were. The industry wants to talk establishing certain probabilities and
about volume, for economic reasons. distributions, and so on. But everyone
And money is talking.Several years ago, recognizes that these steps are based on
the Gartner team saw this unwavering a rational model: for example those that
claim independence when we know it is description is round because it actually
not true. So, steps are taken with a grain does not define what goes into the
of salt, and Veracity is much easier to toolbox. Big Data, or similar statement.
deal with than variety. To conclude, This description works for yourself
Volume and Velocity are really because it greets the tool set and system
challenging. building style as a solution to the Big
Data problem. This description is
incorrect because almost everything in
2 Big Data Myth: Central challenge the Big Data toolbox is focused on
with Big Data is that of devising new Volume (often in conjunction with
computing algorithms and Velocity), with very little attention given
architectures to Variety and Veracity challenges. I
believe that the cloud, and what is today
considered the “Big Data Ecosystem,”
If we think smartly about Big Data in
has its place where the right technology
terms of 4Vs, we immediately have the
is built, but it is not a complete solution
question of what the limits are to call
in itself or a necessary piece of the
something “Big”. Diversity and of
whole solution. My Big Data threshold
course we know that this is definitely
is higher (aside from the other 4 axes)
not an unanswerable question, because
than we know you have to handle
we have no means at first. So let's just
context. The scientist (or manager) is
look at Volume and Velocity. The limit,
responsible for the Big Data system
of some people, is within the limits of
where he has so much data that he can
what we can manage. Obviously, this is
process it using a spreadsheet system he
a moving target. But it has the benefit of
knows. The solution to this is probably
being an inspiration. The sad (in my
as simple as moving to a database. But
opinion) setback is that it limits the size
even simple simple variables can have
of the market to 1: there is only one
many hidden problems: the current
largest shipment in the world at any one
spreadsheet layout may not be ready for
time (bond closure). Increasing the size
a related table (for example, a new
of this submission is a worthwhile
column can be added every month), it
challenge, but not one that could build a
may depend on other elements of a
whole industry around and develop a
particular workflow, and so on.
comprehensive educational field. The
Identifying and eliminating such barriers
limit, in some definitions, is then
is a legitimate task for Big Big. See, for
modified, based on the main construct at
example, a report by the National
some point, say 2010. The data set is
Academies reports “Frontiers in Massive
therefore appropriate, depending on
Data Analysis” [4]. It is also notable that
Volume, as Big Data if it is larger than
we can buy larger systems, more
hosted using the "standard" structures
machines, faster CPUs, and larger disks.
used at the beginning of the Big Data era.
But human ability is unmatched! In
With the ever-increasing popularity of
addition, sizes that challenge humans
Map-Drop style integration, and most of
tend to be too small for computers. For
the programs and tools in the "Big Data
example, consider a graph with just 40
ecosystem, we have a clear, albeit
locations and 200 edges. Try editing it
circular and self-explanatory definition:
on the screen with your favorite graphic
Big Data Problem is one that is best
design and look for patterns. Or such a
addressed using Big Data toolbox
small graph may be at the limit of what
features. there is general agreement as to
we can handle with technology today.
which tools in the Big toolbox are: most
Big data poses great challenges for
toolmakers distinguish themselves .This
human interaction. Many of the most transferred - someone is responsible for
interesting problems in the Big Data making decisions based on the results of
area deal with simplifying this human data analysis and the individual must
interaction. understand and trust the results obtained
first. Finding this certainty will require
understanding and interpretation, it may
3. Big Data Myth 3: Analytic’s is the require observation, it may require
central problem with Big Data sensitivity analysis of various types. All
It is completely understandable that of this has to be organized and done
many people posing for an image equate successfully in analyzing Big Data to
the Big Data System as part of the magic produce any real value.
of software that takes Big Data as input
and produces in-depth information such Phases of Big data life cycle
as output. Unfortunately, this idea of
remembering resonates with many
companies, even some professionals,
very well. This way, the person who is
building the Big Data system (in the
sense described above) can create the
illusion of solving the whole problem
from soup to nuts or focusing on just a
piece of it. The same goes for the person
who created the novel algorithm for the
analysis. But Big Data definitely doesn't
learn about machines in map reduction.
A group of leading researchers across
America wrote a white paper to correct
this misconception, they [1]. Short Big Data analysis is in the pipeline.
version, making the same key points, Major steps in Big Data analysis shown
appeared in CACM, July 2014 [2]. in the top half of the figure. Be aware of
Figure 1 is reproduced in this white possible ways of responding at every
paper. The main point to make is that stage. The bottom half of the figure
there are many steps to go through the shows the Big Data features that make
Big Data analysis pipeline, and the these steps challenging.
important decisions needed for each step,
as well as the many challenges to face in
each one. The first decision is what data 4 Big Data Myth: Data Reuse is low
to record or retrieve, and how to make hanging fruit
the best incomplete data. After that
decisions have to be made to represent We have to collect data for some
the data correctly for analysis, possibly purpose. You should be able to use it for
after extraction, cleaning, and a different purpose, thus eliminating the
integration with other data sources. Even huge cost of collecting data a second
in the analysis phase, which has received time. In fact, reuse may not be safe in
a lot of attention, there is a most cases, if a second analysis is done
misconception of the content of multiple later, in which case there will be no
clusters that appear when several user opportunity to go back to collect
programs are running at the same time. historical data again. While this is a
The final interpretive step is perhaps the stressful opportunity, exploiting it
most important, since it cannot be requires a number of challenges. First,
the original data set must be available administrative data, which is often
during the desired reuse. It's easy to reported to have been collected by
mark data tables (or to use existing management administrators. When such
labels in a data set, such as attributes and information is reused, it needs to be
table names) to find the data sets that are compared (or combined) with the data
in the subject of your favorite topic. established according to a specific
However, in the vast universe of data management area. If the two hierarchies
sets, there may be hundreds of data sets are different, such a parallel cannot
that are somehow related to the topic of happen immediately. For example, it is
interest by only a few who have the not straightforward to compare the data
information in the relationship of reported by the school district with the
interest measured under the conditions data reported by county. Our approach
of interest. We are now starting to think to this problem is to develop new
about how we use the data sets to enable translation methods. Data reuse is
them. Second, the data set must be important to fix and it promises good.
understood and interpreted in order to But there are also a number of
function. Obviously, this requires challenging questions, which have only
sufficient meta-data. Unfortunately, the now been overlooked.
word “enough” in the previous sentence
is often overlooked. If we know the
manufacturer and the date, and the 5. Big Data Myth 5: Data Science is
schema declaration, that is enough meta the same as Big Data
data in many cases. It is likely to be
important under what conditions the
The ability to collect and analyze large
data are obtained, using what resources,
amounts of data is revolutionizing the
after which sample preparation. There is
way scientific research is done [3]. The
a useful function with meta data rates in
Sloan Digital Sky Survey [9]
many communities. Following these
transformed Astronomy from a field
values will undoubtedly motivate
where astronomical photography was a
us even more. However, we also need to
major part of the astronomical work to
address the issue of incentives, at least
another with a focus on finding
in the scientific community: why a
interesting objects and contexts based on
scientist will spend time recording meta
data. in Biological Science, there is now
data more carefully. Why not just do the
a well-established tradition of
minimum amount required by a
incorporating scientific data into public
publishing agency or financial agency?
archives, as well as building databases
In addition, there is still enough
used by other scientists. The size and
variability even under any of the few
number of test data sets in many
studies studied that most of these meta
applications are increasing dramatically.
data levels do not require information
Consider, for example, the advent of
that may be relevant to a specific
Next Generation Sequiling NGS [7].
situation, even if it is not fully functional.
The growth rate of the release of current
Efforts to establish a data collection
NGS methods is faster than the increase
culture are essential to address these
in SPU benchmark (SPECint)
issues. Third, the data sets received are
performance, indicating an increase in
usually not as efficient as you used them
computational power due to Moore's law.
to be. Sometimes this is just a question
Both volume and velocity of data
of mapping. But too much frequency is
require new methods of data control and
usually resolved. One issue I am talking
analysis. For example, the raw image
about right now has to do with
data sets in the NGS are so large that it
is useless today to visualize and programs themselves, there are
maintain them. Instead, these images are analytical tasks - often, we use users to
analyzed in transit to produce sequence formulate unsupported ideas about the
data. Many people use the terms data at hand, e.g. in relation to
“Scientific Data” and “Big Data” independence or impartiality or the way
interchangeably, using these terms in all in which a data set is represented. If we
the examples listed above. This is do not help people to use their data
completely wrong: the first difference intelligently, they will be burned and
between the two terms is their idea: "Big challenged by all the good our
Data" starts with data signals (and works technology can bring you. Database and
there), while "Data Science" starts with data analytics usability research are
data usage (and works down there). important.
However, their formal definitions differ
from more than mere observation. The
National Consortium for Data Science, 6. Big Data Myth 6: Big Data is all
an industry and educational partnership hype
founded in unc, Chapel hill in 2013,
describes data science as "a systematic
Periodic data analysis. Reduce
study of digital data using visual science
information too. So what has changed.
techniques, concept development,
Why now is the time to be happy with
systematic analysis, hypothesis testing
Big Data. Whether or not this is simply
and validation. The main purpose of data
cooked by soulless journalists with a lot
science [8] is to use data to describe,
of attention to Big Data, this is a
describe, and predict ecological and
question worth asking. But we see that
social phenomena by creating
data collection is cheap today, thanks to
information about the properties of large
ubiquitous access, automated business
and dynamic data sets. latency, and cost.
processes, the web and sensor networks,
in a way that has never been seen before.
When we compare this definition of Data storage is also cheaper, thanks to
Data Science with the Gartner definition media pricing. As a result, almost every
of Big Data we saw earlier, we see category of change changes from "poor
immediately that it's possible to do Data data" to "rich data." So it's no surprise
Science without doing Big Data, with that everywhere around us we have
viceversa. Of course, nothing stops Data people asking about the potential of Big
Science from including Big Data, and it Data. At the same time, we have a
often does. However, paying our growing public understanding of the
attention to the connection of these two effects of Big Data. We're just starting to
objects is of limited importance. get serious today in our measurement of
Another point to note is that Data data privacy. Our appreciation for the
Science activities often involve data principles of data analysis is also at its
analysis by a domain expert with limited infancy. Errors and excesses in this case
expertise. If a domain expert can may result in a backlog that can shut
succeed, the data should work. many things down. But to prevent such
Undoubtedly, data systems are very errors, it's safe to say that Big Data may
difficult to use. There is even an urban be hyped, but there is just enough
myth about some merchants in order to something to be overlooked.
keep them hard to use because they
make a lot of money from consulting
and support fees. In addition to the References
[1] http://data2discovery.org/dev/wp-conten
Challenges and opportunities with Big t/uploads/2012/09/NCDS-Consortium-R
Data oadmap_July.pdf (2012)
a community white paper available at Google Scholar
http://cra.org/ccc/docs/init/bigdatawhite [9]
paper.pdf SDSS-III: Massive Spectroscopic
Google Scholar Surveys of the Distant Universe, the
[2] Milky Way Galaxy, and Extra-Solar
H.V. Jagadish, Johannes Gehrke, Alexan Planetary Systems
dros Labrinidis, Yannis Papakonstantino available at
u, Jignesh http://www.sdss3.org/collaboration/desc
M. Patel, Raghu Ramakrishnan, Cyrus S ription.pdf (Jan. 2008).
hahabiBig data and its technical
challenges
Commun. ACM, 57 (7) (July 2014),
pp. 86-94, 10.1145/2611567
CrossRefView Record in ScopusGoogle
Scholar
[3]
Advancing Discovery in
Science, Engineering Computing
Community Consortium (Spring 2011)
Google Scholar
[4]
Frontiers in Massive Data
Analysis, National Academies
Press (2013)
Google Scholar
[5]
Pattern-Based Strategy: getting value
from Big Data
Gartner Group press release, available at
http://www.gartner.com/it/page.jsp?id=1
731916 (July 2011)
Google Scholar
[6]
The 4 V's of Big Data
http://www.ibmbigdatahub.com/tag/587
Google Scholar
[7]
Scott D. KahnOn the future of
genomic data
Science, 11 (February 2011),
pp. 728-729
CrossRefView Record in ScopusGoogle
Scholar
[8]
Establishing a National Consortium
for Data Science
available at

Big Data Myths Complete

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Myths Complete

Uploaded by

Copyright:

Available Formats

Big Data and Science: Myths and Reality

You might also like