P. 1
Big Data Final PR-3

Big Data Final PR-3

|Views: 224|Likes:
Published by Carl Tanner

More info:

Published by: Carl Tanner on Apr 26, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





While Hadoop is fast becoming the most popular platform for big data, there are other
options out there we think are worth mentioning.

Datastax provides an enterprise-grade distribution of the Apache Cassandra NoSQL
database. Cassandra is used primarily as a high-scale transactional (OLTP) database.
Like other NoSQL databases, Cassandra does not impose a predefined schema, so new
data types can be added at will. It is a great complement to Hadoop for real-time data.

LexisNexis offers a big data product called HPCC that uses its enterprise control
language (ECL) instead of Hadoop’s MapReduce for writing parallel-processing
workflows. ECL is a declarative, data-centric language that abstracts a lot of the work
necessary within MapReduce. For certain tasks that take thousands of lines of code in
MapReduce, LexisNexis claims ECL only requires 99 lines. Furthermore, HPCC is
written in C++, which the company says makes it inherently faster than the Java-based

VoltDB is perhaps more famous for the man behind the company than for its
technology. VoltDB founder Dr. Michael Stonebraker has been a pioneer of database
research and technology for more than 25 years. He was the main architect of the
Ingres relational database and the object-relational database PostgreSQL. These
prototypes were developed at the University of California, Berkeley, where
Stonebraker was a professor of computer science for 25 years. VoltDB makes a scalable
SQL database, which is likely to do well because there are lots of issues with SQL at
scale and thousands of database administrators looking to extend their SQL skills into
the next generation of database technology.

Our gratitude to . . .


A near-term outlook for big data

- 40 -

March 2012

Religious wars in technology are rife, and it’s easy to get sucked into one side versus
the other, especially when the key players are selling commercial distributions of
open-source software — in this case Hadoop — and are constantly trying to stay
ahead of the core codebase. We would like to thank the following people for their
unbiased and thoughtful input:

Michael Franklin, Professor of Computer Science, University of California, Berkeley
Peter Skomoroch, Principal Data Scientist, LinkedIn
Theo Vassilakis, Senior Software Engineer, Google
Anand Babu Periasamy, Office of the CTO, Red Hat
Derrick Harris, Writer, GigaOM Pro analyst


A near-term outlook for big data

- 41 -

March 2012

Considering information-quality drivers for big data
analytics — by David Loshin

Years, if not decades, of information-management systems have contributed to the
monumental growth of managed data sets. And data volumes are expected to continue
growing: A 2010 article suggests that data volume will continue to expand at a healthy
rate, noting that “the size of the largest data warehouse . . . triples approximately every
two years.”9

Increased numbers of transactions can contribute much to this data
growth: An example is retailer Wal-Mart, which executes more than 1 million
customer transactions every hour, feeding databases of sizes estimated at more than
2.5 petabytes.10

But transaction processing is rapidly becoming just one source of information that can
be subjected to analysis. Another is unstructured data: A report suggests that by 2013,
the amount of traffic flowing over the Internet annually will reach 667 exabytes.11


expansion of blogging, wikis, collaboration sites and especially social networking
environments — Facebook, with its 845 million members; Twitter; Yelp — have
become major sources of data suited for analytics, with phenomenal data growth. By
the end of 2010, the amount of digital information was estimated to have grown to
almost 1.2 million petabytes, and 1.8 million petabytes by the end of 2011 were

Further, the amount of digital data was expected to balloon to 35
zettabytes (1 zettabyte = 1 trillion gigabytes) by 2020.

Competitive organizations are
striving to make sense out of what we call “big data” with a corresponding increased
demand for scalable computing platforms, algorithms and applications that can
consume and analyze massive amounts of data with varying degrees of structure.


Adrian, Merv. “Exploring the Extremes of Database Growth.” IBM Data Management, Issue 1 2010.


Economist, “Data, data everywhere.” Feb. 21, 2010.


Economist, “Data, data everywhere.” Feb. 21, 2010.


Gantz, John F. “The 2011 IDC Digital Universe Study: Extracting Value from Chaos.” June 2011.


A near-term outlook for big data

- 42 -

March 2012

But what is the impact on big data analytics if there are questions about the quality of
the data? Data quality often centers on specifying expectations about data accuracy,
currency and completeness as well as a raft of other dimensions used to articulate the
suitability for the variety of uses. Data-quality problems (e.g., inconsistencies among
data sets, incomplete records, inaccuracies) existed even when companies controlled
the flow of information into the data warehouse. Yet with essentially no constraints
placed on tweets, yelps, wall posts or other data streams, the questions about the
dependence on high-quality information must be asked.

As the excitement builds around the big data phenomenon, business managers and
developers alike must make themselves aware of some of the potential problems that
are linked to big data analytics. This article presents some of the underlying data-
quality issues associated with the creation of big data sets, their integration into
analytical platforms, and, most importantly, the consumption of the results of big data

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->