You are on page 1of 37

Big Data and Business Analytics

Big data is a broad term for data sets so large or complex that
traditional data processing applications are inadequate.
Challenges include analysis, capture, data curation, search,
sharing, storage, transfer, visualization, and information privacy.
The term often refers simply to the use of predictive analytics or
other certain advanced methods to extract value from data, and
seldom to a particular size of data set. Accuracy in big data may
lead to more confident decision making. And better decisions can
mean greater operational efficiency, cost reductions and reduced
Analysis of data sets can find new correlations, to "spot business
trends, prevent diseases, combat crime and so on." Scientists,
practitioners of media and advertising and governments alike
regularly meet difficulties with large data sets in areas including
Internet search, finance and business informatics. Scientists
encounter limitations in e-Science work, including meteorology,
genomics, connectomics, complex physics simulations, and
biological and environmental research.
Data sets grow in size in part because they are increasingly being
gathered by cheap and numerous information-sensing mobile
devices, aerial (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers, and
wireless sensor networks. The world's technological per-capita
capacity to store information has roughly doubled every 40
months since the 1980s; as of 2012, every day 2.5 exabytes
(2.51018) of data were created; The challenge for large
enterprises is determining who should own big data initiatives
that straddle the entire organization.
Work with big data is necessarily uncommon; most analysis is of
"PC size" data, on a desktop PC or notebook that can handle the
available data set. Relational database management systems and
desktop statistics and visualization packages often have difficulty
handling big data. The work instead requires "massively parallel
software running on tens, hundreds, or even thousands of
servers". What is considered "big data" varies depending on the
capabilities of the users and their tools, and expanding

Big Data and Business Analytics

capabilities make Big Data a moving target. Thus, what is

considered to be "Big" in one year will become ordinary in later
years. "For some organizations, facing hundreds of gigabytes of
data for the first time may trigger a need to reconsider data
management options. For others, it may take tens or hundreds of
terabytes before data size becomes a significant consideration."


The amount of data in our world has been exploding. Companies

capture trillions of bytes of information about their customers,
suppliers, and operations, and millions of networked sensors are
being embedded in the physical world in devices
such as mobile phones and automobiles, sensing, creating, and
communicating data. Multimedia and individuals with
smartphones and on social network sites will continue to fuel
exponential growth. Big data is large pools of data that can be
captured, communicated; aggregated, stored, and analyzed is
now part of every sector and function of the global economy. Like
other essential factors of production such as hard assets and
human capital, it is increasingly the case that much of modern
economic activity, innovation, and growth simply couldnt take
place without data.
The problem is simple: While the storage capacities of hard drives
have increased
massively over the years, access speedsthe rate at which data
can be read from drives have not kept up.
HADOOP: In a nutshell, is what Hadoop provides: a reliable shared
storage and analysis system. The storage is provided by HDFS
and analysis by MapReduce. There are other parts to Hadoop, but
these capabilities are its kernel.


The surprising growth in volumes of data has badly affected

todays business the online users create content like blog posts,
tweets social networking site interactions, and photos. And the
servers continuously log messages about what online users are

Big Data and Business Analytics

doing. IBM estimates that everyday 2.5 quintillion bytes of data is

created - so much that 90% of data in the world today has been
created in the two years.


Big data typically refers to the following types of data:

1.Traditional enterprise data includes customer information from
CRM systems, transactional ERP data, web store
transactions, general ledger data.
2. Machine-generated /sensor data includes Call Detail Records
(CDR), weblogs, smart meters, manufacturing sensors,
equipment logs (often referred to as digital exhaust), trading
systems data.
3. Social data includes customer feedback streams, micro-
blogging sites like Twitter, social media platforms like Facebook .


1.Volume: Machine-generated data is in much larger quantities

than non- traditional data.
2.Velocity: Social media data streams produce a large influx of
opinions and relationships valuable to customer relationship
3.Variety: As new services are added, new sensors deployed, or
new marketing campaigns executed, new data types are needed
to capture the resultant information.
4.Value: The economic value of different data varies significantly.
To make the most of big data, enterprises must evolve their IT
infrastructures to handle the rapid rate of delivery of extreme
volumes of data, with varying data types, which can then be
integrated with an organizations other enterprise data to be

To make the most of big data, enterprises must evolve their IT

infrastructures to handle the rapid rate of delivery of extreme
volumes of data, with varying data types, which can then be

Big Data and Business Analytics

integrated with an organizations other enterprise data to be



Big data refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, and
analyze. This definition is intentionally subjective and
incorporates a moving definition of how big a dataset needs to be
in order to be considered big datai.e., we dont define big data
in terms of being larger than a certain number of terabytes
(thousands of gigabytes). We assume that, as technology
advances over time, the size of datasets that qualify as big data
will also increase. Also note that the definition can vary by sector,
depending on what kinds of software tools are commonly
available and what sizes of datasets are common in a particular
industry. With those caveats, big data in many sectors today will
range from a few dozen terabytes to multiple petabytes
(thousands of terabytes).The ability to store, aggregate, and
combine data and then use the results to perform deep analyses

Big Data and Business Analytics

has become ever more accessible as trends such as Moores Law

in computing, its equivalent in digital storage, and cloud
computing continue to lower costs and other technology barriers.
The means to extract insight from data are also markedly
improving as software
available to apply increasingly sophisticated techniques combines
with growing computing horsepower. Further, the ability to
generate, communicate, share, and access data has been
revolutionized by the increasing number of people, devices, and
sensors that are now connected by digital networks. In 2010,
more than 4 billion people, or 60 percent of the worlds
population, were using mobile phones, and about 12 percent of
those people had smartphones, whose penetration is growing at
more than 20 percent a year. More than 30 million networked
sensor nodes are now present in the transportation, automotive,
industrial, utilities, and retail sectors. The number of these
sensors is increasing at a rate of more than 30
percent a year. There are many ways that big data can be used to
create value across sectors of the global economy. Indeed, world
on the cusp of a tremendous wave of innovation, productivity and
growth, as well as new modes of competition and value capture
all driven by big data as consumers, companies, and economic
sectors exploit its potential.

NOTE: Big data is any attribute size being one of them, that
challenges constrains of a system capabilities or a business need.

But why should this be the case now? Havent data always been
part of the impact of information and communication technology?
Yes, but research suggests that the scale and scope of changes
that big data are bringing about are at an inflection point, set to
expand greatly, as a series of technology trends accelerate and
converge. We are already seeing visible changes in the economic
landscape as a result of this convergence.

Big Data and Business Analytics

Many pioneering companies are already using big data to create

value, and others need to explore how they can do the same if
they are to compete.

Challenges include the need to ensure that the right infrastructure
is in place and that incentives and competition are in place to
encourage continued innovation; that the economic benefits to
users, organizations, and the economy are properly understood;
and that safeguards are in place to address public concerns about
big data.
To understand these challenges one has to understand the state
of digital data, how different domains can use large datasets to
create value, the potential value across stakeholders, and the
implications for the leaders of private sector companies and
public sector organizations, as well as for policy makers.


One not so secret secrets of big data is that it is fueled by cloud
properties. The extensive use of cloud is the main cause of big
data. To understand this one has to understand the state of digital
data, how different domains can use large datasets to create
value, the potential value across stakeholders, and the
implications for the leaders of private sector companies and
public sector organizations, as well as for policy makers.

Big Data and Business Analytics

The generation of big data may be growing exponentially and
advancing technology may allow the global economy to store and
process ever greater quantities of data, but there may be limits to
our innate human abilityour sensory and cognitive faculties to
process this data torrent. It is said that the mind can handle about
seven pieces of information in its short-term memory.1 Roger
Bohn and James Short at the University of California at San Diego
discovered that the rate of growth in data consumed by
consumers, through various types of media, was a relatively
modest 2.8 percent in bytes per hour between 1980 and 2008.
We should note that one of the reasons for this slow growth was
the relatively fixed number of bytes delivered through television
before the widespread adoption of high-definition digital video.2
The topic of information overload has been widely studied by
academics from neuroscientists to economists. Economist Herbert
Simon once said, A wealth of information creates a poverty of
attention and a need to allocate that attention efficiently among
the overabundance of information sources that might consume
it.3Despite these apparent limits, there are ways to help
organizations and individuals to process, visualize, and synthesize
meaning from big data. For instance, more sophisticated
visualization techniques and algorithms, including automated
algorithms, can enable people to see patterns in large amounts of
data and help them to unearth the most pertinent insights (see
chapter 2 for examples of visualization). Advancing collaboration
technology also allows a large number of individuals, each of
whom may possess understanding of a special area of
information, to come together in order to create a whole picture
to tackle interdisciplinary problems.
MGI estimates that enterprises globally stored more than 7
exabytes of new data on disk drives in 2010, while consumers
stored more than 6 exabytes of new data on devices such as PCs

Big Data and Business Analytics

and notebooks. One exabyte of data is the equivalent of more

than 4,000 times the information stored in the US Library of
Congress.6 Indeed, we are generating so much data today that it
is physically impossible to store it all.
We have identified five broadly applicable ways to leverage big
data that offer transformational potential to create value and
have implications for how organizations will have to be designed,
organized, and managed.
The use of big data is becoming a key way for leading companies
to outperform their peers.For example, we estimate that a retailer
embracing big data has the potential to increase its operating
margin by more than 60 percent.
Big data levers that will, in our view, underpin substantial
productivity growth (Exhibit 1).These opportunities have the
potential to improve efficiency and effectiveness, enabling
organizations both to do more with less and to produce higher-
quality outputs, i.e., increase the value-added content of products
and services.
Illustrating differences among different sectors, if we compare the
historical productivity of sectors in the United States with the
potential of these sectors to capture value from big data (using an
index that combines several quantitative metrics).
A significant constraint on realizing value from big data will be a
shortage of talent, particularly of people with deep expertise in
statistics and machine learning, and the managers and analysts
who know how to operate companies by using insights from big

Big Data and Business Analytics


Data policies: As an ever larger amount of data is digitized and
travels across organizational boundaries, there is a set of policy
issues that will become increasingly important, including, but not
limited to, privacy, security, intellectual property, and liability.
Data security: Clearly, privacy is an issue whose importance,
particularly to consumers, is growing as the value of big data
becomes more apparent. Personal data such as health and
financial records are often those that can offer the most
significant human benefits, such as helping to pinpoint the right
medical treatment or the most appropriate financial product.

Big Data and Business Analytics


When big data is distilled and analyzed in combination with

traditional enterprise data,enterprises can develop a more
thorough and insightful understanding of their business, which
can lead to enhanced productivity, a stronger competitive
position and greater innovation all of which can have a
significant impact on the bottom line.For example, in the delivery
of healthcare services, management of chronic or longterm
conditions is expensive. Use of in-home monitoring devices to
measure vital signs, and monitor progress is just one way that
sensor data can be used to improve patient health and reduce
both office visits and hospital admittance.
Manufacturing companies deploy sensors in their products to
return a stream of
telemetry. Sometimes this is used to deliver services like OnStar,
that delivers
communications, security and navigation services. Perhaps more
importantly, this telemetry also reveals usage patterns, failure
rates and other opportunities for product improvement that can
reduce development and assembly costs.
The proliferation of smart phones and other GPS devices offers
advertisers an
opportunity to target consumers when they are in close proximity
to a store, a coffee shop or a restaurant. This opens up new
revenue for service providers and offers many businesses a
chance to target new customers.
Retailers usually know who buys their products. Use of social
media and web log files from their ecommerce sites can help
them understand who didnt buy and why they chose not to,
information not available to them today. This can enable much
more effective micro customer segmentation and targeted
marketing campaigns, as well as improve supply chain
Finally, social media sites like Facebook and LinkedIn simply
wouldnt exist without big data. Their business model requires a
personalized experience on the web, which can only be delivered

Big Data and Business Analytics

by capturing and using all the available data about a user or



As with data warehousing, web stores or any IT platform, an
infrastructure for big data has unique requirements. In
considering all the components of a big data platform, it is
important to remember that the end goal is to easily integrate
your big data with your enterprise data to allow you to conduct
deep analytics on the combined data set. Infrastructure
Requirements The requirements in a big data infrastructure span
data acquisition, data organization and data analysis.
Acquiring Big data The acquisition phase is one of the major
changes in infrastructure from the days before big data. Because
big data refers to data streams of higher velocity and higher
variety, the infrastructure required to support the acquisition of
big data must deliver low, predictable latency in both capturing
data and in executing short, simple queries; be able to handle
very high transaction volumes, often in a distributed environment;
and support flexible, dynamic data structures. NoSQL databases
are frequently used to acquire and store big data. They are well
suited for dynamic data structures and are highly scalable. The
data stored in a NoSQL database is typically of a high variety
because the systems are intended to simply capture all data
without categorizing and parsing the data.
For example, NoSQL databases are often used to collect and store
social media data.
While customer facing applications frequently change, underlying
storage structures are kept simple. Instead of designing a schema
with relationships between entities, these simple structures often
just contain a major key to identify the data point, and then a
content container holding the relevant data. This simple and
dynamic structure allows changes to take place without costly
reorganizations at the storage layer.
In classical data warehousing terms, organizing data is called data
integration. Because there is such a high volume of big data,
there is a tendency to organize data.
Organising the Big Data.

Big Data and Business Analytics

In classical data warehousing terms, organizing data is called data

integration. Because there is such a high volume of big data,
there is a tendency to organize data at its original storage
location, thus saving both time and money by not moving around
large volumes of data. The infrastructure required for organizing
big data must be able to process and manipulate data in the
original storage location; support very high throughput (often in
batch) to deal with large data processing steps; and handle a
large variety of data formats, from unstructured to structured.
Apache Hadoop is a new technology that allows large data
volumes to be organized and processed while keeping the data on
the original data storage cluster. Hadoop Distributed File System
(HDFS) is the long-term storage system for web logs for example.
These weblogs are turned into browsing behavior (sessions) by
running MapReduce programs on the cluster and generating
aggregated results on the same cluster. These aggregated results
are then loaded into a Relational DBMS system.
Analyze BigData.
Since data is not always moved during the organization phase,
the analysis may also be done in a distributed environment,
where some data will stay where it was originally stored and be
transparently accessed from a data warehouse. The infrastructure
required for analyzing big data must be able to support deeper
analytics such as statistical analysis and data mining, on a wider
variety of data types stored in diverse systems; scale to extreme
data volumes; deliver faster response times driven by changes in
behavior; and automate decisions based on analytical models.
Most importantly, the infrastructure must be able to integrate
analysis on the combination of big data and traditional enterprise
data. New insight comes not just from analyzing new data, but
from analyzing it within the context of the old to provide new
perspectives on old problems.For example, analyzing inventory
data from a smart vending machine in combination with the
events calendar for the venue in which the vending machine is
located, will dictate the optimal product mix and replenishment
schedule for the vending machine.
Solution Spectrum
Many new technologies have emerged to address the IT
infrastructure requirements outlined above. At last count, there

Big Data and Business Analytics

were over 120 open source key-value databases for acquiring and
storing big data, with Hadoop emerging as the primary system for
organizing big data and relational databases expanding their
reach into less structured data sets to analyze big data. These
new systems have created a divided solutions spectrum
comprised of:
Not Only SQL (NoSQL) solutions: developer-centric specialized
SQL solutions: the world typically equated with the
manageability, security and trusted nature of relational
database management systems (RDBMS)
NoSQL systems are designed to capture all data without
categorizing and parsing it upon entry into the system, and
therefore the data is highly varied. SQL systems, on the other
hand, typically place data in well-defined structures and impose
metadata on the data captured to ensure consistency and
validate data types

Big Data and Business Analytics

Hadoop is a rapidly evolving ecosystem of components for
implementing the Google MapReduce algorithms in a scalable
fashion on commodity hardware. Hadoop enables users to store
and process large volumes of data and analyze it in ways not
previously possible with less scalable solutions or standard SQL-
based approaches. As an evolving technology solution, Hadoop
design considerations are new to most users and not common
knowledge. As part of the Dell | Hadoop solution, Dell has
developed a series of best practices and architectural
considerations to use when designing and implementing Hadoop
solutions. Hadoop is a highly scalable compute and storage
While most users will not initially deploy servers numbered in the
hundreds or thousands, Dell recommends following the design
principles that drive large, hyper-scale deployments. This ensures
that as you start with a small Hadoop environment, you can easily
scale that environment without rework to existing servers,
software, deployment strategies, and network connectivity. The
Apache Hadoop project develops open-source software for
reliable, scalable,
distributed computing. The Apache Hadoop software library is a
framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming
It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather
than rely on hardware to deliver high-availability, the library itself
is designed to detect and handle failures at the application layer,
so delivering a
Highly availabile service on top of a cluster of computers, each of
which may be prone to failures.
The project includes these modules:
Hadoop Distributed File System (HDFS): A distributed file system
that provides high-throughput access to application data.
Hadoop MapReduce: A YARN-based system for parallel
processing of large datasets.

Big Data and Business Analytics

Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library. Hadoop has its origins
in Apache Nutch, an open source web search engine, itself a part
of the Lucene project.


The name Hadoop is not an acronym; it is a made-up name. The
projects creator, Doug Cutting, That HADOOP is named after his
kids stuffed yellow elephant. Subprojects and contrib modules
in Hadoop also tend to have names that are unrelated to their
function, often with an elephant or other animal theme (Pig, for
example). Smaller components are given more descriptive (and
therefore more mundane) names. This is a good principle, as it
means you can generally work out what something does from its
name. For example, the jobtracker keeps track of MapReduce
jobs. Building a web search engine from scratch was an ambitious
goal, for not only is the software required to crawl and index
websites complex to write, but it is also a challenge to run without
a dedicated operations team, since there are so many moving
parts. Its expensive, too: Mike Cafarella and Doug Cutting
estimated a system supporting a 1-billion-page index would cost
around half a million dollars in hardware, with a monthly running
cost of $30,000. Nevertheless, they believed it was a worthy goal,
as it would open up and ultimately democratize search engine
Nutch was started in 2002, and a working crawler and search
system quickly emerged. However, they realized that their
architecture wouldnt scale to the billions of pages on the Web.
Help was at hand with the publication of a paper in 2003 that
described the architecture of Googles distributed filesystem,
called GFS, which was being used in production Google. GFS, or
something like it, would solve their storage needs for the very
large files generated as a part of the web crawl and indexing
process. In particular, GFS would free up time being spent on
administrative tasks such as managing storage nodes. In 2004,
they set about writing an open source implementation, the Nutch

Big Data and Business Analytics

Distributed Filesystem (NDFS).Lowercase form, jobtracker, is

used to denote the entity when its being referred to generally,
and the CamelCase form JobTracker to denote the Java class that
implements it.Mike Cafarella and Doug Cutting, Building Nutch:
Open Source Search.

In 2004, Google published the paper that introduced MapReduce

to the world. Early in 2005, the Nutch developers had a working
MapReduce implementation in Nutch, and by the middle of that
year all the major Nutch algorithms had been ported to run using
MapReduce and NDFS.
NDFS and the MapReduce implementation in Nutch were
applicable beyond the realm of search, and in February 2006 they
moved out of Nutch to form an independent subproject of Lucene
called Hadoop. At around the same time, Doug Cutting joined
Yahoo!, which provided a dedicated team and the resources to
turn Hadoop into a system that ran at web scale (see sidebar).
This was demonstrated in February 2008
when Yahoo! announced that its production search index was
being generated by a 10,000-core Hadoop cluster.
In January 2008, Hadoop was made its own top-level project at
Apache, confirming its success and its diverse, active community.
By this time, Hadoop was being used by many other companies
besides Yahoo!, such as, Facebook, and the New York Times.
Some applications are covered in the case studies in Chapter 16
and on the Hadoop wiki. In one well-publicized feat, the New York
Times used Amazons EC2 compute cloud to crunch through four
terabytes of scanned archives from the paper converting them to
PDFs for the Web. The processing took less than 24 hours to run
using 100 machines, and the project probably wouldnt have been
embarked on without the combination of Amazons payby-the-
hour model (which allowed the NYT to access a large number of
machines for a short period) and Hadoops easy-to-use parallel
programming model.
In April 2008, Hadoop broke a world record to become the fastest
system to sort a terabyte of data. Running on a 910-node cluster,
Hadoop sorted one terabyte in 209 seconds (just under 3
minutes), beating the previous years winner of 297 seconds In
November of the same year, Google reported that its MapReduce

Big Data and Business Analytics

implementation sorted one terabyte in 68 seconds. As the first

edition of this book was going to press(May2009), it was
announced that a team at Yahoo! used Hadoop to sort one
terabyte in 62seconds.



Figure depicts the Dell representation of the Hadoop ecosystem.
This model does not include the applications and end user
presentation components, but does enable those to be built in a
standard way and scaled as your needs grow and your Hadoop
environment is expanded. The representation is broken down into
the Hadoop use cases from above: Compute, Storage, and
Database workloads. Each workload has specific characteristics
for operations, deployment, architecture, and management.
Although Hadoop is best known for MapReduce and its distributed
filesystem (HDFS, renamed from NDFS), the term is also used for
a family of related projects that fall under the umbrella of

Big Data and Business Analytics

infrastructure for distributed computing and large-scale data

Most of the core projects are hosted by the Apache Software
Foundation, which provides support for a community of open
source software projects, including the original HTTP Server from
which it gets its name. As the Hadoop ecosystem grows, more
projects are appearing, not necessarily hosted at Apache, which
provide complementary services to Hadoop, or build on the core
to add higher-level abstractions.
The entire Apache Hadoop platform is now commonly
considered to consist of the Hadoop kernel, MapReduce and
Hadoop Distributed File System(HDFS), as well as a number of
related projects including Apache Hive, Apache HBase, and
others. Hadoop is written in the Java programming language and
is a top-level Apache project being built and used by a global
community of contributors. Hadoop and its related projects (Hive,
HBase, Zookeeper,and so on) have many contributors from across
the ecosystem.

Big Data and Business Analytics


MapReduce: A distributed data processing model and

execution environment that runs on large clusters of
commodity machines.
HDFS: A distributed filesystem that runs on large clusters of
commodity machines.
Pig: A data flow language and execution environment for
exploring very large datasets. Pig runs on HDFS and
MapReduce clusters.
Hive: A distributed data warehouse. Hive manages data
stored in HDFS and provides a query language based on SQL
(and which is translated by the runtime engine to
MapReduce jobs) or querying the data.
HBase: A distributed, column-oriented database. HBase uses
HDFS for its underlying storage, and supports both batch-
style computations using MapReduce and point queries
(random reads).
ZooKeeper: A distributed, highly available coordination
service. ZooKeeper provides primitives such as distributed
locks that can be used for building distributed applications.
Sqoop: A tool to move data efficiently between relational

Big Data and Business Analytics



HDFS is a distributed, scalable, and portable file system written in
Java for the Hadoop framework. Each node in a Hadoop instance
typically has a single namenode; a cluster of datanodes form the
HDFS cluster. The situation is typical because each node does not
require a datanode to be present. Each datanode serves up blocks
of data over the network using a block protocol specific to HDFS.
The file system uses the TCP/IP layer for communication. Clients
use RPC to communicate between each other. HDFS stores large
files (an ideal file size is a multiple of 64 MB), across multiple
machines. It achieves reliability by replicating the data across
multiple hosts, and hence does not require RAID storage on hosts.
With the default replication value, 3, data is stored on three
nodes: two on the same rack, and one on a different rack. Data
nodes can talk to each other to rebalance data, to move copies
around, and to keep the replication of data high.
HDFS is not fully POSIX compliant, because the requirements for a
POSIX file system differ from the target goals for a Hadoop
application. The tradeoff of not having a fully POSIX compliant file
system is increased performance for data throughput. HDFS was
designed to handle very large files. HDFS has recently added
high-availability capabilities, allowing the main metadata server
(the namenode) to be failed over manually to a backup in the
event of failure.Automatic fail-over is being developed as well.
Additionally, the file system includes what is
called a secondary namenode, which misleads some people into
thinking that when the primary namenode goes offline, the
secondary namenode takes over. In fact, the secondary
namenode regularly connects with the primary namenode and
builds snapshots of the primary namenode's directory
information, which is then saved to local or remote directories.
These checkpointed images can be used to restart a failed
primary namenode without having to replay the entire journal of

Big Data and Business Analytics

file-system actions, then to edit the log to create an up-to-date

directory structure. Because the namenode is the single point for
storage and management of metadata, it can be a bottleneck for
supporting a huge number of files, especially a large number of
small files. HDFS Federation is a new addition that aims to tackle
this problem to a
certain extent by allowing multiple name spaces served by
separate namenodes.

An advantage of using HDFS is data awareness between the job

tracker and task
tracker. The job tracker schedules map or reduce jobs to task
trackers with an awareness of the data location. An example of

Big Data and Business Analytics

this would be if node A contained data (x,y,z) and node B

contained data (a,b,c). Then the job tracker will schedule node B
to perform map or reduce tasks on (a,b,c) and node A would be
scheduled to perform map or reduce tasks on (x,y,z). This reduces
the amount of traffic that goes over the network and prevents
unnecessary data transfer. When Hadoop is used with other file
systems this advantage is not always available. This can have a
significant impact on job-completion times, which has been
when running data-intensive jobs.[12]
Another limitation of HDFS is that it cannot be mounted
directly by an existing operating system. Getting data
into and out of the HDFS file system, an action that often
needs to be performed before and after executing a job,
can be inconvenient.

A Filesystem in Userspace (FUSE) virtual file system has been

developed to address this problem, at least for Linux and some
other Unix systems.
File access can be achieved through the native Java API, the Thrift
API to generate a client in the language of the users' choosing
(C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
Smalltalk, and OCaml), the command-line interface, or browsed
through the HDFSUI webapp over HTTP.
Other Filesystems
By May 2011, the list of supported filesystems included:
1. HDFS: Hadoop's own rack-aware filesystem.[13] This is
designed to scale totems of petabytes of storage and runs on top
of the filesystems of the underlying operating systems.
2. Amazon S3 filesystem. This is targeted at clusters hosted on
the Amazon Elastic
Compute Cloud server-on-demand infrastructure. There is no rack-
awareness in this filesystem, as it is all remote.
3. CloudStore (previously Kosmos Distributed File System), which
is rack-aware.

Big Data and Business Analytics

4. FTP Filesystem: this stores all its data on remotely accessible

FTP servers.
5. Read-only HTTP and HTTPS file systems.
Hadoop can work directly with any distributed file system that can
be mounted by the underlying operating system simply by using a
file:// URL; however, this comes at a price: the loss of locality. To
reduce network traffic, Hadoop needs to know which servers are
closest to the data; this is information that Hadoop-specific
filesystem bridges can provide.


We will focus on Hadoop MapReduce, which is the most popular
open source implementation of the MapReduce framework
proposed by Google.
Hadoop MapReduce job mainly consists of two user-defined
1. MAP
The input of a Hadoop MapReduce job is a set of key-value pairs
(k; v) and the map function is called for each of these pairs. The
map function produces zero or more intermediate key value pairs
(k0; v0). Then, the Hadoop MapReduce framework groups these
intermediate key value pairs by intermediate key k0 and calls the
reduce function for each group. Finally, the reduce function
produces zero or more aggregated results. The beauty of Hadoop
MapReduce is that users usually only have to define the map and
reduce functions. The framework takes care of everything else
such as parallelization and failover. The Hadoop MapReduce
framework utilises a distributed file system to read and write its
data. Typically, Hadoop MapReduce uses the Hadoop Distributed
File System (HDFS), which is the open source counterpart of the
Google File System . Therefore, the I/O performance of a Hadoop
MapReduce job strongly depends on HDFS.


One of the major advantages of Hadoop MapReduce is that it
allows non-expert users to easily run analytical tasks over big

Big Data and Business Analytics

data. Hadoop MapReduce gives users full control on how input

datasets are processed. Users code their queries using Java rather
than SQL. This makes Hadoop MapReduce easy to use for a larger
number of developers: no background in databases is required;
only a basic knowledge in Java is required. However, Hadoop
MapReduce jobs are far behind parallel databases in their query
processing efficiency. Hadoop MapReduce jobs achieve decent
performance through scaling out to very large computing clusters.
However, this results in high costs in terms of hardware and
power consumption. Therefore, researchers have carried out
many research works to effectively adapt the query processing
techniques found in parallel databases to the context of Hadoop


One of the main performance problems with Hadoop Map Reduce
is its physical data organization including data layouts and
indexes. Data layouts: Hadoop MapReduce jobs often suffer from
a row oriented layout. The disadvantages of row layouts have
been thoroughly researched in the context of column stores.
However, in a distributed system, a pure column store has severe
drawbacks as the data for different columns may reside on
different nodes leading to high network costs. Thus, whenever a
query references more than one attribute, columns have to be
sent through the network in order to merge different attributes
values into a row (tuple reconstruction). This can significantly
decrease the performance of Hadoop MapReduce jobs. Therefore,
other, more effective data layouts have been proposed in the
literature for Hadoop MapReduce.
Indexes: Hadoop MapReduce jobs often also suffer from the lack of
appropriate indexes. A number of indexing techniques have been
proposed recently. In the third part of this tutorial, we will discuss
the above data layouts and indexing techniques in detail.

Big Data and Business Analytics

Map reduce steps involve following
Input step: Loads the data into HDFS by splitting the data
into blocks and distributing to data nodes of the cluster. The
blocks are replicated for availability in case of failures. The
Name node keeps track of blocks and the data nodes.
Job Step: Submits the Map Reduce job and its details to the
Job Init Step: The Job Tracker interacts with Task Tracker on
each data node to schedule Map Reduce tasks.
Map step: Mapper process the data blocks and generates a
list of key value pairs.
Sort step: Mapper sorts the list of key value pairs
Shuffle step: Transfers the mapped output to the reducers in
a sorted fashion.
Reduce step: Reducers merge the list of key value pairs to
generate the final result.
Finally, the results are stored in HDFS and replicated as per the
Configuration. The results are finally read from the HDFS by the

Big Data and Business Analytics



Every two years, the amount of data in the world doubles, and by 2015, it is
estimated that the total data on Earth will amount to 7.9 zettabytes. Unstructured
data, such as text and images accounts for 90% of this amount. From here on, it is
highly anticipated that this massive amount of data will be used in business
analytics to improve operations and offer innovative services.


So how do we get from where we are today to where we want to be? Its clear
many companies lack the basic measures to manage big data, but see huge
potential benefits if they can learn to leverage it effectively. Businesses must
employ a holistic approach to data management a new approach for many and
one that focuses on the following stages in the data lifecycle:
Identify: Today, every business is a digital company; and, every customer or
employee is a content producer. The first step to creating business value is to
prepare the enterprise to be able to quickly accommodate new data sources,
then to understand where the data is coming from, who is creating it and
where the content lives.
Filter: The second step is to determine what information is important and
what information does not matter and provide tools and data management
policies that enable staff to effectively filter information for relevance
quickly. It is vital to consider how the company will use the data. Next,
companies must identify what filters to apply, how to categorize the data and
then, establish processes so that producers (all employees and customers) are
more accountable for the information they are creating.
Distribute: Because different information is intended for different levels,
locations and business units, companies must utilize a distribution
mechanism that is both automated and intelligent.
Apply: Overall, businesses must evolve from data analysis to insight to
prediction. Applying the right data in the right case is crucial to this
evolution. Some organizations may even find opportunities to monetize data
in new ways and create competitive differentiation.
Underpinning all of the steps of the data lifecycle is search. Search is a critical
foundation to tackle the big data problem and companies must introduce
comprehensive tools that allow employees to find the right real-time data from
Big Data and Business Analytics

across structured and unstructured sources. Companies must develop a data

culture where executives, employees and strategic partners are active participants
in managing a meaningful data lifecycle. Tomorrows successful companies will be
equipped to harness new sources of information and take responsibility over
accurate data creation and maintenance. This will enable businesses to turn data
from information into business insights.


Until now, businesses were limited to utilizing customer and business information
contained within an in-house system. However, performance improvements in
hardware and the emergence of new services in recent years have shown an
explosive growth in the types of information available for use. Typical examples
are sensing data and lifelog data (Figure 1). Use of sensors and smart devices are
expanding, and more detailed data about people and things are becoming easier
to acquire. We are also seeing a rapid increase of individuals disseminating
information via social networking services and blogs. With big data, it is necessary
to keep an eye on not only the volume, but also the variety and velocity of the data.
Rather than just single source numerical data, it is necessary to process
unstructured data such as text and image acquired from multiple sources. Data that
was previously acquired within a number of minutes or hours now use
extremely small units of time for acquisition, such as every second, or several
hundred milliseconds. Management or processing of data which has these
characteristics will inevitably present problems for current database and
processing technology.

Big Data and Business Analytics

How should we utilize such a big data? Business intelligence (BI) has developed
along with visualization in
the business environment; however, to utilize big data,
visualization is not enough. Incorporating business analytics (BA) which includes
prediction and optimization is the key to success.
There are three major types of business analytics.
Type 1 is to find the relationship and regularity between data sets. For example,
good customers can be determined based on an analysis of the causal relationship
between their attributes and purchasing history.
Type 2 is to find an optimal solution under a specified set of constraints. This type
is valid for problems where limited resources are used effectively, for instance,
when optimizing order quantity or scheduling shift workers.
Type 3 is to anticipate future trends by understanding customer behaviors.
Attentive services and functions that are ahead of the curve can be offered by this
Examples include financial services detecting fraud or anomaly, and offering
recommendations. In order to realize BA, many IT service providers already offer
solutions for large-scale distributed processing (Hadoop), and streaming data
processing (CEP: Complex Event Processing). Some providers have even
established specialized teams for analytics in-company, and continue to advance in
these activities.


How far is big data and business analytics related technology actually advancing?
We believe that the advancement in the axis of growth and diversity of information
(big data aspects), and axis of analytic sophistication (analytic aspects), are each
progressing .

Big Data and Business Analytics


Use and analysis of large amounts of numerical data from sensors continue to
progress. From here on, in aspects of big data, progression of diversified data such
as unstructured data will lead toward data fusion where data is fused from multiple
sources. For example, in the field of transportation, there is currently an effort to
integrate traffic information as text expressions, and weather information as
graphical expressions to analyze traffic congestion. The important technical point
is how to supplement and overlay data which differ in spatial granularity and
acquired timing. From an analytical aspect, analytical technology for diverse data
continues to progress, leading toward even more accurate future predictions and
control of the real world. For example, a retailer anticipates demand based on
sales, and by automatically calculating the appropriate order for number of
products, the retailer determines the actual order amount. For retailers, the need for
optimization of order amount is large because opportunity loss resulting from
inventory shortages and excess inventory is a risk too large to be ignored. Five to
ten years down the line, with an integration of these technologies, comprehensive
decision making, which currently only possible for humans, will partially be done
by machines. A predecessor to this is the Open Source Indicators project at
Intelligence Advanced Research Projects Activity (IARPA) of the U.S. Department
of Defense. This initiative uses Twitter posts, search engine queries, and street
corner surveillance webcams, integrating all kinds of data for automatic analysis to
assist with identifying revolutionary changes and other important social incidents.
Although it is currently in the research stage, it may help prevent terrorism and
large-scale crimes in the near future.

Big Data and Business Analytics

We should perceive the extensive use of big data as an extension of business
intelligence. Traditional business intelligence was based on aggregate analysis, and
stopped at the point of visualization. However, visualization alone has limits to
how high a degree of knowledge can be derived from data. We can perceive a
wider application of business intelligence including visualization and business
analytics, and categorize Business Intelligence into four categories based on our
data analysis consulting experience in various fields of business. Aggregate
Analysis Business Intelligence immediately aggregates and analyzes all data.
Discovery Business Intelligence analyzes variations to match data granularity and
discovers rules. WHAT-IF Business Intelligence uses simulations and searches
for optimal solutions to optimize business operations. And Proactive Business
Inteligence analyzes data in real time to offer future services that are ahead of the
curve. Of these abovementioned categories, Discovery BI, WHAT-IF BI, and
Proactive BI are analogs of business analytics, and these are equivalent to the
previously described Type 1, Type2, and Type 3, respectively. The data analysis
methodology BICLAVIS(Developed by NTT-DATA) has been developed around
the axis of the previously mentioned four classes. At the core of this lies analysis
scenarios classified into analysis patterns by objective. Based on these scenarios,
BICLAVIS uses a template for efficient analysis. At the Data Warehouses, initial
process assistance is offered in the form of support in proof-of-concept design and
selecting tools for core products, and development of demo systems tailored to
customer requirements.

Big Data and Business Analytics



This section will introduce some examples of INITIATIVES
(1) Bridge monitoring [Anomaly Detection]
Bridge deterioration detection involves large maintenance costs. We have built a
demonstration system to detect anomalies denoting distortion in the sent data
from installed sensors on a bridge. (This test case is detailed in a separate
(2) Optimized supply chain management for CPFR
[Prediction and control]
CPFR (Collaborative Planning, Forecasting, and Replenishment) is a cooperative
initiative between manufacturers and retailers to create sales plans, prevent defects,
and reduce inventory. This effort attempts to predict demand for products over the
short term, and mid/long term, and then create an ordering model based on that
information (how much to order at what timing). Additionally, to evaluate the
effectiveness of this effort, simulations are used that include realistic constraints
such as defective items and delivery dates. Usually, real changes in inventory
ordering formats involve great risk, and pre-evaluation through simulations is of
great value.
(3) Shift scheduling for PO [Prediction and control]

Big Data and Business Analytics

BPO (Business Process Outsourcing) involves outsourcing all but core aspects of a
business, radically revising business processes and resources. For this initiative, we
execute shift scheduling for offices that process multiple types of duties, along
with estimating work volume of each task, considering the necessary time limit for
completing a task, as well as the personnel skill needed, and other real constraints
in the workplace. With this system, we can automatically generate an optimum
schedule, maximizing BPO effectiveness.
(4) Medical cost reduction policy for health insurance organizations
The increase in insured persons who become seriously ill due to lifestyle diseases
has caused a problem in increased costs for health insurance organizations. From
insured persons current state of health, we identified who are at high risk for
serious diseases and gave health counseling at an early stage to prevent such
increase cost. By looking at past insurance claims, data mining can help identify
patterns that lead toward lifestyle diseases or complications, allowing organizations
to make a list of high-risk patients and to provide health counseling. The result can
lead to prevention of health risks in insured persons, and curtailing extra costs for
the health insurance organization overall.


1. ClearStory Data
Analyzing complex business intelligence doesn't have to be rocket
science. ClearStory Data offers advanced data mining and analytics tools that also
present information in a simple, easy to understand way. ClearStory Data works by
combining your business's internal data with publicly available information to help
you make better business decisions. These insights are displayed using the
StoryBoard feature, which lets you create graphs, story lines and interactive visuals
right from the ClearStory dashboard. It also comes with collaboration features that
enable team discussion, for instance, by commenting on individual StoryBoards,
much like you would on social media. In addition to business data, ClearStory can
also provide department-specific data, including marketing, sales, operations and
customer analytics. The platform also covers a wide range of industries, such as
retail, food and beverage, media and entertainment, financial services,
manufacturing, consumer packaged goods, healthcare, pharmaceutical and more.

2. Kissmetrics
Looking to increase your marketing ROI? Kissmetrics, a popular customer
intelligence and web analytics platform, could be your best friend. The

Big Data and Business Analytics

platform aims to help businesses optimize their digital marketing by

identifying its best customers and increasing conversions. Unlike traditional
web analytics tools, Kissmetrics goes beyond tracking basic metrics like
pageviews, referrals and demographic information. Kissmetrics specifically
tracks visitors, particularly for insights that can be used for better
segmentation and more successful marketing campaigns. Kissmetrics also
offers engagement tools to help increase sales, such as the ability to create
triggers and design styles that make the most out of customer behaviors.
All of this means more conversions, less churning customers who
quickly leave your site and, ultimately, higher ROIs. In addition,
Kissmetrics offers educational resources to help business improve
marketing campaigns, such as marketing webinars, how-to guides, articles
and infographics.
3. InsightSquared
The tools you already use provide another rich source of data. This doesn't
mean you have to waste time mining your own data and arduously
analyzing it using one spreadsheet after another. Instead,
InsightSquared connects to popular business solutions you probably
already use such as Salesforce, QuickBooks, ShoreTel Sky, Google
Analytics, Zendesk and more to automatically gather data and extract
actionable information. For instance, using data from customer relationship
(CRM) software, InsightSquared can provide a wealth of small business
sales intelligence, such as pipeline forecasting, lead generation and
tracking, profitability analysis, and activity monitoring. It can also help
businesses discover trends, strengths and weaknesses, sales team wins
and losses, and more. In addition to sales tools, InsightSquared's suite of
products also includes marketing, financial, staffing and support analytics
tools. InsightSquared starts at $65 per user per month.

4. Google Analytics
You don't need fancy, expensive software to begin gathering data. It can
start from an asset you already have your website. Google Analytics,
Google's free Web-traffic-monitoring tool, provides all types of data about
website visitors, using a multitude of metrics and traffic sources.
With Google Analytics, you can extract long-term data to reveal trends and
other valuable information, so you can make wise, data-driven decisions.
For instance, by tracking and analyzing visitor behavior such as where
traffic is coming from, how audiences engage and how long visitors stay on
a website (known as bounce rates) you can make better decisions when
Big Data and Business Analytics

striving to meet your website's or online store's goals. Another example is

analyzing social media traffic, which will allow you to make changes to your
social media marketing campaigns based on what is and isn't working.
Studying mobile visitors can also help you extract information about
customers browsing your site using their mobile devices, so you can
provide a better mobile experience.
5. IBM's Watson Analytics
While many Big Data solutions are built for extremely knowledgeable data
scientists and analysts, IBMs Watson Analytics makes advanced and
predictive business analytics easily accessible to small businesses. The
platform doesn't require any requisite skills of using complex data mining
and analysis systems, but automates the process instead. This self-service
analytics solution includes a suite of data access, data refinement and data
warehousing services, giving you all the tools you need to prepare and
present data yourself in a simple and actionable way to guide decision-
making. Unlike other analytics solutions that focus on one area of business,
Watson Analytics unifies all your data analysis projects into a single
platform it can be used for all types of data analysis, from marketing to
sales, finance, human resources and other parts of your operations. Its
"natural language" technology helps businesses identify problems,
recognize patterns and gain meaningful insights to answer key questions
like what ultimately drive sales, which deals are likely to close, how to make
employees happy and more.
6. Canopy Labs
Big Data won't just help you make better business decisions; it can help
you predict the future, too Canopy Labs, a customer analytics platform,
uses customer behavior, sales trends and predictive behavioral models to
extract valuable information for future marketing campaigns and to help you
discover the most opportune product recommendations. One of Canopy
Labs' standout features is the 360-degree Customer View, which shows
comprehensive data about each individual customer. Its purpose is two-
fold: first, it reveals each customers' standing, such as lifetime value, loyalty
and engagement level, as well as purchase histories, email behaviors and
other metrics and this shows which customers are profitable and worth
reaching out to. Second, with this information, businesses can better create
personalized offers, track customer responses and launch improved
outreach campaigns. Canopy Labs handles the complex, technical side of
Big Data, so all you have to focus on are your customers.

Big Data and Business Analytics

It's no secret that credit card transactions are chock full of invaluable data.
Although access was once limited to companies with significant resources,
customer intelligence company Tranzlogic makes this information available
to small businesses without the big business budget. Tranzlogic works
with merchants and payment systems to extract and analyze proprietary
data from credit card purchases. This information can then be used to
measure sales performance, evaluate customers and customer segments,
improve promotions and loyalty programs, launch more-effective marketing
campaigns, write better business plans, and perform other tasks that lead
to smarter business decisions. Moreover, Tranzlogic requires no tech
smarts to get started it is a turnkey program, meaning there is no
installation or programming required. Simply log in to access your merchant
8. Qualtrics
If you don't currently have any rich sources for data, conducting research
may be the answer. Qualtrics lets businesses conduct a wide range of
studies and surveys to gain quality insights to guide data-driven decision
making. Qualtrics offers three types of real-time insights: customer, market
and employee insights. To gain customer insight, use Qualtrics' survey
software for customer satisfaction, customer experience and website
feedback surveys. To study the market, Qualtrics also offers advertising
testing, concept testing and market research programs. And when it comes
to your team, Qualtrics can help conduct employee surveys, exit interviews
and reviews. Other options include online samples, academic research and
mobile surveys.

Big Data and Business Analytics



Big data is a disruptive force that will affect organizations across industries, sectors
and economies. Not only will enterprise IT architectures need to change to
accommodate it, but almost every department within a company will undergo
adjustments to allow big data to inform and reveal. Data analysis will change,
becoming part of a business process instead of a distinct function performed only
by trained specialists. Big data productivity will come as a result of giving users
across the organization the power to work with diverse data sets through self-
service tools.

And thats just the beginning. Once companies begin leveraging big data for
insight, the action they take based on that insight has the potential to revamp
business as it is known today. If a marketing department can gain immediate
feedback on a new branding campaign by analyzing blog comments and social
networking conversations, do focus groups and customer surveys become
obsolete? Nimble new companies that understand the value of big data will not
only challenge existing competitors, but may also begin defining the way business
is done in their industries. Customer relationships will undergo transformation as
companies strive to quickly understand concepts that previously couldnt be
captured, such as sentiment and brand perception.

Achieving the vast potential of big data calls for a thoughtful, holistic approach to
data management, analysis and information intelligence. Across industries,
organizations that get ahead of big data will create new operational efficiencies,
new revenue streams, differentiated competitive advantage and entirely new
business models. Business leaders should begin thinking strategically about
how to prepare their organizations for big dataand big opportunities.

Big Data and Business Analytics


1. Dean, Jeffrey, and Sanjay Ghemawat, MapReduce: Simplified data
processing on large clusters, Sixth Symposium on Operating System
Design and Implementation, San Francisco, CA, December 2004.
[2] D. Abadi et al. Column-Oriented Database Systems.
PVDLB,2(2):16641665, 2009.
[3] F. N. Afrati and J. D. Ullman. Optimizing Joins in a Map-
ReduceEnvironment. In EDBT, pages 99110, 2010.
[4] S. Babu. Towards automatic optimization of MapReduce
programs.In SOCC, pages 137142, 2010.
[5] S. Blanas et al. A Comparison of Join Algorithms for Log
Processing in MapReduce. In SIGMOD, pages 975986, 2010.
[6] J. Dean and S. Ghemawat. MapReduce: A Flexible Data
Processing Tool. CACM, 53(1):7277, 2010.
[7] J. Dittrich, J.-A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J.
Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah
(Without It Even Noticing). PVLDB, 3(1):519529, 2010.
[8] J. Dittrich, J.-A. Quiane-Ruiz, S. Richter, S. Schuh, A. Jindal, and
J. Schad. Only Aggressive Elephants are Fast Elephants. PVLDB,
[9] A. Floratou et al. Column-Oriented Storage Techniques for
MapReduce. PVLDB, 4(7):419429, 2011.
[10] ] J. Lin et al. Full-Text Indexing for Optimizing Selection
Operationsin Large-Scale Data Analytics. MapReduce Workshop,
[12] H. Herodotou and S. Babu. Profiling, What-if Analysis, and
Cost-based Optimization of MapReduce Programs. PVLDB,
(11):11111122, 2011. [13] E. Jahani, M. J. Cafarella, and C. Re.
Automatic Optimization for MapReduce Programs. PVLDB,
4(6):385396, 2011.
[15] D. Jiang et al. The Performance of MapReduce: An In-depth
Study.PVLDB, 3(1-2):472483, 2010.