You are on page 1of 24


Big Data Explained, Analysed, Solved

Big Data Explained, Analysed, Solved

What you will learn

This eBook gives an overview of what big data The functional section of this book discusses Canonical is involved in Big Data
is and its growing importance. It talks about applications, tools, managed services and
Canonical, the company behind Ubuntu,
some of the different kinds of big data, as well clouds, used together or separately, that
works closely with its partners on all aspects
as some of the different things you would will help you benefit most from big data.
infrastructure and partner solutions to
do with it.
You can skip directly to any section and focus support storing, managing, and analysing
on what’s most important to you, or read the big data.
book straight through.

Tweet this
Big Data Explained, Analysed, Solved

The Author

Bill Bauman, Strategy & Content, Canonical,

began his technology career in processor
development and has worked in systems
engineering, sales, business development,
and marketing roles. He holds patents
on memory virtualization technologies
and is published in the field of processor
performance. Bill has a passion for emerging
technologies and explaining how things work.
He loves helping others benefit from modern

Bill Bauman
Strategy & Content, Canonical

Tweet this
Big Data Explained, Analysed, Solved


Overview Functional Partnership

What is Big Data, a general overview 05  esign, deploy, and package
D 13  anonical as a strategic partner
C 22
Big Data solutions for Big Data
The increasing importance of Big Data 07
Juju Big Data Charms 14 Conclusion 23
Different types of Big Data 08
Juju Big Data Frameworks 15 About Canonical 24
Big Data analysis and action 10
Ubuntu for Big Data systems 16
Do I need a cloud for Big Data? 11
OpenStack is a Big Data warehouse 18

BootStack for Big Data 19

Ubuntu Advantage Storage for Big Data 20

Machine Containers for Big Data 21

Tweet this
Big Data Explained, Analysed, Solved

Big Data general overview

Big Data refers to extremely large sets of Traditional Data

data that aren’t easily stored or analysed by
To understand big data, consider some
traditional methods. Typically the data is too
examples of traditional data. Traditional
large, varied in nature, or moves too fast, for
data may be a database of clients, with their
traditional database systems to handle it.
associated contact information. It could be
This often is referred to as volume, variety,
a database of cars, years, makes, models. Structured Database
and velocity.
This sort of data will usually grow gradually
in size and the types of data stored
rarely changes.

Traditional data is generally well-structured

and fits predefined or predictable categories.

Predefined datasets

Tweet this Incremental, predictable growth

Big Data Explained, Analysed, Solved

Big Data The general purpose

When we look at Big Data, typically the data The reason that these gigantic data sets are
is not so neatly organised. Some big data being compiled and stored is so that we can
examples could be random spots on a map, analyse the data. Analysis includes pattern
documents, images, huge lists of named or recognition, trends, associations, etc. The
unnamed individuals that have happened to Often Unstructured outcome of analysis is respective actions
all be in the same general area at a given time, that would otherwise not be possible
or the millions of clicks on a web page without big data.
in a given week.
In the next section of this eBook, we go into
Big Data can be structured or unstructured, further detail about big data analysis - why
but generally the database and analysis tools we do it, why it’s important, and the sort
are specially designed for a given purpose of information for which we’re looking.
and to handle the tremendously large scale,
size, velocity, and variety that most big data Purpose-specific toolset

datasets represent.

Tweet this
Rapid growth
Big Data Explained, Analysed, Solved

The increasing
importance of Big Data
Collect Analyse Act
Organisations of all sizes and functions are The analysis of big data can have big returns. The ability to do something with the data
increasingly gathering more information The ability to understand the types of data that is collected and analysed is the most
about their interactions and transactions. that are collected, to correlate one type of compelling part of big data. Corporations
They are also looking to third parties to data with another, observe trends, identify can offer more compelling products and
provide additional data. Regardless of how outliers, and many other analytic functions, solutions. Governments can better predict
they gather data and the types and quantities are increasingly valuable in organisations and serve the needs of citizens. Even small
are increasing. In a modern, data-driven world, of all types. business can identify short and long term
an organisation that isn’t taking advantage trends in their sales and interactions with
Without thorough analysis via the use of
of big data collection, analytics, and action, customers, as well as other businesses.
modern, big data analytics tools, it can be
is likely going to become uncompetitive with All of these outcomes are about improved
easy to miss or overlook important trends,
those that are. efficiencies and experiences for everyone
shifts in perspective, or subtle changes
involved, from the provider to the consumer.
in customer interaction. Through analysis,
you can learn patterns and predict actions
before they occur and even begin to direct
them via actions discussed here in the
Act section.

Tweet this
Big Data Explained, Analysed, Solved

Types of Big Data

Big data can be structured or unstructured. Structured big data Compilations

New tools and datasets are blurring the lines
Compiled big data is merging existing or
that separate the two. Below are some
otherwise disparate databases into a single
common examples of big data types.
dataset. For example, the data could include
names, locations, demographics, account
balances, credit scores, etc, all combined
into a compiled big data dataset.
Remember, just because it is structured, does
not mean it isn’t big data. Structured big data Transactions
could be compiled from millions or billions of Transactional big data is everything having to
data points, daily or even hourly. do with a transaction, including whether the
transaction was even completed. The data
User Input
could include what was purchased, how long
This is data that is created via a prompt or it took, was it online or in-store, were other
requested action to a user. This could be a items typically purchased together.
ratings system, a survey, a loyalty program, or
any other prompt for the user to input specific
data in specific fields that are then stored in a
structured manner.

Tweet this
Big Data Explained, Analysed, Solved

Unstructured Big Data User-generated Content Passive Data

Big data is most commonly associated with Every day, millions of Internet users post This is generally the data that is generated
unstructured data. Unstructured data, like pictures, videos, short messages, audio, without specific intent or interaction from
photos and IoT datasets, were largely the and more. Much of this data is completely users. For example, cell phones are perpetually
genesis of modern big data. unassociated with a category or field. updating GPS coordinates of their users’
Essentially, it is completely unstructured respective locations. Logistics information,
and it is the function of targeted big data bar code scans, delivery information, are
applications to aggregate, cull, present, all data that are passively updated but can
and analyze these datasets. provide valuable insights when analyzed.

Tweet this
Big Data Explained, Analysed, Solved

Big Data analysis and action

Predictive Analytics Descriptive Analytics Prescriptive Analytics

Probably the most common type of The focus here is on metrics, a summary of Largely an intelligent evolution of predictive
analysis, using past patterns or performance what has happened. This could be views, clicks, analytics, with a prescriptive approach, data
to determine future actions is one of the counts, posts, etc. While descriptive metrics analysis is used to determine recommended
best known uses of big data. It’s important are not necessarily incredibly useful on their actions. Where predictive analytics looks
to analyse data from a multitude of different own, they are the underlying data points that at patterns and makes recommendations,
perspectives and to include cross referenced, feed more advanced analysis and actions. prescriptive analytics looks at patterns,
sometimes loosely-associated data, to establish Descriptive analytics have been used for many associates them with additional datasets,
the most comprehensive patterns and future years now, and are the foundation of the many determines where individual data points
predictions. Predictive analytics can also be graphs and charts we see on the Internet and coincide or there are recurring common
bolstered by machine learning, whereby, over in presentations today. descriptors or activities, and then prescribes
time, the system builds its own intelligence a potential course of action or solution.
profile on a given a subject, individual, or topic. Prescriptive analytics are generally
underutilised but offer great potential
to reduce time to market for solutions
or assessment times for individuals
in various fields.

Tweet this
Big Data Explained, Analysed, Solved

Do I need a cloud for Big Data?

Even though big data was born in the cloud, For more information on Juju, see section For more on building your own private cloud,
it doesn’t mean you need a cloud to take Design, deploy, package Big Data solutions. see the sections OpenStack is a Big Data
advantage of big data solutions or to act warehouse and BootStack for Big Data later
Although it isn’t necessary, a cloud can
on the data. The most important aspects in this eBook.
be tremendously beneficial to big data
of working with big data are that you
processing. The nature of big data is that it is
have chosen the right tools and the right
constantly changing, and the purpose of that
applications for your solution. Canonical
data, the analysis of that data, and the storage
can help you with both.
of that data can change just as quickly. A tool
Canonical has created an open source solution like Juju can help you keep up with the change
for system design and service modeling called in usage by deploying new big data charmed
Juju. Juju simplifies the process of designing solutions. But Juju can’t do it all.
your solution, then configuring, associating,
For system scalability and the ability to easily
and deploying the applications in it. Having
access different types of storage for different
a tool like Juju means that selecting the right
needs, a cloud is recommended. Juju can
big data applications for your needs is the
talk directly to both public and private cloud
most important remaining factor.
solutions, like AWS and Canonical OpenStack,

Tweet this
Big Data Explained, Analysed, Solved

Choosing the right applications

There are many ways to go about application
selection. Some people already know which
big data processing solutions they want to
use. Others are looking for advice, or looking
to explore potential new solutions.

In the Juju Big Data Charms section of this book,

we outline many big data software solutions
that are available, and give a brief description Juju is the game-changing service modelling tool BootStack is your OpenStack private cloud, running
of their purpose. This is a great starting point that lets you build entire cloud environments with on your hardware, in your choice of datacentre
only a few commands. with Canonical’s experts responsible for design,
to see what’s out there, and Juju makes it easy deployment and availability.
to try them all.

Additionally, in the BootStack for Big Data

section of this book, we go into detail on how
a BootStack cloud helps to start processing big
data quickly and efficiently.

Tweet this
Big Data Explained, Analysed, Solved

Design, deploy, package

Big Data solutions
Whether in a cloud or on a dedicated system, Juju, Charms, and Bundles Evolving the solution
managing all the applications in a big data
solution is best handled by a tool that does The use of Charms is what gives Juju its incredible When it comes to big data processing, the
more than static configuration management capabilities to manage applications in complex solution is rarely static. Big data deployments
or orchestration deployment. infrastructures. Charms are intelligent scripts evolve over time, and that often involves adding
wrapped around big data applications that or removing components services. The same
Juju is a service modelling product from allow them to be dynamically configured tool that you used to design and deploy
Canonical that gives you a blank canvas on and deployed without manual configuration. the solution can be used to dynamically
which you can visually lay out all of your big add and remove components within it.
data apps. Communications and data paths The abstraction of application relationship
Juju’s service modelling approach lets you
are defined as relationships between the management by Juju’s Charms is what allows
evolve your solution and keep pace with
applications by connecting the apps on your big data solutions to be rapidly deployed and
the rapidly changing big data market.
canvas. The visual solution design and all the seamlessly scaled. Without the application
application relationships can be deployed abstraction that Juju provides, big data system
immediately, and exported and saved services require manual intervention or iteration
as a bundle for future use. of inflexible, static configuration scripts any
time the solution design needs to be updated
or changed.

Tweet this
Big Data Explained, Analysed, Solved

Juju Big Data Charms

Ingest & Messaging Scale Out Storage Analytics / Search /Visualisation

• Message Processing • Ceph • SpagoBI

• Flume • Swift • Saiku
• Kafka
• Storm
• Message Queues noSQL
• Spark
• RabbitMQ
• Stack
• ZeroMQ • Datafari (ManifoldCF, SolR)
• ElasticSearch
• LogStash • Zeppelin
Structured Data • Kibana
• iPython Notebook
• MySQL • Document Databases
As discussed on the Design, deploy, package
• PostGreSQL • MongoDB Big Data solutions page, these are a sample
• CouchDB of the Charms available for big data. With Juju,
• Percona Cluster
• Couchbase you can readily deploy any combination of
• MariaDB • Column & KV these Charms and define their configurations
and data paths all from a graphical interface,
• Cassandra / DSE CLI, or API.
• quasardb
• memcached
• Redis
Tweet this
Big Data Explained, Analysed, Solved

Juju Big Data Frameworks

Big data frameworks are available for Hadoop Spark

deployment in Juju. You can deploy an entire
• Hadoop Flavours • Spark
Hadoop cluster with a Juju Charm bundle,
or Spark, Docker, or Kubernetes, for example. • Apache Hadoop • Spark Streaming
The Charms listed on the Juju Big Data • Cloudera Hadoop
• Spark SQL
Charms page can all be associated with the
frameworks listed here, as appropriate. • SparkML
• Hive
All of these frameworks benefit from Juju’s • GraphX
ability to automatically configure application • Mahout
data paths and relationships.
• HBase Container Ecosystem & Orchestration

• Pig • Docker

• ZooKeeper • LXD / LXC

• Flume • Kubernetes

• Kafka • Mesos

• Tez

• Storm

• Hue
Tweet this
Big Data Explained, Analysed, Solved

Ubuntu for Big Data systems

Ubuntu Server is the most popular cloud Ubuntu Server can be used as a traditional Ubuntu allows you to process your big
operating system in use. There are many operating system. There are also optimised data anywhere. Keep sensitive information
reasons why Ubuntu is so popular, but one variants for low latency and other task-specific in-house, leverage the public cloud for
of the primary reasons is that Canonical solutions, like big data processing. unpredictable workloads, and trusted private
started to focus on OS scalability many years cloud partners for both.
Where Ubuntu runs:
ago. When you’re working with big data, you
need a cloud-ready platform, like Ubuntu, • On-premise, in your own cloud
that is designed for scalability and reliability.
• In an external, private cloud

• On public clouds, like AWS, Azure, Rackspace,

Google Cloud Platform, IBM, and many
others, please see the Ubuntu Certified
Public Cloud page for more

Tweet this
Big Data Explained, Analysed, Solved

How Ubuntu runs:

The flexibility of Ubuntu to run anywhere
on almost any architecture makes it the ideal
platform choice to execute big data workloads.

 are metal server on - x86,

B  irtual Machine on - KVM,
V Public cloud guest instance
ARM, POWER, or z Mainframe VMware, Hyper-V, and
other hypervisors

Private cloud guest instance Container on bare metal Container as a virtual machine Container as a cloud instance

Tweet this
Big Data Explained, Analysed, Solved

OpenStack is a Big Data warehouse

The section Do I need a cloud for Big Data Autopilot is designed to work with an The base platform of Canonical OpenStack
in this book addresses some of the benefits extended tool set beyond just OpenStack. is Ubuntu. Ubuntu is not only the most popular
of clouds for big data. Specifically, an OpenStack cloud operating system, it is also the most
MaaS, Metal as a Service, automates the
cloud is the most popular private cloud solution popular OpenStack infrastructure operating
configuration of the physical nodes in your
for big data. system. Ubuntu runs on the OpenStack
OpenStack environment. Juju, discussed
physical nodes, providing critical services
OpenStack is a community-based private further in the Design, deploy, package Big
like compute, networking, and storage.
cloud solution. It is not a single product, but Data solutions section of this eBook, allows
It is also the platform for your guest instances,
a collection of individual projects designed you to automatically deploy applications
whether they are LXD machine containers
to seamlessly interact to create a functional and their respective relationships within your
or virtual machines, where you run your big
cloud. Canonical OpenStack is a production- OpenStack cloud. Landscape manages the
data applications.
ready, supported OpenStack distribution, Autopilot experience, as well as the cloud
and more. itself, and the guest instances within it. Combining OpenStack with Canonical’s
feature-rich tools and Ubuntu creates a
The best way to build an OpenStack cloud The comprehensive tool set that comes
scalable, reliable, automated platform for
is using Autopilot. Autopilot is a graphical with a Canonical OpenStack cloud makes it
deploying and managing big data solutions
installation tool that allows you to select easier, faster, and more robust to deploy big
for any type of analytics, monitoring, and
the components of OpenStack you would like data solutions - from the bare metal, to the
more. Canonical even guarantees upgrade
to install and deploys them for you. It can even platform operating system to the
ability of your OpenStack Big Data cloud.
deploy them with high availability. applications themselves.

Tweet this
Big Data Explained, Analysed, Solved

Big Data Cloud, quick and easy

BootStack is a unique, managed Canonical All of the tools that make Canonical Whether you just want to try it out, don’t
OpenStack offering. It is unique in that you OpenStack the platform of choice for big have the in-house skills, or want to get up
may choose to run the solution in your own data are included in BootStack. Even better, and running quickly, BootStack can provide
datacenter, on your own hardware, or in a 3rd- they can be preconfigured for you and ready the answer to a big data cloud. To learn more
party hosted facility, like IBM SoftLayer, for use. As soon as your BootStack cloud is about BootStack, and use the BootStack
an Ubuntu Certified Public Cloud partner. ready, you can start using all the big data calculator to calculate potential savings,
solutions in the Juju Charm Store. You’ll find visit the BootStack managed cloud page.
Canonical’s engineers have years of OpenStack
the core big data solutions you expect and can
experience. With BootStack, you can leverage
even start discovering new big data solutions
their knowledge of how-to and best practices
from all our Charm partners.
and have a Canonical OpenStack cloud ready
for big data processing in days. BootStack is billed on a pay for use model.
The model is similar to that of Ubuntu
With BootStack, you focus on the data, and
Advantage Storage. These unique and
Canonical takes care of the infrastructure.
innovative price models are part of the
Additionally, when you want, Canonical can
initiative to make private cloud usage and
transfer total control of your OpenStack
consumption as easy to calculate and predict
environment to you.
as that of public clouds.

Tweet this
Big Data Explained, Analysed, Solved

Ubuntu Advantage Storage

Ubuntu Advantage Storage is a unique and Pay for what you use
ideal storage solution for big data storage

Unused Capacity
Another unique feature of Ubuntu Advantage

Unused Capacity
and real-time processing. It is based on
Storage is its pay for use, metered model. As Content

Software Defined Storage (SDS) solutions, Storage
opposed to paying for all the storage in your
allowing for flexibility and modern data
datacenter, you just pay for the storage that’s

Redundant Data
management approaches.

Unused Capacity

Unused Capacity
actively in use. Additionally, you don’t pay

Total Capacity

Total Capacity
for replicas or online backups. The cost savings
Choose the right technology

Total Capacity
compared to other SDS-based and managed
Ceph, NexentaEdge, Swift and SwiftStack are storage solutions can be 2x to 3x,
all supported by Ubuntu Advantage Storage. or even more. What What What



you you you
That means, you choose the right technology pay for pay for pay for
The pay for use model of Ubuntu Advantage
for your solution, and it is all directly supported
Storage is similar to that of our managed
by Canonical. The hardware you choose to Grow your capacity, Increase your redundancy,
OpenStack solution, BootStack. These unique without growing your bill pay the same!
run the solution on is just as important, and
and innovative price models are part of the
Canonical’s partners and engineers can help
initiative to make private cloud usage and
you with that, as well.
consumption as easy to calculate and predict
as that of public clouds.

Tweet this
Big Data Explained, Analysed, Solved

Machine Containers for Big Data

Machine containers are a relatively new LXD isn’t just about performance. There are
technology in the virtualisation ecosystem. big data workloads that run in public clouds
Delivered by Ubuntu as a technology called as guest instances. Almost all of those instances
LXD, they provide the management of are virtual machines. One of the benefits
traditional virtual machines without the of LXD machine containers is that it provides
system overhead. process isolation and application mobility
(live migration) to running processes. That
Many big data solutions execute optimally
means increased manageability for public
when run at bare metal speed. That can limit
cloud instances, as well as bare metal and
the use of virtualisation, though, and restrict
private cloud solutions.
system placement. By using LXD, multiple
services can share a single system and all have
direct hardware access.

Multiple services can share a single system and all have direct
hardware access

Tweet this
Big Data Explained, Analysed, Solved

Canonical as a strategic
partner for Big Data
Working with Canonical as your valued partner will maximise your success with big data.
Some attributes to keep in mind and that Canonical delivers are:

Your strategic big data partner should

understand and have experience designing,
building, deploying, and managing scalable
infrastructures and big data applications.
Ideally that partner brings with it an entire
Scalability Application catalog Prebuilt, intergrated Time to solution ecosystem of additional big data partners.
Canonical works closely with a multitude
of big data software and platform providers
to ensure choice in solutions while maintaining
quality and integrity in the overall stack.

24/7 Support Existing expertise Managed offerings ...and more

Tweet this
Big Data Explained, Analysed, Solved


There are many kinds of big data. Your data is important. You need to know how If you’re excited to hear more and talk
to store, process, and act on your data. The to us directly, you can reach us on our
There are many big data applications, services,
overview, explanations, and solutions outlined Contact Us page.
and solutions.
in this book will get you started or accelerate
To learn more about a managed solution for
Canonical has domain expertise, understands your journey to maximising the benefits of the
big data, download the paper BootStack Your
big data, has strong industry partnerships, and data you have and the new data you will start
Big Data Cloud.
can provide a scalable, supported solution. collecting.
If you want to start trying things out
Your best next step is to contact
immediately, we highly encourage you
Canonical today.
to visit Juju solutions for big data.

Tweet this
Big Data Explained, Analysed, Solved

About Canonical

At Canonical, we are passionate about By providing custom engineering, support

the potential of open source software to contracts and training, we help clients in the
transform business. For over a decade, we telecoms and IT services industries to cut
have supported the development of Ubuntu costs, improve efficiency and tighten security
and promoted its adoption in the enterprise. with Ubuntu and OpenStack. We work with
hardware manufacturers like HP, Dell and
Intel, to ensure the software we create can
be delivered on the world’s most popular
devices. And we contribute thousands of man-
hours every year to projects like OpenStack,
to ensure that the world’s best open source
software continues to fulfil its potential.

Tweet this