01-Introduction To Big Data and Data Analytics

Introduction to Big Data and Data Analytics
Introduction to
Big Data and
Data Analytics
Big Data Ecosystem
© Copyright IBM Corporation 2019

Course materials may not be reproduced in whole or in part without the written permission of IBM.
U n i t 1 I n t r o d u c t i o n t o B i g D a t a a n d D a t a A n a l yt i c s
© Copyright IBM Corp. 2019 1-2

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Unit objectives
• Develop an understanding of the complete open-source Hadoop
ecosystem and its near-term future directions
• Be able to compare and evaluate the major Hadoop distributions and
their ecosystem components, both their strengths and their limitations
• Gain hands-on experience with key components of various big data
ecosystem components and their roles in building a complete big data
solution to common business problems
• Learning the tools that will enable you to continue your big data
education after the course
▪ This learning is going to be a life-long technical journey that you will start
here and continue throughout your business career
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Unit objectives

Introduction to Big Data

• A tsunami of Big Data
• The Vs of Big Data
(3Vs, 4Vs, 5Vs, …)
▪ The count depends on who
does the counting
• The Ecosystem
▪ Apache open Source
▪ The distributions
▪ The add-ons
▪ Open Data Platform Initiative
(OPDi.org)
• Some basic terminology
Introduction to Big Data

Big Data - a tsunami that is hitting us already

• We are witnessing a tsunami of data:
▪ Huge volumes
▪ Data of different types and formats
▪ Impacting the business at new and ever increasing speeds
• The challenges:
▪ Capturing, transporting, and moving the data
▪ Managing - the data, the hardware involved, and the software
(open source and not)
▪ Processing - from munging the raw data to programming to provide insight
into the data
▪ Storing - safeguarding and securing
− “Big Data refers to non-conventional strategies and innovative technologies used
by businesses and organizations to capture, manage, process, and make sense of
a large volume of data”
• The industries involved
• The futures
Big Data - a tsunami that is hitting us already

We are witnessing a tsunami of huge volume of data of different types and formats that
make managing, processing, storing safeguarding and securing, and transporting them
a real challenge.
Big Data refers to non-conventional strategies and innovative technologies used by
businesses and organizations to capture, manage, process, and make sense of a large
volume of data. Ref.: Jeff Reed (2017), Data Analytics: Applicable Data Analysis to
Advance Any Business Using the Power of Data Driven Analytics
The analogies:
• Elephant (hence the logo of Hadoop)
• Humongous (the underlying word for Mongo Database)
• Streams, data lakes, oceans of data

Data has an intrinsic property…it grows and grows
90% 80% 20%

of the world’s data of the world’s of available data can
was created in the data today is be processed by
last two years unstructured traditional systems
1 in 2 83% 5.4X
business leaders don’t of CIO’s cited BI and analytics more likely that top
have access to data they as part of their visionary plan performers use business
need analytics
Data has an intrinsic property…it grows and grows

Growing interconnected & instrumented world
Growing interconnected & instrumented world

Gary’s Social Media Count – webpage
• http://www.personalizemedia.com/garys-social-media-count/
Passive RFID grows by 1.12 billion tags in 2014 to 6.9 billion: More than five years later
than the industry had expected, market research and events firm IDTechEx find that the
passive RFID tag market is now seeing tremendous volume growth.
• http://www.idtechex.com/research/articles/passive-rfid-grows-by-1-12-billion-tags-
in-2014-to-6-9-billion-00007031.asp
• http://www.statista.com/statistics/299966/size-of-the-global-rfid-market/
• https://en.wikipedia.org/wiki/Radio-frequency_identification
It is Predicted This Year That 1.3 Billion RFID Tags Will Be Sold (in 2006), but This
Number is Expected to Grow Rapidly over the Next Ten Years
• http://www.businesswire.com/news/home/20060124005042/en/Predicted-Year-
1.3-Billion-RFID-Tags-Sold

The number of internet users has increased tenfold from 1999 to 2013. The
first billion was reached in 2005. The second billion in 2010. The third billion in 2014.
• http://www.internetlivestats.com/internet-users/
Growth in the smart electric meter market has cooled during the last year, even though
strong market drivers remain in place on which utilities can build a business
case. Although the direct operational and societal benefits of smart meters and the
broader benefits of smart grids that are enabled by smart meters continue to be
debated among policymakers, utilities, regulators, and consumers, penetration rates
continue to climb, reaching nearly 39 percent in North America in 2012. According to a
recent report from Navigant Research, the worldwide installed base of smart meters will
grow from 313 million in 2013 to nearly 1.1 billion in 2022.
• https://www.navigantresearch.com/newsroom/the-installed-base-of-smart-meters-
will-surpass-1-billion-by-2022
Number of monthly active Facebook users worldwide as of 1st quarter 2016 (in millions)
• http://www.statista.com/statistics/264810/number-of-monthly-active-facebook-
users-worldwide/
The Top 20 Valuable Facebook Statistics - Updated April 2016

• https://zephoria.com/top-15-valuable-facebook-statistics/
Number of mobile phones to exceed world population by 2014. The ITU expects the
number of cell phone accounts to rise from 6 billion now to 7.3 billion in 2014, compared
with a global population of 7 billion.
Over 100 countries have the number of cell phone accounts exceeding their
population.
• http://www.digitaltrends.com/mobile/mobile-phone-world-population-2014/
• http://www.siliconindia.com/magazine_articles/World_to_have_more_cell_phone_
accounts_than_people_by_2014-DASD767476836.html

Growth in Internet traffic (PCs, smartphones, IoT,…)
Growth in Internet traffic (PCs, smartphones, IoT,…)

Cisco report says smartphone traffic will exceed PC traffic by 2020
IP traffic will grow in a massive way as 10 billion new devices come online over
the next five years
Advancements in the Internet of Things (IoT) are continuing to drive IP traffic and
tangible growth in the market
Graphic: NetworkWorld, 7 Jun 2016
Reference:
• http://www.networkworld.com/article/3080001/lan-wan/cisco-ip-traffic-will-
surpass-the-zettabyte-level-in-2016.html
Applications such as video surveillance, smart meters, digital health monitors and a
host of other Machine-to-Machine services are creating new network requirements and
incremental traffic increases.
Annual global IP traffic will surpass the zettabyte (ZB; 1000 exabytes) threshold in
2016, and will reach 2.3 ZB by 2020. Global IP traffic will reach 1.1 ZB per year or 88.7
EB (one billion gigabytes [GB]) per month in 2016. By 2020, global IP traffic will reach
2.3 ZB per year, or 194 EB per month.

The number of devices connected to IP networks will be three times as high as the
global population in 2020. There will be 3.4 networked devices per capita by 2020, up
from 2.2 networked devices per capita in 2015. Accelerated in part by the increase in
devices and the capabilities of those devices, IP traffic per capita will reach 25 GB per
capita by 2020, up from 10 GB per capita in 2015.

Some examples of Big Data

• Science • Medical records
• Astronomy • Commercial
• Atmospheric science • Web / event / database logs
• Genomics • "Digital exhaust" - result of human
• Biogeochemical interaction with the Internet
• Biological • Sensor networks
• Other complex / interdisciplinary • RFID
scientific research • Internet text and documents
• Social • Internet search indexing
• Social networks • Call detail records (CDR)
• Social data • Photographic archives
− Person to person (P2P, C2C): • Video / audio archives
• Wish Lists on Amazon.com • Large scale eCommerce
• Craig’s List • Regular government business and
• Person to world (P2W, C2W): commerce needs
• Twitter • Military and homeland security
• Facebook surveillance
• LinkedIn
Some examples of Big Data

The four classic dimensions of Big Data (4 Vs)
…and a 5th V - Value -

that is the real purpose
of working with Big Data
to obtain business insight
The four classic dimensions of Big Data (4 Vs)

https://www.promptcloud.com/blog/The-4-Vs-of-Big-Data-for-Yielding-Invaluable-Gems-
of-Information
Volume
As the name suggests, the main characteristic of big data is its huge volume collected
through various sources. We are used to measuring data in Gigabytes or Terabytes.
However, according to various studies, big data volume created so far is
in Zettabytes which is equivalent to a trillion gigabytes. 1 zettabyte is equivalent to
approximately 3 million galaxies of stars. This will give you an idea of colossal volume
of data being available for business research and analysis.
Take any sector and you can comprehend that it is flooded with loads of data. Travel,
education, entertainment, health, banking, shopping - each and every sector can
benefit immensely from the Big data advantage. Data is collected from diverse sources
which include business transactions, social media, sensors, surfing history etc.

With every passing day, data is growing exponentially. Thousands of TBs worth data is
created every minute worldwide via Facebook, tweets, instant messages, email,
internet usage, mobile usage, product reviews etc. Every minute, hundreds of twitter
accounts are created, thousands of applications are downloaded, and thousands of
new posts and ads are posted. According to experts, the amount of big data in the
world is likely to get doubled every two years. This will definitely provide immense data
in coming years and also calls for smarter data management.
As the volume of the data is growing at the speed of light, traditional database
technology will not suffice the need of efficient data management i.e. storage and
analysis. The need of the hour will be a large scale adoption of new age tools like
Hadoop and MongoDB. These use distributed systems to facilitate storage and analysis
of this enormous big data across various databases. This information explosion has
opened new doors of opportunities in the modern age.
Variety
Big data is collected and created in various formats and sources. It includes structured
data as well as unstructured data like text, multimedia, social media, business reports
etc.
Structured data such as bank records, demographic data, inventory databases,
business data, product data feeds have a defined structure and can be stored and
analyzed using traditional data management and analysis methods.
Unstructured data includes captured like images, tweets or Facebook status updates,
instant messenger conversations, blogs, videos uploads, voice recordings, sensor data.
These types of data do not have any defined pattern. Unstructured data is most of the
time reflection of human thoughts, emotions and feelings which sometimes would be
difficult to be expressed using exact words.
As the saying goes “A picture paints a thousand words”, one image or video which is
shared on social networking sites and applauded by millions of users can help in
deriving some crucial inferences. Hence, it is the need of the hour to understand this
non-verbal language to unlock some secrets of market trends.
One of the main objectives of big data is to collect all this unstructured data and analyze
it using the appropriate technology. Data crawling, also known as web crawling, is a
popular technology used for systematically browsing the web pages. There are
algorithms designed to reach the maximum depth of a page and extract useful data
worth analyzing.
Variety of data definitely helps to get insights from different set of samples, users and
demographics. It helps to bring different perspective to same information. It also allows
analyzing and understanding the impact of different form and sources of data collection
from a ‘larger picture’ point of view.

For instance, in order to understand the performance of a brand, traditional surveys are
one of the forms of data collection. This is done by selecting a sample, mostly from
panels. The advantage of this approach is that you get direct answers to the questions.
However, we can obtain real time feedback through various other forms like Facebook
activity, product review blogs, and updates posted by customers on merchant websites
like Flipkart, Amazon, and Snapdeal. A combination of these two forms of data definitely
gives a data-backed, clearer perspective to your business decision making process.
Velocity
In today’s fast paced world, speed is one of the key drivers for success in your business
as time is equivalent to money. Fast turn-around is one of the pre-requisites to stay
alive in this fierce competition. Expectations of quick results and quick deliverables are
pressing to a great extent. In such scenarios, it becomes vital to collect and analyze
vast amount of disparate data swiftly, in order to make well-informed decisions in real-
time. Low velocity of even high quality of data may hinder the decision making of a
business.
The general definition of Velocity is ‘speed in a specific direction’. In big data, Velocity is
the speed or frequency at which data is collected in various forms and from different
sources for processing. The frequency of specific data collected via various sources
defines the velocity of that data. In other terms, it is data in motion to be captured and
explored. It ranges from batch updates, to periodic to real-time flow of the data.
The frequency of Facebook status updates shared, and messages tweeted every
second, videos uploaded and/or downloaded every minute, or the online/offline bank
transactions recorded every hour, determine the velocity of the data. You can relate
velocity with the amount of trade information captured during each trading session in a
stock exchange. Imagine a video or an image going viral at the blink of an eye to reach
millions of users across the world. Big data technology allows you to process the real-
time data, sometimes without even capturing in a database.
Streams of data are processed and databases are updated in real-time, using parallel
processing of live streams of data. Data streaming helps extract valuable insights from
incessant and rapid flow of data records. A streaming application like Amazon Web
Services (AWS) Kinesis is an example of an application that handles the velocity of
data.
The higher the frequency of data collection into your big data platform in a stipulated
time period, the more likely it will enable you to make accurate decision at the right time.

Veracity
The fascinating trio of volume, variety, and velocity of data brings along a mixed bag of
information. It is quite possible that such huge data may have some uncertainty
associated with it. You will need to filter out clean and relevant data from big data, to
provide insights that power up your business. In order to make accurate decisions, the
data you have used as an input should be appropriately compiled, conformed,
validated, and made uniform.
There are various reasons of data contamination like data entry errors or typos (mostly
in structured data), wrong references or links, junk data, pseudo data etc. The
enormous volume, wide variety, and high velocity in conjunction with high-end
technology, holds no significance if the data collected or reported is incorrect. Hence,
data trustworthiness (in other words, quality of data) holds the highest importance in the
big data world.
In automated data collection, analysis, report generation, and decision making process,
it is inevitable to have a foolproof system in place to avoid any lapses. Even the most
minor of slippage at any stage in the big data extraction process can cause immense
blunder.
Any reports generated based on a certain type of data from a certain source must be
validated for accuracy and reliability. It is always advisable that you have 2 different
methods and sources to validate credibility and consistency of the data, to avoid any
bias. It is not only about accuracy post data collection, but also about determining right
source and form of the data, required amount or size of the data, and the right method
of analysis, play a vital role in procuring impeccable results. Integrity in any field of
business life or personal life holds highest significance and hence, proper measures
must be put in place to take care of this crucial aspect. It will definitely allow you to
position yourself in the market as a reliable authority and help to you to attain greater
heights of success.
Parting thoughts
These 4v’s are like 4 pillars lending stability to the giant structure of big data and adding
a precious 5th “V” - value - to the information procured leads to the whole purpose of
big data, smart decision making.

The 4 Vs of Big Data - IBM Infographic
Infographic: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
The 4 Vs of Big Data - IBM Infographic

Infographic available at:
http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
Sources of the data shown: McKinsey Global Institute, Twitter, Cisco, Gartner, EMC,
IBM, MEPTEC, QAS
From traffic patterns and music downloads to web history and medical records, data is
recorded, stored, and analyzed to enable the technology and services that the world
relies on every day. But what exactly is big data, and how can these massive amounts
of data be used.
We can break big data down into four dimensions: Volume, Velocity, Variety, and
Veracity.
Depending on the industry and organization, big data encompasses information from
multiple internal and external sources such as transactions, social media, enterprise
content, sensors, and mobile devices. Companies can leverage data to adapt their
products and services to better meet customer needs, optimize operations and
infrastructure, and find new sources of revenue.
By 2015, 4.4 million IT jobs will be created globally to support big data with 1.9 million in
the United States.

4 dimensions: Volume, Velocity, Variety, & Veracity

• Depending on the industry and organization, big data encompasses
information from multiple internal and external sources such as
− Transactions
− Social media
− Enterprise content
− Sensors
− Mobile devices
• Companies can leverage data to
− adapt their products and services to
better meet customer needs
− optimize operations and infrastructure, and
− find new sources of revenue.
• This decade, over 5 million IT jobs will be created globally to support
big data with 2+ million in the United States.
4 dimensions: Volume, Velocity, Variety, & Veracity

References:
http://www.ibmbigdatahub.com/tag/587
Data is being produced at astronomical rates. In fact, 90% of the data in the world
today was created in the last two years! The term “big data” can be defined as data
that becomes so large that it cannot be processed using conventional methods. The
size of the data which can be considered to be Big Data is a constantly varying factor
and newer tools are continuously being developed to handle this big data. It is
changing our world completely and shows no signs of being a passing fad that will
disappear anytime in the near future.
In order to make sense out of this overwhelming amount of data it is often broken down
using the four V’s listed:
Volume: refers to the humungous amounts of data generated each second from social
media, cell phones, cars, credit cards, M2M sensors, photographs, video, etc. These
vast amounts of data have become so large in fact that we can no longer store and
analyze data using traditional database technology. We now use distributed systems,
where parts of the data is stored in different locations and brought together by software.

Velocity: refers to the speed at which vast amounts of data are being generated,
collected and analyzed. Every day the number of emails, twitter messages, photos,
video clips, etc. increases at lighting speeds around the world. Every second of every
day data is increasing. Not only must it be analyzed, but the speed of transmission,
and access to the data must also remain instantaneous to allow for real-time access to
website, credit card verification and instant messaging. Big data technology allows us
now to analyze the data while it is being generated, without ever putting it into
databases.
Variety: defined as the different types of data we can now use. Data today looks very
different than data from the past. We no longer just have structured data (name, phone
number, address, financial info, etc) that fits nice and neatly into a data table. Today’s
data is unstructured. In fact, 80% of all the world’s data fits into this category, including
photos, video sequences, social media updates, etc. New and innovative big data
technology is now allowing structured and unstructured data to be harvested, stored,
and used simultaneously.
Veracity: last here, but certainly the least. Veracity is the quality or trustworthiness of
the data. Just how accurate is all this data? For example, think about all the Twitter
posts with hash tags, abbreviations, spelling, typos, etc., and the reliability and accuracy
of all that content. Working with tons and tons of data is of no use if the quality or
trustworthiness is not accurate. Traditionally mainframe data was considered the most
trustworth; but what about this new data? Quality, accuracy, and precision are needed.

Volume & Velocity of Big Data
• Volume - scale of data • Velocity - analysis of streaming data

• 40ZB (40 trillion gigabytes) • 1 TB of Trade Information
• of data will be created by 2020, an increase of 300 • The New York Stock Exchange captures 1 TB of trade
times from 2005 information during each trading session
• 2.5 quintillion byes (trillion GB) • 18.9 B Network Connections

• of data are created each day • By 2016. it is projected there will be almost 2.5
connections per person on earth
• 100 terabytes (100,000 GB)
• Most companies in the US have at least 100TB • 100 sensors
of data stored • Modern cars have close to 100 sensors that
monitor items such as fuel level and tire
• 6 billion cell phones pressure
• World population: 7 billion
• Internet of Things (IoT)
Volume & Velocity of Big Data

The numbers shown here are representative, but are growing and changing rapidly in
an ever changing data world.
INFOBESITY VS. SMART DATA
Today we generate a staggering amount of data. Eighty percent of the world’s data has
been created in the past two years. In fact, the International Data Corporation predicts
that by 2020 some 1.7 megabytes of new information will be created every second for
every human being on the planet. To put that in perspective, in 1969, astronauts flew to
the moon and back using computers with only 2 kilobytes (0.002 megabytes) of
memory.
All this data has in some ways been a giant leap for mankind, but “infobesity” can
paralyze. Separating the signals from the noise can be transformative. Indeed,
advances in data science and database technology have made it possible to unlock
insights from these vast troves of data, creating opportunities for a wide range of
industries. E-commerce platforms like Amazon, Alibaba and eBay use powerful
algorithms to predict purchase preferences and make timely product recommendations
with precision accuracy. Spotify can suggest artists, albums and songs by constantly
analyzing what music you - and people like you - listen to. More than half of the
programs watched by Netflix’s 70 million-plus users start with a system-generated
recommendation.

The list goes on. If you aren’t using big data, you have big problems.
See: Classrooms of the Future, http://www.ozy.com/pov/classrooms-of-the-
future/66311

Variety & Veracity of Big Data
• Variety - different forms of data • Veracity - uncertainty of data

• Healthcare - 150 exabytes • 1 in 3 business leaders
• As of 2011, the global size of data in healthcare was • Don’t trust the information they use to
estimated to be 150 EB (billion gigabytes)
make decisions
• 420 million wearables • 27% of respondents
• By 2014, it’s anticipated there will be 420
• In one survey were unsure of how much
million wearable, wireless health
of their data was inaccurate
monitors
• Social Data • $3.1 trillion a year
• Poor data quality costs the US economy
• 30 billion pieces of content are shared on
around $3.1 trillion a year
Facebook every month
• 400 million tweets/day by about 200
million monthly-active users
• 4+ billion hr/mo of video are watched on
YouTube
Variety & Veracity of Big Data

And, of course, some people have more than 4 Vs
http://blogs.systweak.com/2017/03/big-data-vs-represents-
characteristics-or-challenges-of-big-data
And, of course, some people have more than 4 Vs

Reference:
http://blogs.systweak.com/2017/03/big-data-vs-represents-characteristics-or-
challenges-of-big-data/
The article adds also:
• Viability
• Volatility
• Vulnerability

Or even more…
• Volume - how much data is there?
• Velocity - how quickly is the data being created, moved, or accessed?
• Variety - how many different types of sources are there?
• Veracity - can we trust the data?
• Validity - is the data accurate and correct?
• Viability - is the data relevant to the use case at hand?
• Volatility - how often does the data change?
• Vulnerability - can we keep the data secure?
• Visualization - how can the data be presented to the user?
• Value - can this data produce a meaningful return on investment?
https://healthitanalytics.com/news/understanding-the-many-vs-of-healthcare-big-data-analytics
Or even more…
Extracting actionable insights from big data analytics - and perhaps especially
healthcare big data analytics - is one of the most complex challenges that organizations
can face in the modern technological world.
In the healthcare realm, big data has quickly become essential for nearly every
operational and clinical task, including population health management, quality
benchmarking, revenue cycle management, predictive analytics, and clinical decision
support.
The complexity of big data analytics is hard to break down into bite-sized pieces, but
the dictionary has done a good job of providing pundits with some adequate
terminology.
Data scientists and tech journalists both love patterns, and few are more pleasing to
both professions than the alliterative properties of the many V’s of big data.
Originally, there were only the big three - volume, velocity, and variety - introduced by
Gartner analyst Doug Laney all the way back in 2001, long before “big data” became a
mainstream buzzword.
As enterprises started to collect more and more types of data, some of which were
incomplete or poorly architected, IBM was instrumental in adding the fourth V, veracity,
to the mix.

Subsequent linguistic leaps have resulted in even more terms being added to the litany.
Value, visualization, viability, vulnerability, volatility, and validity have all been proposed
as candidates for the list.
Each term describes a specific property of big data that organizations must understand
and address in order to succeed with its chosen initiatives. The article applies this list
specifically to healthcare analytics.

Types of Big Data

• Structured
▪ Data that can be stored
and processed in a
fixed format, aka schema
• Semi-structured
▪ Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some organizational
properties like tags and other markers to separate semantic elements that
makes it easier to analyze, aka XML or JSON
• Unstructured
▪ Data that has an unknown form and cannot be stored in RDBMS and cannot
be analyzed unless it is transformed into a structured format is called as
unstructured data
▪ Text Files and multimedia contents like images, audios, videos are example
of unstructured data - unstructured data is growing quicker than others,
experts say that 80 percent of the data in an organization is unstructured
Types of Big Data

Structured: The data that can be stored and processed in a fixed format is called as
Structured Data. Data stored in a relational database management system (RDBMS) is
one example of ‘structured’ data. It is easy to process structured data as it has a fixed
schema. Structured Query Language (SQL) is often used to manage such kind of Data.
Semi-structured: Semi-structured Data is a type of data which does not have a formal
structure of a data model, i.e. a table definition in a relational DBMS, but nevertheless it
has some organizational properties like tags and other markers to separate semantic
elements that makes it easier to analyze. XML files or JSON documents are examples
of semi-structured data.
Unstructured: The data which have unknown form and cannot be stored in RDBMS
and cannot be analyzed unless it is transformed into a structured format is called as
unstructured data. Text Files and multimedia contents like images, audios, and videos
are examples of unstructured data. The unstructured data is growing quicker than
others, experts say that 80 percent of the data in an organization are unstructured.

Big Data Analytic Techniques

An Insight into
26 Big Data Analytic
Techniques
Big Data Analytic Techniques

Reference:
• An Insight into 26 Big Data Analytic Techniques: Part 1
http://blogs.systweak.com/2016/11/an-insight-into-26-big-data-analytic-
techniques-part-1/
• An Insight into 26 Big Data Analytic Techniques: Part 2
http://blogs.systweak.com/2016/11/an-insight-into-26-big-data-analytic-
techniques-part-2/

Five key Big Data Use Cases
Five key Big Data Use Cases

Common Use Cases applied to Big Data

• Extract / Transform / Load (ETL) And what do these workloads
− Common to business intelligence & have in common?
data warehousing
− But with Big Data we will see these
The nature of the data …
as Extract / Load / Transform (ELT)
• Text mining • Volume
• Index building • Velocity
• Graph creation and analysis • Variety
• Pattern recognition …the Vs that we have met
• Collaborative filtering
• Predictive models
• Sentiment analysis
• Risk assessment…
Common Use Cases applied to Big Data

“Data is the New Oil”
“Data is the New Oil”

“Data is the new oil." Coined in 2006 by Clive Huby, a British data commercialization
entrepreneur. This now famous phrase was embraced by the World Economic Forum in
a 2011 report, which considered data to be an economic asset like oil.
In this infographic, famed graphic designer Nigel Holmes explores our burgeoning
digital universe.
“Information is the oil of the 21st century, and analytics is the combustion engine”
- Peter Sondergaard

Some statistics: Facebook

• 955 million accounts using 70 languages
▪ 10% of the population of the world is connected on FB
• 140 billion photos uploaded
▪ 300 million per day
▪ Collection is 10,000 larger than that in the Library of Congress
• 30 billion pieces of information posted per day
• 2.7 billion likes and comments per day
• 125 billion friend connections
▪ On average, users have 234 friends
• Now just 4 degrees of separation
Some statistics: Facebook

Statistics from:
The Human Face of Big Data, 2012
Problems:
1. Invades our privacy
2. Substitutes phony relationship for real ones

fivethirtyeight.com - Nate Silver

• FiveThirtyEight's mission
▪ Originally to help New York Times readers cut through the clutter of this
data-rich world - now owned by ESPN
▪ 538 is the number of electors in the United States electoral college
(presidential elections)
• Blog: Politics, Sports, Science & Health, Economics, Culture
▪ The blog is devoted to rigorous analysis of politics, polling, public affairs,
sports, science and culture, largely through statistical means
• Nate Silver
▪ Nate Silver built an innovative system for predicting baseball performance,
predicted the 2008 election within a hair’s breadth, and became a
national sensation as a blogger-all by the time he was thirty
▪ He solidified his standing as the nation's foremost political forecaster with
his near perfect prediction of the 2012 election - less so in 2016
fivethirtyeight.com - Nate Silver

Nate Silver books:
• Baseball Between the Numbers (2006)
• The Signal and the Noise (2012)
• The Best American Infographics (2014)
References:
• https://en.wikipedia.org/wiki/FiveThirtyEight
• http://fivethirtyeight.com/

The challenges of sensor data

• Every day, we create 2.5 quintillion bytes (2.5 EB, 2.5*1018) of different
types of data generated in a variety of ways such as social media, cell
phone GPS signals, digital media, and purchase transaction records
• Sensors are one of the biggest contributors of big data, enabling new
applications across industries:
▪ Telemetrics for auto insurance and vehicle monitoring & service
▪ Smart metering for energy & utilities organizations
▪ Inventory management and asset tracking in the retail and manufacturing
sectors
▪ Fleet management in logistics and transportation organizations
• Applications that rely on sensor-generated data have unique big data
requirements to efficiently collect, store, and analyze the data to take
advantage of its value
The challenges of sensor data

Some additional references:
• Internet of Things Tutorial (PDF Version) -- TutorialsPoint
https://www.tutorialspoint.com/internet_of_things/internet_of_things_tutorial.pdf
• How has the IoT progressed over the last four years?
The Economist Intelligence Unit’s (EIU) Internet of Things (IoT) Business Index
2017 Study, commissioned by Arm and IBM
https://pages.arm.com/economist-iot-report.html
• Internet of Things (IoT)
https://www.ibm.com/developerworks/learn/iot/
• IoT Technical Library - Articles
https://www.ibm.com/developerworks/views/iot/libraryview.jsp

Engine data - aircraft, diesel trucks & generators,…

• Each Airbus A380 engine - there are 4 engines - generates 1 PB of
data on a flight from London (LHR) to Singapore (SIN)
▪ GE, one of the world’s largest manufacturers, is using big data analytics
with data generated from machine sensors to predict maintenance needs
− GE is using the analysis
to provide services tied to
its products, designed to
minimize downtime caused
by parts failures
− Real-time analytics also
enables machines to adapt
continually and improve
efficiency
Engine data - aircraft, diesel trucks & generators,...

Both Qantas and Singapore Airlines use Rolls-Royce Trent 900 engines in their A380
aircraft.
Qantas Flight 32 was a Qantas scheduled passenger flight that suffered an
uncontained engine failure on 4 November 2010 and made an emergency
landing at Singapore Changi Airport. The failure was the first of its kind for the
Airbus A380, the world's largest passenger aircraft. It marked the first aviation
occurrence involving an Airbus A380. On inspection it was found that a turbine
disc in the aircraft's No. 2 Rolls-Royce Trent 900 engine (on the port side nearest
the fuselage) had disintegrated. The aircraft had also suffered damage to the
nacelle, wing, fuel system, landing gear, flight controls, the controls for engine No.
1 and an undetected fire in the left inner wing fuel tank that eventually self-
extinguished.[1] The failure was determined to have been caused by the breaking
of a stub oil pipe which had been manufactured improperly.
GE manufactures jet engines, turbines and medical scanners. It is using operational
data from sensors on its machinery and engines for pattern analysis.

References:
http://www.computerweekly.com/news/2240176248/GE-uses-big-data-to-power-
machine-services-business
• "The airline industry spends $200bn on fuel per year so a 2% saving is $4bn. GE
provides software that enables airline pilots to manage fuel efficiency.“
• “Another product, Movement Planner, is a cruise control system for train drivers.
The technology assesses the terrain and the location of the train to calculate the
optimal speed to run the locomotive for fuel economy. ”
http://www.infoworld.com/article/2616433/big-data/general-electric-lays-out-big-plans-
for-big-data.html
• “As one of the world's largest companies, GE is a major manufacturer of systems
in aviation, rail, mining, energy, healthcare, and more. In recognition of the
importance of big data to GE, CEO Jeff Immelt launched a new initiative called
the “industrial Internet,” which aims to help customers increase efficiency and to
create new revenue opportunities for GE through analytics.”
• “The industrial Internet is GE's spin on “the Internet of things,” where Internet-
connected sensors collect vast quantities of data for analysis. According to
Immelt, sensors have already been embedded in 250,000 "intelligent machines"
manufactured by GE, including jet engines, power turbines, medical devices, and
so on. Harvesting and analyzing the data generated by those sensors holds
enormous potential for optimization across a broad range of industrial operations.

Large Hadron Collider (LHC)

• The Large Hadron Collider (LHC) data center process about one
petabyte of data every day. The center hosts 11,000 services with
100,000 processor cores. Some 6000 changes in the database are
performed every second.
▪ A global collaboration of computer
centers distributes and stores LHC
data, giving real-time access to
physicists around the world
▪ The Grid runs more than two million
jobs per day - at peak rates, 10 GB
of data may be transferred / sec.
− This year, 300 terabytes of LHC data
covering experiments were published,
allowing physicists around the world to
examine aspects of particle collisions
that CERN's own researchers had not
had time to cover
Large Hadron Collider (LHC)

References:
http://www.popularmechanics.com/technology/a20540/300-tb-cern-data-large-hadron-
collider/ (April 2016)
• “The most complex machine in mankind's history just put a gargantuan data trove
online for anyone to parse. You think you've got the analytical chops to glean
insights about the nature of the cosmos or God or simply the tendencies of
muons? Go ahead, man, dig through the 300 terabytes of data that CERN, the
European Organization for Nuclear Research, just dropped onto the cloud.”
• …“But it ain't nothing compared to what the National Security Agency works
with. Going by2013 figures the agency released, the NSA's
various activities ’touch’ 300 TB of data every 15 minutes or so.”

http://www.theverge.com/2016/4/25/11501078/cern-300-tb-lhc-data-open-access
• “If you ever wanted to take a look at raw data produced by the Large Hadron
Collider, but are missing the necessary physics PhD, here's your chance: CERN
has published more than 300 terabytes of LHC data online for free. The data
covers roughly half the experiments run by the LHC's CMS detector during 2011,
with a press release from CERN explaining that this includes about 2.5 inverse
femtobarns of data - around 250 trillion particle collisions. Best not to download
this on a mobile connection then.”

Largest Radio Telescope (Square Kilometer Array)

• The Square Kilometer Array (SKA) is a large multi radio telescope
project aimed to be built in Australia and South Africa. When finished,
it would have a total collecting area of about one square kilometer - it
will operate over a wide range of frequencies and its size will make it
50 times more sensitive than any other radio instrument and survey,
10,000 times faster than ever before…
▪ The SKA will produce
20,000 PB / day in 2020
(compared with the current
internet volume of 300 PB / day)
▪ This could also go up by a factor
of ten when fully operational in
2028 - the various data centers
will process 100 PB / day each
− Incidentally,IBM is developing
hardware specifically to process
this astronomical information
Largest Radio Telescope (Square Kilometer Array)

With receiving stations extending out to distance of at least 3,000 kilometers (1,900 mi)
from a concentrated central core, it will exploit radio astronomy's ability to provide the
highest resolution images in all astronomy. The SKA will be built in the southern
hemisphere, in sub-Saharan states with cores in South Africa and Australia, where the
view of the Milky Way Galaxy is best and radio interference least.
Construction of the SKA is scheduled to begin in 2018 for initial observations by 2020.
The SKA will be built in two phases, with Phase 1 (2018-2023) representing about 10%
of the capability of the whole telescope. Phase 1 of the SKA was cost-capped at 650
million euros in 2013, while Phase 2's cost has not yet been established. The
headquarters of the project are located at the Jodrell Bank Observatory, in the UK.
The data collected by the SKA in a single day would take nearly two million years to
playback on an iPod
The SKA will be so sensitive that it will be able to detect an airport radar on a planet
tens of light years away
The SKA central computer will have the processing power of about one hundred million
PCs
The dishes of the SKA will produce 10 times the global internet traffic
The SKA will use enough optical fiber to wrap twice around the Earth

The aperture arrays in the SKA could produce more than 100 times the global internet
traffic
References:
• https://en.wikipedia.org/wiki/Square_Kilometre_Array
• https://www.skatelescope.org/ (SKA homepage)
• http://www.ska.gov.au/About/Pages/default.aspx
• http://www.ska.gov.au/NewZealandSKA/Pages/default.aspx

Wikibon Big Data Software, Hardware & Professional

Services Projection 2014-2026 ($B)
Wikibon Big Data Software, Hardware & Professional Services Projection 2014-2026 ($B)
Executive Summary from the Wikibon Report
The big data market grew 23.5% in 2015, led by Hadoop platform revenues. We
believe the market will grow from $18.3B in 2014 to $92.2B in 2026 - a strong 14.4%
CAGR. Growth throughout the next decade will take place in three successive and
overlapping waves of application patterns - Data Lakes, Intelligent Systems of
Engagement, and Self-Tuning Systems of Intelligence. Increasing amounts of data
generated by sensors from the Internet of Things will drive each application pattern. Big
data tool integration, administrative simplicity, and developer adoption are keys to
growth rates. The adoption of streaming technologies, which address a number of
Hadoop limitations, also will be a factor. Ultimately, the market growth will depend on
enterprises: Will doers take the steps required to transform business with big data
systems?

2017 Big Data and Analytics forecast - 03-Mar-17
2017 Big Data and Analytics Forecast - 03-Mar-17

Big Data Market Evolution Is Accelerating
Wikibon identifies three broad waves of usage scenarios driving big data and machine
learning adoption. All require orders of magnitude greater scalability and lower price
points than what preceded them. The usage scenarios are:
Data Lake applications that greatly expand on prior generation data warehouses
Massively scalable Web and mobile applications that anticipate and influence users’
digital experiences
Autonomous applications that manage an ecosystem of smart, connected devices (IoT)
Each of these usage scenarios represents a wave of adoption that depends on the
maturity of a set of underlying technologies which we analyze independently. All
categories but applications - those segments that collectively comprise infrastructure -
slow from 27% growth in 2017 to single digits in 2023. Open source options and
integrated stacks of services from cloud providers combine to commoditize
infrastructure technologies.

Wikibon forecasts the following categories:

Application databases accrue functionality of analytic databases. Analytics
increasingly will inform human and machine decisions in real-time. The category totals
$2.6bn in 2016 and growth slowly tapers off from 30% and spend peaks at $8.5bn in
2024.
Analytic databases evolve beyond data lakes MPP SQL databases, the backbone
for data lakes, continue to evolve and eventually will become the platform for large-
scale advanced, offline analytics. The category totals $2.5bn in 2016 with growth at half
the level of application databases and its size peaks at $3.8bn in 2023.
Application infrastructure for core batch compute slows. This category, which
includes offerings like Spark, Splunk, and AWS EMR, totals $1.7bn in 2016 with 35%
growth in 2017 but slows to single digits in 2023 as continuous processing captures
more growth.
Continuous processing infrastructure is boosted by IoT applications. The
category will be the basis for emerging microservice-based big data applications,
including much of intelligent. It totals $200M in 2016 with 45% growth in 2017 and only
slows below 30% in 2025.
The data science tool chain is evolving into models with APIs. Today, data science
tool chains require dedicated specialist to architect, administer, and operate. However,
complex data science toolchains, including those for machine learning, are transforming
into live, pre-trained models accessible through developer APIs. The cottage industry of
tools today totals $200M growing at 45% in 2017 but only dips below 30% in 2025
when it totals $1.8bn.
Machine learning applications are mostly custom-built today. They will become
more pervasive in existing enterprise applications in addition to spawning new
specialized vendors. The market today totals $900M growing at 50% 2017 and reaches
almost $18bn in 2027.

Big Data Analytics Segment Analysis

As users gain experience with big data technology and use cases, the focus of the
market will inevitably turn from hardware and infrastructure software to applications.
That process is well underway in the big data market, as hardware accounts for an ever
smaller share of total spend (see Figure 2 below). Infrastructure software is facing twin
challenges. First, open source options are undermining prices. Second, more vendors,
especially those with open source products, have adopted pricing models around
helping operate their products on-prem or in the cloud. While there is room for vendors
to deliver value with this approach, there is a major limitation. Given the relatively
immature state of the market, customers typically run products from many vendors. And
it’s the interactions between these products from multiple vendors that creates most of
the operational complexity for customers. As a result, market growth for infrastructure
software will slow to single digits in 2023 from 26% in 2017 (see Figure 1). Public cloud
vendors will likely solve the multi-vendor management problem. Before more fully
packaged applications can take off, customers will require a heavy mix of professional
services since the reusable building blocks are so low-level. By 2027, applications will
account for 40% of software spend, up from 11% in 2016, and professional services will
taper off to 32% of all big data spend in 2027, down from 40% in 2016.
To summarize:
• Open source and public cloud continue to pressure pricing.
• Applications today require heavy professional services spend.
• Business objectives are driving software to support ever lower latency decisions.
Reference: http://wikibon.com/2017-big-data-and-analytics-forecast/ (subscription
required)

Evolution of the Internet of Things (IoT)
Source:
Internet of Things (IoT): The Next Cyber Security Target (Webinar)
Praveen Kumar Gandi, Head Information Security Services,
https://www.slideserve.com/ClicTest/webinar-on-internet-of-things-iot-the-next-cyber-security-target
Evolution of the Internet of Things (IoT)

For relevant images:
Google search: evolution of the internet of things images
The image shown above: https://twitter.com/fisher85m/status/926360908900773889
See also: Webinar on Internet of Things(IoT): The Next Cyber Security Target -
PowerPoint PPT Presentation on Slide Serve at
https://www.slideserve.com/ClicTest/webinar-on-internet-of-things-iot-the-next-cyber-
security-target (downloadable PPT slide presentation)

Meaning of real-time when applied to Big Data

• Sub-second Response
Generally, when engineers say “real-time”, they are usually referring to sub-second response time. In this
kind of real-time data processing, nanoseconds count. Extreme levels of performance are key to success.
• Human Comfortable Response Time

“Thou shalt not bore or frustrate the users.” The performance
requirement for this kind of processing is usually a couple of seconds.
• Event-Driven
If when you say “real-time,” you mean the opposite of scheduled, then you mean event-driven. Instead of
happening in a particular time interval, event-driven data processing happens when a certain action or
condition triggers it. The performance requirement for this is generally before another event happens.
• Streaming Data Processing

If when you say “real-time,” you mean the opposite of batch processing, then you mean streaming data
processing. In batch processing, data is gathered together, and all records or other data units are
processed in one big bundle until they’re all done. In streaming data processing, the data is processed as
it flows in, one unit at a time. And once the data starts coming in, it generally doesn’t end.
http://blog.syncsort.com/2016/03/big-data/four-really-real-meanings-of-real-time
Meaning of real-time when applied to Big Data

Real-time processing of Big Data is the topic: Streaming Data
Here we are just defining some terms to distinguish between:
Data at Rest: e.g., “Oceans of Data.” and the new term “Data Lakes” - the data has
already arrived and is stored
Data in Motion: Streaming Data
The question is “how do we define the processing of data in real-time”? Does it mean
milliseconds, seconds, minutes, or what>

More comments on real-time

• Real time is not a concept that’s woven into the fabric of the universe -
it’s a very human construct.
▪ Essentially, real time refers to lags in data arrival that are either below the
threshold of perception or, if they exist, are so short that they don't pose a
barrier to immediate action
• Decisions have various tolerances for protracted data arrival
• Data latencies vs. decision latencies
More comments on real-time

https://www.linkedin.com/pulse/real-time-isnt-youve-been-led-believe-james-kobielus
http://bigdatapage.com/4-really-real-meanings-of-real-time

Traditional vs. Big Data approaches to using data
Traditional vs. Big Data approaches to using data

Source: IBM

There are many parts to the Hadoop Ecosystem

• Constantly changing
▪ New additions
▪ Updated versions
There are a many parts to the Hadoop Ecosystem

Illustration from: Gollapudi, S. (2016). Practical machine learning: Tackle the real-world
complexities of modern machine learning. Birmingham, UK & Mumbai, India: Packt
Publishing, p. 68.

We will start with the Core Components

• Hadoop
▪ Hadoop Distributed
File System (HDFS)
▪ MapReduce
▪ Common
• And we will work our
way through the
history of Hadoop
and its versions
▪ Classic (v1)
▪ Current (v2) / YARN
▪ Futures (v3)
We will start with the Core Components

Appendix A
Appendix A
Self-study materials from Module 1.

Example business sectors currently using Big Data

• Healthcare
• Financial
• Industry
• Agriculture
…and many others
Example business sectors currently using Big Data

Links
Which industries benefit most from big data?
• http://hortonworks.com/big-data-insights/industries-benefit-big-data
Explore new ways to work with IBM Predictive Industry Solutions
• http://www.ibm.com/analytics/us/en/industry/index.html
www.ibmbigdatahub.com/use-cases

Big Data adoption study (the 4 Es)
• From a 2012 Big Data @ Work Study surveying 1144 business and IT
professionals in 95 countries
• Gartner Sept. 2014 report: 13% of surveyed organizations have deployed
big data solutions, while 73% have invested in big data or plan to do so
Big Data adoption study (the 4 Es)

Before we look at individual industries, let’s look at the Big Data Adoption Process.
IBM's John Graeme Noseworthy on the four E’s of Big Data -
http://www.exchange4media.com/marketing/videoibms-john-graeme-noseworthy-on-
the-four-es-of-big-data_51412.html
Given numerous challenges for leveraging data optimally for marketers today, Graeme
shares few tips for marketers that will help them in their journey to Big Data market
analytics in the form of what he calls the ‘Four Es of Data and Analytics’:
Educate - A proper understanding of what Big Data and analytics mean to your
organization and how you define it, how the marketplace in and around your industry
and your competition are either applying or not applying analytics for operations.
Explore - Exploring possibilities encapsulates thinking big and understanding all the
possibilities you can explore with Big Data.
Engage - Moving from learning to applying. Find a small starting point; it is critical that
marketers talk to their partners in technology. Everybody needs to collaborate for
mutual success, which brings you to the next area of the ability to execute.
Execute - This is the phase where the organization actually starts delivering smarter
customer experiences.
See more at: http://www.exchange4media.com/marketing/videoibms-john-graeme-
noseworthy-on-the-four-es-of-big-data_51412.html#sthash.BIeieEB8.dpuf

Use cases for Big Data: Healthcare and Life Sciences

• Problem:
▪ Vast quantities of real-time information
are starting to come from wireless
monitoring devices that postoperative
patients and those with chronic
diseases are wearing at home and in
their daily lives.
• How big data analytics can help:
▪ Epidemic early warning
▪ Intensive Care Unit and remote
monitoring
Use cases for Big Data: Healthcare and Life Sciences

Healthcare
• How Big Data Is Quietly Fighting Diseases and Illnesses

▪ http://dataconomy.com/how-big-data-is-quietly-fighting-diseases-and-
illnesses
• The Data Is In: 3 Ways Analytics Will Improve Healthcare
▪ http://dataconomy.com/the-data-is-in-3-ways-analytics-will-improve-
healthcare
Healthcare

Big Data and complexity in healthcare

• Medical information is
doubling every 5 years,
much of which is
unstructured
• 81% of physicians report

spending 5 hours or less
per month reading
medical journals
Big Data and complexity in healthcare

“Medicine has become too complex (and only) about 20 percent of the knowledge
clinicians use today is evidence-based” - Steven Shapiro, Chief Medical and Scientific
Officer, UPMC
“…to keep up with the state of the art, a doctor would have to devote 160 hours a week
to perusing papers …” - The Economist, Feb 14th 2013

Precision Medicine Initiative (PMI) & Big Data

• Precision Medicine
▪ A medical model that proposes the
customization of healthcare, with medical
decisions, practices, and/or products being
tailored to the individual patient - Wikipedia
▪ Diagnostic testing is often employed for
selecting appropriate and optimal therapies
based on the context of a patient’s genetic
content or other molecular or cellular analysis
▪ Tools employed in PM can include molecular
diagnostics, imaging, and analytics/software
• The Precision Medicine Initiative (PMI)
▪ A $215 million investment in President Obama’s
Fiscal Year 2016 Budget to accelerate
biomedical research and provide clinicians with
new tools to select the therapies that will work
best in individual patients
Precision Medicine Initiative (PMI) & Big Data

Graphic on the right (from the National Cancer Institute) illustrates the use of precision
medicine in cancer treatment. “Discovering unique therapies that treat an individual’s
cancer based on the specific genetic abnormalities of that person’s tumor.”
Precision Medicine Initiative (PMI) references:
• Obama’s Precision Medicine Initiative Is The Ultimate Big-Data Project - “Curing
both rare diseases and common cancers doesn't just require new research, but
also linking all the data that researchers already have”
http://www.fastcompany.com/3057177/obamas-precision-medicine-initiative-is-
the-ultimate-big-data-project
• White House
https://www.whitehouse.gov/precision-medicine
• Obama: Precision Medicine Initiative Is First Step to Revolutionizing Medicine -
"We may be able to accelerate the process of discovering cures in ways we've
never seen before," the president stated
http://www.usnews.com/news/articles/2016-02-25/obama-precision-medicine-
initiative-is-first-step-to-revolutionizing-medicine
• National Institutes of Health
https://www.nih.gov/precision-medicine-initiative-cohort-program

• National Cancer Institute

http://www.cancer.gov/research/key-initiatives/precision-medicine
• Precision Medicine (Wikipedia)
https://en.wikipedia.org/wiki/Precision_medicine

Use cases for Big Data: Financial Services

• Problem:
▪ Manage the several Petabytes of data which is growing at 40-100%
per year under increasing pressure to prevent frauds and complaints
to regulators
▪ Fraud detection
▪ Credit issuance
▪ Risk management
▪ 360 view of the Customer
Use cases for Big Data: Financial Services

Financial marketplace example: Visa

• Problem
▪ Credit card fraud costs up to 7 cents per 100 dollars –
billions of dollars per year
▪ Fraud schemes are constantly changing
▪ Understanding the fraud pattern months after
the fact is only partially helpful - fraud detection
models need to evolve faster
• If only Visa could …
▪ Reinvent how to detect the fraud patterns
▪ Stop new fraud patterns before they can
rack-up significant losses
• Solution
▪ Revolutionize the speed of detection
▪ Visa loaded two years of test records, or 73 billion transactions, amounting
to 36 terabytes of data into Hadoop - the processing time fell from one
month with traditional methods to a mere 13 minutes
Financial marketplace example: Visa

Financial
• Big data is overhauling credit scores

▪ http://dataconomy.com/big-data-overhauling-credit-scores-2
• Top 10 Big Data Trends in 2016 for Financial Services
▪ https://www.mapr.com/blog/top-10-big-data-trends-2016-financial-services
Financial

Use cases for Big Data: Telecommunications Services

• Problem:
▪ Legacy systems are used to gain insights from internally
generated data facing issues of high storage costs,
long data loading time, and long administration
processing times
▪ CDR processing
▪ Combat fraud
▪ Churn prediction
▪ Geomapping / marketing
▪ Network monitoring
Use cases for Big Data: Telecommunications Services

Survey: 90% of telco service providers believe Hadoop is best platform to combat fraud
(18 May 2016)
http://www.fiercebigdata.com/story/survey-90-telco-service-providers-believe-hadoop-
best-platform-combat-fraud/2016-05-18
A recently released survey by Cloudera and Argyle Data found that 90 percent of telcos
believe Hadoop is the most effective platform to combat revenue fraud scams, losses
from which are now estimated at U.S. $38 billion.
"Across all areas of industry, people are trying to figure out the most effective use cases
for Hadoop," said Vijay Raja, solutions marketing manager at Cloudera, in the
announcement.
"Fraud prevention is a textbook use case for Hadoop-based analytics because the ROI
is immediately visible.“
While the two companies tout the capabilities of machine learning in tackling security
issues, of which fraud is one, most applications of machine learning are not yet ready to
replace human security analysts.

"Although security solutions based on unsupervised machine learning do exist, relying

entirely on artificial intelligence to spot cyberattacks isn't totally practical because such
systems yield a large number of false positives," explains Ben Dickson, a software
engineer, in his post in The Daily Dot.
"You eventually need the help of human experts to find evidence of security breaches
and make critical decisions.“
However, machine learning often proves to be an exceptional tool in leveraging human
talent beyond what the human alone can do. It's simply too much data for a human to
mentally process and react to fast enough to do any good.
Further, there's a major shortage in skilled cyber security analysts at the moment. That
means that most companies are running big on security problems but short on talent to
address it.
So, while cybersecurity pros are safely employed for the foreseeable future, any who do
not use machine learning for a much-needed assist will soon find themselves replaced
by those that do.
It's not a question of machine or man, but a mandate of man and machine.

Use cases for Big Data: Transportation Services

• Problem:
▪ Traffic congestion has been increasing
worldwide as a result of increased
urbanization and population growth
reducing the efficiency of transportation
infrastructure and increasing travel time
and fuel consumption.
▪ Urban planning & monitoring
▪ Real time analysis to weather
and traffic congestion data streams
to identify traffic patterns reducing
transportation costs.
Use cases for Big Data: Transportation Services

Use cases for Big Data: Retailers & Social Media

• Problem:
▪ Savvy retailers want to use “big data” to
predict trends, prepare for demand, pinpoint
customers, optimize pricing & promotions,
and monitor real-time analytics & results
- by combining data from web browsing
patterns, social media, industry forecasts,
existing customer records, etc.
▪ Access social media to gain insight
▪ Federate data between Big Data and RDBMs
▪ Apply graph analysis to the available data
▪ Work to understand demand and engage
customers
Use cases for Big Data: Retailers & Social Media

IBM Presentation - Big Data in Retail: Examples in Action
http://www.ibmbigdatahub.com/presentation/big-data-retail-examples-action
http://www.ibm.com/analytics/us/en/industry/retail-analytics/
One example:
Graph Analytics - http://www.ibmbigdatahub.com/blog/what-graph-analytics
A growing number of businesses and industries are finding innovative ways to apply
graph analytics to a variety of use-case scenarios because it affords a unique
perspective on the analysis of networked entities and their relationships. Gain an
understanding of how four different types of graph.
http://www.ibmbigdatahub.com/blog/what-graph-analytics
IBM Institute for Business Value (IIBV) -
http://www.ibm.com/services/uk/gbs/thoughtleadership/ - Register to download the
complete IBM Institute for Business Value study.

Graph analytics
• Path analysis • Community analysis
• Connectivity analysis • Centrality analysis
www.ibmbigdatahub.com/blog/what-graph-analytics
Graph analytics
Bringing this concept to the real world, nodes or vertices can be people, such as
customers and employees; affinity groups such as LinkedIn or meet-up groups; and
companies and institutions. They can also be places such as airports, buildings, cities
and towns, distribution centers, houses, landmarks, retail stores, shipping ports and so
on. Vertices can also be things such as assets, bank accounts, computer servers,
customer accounts, devices, grids, molecules, policies, products, twitter handles, URLs,
web pages and so on.
Edges can be elements that represent relationships such as emails, likes and dislikes,
payment transactions, phone calls, social networking and so on. Edges can be directed,
that is, they have a one-way direction arrow to represent a relationship from one node
to another-for example, Mike made a payment to Bill; Mike follows Corinne on Twitter.
They can also be non-directed-for example, M1 links London and Leeds-and weighted-
for example, the number of payments between these two accounts is high. The time
between two locations is also an example of a weighted relationship.
http://www.ibmbigdatahub.com/blog/what-graph-analytics

Behavioral segmentation & analytics
Behavioral segmentation & analytics

Data 3.0 - my view on the data landscape

• The Eras of Data
▪ 0 Flat files
▪ 1 Relational Databases (RBDMs) - 1970s - OLTP (Online Transactional processing)
▪ 2 Data Warehouses - 1990s - OLAP (Online Analytical processing) or
DSS (Decision Support Systems) workloads
▪ 3 Big Data - 2000s - Batch, with a movement towards Real-time
• Some terminology of Big Data
▪ Oceans of data (data at rest) vs. Streams of data (data in motion)
▪ Data Lake (a large storage repository and processing engine)
• James Dixon of Pentaho used the term initially to contrast with “data mart,” which is a
smaller repository of interesting attributes extracted from the raw data. He wrote: "If you
think of a datamart as a store of bottled water - cleansed and packaged and structured
for easy consumption - the data lake is a large body of water in a more natural state. The
contents of the data lake stream in from a source to fill the lake, and various users of the
lake can come to examine, dive in, or take samples.“ (Wikipedia, Data Lake)
▪ NoSQL (“not only SQL”)
Data 3.0 - my view on the data landscape

Realtime is a definite future direction for analytics

• IBM System S was a precursor of the future
IBM InfoSphere Streams and now the open-source competitors such
as Apache Storm, Apache Kafka, etc.
Realtime is a definite future direction for analytics

Internet of Things (IoT)

• The number of connected devices that can share data is exploding,
with estimates of 50-200 billion devices being connected to the Internet
by 2020 - a transformative change for our industrial society
• With a dramatic growth in connections:
▪ new devices
▪ legacy infrastructures
…triggering an unprecedented spike in data volumes, devices & data
• That data represents
▪ untapped production efficiencies
▪ competitive business insights
▪ new, brand-differentiating services
* but only if the data can be effectively
analyzed, and its value unlocked
Photo: Siemens
Internet of Things (IoT)

Machine-generated data is becoming a major data resource and will continue to do so.
Wikibon has forecast that the market value of the industrial Internet (a term coined by
Frost & Sullivan to refer to the integration of complex physical machinery with
networked sensors and software) will be approximately $540 billion in 2020.
Cielen, D., Meysman, A. D. B., & Ali, M. (2016). Introducing data science: Big
data, machine learning, and more, using Python tools. Shelter Island, NY:
Manning Publications, p. 6.
http://www.datanami.com/2015/10/30/hortonworks-preps-for-coming-iot-device-storm/
The masses may not be getting flying cars or personal robots anytime soon, but thanks
to connected devices, the future of technology is definitely bright. From smart
refrigerators and wireless BBQ sensors to connected turbines and semi-trucks, the
Internet of Things (IoT) will have a huge impact on consumer and industrial tech.
However, managing hundreds of billions of devices-not to mention the data they
generate-will not be easy, which is why Hadoop distributor Hortonworks is taking pains
to prepare for the coming device storm.

According to Intel, the number of connected devices will explode in the near future,
growing from about 15 billion devices in 2015 to more than 200 billion by 2010.
Anybody who’s watching the rise of wearables like the Fitbit, the popularity of smart
thermostats like Nest, or the presence of drone aircraft can tell you that this
phenomenon is real and accelerating.
Getting these wireless IoT devices integrated with command and control systems will
not be an easy task. While we now have enough Internet addresses to go around,
there’s a lot of other messy details that need to be worked out to ensure that devices
aren’t stepping on each other’s toes, that they can’t be co-opted by others, that the data
is safe and secure. Right now, it’s an ad hoc, Wild West IoT world, but that simply won’t
scale.

The future of IoT and the connected world
The future of IoT and the connected world

International Data Corporation (IDC) has estimated that there will be 26 times more
connected things than people in 2020 - perhaps 200 billion. This network is commonly
referred to as the internet of things (IoT).
Free O’Reilly book (37 pp., PDF) - Architecting for the Internet of Things:
https://voltdb.com/sm/16/IoTOReilly

Sensors & software for the automobile
Sensors & software for the automobile

Disrupting the Car Industry
https://whatsthebigdata.com/2016/06/14/disrupting-the-car-industry/
While self-driving tech - which is included in the infographic above - receives the lion’s
share of media attention, a host of less-heralded startups are targeting specific pieces
of automotive infrastructure or components. For example, companies including
Quanergy (https://www.cbinsights.com/company/quanergy-systems), LeddarTech
(https://www.cbinsights.com/company/leddartech), and TriLumina
(https://www.cbinsights.com/company/trilumina) are seeking to capitalize on the self-
driving revolution by dramatically lowering the cost of expensive LiDAR sensors.
Startups such as Veniam (https://www.cbinsights.com/company/veniam) and Savari
(https://www.cbinsights.com/company/savari) are developing vehicle-to-vehicle (V2V)
and vehicle-to-anything (V2X) communications, another autonomous-adjacent
technology field. With increasing automotive connectivity seemingly inevitable, vehicle
cybersecurity has also begun to emerge as a focus, with newer players like Karamba
Security (https://www.cbinsights.com/company/karamba) joining others such as Argus
Cyber Security (https://www.cbinsights.com/company/argus-cyber-security).

An example of a big data platform in practice (IBM)

Ingestion and Real-time Analytic Zone Analytics and
Reporting Zone
Streaming Data
Warehousing Zone
BI &
Reporting
Enterprise
Warehouse
Connectors
Predictive
Analytics
Hadoop
MapReduce Hive/HBase Data Marts

Col Stores Visualization &
Discovery
Documents
ETL, MDM, Data Governance
in variety of formats
Landing and Analytics Sandbox Zone Metadata and Governance Zone

An example of a big data platform in practice (IBM)

We will see that other vendors (e.g., Hortonworks) offer their own Big Data Platforms
that handle both real-time (streaming) and batch (traditional Hadoop v1) processing
environments.
We should note that the traditional data processing environments are not going to go
away:
• Relational databases
• Data Warehouses
…
…but those will be transformed.

Big Data & Analytics architecture - A broader picture
Big Data & Analytics Architecture - A broader picture

This slide illustration adds a series of new / enhanced applications and is more specific
as to the nature of the types of data sources.

Big Data scenarios span many industries
Multi-channel customer
sentiment and experience
analysis
Detect life-threatening
conditions at hospitals in time to
intervene
Predict weather patterns to plan

optimal wind turbine usage, and
optimize capital expenditure on
asset placement
Make risk decisions based on

real-time transactional data
Identify criminals and threats

from disparate video, audio,
and data feeds
Big Data scenarios span many industries

Graphic: IBM

Big Data adoption (emphasis on Hadoop)
Big Data adoption (emphasis on Hadoop)

Factors driving interest in Big Data Analysis
Factors driving interest in Big Data Analysis

Impact of Big Data analytics in the next 5 years
Impact of Big Data analytics in the next 5 years

Big Data: Outcomes & data sources
Big Data: Outcomes & data sources

Data & Data Science

• The direction of this course is to head towards a study of Data Science
- and that means we will have to study the nature of data, including
how it is found, what formats it occurs in, wrangling it, …
• During the journey we will study:
▪ The basics of the technology: Hadoop & HDFS, MapReduce & YARN,
Spark
▪ Data formats & data movement
▪ The role & work of the Data Scientist
▪ Programming for Big Data
▪ The Hadoop ecosystem - both open source and proprietary approaches
▪ Data governance & data security
Data & Data Science

The sections of this course are:
• Orientation & Introduction to Big Data
• Hadoop & HDFS
• MapReduce & YARN
• Spark
• Data Formats & Data Movement
• The Role & Work of the Data Scientist
• Programming for Big Data
• The Hadoop Ecosystem
• Data Governance & Data Security

Facets of data
In data science and big data, you will come across many different types
of data, and each of them require different tools and techniques. The
main categories of data are:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and image
• Streaming
Cielen, D., Meysman, A. D. B., & Ali, M. (2016). Introducing data science: Big data, machine
learning, and more, using Python tools. Shelter Island, NY: Manning Publications, pp. 4-8.
Facets of data
We live in the data age. It’s not easy to measure the total volume of data stored
electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes
21
in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes. A zettabyte is 10
bytes, or equivalently one thousand exabytes, one million petabytes, or one billion
terabytes. That’s more than one disk drive for every person in the world.
White, T. (2015). Hadoop: The definitive guide (4th, revised & updated ed.).
Sebastopol, CA: O'Reilly Media, p. 1.
For the meaning of this terminology (terabyte, petabyte, exabyte, zettabyte), see the
next slide and its notes.
This flood of data is coming from many sources. Consider the following:
• The New York Stock Exchange generates about 4−5 terabytes of data per day.
• Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
• Ancestry.com, the genealogy site, stores around 10 petabytes of data.
• The Internet Archive stores around 18.5 petabytes of data.

(All figures are from 2013 or 2014. For more information, see Tom Groenfeldt, “At
NYSE, The Data Deluge Overwhelms Traditional Databases”; Rich Miller, “Facebook
Builds Exabyte Data Centers for Cold Storage”; Ancestry.com’s “Company Facts”;
Archive.org’s “Petabox”; and the Worldwide LHC Computing Grid project’s welcome
page.)

System of Units / Binary System of Units
System of Units / Binary System of Units

When we talk about Big Data, we need to have some measuring units. Accuracy in
terminology is important.
Please note that international symbol for kilo is small “k” (and not capital “K”) - and thus
kB. This is a common mistake in usage.
The easiest way to remember some of this: 1 petabyte = 1000 terabytes, 1 exabyte = 1
billion gigabytes, 1 zettabyte = 1 billion terabytes.
You need to be aware of the two nomenclatures for sizing disk and storage media: the
official International System of Units (SI) and the now deprecated binary usage, based
on powers of 2.
Unfortunately, both are in use in all the literature, and usually without distinction.
From an operating systems perspective, Linux and OS X compute in powers of 10 (1KB
= 1000 Bytes), Windows (even in Windows 10) use powers of 2 (1KB = 1024 Bytes).
Disk and tape storage is generally storage is universally sold as powers of 10, but
12
called GB, TB, etc. Thus a purchased 4 TB disk drive is truly 4*10 ) and will appear at
40
3.57 TB on Windows (really 4 TiB = 3.57 * 2 ).
Note also that the correct formulation is kB (kilo is lowercase “k” in SI terminology), etc.

Be very careful when talking about network speed, as this is fairly standard:
1 Mbps = one million bits per second
1 MBps = one million bytes per second
By the way, even the term “byte” is a little ambiguous. The generally accepted meaning
these days is an octet (i.e., eight bits).
The de facto standard of eight bits is a convenient power of two permitting the values
0 through 255 for one byte. The international standard IEC 80000-13 codified this
common meaning. Many types of applications use information representable in eight
or fewer bits and processor designers optimize for this common usage. The
popularity of major commercial computing architectures has aided in the ubiquitous
acceptance of the 8-bit size. The unit octet was defined to explicitly denote a
sequence of 8 bits because of the ambiguity associated at the time with the byte.
https://en.wikipedia.org/wiki/Byte
Unicode UTF-8 encoding is variable-length and uses 8-bit code units. It was
designed for backward compatibility with ASCII and to avoid the complications of
endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings.
The name is derived from: Universal Coded Character Set + Transformation Format
- 8-bit. UTF-8 is the dominant character encoding for the World Wide Web,
accounting for 86.9% of all Web pages in May 2016. The Internet Mail Consortium
(IMC) recommends that all e-mail programs be able to display and create mail using
UTF-8, and the W3C recommends UTF-8 as the default encoding in XML and
HTML. UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code
space (1,114,112 code points minus 2,048 surrogate code points) using one to four
8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard).
https://en.wikipedia.org/wiki/UTF-8

Introduction to Hadoop & the Hadoop Ecosystem

• Why? When? Where?
▪ Origins / History
▪ The Why of Hadoop
▪ The When of Hadoop
▪ The Where of Hadoop
• Hadoop Basics
▪ Comparison with RDBMS
• Hadoop architecture
▪ MapReduce
▪ HDFS
▪ Hadoop Common
Introduction to Hadoop & the Hadoop Ecosystem

So let’s dive in now to the topic of Hadoop and the Hadoop Ecosystem.
We will look at cover the big picture first, diving in just so far - in this units - to get our
feet wet. Later on, in the follow up units, we will explore how it fits into a broader IT
environment that could involve data warehouses, relational DBMSs, streaming engines,
etc. Then I’ll wrap up by explaining what IBM can do to help you get started with this
technology.
https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-
data/2/

Hardware improvements over the years…

• CPU Speeds
▪ 1990 - 44 MIPS at 40 MHz
▪ 2010 - 147,600 MIPS at 3.3 GHz
• RAM Memory
▪ 1990 - 640K conventional memory
(256K extended memory recommended)
▪ 2010 - 8-32GB (and more)
• Disk Capacity
▪ 1990 - 20MB
▪ 2010 - 1TB
• Disk Latency (speed of reads and
writes) - not much improvement in last
7-10 years, currently around 80MB / sec
Hardware improvements over the years …

Before we talk about what is Hadoop, let’s set up the context as to why Hadoop
technology is so important.
Moore’s law has been true for a long time, but no matter how many more transistors are
added to CPUs, and how powerful they become, it is on disk latency where the
bottleneck is.
Here our intent is to make one specific point: Scaling up (more powerful computers
with power CPUs) is not the answer to all problems since disk latency is the main
issue. Scaling out (to a cluster of computers) is the better approach, and the only
approach at extreme scale.

Parallel data processing

Different approaches:
▪ GRID computing - spreads processing load -
“CPU scavenging”
▪ Distributed workload - hard to manage
applications, overhead on developer
▪ Parallel databases - Db2 DPF, Teradata,
Netezza, etc. (distribute the data)
• Distributed computing:
▪ Multiple computers appear as one super
computer, communicate with each other by In pioneer days they used oxen
message passing, operate together to achieve for heavy pulling, and when
a common goal one ox couldn’t budge a log,
they didn’t try to grow a larger ox.
• Challenges We shouldn’t be trying for
bigger computers, but for more
▪ Heterogeneity, Openness, Security, systems of computers.
Scalability, Concurrency, Fault tolerance, -Grace Hopper
Transparency
Parallel data processing

Grace Hopper (December 9, 1906 - January 1, 1992), was an American computer
scientist and United States Navy Rear Admiral. She was one of the first programmers of
the Harvard Mark I computer in 1944, invented the first compiler for a computer
programming language, and was one of those who popularized the idea of machine-
independent programming languages which led to the development of COBOL, one of
the first high-level programming languages.
https://en.wikipedia.org/wiki/Grace_Hopper
The quote is found at: White, T. (2015). Hadoop: The definitive guide (4th, revised &
updated ed.). Sebastopol, CA: O'Reilly Media, p. 1.

What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
▪ Hides underlying system details and complexities from user
▪ Developed in Java
• Consists of 3 sub projects:
▪ MapReduce
▪ Hadoop Distributed File System (aka. HDFS)
▪ Hadoop Common
• Has a large ecosystem with both open-source & proprietary Hadoop-
related projects
▪ HBase / ZooKeeper / Avro / etc.
• Meant for heterogeneous commodity hardware
What is Hadoop?
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java originally developed by Doug Cutting who named it
after his son's toy elephant.
Hadoop uses Google’s MapReduce and Google File System (GFS) technologies
concepts at its foundation.
It is optimized to handle massive amounts of data which could be structured,
unstructured or semi-structured, using commodity hardware, that is, relatively
inexpensive computers.
This massive parallel processing is done with great performance. In its initial
conception, which we will study first, it is a batch operation handling massive amounts
of data, so the response time is not immediate.
Hadoop replicates its data across different computers, so that if one goes down, the
data is processed on one of the replicated computers.
You may be familiar with OLTP (Online Transactional processing) workloads where data
is randomly accessed on structured data like a relational database. For example when
you access your bank account.

You may also be familiar with OLAP (Online Analytical processing) or DSS (Decision
Support Systems) workloads where data is sequentially access on structured data like
a relational database to generate reports that provide business intelligence.
Now, you may not be that familiar with the concept of “Big Data”. Big Data is a term
used to describe large collections of data (also known as datasets) that may be
unstructured, and grow so large and quickly that is difficult to manage with regular
database or statistics tools.
Hadoop is not used for OLTP nor OLAP, but for Big Data. It complements these two, to
manage data. So Hadoop is NOT a replacement for a RDBMS.

A large (and growing) Ecosystem
A large (and growing) Ecosystem

Some references:
https://www.coursera.org/learn/hadoop/lecture/E87sw/hadoop-ecosystem-major-
components
https://hadoopecosystemtable.github.io/

Who uses Hadoop?
Who uses Hadoop?

Reference:
https://wiki.apache.org/hadoop/PoweredBy

Why & where Hadoop is used / not used

• What Hadoop is good for:
▪ Massive amounts of data through
parallelism
▪ A variety of data (structured, unstructured,
semi-structured)
▪ Inexpensive commodity hardware
• Hadoop is not good for:
▪ Not to process transactions (random access)
▪ Not good when work cannot be parallelized
▪ Not good for low latency data access
▪ Not good for processing lots of small files
▪ Not good for intensive calculations with little data
Why & where Hadoop is used / not used

Hadoop / MapReduce timeline
Hadoop / MapReduce timeline

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search
engine, itself a part of the Lucene project.
Following are the major events that led to the creation of the stable version of Hadoop
that's available.
• 2003 - Google launches project Nutch to handle billions of searches and
indexing millions of web pages.
• Oct 2003 - Google releases papers with GFS (Google File System)
• Dec 2004 - Google releases papers with MapReduce
• 2005 - Nutch used GFS and MapReduce to perform operations
• 2006 - Yahoo! created Hadoop based on GFS and MapReduce (with Doug
Cutting and team)
• 2007 - Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008 - Apache took over Hadoop
• Jul 2008 - Tested a 4000 node cluster with Hadoop successfully
• 2009 - Hadoop successfully sorted a petabyte of data in less than 17 hours to
handle billions of searches and indexing millions of web pages.

• 2011 - IBM releases BigInsights based on Hadoop 0.23

• 2011 June - Hortonworks founded in Santa Clara, CA
• Dec 2011 - Hadoop releases version 1.0
• Aug 2013 - Version 2.0.6 is available
A more detailed history can be found at:
https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-
data/
When the seeds of Hadoop were first planted in 2002, the world just wanted a better
open-source search engine. So then-Internet Archive search director Doug Cutting and
University of Washington graduate student Mike Cafarella set out to build it. They called
their project Nutch and it was designed with that era’s web in mind.
Looking back on it today, early iterations of Nutch were kind of laughable. About a year
into their work on it, Cutting and Cafarella thought things were going pretty well
because Nutch was already able to crawl and index hundreds of millions of pages. “At
the time, when we started, we were sort of thinking that a web search engine was
around a billion pages,” Cutting explained to me, “so we were getting up there.”
There are now about 700 million web sites and, according to Wired’s Kevin Kelly, well
over a trillion web pages.
But getting Nutch to work wasn’t easy. It could only run across a handful of machines,
and someone had to watch it around the clock to make sure it didn’t fall down.
“I remember working on it for several months, being quite proud of what we had been
doing, and then the Google File System paper came out and I realized ‘Oh, that’s a
much better way of doing it. We should do it that way,'” reminisced Cafarella. “Then, by
the time we had a first working version, the MapReduce paper came out and that
seemed like a pretty good idea, too.”
Google released the Google File System paper (http://research.google.com/
archive/gfs.html) in October 2003 and the MapReduce paper
(http://research.google.com/archive/mapreduce.html) in December 2004. The latter
would prove especially revelatory to the two engineers building Nutch.
“What they spent a lot of time doing was generalizing this into a framework that
automated all these steps that we were doing manually,” Cutting explained.
References:
• Google’s Paper on Big Table: http://research.google.com/archive/bigtable.html
• Google’s Paper on MapReduce:
http://research.google.com/archive/mapreduce.html

Many contributors to Hadoop (e.g., 2006-2011)
Many contributors to Hadoop (e.g., 2006-2011)

The big fear is a fragmentation of Hadoop similar to what happened to the Unix
operating system in the 1990s. Already, Cloudera, Hortonworks, EMC
Greenplum, MapR, Intel and IBM (and, in a way, Amazon Web Services) sell
foundational-level Hadoop distributions that vary from one another and from Apache
Hadoop in significant ways. Facebook, Twitter, Quantcast and other web companies
also do their own open source work and often release it into open source via Github.
The companies involved haven’t always done their best to quell these fears.
Hortonworks and Cloudera have engaged in very public disputes over who the real
open source champion is and whose employees have been more integral to the
development of Apache Hadoop. And both of those companies are quick to point the
finger at other companies they claim are doing Hadoop a disservice by developing
proprietary software at the MapReduce and HDFS levels.

The two key components of Hadoop

• Hadoop Distributed File System = HDFS
▪ Where Hadoop stores data
▪ A file system that spans all the nodes in a Hadoop cluster
▪ It links together the file systems on many local nodes to make them into one
big file system
• MapReduce framework
▪ How Hadoop understands and assigns work to the nodes (machines)
The two key components of Hadoop

There are two key components or aspects of Hadoop that are important to understand:
1. MapReduce is a software framework introduced by Google to support distributed
computing on large data sets of clusters of computers.
2. The Hadoop Distributed File System (HDFS) is where Hadoop stores its data.
This file system spans all the nodes in a cluster. Effectively, HDFS links together
the data that resides on many local nodes, making the data part of one big file
system. You can use other file systems with Hadoop (e.g., MapR’s MapRFS,
IBM’s Spectrum Scale [formerly known as GPFS = General Parallel File
System]), but HDFS is quite common.
References:
• http://info.mapr.com/Hadoop_Buyers_Guide_2015.html
• http://www.redbooks.ibm.com/abstracts/sg248254.html

Think differently
As we start to work with Hadoop, we need to think differently:
• Different processing paradigms
• Different approaches to storing data
• Think ELT (extract-load-transform)
rather than ETL (extract-transform-load)
…and to understanding the Hadoop Ecosystem is embark on a

continuing learning process…self-education is an ongoing requirement
Think differently

Core Hadoop concepts

• Applications are written in high-level language code
• Work is performed in a cluster of commodity machines
▪ Nodes talk to each other as little as possible
• Data is distributed in advance
▪ Bring the computation to the data
• Data is replicated for increased availability and reliability
• Hadoop is fully scalable and fault-‐tolerant
Core Hadoop concepts

Differences between RDBMS and Hadoop/HDFS
Differences between RDBMS and Hadoop/HDFS

Requirements for this new approach

• Partial Failure Support
• Data Recoverability
• Component Recovery
• Consistency
• Scalability
• Hadoop is based on work done by Google in the late 1990s/early 2000s:

Specifically, on papers describing the Google File System (GFS)
(published in 2003), and MapReduce (published in 2004)
Requirements for this new approach

Partial Failure Support
Failure of a component should result in a graceful degradation of application
performance, not complete failure of the entire system
Data Recoverability
If a component of the system fails, its workload should be assumed by still-functioning
units in the system - failure should not result in the loss of any data
Component Recovery
If a component of the system fails and then recovers, it should be able to rejoin the
system, without requiring a full restart of the entire system
Consistency
Component failures during execution of a job should not affect the outcome of the job
Scalability
Adding load to the system should result in a graceful decline in performance of
individual jobs, not failure of the system
Increasing resources should support a proportional increase in load capacity

References:
• Google File System (2003), aka BigTable:
http://research.google.com/archive/gfs.html
o BigTable: A Distributed Storage System for Structured Data
• MapReduce (2004): http://research.google.com/archive/mapreduce.html
o MapReduce: Simplified Data Processing on Large Clusters

Checkpoint
1. What are the 4Vs of Big Data?
What are some of the additional Vs that some add to the basic four?
2. What are the three types of Big Data?
3. Name some of the industry sectors that are using Big Data and Data
Analytics to manage their business.
Checkpoint

Checkpoint solution
1. What are the 4Vs of Big Data?
What are some of the additional Vs that some add to the basic four?
▪ Velocity, Variety, Volume, Veracity
▪ Value, Validity, Viability, Volatility, Vulnerability, Visualization
2. What are the three types of Big Data?
▪ Structured, Semi-structured, Unstructured
▪ Secondary types: Natural Language, Machine-Generated, Graph-based, Audio / Video / Image,
Streaming
3. Name some of the industry sectors that are using Big Data and Data
Analytics to manage their business
▪ Healthcare, Telecommunications, Utilities, Banking / Finance, Insurance, Agriculture, Travel,
Retail - This list, of course, is not closed or exclusive, but merely representative of the examples
used in this course - other industries are also valid
Checkpoint solution

Unit summary
• Develop an understanding of the complete open-source Hadoop
ecosystem and its near-term future directions
• Be able to compare and evaluate the major Hadoop distributions and
their ecosystem components, both their strengths and their limitations
• Gain hands-on experience with key components of various big data
ecosystem components and their roles in building a complete big data
solution to common business problems
• Learning the tools that will enable you to continue your big data
education after the course
▪ This learning is going to be a life-long technical journey that you will start
here and continue throughout your business career
Unit summary


01-Introduction To Big Data and Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01-Introduction To Big Data and Data Analytics

Uploaded by

Copyright:

Available Formats

Introduction to Big Data and Data Analytics

Big Data Ecosystem

© Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 1-2

© Copyright IBM Corp. 2019 1-3

Introduction to Big Data

Introduction to Big Data

© Copyright IBM Corp. 2019 1-4

Big Data - a tsunami that is hitting us already

Big Data - a tsunami that is hitting us already

© Copyright IBM Corp. 2019 1-5

Data has an intrinsic property…it grows and grows

90% 80% 20%

Data has an intrinsic property…it grows and grows

© Copyright IBM Corp. 2019 1-6

Growing interconnected & instrumented world

Growing interconnected & instrumented world

© Copyright IBM Corp. 2019 1-7

The Top 20 Valuable Facebook Statistics - Updated April 2016

© Copyright IBM Corp. 2019 1-8

Growth in Internet traffic (PCs, smartphones, IoT,…)

Growth in Internet traffic (PCs, smartphones, IoT,…)

© Copyright IBM Corp. 2019 1-9

© Copyright IBM Corp. 2019 1-10

Some examples of Big Data

Some examples of Big Data

© Copyright IBM Corp. 2019 1-11

The four classic dimensions of Big Data (4 Vs)

…and a 5th V - Value -

The four classic dimensions of Big Data (4 Vs)

© Copyright IBM Corp. 2019 1-12

© Copyright IBM Corp. 2019 1-13

© Copyright IBM Corp. 2019 1-14

© Copyright IBM Corp. 2019 1-15

The 4 Vs of Big Data - IBM Infographic

The 4 Vs of Big Data - IBM Infographic

© Copyright IBM Corp. 2019 1-16

4 dimensions: Volume, Velocity, Variety, & Veracity

4 dimensions: Volume, Velocity, Variety, & Veracity

© Copyright IBM Corp. 2019 1-17

© Copyright IBM Corp. 2019 1-18

Volume & Velocity of Big Data

• Volume - scale of data • Velocity - analysis of streaming data

• 2.5 quintillion byes (trillion GB) • 18.9 B Network Connections

Volume & Velocity of Big Data

© Copyright IBM Corp. 2019 1-19

© Copyright IBM Corp. 2019 1-20

Variety & Veracity of Big Data

• Variety - different forms of data • Veracity - uncertainty of data

Variety & Veracity of Big Data

© Copyright IBM Corp. 2019 1-21

And, of course, some people have more than 4 Vs

And, of course, some people have more than 4 Vs

© Copyright IBM Corp. 2019 1-22

© Copyright IBM Corp. 2019 1-23

© Copyright IBM Corp. 2019 1-24

Types of Big Data

Types of Big Data

© Copyright IBM Corp. 2019 1-25

Big Data Analytic Techniques

Big Data Analytic Techniques

© Copyright IBM Corp. 2019 1-26

Five key Big Data Use Cases