Professional Documents
Culture Documents
Introduction to
Big Data and
Data Analytics
Unit objectives
• Develop an understanding of the complete open-source Hadoop
ecosystem and its near-term future directions
• Be able to compare and evaluate the major Hadoop distributions and
their ecosystem components, both their strengths and their limitations
• Gain hands-on experience with key components of various big data
ecosystem components and their roles in building a complete big data
solution to common business problems
• Learning the tools that will enable you to continue your big data
education after the course
▪ This learning is going to be a life-long technical journey that you will start
here and continue throughout your business career
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Unit objectives
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
1 in 2 83% 5.4X
business leaders don’t of CIO’s cited BI and analytics more likely that top
have access to data they as part of their visionary plan performers use business
need analytics
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Passive RFID grows by 1.12 billion tags in 2014 to 6.9 billion: More than five years later
than the industry had expected, market research and events firm IDTechEx find that the
passive RFID tag market is now seeing tremendous volume growth.
• http://www.idtechex.com/research/articles/passive-rfid-grows-by-1-12-billion-tags-
in-2014-to-6-9-billion-00007031.asp
• http://www.statista.com/statistics/299966/size-of-the-global-rfid-market/
• https://en.wikipedia.org/wiki/Radio-frequency_identification
It is Predicted This Year That 1.3 Billion RFID Tags Will Be Sold (in 2006), but This
Number is Expected to Grow Rapidly over the Next Ten Years
• http://www.businesswire.com/news/home/20060124005042/en/Predicted-Year-
1.3-Billion-RFID-Tags-Sold
The number of internet users has increased tenfold from 1999 to 2013. The
first billion was reached in 2005. The second billion in 2010. The third billion in 2014.
• http://www.internetlivestats.com/internet-users/
Growth in the smart electric meter market has cooled during the last year, even though
strong market drivers remain in place on which utilities can build a business
case. Although the direct operational and societal benefits of smart meters and the
broader benefits of smart grids that are enabled by smart meters continue to be
debated among policymakers, utilities, regulators, and consumers, penetration rates
continue to climb, reaching nearly 39 percent in North America in 2012. According to a
recent report from Navigant Research, the worldwide installed base of smart meters will
grow from 313 million in 2013 to nearly 1.1 billion in 2022.
• https://www.navigantresearch.com/newsroom/the-installed-base-of-smart-meters-
will-surpass-1-billion-by-2022
Number of monthly active Facebook users worldwide as of 1st quarter 2016 (in millions)
• http://www.statista.com/statistics/264810/number-of-monthly-active-facebook-
users-worldwide/
Number of mobile phones to exceed world population by 2014. The ITU expects the
number of cell phone accounts to rise from 6 billion now to 7.3 billion in 2014, compared
with a global population of 7 billion.
Over 100 countries have the number of cell phone accounts exceeding their
population.
• http://www.digitaltrends.com/mobile/mobile-phone-world-population-2014/
• http://www.siliconindia.com/magazine_articles/World_to_have_more_cell_phone_
accounts_than_people_by_2014-DASD767476836.html
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
The number of devices connected to IP networks will be three times as high as the
global population in 2020. There will be 3.4 networked devices per capita by 2020, up
from 2.2 networked devices per capita in 2015. Accelerated in part by the increase in
devices and the capabilities of those devices, IP traffic per capita will reach 25 GB per
capita by 2020, up from 10 GB per capita in 2015.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
With every passing day, data is growing exponentially. Thousands of TBs worth data is
created every minute worldwide via Facebook, tweets, instant messages, email,
internet usage, mobile usage, product reviews etc. Every minute, hundreds of twitter
accounts are created, thousands of applications are downloaded, and thousands of
new posts and ads are posted. According to experts, the amount of big data in the
world is likely to get doubled every two years. This will definitely provide immense data
in coming years and also calls for smarter data management.
As the volume of the data is growing at the speed of light, traditional database
technology will not suffice the need of efficient data management i.e. storage and
analysis. The need of the hour will be a large scale adoption of new age tools like
Hadoop and MongoDB. These use distributed systems to facilitate storage and analysis
of this enormous big data across various databases. This information explosion has
opened new doors of opportunities in the modern age.
Variety
Big data is collected and created in various formats and sources. It includes structured
data as well as unstructured data like text, multimedia, social media, business reports
etc.
Structured data such as bank records, demographic data, inventory databases,
business data, product data feeds have a defined structure and can be stored and
analyzed using traditional data management and analysis methods.
Unstructured data includes captured like images, tweets or Facebook status updates,
instant messenger conversations, blogs, videos uploads, voice recordings, sensor data.
These types of data do not have any defined pattern. Unstructured data is most of the
time reflection of human thoughts, emotions and feelings which sometimes would be
difficult to be expressed using exact words.
As the saying goes “A picture paints a thousand words”, one image or video which is
shared on social networking sites and applauded by millions of users can help in
deriving some crucial inferences. Hence, it is the need of the hour to understand this
non-verbal language to unlock some secrets of market trends.
One of the main objectives of big data is to collect all this unstructured data and analyze
it using the appropriate technology. Data crawling, also known as web crawling, is a
popular technology used for systematically browsing the web pages. There are
algorithms designed to reach the maximum depth of a page and extract useful data
worth analyzing.
Variety of data definitely helps to get insights from different set of samples, users and
demographics. It helps to bring different perspective to same information. It also allows
analyzing and understanding the impact of different form and sources of data collection
from a ‘larger picture’ point of view.
For instance, in order to understand the performance of a brand, traditional surveys are
one of the forms of data collection. This is done by selecting a sample, mostly from
panels. The advantage of this approach is that you get direct answers to the questions.
However, we can obtain real time feedback through various other forms like Facebook
activity, product review blogs, and updates posted by customers on merchant websites
like Flipkart, Amazon, and Snapdeal. A combination of these two forms of data definitely
gives a data-backed, clearer perspective to your business decision making process.
Velocity
In today’s fast paced world, speed is one of the key drivers for success in your business
as time is equivalent to money. Fast turn-around is one of the pre-requisites to stay
alive in this fierce competition. Expectations of quick results and quick deliverables are
pressing to a great extent. In such scenarios, it becomes vital to collect and analyze
vast amount of disparate data swiftly, in order to make well-informed decisions in real-
time. Low velocity of even high quality of data may hinder the decision making of a
business.
The general definition of Velocity is ‘speed in a specific direction’. In big data, Velocity is
the speed or frequency at which data is collected in various forms and from different
sources for processing. The frequency of specific data collected via various sources
defines the velocity of that data. In other terms, it is data in motion to be captured and
explored. It ranges from batch updates, to periodic to real-time flow of the data.
The frequency of Facebook status updates shared, and messages tweeted every
second, videos uploaded and/or downloaded every minute, or the online/offline bank
transactions recorded every hour, determine the velocity of the data. You can relate
velocity with the amount of trade information captured during each trading session in a
stock exchange. Imagine a video or an image going viral at the blink of an eye to reach
millions of users across the world. Big data technology allows you to process the real-
time data, sometimes without even capturing in a database.
Streams of data are processed and databases are updated in real-time, using parallel
processing of live streams of data. Data streaming helps extract valuable insights from
incessant and rapid flow of data records. A streaming application like Amazon Web
Services (AWS) Kinesis is an example of an application that handles the velocity of
data.
The higher the frequency of data collection into your big data platform in a stipulated
time period, the more likely it will enable you to make accurate decision at the right time.
Veracity
The fascinating trio of volume, variety, and velocity of data brings along a mixed bag of
information. It is quite possible that such huge data may have some uncertainty
associated with it. You will need to filter out clean and relevant data from big data, to
provide insights that power up your business. In order to make accurate decisions, the
data you have used as an input should be appropriately compiled, conformed,
validated, and made uniform.
There are various reasons of data contamination like data entry errors or typos (mostly
in structured data), wrong references or links, junk data, pseudo data etc. The
enormous volume, wide variety, and high velocity in conjunction with high-end
technology, holds no significance if the data collected or reported is incorrect. Hence,
data trustworthiness (in other words, quality of data) holds the highest importance in the
big data world.
In automated data collection, analysis, report generation, and decision making process,
it is inevitable to have a foolproof system in place to avoid any lapses. Even the most
minor of slippage at any stage in the big data extraction process can cause immense
blunder.
Any reports generated based on a certain type of data from a certain source must be
validated for accuracy and reliability. It is always advisable that you have 2 different
methods and sources to validate credibility and consistency of the data, to avoid any
bias. It is not only about accuracy post data collection, but also about determining right
source and form of the data, required amount or size of the data, and the right method
of analysis, play a vital role in procuring impeccable results. Integrity in any field of
business life or personal life holds highest significance and hence, proper measures
must be put in place to take care of this crucial aspect. It will definitely allow you to
position yourself in the market as a reliable authority and help to you to attain greater
heights of success.
Parting thoughts
These 4v’s are like 4 pillars lending stability to the giant structure of big data and adding
a precious 5th “V” - value - to the information procured leads to the whole purpose of
big data, smart decision making.
Infographic: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Velocity: refers to the speed at which vast amounts of data are being generated,
collected and analyzed. Every day the number of emails, twitter messages, photos,
video clips, etc. increases at lighting speeds around the world. Every second of every
day data is increasing. Not only must it be analyzed, but the speed of transmission,
and access to the data must also remain instantaneous to allow for real-time access to
website, credit card verification and instant messaging. Big data technology allows us
now to analyze the data while it is being generated, without ever putting it into
databases.
Variety: defined as the different types of data we can now use. Data today looks very
different than data from the past. We no longer just have structured data (name, phone
number, address, financial info, etc) that fits nice and neatly into a data table. Today’s
data is unstructured. In fact, 80% of all the world’s data fits into this category, including
photos, video sequences, social media updates, etc. New and innovative big data
technology is now allowing structured and unstructured data to be harvested, stored,
and used simultaneously.
Veracity: last here, but certainly the least. Veracity is the quality or trustworthiness of
the data. Just how accurate is all this data? For example, think about all the Twitter
posts with hash tags, abbreviations, spelling, typos, etc., and the reliability and accuracy
of all that content. Working with tons and tons of data is of no use if the quality or
trustworthiness is not accurate. Traditionally mainframe data was considered the most
trustworth; but what about this new data? Quality, accuracy, and precision are needed.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
The list goes on. If you aren’t using big data, you have big problems.
See: Classrooms of the Future, http://www.ozy.com/pov/classrooms-of-the-
future/66311
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
http://blogs.systweak.com/2017/03/big-data-vs-represents-
characteristics-or-challenges-of-big-data
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Or even more…
• Volume - how much data is there?
• Velocity - how quickly is the data being created, moved, or accessed?
• Variety - how many different types of sources are there?
• Veracity - can we trust the data?
• Validity - is the data accurate and correct?
• Viability - is the data relevant to the use case at hand?
• Volatility - how often does the data change?
• Vulnerability - can we keep the data secure?
• Visualization - how can the data be presented to the user?
• Value - can this data produce a meaningful return on investment?
https://healthitanalytics.com/news/understanding-the-many-vs-of-healthcare-big-data-analytics
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Or even more…
Extracting actionable insights from big data analytics - and perhaps especially
healthcare big data analytics - is one of the most complex challenges that organizations
can face in the modern technological world.
In the healthcare realm, big data has quickly become essential for nearly every
operational and clinical task, including population health management, quality
benchmarking, revenue cycle management, predictive analytics, and clinical decision
support.
The complexity of big data analytics is hard to break down into bite-sized pieces, but
the dictionary has done a good job of providing pundits with some adequate
terminology.
Data scientists and tech journalists both love patterns, and few are more pleasing to
both professions than the alliterative properties of the many V’s of big data.
Originally, there were only the big three - volume, velocity, and variety - introduced by
Gartner analyst Doug Laney all the way back in 2001, long before “big data” became a
mainstream buzzword.
As enterprises started to collect more and more types of data, some of which were
incomplete or poorly architected, IBM was instrumental in adding the fourth V, veracity,
to the mix.
Subsequent linguistic leaps have resulted in even more terms being added to the litany.
Value, visualization, viability, vulnerability, volatility, and validity have all been proposed
as candidates for the list.
Each term describes a specific property of big data that organizations must understand
and address in order to succeed with its chosen initiatives. The article applies this list
specifically to healthcare analytics.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Problems:
1. Invades our privacy
2. Substitutes phony relationship for real ones
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
References:
• https://en.wikipedia.org/wiki/FiveThirtyEight
• http://fivethirtyeight.com/
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
References:
http://www.computerweekly.com/news/2240176248/GE-uses-big-data-to-power-
machine-services-business
• "The airline industry spends $200bn on fuel per year so a 2% saving is $4bn. GE
provides software that enables airline pilots to manage fuel efficiency.“
• “Another product, Movement Planner, is a cruise control system for train drivers.
The technology assesses the terrain and the location of the train to calculate the
optimal speed to run the locomotive for fuel economy. ”
http://www.infoworld.com/article/2616433/big-data/general-electric-lays-out-big-plans-
for-big-data.html
• “As one of the world's largest companies, GE is a major manufacturer of systems
in aviation, rail, mining, energy, healthcare, and more. In recognition of the
importance of big data to GE, CEO Jeff Immelt launched a new initiative called
the “industrial Internet,” which aims to help customers increase efficiency and to
create new revenue opportunities for GE through analytics.”
• “The industrial Internet is GE's spin on “the Internet of things,” where Internet-
connected sensors collect vast quantities of data for analysis. According to
Immelt, sensors have already been embedded in 250,000 "intelligent machines"
manufactured by GE, including jet engines, power turbines, medical devices, and
so on. Harvesting and analyzing the data generated by those sensors holds
enormous potential for optimization across a broad range of industrial operations.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
http://www.theverge.com/2016/4/25/11501078/cern-300-tb-lhc-data-open-access
• “If you ever wanted to take a look at raw data produced by the Large Hadron
Collider, but are missing the necessary physics PhD, here's your chance: CERN
has published more than 300 terabytes of LHC data online for free. The data
covers roughly half the experiments run by the LHC's CMS detector during 2011,
with a press release from CERN explaining that this includes about 2.5 inverse
femtobarns of data - around 250 trillion particle collisions. Best not to download
this on a mobile connection then.”
The aperture arrays in the SKA could produce more than 100 times the global internet
traffic
References:
• https://en.wikipedia.org/wiki/Square_Kilometre_Array
• https://www.skatelescope.org/ (SKA homepage)
• http://www.ska.gov.au/About/Pages/default.aspx
• http://www.ska.gov.au/NewZealandSKA/Pages/default.aspx
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Wikibon Big Data Software, Hardware & Professional Services Projection 2014-2026 ($B)
Executive Summary from the Wikibon Report
The big data market grew 23.5% in 2015, led by Hadoop platform revenues. We
believe the market will grow from $18.3B in 2014 to $92.2B in 2026 - a strong 14.4%
CAGR. Growth throughout the next decade will take place in three successive and
overlapping waves of application patterns - Data Lakes, Intelligent Systems of
Engagement, and Self-Tuning Systems of Intelligence. Increasing amounts of data
generated by sensors from the Internet of Things will drive each application pattern. Big
data tool integration, administrative simplicity, and developer adoption are keys to
growth rates. The adoption of streaming technologies, which address a number of
Hadoop limitations, also will be a factor. Ultimately, the market growth will depend on
enterprises: Will doers take the steps required to transform business with big data
systems?
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Source:
Internet of Things (IoT): The Next Cyber Security Target (Webinar)
Praveen Kumar Gandi, Head Information Security Services,
https://www.slideserve.com/ClicTest/webinar-on-internet-of-things-iot-the-next-cyber-security-target
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
• Event-Driven
If when you say “real-time,” you mean the opposite of scheduled, then you mean event-driven. Instead of
happening in a particular time interval, event-driven data processing happens when a certain action or
condition triggers it. The performance requirement for this is generally before another event happens.
http://blog.syncsort.com/2016/03/big-data/four-really-real-meanings-of-real-time
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Appendix A
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Appendix A
Self-study materials from Module 1.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
• From a 2012 Big Data @ Work Study surveying 1144 business and IT
professionals in 95 countries
• Gartner Sept. 2014 report: 13% of surveyed organizations have deployed
big data solutions, while 73% have invested in big data or plan to do so
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Healthcare
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Healthcare
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Financial
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Financial
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Graph analytics
• Path analysis • Community analysis
• Connectivity analysis • Centrality analysis
www.ibmbigdatahub.com/blog/what-graph-analytics
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Graph analytics
Bringing this concept to the real world, nodes or vertices can be people, such as
customers and employees; affinity groups such as LinkedIn or meet-up groups; and
companies and institutions. They can also be places such as airports, buildings, cities
and towns, distribution centers, houses, landmarks, retail stores, shipping ports and so
on. Vertices can also be things such as assets, bank accounts, computer servers,
customer accounts, devices, grids, molecules, policies, products, twitter handles, URLs,
web pages and so on.
Edges can be elements that represent relationships such as emails, likes and dislikes,
payment transactions, phone calls, social networking and so on. Edges can be directed,
that is, they have a one-way direction arrow to represent a relationship from one node
to another-for example, Mike made a payment to Bill; Mike follows Corinne on Twitter.
They can also be non-directed-for example, M1 links London and Leeds-and weighted-
for example, the number of payments between these two accounts is high. The time
between two locations is also an example of a weighted relationship.
http://www.ibmbigdatahub.com/blog/what-graph-analytics
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Photo: Siemens
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
http://www.datanami.com/2015/10/30/hortonworks-preps-for-coming-iot-device-storm/
The masses may not be getting flying cars or personal robots anytime soon, but thanks
to connected devices, the future of technology is definitely bright. From smart
refrigerators and wireless BBQ sensors to connected turbines and semi-trucks, the
Internet of Things (IoT) will have a huge impact on consumer and industrial tech.
However, managing hundreds of billions of devices-not to mention the data they
generate-will not be easy, which is why Hadoop distributor Hortonworks is taking pains
to prepare for the coming device storm.
According to Intel, the number of connected devices will explode in the near future,
growing from about 15 billion devices in 2015 to more than 200 billion by 2010.
Anybody who’s watching the rise of wearables like the Fitbit, the popularity of smart
thermostats like Nest, or the presence of drone aircraft can tell you that this
phenomenon is real and accelerating.
Getting these wireless IoT devices integrated with command and control systems will
not be an easy task. While we now have enough Internet addresses to go around,
there’s a lot of other messy details that need to be worked out to ensure that devices
aren’t stepping on each other’s toes, that they can’t be co-opted by others, that the data
is safe and secure. Right now, it’s an ad hoc, Wild West IoT world, but that simply won’t
scale.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Warehousing Zone
BI &
Reporting
Enterprise
Warehouse
Connectors
Predictive
Analytics
Hadoop
Documents
ETL, MDM, Data Governance
in variety of formats
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Multi-channel customer
sentiment and experience
analysis
Detect life-threatening
conditions at hospitals in time to
intervene
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Facets of data
In data science and big data, you will come across many different types
of data, and each of them require different tools and techniques. The
main categories of data are:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and image
• Streaming
Cielen, D., Meysman, A. D. B., & Ali, M. (2016). Introducing data science: Big data, machine
learning, and more, using Python tools. Shelter Island, NY: Manning Publications, pp. 4-8.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Facets of data
We live in the data age. It’s not easy to measure the total volume of data stored
electronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes
21
in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes. A zettabyte is 10
bytes, or equivalently one thousand exabytes, one million petabytes, or one billion
terabytes. That’s more than one disk drive for every person in the world.
White, T. (2015). Hadoop: The definitive guide (4th, revised & updated ed.).
Sebastopol, CA: O'Reilly Media, p. 1.
For the meaning of this terminology (terabyte, petabyte, exabyte, zettabyte), see the
next slide and its notes.
This flood of data is coming from many sources. Consider the following:
• The New York Stock Exchange generates about 4−5 terabytes of data per day.
• Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
• Ancestry.com, the genealogy site, stores around 10 petabytes of data.
• The Internet Archive stores around 18.5 petabytes of data.
(All figures are from 2013 or 2014. For more information, see Tom Groenfeldt, “At
NYSE, The Data Deluge Overwhelms Traditional Databases”; Rich Miller, “Facebook
Builds Exabyte Data Centers for Cold Storage”; Ancestry.com’s “Company Facts”;
Archive.org’s “Petabox”; and the Worldwide LHC Computing Grid project’s welcome
page.)
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Be very careful when talking about network speed, as this is fairly standard:
1 Mbps = one million bits per second
1 MBps = one million bytes per second
By the way, even the term “byte” is a little ambiguous. The generally accepted meaning
these days is an octet (i.e., eight bits).
The de facto standard of eight bits is a convenient power of two permitting the values
0 through 255 for one byte. The international standard IEC 80000-13 codified this
common meaning. Many types of applications use information representable in eight
or fewer bits and processor designers optimize for this common usage. The
popularity of major commercial computing architectures has aided in the ubiquitous
acceptance of the 8-bit size. The unit octet was defined to explicitly denote a
sequence of 8 bits because of the ambiguity associated at the time with the byte.
https://en.wikipedia.org/wiki/Byte
Unicode UTF-8 encoding is variable-length and uses 8-bit code units. It was
designed for backward compatibility with ASCII and to avoid the complications of
endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings.
The name is derived from: Universal Coded Character Set + Transformation Format
- 8-bit. UTF-8 is the dominant character encoding for the World Wide Web,
accounting for 86.9% of all Web pages in May 2016. The Internet Mail Consortium
(IMC) recommends that all e-mail programs be able to display and create mail using
UTF-8, and the W3C recommends UTF-8 as the default encoding in XML and
HTML. UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code
space (1,114,112 code points minus 2,048 surrogate code points) using one to four
8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard).
https://en.wikipedia.org/wiki/UTF-8
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
▪ Hides underlying system details and complexities from user
▪ Developed in Java
• Consists of 3 sub projects:
▪ MapReduce
▪ Hadoop Distributed File System (aka. HDFS)
▪ Hadoop Common
• Has a large ecosystem with both open-source & proprietary Hadoop-
related projects
▪ HBase / ZooKeeper / Avro / etc.
• Meant for heterogeneous commodity hardware
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
What is Hadoop?
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java originally developed by Doug Cutting who named it
after his son's toy elephant.
Hadoop uses Google’s MapReduce and Google File System (GFS) technologies
concepts at its foundation.
It is optimized to handle massive amounts of data which could be structured,
unstructured or semi-structured, using commodity hardware, that is, relatively
inexpensive computers.
This massive parallel processing is done with great performance. In its initial
conception, which we will study first, it is a batch operation handling massive amounts
of data, so the response time is not immediate.
Hadoop replicates its data across different computers, so that if one goes down, the
data is processed on one of the replicated computers.
You may be familiar with OLTP (Online Transactional processing) workloads where data
is randomly accessed on structured data like a relational database. For example when
you access your bank account.
You may also be familiar with OLAP (Online Analytical processing) or DSS (Decision
Support Systems) workloads where data is sequentially access on structured data like
a relational database to generate reports that provide business intelligence.
Now, you may not be that familiar with the concept of “Big Data”. Big Data is a term
used to describe large collections of data (also known as datasets) that may be
unstructured, and grow so large and quickly that is difficult to manage with regular
database or statistics tools.
Hadoop is not used for OLTP nor OLAP, but for Big Data. It complements these two, to
manage data. So Hadoop is NOT a replacement for a RDBMS.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Think differently
As we start to work with Hadoop, we need to think differently:
• Different processing paradigms
• Different approaches to storing data
• Think ELT (extract-load-transform)
rather than ETL (extract-transform-load)
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Think differently
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
References:
• Google File System (2003), aka BigTable:
http://research.google.com/archive/gfs.html
o BigTable: A Distributed Storage System for Structured Data
• MapReduce (2004): http://research.google.com/archive/mapreduce.html
o MapReduce: Simplified Data Processing on Large Clusters
Checkpoint
1. What are the 4Vs of Big Data?
What are some of the additional Vs that some add to the basic four?
2. What are the three types of Big Data?
3. Name some of the industry sectors that are using Big Data and Data
Analytics to manage their business.
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Checkpoint
Checkpoint solution
1. What are the 4Vs of Big Data?
What are some of the additional Vs that some add to the basic four?
▪ Velocity, Variety, Volume, Veracity
▪ Value, Validity, Viability, Volatility, Vulnerability, Visualization
2. What are the three types of Big Data?
▪ Structured, Semi-structured, Unstructured
▪ Secondary types: Natural Language, Machine-Generated, Graph-based, Audio / Video / Image,
Streaming
3. Name some of the industry sectors that are using Big Data and Data
Analytics to manage their business
▪ Healthcare, Telecommunications, Utilities, Banking / Finance, Insurance, Agriculture, Travel,
Retail - This list, of course, is not closed or exclusive, but merely representative of the examples
used in this course - other industries are also valid
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Checkpoint solution
Unit summary
• Develop an understanding of the complete open-source Hadoop
ecosystem and its near-term future directions
• Be able to compare and evaluate the major Hadoop distributions and
their ecosystem components, both their strengths and their limitations
• Gain hands-on experience with key components of various big data
ecosystem components and their roles in building a complete big data
solution to common business problems
• Learning the tools that will enable you to continue your big data
education after the course
▪ This learning is going to be a life-long technical journey that you will start
here and continue throughout your business career
Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2019
Unit summary