You are on page 1of 20

Introduction to Data

Science (CCPS521)

Session 2:
Big Data
What is Big Data
A Bit of History on Data (00:30 – 9:51): https://www.youtube.com/watch?v=gq_T7EgQXkI
Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage
and process it within a tolerable elapsed time for its user population.
Merv Adrian Article in Teradata Magazine Q1/2011
Source: http://docshare04.docshare.tips/files/20905/209055375.pdf
• First time Big Data is mentioned was by Nasa in Article 1997:
https://www.nas.nasa.gov/assets/pdf/techreports/1997/nas-97-010.pdf
• First time it mentioned by media was 2005: A Short History Of Big Data: https://datafloq.com/read/big-
data-history/239
• Big data is a term for data sets that are so large or complex that traditional data processing application
software is inadequate to deal with them. Big data challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying, updating and information privacy.
• Lately, the term "big data" tends to refer to the use of predictive analytics, user behavior analytics, or certain
other advanced data analytics methods that extract value from data, and seldom to a particular size of data
set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most
relevant characteristic of this new data ecosystem."[2] Analysis of data sets can find new correlations to "spot
business trends, prevent diseases, combat crime and so on."[3] Scientists, business executives, practitioners
of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas
including Internet search, fintech, urban informatics, and business informatics. Scientists encounter
limitations in e-Science work, including meteorology, genomics,[4] connectomics, complex physics
simulations, biology and environmental research.[5]
Source: https://en.wikipedia.org/wiki/Big_data
What is Big Data
The trend is for every individual’s data footprint to grow, but perhaps
more significantly, the amount of data generated by machines as a part
of the Internet of Things will be even greater than that generated by
people. Machine logs, RFID readers, sensor networks, vehicle GPS
traces, retail transactions—all of these contribute to the growing
mountain of data. The volume of data being made publicly available
increases every year, too. Organizations no longer have to merely
manage their own data; success in the future will be dictated to a large
extent by their ability to extract value from other organizations’ data.
Source: Hadoop, The Definitive Guide by Tom White, 4th edition (2015), page 4
What is Big Data
2^0 = 1 Byte = 8 bits, 1 bit can have 0 or 1 (On, Off) (Yes, No), etc
Kilobyte = 1024 = 2^10 (10^3)
Megabyte = 1024x1024 = 2^20 (10^6)
Gigabyte = 1024x1024x1024 = 2^30 (10^9)
Terabyte = 1024x1024x1024x1024 = 2^40 (10^12)
Petabyte = 1024x1024x1024x1024x1024 = 2^50 (10^15)
Exabyte = 1024x1024x1024x1024x1024x1024 = 2^60 (10^18)
Yottabyte = 1024x1024x1024x1024x1024x1024x1024 = 2^70 (10^21)
Brontobyte = 1024x1024x1024x1024x1024x1024x1024x1024 = 2^80 (10^24)

Source: https://en.wikipedia.org/wiki/Exabyte
Big Data Systems
The six Vs
Volume:
Big data implies enormous volumes of data. It used
to be employees created data. Now that data is
generated by machines, networks and human
interaction on systems like social media the volume
of data to be analyzed is massive. Yet, Inderpal
states that the volume of data is not as much the
problem as other V’s like veracity.
Big Data Systems
The six Vs
Variety:
Variety refers to the many sources and types of data both
structured and unstructured. We used to store data from
sources like spreadsheets and databases. Now data
comes in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. This variety of unstructured
data creates problems for storage, mining and analyzing
data.
Big Data Systems
The six Vs
Velocity:
Big Data Velocity deals with the pace at which data flows in
from sources like business processes, machines, networks
and human interaction with things like social media sites,
mobile devices, etc. The flow of data is massive and
continuous. This real-time data can help researchers and
businesses make valuable decisions that provide strategic
competitive advantages and ROI if you are able to handle the
velocity. Inderpal suggest that sampling data can help deal
with issues like volume and velocity.
Big Data Systems
The six Vs
Veracity:
Big Data Veracity refers to the biases, noise and abnormality in
data. Is the data that is being stored, and mined meaningful to
the problem being analyzed. Inderpal feel veracity in data
analysis is the biggest challenge when compares to things like
volume and velocity. In scoping out your big data strategy you
need to have your team and partners work to help keep your
data clean and processes to keep ‘dirty data’ from accumulating
in your systems.
Big Data Systems

The six Vs
Validity:
Like big data veracity is the issue of validity meaning is the data correct and
accurate for the intended use. Clearly valid data is key to making the right
decisions.
Volatility:
Big data volatility refers to how long is data valid and how long should it be
stored. In this world of real time data you need to determine at what point is data
no longer relevant to the current analysis.
Big Data Systems
The six Vs
Resources:
The 10 Vs of Big Data: https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx
https://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/
https://www.ibm.com/analytics/hadoop/big-data-analytics
Big Data Facilitation, Utilization, and Monetization: Exploring the 3Vs in a New Product Development Process by Jeff S. Johnson
, Scott B. Friend, and Hannah S. Lee
http://onlinelibrary.wiley.com.ezproxy.lib.ryerson.ca/doi/10.1111/jpim.12397/pdf

Ishwarappa, J. Anuradha,
A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology,
Procedia Computer Science,
Volume 48, 2015, Pages 319-324, ISSN 1877-0509,
https://doi.org/10.1016/j.procs.2015.04.188.
(https://www.sciencedirect.com/science/article/pii/S1877050915006973)
What is Data Lake
What is a Data Lake? https://www.youtube.com/watch?v=aC9_fDoMH6M ,
https://www.youtube.com/watch?v=LxcH6z8TFpI
A data lake is a method of storing data within a system or repository, in its
natural format,[1] that facilitates the collocation of data in various schemata
and structural forms, usually object blobs or files. The idea of data lake is to
have a single store of all data in the enterprise ranging from raw data (which
implies exact copy of source system data) to transformed data which is used
for various tasks including reporting, visualization, analytics and machine
learning. The data lake includes structured data from relational databases
(rows and columns), semi-structured data (CSV, logs, XML, JSON),
unstructured data (emails, documents, PDFs) and even binary data (images,
audio, video) thus creating a centralized data store accommodating all forms
of data.
Sources:
https://en.wikipedia.org/wiki/Data_lake
https://docs.microsoft.com/en-us/azure/architecture/data-guide/scenarios/data-lake
Internet of Things (IOT)
https://www.youtube.com/watch?v=PXncS2_63o4
Input Devices
The Internet of Things (IoT)
The Internet of Things (IoT) is the ultimate change agent, enabling
commerce and industry to connect, measure, and manage products,
information, operations, and the enterprise. Big Data is getting bigger
due to IoT, and this highly distributed, unstructured data is generated
by a wide range of sensors, beacons, applications, websites, social
media, weather data, computers, smartphones and more. Important
question is; how to monetize IoT data? Analytics and management of
Big Data is one of the most promising IoT opportunities for revenue
growth and value creation today. Without the ability to leverage IoT
analytics, organizations and governments will be left with basic sensor
data and unable to truly harness the tangible value of IoT. The course
includes real-world case scenarios where IoT adoption has huge
potential to deliver monetization, efficiency, productivity, profitability,
competitive advantage and economic prosperity.
Hadoop
Doug Cutting: The Origins of Hadoop https://www.youtube.com/watch?v=ebgXN7VaIZA
Doug Cutting: The Name of Hadoop https://www.youtube.com/watch?v=irK7xHUmkUA
What Is Hadoop? https://www.youtube.com/watch?v=OoEpfb6yga8

The Origin of the Name “Hadoop”


The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting,
explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating
such. Googol is a kid’s term.
Projects in the Hadoop ecosystem also tend to have names that are unrelated to their function,
often with an elephant or other animal theme (“Pig,” for example). Smaller components are given
more descriptive (and therefore more mundane) names. This is a good principle, as it means you
can generally work out what something does from its name. For example, the namenode8 manages
the file-system namespace.

Source: Hadoop, The Definitive Guide by Tom White, 4th edition (2015), page 12
Hadoop
History:
In 2006, Cutting went to work with Yahoo, which was equally
impressed by the Google File System and MapReduce papers and
wanted to build open source technologies based on them. They spun
out the storage and processing parts of Nutch to form Hadoop
(named after Cutting’s son’s stuffed elephant) as an open-source
Apache Software Foundation project
https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Google File System:
https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
Hadoop
What is it?
Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big
data using the MapReduce programming model. It consists of computer clusters built from commodity hardware. All
the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences
and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known
as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model.
Source: https://en.wikipedia.org/wiki/Apache_Hadoop
What Is Apache Hadoop?
• The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed
computing.
• The Apache Hadoop software library is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation and storage. Rather
than rely on hardware to deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on top of a cluster of computers,
each of which may be prone to failures. Source: http://hadoop.apache.org/
• Hortonworks Hadoop Tutorial (4 – 7:50 then 9:35 – 16:30) https://www.youtube.com/watch?v=6UtD53BzDNk
Hadoop Components
• Hadoop project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop
modules. https://www.techopedia.com/definition/30427/hadoop-common
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
https://www.youtube.com/watch?v=1_ly9dZnmWc (0 – 3:08)
• Hadoop YARN: A framework for job scheduling and cluster resource
management. https://www.youtube.com/watch?v=9Bm6EBQ8t8U
• Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets. (3:30 – 5:50) https://www.youtube.com/watch?v=ht3dNvdNDzI
Source: http://hadoop.apache.org/
Hadoop Other Components
Other Hadoop-related projects at Apache include:
• Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop
MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps
and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly
manner.
• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range
of applications, including ETL, machine learning, stream processing, and graph computation.
• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary
DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop
ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
• ZooKeeper™: A high-performance coordination service for distributed applications.

Source: http://hadoop.apache.org/
Some History, Big Data & Hadoop
A Brief History of Computers: https://www.youtube.com/watch?v=iK0PT5q7GlE
Punched Card Machines: http://www.suomentietokonemuseo.fi/vanhat/eng/laite_eng.htm
Computer Punch Cards - Historical Overview - https://www.youtube.com/watch?v=YXE6HjN8heg

Big Data - Tim Smith: https://www.youtube.com/watch?v=j-0cUmUyb-Y


A Bit of History on Data: https://www.youtube.com/watch?v=gq_T7EgQXkI
Doug Cutting: The Origins of Hadoop https://www.youtube.com/watch?v=ebgXN7VaIZA
Doug Cutting: The Name of Hadoop https://www.youtube.com/watch?v=irK7xHUmkUA
What Is Hadoop? https://www.youtube.com/watch?v=OoEpfb6yga8
What is MapReduce? https://www.youtube.com/watch?v=43fqzaSH0CQ
Learn MapReduce with Playing Cards: https://www.youtube.com/watch?v=bcjSe0xCHbE
Map Reduce – Example: https://www.youtube.com/watch?v=iaHCvhwA8p4

You might also like