Professional Documents
Culture Documents
APDS03 Big Data Day 2
APDS03 Big Data Day 2
Bytes ++
Now every 2 days we produce more data than we created since beginning of time till 3/4 years back.
Over 5000 Exabyte's of data in cloud. This is ~ 90x of if books printed and stacked from earth till Pluto.
Every minute we send ~400 million email, ~2.5 mil. Facebook likes, ~350 thousands tweets,
upload ~300000 photos in Facebook.
Burn all the data we created so far in DVD, stack them, will reach moon 3 times (up and down)
Decoding the human genome originally took 10 years to process; now it can be achieved in one week.
Walmart handles more than 1 million customer transactions every hour, which is imported into
databases estimated to contain more than 2.5 petabytes of data
The Gartner Hype Cycle for Big Data 2013
The Hype comparison in last 3 years
https://www.youtube.com/watch?v=SQipnBNVjv0
Big Opportunities !
Reference Terminology
Astronomy
LSST - The Large Synoptic Survey Telescope (LSST) is a wide-field survey reflecting telescope with an 8.4-
meter primary mirror
PAN-STARRS - The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; code: F51
and F52) located at Haleakala Observatory, Hawaii, consists of astronomical cameras, telescopes and a
computing facility that is surveying the sky for moving objects on a continual basis, including accurate
astrometry and photometry of already detected objects
SDSS - The Sloan Digital Sky Survey or SDSS is a major multi-filter imaging and spectroscopic redshift
survey using a dedicated 2.5-m wide-angle optical telescope at Apache Point Observatory in New Mexico,
United States. The project was named after the Alfred P. Sloan Foundation, which contributed significant
funding.
n-body-SIMS - In physics and astronomy, an N-body simulation is a simulation of a dynamical system of
particles, usually under the influence of physical forces, such as gravity (see n-body problem). N-body
simulations are widely used tools in astrophysics, from investigating the dynamics of few-body systems like
the Earth-Moon-Sun system to understanding the evolution of the large-scale structure of the universe. In
physical cosmology, N-body simulations are used to study processes of non-linear structure formation such
as galaxy filaments and galaxy halos from the influence of dark matter. Direct N-body simulations are used
to study the dynamical evolution of star clusters.
Reference Terminology
Ocean Sciences
Patrick Meier is an internationally recognized expert and consultant on Humanitarian Technology and Innovation. His book,
Digital Humanitarians, has been praised by Harvard, MIT, Stanford, Oxford, UN, Red Cross, World Bank, USAID and
others.
https://irevolutions.org/bio/
Deb Roy is a tenured professor at MIT and served as Chief Media Scientist of Twitter from 2013-2017.
A native of Winnipeg, Manitoba, Canada, Roy received his PhD in Media Arts and Sciences from MIT.
MIT researcher Deb Roy wanted to understand how his infant son learned language -- so he wired up his house with video
cameras to catch every moment (with exceptions) of his son's life, then parsed 90,000 hours of home video to watch "gaaaa"
slowly turn into "water."
https://dkroy.media.mit.edu/
https://www.ted.com/talks/deb_roy_the_birth_of_a_word?language=en
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities ! Wordscapes (Deb Roy MIT)
https://www.ted.com/talks/deb_roy_the_birth_of_a_word?language=en
Big Opportunities ! AI in Rock Climbing
Stanford Paper
- http://cs229.stanford.edu/proj2017/final-reports/5232206.pdf
Challenges & Traditional Solutions..
Big Data Vs Traditional DW
Split 1Tb in equal 100 equal size (Blocks) and read them in
parallel.
1 TB File
1 2 3 4 5 98 99 100
Network Congestion
The Problem !!
The Problem !!
The Solution !!
Hadoop – What it is NOT..
Not a Database substitute
Hadoop is not meant to store & manage data the way RDBMS does it.
Not a real time data processing engine
Hadoop is a batch processing system. Using Hadoop with an expectation to analyze data
as soon as it is generated is inappropriate. If need is to analyze data at point of
generation without time lag – look for alternate technologies
Not an analytic engine
Hadoop by itself does not provide any inbuilt analytic capabilities. Write MapReduce
programs for any data processing requirement
A brief History – The Journey from 2002
Big Data – Ecosystem
Distributions
Cloudera
Hortonworks
MapR (Fastest)
Why Another File System..
Why Another File System..
Why Another File System..
Why Another File System..
Let us understand Blocks..
Let us understand Blocks..
HDFS..
Data is split into blocks and distributed across multiple nodes in the cluster.
Suitable for applications that require high throughput access to large data sets.
Hardware failure
An HDFS instance consists of hundreds of machines each of which can fail, key goal of HDFS
architecture is to support detection of such faults and recovery.
Data Locality
Achieves greater efficiency by moving computation to the data. Since files are spread across the distributed file system as chunks,
each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its
locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and
preventing unnecessary network transfers.
Portability
Designed to be portable from one platform to another facilitating wider adoption.
Economy
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware
(commonly available hardware available from multiple vendors3) for which the chance of node failure across the cluster is high,
at least for large clusters.
When NOT to use HDFS..
An HDFS cluster consists of a single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
NameNode..
The NameNode (master) executes file system namespace operations like opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes.
The NameNode maintains the file system tree and metadata for all files and directories in the tree. Any change to the file system
namespace or its properties is recorded by the NameNode.
This information is stored persistently on the local disk in the form of two files: the FsImage and the Edit log.
The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. The
NameNode uses a file in its local host OS file system to store the EditLog.
o The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage
(stored in local file system).
o The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
o When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory
representation of the FsImage, and flushes out this new version into a new FsImage on disk.
o It then truncates the old EditLog.This process is called checkpoint.(Generally occurs during startup)
The namenode also knows the datanodes on which all the blocks are stored for a given file.
It does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes.
The client presents a filesystem interface, so the user code does not need to know about the namenode and datanode to function.
DataNode..
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
DataNode periodically sends a Heartbeat and a Blockreport to the NameNode in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a
list of all blocks on a DataNode.
Replication facts..
HDFS is designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
Files in HDFS are write-once and have strictly one writer at any time.
In most cases, network bandwidth between machines in the same rack is greater than network
bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via a process called - Hadoop
Rack Awareness
This policy cuts the inter-rack write traffic which generally improves write performance.
The chance of rack failure is far less than that of node failure - this policy does not impact data
reliability and availability guarantees.
However, it does reduce the aggregate network bandwidth used when reading data since a block is
placed in only two unique racks rather than three.
Hands on - HDFS Sample Commands practice
Etc…
HDFS Read Write – In Detail..
HDFS Read Write – In Detail..
HDFS Read Write – In Detail..
Hands on - HDFS Java Programs for Read and Write..
A video to show the athletic power of quadcopters and the data it generates for processing
- https://www.youtube.com/watch?v=w2itwFJCgFQ
- https://www.youtube.com/watch?v=RCXGpEmFbOw
Big Data – Data Visualization Showcase 2
WorldBank Data
World Bank Demo Terminology
Terms
Fertility Rate - Number of live births per 1000 women between the ages of 15 and 44 years
Life Expectancy - Life expectancy equals the average number of years a person born in a given country is
expected to live if mortality rates at each age were to remain steady in the future
GDP Per Captia - GDP per capita is a measure of a country's economic output that accounts for population.
It divides the country's gross domestic product by its total population. That makes it the best measurement
of a country's standard of living. It tells you how prosperous a country feels to each of its citizens..
Why the Largest Economies Aren't the Richest per Capita - GDP per capita allows you to compare the
prosperity of countries with different population sizes. For example, U.S. GDP was $18.56 trillion in 2016.
But one reason America is so prosperous is it has so many people. It's the third most populous country after
China and India.
The United States must spread its wealth among 324 million people. As a result, its GDP per capita is only
$57,300. That makes it the 18th most prosperous country per person.
China has the largest GDP in the world, producing $21.2 trillion in 2016. But its GDP per capita was only
$15,400 because it has four times the number of people as the United States.