You are on page 1of 26

Hadoop

An Overview

Cyrus Lentin
Introduction

Hadoop - Cyrus Lentin 1


What Is Big Data

▪ Big Data Is An Evolving Term That Describes Any Voluminous Amount Of Structured, Semi-structured
And Unstructured Data That Has The Potential To Be Mined For Information
▪ Big Data Originally Was Characterized By 3vs:
• The Extreme Volume Of Data
• The Wide Variety Of Types Of Data
• The Velocity At Which The Data Is Generated
▪ Then We Talked Of 5vs:
• Veracity Or Uncertainty Of Data
• The Value Of Data
▪ Today We Talk Of 8vs:
• Visualization To Make Sense Of Data At A Glance
• Viscosity Would You Want To Keep The Data With You / Something Useful Or Important
• Virality Is There A Chance That The Data May Go Viral? Can The Data Be Used Further In A Post Etc?
▪ Although Big Data Doesn't Refer To Any Specific Quantity, The Term Is Often Used When Speaking
About Petabytes And Exabytes Of Data. (10^12 Times Size Of Ordinary Files)

Business Analytics – Cyrus Lentin 2


What Is Big Data

Business Analytics – Cyrus Lentin 3


Variety Of Data

Structur • Data containing a defined data type, format, structure


• Example: Transaction data and OLAP
ed
• Textual data files with a discernable pattern,
Semi- enabling parsing
• Example: XML data files that are self describing
Structured and defined by an xml schema

• Textual data with erratic data formats, can be


“Quasi” formatted with effort, tools, and time
• Example: Web clickstream data that may contain
Structured some inconsistencies in data values and formats

• Data that has no inherent structure


and is usually stored as different types
Unstructured •
of files.
Example: Text documents, PDFs,
images and video

Business Analytics – Cyrus Lentin 4


What Happens In An Internet Minute – 2013 / 2014!

Business Analytics – Cyrus Lentin 5


What Happens In An Internet Minute – 2015!

Business Analytics – Cyrus Lentin 6


What Happens In An Internet Minute – 2016!

Business Analytics – Cyrus Lentin 7


What Happens In An Internet Minute – 2017!

Business Analytics – Cyrus Lentin 8


What Happens In An Internet Minute – 2018!

Business Analytics – Cyrus Lentin 9


Social Media Active Users

Business Analytics – Cyrus Lentin 10


Social Media Active Users

Business Analytics – Cyrus Lentin 11


Social Media Active Users

Business Analytics – Cyrus Lentin 12


Key Enablers

▪ Availability Of Data
▪ Increase In Processing Power
▪ Increase In Storage Capabilities

Business Analytics – Cyrus Lentin 13


Challenges

▪ Problem Is Not Getting This Big Data


▪ Problem Is How To
• Store The Data
• Process The Data
• Analyze The Data

Business Analytics – Cyrus Lentin 14


Hadoop

Hadoop - Cyrus Lentin 15


Solution

▪ Introduced by Google was GFS (Google File System) and Map Reduce
▪ Then Hadoop became open source covering both HDFS & MR
▪ Hadoop is owned by Apache
▪ Hadoop is used by Facebook, Yahoo, Google, Twitter, LinkedIn, Rackspace

Hadoop - Cyrus Lentin 16


Hadoop Trivia

▪ Creator – Doug Cutting


In December 2004, Google Labs published a paper on the MapReduce algorithm, which allows very large scale
computations to be trivially parallelized across large clusters of servers. Cutting, realizing the importance of this
paper to extending Lucene into the realm of extremely large search problems, created the open-source Hadoop
framework that allows applications based on the MapReduce paradigm to be run on large clusters of commodity
hardware. Cutting was an employee of Yahoo!, where he led the Hadoop project full-time; he has since moved on
to Cloudera.
In July 2009, Doug Cutting was elected to the board of directors of the Apache Software Foundation, and in
September 2010, he was elected its chairman.
“Hadoop can fill a myriad of different roles within an enterprise. The trick to getting real value from it, however, is
to start with just one task. If You Want To Succeed With Big Data, Start Small !!!” – Doug Cutting
▪ Name History
The name "Hadoop" was given by one of Doug Cutting's sons to that son's toy elephant. Doug used the name for
his open source project because it was easy to pronounce and to Google.
▪ Logo

Hadoop - Cyrus Lentin 17


What Is Hadoop?

▪ an open-source framework of software


▪ for storage & large-scale processing
▪ of massive data-sets
▪ on clusters
▪ of commodity hardware

Hadoop - Cyrus Lentin 18


Hadoop As The Solution

▪ Storage -> Hadoop Distributed File System


▪ Process -> Map Reduce Paradigm
▪ Analyze -> Hive, Pig, Impala, etc

Hadoop - Cyrus Lentin 19


Hadoop for Scalability

▪ Hadoop can scale to multiple nodes (1,500–2,000) in a cluster


▪ Just configuration changes are required

Hadoop - Cyrus Lentin 20


What Hadoop Is Not

▪ It is not OLAP (online analytical processing) but batch / offline oriented


▪ It is not a database

Hadoop - Cyrus Lentin 21


Hadoop Versions & Distributions

▪ Versions
• OS
• Hadoop
• Distribution
▪ Developed & maintained by
• Apache
▪ Package & distributed by
• Cloudera
• HortonWorks
• MapR

Hadoop - Cyrus Lentin 22


Hadoop Applications

▪ Predictive Analysis
▪ Sentiment Analysis
▪ Customer Intelligence
▪ Fraud & Security Intelligence
▪ High-Performance Analytics
▪ Risk Management
▪ Operational Analysis

Hadoop - Cyrus Lentin 23


To Whom Does Big Data Matters Most

▪ Any Organization / Entity That Has Webserver Or An Information Logging Utility


▪ Utilize It's Logs To Find Out Information Relevant For It's Business
▪ Get Insights Into Data To Increase Revenue And / Or Decrease Costs
▪ Functions
• Marketing
• Operations
• Research
• Finance
▪ On Sectors Such As
• Healthcare
• Customer Service
• Insurance
• Banking
• Social Media
• IoT

Business Analytics – Cyrus Lentin 24


Thank you!
Contact:
Cyrus Lentin
cyrus@lentins.co.in
+91-98200-94236

Hadoop - Cyrus Lentin 25

You might also like