You are on page 1of 80

Essentials of Big Data Programming

Introduction0
Syllabus
UNIT I : INTRODUCTION TO BIG DATA AND HADOOP
Types of Digital Data : Structured, Semi-Structured and
Unstructured data. Characteristics of Data, Evolution of Big Data,
Why Big Data, What is Big Data Analytics, Big Data Challenges,
Features of Hadoop, Evolution of Hadoop, Introduction to
Hadoop Eco System.
UNIT II : HDFS(Hadoop Distributed File System)
The Design of HDFS, HDFS Concepts, Command Line Interface,
Hadoop file system interfaces, Data flow, Data Ingest with Flume
and Sqoop, Concepts of Hadoop I/O : Data Integration,
Compression, Serialization and File-Based Data structures.
UNIT III : Map Reduce
Anatomy of a Map Reduce Job Run, Failures, Job Scheduling,
Shuffle and Sort, Task Execution, Map Reduce Types and Formats,
Map Reduce Features.
Syllabus
Unit IV : Hadoop Eco System
Pig : Introduction to PIG, Execution Modes of Pig, Comparison of
Pig with Databases, Pig Latin, Data Processing operators and ,
User Defined Functions in Pig
Structured Data Queries with Hive : The Hive Command Line
Interface(CLI), Hive Query Language(HQL), Data Analysis with
Hive.
NoSQL Database HBase : CAP theorem, NoSQL Databases,
Column-Oriented Databases, Real Time analytics with HBase.
UNIT V : Apache Spark
In-Memory Computing with Spark : Spark Basics, Interactive Spark
Using PySpark, Writing Spark Applications with PySpark.
Scalable Machine Learning with Spark : Collaborative Filtering,
Classification, Clustering.
Text books :
Content of this presentation
has been taken from Book
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Published by Wiley India Pvt. Ltd.
Data
• Data is Collection of Facts.
• Data is Precious to any Organization .. Why?
– Data is present internal as well as external to an
organization .. How?
– Data comes from homogenous as well as
heterogeneous sources.
Digital Data
It is defined as the data that is stored in digital form,
may be in the form of a picture, document or video etc.
Digital data can be classified into three forms:
• Structured
• Semi-Structured
• Un-Structured
Usually, data is in the unstructured format which makes
extracting information from it difficult.
• According to Merrill Lynch, 80–90% of business data is
either unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes
80% of the whole enterprise data
Formats of Digital Data
Here is a percent distribution of the three forms of data
Data Forms Defined
Unstructured data: This is the data which does not conform to a
data model or is not in a form which can be used easily by a
computer program. About 80—90% data of an organization is in
this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, white papers,
body of an email, etc.
Semi-structured data: This is the data which does not conform to a
data model but has some structure. However, it is not in a form
which can be used easily by a computer program; for example,
emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.
Structured data: This is the data which is in an organized form
(e.g., in rows and columns) and can be easily used by a computer
program. Relationships exist between entities of data, such as
classes and their objects. Data stored in databases is an example of
structured data.
Structured Data
• Structured Data
– Data is in Organized form
– Data stored in Databases
– Stored in rows and columns.
– Easily accessed by a program
• Insert/update/delete
• Security
• Indexing
• Transaction Processing
• Scalability
– Data Collected by day-to-day Business Activities
– Sources
• Databases such as Oracle corp. - Oracle, IBM - DB2,
Teradata,Microsoft - MySQL, opensource PostgreSQL,EMC -
Greenplum..etc
• Spreadsheets
• OLTP
Semi-structured Data
• Referred to as self describing structure.
• Does not Conform to a data model but has a structure
• It uses tags to segregate semantic elements.
– Tags are used to enforce hierarchies of records and fields within data.
• No separation between data and schema. Often schema
information is blended with data values.
• Data Objects may have different attributes not known
beforehand.
• Not in the form which is easily usable by a program
– Emails, XML, HTML, JSON etc.
Where does Semi-structured Data Come from?

E-mail

XML

TCP/IP packets

Zipped files
Semi-structured
data
Binary
executables

Mark-up languages

Integration of data from


heterogeneous sources
Semi - Structured
• Sample Json Document {
_id: 9,
BookTitle: ”Fundamentals of BDA”,
AuthorName: ”Seemachary”,
Publisher: ”Wiley India”,
YearofPublication: ”2001”
}
• Sample Xml Document{
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>
Two of our famous Belgian Waffles with plenty of real maple syrup
</description>
<calories> 650 </calories>
</food>
<food>
<name> Dosa</name>
<price>$5></price>
</food>
</breakfast_menu>
Semi-Structured
• Some types of data that appear to be unstructured but are actually
semi-structured include:
• Text: XML, email or electronic data interchange messages (EDI).
– These lacks formal structure but do contain tags or a known structure that
separate semantic elements. Most social media sources, a hot topic for
analysis today, fall in this category. Facebook, Twitter, and others offer
data access through an application programming interface (API).
• Web Server Logs and Search Patterns:
– An individual’s journey through a web site, whether searching, consuming
content, or shopping is recorded in detail in electronic web server logs.
• Sensor Data:
– There is a huge explosion in the number of sensors producing streams of
data all around us. In addition to monitoring mechanical systems, sensors
increasingly monitor consumer behavior. In-store sensors are monitoring
consumer shopping behavior. Your cell phone puts out a constant stream
of signals that are being captured for location-based marketing. .
• We have been refining our use of structured data for the past 10 or 20 years.
Opportunity lies in understanding how adding unstructured and semi-
structured data to the mix creates competitive advantage.
Un-structured
• Data does not conform to a Structure
• Not Easily be used by a computer
• 80-90% of the organization data is in this format.
– Ex: Webpages, Images, Free-form text, Audios, Videos, Body of
Emails, Text Messages, Memos, Chats, Social Media Data,
Letters, Research/White Papers.
• Dealing with un-structured data involves Data Mining, NLP,
Text Analytics and Noisy Data Analysis.
• Data Mining
– Association Rule Mining
– Regression
– Collaborative Filtering
– Text Analytics/Text Mining
– Natural Language Processing
– Noisy Text Analytics
– Manual Tagging with Meta Data
Where does Unstructured Data Come from?

Web pages

Memos

Videos (MPEG, etc.)

Images (JPEG, GIF, etc.)

Body of an e-mail

Unstructured data Word document

PowerPoint presentations

Chats

Reports

Whitepapers

Surveys
Where does Unstructured Data Come from?
Broadly speaking, anything in a non-database form is
unstructured data. It can be classified into two broad categories:
▪ Bitmap Objects : For example, image, video, or audio files.
▪ Textual Objects : For example, Microsoft Word documents,
emails, or Microsoft Excel spread-sheets.
Even though email messages are organized in databases such as
Microsoft Exchange or Lotus Notes, the body of the email is
essentially raw data, i.e. free form : Text without any structure.
Unstructured Data – Getting to Know
Dr. Ben, Dr. Stanley, and Dr. Mark work at the medical facility of “GoodLife”. Over the past
few days, Dr. Ben and Dr. Stanley had been exchanging long emails about a particular case
of intestinal problem. Dr. Stanley has chanced upon a particular combination of drugs that
has cured gastro-intestinal disorders in his patients. He has written an email about this
combination of drugs to Dr. Ben.
Dr. Mark has a patient in the “GoodLife” emergency unit with quite a similar case of gastro-
intestinal disorder whose cure Dr. Stanley has chanced upon. Dr. Mark has already tried
regular drugs but with no positive results so far. He quickly searches the organization's
database for answers, but with no luck. The information he wants is tucked away in the
email conversation between two other “GoodLife” doctors, Dr. Ben and Dr. Stanley. Dr.
Mark would have accessed the solution with few mouse clicks had the storage and analysis
of unstructured data been undertaken by “GoodLife”.
As is the case at “GoodLife”, 80-85% of data in any organization is unstructured and is an
alarming rate. An enormous amount of knowledge is buried in this data. In the above
Stanley's email to Dr. Ben had not been successfully updated into the medical system in the
unstructured format.
Unstructured data, thus, is the one which cannot be stored in the form of rows and as in a
database and does not conform to any data model, i.e. it is difficult to determine the
meaning of the data. It does not follow any rules or semantics. It can be of any type and is
hence unpredictable.
Characteristics of Data
• Composition of Data : Deals with Structure
– Sources of data
– Types of data
– Nature of data(static/real-time streaming)
– Granularity
• Condition of Data : Can we use data or does it
require preprocessing for analysis.
• Context of Data : Deals with Purpose of Data
Generation
– Where has the data been generated
– Why was the data generated
– How sensitive is the data
– What are the events associated with data, so on
What is BIG DATA? How is it different ?
• Bigdata is the term for a collection of datasets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.
• The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.
• The trend to larger data sets is due to the additional
information derivable from analysis of a single large set
of related data, as compared to separate smaller sets
with the same total amount of data, allowing
correlations to be found to “spot business trends,
determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real time
roadway traffic conditions”.
Why Big Data
More Data

More Accurate Analysis

Greater Confidence in Decision
Making

Greater Operational Efficiencies
for
▪ Cost Reduction
▪ Time Reduction
▪ New Product Development
▪ Optimized Offerings
Facts and Figures
Walmart handles 1 million customer transactions/hour
Facebook handles 40 billion photos from its user base!
Facebook inserts 500 terabytes of new data every day
Facebook stores, accesses, and analyzes 30 Petabytes of user
generated data
A flight generates 240 terabytes of flight data in 6 8 hours of flight
More than 5 billion people are calling, texting, tweeting and
browsing on mobile phones worldwide
Decoding the human genome originally took 10 years to process
now it can be achieved in one week 8
The largest AT&T database boasts titles including the largest volume
of data in one unique database 312 terabytes) and the second
largest number of rows in a unique database 1 9 trillion), which
comprises AT&T’s extensive calling records
An Insight
An Insight
• 2.5 petabytes
– Memory capacity of the human brain
• 13 petabytes
– Amount that could be downloaded from the internet in two minutes, if
every
– American (300M) got on a computer at the same time
• 463 exabytes
– Generated Every Day!! By people as of 2025
• 1 exabytes

• 4.75 exabytes
– Total genome sequences of all people on the Earth
• 422 exabytes
– Total digital data created in 2008
• 59 zettabyte
– World’s current (2020) digital storage capacity
• 180 zettabytes
– Total digital data expected to be created in 2025
What’s Making so much Data?
• Sources : People, machine, organization
Ubiquitous computing
• More people carrying data generating devices
(Mobile phones with facebook GPS, Cameras,
etc.)
• Data on Internet, Web server logs and Internet
clickstream data – Log Data
• Social media content and Social network activity
reports, Facebook data – Social Media Data
• Machine data captured by sensors connected to
the Internet of Things – Sensor Data.
• Audio, Video, Image, Podcasts, etc – Media Data
• Text from customer emails and survey responses.
– Text Data
Sources of Data Generation
Big Data In Use
• The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook
• Facebook handles 40 billion photos from its user base.
• Walmart handles more than 1 million customer transactions every hour.
• Decoding the human genome originally took 10 years to process; now it can
be achieved in one week.
Some examples of Big Data Science
Commercial • Astronomy
• Web / event / database logs • Atmospheric science
• “Digital exhaust” (result of human • Genomics
interaction with the Internet) • Biogeochemical
• Sensor networks • Biological
• RFID • ... and other complex and/or
• Internet text and documents interdisciplinary scientific research
• Internet search indexing Social
• Call detail records (CDR) • Social networks
• Medical records • Social data
• Photographic archives – Person to person (P2P, C2C):
• Video / audio archives • Wish Lists on Amazon.com
• Large scale eCommerce • Craig’s List
Government – Person to world (P2W, C2W):
• Regular government business & • Twitter
commerce needs • Facebook
• Military & homeland security • LinkedIn
surveillance
Evolution of Big Data
Data Data Utilization Data Driven
Generation and
Storage
Complex and Structured,
Unstructured Unstructured,
Multimedia etc.
Complex and RDBMS: Data-
Relational intensive
Applications
Primitive & Basic Data
Structured Storage in files
on Mainframes
1970s and 1980s-1990s 2000s and beyond
before
Database
• Used for storing data from one or a limited
number of applications or sources.
• Pros: Processing digital transactions,
established technology
• Cons: Reporting, visualization, and analysis
cannot be performed across a very large
integrated set of data sources and streams
Data warehouse
• Used for aggregating data from many different
data sources, and make that data available for
visualization, reporting, and analysis. Purpose-
built for analysis.
• Pros: Better support for reporting, analysis, big
data, data retrieval, and visualization, designed to
store data from any number of data sources
• Cons: Costly compared to a single database,
preparation/configuration of data prior to
ingestion, (for cloud data warehouses) less
control over access and security configuration
Where is the Problem?
• Traditional RDBMS queries isn't sufficient to get
useful information out of the huge volume of
data.
• To search it with traditional tools to find out if a
particular topic was trending would take so long
that the result would be meaningless by the time
it was computed.
• Big Data come up with a solution to store this
data in novel ways in order to make it more
accessible, and also to come up with methods of
performing analysis on it.
The Dimensions of Big Data
• Volume: Large volumes of data
• Velocity: Quickly moving data
• Variety: structured, unstructured, semi-
structured.
• Veracity: Trust and integrity is a challenge and
is important for big data just as for traditional
relational DBs
Big Data – 3V’s
Volume : Enterprises are awash with ever growing data of
all types, easily amassing terabytes even Petabytes of
information.
Data volume is increasing exponentially
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
• Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis.
• Convert 350 billion annual meter readings to better
predict power consumption.
• Sensors embedded into everyday objects enable
better monitored activities.
Velocity (speed)
Velocity : Sometimes 2 minutes is too late. For time-
sensitive processes such as catching fraud, big data must
be used as it streams into your enterprise in order to
maximize its value.
• Scrutinize 5 million trade events created each day to identify
potential fraud.
• Analyze 500 million daily call detail records in real time to
predict customer churn faster.
• Data is begin generated fast and need to be processed fast.
• Online Data Analytics : Late decisions → missing opportunities
• Examples
• E Promotions: Based on your current location, your purchase history,
what you like → send promotions right now for store next to you
• Healthcare monitoring : Sensors monitoring your activities and body →
any abnormal measurements require immediate reaction.
Real-time/Fast Data

• The progress and innovation is no longer hindered by the


ability to collect data
• But, by the ability to manage, analyze, summarize, visualize,
and discover knowledge from the collected data in a timely
manner and in a scalable fashion
Real-Time Analytics/Decision Requirement
Learning why
Customers switch to
competitors and
their offers; in time
to counter

Product
Recommendations Friend Invitations to
that are Relevant & join a Game or Activity
Compelling that expands business

Consumer
Behavious

Improving the Marketing


Preventing Fraud as it is
Effectiveness of a
Occurring & preventing
Promotion while it is still in
more proactively
Play
Variety (Complexity)
Variety : Big data is any type of data
• Structured Data (example: tabular data)
• Unstructured Data : text, sensor data, audio, video
• Semi Structured Data : web data, log files

• Relational Data (Tables/Transaction/Legacy Data)


• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• A single application can be generating/collecting many types of data
• Big Public Data (online, weather, finance, etc)
• To extract knowledge → all these types of data need to be linked
together
The 3 Big V’s (+1) • Plus many more
• Big 3V’s • Veracity
• Validity
• Volume
• Variability
• Velocity • Viscosity & Volatility
• Variety • Viability,
• Plus 1 • Venue,
• Veracity • Vocabulary, Vagueness, ….
V’s
• Valence refers to the connectedness of big data. Such as in
the form of graph networks

• Validity : Accuracy and correctness of the data relative to a


particular use. Example: Gauging storm intensity From
Satellite imagery vs social media posts
Prediction quality vs human impact
• Veracity refers to the biases, noise and abnormality in data,
trustworthiness of data.
• 1 in 3 business leaders don’t trust the information they use to
make decisions.
• How can you act upon information if you don’t trust it?
• Establishing trust in big data presents a huge challenge as the variety
and number of sources grows.
Other V’s
• Variability : How the meaning of the data changes
over time
• Language evolution
• Data availability
• Sampling processes
• Changes in characteristics of the data source.

Both related to velocity


Viscosity: data velocity relative to timescale of event
being studied
Volatility: rate of data loss and stable lifetime of data
Scientific data often has practically unlimited lifespan, but
social / business data may evaporate in finite time
More V’s
Viability
Which data has meaningful relations to questions of
interest?
Venue
Where does the data live and how do you get it?
Vocabulary
Metadata describing structure, content &
provenance
Schemas, semantics, ontologies, taxonomies,
vocabularies
Vagueness
Confusion about what “Big Data” means
Big Data V’s

Value

(Quality)
Big Data Definition
• Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making – Gartner
• Big data is a collection of data sets so large and complex
that it becomes difficult to process using on-hand
database management tools. The challenges include
Capture, Curation, Storage, Search, Sharing, Analysis,
and Visualization.
• Big data is the realization of greater business
intelligence by storing, processing, and analyzing data
that was previously ignored due to the limitations of
traditional data management technologies
Big Data Definition - Gartner
• Big Data is high volume, high Velocity and high
variety
– Voluminous, variety requires speed for storage,
preparation, processing and analysis
• Big Data is cost effective innovative forms of
Information Processing
– (talks abt technologies to capture, store, process,
persist, integrates and visualize Big data.)
• Big Data for Enhanced insights and decision
making
– Deriving deeper, richer and meaningful insights
• Data -> Information-> Actionable Intelligence ->
Better Decision -> Enhanced Business Value
Big Data Analytics
• Big data analytics is the process of examining large data
sets containing a variety of data types – i.e., big data --
to uncover hidden patterns, unknown correlations,
market trends, customer preferences and other useful
business information.
• The analytical findings can lead to more effective
marketing, new revenue opportunities, better
customer service, improved operational efficiency,
competitive advantages over rival organizations and
other business benefits.
• The primary goal of big data analytics is to help
companies make more informed business decisions
Analytics
• Basic Analytics : Slicing and Dicing of Data for
basic business insights. Ex. Reporting on
historical data, Visualization etc.,
• Operationalized Analytics : if it is woven
around enterprise businesses process
• Advanced Analytics : About Forecasting for the
future by way of Predictive and Prescriptive
modeling.
• Monetized Analytics : About deriving Direct
Business Revenue.
Analytics 1.0, 2.0 and 3.0
How can we
make it happen

Prescriptive
Analytics
What will happen?

Predictive
Analytics Foresight
Why did it happen?

Diagnostic
Analytics
What happened? Insight
Descriptive
Analytics

Hindsight
Analytics 1.0
Era : 1950s to 2009
• Descriptive
• What happened ?
• Data from CRM, ERM and 3rd party
applications
• Small and structured data sources, EDW or
data marts
• Data was Internally Sourced
• RDBMS
Analytics 2.0
• 2005-12
• Descriptive + Predictive
• What will happen? Why will it happen
• Big data
• Data Mainly unstructured and with velocity
• Stored and processed often parallel
• Data often externally sourced
• Database appliances, hadoop clusters, SQL to
hadoop environment
Analytics 3.0
2012 onwards
• Descriptive + Predictive + Prescriptive
• What will happen, when will it happen, why will it
happen, what should be the action taken to take the
advantage of what will happen?
• A blend of Big data, Legacy Systems, CRM, ERP and 3rd
party tools
• A blend of Big data and Traditional Analytics to yield
insights and offering with speed and impact
• Data is both being internally and externally sourced
• In memory analytics, in-database processing agile
analytics methods, machine learning techniques etc.
Categories of Big Data Analytics
• Descriptive Analytics: These tools tell companies what happened.
They create simple reports and visualizations that show what
occurred at a particular point in time or over a period of time.
These are the least advanced analytics tools.
• Diagnostic Analytics: Diagnostic tools explain why something
happened. More advanced than descriptive reporting tools, they
allow analysts to dive deep into the data and determine root causes
for a given situation.
• Predictive Analytics: Among the most popular big data analytics
tools available today, predictive analytics tools use highly advanced
algorithms to forecast what might happen next. Often these tools
make use of artificial intelligence and machine learning technology.
• Prescriptive Analytics: A step above predictive analytics,
prescriptive analytics tell organizations what they should do in
order to achieve a desired result. These tools require very advanced
machine learning capabilities, and few solutions on the market
today offer true prescriptive capabilities.
Where is Big Data Analytics
Smarter Multi-channel
Healthcare sales

Homeland Telecom
Security

Trading
Traffic Control Analytics

Search
Manufacturing Quality
Benefits of Big Data Analytics
• Business Transformation : Executives believe that big data analytics
offers tremendous potential to revolution their organizations. The
collection and analysis of big data could fundamentally change the
way their companies do business.
• Competitive Advantage : In the MIT Sloan Management Review
Research Report Analytics as a Source of Business Innovation, 57
percent of enterprises surveyed said their use of analytics was
helping them achieve competitive advantage.
• Innovation : Big data analytics can help companies develop
products and services that appeal to their customers, as well as
helping them identify new opportunities for revenue generation.
Also in the MIT Sloan Management survey, 68 percent of
respondents agreed that analytics has helped their company
innovate.
• Lower Costs In the New Vantage Partners Big Data Executive Survey
2017, 49.2 percent of companies surveyed said that they had
successfully decreased expenses as a result of a big data project
Benefits of Big Data Analytics
• Improved Customer Service Organizations often use big data
analytics to examine social media, customer service, sales and
marketing data. This can help them better gauge customer
sentiment and respond to customers in real time.
• Increased Security Another key area for big data analytics is IT
security. Security software creates an enormous amount of log
data. By applying big data analytics techniques to this data,
organizations can sometimes identify and thwart cyberattacks that
would otherwise have gone unnoticed.
Big Data Use Cases
• 360o View of the Customer :
– Many enterprises use big data to build a dashboard
application that provides a 360° view of the customer.
These dashboards pull together data from a variety of
internal and external sources, analyze it and present it to
customer service, sales and/or marketing personnel in a
way that helps them do their jobs.
• Fraud Prevention
– Thanks to big data analytics and machine learning, today’s
fraud prevention systems are orders of magnitude better
at detecting criminal activity and preventing false
positives.
– r example, if a credit card were used to rent a car in
Chennai, but the customer lived in Delhi, a customer
service agent might call to confirm that the cardholder was
on vacation and that someone hadn’t stolen the card
Big Data Use Cases
• Data Warehouse Offload
– It is common for organization to have a data warehouse to
facilitate their business intelligence (BI) efforts. DW
technology tends to be very costly to purchase and run.
– Companies begin using big data tools to remove some of
the burden from their data warehouses.
Replace/Compliment DW with Hadoop. Hadoop-based
solutions often provide much faster performance while
reducing licensing fees and other costs.
• Price Optimization
– Both business-to-consumer (B2C) and business-to-
business (B2B) enterprises using BDA to to optimize the
prices. They use their big data solutions to segment their
customer base and build models that show how much
different types of customers will be willing to pay under
different circumstances
The Structure of Big Data
❖Structured
• Most traditional data
sources

❖Semi-structured
• Many sources of big
data

❖Unstructured
• Video data, audio data
58
Big Data Challenges
• Capture
• Curation
• Storage
• Search
• Sharing
• Transfer
• Analysis and
• Visualization
Big Data Challenges
• Dealing with Data growth(Scale) : Addressing
the Storage needs that can best withstand the
onslaught of high volume, velocity and variety
of big data is a challenge.
• Security : Security is a big concern for
organizations with big data stores. Most of
NoSQL big data Platforms have poor security
mechanism. Some big data stores can be
attractive targets for hackers or advanced
persistent threats (APTs).
• Schema : Rigid Schemas have no place. The
technology should fit the big data and not vice
versa. Hence dynamic Schemas is another issue.
Big Data Challenges
• Generating insights in a timely manner : The
most common goals associated with big data
projects included the following:
– Decreasing expenses through operational cost
efficiencies
– Establishing a data-driven culture
– Creating new avenues for innovation and disruption
– Accelerating the speed with which new capabilities
and services are deployed
– Launching new product and service offerings
Big Data Challenges
• Partition Tolerant: How to build partition tolerant
systems that can take care of both hardware and
software failures.
• Continuous availability of Data : How to provide
24/7 support.
• Data Quality: How to maintain data quality in
terms of - data accuracy , completeness and
timeliness – Data with Veracity
Big Data Challenges
India – Big Data
• Gaining attraction
• Huge market opportunities for IT services (82.9% of
revenues) and analytics firms (17.1 % )
• The global big data and business analytics market
was valued at 169 billion U.S. dollars in 2018 and is
expected to grow to 274 billion U.S. dollars in 2022.
• The opportunity for Indian service providers lies in
offering services around Big Data implementation
and analytics for global multinationals
What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
– Developed in Java
– Hides underlying system details and complexities from user
– Well-suited for batch-oriented, read-intensive applications

• It is a flexible and highly-available architecture for large scale


computation and data processing on a network of commodity
hardware.

• Enables applications to work with thousands of nodes and


petabytes of data in a highly parallel, cost effective manner
– CPU + disks = “node”, Nodes can be combined into clusters
– New nodes can be added as needed.
Hadoop : Driving Principles
• It has two Basic Parts
– HDFS
– Map Reduce
• Hadoop Distributed File System ( is the storage
system of Hadoop which splits big data and
distribute across many nodes in a cluster
– Scaling out of H/W resources
– Fault Tolerant
• MapReduce : Programming model for processing
large data sets with a parallel, distributed algorithm
on a cluster of computers - simplifies parallel
programming
– Map ->apply()
– Reduce ->summarize()
Hadoop Evolution
• Hadoop created by Doug Cutting and Mike Cafarella , (Doug
Cutting ) the creator of Apache Lucene, the widely used text
search library.
• Hadoop has its origins in Apache Nutch, an open source web
search engine, as a part of the Lucene project.
• Nutch started in 2002, a web crawler and search system that
can index 1 billion web pages. However Mike Cafarella and
Doug Cutting realized that their architecture would not scale
to the billions of pages on the Web.
• A paper Published in 2003 describing architecture of Google
distributed File System(GFS) which would free up time being
spent on administrative tasks such as managing storage
nodes, and realized that this can solve their problem of
storing very large files which were being generated from web
crawling and indexing.
Hadoop Evolution Conti..
• In 2004, they set about writing an open source implementation,
the Nutch Distributed Filesystem (NDFS).
• In 2004, Google published the paper on the Map Reduce, a
solution to process large datasets.
• Early in 2005, the Nutch developers had a working MapReduce
implementation in Nutch, and by the middle of that year all the
major Nutch algorithms had been ported to run using Map Reduce
and NDFS.
• In 2005 they realized that the engineering task in Nutch project
was much bigger than they realized. Dough Cutting was in search
of a company interested in investing in their efforts and so Joined
Yahoo in 2006.
• In 2006 an Independent sub project formed from Lucene named
Hadoop. He worked with GFS and Map Reduce on Hadoop.
Hadoop Evolution Conti..
• In January 2008, Hadoop was made its own top-level project at
Apache, confirming its success and its diverse, active community.
By this time, Hadoop was being used by many other companies
besides Yahoo!, such as Last.fm, Facebook, and the New
York Times. Yahoo Released Hadoop to ASF (Apache Software
Foundation).
• In July 2008, ASF successfully tested a 4000 node cluster with
Hadoop.
• In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of
data in less than 17 hours for handling billions of searches and
indexing millions of web pages.
• Doug Cutting left the Yahoo and joined Cloudera to fulfill the
challenge of spreading Hadoop to other industries.
• In Dec. 2011, ASF released Apache Hadoop version 1.0. Later in
Aug 2013, Version 2.0.6. And currently, we have Apache Hadoop
version 3.0 which released in December 2017.
Features
• Open-Source : Free to use. Its Source code is available for inspection,
modification, and analyses that allows enterprises to modify the code as
per their requirements.
• Cost-effective : Hadoop uses commodity hardware which provides a
cost-efficient model, unlike traditional Relational databases that require
expensive hardware and high-end processors to deal with Big Data.
• Scalable : Nodes can be added to Hadoop cluster on the fly making it a
scalable framework.
• Fault Tolerance : Data is replicated by default in Hadoop cluster. In case
of a crash, the data can still be read from other Data Nodes where the
data is replicated. (By default : 3 copies replicated)
• Flexible : It can process any kind of data (Sql, CSV, XML, JSON, Audio,
Video) makes its highly flexible.
• Fast Processing : Hadoop stores data in a distributed fashion, which
allows data to be processed distributedly on a cluster of nodes. Thus it
provides lightning-fast processing capability to the Hadoop framework.
Hadoop Ecosystems
• Hadoop was initially introduced in 2007 as an open
source implementation of the MapReduce
processing engine linked with a distributed file
system.
• Due to its vast list of projects related to every step of
a big data workflow, including data collection,
storage, processing, and much more.
• For this reason, we often hear reference to
the Hadoop Ecosystem instead, which contains these
related projects and products.
Hadoop Ecosystem
– HDFS – Distributed File System
• A file system designed for storing very large files with streaming data
access patterns, running on clusters of commodity hardware(Simple
hardware ).
• Map Reduce - Distributed Processing
– A programming model for processing large data sets with a parallel,
distributed algorithm on a cluster
–Spark is a Computing Framework
– provides an alternative to Hadoop MapReduce and offers
performance up to 10 times faster
Hadoop Ecosystem
• The processing layer is where the actual analysis takes place.
• PIG (Data Flow) – Data Processing
– A High Level Language that generates Map Reduce Programs.
– Framework for analyzing large un-structured and semi-structured data on
top of Hadoop.
– Significantly reduces the amount of code the programmer has to write.
• HIVE (Data Warehouse) – Data Processing
– A data warehouse system for Hadoop.
– Typically manages and queries structured data built on top of Hadoop
– Uses SQL like Language(HiveQL) to Manage data.
– Can be used to do ad-hoc queries, summarization and data analysis.
• Flume : Data Ingestion
– Flume is a distributed, reliable, service for efficiently collecting,
aggregating, and moving large amounts of log data to a centralized
location.
• Sqoop : Data Ingestion
– Transfers bulk data between HDFS and RDBMS
Hadoop Ecosystems
• Hbase : (Distributed Table Store) - Database
– Non-relational, distributed database
– NoSQL Column Oriented database for Hadoop.
– Provides Random Read/Write Operations.
– Stores Billions of rows and millions of columns .
Hadoop Eco System
• Ambari : A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for
Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase,
ZooKeeper, Oozie, Pig, and Sqoop.
• Zookeeper : A centralized system to Maintaining configuration
information, Naming, Providing distributed synchronization
Providing group services
• Oozie :
– A Workflow Scheduler for Hadoop.
– For complex workflows which require multiple jobs and tools, it
specifies a sequence of actions and coordinates between them to
complete the tasks.
– It also facilitates scheduling of jobs which need to run on regular
intervals.
Hadoop Ecosystem - Machine Learning
• Mahout : A scalable machine learning and data mining library.
– Support for Four Use Cases
• Classification : Predicting the selling price of the house based on
pricing of other houses for sale in the neighborhood.
• Clustering : Grouping the related objects. Google News collects
tens of thousands of news stories and automatically clusters them
together, so that the news stories that have the same content are
displayed together.
• Association : Also referred to Market-Basket Analysis. Used to
determine co-occurrence relationship among activities performed
by individuals or groups.
• Recommendation System : In Amazon/Flipcart, assume a
purchase of an Book-A the moment we added the item to
“Shopping Cart”, a popup comes with a recommendation stating
customer who bought Book-A also bought Book-B.
Competing Hadoop Distribution Vendors
Focus Mainly on Infrastructure
Cloudera
– “Cloudera makes it easy to run open source Hadoop in production”
– “Focus on deriving business value from all your data instead of worrying about
managing Hadoop”
Hortonworks
– “Make Hadoop easier to consume for enterprises and technology vendors”
– “Provide expert support by the leading contributors to the Apache Hadoop
open source projects”
Pivotal HD ** formerly EMC Greenplum **
– “The worlds most powerful Hadoop distribution”
– “Provides a complete platform including installation, training, global support,
and value-add beyond simple packaging of the Apache Hadoop distribution”
MapR
– “High Performance Hadoop, up to 2-5 times faster performance than Apache-
based distributions”
– “The first distribution to provide true high availability at all levels making it
more dependable”
Competing Hadoop Distribution Vendors
Amazon Elastic MapReduce
– “Amazon Elastic MapReduce lets you focus on crunching or analyzing your
data without having to worry about time-consuming set-up, management or
tuning of Hadoop clusters or the compute capacity upon which they sit”.
Intel
– “Many organizations are looking for large stable company to establish that
foundation so that they can bet long term to innovate around a stable
platform.”
– “Intel aims to “..protect nurture and drive a common open source foundation
for Apache Hadoop.”

WANdisco
– WANdisco Distro (WDD), available for free download, is the first 100% open
source, commercially supported, fully-tested, production-ready version of
Apache Hadoop 2 based on the most recent release, including the latest fixes.

You might also like