Professional Documents
Culture Documents
Introduction0
Syllabus
UNIT I : INTRODUCTION TO BIG DATA AND HADOOP
Types of Digital Data : Structured, Semi-Structured and
Unstructured data. Characteristics of Data, Evolution of Big Data,
Why Big Data, What is Big Data Analytics, Big Data Challenges,
Features of Hadoop, Evolution of Hadoop, Introduction to
Hadoop Eco System.
UNIT II : HDFS(Hadoop Distributed File System)
The Design of HDFS, HDFS Concepts, Command Line Interface,
Hadoop file system interfaces, Data flow, Data Ingest with Flume
and Sqoop, Concepts of Hadoop I/O : Data Integration,
Compression, Serialization and File-Based Data structures.
UNIT III : Map Reduce
Anatomy of a Map Reduce Job Run, Failures, Job Scheduling,
Shuffle and Sort, Task Execution, Map Reduce Types and Formats,
Map Reduce Features.
Syllabus
Unit IV : Hadoop Eco System
Pig : Introduction to PIG, Execution Modes of Pig, Comparison of
Pig with Databases, Pig Latin, Data Processing operators and ,
User Defined Functions in Pig
Structured Data Queries with Hive : The Hive Command Line
Interface(CLI), Hive Query Language(HQL), Data Analysis with
Hive.
NoSQL Database HBase : CAP theorem, NoSQL Databases,
Column-Oriented Databases, Real Time analytics with HBase.
UNIT V : Apache Spark
In-Memory Computing with Spark : Spark Basics, Interactive Spark
Using PySpark, Writing Spark Applications with PySpark.
Scalable Machine Learning with Spark : Collaborative Filtering,
Classification, Clustering.
Text books :
Content of this presentation
has been taken from Book
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Published by Wiley India Pvt. Ltd.
Data
• Data is Collection of Facts.
• Data is Precious to any Organization .. Why?
– Data is present internal as well as external to an
organization .. How?
– Data comes from homogenous as well as
heterogeneous sources.
Digital Data
It is defined as the data that is stored in digital form,
may be in the form of a picture, document or video etc.
Digital data can be classified into three forms:
• Structured
• Semi-Structured
• Un-Structured
Usually, data is in the unstructured format which makes
extracting information from it difficult.
• According to Merrill Lynch, 80–90% of business data is
either unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes
80% of the whole enterprise data
Formats of Digital Data
Here is a percent distribution of the three forms of data
Data Forms Defined
Unstructured data: This is the data which does not conform to a
data model or is not in a form which can be used easily by a
computer program. About 80—90% data of an organization is in
this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, white papers,
body of an email, etc.
Semi-structured data: This is the data which does not conform to a
data model but has some structure. However, it is not in a form
which can be used easily by a computer program; for example,
emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.
Structured data: This is the data which is in an organized form
(e.g., in rows and columns) and can be easily used by a computer
program. Relationships exist between entities of data, such as
classes and their objects. Data stored in databases is an example of
structured data.
Structured Data
• Structured Data
– Data is in Organized form
– Data stored in Databases
– Stored in rows and columns.
– Easily accessed by a program
• Insert/update/delete
• Security
• Indexing
• Transaction Processing
• Scalability
– Data Collected by day-to-day Business Activities
– Sources
• Databases such as Oracle corp. - Oracle, IBM - DB2,
Teradata,Microsoft - MySQL, opensource PostgreSQL,EMC -
Greenplum..etc
• Spreadsheets
• OLTP
Semi-structured Data
• Referred to as self describing structure.
• Does not Conform to a data model but has a structure
• It uses tags to segregate semantic elements.
– Tags are used to enforce hierarchies of records and fields within data.
• No separation between data and schema. Often schema
information is blended with data values.
• Data Objects may have different attributes not known
beforehand.
• Not in the form which is easily usable by a program
– Emails, XML, HTML, JSON etc.
Where does Semi-structured Data Come from?
XML
TCP/IP packets
Zipped files
Semi-structured
data
Binary
executables
Mark-up languages
Web pages
Memos
Body of an e-mail
PowerPoint presentations
Chats
Reports
Whitepapers
Surveys
Where does Unstructured Data Come from?
Broadly speaking, anything in a non-database form is
unstructured data. It can be classified into two broad categories:
▪ Bitmap Objects : For example, image, video, or audio files.
▪ Textual Objects : For example, Microsoft Word documents,
emails, or Microsoft Excel spread-sheets.
Even though email messages are organized in databases such as
Microsoft Exchange or Lotus Notes, the body of the email is
essentially raw data, i.e. free form : Text without any structure.
Unstructured Data – Getting to Know
Dr. Ben, Dr. Stanley, and Dr. Mark work at the medical facility of “GoodLife”. Over the past
few days, Dr. Ben and Dr. Stanley had been exchanging long emails about a particular case
of intestinal problem. Dr. Stanley has chanced upon a particular combination of drugs that
has cured gastro-intestinal disorders in his patients. He has written an email about this
combination of drugs to Dr. Ben.
Dr. Mark has a patient in the “GoodLife” emergency unit with quite a similar case of gastro-
intestinal disorder whose cure Dr. Stanley has chanced upon. Dr. Mark has already tried
regular drugs but with no positive results so far. He quickly searches the organization's
database for answers, but with no luck. The information he wants is tucked away in the
email conversation between two other “GoodLife” doctors, Dr. Ben and Dr. Stanley. Dr.
Mark would have accessed the solution with few mouse clicks had the storage and analysis
of unstructured data been undertaken by “GoodLife”.
As is the case at “GoodLife”, 80-85% of data in any organization is unstructured and is an
alarming rate. An enormous amount of knowledge is buried in this data. In the above
Stanley's email to Dr. Ben had not been successfully updated into the medical system in the
unstructured format.
Unstructured data, thus, is the one which cannot be stored in the form of rows and as in a
database and does not conform to any data model, i.e. it is difficult to determine the
meaning of the data. It does not follow any rules or semantics. It can be of any type and is
hence unpredictable.
Characteristics of Data
• Composition of Data : Deals with Structure
– Sources of data
– Types of data
– Nature of data(static/real-time streaming)
– Granularity
• Condition of Data : Can we use data or does it
require preprocessing for analysis.
• Context of Data : Deals with Purpose of Data
Generation
– Where has the data been generated
– Why was the data generated
– How sensitive is the data
– What are the events associated with data, so on
What is BIG DATA? How is it different ?
• Bigdata is the term for a collection of datasets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.
• The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.
• The trend to larger data sets is due to the additional
information derivable from analysis of a single large set
of related data, as compared to separate smaller sets
with the same total amount of data, allowing
correlations to be found to “spot business trends,
determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real time
roadway traffic conditions”.
Why Big Data
More Data
↓
More Accurate Analysis
↓
Greater Confidence in Decision
Making
↓
Greater Operational Efficiencies
for
▪ Cost Reduction
▪ Time Reduction
▪ New Product Development
▪ Optimized Offerings
Facts and Figures
Walmart handles 1 million customer transactions/hour
Facebook handles 40 billion photos from its user base!
Facebook inserts 500 terabytes of new data every day
Facebook stores, accesses, and analyzes 30 Petabytes of user
generated data
A flight generates 240 terabytes of flight data in 6 8 hours of flight
More than 5 billion people are calling, texting, tweeting and
browsing on mobile phones worldwide
Decoding the human genome originally took 10 years to process
now it can be achieved in one week 8
The largest AT&T database boasts titles including the largest volume
of data in one unique database 312 terabytes) and the second
largest number of rows in a unique database 1 9 trillion), which
comprises AT&T’s extensive calling records
An Insight
An Insight
• 2.5 petabytes
– Memory capacity of the human brain
• 13 petabytes
– Amount that could be downloaded from the internet in two minutes, if
every
– American (300M) got on a computer at the same time
• 463 exabytes
– Generated Every Day!! By people as of 2025
• 1 exabytes
• 4.75 exabytes
– Total genome sequences of all people on the Earth
• 422 exabytes
– Total digital data created in 2008
• 59 zettabyte
– World’s current (2020) digital storage capacity
• 180 zettabytes
– Total digital data expected to be created in 2025
What’s Making so much Data?
• Sources : People, machine, organization
Ubiquitous computing
• More people carrying data generating devices
(Mobile phones with facebook GPS, Cameras,
etc.)
• Data on Internet, Web server logs and Internet
clickstream data – Log Data
• Social media content and Social network activity
reports, Facebook data – Social Media Data
• Machine data captured by sensors connected to
the Internet of Things – Sensor Data.
• Audio, Video, Image, Podcasts, etc – Media Data
• Text from customer emails and survey responses.
– Text Data
Sources of Data Generation
Big Data In Use
• The first organizations to embrace it were online and startup firms. Firms like
Google, eBay, LinkedIn, and Facebook
• Facebook handles 40 billion photos from its user base.
• Walmart handles more than 1 million customer transactions every hour.
• Decoding the human genome originally took 10 years to process; now it can
be achieved in one week.
Some examples of Big Data Science
Commercial • Astronomy
• Web / event / database logs • Atmospheric science
• “Digital exhaust” (result of human • Genomics
interaction with the Internet) • Biogeochemical
• Sensor networks • Biological
• RFID • ... and other complex and/or
• Internet text and documents interdisciplinary scientific research
• Internet search indexing Social
• Call detail records (CDR) • Social networks
• Medical records • Social data
• Photographic archives – Person to person (P2P, C2C):
• Video / audio archives • Wish Lists on Amazon.com
• Large scale eCommerce • Craig’s List
Government – Person to world (P2W, C2W):
• Regular government business & • Twitter
commerce needs • Facebook
• Military & homeland security • LinkedIn
surveillance
Evolution of Big Data
Data Data Utilization Data Driven
Generation and
Storage
Complex and Structured,
Unstructured Unstructured,
Multimedia etc.
Complex and RDBMS: Data-
Relational intensive
Applications
Primitive & Basic Data
Structured Storage in files
on Mainframes
1970s and 1980s-1990s 2000s and beyond
before
Database
• Used for storing data from one or a limited
number of applications or sources.
• Pros: Processing digital transactions,
established technology
• Cons: Reporting, visualization, and analysis
cannot be performed across a very large
integrated set of data sources and streams
Data warehouse
• Used for aggregating data from many different
data sources, and make that data available for
visualization, reporting, and analysis. Purpose-
built for analysis.
• Pros: Better support for reporting, analysis, big
data, data retrieval, and visualization, designed to
store data from any number of data sources
• Cons: Costly compared to a single database,
preparation/configuration of data prior to
ingestion, (for cloud data warehouses) less
control over access and security configuration
Where is the Problem?
• Traditional RDBMS queries isn't sufficient to get
useful information out of the huge volume of
data.
• To search it with traditional tools to find out if a
particular topic was trending would take so long
that the result would be meaningless by the time
it was computed.
• Big Data come up with a solution to store this
data in novel ways in order to make it more
accessible, and also to come up with methods of
performing analysis on it.
The Dimensions of Big Data
• Volume: Large volumes of data
• Velocity: Quickly moving data
• Variety: structured, unstructured, semi-
structured.
• Veracity: Trust and integrity is a challenge and
is important for big data just as for traditional
relational DBs
Big Data – 3V’s
Volume : Enterprises are awash with ever growing data of
all types, easily amassing terabytes even Petabytes of
information.
Data volume is increasing exponentially
Data Volume
44x increase from 2009 2020
From 0.8 zettabytes to 35zb
• Turn 12 terabytes of Tweets created each day into
improved product sentiment analysis.
• Convert 350 billion annual meter readings to better
predict power consumption.
• Sensors embedded into everyday objects enable
better monitored activities.
Velocity (speed)
Velocity : Sometimes 2 minutes is too late. For time-
sensitive processes such as catching fraud, big data must
be used as it streams into your enterprise in order to
maximize its value.
• Scrutinize 5 million trade events created each day to identify
potential fraud.
• Analyze 500 million daily call detail records in real time to
predict customer churn faster.
• Data is begin generated fast and need to be processed fast.
• Online Data Analytics : Late decisions → missing opportunities
• Examples
• E Promotions: Based on your current location, your purchase history,
what you like → send promotions right now for store next to you
• Healthcare monitoring : Sensors monitoring your activities and body →
any abnormal measurements require immediate reaction.
Real-time/Fast Data
Product
Recommendations Friend Invitations to
that are Relevant & join a Game or Activity
Compelling that expands business
Consumer
Behavious
Value
(Quality)
Big Data Definition
• Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision making – Gartner
• Big data is a collection of data sets so large and complex
that it becomes difficult to process using on-hand
database management tools. The challenges include
Capture, Curation, Storage, Search, Sharing, Analysis,
and Visualization.
• Big data is the realization of greater business
intelligence by storing, processing, and analyzing data
that was previously ignored due to the limitations of
traditional data management technologies
Big Data Definition - Gartner
• Big Data is high volume, high Velocity and high
variety
– Voluminous, variety requires speed for storage,
preparation, processing and analysis
• Big Data is cost effective innovative forms of
Information Processing
– (talks abt technologies to capture, store, process,
persist, integrates and visualize Big data.)
• Big Data for Enhanced insights and decision
making
– Deriving deeper, richer and meaningful insights
• Data -> Information-> Actionable Intelligence ->
Better Decision -> Enhanced Business Value
Big Data Analytics
• Big data analytics is the process of examining large data
sets containing a variety of data types – i.e., big data --
to uncover hidden patterns, unknown correlations,
market trends, customer preferences and other useful
business information.
• The analytical findings can lead to more effective
marketing, new revenue opportunities, better
customer service, improved operational efficiency,
competitive advantages over rival organizations and
other business benefits.
• The primary goal of big data analytics is to help
companies make more informed business decisions
Analytics
• Basic Analytics : Slicing and Dicing of Data for
basic business insights. Ex. Reporting on
historical data, Visualization etc.,
• Operationalized Analytics : if it is woven
around enterprise businesses process
• Advanced Analytics : About Forecasting for the
future by way of Predictive and Prescriptive
modeling.
• Monetized Analytics : About deriving Direct
Business Revenue.
Analytics 1.0, 2.0 and 3.0
How can we
make it happen
Prescriptive
Analytics
What will happen?
Predictive
Analytics Foresight
Why did it happen?
Diagnostic
Analytics
What happened? Insight
Descriptive
Analytics
Hindsight
Analytics 1.0
Era : 1950s to 2009
• Descriptive
• What happened ?
• Data from CRM, ERM and 3rd party
applications
• Small and structured data sources, EDW or
data marts
• Data was Internally Sourced
• RDBMS
Analytics 2.0
• 2005-12
• Descriptive + Predictive
• What will happen? Why will it happen
• Big data
• Data Mainly unstructured and with velocity
• Stored and processed often parallel
• Data often externally sourced
• Database appliances, hadoop clusters, SQL to
hadoop environment
Analytics 3.0
2012 onwards
• Descriptive + Predictive + Prescriptive
• What will happen, when will it happen, why will it
happen, what should be the action taken to take the
advantage of what will happen?
• A blend of Big data, Legacy Systems, CRM, ERP and 3rd
party tools
• A blend of Big data and Traditional Analytics to yield
insights and offering with speed and impact
• Data is both being internally and externally sourced
• In memory analytics, in-database processing agile
analytics methods, machine learning techniques etc.
Categories of Big Data Analytics
• Descriptive Analytics: These tools tell companies what happened.
They create simple reports and visualizations that show what
occurred at a particular point in time or over a period of time.
These are the least advanced analytics tools.
• Diagnostic Analytics: Diagnostic tools explain why something
happened. More advanced than descriptive reporting tools, they
allow analysts to dive deep into the data and determine root causes
for a given situation.
• Predictive Analytics: Among the most popular big data analytics
tools available today, predictive analytics tools use highly advanced
algorithms to forecast what might happen next. Often these tools
make use of artificial intelligence and machine learning technology.
• Prescriptive Analytics: A step above predictive analytics,
prescriptive analytics tell organizations what they should do in
order to achieve a desired result. These tools require very advanced
machine learning capabilities, and few solutions on the market
today offer true prescriptive capabilities.
Where is Big Data Analytics
Smarter Multi-channel
Healthcare sales
Homeland Telecom
Security
Trading
Traffic Control Analytics
Search
Manufacturing Quality
Benefits of Big Data Analytics
• Business Transformation : Executives believe that big data analytics
offers tremendous potential to revolution their organizations. The
collection and analysis of big data could fundamentally change the
way their companies do business.
• Competitive Advantage : In the MIT Sloan Management Review
Research Report Analytics as a Source of Business Innovation, 57
percent of enterprises surveyed said their use of analytics was
helping them achieve competitive advantage.
• Innovation : Big data analytics can help companies develop
products and services that appeal to their customers, as well as
helping them identify new opportunities for revenue generation.
Also in the MIT Sloan Management survey, 68 percent of
respondents agreed that analytics has helped their company
innovate.
• Lower Costs In the New Vantage Partners Big Data Executive Survey
2017, 49.2 percent of companies surveyed said that they had
successfully decreased expenses as a result of a big data project
Benefits of Big Data Analytics
• Improved Customer Service Organizations often use big data
analytics to examine social media, customer service, sales and
marketing data. This can help them better gauge customer
sentiment and respond to customers in real time.
• Increased Security Another key area for big data analytics is IT
security. Security software creates an enormous amount of log
data. By applying big data analytics techniques to this data,
organizations can sometimes identify and thwart cyberattacks that
would otherwise have gone unnoticed.
Big Data Use Cases
• 360o View of the Customer :
– Many enterprises use big data to build a dashboard
application that provides a 360° view of the customer.
These dashboards pull together data from a variety of
internal and external sources, analyze it and present it to
customer service, sales and/or marketing personnel in a
way that helps them do their jobs.
• Fraud Prevention
– Thanks to big data analytics and machine learning, today’s
fraud prevention systems are orders of magnitude better
at detecting criminal activity and preventing false
positives.
– r example, if a credit card were used to rent a car in
Chennai, but the customer lived in Delhi, a customer
service agent might call to confirm that the cardholder was
on vacation and that someone hadn’t stolen the card
Big Data Use Cases
• Data Warehouse Offload
– It is common for organization to have a data warehouse to
facilitate their business intelligence (BI) efforts. DW
technology tends to be very costly to purchase and run.
– Companies begin using big data tools to remove some of
the burden from their data warehouses.
Replace/Compliment DW with Hadoop. Hadoop-based
solutions often provide much faster performance while
reducing licensing fees and other costs.
• Price Optimization
– Both business-to-consumer (B2C) and business-to-
business (B2B) enterprises using BDA to to optimize the
prices. They use their big data solutions to segment their
customer base and build models that show how much
different types of customers will be willing to pay under
different circumstances
The Structure of Big Data
❖Structured
• Most traditional data
sources
❖Semi-structured
• Many sources of big
data
❖Unstructured
• Video data, audio data
58
Big Data Challenges
• Capture
• Curation
• Storage
• Search
• Sharing
• Transfer
• Analysis and
• Visualization
Big Data Challenges
• Dealing with Data growth(Scale) : Addressing
the Storage needs that can best withstand the
onslaught of high volume, velocity and variety
of big data is a challenge.
• Security : Security is a big concern for
organizations with big data stores. Most of
NoSQL big data Platforms have poor security
mechanism. Some big data stores can be
attractive targets for hackers or advanced
persistent threats (APTs).
• Schema : Rigid Schemas have no place. The
technology should fit the big data and not vice
versa. Hence dynamic Schemas is another issue.
Big Data Challenges
• Generating insights in a timely manner : The
most common goals associated with big data
projects included the following:
– Decreasing expenses through operational cost
efficiencies
– Establishing a data-driven culture
– Creating new avenues for innovation and disruption
– Accelerating the speed with which new capabilities
and services are deployed
– Launching new product and service offerings
Big Data Challenges
• Partition Tolerant: How to build partition tolerant
systems that can take care of both hardware and
software failures.
• Continuous availability of Data : How to provide
24/7 support.
• Data Quality: How to maintain data quality in
terms of - data accuracy , completeness and
timeliness – Data with Veracity
Big Data Challenges
India – Big Data
• Gaining attraction
• Huge market opportunities for IT services (82.9% of
revenues) and analytics firms (17.1 % )
• The global big data and business analytics market
was valued at 169 billion U.S. dollars in 2018 and is
expected to grow to 274 billion U.S. dollars in 2022.
• The opportunity for Indian service providers lies in
offering services around Big Data implementation
and analytics for global multinationals
What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing of massive amount of data
– Developed in Java
– Hides underlying system details and complexities from user
– Well-suited for batch-oriented, read-intensive applications
WANdisco
– WANdisco Distro (WDD), available for free download, is the first 100% open
source, commercially supported, fully-tested, production-ready version of
Apache Hadoop 2 based on the most recent release, including the latest fixes.