You are on page 1of 40

Big Data

Introduction to Big Data, Hadoop and


Spark
Agenda
• About speaker and itversity
• Categorization of enterprise applications
• Typical data integration architecture
• Challenges with conventional technologies
• Big Data eco system
• Data Storage - Distributed file system and databases
• Data Processing - Distributed computing frameworks
• Data Ingestion - Open source tools
• Data Visualization - BI tools as well as custom
• Role of Apache and other companies
• Different Certifications in Big Data
• Resources for deep dive into Big Data
• Job roles in Big Data
About me
• IT Professional with 13+ years of experience in vast array of technologies including
Oracle, Java/J2EE, Data Warehousing, ETL, BI as well as Big Data

• Website: http://www.itversity.com
• YouTube: https://www.youtube.com/c/itversityin
• LinkedIn Profile: https://in.linkedin.com/in/durga0gadiraju
• Facebook: https://www.facebook.com/itversity
• Twitter: https://twitter.com/itversity
• Google Plus: https://plus.google.com/+itversityin
• Github: https://github.com/dgadiraju
• Meetup:
– https://www.meetup.com/Hyderabad-Technology-Meetup/ (local hyderabad meetup)
– https://www.meetup.com/itversityin/ (typically online)
Categorization of Enterprise
applications
Customer
Analytics
Enterprise
applications

Decision
Operational
Support

Traditionally enterprise applications can be broadly


categorized into Operational and Decision support systems.

Lately new set of applications such as Customer Analytics is


gaining momentum (eg: YouTube Channel for different
categories of users)
Enterprise Applications
• Most of the traditional applications are considered
monolith (n-tier architecture)
– Monoliths are typically built on RDBMS databases such as
Oracle
• Modern applications use micro services and
considered to be polyglot

* Now we have choice of different types of databases


(we will see later)
Enterprise applications
Take the example use case (eCommerce platform)
•Operational
– Transactional – check out
– Not transactional – recommendation engine
•Decision support – sales trends
•Customer analytics – categories in which customers have spent
money
* Data from transactional systems need to be integrated to non
transactional, decision support, customer analytics etc
Data Integration
• Data integration can be categorized into
– Real time
– Batch
• Traditionally when there are no customer analytics
and recommendation engines, we used to have
– ODS (compliance and single source of truth)
– EDW (Facts and dimensions to support reports)
– ETL (to perform transformations and load data into EDW)
– BI (to visualize and publish reports)
Data Integration
(Current Architecture)

Visualization/
EDW/ODS
OLTP ODS Reporting

Closed Reporting
Main Frames Data Integration
(ETL/Real Time)

Decision
XML
Support
Target system
External apps
(eg; EDW)

Source(s)
Data Integration - Technologies
• Batch ETL – Informatica, Data Stage etc
• Real time data integration – Goldengate,
Shareplex etc
• ODS – Oracle, MySQL etc
• EDW/MPP – Teradata, Greenplum etc
• BI Tools – Cognos, Business Objects etc
Current Scenario - Challenges
• Almost all operational systems are using relational databases (RDBMS like Oracle).
– RDBMS are originally designed for Operational and transactional.
• Not linearly scalable.
– Transactions
– Data integrity
• Expensive
• Predefined Schema
• Data processing do not happen where data is stored (storage layer) – no data
locality
– Some processing happens at database server level (SQL)
– Some processing happens at application server level (Java/.net)
– Some processing happens at client/browser level (Java Script)
• Almost all Data Warehouse appliances are expensive and not very flexible for
customer analytics and recommendation engines
Evolution of Databases
Now we have many choices of Databases
•Relational Databases (Oracle, Informix, Sybase, MySQL etc)
•Datawarehouse and MPP appliances (Teradata, Greenplum etc)
•NoSQL Databases (Cassandra, HBase, MongoDB etc)
•In memory Databases (Gemfire, Coherence etc)
•Search based Databases (Elastic Search, Solr etc)
•Batch processing frameworks (Map Reduce, Spark etc)
•Graph Databases (Neo4j)
* Modern applications need to be polyglot (different modules need different
category of databases)
Big Data eco system – History
• Started with Google search engine
• Google’s use case is different fromenterprises
– Crawl web pages
– Index based on key words
– Return search results
• As conventional database technologies does not scale, them
implemented
– GFS (Distributed file system)
– Google Map Reduce (Distributed processing engine)
– Google Big Table (Distributed indexed table)
Big Data eco system - myths
• Big Data is Hadoop
• Big Data eco system can only solve problems with very large data sets
• Big Data is cheap
• Big Data provide variety of tools and can solve problems quickly
• Big Data is a technology
• Big Data is Data Science
– Data Scientist need to have specialized mathematical skills
– Domain knowledge
– Minimal technology orientation
– Data Science it self is separate domain - if required Big Data technologies can be used

* Often people have unrealistic expectations on Big Data technologies


Big Data eco system –
Characteristics
• Distributed storage
– Fault tolerance (RAID is replaced by replication)
• Distributed computing/processing
– Data locality (code goes to data)
• Scalability (almost linear)
• Low cost hardware (commodity)
• Low licensing costs
* Low cost hardware and software does not mean that
Big Data is cheap for enterprises
Big Data eco system
Big Data eco system of tools can be categorized into
•Distributed file systems
– HDFS
– Cloud storage (s3/Azure blob)
•Distributed processing engines
– Map Reduce
– Spark
•Distributed databases (operational)
– NoSQL databases (HBase, Cassandra)
– Search databases (Elastic Search)
Big Data eco system - Evolution
After successfully building search engine in new
technologies, Google have published white
papers
•Distributed file system – GFS
•Distributed processing Engine – Map Reduce
•Distributed database – Big Table
* Development of Big Data technologies such as
Hadoop is started with these white papers
Data Storage
• Data storage options in Big Data eco systems
• Distributed file systems (streaming and batch access)
– HDFS
– Cloud storage
• Distributed Databases (random access - distributed
indexed tables)
– Cassandra
– HBase
– MongoDB
– Solr
Data Ingestion
Data ingestion strategies are defined by sources from which data is pulled and
sinks where data is stored
•Sources
– Relational Databases
– Non relational Databases
– Streaming web logs
– Flat files
•Sinks
– HDFS
– Relational or Non relational Databases
– Data processing frameworks
Data Ingestion
• Sqoop is used to get data from relational
databases
• Flume and/or Kafka is used to read data from
web logs
• Spark streaming, Storm, Flink etc are used to
process data from Flume and/or Kafka before
loading data into sinks
Data processing - Batch
• I/O based
– Map Reduce
– Hive, Pig are wrappers on top of map reduce
• In memory
– Spark
– Spark Data Frames is wrapper on top of core spark
• As part of data processing typically we focus on
transformations such as
– Aggregations
– Joins
– Sorting
– Ranking
Data processing - Operational
• Data is typically stored in distributed databases
• Supports CRUD operations
• Data is typically distributed
• Data is typically sorted by key
• Fast and scalable random reads
• NoSQL
– HBase
– Cassandra
– MongoDB
• Search databases
– Elastic Search
Data Analysis or Visualization
Processed data is analyzed or visualized using
•BI Tools
•Custom visualization frameworks (d3js)
•Ad hoc query tools
Let us recollect how data integration is
typically done in enterprises (eg: Data
Warehousing)
Data Integration
(Current Architecture)

Visualization/
EDW/ODS
OLTP ODS Reporting

Closed Reporting
Main Frames Data Integration
(ETL/Real Time)

Decision
XML
Support
Target system
External apps
(eg; EDW)

Source(s)
Use Case – EDW
(Current Architecture)
• Enterprise Data Warehouse is built for Enterprise reporting for selected
audience in Executive Management, hence user base who view the reports
will be typically in tens or hundreds
• Data Integration
– ODS (Operational Data Store)
• Sources – Disparate
• Real time – Tools/custom (Goldengate, Shareplex etc)
• Batch – Tools/custom
• Uses – Compliance, data lineage, reports etc
– Enterprise Datawarehouse
• Sources – ODS or other sources
• ETL – Tools/custom (Informatica, Ab Initio, Talend)
• Reporting/Visualization
– ODS (Compliance related reporting)
– Enterprise Datawarehouse
– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
Now we will see how data integration can
be done using Big Data eco system
Data Ingestion
• Apache Sqoop to get data from relational
databases into Hadoop
• Apache Flume or Kafka to get data from
streaming logs
• If some processing need to be done before
loading to databases or HDFS, data is
processed through streaming technologies
such as Flink, Storm etc
Data Processing
• There are 2 engines to apply transformation rules at
scale
– Map Reduce (uses I/O)
• Hive is the most popular map reduce based tool
• Map Reduce works well to process huge amounts of data in few
hours
– Spark (in memory)
* We will see disadvantages of Map Reduce and why
Spark with new programming languages such as
Scala and python is gaining momentum
Disadvantages of Map Reduce
• Disadvantages of Map Reduce based solutions
– Designed for batch, not meant for interactive and
ad hoc reporting
– I/O bound and processing of micro batches can be
an issue
– Too many tools/technologies (Map Reduce, Hive,
Pig, Sqoop, Flume etc.) to build applications
– Not suitable for enterprise hardware where
storage is typically network mounted
Apache Spark
• Spark can work with any file system including HDFS
• Processing is done in memory – hence I/O is minimized
• Suitable for ad hoc or interactive querying or reporting
• Streaming jobs can be done much faster than map reduce
• Applications can be developed using Scala, Python, Java etc
• Choose one programming language and perform
– Data integration from RDBMS using JDBC (no need of sqoop)
– Stream data using spark streaming
– Leverage data frames and SQL embedded in programming language
– As processing is done in memory Spark works well with Enterprise
Hardware with network file system
Data Integration
(Big Data eco system)

Reporting Visualization/
OLTP Node Database Reporting
(optional)

Closed Node Reporting


Main Frames ETL
Big Data Cluster
Real
(EDW/ODS)
Time/Batch Decision
(No ETL)
XML
Support
External apps Node

Source(s)
Big Data eco system
Big Data
Technologies
Data Ingestion Hadoop eco system Ad-hoc querying tools
Kafka Flume Oozie Pig Impala Presto

Streaming Analytics NoSQL


Storm Flink
Hive Mahout
HBase Cassandra

Sqoop
In Memory processing
Spark

Map Reduce Non Map Reduce

Distributed File System (HDFS)


Core Components
Big Data eco system
Big Data
Technologies
Hadoop eco system
Oozie Impala
Custom Map Reduce Interactive/adhoc
Workflows
E, T and L Reporting

Hive HBase
T and L Real Time data
Sqoop
Batch Reporting integration or
E and L
Reporting

Map Reduce Non Map Reduce

Distributed File System (HDFS)

Hadoop Core Components

33
Role of Apache
• Each of these are separate projects incubated under Apache
– HDFS and MapReduce/YARN
– Hive
– Pig
– Sqoop
– HBase
Etc
Installation (plain vanilla)
• In plain vanilla mode, depending up on the architecture each
tool/technology needs to be manually downloaded, installed
and configured.
• Typically people use Puppet or Chef to set up clusters using
plain vanilla tools
• Advantages
– You can set up your cluster with latest versions from Apache directly
• Disadvantages
– Installation is tedious and error prone
– Need to integrate with monitoring tools
Hadoop Distributions
• Different vendors pre-package apache suite of big data tools into their
distribution to facilitate
– Easier installation/upgrade using wizards
– Better monitoring
– Easier maintenance
– and many more
• Leading distributions include, but not limited to
– Cloudera
– Hortonworks
– MapR
– AWS EMR
– IBM Big Insights
– and many more
Hadoop Distributions
Apache Foundation
Cloudera
HDFS/YARN/MR

Hive
Impala HBase
Zookeeper

Pig
Tez Impala
Hortonworks
Sqoop
Spark
Ganglia
Flume

MapR

AWS
Certifications
• Why to certify?
– To promote skills
– Demonstrate industry recognized validation for your expertise.
– Meet global standards required to ensure compatibility between Spark and Hadoop
– Stay up to date with the latest advances in Big Data technologies such as Spark and
Hadoop
• Take certifications from only vendors like
– Cloudera
– Hortonworks
– MapR
– Databricks (oreilly)
– http://www.itversity.com/2016/07/05/hadoop-and-spark-developer-certifications-faqs/
– http://www.itversity.com/2016/07/02/hadoop-certifications/
Resources
• Resources to learn Big Data with hands on practice
• YouTube Channel: www.YouTube.com/itversityin (please subscribe)
• 900+ videos
• 100+ playlists
• 6 Certification courses
• www.itversity.com - launched recently
• Few courses added
• Other courses will be added overtime
• Courses will be either role based or certification based
• Will be working on blogging platform for IT content
Job Roles
Job Role Experience required Desired Skills
• Go through this blog - Hadoop, Programming using
Hadoop Developer 0-7 Years java, spark, hive, pig, sqoop
http://www.itversity.com/2016/07/02/hadoopetc
-certifications/ Linux, Hadoop
Hadoop Administrator 0-10 Years Administration using
distributions
Data Warehousing, ETL,
Big Data Engineer 3-15 Years Hadoop, hive, pig, sqoop,
spark etc
Deep understanding of Big
Big Data Solutions Architect 12-18 Years Data eco system such as
Hadoop, NoSQL etc
Deep understanding of
Infrastructure Architect 12-18 Years infrastructure as well as Big
Data eco system

You might also like