Professional Documents
Culture Documents
• Website: http://www.itversity.com
• YouTube: https://www.youtube.com/c/itversityin
• LinkedIn Profile: https://in.linkedin.com/in/durga0gadiraju
• Facebook: https://www.facebook.com/itversity
• Twitter: https://twitter.com/itversity
• Google Plus: https://plus.google.com/+itversityin
• Github: https://github.com/dgadiraju
• Meetup:
– https://www.meetup.com/Hyderabad-Technology-Meetup/ (local hyderabad meetup)
– https://www.meetup.com/itversityin/ (typically online)
Categorization of Enterprise
applications
Customer
Analytics
Enterprise
applications
Decision
Operational
Support
Visualization/
EDW/ODS
OLTP ODS Reporting
Closed Reporting
Main Frames Data Integration
(ETL/Real Time)
Decision
XML
Support
Target system
External apps
(eg; EDW)
Source(s)
Data Integration - Technologies
• Batch ETL – Informatica, Data Stage etc
• Real time data integration – Goldengate,
Shareplex etc
• ODS – Oracle, MySQL etc
• EDW/MPP – Teradata, Greenplum etc
• BI Tools – Cognos, Business Objects etc
Current Scenario - Challenges
• Almost all operational systems are using relational databases (RDBMS like Oracle).
– RDBMS are originally designed for Operational and transactional.
• Not linearly scalable.
– Transactions
– Data integrity
• Expensive
• Predefined Schema
• Data processing do not happen where data is stored (storage layer) – no data
locality
– Some processing happens at database server level (SQL)
– Some processing happens at application server level (Java/.net)
– Some processing happens at client/browser level (Java Script)
• Almost all Data Warehouse appliances are expensive and not very flexible for
customer analytics and recommendation engines
Evolution of Databases
Now we have many choices of Databases
•Relational Databases (Oracle, Informix, Sybase, MySQL etc)
•Datawarehouse and MPP appliances (Teradata, Greenplum etc)
•NoSQL Databases (Cassandra, HBase, MongoDB etc)
•In memory Databases (Gemfire, Coherence etc)
•Search based Databases (Elastic Search, Solr etc)
•Batch processing frameworks (Map Reduce, Spark etc)
•Graph Databases (Neo4j)
* Modern applications need to be polyglot (different modules need different
category of databases)
Big Data eco system – History
• Started with Google search engine
• Google’s use case is different fromenterprises
– Crawl web pages
– Index based on key words
– Return search results
• As conventional database technologies does not scale, them
implemented
– GFS (Distributed file system)
– Google Map Reduce (Distributed processing engine)
– Google Big Table (Distributed indexed table)
Big Data eco system - myths
• Big Data is Hadoop
• Big Data eco system can only solve problems with very large data sets
• Big Data is cheap
• Big Data provide variety of tools and can solve problems quickly
• Big Data is a technology
• Big Data is Data Science
– Data Scientist need to have specialized mathematical skills
– Domain knowledge
– Minimal technology orientation
– Data Science it self is separate domain - if required Big Data technologies can be used
Visualization/
EDW/ODS
OLTP ODS Reporting
Closed Reporting
Main Frames Data Integration
(ETL/Real Time)
Decision
XML
Support
Target system
External apps
(eg; EDW)
Source(s)
Use Case – EDW
(Current Architecture)
• Enterprise Data Warehouse is built for Enterprise reporting for selected
audience in Executive Management, hence user base who view the reports
will be typically in tens or hundreds
• Data Integration
– ODS (Operational Data Store)
• Sources – Disparate
• Real time – Tools/custom (Goldengate, Shareplex etc)
• Batch – Tools/custom
• Uses – Compliance, data lineage, reports etc
– Enterprise Datawarehouse
• Sources – ODS or other sources
• ETL – Tools/custom (Informatica, Ab Initio, Talend)
• Reporting/Visualization
– ODS (Compliance related reporting)
– Enterprise Datawarehouse
– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
Now we will see how data integration can
be done using Big Data eco system
Data Ingestion
• Apache Sqoop to get data from relational
databases into Hadoop
• Apache Flume or Kafka to get data from
streaming logs
• If some processing need to be done before
loading to databases or HDFS, data is
processed through streaming technologies
such as Flink, Storm etc
Data Processing
• There are 2 engines to apply transformation rules at
scale
– Map Reduce (uses I/O)
• Hive is the most popular map reduce based tool
• Map Reduce works well to process huge amounts of data in few
hours
– Spark (in memory)
* We will see disadvantages of Map Reduce and why
Spark with new programming languages such as
Scala and python is gaining momentum
Disadvantages of Map Reduce
• Disadvantages of Map Reduce based solutions
– Designed for batch, not meant for interactive and
ad hoc reporting
– I/O bound and processing of micro batches can be
an issue
– Too many tools/technologies (Map Reduce, Hive,
Pig, Sqoop, Flume etc.) to build applications
– Not suitable for enterprise hardware where
storage is typically network mounted
Apache Spark
• Spark can work with any file system including HDFS
• Processing is done in memory – hence I/O is minimized
• Suitable for ad hoc or interactive querying or reporting
• Streaming jobs can be done much faster than map reduce
• Applications can be developed using Scala, Python, Java etc
• Choose one programming language and perform
– Data integration from RDBMS using JDBC (no need of sqoop)
– Stream data using spark streaming
– Leverage data frames and SQL embedded in programming language
– As processing is done in memory Spark works well with Enterprise
Hardware with network file system
Data Integration
(Big Data eco system)
Reporting Visualization/
OLTP Node Database Reporting
(optional)
Source(s)
Big Data eco system
Big Data
Technologies
Data Ingestion Hadoop eco system Ad-hoc querying tools
Kafka Flume Oozie Pig Impala Presto
Sqoop
In Memory processing
Spark
Hive HBase
T and L Real Time data
Sqoop
Batch Reporting integration or
E and L
Reporting
33
Role of Apache
• Each of these are separate projects incubated under Apache
– HDFS and MapReduce/YARN
– Hive
– Pig
– Sqoop
– HBase
Etc
Installation (plain vanilla)
• In plain vanilla mode, depending up on the architecture each
tool/technology needs to be manually downloaded, installed
and configured.
• Typically people use Puppet or Chef to set up clusters using
plain vanilla tools
• Advantages
– You can set up your cluster with latest versions from Apache directly
• Disadvantages
– Installation is tedious and error prone
– Need to integrate with monitoring tools
Hadoop Distributions
• Different vendors pre-package apache suite of big data tools into their
distribution to facilitate
– Easier installation/upgrade using wizards
– Better monitoring
– Easier maintenance
– and many more
• Leading distributions include, but not limited to
– Cloudera
– Hortonworks
– MapR
– AWS EMR
– IBM Big Insights
– and many more
Hadoop Distributions
Apache Foundation
Cloudera
HDFS/YARN/MR
Hive
Impala HBase
Zookeeper
Pig
Tez Impala
Hortonworks
Sqoop
Spark
Ganglia
Flume
MapR
AWS
Certifications
• Why to certify?
– To promote skills
– Demonstrate industry recognized validation for your expertise.
– Meet global standards required to ensure compatibility between Spark and Hadoop
– Stay up to date with the latest advances in Big Data technologies such as Spark and
Hadoop
• Take certifications from only vendors like
– Cloudera
– Hortonworks
– MapR
– Databricks (oreilly)
– http://www.itversity.com/2016/07/05/hadoop-and-spark-developer-certifications-faqs/
– http://www.itversity.com/2016/07/02/hadoop-certifications/
Resources
• Resources to learn Big Data with hands on practice
• YouTube Channel: www.YouTube.com/itversityin (please subscribe)
• 900+ videos
• 100+ playlists
• 6 Certification courses
• www.itversity.com - launched recently
• Few courses added
• Other courses will be added overtime
• Courses will be either role based or certification based
• Will be working on blogging platform for IT content
Job Roles
Job Role Experience required Desired Skills
• Go through this blog - Hadoop, Programming using
Hadoop Developer 0-7 Years java, spark, hive, pig, sqoop
http://www.itversity.com/2016/07/02/hadoopetc
-certifications/ Linux, Hadoop
Hadoop Administrator 0-10 Years Administration using
distributions
Data Warehousing, ETL,
Big Data Engineer 3-15 Years Hadoop, hive, pig, sqoop,
spark etc
Deep understanding of Big
Big Data Solutions Architect 12-18 Years Data eco system such as
Hadoop, NoSQL etc
Deep understanding of
Infrastructure Architect 12-18 Years infrastructure as well as Big
Data eco system