Professional Documents
Culture Documents
3A BD Ch3 HDP
3A BD Ch3 HDP
Houda Benbrahim
Chapter III: Horthonworks Data Platform HDP
2023 - 2024
2
H. Benbrahim
1 2
HDP ? HDP Features HDP ? HDP Features
3 4
HDP ? HDP Features HDP ? HDP Features
5 6
H. Benbrahim H. Benbrahim
5 6
HDP ? HDP Features HDP ? HDP Features
• Apache Flume is a distributed, reliable, and available service for efficiently • Kafka Apache is a messaging system used for real-time data pipelines.
collecting, aggregating, and moving large amounts of streaming event data. • It is used to build real-time streaming data pipelines that get data between systems or
applications.
• It works with a number variety of Hadoop tools for various applications.
• Flume helps you aggregate data from many sources, manipulate the data, and
• Examples of use cases are:
then add the data into your Hadoop environment.
• Website activity tracking: capturing user site activities for real-time tracking/
monitoring
• Its functionality is now superseded by HDF / Apache Nifi.
• Log aggregation: collecting logs from various sources to a central location for
processing.
• Stream processing: article recommendations based on user activity
• Event sourcing: state changes in applications are logged as time-ordered sequence of
7 8records
H. Benbrahim H. Benbrahim
7 8
HDP ? HDP Features HDP ? HDP Features
9 10
H. Benbrahim H. Benbrahim
9 10
HDP ? HDP Features HDP ? HDP Features
• Apache Pig is a platform for analyzing large data sets. • Apache HBase is a distributed, scalable, big data store.
• Pig was designed for scripting a long series of data operations (good for ETL) • Use Apache HBase when you need random, real-time read/write access to your
• Pig consists of a high-level language called Pig Latin, which was designed to Big Data.
simplify MapReduce programming. • The goals of the HBase project is to be able to handle very large tables of data
• Pig's infrastructure layer consists of a compiler that produces sequences of running on clusters of commodity hardware.
• HBase is modeled after Google's BigTable and provides BigTable-like
MapReduce programs from this Pig Latin code that you write.
capabilities on top of Hadoop and HDFS.
• The system is able to optimize your code, and "translate" it into MapReduce
• HBase is a NoSQL datastore.
allowing you to focus on semantics rather than efficiency.
• HBase is not designed for transactional processing.
11 12
H. Benbrahim H. Benbrahim
11 12
HDP ? HDP Features HDP ? HDP Features
• Apache Accumulo is a sorted, distributed key/value store that provides robust, • Apache Phoenix enables OLTP and operational analytics in Hadoop for low
scalable data storage and retrieval. latency applications by combining the best of both worlds:
• It is based on Google’s BigTable and runs on YARN ▪ The power of standard SQL and JDBC APIs with full ACID transaction
• Think of it as a "highly secure HBase" capabilities.
• Features: ▪ The flexibility of late-bound, schema-on-read capabilities from the NoSQL
world by leveraging HBase as its backing store.
• Server-side programming
• Essentially this is SQL for NoSQL
• Designed to scale
• Fully integrated with other Hadoop products such as Spark, Hive, Pig,
• Cell-based access control
Flume, and MapReduce
• Stable
13 14
H. Benbrahim H. Benbrahim
13 14
HDP ? HDP Features HDP ? HDP Features
• Apache Storm is an open source distributed real-time computation system. • Apache Solr is a fast, open source enterprise search platform built on the
▪ Fast Apache Lucene Java search library
▪ Scalable
• Full-text indexing and search
▪ Fault-tolerant
▪ REST-like HTTP/XML and JSON APIs make it easy to use with variety of
• Used to process large volumes of high-velocity data
programming languages
• Useful when milliseconds of latency matter and Spark isn't fast enough
▪ Has been benchmarked at over a million tuples processed per second per node • Highly reliable, scalable and fault tolerant, providing distributed indexing,
replication and load-balanced querying, automated failover and recovery,
centralized configuration and more
15 16
H. Benbrahim H. Benbrahim
15 16
HDP ? HDP Features HDP ? HDP Features
• Apache Spark is a fast and general engine for large-scale in-memory data • Apache Druid is a high-performance, column-oriented, distributed data store.
processing. • It has a unique architecture that enables rapid multi-dimensional filtering, ad-
• It has a number of built-in libraries that sits on top of the Spark core, which hoc attribute groupings, and extremely fast aggregations
takes advantage of all its capabilities. Spark ML, Spark's GraphX, Spark • It supports real-time streams
Streaming, Spark SQL and DataFrames. − Lock-free ingestion to allow for simultaneous ingestion and querying of high
• Spark has a variety of advantages including: dimensional, high volume data sets
• Speed: Run programs faster than MapReduce − Explore events immediately after they occur
• Easy to use: Write apps quickly with Java, Scala, Python, R • It is a datastore designed for business intelligence (OLAP) queries.
• Generality: Can combine SQL, streaming, and complex analytics • It integrates with Apache Hive to build OLAP cubes and run sub-seconds
• Runs on variety of environments and can access diverse data sources: queries.
17
− Hadoop, Mesos, standalone, cloud... 18
H. Benbrahim H. Benbrahim
− HDFS, HBase,…
17 18
HDP ? HDP Features HDP ? HDP Features
HDP, Data life cycle and governance? HDP, Data life cycle and governance?
Falcon
19 20
H. Benbrahim H. Benbrahim
19 20
HDP ? HDP Features HDP ? HDP Features
21 22
HDP ? HDP Features HDP ? HDP Features
23 24
H. Benbrahim H. Benbrahim
23 24
HDP ? HDP Features HDP ? HDP Features
25 26
H. Benbrahim H. Benbrahim
25 26
HDP ? HDP Features HDP ? HDP Features
• A tool for provisioning and managing Apache Hadoop clusters in the cloud • Apache ZooKeeper is a centralized service for maintaining configuration
• Policy-based autoscaling on the major cloud infrastructure platforms, including: information, naming, providing distributed synchronization, and providing
▪ Microsoft Azure group services
▪ Amazon Web Services • All of these kinds of services are used in some form or another by distributed
▪ Google Cloud Platform applications
• Saves time so you don't have to develop your own
▪ OpenStack
• It is fast, reliable, simple and ordered
▪…
• Distributed applications can use ZooKeeper to store and mediate updates to
•
important configuration information
27 28
H. Benbrahim H. Benbrahim
27 28
HDP ? HDP Features HDP ? HDP Features
29 30
H. Benbrahim H. Benbrahim
29 30
HDP ? HDP Features
HDP, Tools?
Zeppelin
31
H. Benbrahim
31