You are on page 1of 27

Hadoop

DATA INGESTION

1
INTRODUCTION
 Definition
 Data ingestion is the process of obtaining and importing data for
immediate use or storage in a database. To ingest something is to "take
something in or absorb something."
 Data can be streamed in real time or ingested in batches.
 When data is ingested in real time, each data item is imported as it is
emitted by the source.
 When data is ingested in batches, data items are imported in discrete chunks
at periodic intervals of time.
 Note
 An effective data ingestion process begins by prioritizing data sources,
validating individual files and routing data items to the correct destination.
2
INTRODUCTION CONTD..

 When numerous big data sources exist in diverse formats (the sources may often
number in the hundreds and the formats in the dozens), it can be challenging for
businesses to ingest data at a reasonable speed and process it efficiently in order to
maintain a competitive advantage.
 To that end, vendors offer software programs that are tailored to specific
computing environments or software applications.
 When data ingestion is automated, the software used to
carry out the process may also include data preparation features to structure and organ
ize data so it can be analyzed on the fly or at a later time by business intelligence
(BI) and business analytics (BA) programs.

3
BIG DATA INGESTION PATTERNS

•  A common pattern that a lot of companies use to populate a Hadoop-based


data lake is to get data from pre-existing relational databases and data
warehouses.
•  When planning to ingest data into the data lake, one of the key considerations is
to determine how to organize data and enable consumers to access the data.
•  Hive and Impala provide a data infrastructure on top of Hadoop – commonly
referred to as SQL on Hadoop – that provide a structure to the data and the ability to
query the data using a SQL-like language.

4
KEY ASPECTS TO
CONSIDER

 Before you start to populate data into sayHive databases/schema and tables, the two
key aspects one would need to consider are:
 Which data storage formats to use when storing data? (HDFS supports a number of
data formats for files such as SequenceFile, RCFile, ORCFile, AVRO, Parquet, and
others.)
 What are the optimal compression options for files stored on HDFS? (Examples
include gzip, LZO, Snappy and others.)

5
HADOOP DATA INGESTION

•  Today, most data are generated and stored out of


Hadoop, e.g. relational databases, plain files, etc.
Therefore, data ingestion is the first step to utilize the
power of Hadoop. Various utilities have been developed
to move data into Hadoop.

6
BATCH DATA INGESTION

 The File System Shell includes various shell-like commands,


including copyFromLocaland copyToLocal, that directly interact with the HDFS as well
as other file systems that Hadoop supports. Most of the commands in File System
Shell behave like corresponding Unix commands. When the data files are ready in
local file system, the shell is a great tool to ingest data into HDFS in batch. In order
to stream data into Hadoop for real time analytics, however, we need more advanced
tools, e.g. Apache Flume and Apache Chukwa.

7
STREAMING DATA INGESTION
 Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data into HDFS.
 It has a simple and flexible architecture based on streaming data flows; and robust and fault
tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
 It uses a simple extensible data model that allows for online analytic application.
 Flume employs the familiar producer-consumer model. Source is the entity through which data
enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to
them. On the other hand, Sink is the entity that delivers the data to the destination. Flume has
many built-in sources (e.g. log4j and syslogs) and sinks (e.g. HDFS and HBase). Channel is the
conduit between the Source and the Sink.
Sources ingest events into the channel and the sinks drain the channel. Channels allow decoupling of
ingestion rate from drain rate. When data are generated faster than what the destination can
handle, the channel size increases.

8
STREAMING DATA
INGESTION

 Apache Chukwa is devoted to large-scale log collection and analysis, built on top of MapReduce
framework. Beyond data ingestion, Chukwa also includes a flexible and powerful toolkit for
displaying monitoring and analyzing results. Different from Flume, Chukwa is not a a continuous
stream processing system but a mini-batch system.
 Apache Kafka and Apache Storm may also be used to ingest streaming data into Hadoop although
they are mainly designed to solve different problems. Kafka is a distributed publish-subscribe
messaging system. It is designed to provide high throughput persistent messaging that’s scalable
and allows for parallel data loads into Hadoop. Storm is a distributed realtime computation system
for use cases such as realtime analytics, online machine learning, continuous computation, etc.

9
STRUCTURED DATA
INGESTION

 Apache Sqoop is a tool designed to efficiently transfer data between Hadoop and relational
databases. We can use Sqoop to import data from a relational database table into HDFS. The
import process is performed in parallel and thus generates multiple files in the format of
delimited text, Avro, or SequenceFile. Besides, Sqoop generates a Java class that encapsulates
one row of the imported table, which can be used in subsequent MapReduce processing of the
data. Moreover, Sqoop can export the data (e.g. the results of MapReduce processing) back to
the relational database for consumption by external applications or users.

10
DATA INGESTION
TOOLS

 Apache Hive
 Apache Flume
 Apache NiFi
 Apache Sqoop
 Apache Kafka

11
APACHE FLUME
 A service for streaming logs into Hadoop
 Flume lets Hadoop users ingest high-volume streaming data into HDFS
for storage.
 Specifically, Flume allows users to:
 Stream data
 Ingest streaming data from multiple sources into Hadoop for storage and analysis
 Insulate systems
 Buffer storage platform from transient spikes, when the rate of incoming data exceeds
the rate at which data can be written to the destination
 Guarantee data delivery
 Flume NG uses channel-based transactions to guarantee reliable message delivery. When
a message moves from one agent to another, two transactions are started, one on the
agent that delivers the event and the other on the agent that receives the event. This
ensures guaranteed delivery semantics
 Scale horizontally
 To ingest new data streams and additional volume as needed 12
APACHE FLUME

 Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams
in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application
logs, sensor and machine data, geo- location data and social media. These different types of data
can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can
feed business dashboards served ongoing data by Apache HBase .

13
EXAMPLE OF FLUME

 Flume is used to log manufacturing operations. When one run of product comes off the line, it
generates a log file about that run. Even if this occurs hundreds or thousands of times per day, the
large volume log file data can stream through Flume into a tool for same-day analysis with
Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a
quality assurance engineer using Apache Hive.

14
FLUME ILLUSTRATION

15
HOW FLUME WORKS

 Flume’s high-level architecture is built on a streamlined codebase that is easy to use and
extend. The project is highly reliable, without the risk of data loss. Flume also supports
dynamic reconfiguration without the need for a restart, which reduces downtime for its
agents.

16
COMPONENTS OF FLUME
 Event
 A singular unit of data that is transported by Flume (typically a single log entry)
 Source
 The entity through which data enters into Flume. Sources either actively poll for data or
passively wait for data to be delivered to them. A variety of sources allow data to be collected,
such as log4j logs and syslogs.
 Sink
 The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to
a range of destinations. One example is the HDFS sink that writes events to HDFS.
 Channel
 The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks
drain the channel.
 Agent
 Any physical Java virtual machine running Flume. It is a collection of sources, sinks and
channels.
 Client
 17
The entity that produces and transmits the Event to the Source operating within the Agent.
COMPENENTS INTERACTION
 A flow in Flume starts from the Client.
 The Client transmits the Event to a Source operating within the Agent.
 The Source receiving this Event then delivers it to one or more Channels.
 One or more Sinks operating within the same Agent drains these Channels.
 Channels decouple the ingestion rate from drain rate using the familiar
producer-consumer model of data exchange.
 When spikes in client side activity cause data to be generated faster than can be handled by the
provisioned destination capacity can handle, the Channel size increases. This allows sources to
continue normal operation for the duration of the spike.
 The Sink of one Agent can be chained to the Source of another Agent.
This chaining enables the creation of complex data flow topologies.
 Note
 Because Flume’s distributed architecture requires no central coordination point. Each agent
runs independently of others with no inherent single point of failure, and Flume can easily
scale horizontally. 18
APACHE NIFI

 Apache NiFi is a secure integrated platform for real time data collection, simple event
processing, transport and delivery from source to storage. It is useful for moving distributed data
to and from your Hadoop cluster. NiFi has lots of distributed processing capability to help reduce
processing cost and get real-time insights from many different data sources across many large
systems and can help aggregate that data into a single, or many different places.
 NiFi lets users get the most value from their data. Specifically NiFi allows users to:
 Stream data from multiple source
 Collect high volumes of data in real time
 Guarantee delivery of data
 Scale horizontally across many machines
19
HOW NIFI WORKS
 NiFi’s high-level architecture is focused on delivering a streamlined interface
that is easy to use and easy to set up.
 Basic Terminology
 Processor: Processors in NiFi are what makes the data move. Processors can help
generate data, run commands, move data, convert data, and many many more.
NiFi’s architecture and feature set is designed to be extended these processors.
They are at the very core of NiFi’s functionality.
 Processing Group: When data flows get very complex, it can be very useful to
group different parts together which perform certain functions. NiFi abstracts this
concept and calls them processing groups.
 FlowFile: A FlowFile in NiFi represents just a single piece of data. It is made of
different parts. Attributes and Contents. Attributes help give the data context
which are made of key-value pairs. Typically there are 3 attributes which are
present on all FlowFiles: uuid, filename, and path
 Connections and Relationships: NiFi allows users to simply drag and drop
connections between processors which controls how the data will flow. Each
connection will be assigned to different types of relationships for the FlowFiles
(such as successful processing, or a failure to process) 20
WORKING

•  A FlowFile can originate from a processor in NiFi. Processors


can also receive the flowfiles and transmit them to many other
processors. These processors can then drop the data in the
flowfile into various places depending on the function of the
processor.

21
WHAT YOU NEED
 Oracle VirtualBox virtual machine (VM).

 ODBC driver that matches the version of Excel you are using (32-bit or 64-bit).
 Power View feature in Excel 2013 to visualize the server log data.
 Power View is currently only available in Microsoft Office Professional Plus and
Microsoft Office 365 Professional Plus.

 Install Hortonworks DataFlow (HDF) on the

Sandbox, so you’ll need to download the latest HDF release

22
HOW NIFI LOOKS LIKE

23
IMPORT FLOW IN NIFI

24
THE FLOW LOOKS LIKE THIS

25
VERIFYING THE IMPORT

26
Thank You

27

You might also like