Professional Documents
Culture Documents
Assignment 1
4. Discuss the role and need of Apache flume in ingesting of Big data in Hadoop.
5. Discuss 3 components of Flume agent.
Solution 1:-
ETL is a process that extracts the data from different source systems, then transforms the
data (like applying calculations, concatenations, etc.) and finally loads the data into the Data
Warehouse system. Full form of ETL is Extract, Transform and Load.
The ETL process requires active inputs from various stakeholders including developers,
analysts, testers, top executives and is technically challenging. ETL is a recurring activity
(daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and
well documented.
It helps companies to analyze their business data for taking critical business
decisions.
ETL can answer complex business questions.
ETL provides a method of moving the data from various sources into a data
warehouse.
Well-designed and documented ETL system is almost essential to the success of a
Data Warehouse project.
Allow verification of data transformation, aggregation and calculations rules.
ETL process allows sample data comparison between the source and the target
system.
ETL process can perform complex transformations and requires the extra area to
store the data.
ETL helps to Migrate data into a Data Warehouse. Convert to the various formats
and types to adhere to one consistent system.
ETL is a predefined process for accessing and manipulating source data into the
target database.
ETL offers deep historical context for the business.
It helps to improve productivity because it codifies and reuses without a need for
technical skills.
ETL Process in Data Warehouses
Step 1) Extraction:
In this step, data is extracted from the source system into the staging area.
Transformations if any are done in staging area so that performance of source system in not
degraded. Also, if corrupted data is copied directly from the source into Data warehouse
database, rollback will be a challenge.
Staging area gives an opportunity to validate extracted data before it moves into the Data
warehouse.
1. DBMS,
2. Hardware,
3. Operating Systems and Communication Protocols.
Sources could include legacy applications like Mainframes, customized applications, Point of
contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from vendors,
partners amongst others.
Hence one need a logical data map before data is extracted and loaded physically. This data
map describes the relationship between sources and target data.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response
time of the source systems that are live production databases. Any slow down or locking
could affect company's bottom line.
Some validations are done during Extraction:
Step 2) Transformation:
Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and change data such that insightful Business Intelligent reports can be
generated.
In this step, you apply a set of functions on extracted data. Data that does not require any
transformation is called as direct move or pass through data.
In transformation step, you can perform customized operations on data.
For instance, if the user wants sum-of-sales revenue which is not in the database or if the
first name and the last name in a table are in different columns, it is possible to concatenate
them before loading.
Step 3) Loading
Types of Loading:
Load verification
Ensure that the key field data is neither missing nor null.
Test modelling views based on the target tables.
Check that combined values and calculated measures.
Data checks in dimension table as well as history table.
Check the BI reports on the loaded fact and dimension table.
Solution 2:-
A data warehouse is a database designed to enable business intelligence activities: it exists
to help users understand and enhance their organization's performance.
It is designed for query and analysis rather than for transaction processing, and usually
contains historical data derived from transaction data, but can include data from other
sources. Data warehouses separate analysis workload from transaction workload and
enable an organization to consolidate data from several sources. This helps in:
2} Analyzing the data to gain a better understanding of the business and to improve
the business
The data warehouse acts as an underlying engine used by middleware business intelligence
environments that serve reports, dashboards and other interfaces to end users.
Subject-Oriented ->
Ability to define a data warehouse by subject matter
Integrated ->
Resolve such problems as naming conflicts and inconsistencies
Non-volatile ->
Once entered into the data warehouse, data should not change
Time Variant ->
A data warehouse's focus on change over time
In general, fast query performance with high data throughput is the key to a successful data
warehouse.
Data Warehouse Architectures
Data warehouses and their architectures vary depending upon the specifics of an
organization's situation. Three common architectures are:
The figure below shows a simple architecture for a data warehouse. End users directly
access data derived from several source systems through the data warehouse.
In this, the metadata and raw data of a traditional OLTP system is present, as is an additional
type of data, summary data. Summaries are a mechanism to pre-compute common
expensive, long-running operations for sub-second data retrieval. For example, a typical
data warehouse query is to retrieve something such as August sales. A summary in an
Oracle database is called a materialized view.
The consolidated storage of the raw data as the centre of data warehousing architecture is
often referred to as an Enterprise Data Warehouse (EDW). An EDW provides a 360-degree
view into the business of an organization by holding all relevant business information in the
most detailed format.
One must clean and process the operational data before putting it into the warehouse, as
shown below. One can do this programmatically, although most data warehouses use a
staging area instead. A staging area simplifies data cleansing and consolidation for
operational data coming from multiple source systems, especially for enterprise data
warehouses where all relevant information of an enterprise is consolidated. Figure below
illustrates this typical architecture.
Architecture of a Data Warehouse with a Staging Area
Although the architecture of a data warehouse with Staging Area is quite common, you may
want to customize your warehouse's architecture for different groups within your
organization. You can do this by adding data marts, which are systems designed for a
particular line of business. Below mentioned figure illustrates an example where purchasing,
sales, and inventories are separated and, a financial analyst might want to analyze historical
data for purchases and sales or mine historical data to make predictions about customer
behaviour.
The ODS data is cleaned and validated, but it is not historically deep; it may be just
the data for the current day.
Rather than support the historically rich queries that a data warehouse can handle,
the ODS gives data warehouses a place to get access to the most current data, which
has not yet been loaded into the data warehouse.
The ODS may also be used as a source to load the data warehouse.
As data warehousing loading techniques have become more advanced, data
warehouses may have less need for ODS as a source for loading data. Instead,
constant trickle-feed systems can load the data warehouse in near real time.
A database designed for queries on transactional data. An ODS is often an interim or
staging area for a data warehouse, but differs in that its contents are updated in the
course of business, whereas a data warehouse contains static data.
An ODS is designed for performance and numerous queries on small amounts of
data such as an account balance. A data warehouse is generally designed for
elaborate queries on large amounts of data.
Solution 4:-
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and ingest high-volume of streaming data into HDFS or HBase or other storage
systems. It has a simple and flexible architecture based on streaming data flows; and is
robust and fault tolerant with tuneable reliability mechanisms for failover and recovery.
These systems act as a buffer between the data producers and the final destination.
Apache Flume acts as a system used to write data to Apache Hadoop and Apache HBase in a
reliable and scalable fashion and we can write data into it in any format and language such
as Map-Reduce, Hive, Pig, Impala etc.
Need of Flume:
We can write the data directly into HDFS or Hbase. Normally there is so much data
processed on a Hadoop cluster, so there will be hundreds to thousands of large
servers producing the data.
Such huge data trying to write in HDFS can cause huge problems.
In HDFS, at a time thousands of files are being written and for that many blocks are
allocated in a single name node.
Thus it makes server to become stressed, network latency occurs and sometimes we
can also loose the data.
Scaling Flume is also easier than scaling HDFS or HBase clusters. So we use Flume to
collect the data and send to HDFS.
Similar to Flume other types of system are Apache Kafka and Facebook Scribe.
Event: A byte payload with optional string headers that represent the unit of data that
Flume can transport from it’s point of origination to it’s final destination.
Flow: Movement of events from the point of origin to their final destination is considered a
data flow, or simply flow. This is not a rigorous definition and is used only at a high level for
description purposes.
Client: An interface implementation that operates at the point of origin of events and
delivers them to a Flume agent. Clients typically operate in the process space of the
application they are consuming data from. For example, Flume Log4j Appender is a client.
Agent: An independent process that hosts flume components such as sources, channels
and sinks, and thus has the ability to receive, store and forward events to their next-hop
destination.
Solution 5:-
Each Flume agent has three components: source, channel and sink.
The source is used for getting events into the Flume agent.
The sink is used for removing the events from the agent and forwarding them to the
next agent in the topology, or to HDFS, HBase etc.
The channel acts like a buffer which stores data that the source has received, until a
sink has successfully written the data out to the next hop or the eventual
destination.
Source:
Sources are components which receive data from other applications. There are sources
which also can produce data on their own, that data is only used for testing purpose.
They can listen to 1 or more network ports and receive data or can read data from
local file system also. A source can write to 1 or more channels and it can also
replicate the events to all or some of the channels.
Channel:
Channels are components which receive data from source and it writes data to sinks.
Multiple sources can write the data to the same channel and multiple sinks can read
that data. But each sink can read exactly a specific event from one channel.
Sink:
Sinks reads the events from the channels and removes it from the channel.
They push events to the next hop or final destination.
Once the data is safely placed in the next hop or destination then the sinks inform
that to channels via transaction commit and then that events are deleted from these
channels
. Flume Agent with multiple flow
( Events processor)
Sources write the data into channels using channel processors, interceptors and
selectors.
Each and every source has its own channel processor, which takes the task given by
the source and then passes that task or events to one or more interceptors.
Interceptors read the event and modify or drop the event based on some criteria like
regex.
We can have multiple interceptors which are called in the order in which they are
defined. This can be called as chain-of-responsibility design pattern.
Then we pass that list of events generated by interceptor chain to channel selector.
The selectors decide which channels attached to this source each event be written
to.
They apply some filtering type of criteria about which channels are required and
optional.
A failure when writing a required channel causes the processor to throw a
“ChannelException” error which indicates the source to retry the event. If a failure
happens to optional channel, then it is ignored.
Once all the events are written successfully then the processor indicates success to the
source and sends an acknowledgement to the source that sent the event and continue
to accept more events.