You are on page 1of 12

Medi-Caps University, Indore

Assignment 1

Big Data Engg (CS3ED04 ) Elective-4

B.Tech. CSE 6 sem

1. Describe ETL operations with 3 stages of ETL.

2. Describe Data warehouse with its architecture.


3. Differentiate between data warehouse, data mart and operational data store.

4. Discuss the role and need of Apache flume in ingesting of Big data in Hadoop.
5. Discuss 3 components of Flume agent.

Solution 1:-
ETL is a process that extracts the data from different source systems, then transforms the
data (like applying calculations, concatenations, etc.) and finally loads the data into the Data
Warehouse system. Full form of ETL is Extract, Transform and Load.

The ETL process requires active inputs from various stakeholders including developers,
analysts, testers, top executives and is technically challenging. ETL is a recurring activity
(daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and
well documented.

There are many reasons for adopting ETL in the organization:

 It helps companies to analyze their business data for taking critical business
decisions.
 ETL can answer complex business questions.
 ETL provides a method of moving the data from various sources into a data
warehouse.
 Well-designed and documented ETL system is almost essential to the success of a
Data Warehouse project.
 Allow verification of data transformation, aggregation and calculations rules.
 ETL process allows sample data comparison between the source and the target
system.
 ETL process can perform complex transformations and requires the extra area to
store the data.
 ETL helps to Migrate data into a Data Warehouse. Convert to the various formats
and types to adhere to one consistent system.
 ETL is a predefined process for accessing and manipulating source data into the
target database.
 ETL offers deep historical context for the business.
 It helps to improve productivity because it codifies and reuses without a need for
technical skills.
ETL Process in Data Warehouses

ETL is a 3-step process ->

Step 1) Extraction:

In this step, data is extracted from the source system into the staging area.
Transformations if any are done in staging area so that performance of source system in not
degraded. Also, if corrupted data is copied directly from the source into Data warehouse
database, rollback will be a challenge. 
Staging area gives an opportunity to validate extracted data before it moves into the Data
warehouse.

Data warehouse needs to integrate systems that have different:

1. DBMS, 
2. Hardware,
3.  Operating Systems and Communication Protocols.

Sources could include legacy applications like Mainframes, customized applications, Point of
contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from vendors,
partners amongst others.
Hence one need a logical data map before data is extracted and loaded physically. This data
map describes the relationship between sources and target data.
Three Data Extraction methods:

1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification

Irrespective of the method used, extraction should not affect performance and response
time of the source systems that are live production databases. Any slow down or locking
could affect company's bottom line.
 
Some validations are done during Extraction:

 Reconcile records with the source data


 Make sure that no spam/unwanted data loaded
 Data type check
 Remove all types of duplicate/fragmented data
 Check whether all the keys are in place or not

Step 2) Transformation:

Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and change data such that insightful Business Intelligent reports can be
generated.
In this step, you apply a set of functions on extracted data. Data that does not require any
transformation is called as direct move or pass through data.
In transformation step, you can perform customized operations on data.
For instance, if the user wants sum-of-sales revenue which is not in the database or if the
first name and the last name in a table are in different columns, it is possible to concatenate
them before loading.

Following are Data Integrity Problems:

1. Different spelling of the same person like Jon, John, etc.


2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various
applications for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes. 
Validations are done during this stage

 Filtering – Select only certain columns to load


 Using rules and lookup tables for Data standardization
 Character Set Conversion and encoding handling
 Conversions of Units of Measurements like Date-Time Conversion, currency
conversions, numerical conversions, etc.
 Data threshold validation check. For example, age cannot be more than two digits.
 Data flow validation from the staging area to the intermediate tables.
 Required fields should not be left blank.
 Cleaning (for example, mapping NULL to 0 or Gender Male to "M" and Female to "F"
etc.)
 Split a column into multiples and merging multiple columns into a single column.
 Transposing rows and columns,
 Use lookups to merge data
 Using any complex data validation (e.g., if the first two columns in a row are empty
then it automatically reject the row from processing)

 
Step 3) Loading

Loading data into the target data warehouse database.


In a typical Data warehouse, huge volume of data needs to be loaded in a relatively short
period (nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point of
failure without data integrity loss. 
Data Warehouse admins need to monitor, resume, cancel loads as per prevailing server
performance.

Types of Loading:

 Initial Load — populating all the Data Warehouse tables


 Incremental Load — applying ongoing changes as when needed periodically.
 Full Refresh —erasing the contents of one or more tables and reloading with fresh
data.

Load verification

 Ensure that the key field data is neither missing nor null.
 Test modelling views based on the target tables.
 Check that combined values and calculated measures.
 Data checks in dimension table as well as history table.
 Check the BI reports on the loaded fact and dimension table.
Solution 2:-
A data warehouse is a database designed to enable business intelligence activities: it exists
to help users understand and enhance their organization's performance. 

It is designed for query and analysis rather than for transaction processing, and usually
contains historical data derived from transaction data, but can include data from other
sources. Data warehouses separate analysis workload from transaction workload and
enable an organization to consolidate data from several sources. This helps in:

1} Maintaining historical records

2} Analyzing the data to gain a better understanding of the business and to improve
the business

In addition to a relational database, a data warehouse environment can include an


extraction, transportation, transformation, and loading (ETL) solution, statistical analysis,
reporting, data mining capabilities, client analysis tools, and other applications that manage
the process of gathering data, transforming it into useful, actionable information, and
delivering it to business users.

The data warehouse acts as an underlying engine used by middleware business intelligence
environments that serve reports, dashboards and other interfaces to end users.

Characteristics of a data warehouse as set forth by William Inmon:

 Subject-Oriented ->
Ability to define a data warehouse by subject matter
 Integrated ->
Resolve such problems as naming conflicts and inconsistencies
 Non-volatile ->
Once entered into the data warehouse, data should not change
 Time Variant ->
A data warehouse's focus on change over time

Key Characteristics of a Data Warehouse

The key characteristics of a data warehouse are as follows:

 Data is structured for simplicity of access and high-speed query performance.


 End users are time-sensitive and desire speed-of-thought response times.
 Large amounts of historical data are used.
 Queries often retrieve large amounts of data, perhaps many thousands of rows.
 Both predefined and ad hoc queries are common.
 The data load involves multiple sources and transformations.

In general, fast query performance with high data throughput is the key to a successful data
warehouse.
Data Warehouse Architectures
Data warehouses and their architectures vary depending upon the specifics of an
organization's situation. Three common architectures are:

Data Warehouse Architecture: Basic

The figure below shows a simple architecture for a data warehouse. End users directly
access data derived from several source systems through the data warehouse.

Architecture of a Data Warehouse

In this, the metadata and raw data of a traditional OLTP system is present, as is an additional
type of data, summary data. Summaries are a mechanism to pre-compute common
expensive, long-running operations for sub-second data retrieval. For example, a typical
data warehouse query is to retrieve something such as August sales. A summary in an
Oracle database is called a materialized view.

The consolidated storage of the raw data as the centre of data warehousing architecture is
often referred to as an Enterprise Data Warehouse (EDW). An EDW provides a 360-degree
view into the business of an organization by holding all relevant business information in the
most detailed format.

Data Warehouse Architecture: with a Staging Area

One must clean and process the operational data before putting it into the warehouse, as
shown below. One can do this programmatically, although most data warehouses use a
staging area instead. A staging area simplifies data cleansing and consolidation for
operational data coming from multiple source systems, especially for enterprise data
warehouses where all relevant information of an enterprise is consolidated. Figure below
illustrates this typical architecture.
Architecture of a Data Warehouse with a Staging Area

Data Warehouse Architecture: with a Staging Area and Data Marts

Although the architecture of a data warehouse with Staging Area is quite common, you may
want to customize your warehouse's architecture for different groups within your
organization. You can do this by adding data marts, which are systems designed for a
particular line of business. Below mentioned figure illustrates an example where purchasing,
sales, and inventories are separated and, a financial analyst might want to analyze historical
data for purchases and sales or mine historical data to make predictions about customer
behaviour.

Architecture of a Data Warehouse with a Staging Area and Data Marts


Solution 3:-

Data Warehouse vs. Data Mart –


A data mart serves as the same role as a data warehouse, but it is intentionally limited in
scope.

 It may serve one particular department or line of business. 


 The advantage of a data mart versus a data warehouse is that it can be created much
faster due to its limited coverage.
 However, data marts also create problems with inconsistency. It takes tight discipline
to keep data and calculation definitions consistent across data marts.
 Data warehouse is top-down, data-oriented model while data mart is a bottom-up
project-oriented model.
 Data warehouse is a centralised system with light de-normalization whereas, data
mart is a decentralised system with heavy de-normalization.
 Data warehouses are flexible and use fact constellation schema while, data mart are
quite rigid and supports snowflake/star schema.
 Data is contained in detailed form in DW while in DM, it is contained in summarised
form.

Data Warehouse vs. Data Mart -

Operational data stores exist to support daily operations. 

 The ODS data is cleaned and validated, but it is not historically deep; it may be just
the data for the current day. 
 Rather than support the historically rich queries that a data warehouse can handle,
the ODS gives data warehouses a place to get access to the most current data, which
has not yet been loaded into the data warehouse. 
 The ODS may also be used as a source to load the data warehouse.
  As data warehousing loading techniques have become more advanced, data
warehouses may have less need for ODS as a source for loading data. Instead,
constant trickle-feed systems can load the data warehouse in near real time.
 A database designed for queries on transactional data. An ODS is often an interim or
staging area for a data warehouse, but differs in that its contents are updated in the
course of business, whereas a data warehouse contains static data.
 An ODS is designed for performance and numerous queries on small amounts of
data such as an account balance. A data warehouse is generally designed for
elaborate queries on large amounts of data.

Solution 4:-
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and ingest high-volume of streaming data into HDFS or HBase or other storage
systems.  It has a simple and flexible architecture based on streaming data flows; and is
robust and fault tolerant with tuneable reliability mechanisms for failover and recovery.
These systems act as a buffer between the data producers and the final destination.
 
Apache Flume acts as a system used to write data to Apache Hadoop and Apache HBase in a
reliable and scalable fashion and we can write data into it in any format and language such
as Map-Reduce, Hive, Pig, Impala etc.
 
Need of Flume:
 We can write the data directly into HDFS or Hbase. Normally there is so much data
processed on a Hadoop cluster, so there will be hundreds to thousands of large
servers producing the data.
  Such huge data trying to write in HDFS can cause huge problems.
 In HDFS, at a time thousands of files are being written and for that many blocks are
allocated in a single name node.
 Thus it makes server to become stressed, network latency occurs and sometimes we
can also loose the data. 
 Scaling Flume is also easier than scaling HDFS or HBase clusters. So we use Flume to
collect the data and send to HDFS.
Similar to Flume other types of system are Apache Kafka and Facebook Scribe.

Core Concepts of Flume NG:

Event: A byte payload with optional string headers that represent the unit of data that
Flume can transport from it’s point of origination to it’s final destination.

Flow: Movement of events from the point of origin to their final destination is considered a
data flow, or simply flow. This is not a rigorous definition and is used only at a high level for
description purposes.
Client: An interface implementation that operates at the point of origin of events and
delivers them to a Flume agent. Clients typically operate in the process space of the
application they are consuming data from. For example, Flume Log4j Appender is a client.

Agent: An independent process that hosts flume components such as sources, channels
and sinks, and thus has the ability to receive, store and forward events to their next-hop
destination.
Solution 5:-
Each Flume agent has three components: source, channel and sink.
  The source is used for getting events into the Flume agent.
 The sink is used for removing the events from the agent and forwarding them to the
next agent in the topology, or to HDFS, HBase etc.
 The channel acts like a buffer which stores data that the source has received, until a
sink has successfully written the data out to the next hop or the eventual
destination. 

Source:
Sources are components which receive data from other applications. There are sources
which also can produce data on their own, that data is only used for testing purpose. 
 They can listen to 1 or more network ports and receive data or can read data from
local file system also. A source can write to 1 or more channels and it can also
replicate the events to all or some of the channels.

Channel:
Channels are components which receive data from source and it writes data to sinks. 
 Multiple sources can write the data to the same channel and multiple sinks can read
that data. But each sink can read exactly a specific event from one channel.

Sink:
Sinks reads the events from the channels and removes it from the channel. 
 They push events to the next hop or final destination.
 Once the data is safely placed in the next hop or destination then the sinks inform
that to channels via transaction commit and then that events are deleted from these
channels
. Flume Agent with multiple flow
( Events processor)

 Sources write the data into channels using channel processors, interceptors and
selectors. 
 Each and every source has its own channel processor, which takes the task given by
the source and then passes that task or events to one or more interceptors. 
 Interceptors read the event and modify or drop the event based on some criteria like
regex. 
 We can have multiple interceptors which are called in the order in which they are
defined. This can be called as chain-of-responsibility design pattern. 
 Then we pass that list of events generated by interceptor chain to channel selector.
The selectors decide which channels attached to this source each event be written
to.
 They apply some filtering type of criteria about which channels are required and
optional.
 A failure when writing a required channel causes the processor to throw a
“ChannelException” error which indicates the source to retry the event. If a failure
happens to optional channel, then it is ignored.

Once all the events are written successfully then the processor indicates success to the
source and sends an acknowledgement to the source that sent the event and continue
to accept more events. 

You might also like