You are on page 1of 26

HGrid247 Data Engineering

( HGrid247 )

By Solechoel Arifin
Agenda
- The Importance of Data Engineering

- What is HGrid247 ?

- Why Use HGrid247 ?

- HGrid247 Features

- HGrid247 Implementation
The Importance of Data Engineering
1. Data engineers design and build pipelines that transform and transport data into a format wherein, by the time
it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data
from many disparate sources and collect them into a single warehouse that represents the data uniformly as a
single source of truth. (Nathan Black - QuantHub )
2. Without data engineering, there would be no data as such, which would bring machine learning and AI to an
end, because these technologies use algorithms that are requiring a lot of data to build. (DataEngi)
3. Data Engineering is the Backbone of Data Science. Data engineers are on the front lines of data strategy.
They are the first people to tackle the influx of structured and unstructured data that enters a company’s
systems. They are the foundation of any data strategy. Without Lego blocks, after all, you can’t build a Lego
castle. (DataQuest)
4. In F1, the driver would be useless without a whole range of engineers and mechanics. If your business only
has BI (Business Intelligence) and MI (Management Information) analysts or Data Scientists, you are asking
the driver to win an F1 race with a Morris Minor – you need a Data Engineer. (Holly Rourke - LinkedIn)
5. Data Engineering Is Critical to Drive Data and Analytics Success. Organizations have heavily invested in hiring
data scientists and business analysts, but without data engineers they struggle to curate a data pipeline or
move data to production. Data engineers make the appropriate data accessible and available to the right users
at the right time. (Gartner report 10/2020)
What is HGrid247 ?
HGrid247 is Multi Platforms Drag and Drop Big Data Engineering
ETL Tool for batch and Stream Processing.
HGrid247 can help the user easily design data engineering pipeline
(workflow) using a drag-and-drop interface.
From a workflow, HGrid247 can generate code that runs on the
following 'Distributed Massive Data Processing Frame-Work' :

1. Map Reduce 5. Flink 1.11.0 (Batch & Stream)


2. Spark 1.6 RDD 6. Apache Beam 2.29.0 (Batch & Stream)
3. Spark 2.x RDD 7. Tez 0.8.2
4. Spark 2.x Dataset
What is HGrid247 ?
Why Use HGrid247 ?
Easy
HGrid247 offers drag-and-drop interface. The tool allows its users to design and program ETL work flows,
save the time and simplify the complexity of data integration.
No Coding
Since HGrid247 is a visual tool, the user is not required to have a high competence of Java programming and
understanding of distributed programming paradigms. Hgrid247 automatically generates the code for you. This
results to greater team productivity and compatibility.
Robust
All functionalities in HGrid247 have been thoroughly tested and implemented in the production environment
Simple
Additional functionalities can be easily added as UDF in Java (Plain Old Java Object). A competence of other
programming languages or scripts is not significantly required.
Supports Many Platforms
Design once, run on multiple Distributed Data Processing Platforms
HGrid247 Features
- Data Sources

- Processing Components

- Data Sinks

- Other Features
HGrid247 Features
Data Sources

- File
- RDBMS
- Hive
- HBase
- Solr
- Kafka
HGrid247 Features
Data Source : File

- Semi Structured Flat File - MSWord


- Unstructured Flat File - MSExcel
- Avro - MDB
- Parquet - SAS
- ORC (Optimized Record Columnar) - PDF
- XML - Image
- Json - Shape File (shp)
- ASN.1 (Abstract Syntax Notation One)
HGrid247 Features
Processing Components

- Transformation - Combiner (speeding up process grouping+agg)


- Filtering - Buffer (Processing of grouped+sorted collection)
- Grouping - Record Duplication Check
- Aggregation - Join (with shuffling)
- Merging - Reference Join (join without shuffling)
- Profiling
HGrid247 Features
Processing Component : Transformation

- String (Text) Processing - Image Processing


- Number Processing (Math) - Transpose Operation
- Date Processing - Security of Data
- String (Text) Similarity - Json+xml Operation
- Spatial Processing - Lookup Operation
HGrid247 Features
Processing Component : Filtering

- Single Collection Ouput Filter


- Two Collections Output Filter
- Simple Clause Filter
- Complex Clause Filter
HGrid247 Features
Processing Component : Grouping

- Grouping Single or Multiple Fields

- Sorting Single or Multiple Fields (Ascending or Descending)


HGrid247 Features
Processing Component : Aggregation

- Basic Aggregation Operation

- Condition Aggregation Operation

- Calculation Aggregation Operation


HGrid247 Features
Processing Component : Join (with shuffling)

- Inner Join

- Left Join

- Right Join

- Outer Join
HGrid247 Features
Processing Component : Reference Join
(join without shuffling)
- Inner Join

- Left Join
HGrid247 Features
Processing Component : Merging

- Merging two or more Collections


HGrid247 Features
Processing Component : Profiling

- Using Basic Aggregation

- Using Condition Aggregation


HGrid247 Features
Processing Component : Combiner

- Basic Combiner Operation

- Condition Combiner Operation

- Calculation Combiner Operation


HGrid247 Features
Processing Component : Buffer
- Basic Buffer Aggregatin Operation
- Ranking Operation
- Getting First Record
- Getting Last Record
- Getting FirstNextChange Record
- Getting Route
- Getting Pareto
- Getting Distance From Previous Record
- Getting Value from Previous Record
HGrid247 Features
Processing Component: Record Duplication Check

- Using Hash Member Check Method

- Using Bloom Filter Member Check Method


HGrid247 Features
Data Sinks

- Flat File (text File) - XML


- Avro - Parquet
- ORC (Optimized Row Columnar) - RDBMS
- Kafka - Hive
- HBase - Image
HGrid247 Features
Data Sinks

- Flat File (text File) - XML


- Avro - Parquet
- ORC (Optimized Row Columnar) - RDBMS
- Kafka - Hive
- HBase - Image
HGrid247 Features
Other Features

- Enabling Data Lineage on Atlas for MapReduce Engine

- Creating Narrative Documentation of data processing pipeline


(workflow)
HGrid247 Implementation

- Telecommunication Industry

- Goverment

- Banking

- Education Institution
Thank You

You might also like