Data Architecture and Integration

Data Architecture
and Integration
Maestría en Tecnología Informática y de
Comunicaciones (TIC) - UADE
October 2021
Module Organization - Lectures
1. Introduction to Big Data (22/10)

2. Minimum Viable Products (23/10) Init of the Final Eval. Work
3. Data Architecture & Integration (29/10)

4.
4. Data Viz + Data Science with Simulation Model (30/10) Lecture by Nahuel Romera
5. Data Science and Machine Learning (05/11) Lecture by Dr. Tomas Tecce
6. Case/Guided discussion (06/11) Final Evaluation Work Cont'd
7. The Chief Data Officer: strategy and governance (12/11) Final Evaluation Work Cont'd
8. Effective Communications: Data-Driven Storytelling (13/11) Lecture by Mg. Giuliana Marsili
9. Closing discussion and Final Evaluation (19/11)
GonzaloZarza.me
Agenda
1. A few initial words about data practices

Roles, teams and modern dilemmas
2. Deep dive into Data Architecture

Processing
Storage
Deployment
3. Introduction to Data Integration

How does it work?
Common Scenarios and Tools
4. Open Discussion and Questions
5. References
GonzaloZarza.me
Introduction to Data
Practices & Profiles
Back to Module Objective - Goals
Data Tools Skills

Strategic Architecture & How to profit
business asset Tech Ecosystem from data
● Understand each building block, basics concepts and tools.

● Develop a data-driven mindset to nurture a more data-educated decision making.
● Review common pitfalls when dealing with data, and how to avoid them.
GonzaloZarza.me
Teams and Roles Skills
1. GENERATION
6. DECISION, 2. LOADING,
ACTION STAGING
DATA
5. EXPLOITATION,
3. STORAGE
MODELLING
4. PROCESSING
GonzaloZarza.me
Visualization Integration
Specialist Specialist
1. GENERATION
6. DECISION, 2. LOADING,
ACTION STAGING
DATA
Strategist Engineer
5. EXPLOITATION,
3. STORAGE
MODELLING
4. PROCESSING
Scientist Architect
GonzaloZarza.me
Integration Visualization
Engineer Architect Scientist Specialist
Specialist
1 Generation Generation Generation Generation Generation
Strategist
Loading/ Loading/ Loading/ Loading/ Loading/
2
Staging Staging Staging Staging Staging
3 Storage Storage Storage Storage Storage
4 Processing Processing Processing Processing Processing Chief Data

Officer
Exploitation Exploitation Exploitation Exploitation Exploitation

5
/Modelling /Modelling /Modelling /Modelling /Modelling
Awareness
Decision/ Decision/ Decision/ Decision/ Decision/ Understanding
6
Action Action Action Action Action Expertise
GonzaloZarza.me
Deep dive into
Data Architecture
Back to the (Big) Data Evolution Tools
First Speciation
SQL Platforms
Second Speciation
Batch Processing
Third Speciation
NRT Processing
Fourth Speciation
AI Platforms
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture Practice - Modern Dilemmas Tools
Batch Stream
Processing Processing
Data Architecture Practice - Modern Dilemmas Tools
Batch Stream
Processing Processing
On
Cloud
Premise
Intro to Data Architecture
Processing
Data Architecture
First Speciation
SQL Platforms
Second Speciation
Batch Processing
Third Speciation
NRT Processing
Fourth Speciation
AI Platforms
Data Architecture - Batch Processing
A batch processing system takes a large amount of input data, runs a job to process
it, and produces some output data. Jobs often take a while (from a few minutes to
Batch several days), so there normally isn’t a user waiting for the job to finish. Instead,
batch jobs are often scheduled to run periodically (for example, once a day). The
Processing primary performance measure of a batch job is usually throughput [...]
—Martin Kleppmann, Designing Data-Intensive Applications (2017)
*Quote source: "Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems", Kleppmann, M. O'Reilly Media, 2017.
Batch
Processing
*Image source: https://atomiv.org/knowledgebase/big-data/batch-processing

Data Architecture - Batch Processing - Hadoop Batch
Initial Google's PageRank algorithm
The current stable release of Apache Hadoop is 3.3.1, released on 2021-06-15

(Info Updated on 2021-10-27)
Divide Conquer
Batch
Processing
Batch Hadoop
Processing Ecosystem
Batch
Processing
Data Architecture - NRT (Stream) Processing
Stream
Processing
[...] we can run the processing more frequently—say, processing a second’s worth of
data at the end of every second—or even continuously, abandoning the fixed time
slices entirely and simply processing every event as it happens. That is the idea
Stream behind stream processing.
Processing In general, a “stream” refers to data that is incrementally made available over time.
—Martin Kleppmann, Designing Data-Intensive Applications (2017)
Stream
Processing
*Image source: https://atomiv.org/knowledgebase/big-data/stream-processing

Stream
Processing
*Images source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream
*Images Source: https://databricks.com/spark/about

● A DAG is a graph in which all the

edges are directed, such that it is
impossible to find a node and follow
a sequence of edges that eventually
loops back to the same node
● [...] the nodes of a DAG can be put
into a linear sequence with the
nodes given an “ordering”
● DAGs are used in project
management to plan, design, and
implement complex projects or
tasks.
*Quote source: https://www.capgemini.com/gb-en/2020/10/introducing-directed-acyclic-graphs-and-their-use-cases/

*Image source: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
Stream
Processing
Stream Specialized
Processing tools
Storage
Data Architecture - Storage - NoSQL
We will define NoSQL databases as all of the modern databases that cannot or currently are not used through a relational schema.
Agile development Relations between Complex models OLAP

Flexible data-models entities (social graphs) Flexible business logic Analytics
Too many types. Eg: Semi-structured data Not for Updates!
Corporate areas High volumes
*Image source: https://docs.microsoft.com/es-es/dotnet/architecture/cloud-native/relational-vs-nosql-data

Deployment
Data Architecture - Deployment
On
Cloud
Premise
Data Architecture - On Premise
On
Cloud
Premise
Data Architecture - On Premise
On
Cloud
Premise
Data Architecture - On Premise - Bonus Track On
Prem
https://top500.org/
https://top500.org/lists/top500/2021/06/ https://top500.org/lists/green500/2021/06/
Data Architecture - On Premise - Bonus Track On
Prem
The total number of nodes in Fugaku is 158,976.
A single CPU makes up a node, and two CPUs (two

nodes) are mounted on a board called the CPU Memory
Unit (CMU). Eight CMUs make up a "Bunch of Blades
(BoB)," which means each BoB has 16 nodes. Three BoBs
make up a Shelf, and therefore each Shelf has 48 nodes.
Eight Shelves (384 nodes) are installed in a computer
rack (some racks have 192 nodes). Fugaku is made up of
432 racks, of which 396 racks have 384 nodes, and 36
racks have 192 nodes. This makes a total of 158,976
nodes.
*Images source: https://www.r-ccs.riken.jp/en/fugaku/

Data Architecture - Cloud
Cloud
Cloud
Cloud
IaaS - PaaS - SaaS
Data Architecture - Cloud - Bonus Track Cloud
*Images source: https://aws.amazon.com/about-aws/global-infrastructure/global_network/ - https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/

Data Architecture - On Premise vs Cloud
*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
IaaS:
Cloud infrastructure
services, known as
Infrastructure as a Service
(IaaS), are made of highly
scalable and automated
compute resources.
GonzaloZarza.me
PaaS:
Cloud platform services, or
PaaS, provide cloud
components to certain
software while being used
mainly for applications.
PaaS delivers a framework
for developers that they
can build upon and use to
create customized
applications.
GonzaloZarza.me
SaaS:
Also known as cloud
application services,
represents the most
commonly utilized option
for businesses in the cloud
market. SaaS utilizes the
internet to deliver
applications, which are
managed by a third-party
vendor, to its users.
GonzaloZarza.me
GonzaloZarza.me
Useful Tips
Data Architecture - Common Scenarios and Useful Tips
● Do we need to process several data sources?

● Do we have time restrictions to provide results (solutions)?
● Do we need to deliver Business (Near) Real Time solutions?
● What about high-availability? Fault-tolerance?
● How do we plan to store and access our data? Is it going to be read or write intensive? Both?
● Do we need to design a long-term data storage strategy?
GonzaloZarza.me
Open discussion
& Questions
Introduction to
Data Integration
Data Integration
How does it work?
Data Integration - Common Scenarios and Tools
● Are we going to deal with legacy (storage) systems?

● Do we need to buffer data (remember the Velocity dimension from the Big Data definition)?
● Are there different QoS that should be considered or handled?
● Is the input data quality good enough? Is the data format correct?
● Do we want (or need) to transform the data before starting the processing stages?
GonzaloZarza.me
Data Integration - Common Scenarios and Tools
● Are we going to deal with legacy (storage) systems?

● Do we need to buffer data (remember the Velocity dimension from the Big Data definition)?
● Are there different QoS that should be considered or handled?
● Is the input data quality good enough? Is the data format correct?
● Do we want (or need) to transform the data before starting the processing stages?
Main (types of) tools:
Pentaho provides big data tools to extract, prepare and blend your Apache Sqoop is a tool designed for efficiently transferring bulk
data, plus the visualizations and analytics that will change the way data between Apache Hadoop and structured datastores such as
you run your business. relational databases.
[Old] Apache Kafka is a fast, scalable, durable, and fault-tolerant Logstash is a tool for managing events and logs. You can use it to
publish-subscribe messaging system. Kafka is often used in place collect logs, parse them, and store them for later use (like, for
of traditional message brokers like JMS and AMQP because of its searching). If you store them in Elasticsearch, you can view and
higher throughput, reliability and replication. analyze them with Kibana.
[New] Apache Kafka is a distributed streaming platform.
GonzaloZarza.me
Intro to Data Integration
Kafka
Data Integration - Kafka
Kafka is used for building real-time data pipelines

and streaming apps. It is horizontally scalable,
fault-tolerant, wicked fast, and runs in production
in thousands of companies.
Data Integration - Kafka
Kafka is used for building real-time data

pipelines and streaming apps. It is horizontally
scalable, fault-tolerant, wicked fast, and runs in
production in thousands of companies.
Pentaho
Data Integration - Pentaho
Pentaho tightly couples data integration with business analytics in a modern platform that brings together
IT and business users to easily access, visualize and explore all data that impacts business results.
Pentaho Data Integration, codenamed Kettle, consists of a core data integration (ETL) engine, and GUI
applications that allow the user to define data integration jobs and transformations. It supports
deployment on single node computers as well as on a cloud, or cluster.
GonzaloZarza.me
Logstash
Data Integration - Logstash
Open source data collection engine with

real-time pipelining capabilities. Logstash can
dynamically unify data from disparate sources
and normalize the data into destinations of your
choice. Any type of event can be enriched and
transformed with a broad array of input, filter, and
output plugins, with many native codecs further
simplifying the ingestion process.
*Image source: https://www.elastic.co/logstash/

Data Integration - Logstash
Logstash supports a variety of inputs that pull in events

from a multitude of common sources, all at the same
time.
Dynamically transforms and prepares input data

regardless of format or complexity:
● Derive structure from unstructured data with grok
● Decipher geo coordinates from IP addresses
● Anonymize PII data, exclude sensitive fields
completely
● Ease overall processing, independent of the data
source, format, or schema.
It has a variety of outputs that let us route data where

needed, giving us the flexibility to unlock a slew of
downstream use cases. Logstash has a pluggable
framework featuring over 200 plugins.
*Image source: https://www.elastic.co/logstash/

References
References
● La Ingeniería del Big Data, Cómo trabajar con datos. Juan José López Murphy, Gonzalo Zarza, Editorial
UOC, 2017.
● Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable
systems. Kleppmann, M. O'Reilly Media, 2017.
● Hadoop, The Definitive Guide, 4th Edition. Tom White. O’Reilly Media / Yahoo Press, 2015.
● Hadoop in Action, 2nd Edition. Chuck Lam. Manning, 2016.
GonzaloZarza.me
Any question?
Thank you!
Gonzalo Zarza
gzarza@uade.edu.ar

Data Architecture and Integration

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Architecture and Integration

Uploaded by

Copyright:

Available Formats

Data Architecture

1. Introduction to Big Data (22/10)

3. Data Architecture & Integration (29/10)

6. Case/Guided discussion (06/11) Final Evaluation Work Cont'd

8. Effective Communications: Data-Driven Storytelling (13/11) Lecture by Mg. Giuliana Marsili

9. Closing discussion and Final Evaluation (19/11)

1. A few initial words about data practices

2. Deep dive into Data Architecture

3. Introduction to Data Integration

4. Open Discussion and Questions

Data Tools Skills

● Understand each building block, basics concepts and tools.

1 Generation Generation Generation Generation Generation

3 Storage Storage Storage Storage Storage

4 Processing Processing Processing Processing Processing Chief Data

Exploitation Exploitation Exploitation Exploitation Exploitation

*Image source: https://atomiv.org/knowledgebase/big-data/batch-processing

Initial Google's PageRank algorithm

The current stable release of Apache Hadoop is 3.3.1, released on 2021-06-15

*Image source: https://atomiv.org/knowledgebase/big-data/stream-processing

*Images Source: https://databricks.com/spark/about

● A DAG is a graph in which all the

*Quote source: https://www.capgemini.com/gb-en/2020/10/introducing-directed-acyclic-graphs-and-their-use-cases/

Agile development Relations between Complex models OLAP

*Image source: https://docs.microsoft.com/es-es/dotnet/architecture/cloud-native/relational-vs-nosql-data

The total number of nodes in Fugaku is 158,976.

A single CPU makes up a node, and two CPUs (two

*Images source: https://www.r-ccs.riken.jp/en/fugaku/

IaaS - PaaS - SaaS

*Images source: https://aws.amazon.com/about-aws/global-infrastructure/global_network/ - https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/

● Do we need to process several data sources?

● Are we going to deal with legacy (storage) systems?

● Are we going to deal with legacy (storage) systems?

Main (types of) tools:

[New] Apache Kafka is a distributed streaming platform.

Kafka is used for building real-time data pipelines

Kafka is used for building real-time data

Open source data collection engine with

*Image source: https://www.elastic.co/logstash/

Logstash supports a variety of inputs that pull in events

Dynamically transforms and prepares input data

It has a variety of outputs that let us route data where

*Image source: https://www.elastic.co/logstash/

You might also like