You are on page 1of 68

Data Architecture

and Integration
Maestría en Tecnología Informática y de
Comunicaciones (TIC) - UADE
October 2021
Module Organization - Lectures

1. Introduction to Big Data (22/10)


2. Minimum Viable Products (23/10) Init of the Final Eval. Work

3. Data Architecture & Integration (29/10)


4.
4. Data Viz + Data Science with Simulation Model (30/10) Lecture by Nahuel Romera

5. Data Science and Machine Learning (05/11) Lecture by Dr. Tomas Tecce

6. Case/Guided discussion (06/11) Final Evaluation Work Cont'd

7. The Chief Data Officer: strategy and governance (12/11) Final Evaluation Work Cont'd

8. Effective Communications: Data-Driven Storytelling (13/11) Lecture by Mg. Giuliana Marsili

9. Closing discussion and Final Evaluation (19/11)

GonzaloZarza.me
Agenda

1. A few initial words about data practices


Roles, teams and modern dilemmas

2. Deep dive into Data Architecture


Processing
Storage
Deployment

3. Introduction to Data Integration


How does it work?
Common Scenarios and Tools

4. Open Discussion and Questions

5. References

GonzaloZarza.me
Introduction to Data
Practices & Profiles
Back to Module Objective - Goals

Data Tools Skills


Strategic Architecture & How to profit
business asset Tech Ecosystem from data

● Understand each building block, basics concepts and tools.


● Develop a data-driven mindset to nurture a more data-educated decision making.
● Review common pitfalls when dealing with data, and how to avoid them.

GonzaloZarza.me
Teams and Roles Skills

1. GENERATION

6. DECISION, 2. LOADING,
ACTION STAGING

DATA

5. EXPLOITATION,
3. STORAGE
MODELLING

4. PROCESSING

GonzaloZarza.me
Teams and Roles Skills

Visualization Integration
Specialist Specialist

1. GENERATION

6. DECISION, 2. LOADING,
ACTION STAGING

DATA
Strategist Engineer

5. EXPLOITATION,
3. STORAGE
MODELLING

4. PROCESSING

Scientist Architect
GonzaloZarza.me
Teams and Roles Skills

Integration Visualization
Engineer Architect Scientist Specialist
Specialist

1 Generation Generation Generation Generation Generation

Strategist
Loading/ Loading/ Loading/ Loading/ Loading/
2
Staging Staging Staging Staging Staging

3 Storage Storage Storage Storage Storage

4 Processing Processing Processing Processing Processing Chief Data


Officer

Exploitation Exploitation Exploitation Exploitation Exploitation


5
/Modelling /Modelling /Modelling /Modelling /Modelling
Awareness
Decision/ Decision/ Decision/ Decision/ Decision/ Understanding
6
Action Action Action Action Action Expertise

GonzaloZarza.me
Deep dive into
Data Architecture
Back to the (Big) Data Evolution Tools

First Speciation
SQL Platforms
Second Speciation
Batch Processing
Third Speciation
NRT Processing
Fourth Speciation
AI Platforms

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture Practice - Modern Dilemmas Tools

Batch Stream
Processing Processing

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture Practice - Modern Dilemmas Tools

Batch Stream
Processing Processing

On
Cloud
Premise

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Intro to Data Architecture
Processing
Data Architecture

First Speciation
SQL Platforms
Second Speciation
Batch Processing
Third Speciation
NRT Processing
Fourth Speciation
AI Platforms

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing

A batch processing system takes a large amount of input data, runs a job to process
it, and produces some output data. Jobs often take a while (from a few minutes to
Batch several days), so there normally isn’t a user waiting for the job to finish. Instead,
batch jobs are often scheduled to run periodically (for example, once a day). The
Processing primary performance measure of a batch job is usually throughput [...]
—Martin Kleppmann, Designing Data-Intensive Applications (2017)

*Quote source: "Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems", Kleppmann, M. O'Reilly Media, 2017.
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing

Batch
Processing

*Image source: https://atomiv.org/knowledgebase/big-data/batch-processing


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing - Hadoop Batch

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing - Hadoop Batch

Initial Google's PageRank algorithm

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing - Hadoop Batch

The current stable release of Apache Hadoop is 3.3.1, released on 2021-06-15


(Info Updated on 2021-10-27)

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing - Hadoop Batch

Divide Conquer

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing

Batch
Processing

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing

Batch Hadoop
Processing Ecosystem

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing

Batch
Processing

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing

Stream
Processing

*Quote source: "Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems", Kleppmann, M. O'Reilly Media, 2017.
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing

[...] we can run the processing more frequently—say, processing a second’s worth of
data at the end of every second—or even continuously, abandoning the fixed time
slices entirely and simply processing every event as it happens. That is the idea
Stream behind stream processing.
Processing In general, a “stream” refers to data that is incrementally made available over time.
—Martin Kleppmann, Designing Data-Intensive Applications (2017)

*Quote source: "Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems", Kleppmann, M. O'Reilly Media, 2017.
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing

Stream
Processing

*Image source: https://atomiv.org/knowledgebase/big-data/stream-processing


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing

Stream
Processing

*Images source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream

*Images Source: https://databricks.com/spark/about


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream

● A DAG is a graph in which all the


edges are directed, such that it is
impossible to find a node and follow
a sequence of edges that eventually
loops back to the same node
● [...] the nodes of a DAG can be put
into a linear sequence with the
nodes given an “ordering”
● DAGs are used in project
management to plan, design, and
implement complex projects or
tasks.

*Quote source: https://www.capgemini.com/gb-en/2020/10/introducing-directed-acyclic-graphs-and-their-use-cases/


*Image source: https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing

Stream
Processing

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing

Stream Specialized
Processing tools

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Intro to Data Architecture
Storage
Data Architecture - Storage - NoSQL
We will define NoSQL databases as all of the modern databases that cannot or currently are not used through a relational schema.

Agile development Relations between Complex models OLAP


Flexible data-models entities (social graphs) Flexible business logic Analytics
Too many types. Eg: Semi-structured data Not for Updates!
Corporate areas High volumes

*Image source: https://docs.microsoft.com/es-es/dotnet/architecture/cloud-native/relational-vs-nosql-data


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Intro to Data Architecture
Deployment
Data Architecture - Deployment

On
Cloud
Premise

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise

On
Cloud
Premise

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise

On
Cloud
Premise

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise - Bonus Track On
Prem

https://top500.org/

https://top500.org/lists/top500/2021/06/ https://top500.org/lists/green500/2021/06/

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise - Bonus Track On
Prem

The total number of nodes in Fugaku is 158,976.

A single CPU makes up a node, and two CPUs (two


nodes) are mounted on a board called the CPU Memory
Unit (CMU). Eight CMUs make up a "Bunch of Blades
(BoB)," which means each BoB has 16 nodes. Three BoBs
make up a Shelf, and therefore each Shelf has 48 nodes.
Eight Shelves (384 nodes) are installed in a computer
rack (some racks have 192 nodes). Fugaku is made up of
432 racks, of which 396 racks have 384 nodes, and 36
racks have 192 nodes. This makes a total of 158,976
nodes.

*Images source: https://www.r-ccs.riken.jp/en/fugaku/


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Cloud

Cloud

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Cloud

Cloud

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Cloud

Cloud

IaaS - PaaS - SaaS

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Cloud - Bonus Track Cloud

*Images source: https://aws.amazon.com/about-aws/global-infrastructure/global_network/ - https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise vs Cloud

*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud

IaaS:
Cloud infrastructure
services, known as
Infrastructure as a Service
(IaaS), are made of highly
scalable and automated
compute resources.

*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud

PaaS:
Cloud platform services, or
PaaS, provide cloud
components to certain
software while being used
mainly for applications.
PaaS delivers a framework
for developers that they
can build upon and use to
create customized
applications.

*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud

SaaS:
Also known as cloud
application services,
represents the most
commonly utilized option
for businesses in the cloud
market. SaaS utilizes the
internet to deliver
applications, which are
managed by a third-party
vendor, to its users.

*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud

*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Intro to Data Architecture
Useful Tips
Data Architecture - Common Scenarios and Useful Tips

● Do we need to process several data sources?


● Do we have time restrictions to provide results (solutions)?
● Do we need to deliver Business (Near) Real Time solutions?
● What about high-availability? Fault-tolerance?
● How do we plan to store and access our data? Is it going to be read or write intensive? Both?
● Do we need to design a long-term data storage strategy?

GonzaloZarza.me
Open discussion
& Questions
Introduction to
Data Integration
Data Integration
How does it work?
Data Integration - Common Scenarios and Tools

● Are we going to deal with legacy (storage) systems?


● Do we need to buffer data (remember the Velocity dimension from the Big Data definition)?
● Are there different QoS that should be considered or handled?
● Is the input data quality good enough? Is the data format correct?
● Do we want (or need) to transform the data before starting the processing stages?

GonzaloZarza.me
Data Integration - Common Scenarios and Tools

● Are we going to deal with legacy (storage) systems?


● Do we need to buffer data (remember the Velocity dimension from the Big Data definition)?
● Are there different QoS that should be considered or handled?
● Is the input data quality good enough? Is the data format correct?
● Do we want (or need) to transform the data before starting the processing stages?

Main (types of) tools:

Pentaho provides big data tools to extract, prepare and blend your Apache Sqoop is a tool designed for efficiently transferring bulk
data, plus the visualizations and analytics that will change the way data between Apache Hadoop and structured datastores such as
you run your business. relational databases.

[Old] Apache Kafka is a fast, scalable, durable, and fault-tolerant Logstash is a tool for managing events and logs. You can use it to
publish-subscribe messaging system. Kafka is often used in place collect logs, parse them, and store them for later use (like, for
of traditional message brokers like JMS and AMQP because of its searching). If you store them in Elasticsearch, you can view and
higher throughput, reliability and replication. analyze them with Kibana.

[New] Apache Kafka is a distributed streaming platform.

GonzaloZarza.me
Intro to Data Integration
Kafka
Data Integration - Kafka

Kafka is used for building real-time data pipelines


and streaming apps. It is horizontally scalable,
fault-tolerant, wicked fast, and runs in production
in thousands of companies.

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Integration - Kafka

Kafka is used for building real-time data


pipelines and streaming apps. It is horizontally
scalable, fault-tolerant, wicked fast, and runs in
production in thousands of companies.

*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Intro to Data Integration
Pentaho
Data Integration - Pentaho

Pentaho tightly couples data integration with business analytics in a modern platform that brings together
IT and business users to easily access, visualize and explore all data that impacts business results.

Pentaho Data Integration, codenamed Kettle, consists of a core data integration (ETL) engine, and GUI
applications that allow the user to define data integration jobs and transformations. It supports
deployment on single node computers as well as on a cloud, or cluster.

GonzaloZarza.me
Intro to Data Integration
Logstash
Data Integration - Logstash

Open source data collection engine with


real-time pipelining capabilities. Logstash can
dynamically unify data from disparate sources
and normalize the data into destinations of your
choice. Any type of event can be enriched and
transformed with a broad array of input, filter, and
output plugins, with many native codecs further
simplifying the ingestion process.

*Image source: https://www.elastic.co/logstash/


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Integration - Logstash

Logstash supports a variety of inputs that pull in events


from a multitude of common sources, all at the same
time.

Dynamically transforms and prepares input data


regardless of format or complexity:
● Derive structure from unstructured data with grok
● Decipher geo coordinates from IP addresses
● Anonymize PII data, exclude sensitive fields
completely
● Ease overall processing, independent of the data
source, format, or schema.

It has a variety of outputs that let us route data where


needed, giving us the flexibility to unlock a slew of
downstream use cases. Logstash has a pluggable
framework featuring over 200 plugins.

*Image source: https://www.elastic.co/logstash/


*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
References
References

● La Ingeniería del Big Data, Cómo trabajar con datos. Juan José López Murphy, Gonzalo Zarza, Editorial
UOC, 2017.
● Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable
systems. Kleppmann, M. O'Reilly Media, 2017.
● Hadoop, The Definitive Guide, 4th Edition. Tom White. O’Reilly Media / Yahoo Press, 2015.
● Hadoop in Action, 2nd Edition. Chuck Lam. Manning, 2016.

GonzaloZarza.me
Any question?
Thank you!

Gonzalo Zarza
gzarza@uade.edu.ar

You might also like