Professional Documents
Culture Documents
and Integration
Maestría en Tecnología Informática y de
Comunicaciones (TIC) - UADE
October 2021
Module Organization - Lectures
5. Data Science and Machine Learning (05/11) Lecture by Dr. Tomas Tecce
7. The Chief Data Officer: strategy and governance (12/11) Final Evaluation Work Cont'd
GonzaloZarza.me
Agenda
5. References
GonzaloZarza.me
Introduction to Data
Practices & Profiles
Back to Module Objective - Goals
GonzaloZarza.me
Teams and Roles Skills
1. GENERATION
6. DECISION, 2. LOADING,
ACTION STAGING
DATA
5. EXPLOITATION,
3. STORAGE
MODELLING
4. PROCESSING
GonzaloZarza.me
Teams and Roles Skills
Visualization Integration
Specialist Specialist
1. GENERATION
6. DECISION, 2. LOADING,
ACTION STAGING
DATA
Strategist Engineer
5. EXPLOITATION,
3. STORAGE
MODELLING
4. PROCESSING
Scientist Architect
GonzaloZarza.me
Teams and Roles Skills
Integration Visualization
Engineer Architect Scientist Specialist
Specialist
Strategist
Loading/ Loading/ Loading/ Loading/ Loading/
2
Staging Staging Staging Staging Staging
GonzaloZarza.me
Deep dive into
Data Architecture
Back to the (Big) Data Evolution Tools
First Speciation
SQL Platforms
Second Speciation
Batch Processing
Third Speciation
NRT Processing
Fourth Speciation
AI Platforms
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture Practice - Modern Dilemmas Tools
Batch Stream
Processing Processing
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture Practice - Modern Dilemmas Tools
Batch Stream
Processing Processing
On
Cloud
Premise
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Intro to Data Architecture
Processing
Data Architecture
First Speciation
SQL Platforms
Second Speciation
Batch Processing
Third Speciation
NRT Processing
Fourth Speciation
AI Platforms
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing
A batch processing system takes a large amount of input data, runs a job to process
it, and produces some output data. Jobs often take a while (from a few minutes to
Batch several days), so there normally isn’t a user waiting for the job to finish. Instead,
batch jobs are often scheduled to run periodically (for example, once a day). The
Processing primary performance measure of a batch job is usually throughput [...]
—Martin Kleppmann, Designing Data-Intensive Applications (2017)
*Quote source: "Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems", Kleppmann, M. O'Reilly Media, 2017.
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing
Batch
Processing
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing - Hadoop Batch
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing - Hadoop Batch
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing - Hadoop Batch
Divide Conquer
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing
Batch
Processing
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing
Batch Hadoop
Processing Ecosystem
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Batch Processing
Batch
Processing
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing
Stream
Processing
*Quote source: "Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems", Kleppmann, M. O'Reilly Media, 2017.
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing
[...] we can run the processing more frequently—say, processing a second’s worth of
data at the end of every second—or even continuously, abandoning the fixed time
slices entirely and simply processing every event as it happens. That is the idea
Stream behind stream processing.
Processing In general, a “stream” refers to data that is incrementally made available over time.
—Martin Kleppmann, Designing Data-Intensive Applications (2017)
*Quote source: "Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems", Kleppmann, M. O'Reilly Media, 2017.
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing
Stream
Processing
Stream
Processing
*Images source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing - Spark Stream
Stream
Processing
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - NRT (Stream) Processing
Stream Specialized
Processing tools
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Intro to Data Architecture
Storage
Data Architecture - Storage - NoSQL
We will define NoSQL databases as all of the modern databases that cannot or currently are not used through a relational schema.
On
Cloud
Premise
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise
On
Cloud
Premise
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise
On
Cloud
Premise
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise - Bonus Track On
Prem
https://top500.org/
https://top500.org/lists/top500/2021/06/ https://top500.org/lists/green500/2021/06/
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - On Premise - Bonus Track On
Prem
Cloud
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Cloud
Cloud
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Cloud
Cloud
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Architecture - Cloud - Bonus Track Cloud
*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud
IaaS:
Cloud infrastructure
services, known as
Infrastructure as a Service
(IaaS), are made of highly
scalable and automated
compute resources.
*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud
PaaS:
Cloud platform services, or
PaaS, provide cloud
components to certain
software while being used
mainly for applications.
PaaS delivers a framework
for developers that they
can build upon and use to
create customized
applications.
*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud
SaaS:
Also known as cloud
application services,
represents the most
commonly utilized option
for businesses in the cloud
market. SaaS utilizes the
internet to deliver
applications, which are
managed by a third-party
vendor, to its users.
*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Data Architecture - On Premise vs Cloud
*Source: "IaaS vs. PaaS vs. SaaS Cloud Models (Differences & Examples)" - https://www.hostingadvice.com/how-to/iaas-vs-paas-vs-saas/
GonzaloZarza.me
Intro to Data Architecture
Useful Tips
Data Architecture - Common Scenarios and Useful Tips
GonzaloZarza.me
Open discussion
& Questions
Introduction to
Data Integration
Data Integration
How does it work?
Data Integration - Common Scenarios and Tools
GonzaloZarza.me
Data Integration - Common Scenarios and Tools
Pentaho provides big data tools to extract, prepare and blend your Apache Sqoop is a tool designed for efficiently transferring bulk
data, plus the visualizations and analytics that will change the way data between Apache Hadoop and structured datastores such as
you run your business. relational databases.
[Old] Apache Kafka is a fast, scalable, durable, and fault-tolerant Logstash is a tool for managing events and logs. You can use it to
publish-subscribe messaging system. Kafka is often used in place collect logs, parse them, and store them for later use (like, for
of traditional message brokers like JMS and AMQP because of its searching). If you store them in Elasticsearch, you can view and
higher throughput, reliability and replication. analyze them with Kibana.
GonzaloZarza.me
Intro to Data Integration
Kafka
Data Integration - Kafka
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Data Integration - Kafka
*Source: "La Ingeniería del Big Data, Cómo trabajar con datos", Juan José López Murphy, Gonzalo Zarza; Editorial UOC, 2017. GonzaloZarza.me
Intro to Data Integration
Pentaho
Data Integration - Pentaho
Pentaho tightly couples data integration with business analytics in a modern platform that brings together
IT and business users to easily access, visualize and explore all data that impacts business results.
Pentaho Data Integration, codenamed Kettle, consists of a core data integration (ETL) engine, and GUI
applications that allow the user to define data integration jobs and transformations. It supports
deployment on single node computers as well as on a cloud, or cluster.
GonzaloZarza.me
Intro to Data Integration
Logstash
Data Integration - Logstash
● La Ingeniería del Big Data, Cómo trabajar con datos. Juan José López Murphy, Gonzalo Zarza, Editorial
UOC, 2017.
● Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable
systems. Kleppmann, M. O'Reilly Media, 2017.
● Hadoop, The Definitive Guide, 4th Edition. Tom White. O’Reilly Media / Yahoo Press, 2015.
● Hadoop in Action, 2nd Edition. Chuck Lam. Manning, 2016.
GonzaloZarza.me
Any question?
Thank you!
Gonzalo Zarza
gzarza@uade.edu.ar