Architecture of AI Systems - Engineering For Big Data and AI (Grokking)

Architectures of AI systems
Engineering for Big Data & AI
HCMC, Sep 6th 2019 Herve Roussel herve@quod.ai

What is
Data Engineering ?
Is this data engineering?
UploadData.java
upload_data.py
cat console.log
| grep “ERROR”
> errors.log
Data engineering?
Program
Event data Transformed data

Backend vs Data?
Event data
cat console.log
Transform
| grep “ERROR”
Transformed data
> errors.log
What is
Big Data Engineering ?
Where is Big Data?
How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC
Chris posted. Is that good Who can see
or bad? this?
Racist? Vulgar?
Notify? Web,
mobile?
Anybody tagged?
Is this a face? Who’s this?

Friend? Celebrity?
Courtney likes. Is that
good or bad? What rank in
feed?
Paddy commented. Is that

good or bad?
Copyright violation?
Is Big Data just for big companies?
300K QPS [R] 1B+ QPM [P] 400M LOC [P]

6K QPS [W] 250M+ QPM [R] 1.8 TB per year [P]
As of JULY 8, 2013
Data Engineering
Program
Event data Augmented data

Big Data Engineering + AI
Event data
Transform
Augmented data
Source (Event data)
Pipeline (Transform)
Sink (Augmented data)

What is a
source ?
Where is data coming from?
Main data
Synchronous_
( 10-100 ms )_
Asynchronous_
( 3-5 s )_
Event source
Why split?
What’s in an event data?
Post PostCreatedEvent
{ {
id: 12345, story_id: 12345,
content: “hello world”, type: “story_posted”
created_at: … …
updated_at: … }
author_id: 67890,
…
}
What’s batch processing?
Job 1
Scheduler
Job 2
Which DB for event source?
How to store events?
● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?
MySQL MongoDB JSON on S3 (or
GCS)
30 GB OK Good Very good
10K WPS OK Good Very good
1K RPS OK Good Very good
Range readread
Sequential OK Good Very good
Cost $$ $$$ $
Who wants to become architect?
What’s the problem with batch?
E NC Y
LAT
Job 1
Scheduler
Job 2
How to process real-time?
Stream processing
How can 2 processes talk?
QUEUE
Why not use database?

Why not database?
Importance MySQL Kafka Redis
10K WPS 1.0 5 10 10
1K RPS 1.0 5 10 10
Sequential 1.0 10 10 10
read (with B-TREE) (using Lists)
Order 0.2 10 0 10
guarantee
Durability 0.1 10 5 (but perf. hit) 0
Deployability 0.5 10 5 7.5
Score 5.6 / 10 6.6 / 10 7.15 / 10

What is a
transform ?
Source
Transforms
Sink
Functional vs OOP
Operations on things Things with operations

Add more things Add more operations
Librarian find(book)
Books.create()
.startShift()
load_cover(book)
Catalog.open() Library.close()
remove(book)
assign(book)
Functional vs OOP
Things with operations

Add more operations
generate_thumbnails(vid_uploaded)
find_similar(vid_uploaded)
transcribe_captions(vid_uploaded)
alert_subscribers(vid_uploaded)
What’s supporting data?
Supporting data
event
{
id: 12345,
Transform type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66 ]
}
Friends or city DB
Who uses ext. supporting data?
API vs Pipeline: availability?
Requests in thread Long running

API vs Pipeline: performance?
100ms 100ms * 300,000/60/60 = 9H
⇓ ⇓
10ms 10ms * 300,000/60/60 = 55 min

Where is the data coming from?
Is this a face? Who’s this?

Friend? Celebrity?
Data pipelines & AI
AI model Transform
How can 2 processes talk?
Transform
AI model
What is a
sink ?
Which DB to sink to?
What to do with the sink?
Write Read
Data scientist
Sales
What are the read use cases?
Give me posts that

Give me summary
contain the words Give me all posts by
report of last month’s
Donald Trump, Trump female, age 18-35
activity
or President
Aggregation Full text search Bulk data, filtered

ACID
Denormalization: good or bad?
What is BCNF?
What’s distributed data systems?
Why re-run the pipeline?
AI model Transform
Transformv2
Idempotency & backfill
f(f(x)) = f(x)
POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }
Another reason for backfill?
What if the AI model improves?
AI model v2 Transform
AI systems ≠ traditional systems?
93.2%
Deterministic Probabilistic
Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )
AI Model v2
( accuracy: ?? )
What have we
learned ?
[BE/FE] Use DL model in app
[DE] Collect data
[DS] Build DL model
[DE] Process data

[DA] Validate DL model
Source: Uber Engineering

Which NFR for Big Data?
• Scalability • Deployability
• Availability • Ease of Development
• Interoperability • Performance
• Portability • Security
• Modifiability • Localization
• Maintainability • Legal
• Testability • Reusability
• Usability • Supportability
• Buildability • Monitorability
Which NFR for Big Data?
• Scalability • Deployability
• Availability • Ease of Development
• Interoperability • Performance
• Portability • Security
• Modifiability • Localization
• Maintainability • Legal
• Testability • Reusability
• Usability • Supportability
• Buildability • Monitorability
What have we learned?
Main data
+
Materialized view
Event data
⇓
Pipeline
⇓
Augmented data
Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)
http://bit.ly/quod-ai-join
Herve Roussel herve@quod.ai

Architecture of AI Systems - Engineering For Big Data and AI (Grokking)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Architecture of AI Systems - Engineering For Big Data and AI (Grokking)

Uploaded by

Copyright:

Available Formats

Architectures of AI systems

Engineering for Big Data & AI

HCMC, Sep 6th 2019 Herve Roussel herve@quod.ai

Event data Transformed data

Is this a face? Who’s this?

Paddy commented. Is that

300K QPS [R] 1B+ QPM [P] 400M LOC [P]

Event data Augmented data

Sink (Augmented data)

30 GB OK Good Very good

10K WPS OK Good Very good

1K RPS OK Good Very good

Why not use database?

10K WPS 1.0 5 10 10

Durability 0.1 10 5 (but perf. hit) 0

Deployability 0.5 10 5 7.5

Score 5.6 / 10 6.6 / 10 7.15 / 10

Operations on things Things with operations

Things with operations

Requests in thread Long running

100ms 100ms * 300,000/60/60 = 9H

10ms 10ms * 300,000/60/60 = 55 min

Is this a face? Who’s this?

Give me posts that

Aggregation Full text search Bulk data, filtered

[DS] Build DL model

[DE] Process data

Source: Uber Engineering

Herve Roussel herve@quod.ai

You might also like