You are on page 1of 60

Architectures of AI systems

Engineering for Big Data & AI

HCMC, Sep 6th 2019 Herve Roussel herve@quod.ai


What is
Data Engineering ?
Is this data engineering?

UploadData.java

upload_data.py
Is this data engineering?

cat console.log
| grep “ERROR”
> errors.log
Data engineering?

Program

Event data Transformed data


Backend vs Data?
Is this data engineering?

Event data
cat console.log
Transform
| grep “ERROR”
Transformed data
> errors.log
What is
Big Data Engineering ?
Where is Big Data?
How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC
Chris posted. Is that good Who can see
or bad? this?

Racist? Vulgar?
Notify? Web,
mobile?
Anybody tagged?

Is this a face? Who’s this?


Friend? Celebrity?
Courtney likes. Is that
good or bad? What rank in
feed?

Paddy commented. Is that


good or bad?
Copyright violation?
Is Big Data just for big companies?

300K QPS [R] 1B+ QPM [P] 400M LOC [P]


6K QPS [W] 250M+ QPM [R] 1.8 TB per year [P]

As of JULY 8, 2013
Data Engineering

Program

Event data Augmented data


Big Data Engineering + AI
Event data

Transform

Augmented data
Source (Event data)

Pipeline (Transform)

Sink (Augmented data)


What is a
source ?
Where is data coming from?
Main data

Synchronous_
( 10-100 ms )_

Asynchronous_
( 3-5 s )_

Event source

Why split?
What’s in an event data?
Post PostCreatedEvent

{ {
id: 12345, story_id: 12345,
content: “hello world”, type: “story_posted”
created_at: … …
updated_at: … }
author_id: 67890,

}
What’s batch processing?

Job 1

Scheduler

Job 2
Which DB for event source?
How to store events?
● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?
MySQL MongoDB JSON on S3 (or
GCS)

30 GB OK Good Very good

10K WPS OK Good Very good

1K RPS OK Good Very good

Range readread
Sequential OK Good Very good

Cost $$ $$$ $
Who wants to become architect?
What’s the problem with batch?

E NC Y
LAT
Job 1

Scheduler

Job 2
How to process real-time?

Stream processing
How can 2 processes talk?
QUEUE

Why not use database?


Why not database?
Importance MySQL Kafka Redis

10K WPS 1.0 5 10 10

1K RPS 1.0 5 10 10

Sequential 1.0 10 10 10
read (with B-TREE) (using Lists)

Order 0.2 10 0 10
guarantee

Durability 0.1 10 5 (but perf. hit) 0

Deployability 0.5 10 5 7.5

Score 5.6 / 10 6.6 / 10 7.15 / 10


What is a
transform ?
Source

Transforms

Sink
Functional vs OOP

Operations on things Things with operations


Add more things Add more operations

Librarian find(book)
Books.create()
.startShift()
load_cover(book)
Catalog.open() Library.close()
remove(book)

assign(book)
Functional vs OOP

Things with operations


Add more operations

generate_thumbnails(vid_uploaded)

find_similar(vid_uploaded)

transcribe_captions(vid_uploaded)

alert_subscribers(vid_uploaded)
What’s supporting data?

Supporting data
event
{
id: 12345,
Transform type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66 ]
}

Friends or city DB
Who uses ext. supporting data?
API vs Pipeline: availability?

Requests in thread Long running


API vs Pipeline: performance?

100ms 100ms * 300,000/60/60 = 9H

⇓ ⇓

10ms 10ms * 300,000/60/60 = 55 min


Where is the data coming from?

Is this a face? Who’s this?


Friend? Celebrity?
Data pipelines & AI

AI model Transform
How can 2 processes talk?

Transform

AI model
What is a
sink ?
Which DB to sink to?
What to do with the sink?
Write Read

Data scientist

Sales
What are the read use cases?

Give me posts that


Give me summary
contain the words Give me all posts by
report of last month’s
Donald Trump, Trump female, age 18-35
activity
or President

Aggregation Full text search Bulk data, filtered


ACID
Denormalization: good or bad?
What is BCNF?
What’s distributed data systems?
Why re-run the pipeline?

AI model Transform
Transformv2
Idempotency & backfill

f(f(x)) = f(x)

POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }
Another reason for backfill?
What if the AI model improves?

AI model v2 Transform
AI systems ≠ traditional systems?

93.2%

Deterministic Probabilistic
Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )

AI Model v2
( accuracy: ?? )
What have we
learned ?
[BE/FE] Use DL model in app
[DE] Collect data

[DS] Build DL model

[DE] Process data


[DA] Validate DL model

Source: Uber Engineering


Which NFR for Big Data?

• Scalability • Deployability
• Availability • Ease of Development
• Interoperability • Performance
• Portability • Security
• Modifiability • Localization
• Maintainability • Legal
• Testability • Reusability
• Usability • Supportability
• Buildability • Monitorability
Which NFR for Big Data?

• Scalability • Deployability
• Availability • Ease of Development
• Interoperability • Performance
• Portability • Security
• Modifiability • Localization
• Maintainability • Legal
• Testability • Reusability
• Usability • Supportability
• Buildability • Monitorability
What have we learned?

Main data
+
Materialized view
Event data

Pipeline

Augmented data
Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)

http://bit.ly/quod-ai-join

Herve Roussel herve@quod.ai

You might also like