Professional Documents
Culture Documents
10/2017
Modern online services
Stream Stream
Analytics Useful
Aggregator Information
System
2
Modern online services
Approximate computing
Tension
Efficient resource
Low latency
utilization
3
Approximate Computing
Many applications:
Approximate output is good enough!
E.g. : Google Trends --- “Spark SQL” vs “Spark Streaming” (Sep/2017 – Oct/2017)
100
50
0
Average Sep 17 Sep 26 Oct 5
4
Approximate Computing
Approximate computing
Take a
sample Approximate output
Compute ± Error bound
5
State-of-the-art systems
Not designed
ApproxHadoop [ASPLOS’15] for Using multi-stage sampling
stream analytics
6
StreamApprox: Design goals
7
Outline
• Motivation
• Design
• Evaluation
8
StreamApprox: Overview
S1
Stream Data stream
S2 aggregator StreamApprox Approximate output
(E.g Kafka) error bound
…
Sn
Query budget:
• Latency/throughput guarantees
• Desired computing resources for query processing
• Desired accuracy
9
Key idea: Sampling
Simple random sampling (SRS):
10
Key idea: Sampling
Reservoir sampling (RS):
With
p roba
bilit
y(
Replace by item i
Size of reservoir = k
11
Spark-based Sampling
Spark-based Simple Random Sampling (Spark-based SRS)
RS
S1 Weight = #items/k = 8/4
Size of reservoir = k
RS Weight = 1
S3
RS : Reservoir Sampling
Easy to parallelize, doesn't
k=4
need any synchronization
between workers 14
StreamApprox: Core idea
Worker 1
OASRS Weight = 2
Weight = 1.5
Weight = 1
Worker 2
Weight = 1
OASRS
Weight = 2
Weight = 1.5
Size of reservoir = 4
15
Implementation
S1
Stream Data stream
S2 aggregator StreamApprox Approximate output
error bound
…
Sn
16
Implementation
S1
Stream Sampling Batch Batched RDDs Spark computation
S2
aggregator module generator engine
…
Sn
Error
estimation
Refined sampling Output
module
parameters error bound
17
Outline
• Motivation
• Design
• Evaluation
18
Experimental setup
• Evaluation questions
See the paper
• Throughput vs sample size
for more
• Throughput vs accuracy results!
• End-to-end latency
• Testbed
• Cluster: 17 nodes
• Datasets:
• Synthesis: Gaussian distribution, Poisson distribution datasets
• CAIDA Network traffic traces; NYC Taxi ride records
19
Throughput
Flink-based StreamApprox Spark-based StreamApprox Higher
7 Spark-based STS the better
6
Throughput (M)
5
#items/s
4
3
2
1
0
10 20 40 60 80
Sampling fraction (%)
20
Throughput vs Accuracy
Flink-based StreamApprox
Spark-based StreamApprox Higher
4500 Spark-based STS the better
4000
Throughput (M)
3500
3000
#items/s
2500
2000
1500
1000
500
0
0.5 1
Accuracy loss (%)
21
Latency
Network Traffic Dataset Lower
180 NYC Taxi Dataset the better
160
Total processing time
140
Latency (Seconds)
120
100
80
60
40
20
0
Spark-based STS Spark-based SRS StreamApprox
22
Conclusion
StreamApprox: Approximate computing for stream analytics
Thank you!
Details: StreamApprox [Middleware’17]
https://streamapprox.github.io
23