Professional Documents
Culture Documents
AWS
KD Singh
Solutions Architect
Amazon Web Services
AWS
Amazon Amazon
Data Pipeline
Kinesis CloudSearch
Cassandra
Time
Devices
Cassandra
DynamoDB
Kinesis
enabled app
Spark Amazon Storm
Streaming Kinesis
Amazon Kinesis Kafka Connector
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS big data portfolio
Collect / Ingest Store Process / Analyze Visualize / Report
Apps
Database
– Database reads/writes
Devices
• File
– Media files; log files Cloud
Logging Frameworks
Storage
• Stream
– Click-stream logs (sets of Stream
events) Storage
Amazon Elastic
Fluentd MapReduce
Write Read
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Partners for data ingest, load, and
transformation
Flume, Sqoop
App/Web Tier
App/Web Tier
Data &Tier
Database Storage Tier
Amazon
ElastiCache Amazon RDS Amazon
Amazon CloudSearch
DynamoDB
Amazon
HDFS on Amazon EMR
Amazon S3 Glacier
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Store anything
Object storage
Amazon
S3 Scalable
Designed for 99.999999999% durability
• 99.999999999% durability
• Versioning; lifecycle policies
• Amazon Glacier integration Cassandra Storm Spark
Streaming
• Batch processing
– Take large amount (>100 TB) of cold data and ask questions
– Takes minutes or hours to get answers back
– Example: Generating hourly, daily, weekly reports
MPP
MPP
• Batch processing/analytic
– Amazon Redshift
– Amazon EMR (Hadoop)
– Spark, Hive/Tez, Pig, Impala, Presto, ….
Hadoop
• Stream processing
– Amazon Kinesis client and connector library
Stream Processing
– Spark Streaming
– Storm (+Trident)
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel 10 GigE
(HPC)
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
Ingestion
• Two hardware platforms Backup
– DS2 (dense storage): HDD; scale to 1.6PB Restore
– DC1 (dense compute): SSD; scale to 256TB
Basic UI FancyUI
Create
Read Read
Update
Delete
Key Priority
CSV
|26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav
en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take-
off||
…
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect
Use linux
crontab
to schedule
#!/bin/sh
wget "http://somexml" -qO- | tr -d "\n" | tr -d "\r" |
sed "s#<Accident>#\n<Accident>#g" > tmp
aws s3 put tmp s3://accidents/input/source1
Amazon S3
…….
Amazon EC2
Function source1(line)
{
Amazon S3 Var data=xml2json(line)
data.records.forEach(function(v){
Var el={ Date:v.Date,
Registration:v.Registration,
Model:v.Model,
Source:”Source1”,
Priority:1
}
var key=el.Date+”#”+el.Registration
mapped process.stdout.write(key+”/t”+JSON.stringify(el)
Amazon Elastic )
})
MapReduce
}
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Reduce #!/usr/bin/env node
Var oldkey,key,array=[]
function treatline(line) {
Mapped and key=line.split(“/t”)[0]
data=JSON.parse(line.split(“/t”)[1])
sorted If ((key==oldkey) || !oldkey)
{
array.push(data)}
Else {
treat(array)
array=[]}
oldkey=key
Amazon Elastic ……}
MapReduce
Function treat(array)
{
el={}
array=array.sort(prioritysort)
array.forEach(function(v){
Reduced el=updateresult(el,v)
})
Amazon S3 process.stdout.write(JSON.stringify(el)+”\n”)
} AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time statistics
Amazon Elastic
MapReduce