Big Data and Analytics On AWS: KD Singh Solutions Architect Amazon Web Services

Big Data and Analytics on
AWS
KD Singh
Solutions Architect
Amazon Web Services
AWS Government, Education, and Nonprofit Symposium

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Washington, DC I June 25-26, 2015
What is big data?
When your data sets become so large that you have to start
innovating around how to collect, store, organize, analyze, and
share it
• Velocity • Change Rate
– Rate of data flow in – How much is the data changing?
• Latency • Processing Requirements
– High or Low – How much computation?
• Volume • Durability
– High or Low – Preservation of source data?
• Variety • Availability
– Diversity of source data – Tolerance for downtime?
• Item Size • Growth Rate
– KB or MB – Rate of data growth?
• Request Rate
– Access patterns • Views
– The diversity of consumers?
Washington, DC I June 25-26, 2015
Plethora of tools
Amazon Amazon Amazon

EMR S3 DynamoDB
Amazon Amazon Amazon

Redshift Glacier RDS
AWS
Amazon Amazon
Data Pipeline
Kinesis CloudSearch
Cassandra

Simplify data analytics flow
Multiple stages
Storage decoupled from processing
Ingest Store Analyze Visualize

Data Answers
Time

Ingest Store Process/Analyze Visualize
App Server
Web Server Amazon

S3 Glacier
Devices
Cassandra
DynamoDB
Data Pipeline Amazon

EMR
Amazon Redshift
CloudSearch
RDS
Kinesis
enabled app
Spark Amazon Storm
Streaming Kinesis
Amazon Kinesis Kafka Connector
AWS big data portfolio
Collect / Ingest Store Process / Analyze Visualize / Report
Amazon Kinesis Amazon S3 Amazon EMR Amazon EC2
AWS Import/Export Amazon DynamoDB Amazon AWS

Redshift Data Pipeline
AWS Direct Connect Amazon Glacier
Amazon SQS Amazon RDS

Industries using AWS for data analysis
Oil and Gas Retail/Consumer Life Sciences Online Media
Mobile / Cable Financial Publishing Media
Telecom Industrial Entertainment Scientific
Services Advertising
Social Network
Manufacturing Hospitality Exploration Gaming

Ingest: The act of collecting and storing data

Types of data ingest
• Transactional
Apps
Database
– Database reads/writes
Devices
• File
– Media files; log files Cloud
Logging Frameworks
Storage
• Stream
– Click-stream logs (sets of Stream
events) Storage

Real-time processing of streaming data
High throughput
Elastic
Amazon
Kinesis Easy to use
Connectors for EMR, S3, Amazon Redshift,
DynamoDB

Sending and reading data from Amazon
Kinesis streams
Sending Reading
HTTP Post Get* APIs
AWS SDK Kinesis Client Library

+
Connector Library
LOG4J
Apache
Storm
Flume
Amazon Elastic
Fluentd MapReduce
Write Read
AWS Partners for data ingest, load, and
transformation
Flume, Sqoop
Hparser, Big Data Edition

Storage

Cloud database and storage tier anti-pattern
Client Tier
App/Web Tier
Database & Storage Tier

Cloud database and storage tier — use the right
tool for the job!
Client Tier
App/Web Tier
Data &Tier
Database Storage Tier
Cache SQL NoSQL Search
Blob Store Hadoop/HDFS

Cloud database and storage tier — use the right
tool for the job!
Database & Storage Tier
Amazon
ElastiCache Amazon RDS Amazon
Amazon CloudSearch
DynamoDB
Amazon
HDFS on Amazon EMR
Amazon S3 Glacier
Store anything
Object storage
Amazon
S3 Scalable
Designed for 99.999999999% durability

Aggregate all data in S3 surrounded by a collection of the right tools
• No limit on the number of objects

• Object size up to 5 TB Amazon Amazon AWS
EMR Kinesis Data Pipeline
• Central data storage for all
systems
• High bandwidth Amazon S3 Amazon
Redshift
Amazon
DynamoDB
Amazon RDS
• 99.999999999% durability
• Versioning; lifecycle policies
• Amazon Glacier integration Cassandra Storm Spark
Streaming

Fully managed NoSQL database service
Built on solid-state drives (SSDs)
Amazon Consistent low-latency performance

DynamoDB
Any throughput rate
No storage limits

DynamoDB: managed high availability and
durability
• Scaling without downtime

• Automatic sharding
• Security inspections, patches,
upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration
designed specifically for
DynamoDB
• Performance tuning
Relational databases
Fully managed; zero admin
Amazon MySQL, PostgreSQL, Oracle, SQL Server
RDS
Aurora

Process and analyze

Processing frameworks
• Batch processing
– Take large amount (>100 TB) of cold data and ask questions
– Takes minutes or hours to get answers back
– Example: Generating hourly, daily, weekly reports
• Stream processing (real-time)

– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
– Example: 1 min metrics

Processing frameworks
MPP
MPP
• Batch processing/analytic
– Amazon Redshift
– Amazon EMR (Hadoop)
– Spark, Hive/Tez, Pig, Impala, Presto, ….
Hadoop
• Stream processing
– Amazon Kinesis client and connector library
Stream Processing
– Spark Streaming
– Storm (+Trident)

Columnar data warehouse
ANSI SQL compatible
Massively parallel
Amazon
Redshift Petabyte scale
Fully managed
Very cost-effective

Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata JDBC/ODBC
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel 10 GigE
(HPC)
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
Ingestion
• Two hardware platforms Backup
– DS2 (dense storage): HDD; scale to 1.6PB Restore
– DC1 (dense compute): SSD; scale to 256TB

Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
Amazon On-demand and spot pricing
Elastic
MapReduce
Tight integration with S3,
DynamoDB, and Amazon Kinesis

How does EMR work?
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
1. Put the data EMR Cluster

into S3
S3
3. Launch the cluster using

the EMR console, CLI, SDK, or
APIs
4. Get the output
from S3

The Hadoop ecosystem works with EMR

Partners – advanced analytics

Visualize

AWS Partners for BI & data visualization

Putting it all together

Demo
Logs Amazon EMR Amazon Redshift – Visualization

as ETL Grid Production DWH
Traffic Statistics and Analysis

Demo

ICAO and Hadoop
Marco Merens
Chief (Acting) Integrated Analysis
International Civil Aviation Organization

ICAO in the cloud

Cloudability principles at ICAO
1. What comes from the cloud, can stay in the

cloud
2. What comes from in-house
A. should stay in-house if private, or
B. can be synced with the cloud if public

Data sync
Metrics
Data sync Data
Basic UI FancyUI
Create
Read Read
Update
Delete

EMR example: blended accident list
Collect Map Reduce Publish
Key Priority

Input format
XML
<?xml version="1.0" encoding="utf-8"?>
<root>
<ADREP><FilingInformation
State="XX"><ReportingOrganization>Ascend</ReportingOrganization>
<StateFileNumber>S1982045</StateFileNumber>
<Headline>MU-2, Collision with high ground, (near) Kelowna</Headline>
</FilingInformation>
….
</root>
CSV
|26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav
en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take-
off||
…
Collect
Use linux
crontab
to schedule
#!/bin/sh
wget "http://somexml" -qO- | tr -d "\n" | tr -d "\r" |
sed "s#<Accident>#\n<Accident>#g" > tmp
aws s3 put tmp s3://accidents/input/source1
Amazon S3
…….
Amazon EC2
Make One XML

element per line for
EMR
EMR command line
elastic-mapreduce
--create
--bootstrap-action
s3://elasticmapreduce/samples/node/install-node-bin-
x86.sh
--instance-type m1.small --instance-count 3
--json job.json
--put /home/ec2-user/key/newtest.pem
--to /home/hadoop
--enable-debugging
Put ssh key to hadoop
if you need to remote
sh
EMR json config file
[{
"Name": "Make accident map",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args": [
"-input",
"s3://accidentstats/input/*", …
Move the results
]},{ from S3 to
"Name": "Store in mongo", somewhere else
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar" ,
"Args": [
"s3://edmscripts/uploadtomongo.sh",
"accidentstats/output",
"NEWACCIDENTLIST"
]}}
Map #!/usr/bin/env node
function treatline(line) {
If (line.indexOf(“<ADREP>”))
sourceX {
source1(line)
}
……
Function source1(line)
{
Amazon S3 Var data=xml2json(line)
data.records.forEach(function(v){
Var el={ Date:v.Date,
Registration:v.Registration,
Model:v.Model,
Source:”Source1”,
Priority:1
}
var key=el.Date+”#”+el.Registration
mapped process.stdout.write(key+”/t”+JSON.stringify(el)
Amazon Elastic )
})
MapReduce
}
Reduce #!/usr/bin/env node
Var oldkey,key,array=[]
function treatline(line) {
Mapped and key=line.split(“/t”)[0]
data=JSON.parse(line.split(“/t”)[1])
sorted If ((key==oldkey) || !oldkey)
{
array.push(data)}
Else {
treat(array)
array=[]}
oldkey=key
Amazon Elastic ……}
MapReduce
Function treat(array)
{
el={}
array=array.sort(prioritysort)
array.forEach(function(v){
Reduced el=updateresult(el,v)
})
Amazon S3 process.stdout.write(JSON.stringify(el)+”\n”)
} AWS Government, Education, and Nonprofit Symposium
Real-time statistics
Amazon Elastic
MapReduce

Thank You.
This presentation will be loaded to SlideShare the week following the Symposium.
http://www.slideshare.net/AmazonWebServices


Big Data and Analytics On AWS: KD Singh Solutions Architect Amazon Web Services

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data and Analytics On AWS: KD Singh Solutions Architect Amazon Web Services

Uploaded by

Copyright:

Available Formats

Big Data and Analytics on

AWS Government, Education, and Nonprofit Symposium

Amazon Amazon Amazon

Amazon Amazon Amazon

AWS Government, Education, and Nonprofit Symposium

Ingest Store Analyze Visualize

AWS Government, Education, and Nonprofit Symposium

Web Server Amazon

Data Pipeline Amazon

Amazon Kinesis Amazon S3 Amazon EMR Amazon EC2

AWS Import/Export Amazon DynamoDB Amazon AWS

AWS Direct Connect Amazon Glacier

Amazon SQS Amazon RDS

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

HTTP Post Get* APIs

AWS SDK Kinesis Client Library

Hparser, Big Data Edition

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

Database & Storage Tier

AWS Government, Education, and Nonprofit Symposium

Cache SQL NoSQL Search

Blob Store Hadoop/HDFS

AWS Government, Education, and Nonprofit Symposium

Database & Storage Tier

AWS Government, Education, and Nonprofit Symposium

• No limit on the number of objects

AWS Government, Education, and Nonprofit Symposium

Amazon Consistent low-latency performance

AWS Government, Education, and Nonprofit Symposium

• Scaling without downtime

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

• Stream processing (real-time)

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

• Hardware optimized for data processing

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

1. Put the data EMR Cluster

3. Launch the cluster using

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

Logs Amazon EMR Amazon Redshift – Visualization

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

1. What comes from the cloud, can stay in the

AWS Government, Education, and Nonprofit Symposium

AWS Government, Education, and Nonprofit Symposium

Collect Map Reduce Publish

AWS Government, Education, and Nonprofit Symposium

Make One XML