You are on page 1of 49

Big Data and Analytics on

AWS
KD Singh
Solutions Architect
Amazon Web Services

AWS Government, Education, and Nonprofit Symposium


©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Washington, DC I June 25-26, 2015
What is big data?
When your data sets become so large that you have to start
innovating around how to collect, store, organize, analyze, and
share it
• Velocity • Change Rate
– Rate of data flow in – How much is the data changing?
• Latency • Processing Requirements
– High or Low – How much computation?
• Volume • Durability
– High or Low – Preservation of source data?
• Variety • Availability
– Diversity of source data – Tolerance for downtime?
• Item Size • Growth Rate
– KB or MB – Rate of data growth?
• Request Rate
– Access patterns • Views
– The diversity of consumers?
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Plethora of tools

Amazon Amazon Amazon


EMR S3 DynamoDB

Amazon Amazon Amazon


Redshift Glacier RDS

AWS
Amazon Amazon
Data Pipeline
Kinesis CloudSearch

Cassandra

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Simplify data analytics flow
Multiple stages
Storage decoupled from processing

Ingest Store Analyze Visualize


Data Answers

Time

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Ingest Store Process/Analyze Visualize
App Server

Web Server Amazon


S3 Glacier

Devices

Cassandra

DynamoDB

Data Pipeline Amazon


EMR
Amazon Redshift
CloudSearch
RDS

Kinesis
enabled app
Spark Amazon Storm
Streaming Kinesis
Amazon Kinesis Kafka Connector
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS big data portfolio
Collect / Ingest Store Process / Analyze Visualize / Report

Amazon Kinesis Amazon S3 Amazon EMR Amazon EC2

AWS Import/Export Amazon DynamoDB Amazon AWS


Redshift Data Pipeline

AWS Direct Connect Amazon Glacier

Amazon SQS Amazon RDS

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Industries using AWS for data analysis
Oil and Gas Retail/Consumer Life Sciences Online Media
Mobile / Cable Financial Publishing Media
Telecom Industrial Entertainment Scientific
Services Advertising
Social Network
Manufacturing Hospitality Exploration Gaming

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Ingest: The act of collecting and storing data

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Types of data ingest
• Transactional

Apps
Database
– Database reads/writes

Devices
• File
– Media files; log files Cloud

Logging Frameworks
Storage
• Stream
– Click-stream logs (sets of Stream
events) Storage

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Real-time processing of streaming data
High throughput
Elastic
Amazon
Kinesis Easy to use
Connectors for EMR, S3, Amazon Redshift,
DynamoDB

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Sending and reading data from Amazon
Kinesis streams
Sending Reading

HTTP Post Get* APIs

AWS SDK Kinesis Client Library


+
Connector Library
LOG4J
Apache
Storm
Flume

Amazon Elastic
Fluentd MapReduce

Write Read
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Partners for data ingest, load, and
transformation

Flume, Sqoop

Hparser, Big Data Edition

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Storage

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Cloud database and storage tier anti-pattern
Client Tier

App/Web Tier

Database & Storage Tier

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Cloud database and storage tier — use the right
tool for the job!
Client Tier

App/Web Tier

Data &Tier
Database Storage Tier

Cache SQL NoSQL Search

Blob Store Hadoop/HDFS

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Cloud database and storage tier — use the right
tool for the job!

Database & Storage Tier

Amazon
ElastiCache Amazon RDS Amazon
Amazon CloudSearch
DynamoDB

Amazon
HDFS on Amazon EMR
Amazon S3 Glacier
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Store anything
Object storage
Amazon
S3 Scalable
Designed for 99.999999999% durability

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Aggregate all data in S3 surrounded by a collection of the right tools

• No limit on the number of objects


• Object size up to 5 TB Amazon Amazon AWS
EMR Kinesis Data Pipeline
• Central data storage for all
systems
• High bandwidth Amazon S3 Amazon
Redshift
Amazon
DynamoDB
Amazon RDS

• 99.999999999% durability
• Versioning; lifecycle policies
• Amazon Glacier integration Cassandra Storm Spark
Streaming

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Fully managed NoSQL database service
Built on solid-state drives (SSDs)

Amazon Consistent low-latency performance


DynamoDB
Any throughput rate
No storage limits

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
DynamoDB: managed high availability and
durability

• Scaling without downtime


• Automatic sharding
• Security inspections, patches,
upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration
designed specifically for
DynamoDB
• Performance tuning
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Relational databases
Fully managed; zero admin
Amazon MySQL, PostgreSQL, Oracle, SQL Server
RDS
Aurora

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Process and analyze

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Processing frameworks

• Batch processing
– Take large amount (>100 TB) of cold data and ask questions
– Takes minutes or hours to get answers back
– Example: Generating hourly, daily, weekly reports

• Stream processing (real-time)


– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
– Example: 1 min metrics

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Processing frameworks

MPP
MPP
• Batch processing/analytic
– Amazon Redshift
– Amazon EMR (Hadoop)
– Spark, Hive/Tez, Pig, Impala, Presto, ….

Hadoop
• Stream processing
– Amazon Kinesis client and connector library

Stream Processing
– Spark Streaming
– Storm (+Trident)

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Columnar data warehouse
ANSI SQL compatible
Massively parallel
Amazon
Redshift Petabyte scale
Fully managed
Very cost-effective

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata JDBC/ODBC
– Coordinates query execution

• Compute Nodes
– Local, columnar storage
– Execute queries in parallel 10 GigE
(HPC)
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB

• Hardware optimized for data processing

Ingestion
• Two hardware platforms Backup
– DS2 (dense storage): HDD; scale to 1.6PB Restore
– DC1 (dense compute): SSD; scale to 256TB

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
Amazon On-demand and spot pricing
Elastic
MapReduce
Tight integration with S3,
DynamoDB, and Amazon Kinesis

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
How does EMR work?
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase

1. Put the data EMR Cluster


into S3
S3

3. Launch the cluster using


the EMR console, CLI, SDK, or
APIs
4. Get the output
from S3

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
The Hadoop ecosystem works with EMR

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Partners – advanced analytics

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Visualize

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
AWS Partners for BI & data visualization

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Putting it all together

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Demo

Logs Amazon EMR Amazon Redshift – Visualization


as ETL Grid Production DWH
Traffic Statistics and Analysis

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Demo

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
ICAO and Hadoop
Marco Merens
Chief (Acting) Integrated Analysis
International Civil Aviation Organization

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
ICAO in the cloud

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Cloudability principles at ICAO

1. What comes from the cloud, can stay in the


cloud
2. What comes from in-house
A. should stay in-house if private, or
B. can be synced with the cloud if public

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Data sync
Metrics
Data sync Data

Basic UI FancyUI

Create
Read Read
Update
Delete

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
EMR example: blended accident list

Collect Map Reduce Publish

Key Priority

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Input format
XML
<?xml version="1.0" encoding="utf-8"?>
<root>
<ADREP><FilingInformation
State="XX"><ReportingOrganization>Ascend</ReportingOrganization>
<StateFileNumber>S1982045</StateFileNumber>
<Headline>MU-2, Collision with high ground, (near) Kelowna</Headline>
</FilingInformation>
….
</root>

CSV
|26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav
en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take-
off||

AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect
Use linux
crontab
to schedule

#!/bin/sh
wget "http://somexml" -qO- | tr -d "\n" | tr -d "\r" |
sed "s#<Accident>#\n<Accident>#g" > tmp
aws s3 put tmp s3://accidents/input/source1
Amazon S3
…….
Amazon EC2

Make One XML


element per line for
EMR
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR command line
elastic-mapreduce
--create
--bootstrap-action
s3://elasticmapreduce/samples/node/install-node-bin-
x86.sh
--instance-type m1.small --instance-count 3
--json job.json
--put /home/ec2-user/key/newtest.pem
--to /home/hadoop
--enable-debugging
Put ssh key to hadoop
if you need to remote
sh
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR json config file
[{
"Name": "Make accident map",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args": [
"-input",
"s3://accidentstats/input/*", …
Move the results
]},{ from S3 to
"Name": "Store in mongo", somewhere else
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar" ,
"Args": [
"s3://edmscripts/uploadtomongo.sh",
"accidentstats/output",
"NEWACCIDENTLIST"
]}}
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Map #!/usr/bin/env node
function treatline(line) {
If (line.indexOf(“<ADREP>”))
sourceX {
source1(line)
}
……

Function source1(line)
{
Amazon S3 Var data=xml2json(line)
data.records.forEach(function(v){
Var el={ Date:v.Date,

Registration:v.Registration,
Model:v.Model,
Source:”Source1”,
Priority:1
}
var key=el.Date+”#”+el.Registration
mapped process.stdout.write(key+”/t”+JSON.stringify(el)
Amazon Elastic )
})
MapReduce
}
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Reduce #!/usr/bin/env node
Var oldkey,key,array=[]
function treatline(line) {
Mapped and key=line.split(“/t”)[0]
data=JSON.parse(line.split(“/t”)[1])
sorted If ((key==oldkey) || !oldkey)
{
array.push(data)}
Else {
treat(array)
array=[]}
oldkey=key
Amazon Elastic ……}
MapReduce

Function treat(array)
{
el={}
array=array.sort(prioritysort)
array.forEach(function(v){
Reduced el=updateresult(el,v)
})
Amazon S3 process.stdout.write(JSON.stringify(el)+”\n”)
} AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time statistics

Amazon Elastic
MapReduce

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015
Thank You.
This presentation will be loaded to SlideShare the week following the Symposium.
http://www.slideshare.net/AmazonWebServices

AWS Government, Education, and Nonprofit Symposium


Washington, DC I June 25-26, 2015

You might also like