You are on page 1of 20

Introduction to

Amazon Redshift Spectrum


Anurag Gupta, Vice President
Amazon Athena, Amazon CloudSearch, AWS Data Pipeline, Amazon Elasticsearch Service, Amazon EMR,
Amazon Redshift, AWS Glue, Amazon Aurora, Amazon RDS for MariaDB, RDS for MySQL, RDS for PostgreSQL

April 19, 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Big Data?

When your data sets become so large and diverse


that you have to start innovating around how to
collect, store, process, analyze and share them
It’s never been easier to generate vast amounts of data

Generate

Individual AWS customers Collect & Store


generate over a PB/day

Analyze

Collaborate & Act


Amazon S3 lets you collect and store all this data

Generate

Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day

Analyze

Collaborate & Act


But how do you analyze it?

Generate

Store exabytes of
Individual AWS customers Collect & Store
data in S3
generating over PB/day

Highly
Analyze
Constrained

Collaborate & Act


The Dark Data Problem
Most generated data is unavailable for analysis
Data Volume

Generated Data
Available for Analysis

Year
1990 2000 2010 2020
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
The tyranny of “OR”

Amazon EMR Amazon Redshift

Directly access data in S3 Super-fast local disk performance

Scale out to thousands of nodes Sophisticated query optimization

Open data formats Join-optimized data formats

Popular big data frameworks Query using standard SQL

Anything you can dream up and code Optimized for data warehousing
But I don’t want to choose.

I shouldn’t have to choose

I want “all of the above”


I want
sophisticated query optimization and scale-out processing

super fast performance and support for open formats

the throughput of local disk and the scale of S3


I want all this
From one data processing engine

With my data accessible from all data processing engines

Now and in the future


We’re told “you have to choose”

Pick small clusters for joins or large ones for scans


Shuffles are expensive

Open formats can’t collocate data for joins


They have to deal with variable cluster sizes

Query optimization requires statistics


You can’t determine this for external data
“It’s just physics”
Amazon Redshift Spectrum
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes

Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query

S3
SQL
High concurrency: Multiple No ETL: Query data in-place Full Amazon Redshift
clusters access same data using open file formats SQL support
Life of a query Query
SELECT COUNT(*)
1
FROM S3.EXT_TABLE
GROUP BY…
JDBC/ODBC

Amazon
Redshift

...
1 2 3 4 N

Amazon S3 Data Catalog


Exabyte-scale object storage Apache Hive Metastore
Life of a query

JDBC/ODBC

Amazon
Query is optimized and compiled at
Redshift
2 the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum

...
1 2 3 4 N

Amazon S3 Data Catalog


Exabyte-scale object storage Apache Hive Metastore
Life of a query

JDBC/ODBC

Amazon
Redshift

Query plan is sent to


3 all compute nodes

...
1 2 3 4 N

Amazon S3 Data Catalog


Exabyte-scale object storage Apache Hive Metastore
Life of a query

JDBC/ODBC

Amazon
Redshift

Compute nodes obtain partition info from


4 Data Catalog; dynamically prune partitions

...
1 2 3 4 N

Amazon S3 Data Catalog


Exabyte-scale object storage Apache Hive Metastore
Life of a query

JDBC/ODBC

Amazon
Redshift

Each compute node issues multiple


5 requests to the Amazon Redshift
Spectrum layer

...
1 2 3 4 N

Amazon S3 Data Catalog


Exabyte-scale object storage Apache Hive Metastore
Life of a query

JDBC/ODBC

Amazon
Redshift

... 6 Amazon Redshift Spectrum nodes


1 2 3 4 N scan your S3 data

Amazon S3 Data Catalog


Exabyte-scale object storage Apache Hive Metastore

You might also like