Welcome to Scribd!

AWS Athena Running Notes

Uploaded by

0% found this document useful (0 votes)

8 views5 pages

Athena allows querying data stored in S3 using SQL. It is serverless, so no infrastructure needs to be managed. Users are charged based on the amount of data scanned per query. To minimize costs, the document discusses strategies like partitioning data by columns of interest and using columnar file formats like Parquet that allow skipping unnecessary columns during queries. The author provides an example using Spark to restructure CSV data into a Parquet format partitioned by subject to reduce scanning during queries for average math scores. AWS Glue crawlers can also infer schemas to create metadata for Athena without manual definition.

Original Description:

Original Title

12. AWS Athena Running Notes

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

8 views5 pages

AWS Athena Running Notes

Uploaded by

Arnab Dey

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 5

Search inside document

Athena

=======

HDFS file - data

+
Schema - metadata

= Tabular view

Data will be kept in S3

+
Build Schema on top of it

= Tabular view

we can query it like normal sql table

we can query s3 data using sql style syntax.

EMR - we create a cluster of machines.

for Athena we do not have to create a cluster.

It is serverless.

No infrastructure is required.

In case of EMR we were paying based on nodes..

for each node we have cost associated per hour..

we will be charged based on amount of data scanned.

in S3 we can store our data in csv, json, parquet, orc ..

perfect for infrequent queries..

we can only focus on our actual problem statement, and we do not have to worry about manag-
ing the infrastructure.

pay for queries(based on amount of data scanned)

$5 per TB scanned..

100 mb

100 * 5
----------------
1000 * 1000

5 kb (atleast for 10 mb)

charged for minimum 10 mb per query.

there are no charges for DDL.

if the query fails we are not charged.

what if you cancel the query.

if we cancel the query it will charge us for the data that is already scanned.

problem 1: I want to find the sum of scores for each candidate.

for this we have to group it based on each student and then apply sum on scores.

problem 2: I want to calculate the avg marks scored in maths.

when the data is stored in csv format(non partitioned)

we have to go through full data scan which will incur more charges.

our goal should be to organize the data in a way that we do not have to scan all data..

we want to minimize the data scanned.

Athena session - 2
===================

we were scanning the entire data.

How to minimize the data scanning

==================================

1. skip some of the columns from being scanned - column pruning

when we use row based file formats then we cannot skip some of the columns. we have to
read all the columns.

column pruning is not possible when we go with row based file formats.

column based file formats - parquet, orc ..

2. skip some of the rows - partition

so that we have partitions of data and we can then skip some of the partitions..

partition pruning.

in our example we can have partitioning on subject so that we can only scan the math folder.

I will write a simple spark program to take csv as input

and give partitioned (subject) data in parquet format as output.

when we are saying

select year from trendytech.students_partitioned

then we see that we are just scanning 3 kb

first of all we are skipping all the columns.

parquet will be internally using dictionary encoding..

select subject, avg(score) as average_marks from trendytech.students_partitioned group by

subject having subject = ‘Math’

so in this case we are doing 2 levels of pruning

column skipping - parquet

partition skipping - partitioned the data..

3. use compression techniques like snappy, gzip..

in this session we created the schema manually..

in the next session we will see how to infer the schema of data using crawler..

Athena session - 3
===================

Last time we created the schema manually..

is there a automated way to infer the schema..

AWS Glue
=========

Glue acts as a metastore - the metadata is stored in

glue catalog (metastore)

Glue provides you a crawler as well which can crawl your data and infer schema from that..
AWS Glue is one service

S3 is another service

we have to give permission to AWS Glue to access S3

AWS Athena along with AWS Glue

Athena gives you a panel to query to your table

data of table - S3

Metadata is stored in - Glue

(we have inferred the schema using glue).

Athena
=======

how to create a normal table manually on csv data residing in s3

how to create a partitioned table on parquet file to get optimization so that we scan less data.

how to infer the schema using glue.

glue catalog holds the metadata - Hive Metastore

S3 holds the actual data.. - HDFS

Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Informatica PDF
Document55 pages
Informatica PDF
harishkode
No ratings yet
CSS Grid Layout
From Everand
CSS Grid Layout
Abdelfattah Ragab
No ratings yet
SQL
Document111 pages
SQL
shamanth br
100% (1)
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
ORACLE Database Health - Query Tunning
Document55 pages
ORACLE Database Health - Query Tunning
Rishi Mathur
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Data Wrangling With Python and Pandas
Document7 pages
Data Wrangling With Python and Pandas
Carlos Andrés Pérez
No ratings yet
Scala Data Analysis Cookbook
From Everand
Scala Data Analysis Cookbook
Manivannan Arun
No ratings yet
DP 900
Document8 pages
DP 900
jyh83777
No ratings yet
R: Recipes for Analysis, Visualization and Machine Learning
From Everand
R: Recipes for Analysis, Visualization and Machine Learning
Atmajitsinh Gohil
No ratings yet
AWS Data Lake
Document13 pages
AWS Data Lake
Suvankar Chakraborty
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Unit 5 - Chapter - 5 - Cursors
Document21 pages
Unit 5 - Chapter - 5 - Cursors
Sukruthi KC
No ratings yet
SQL Server: Tips and Tricks - 2
From Everand
SQL Server: Tips and Tricks - 2
Priyanka Agarwal
Rating: 4.5 out of 5 stars
4.5/5 (3)
1.when A Sparse Matrix Is Represented With A 2-Dimensional Array, We
Document6 pages
1.when A Sparse Matrix Is Represented With A 2-Dimensional Array, We
khushinagar9009
No ratings yet
PDF Notes
Document79 pages
PDF Notes
praveen kumar
No ratings yet
Microsoft Azure Data Fundamentals (DP-900) Master Cheat Sheet
Document32 pages
Microsoft Azure Data Fundamentals (DP-900) Master Cheat Sheet
Chaitanya
No ratings yet
Configuration File
Document7 pages
Configuration File
Subramanyam Hindustani
No ratings yet
Day 1 SQL
Document1 page
Day 1 SQL
Elisha Muppalla
No ratings yet
ELEE28706D Introduction S1
Document33 pages
ELEE28706D Introduction S1
Ehtisham Malik
No ratings yet
Exploring The Data Tree in Abaqus With Python
Document30 pages
Exploring The Data Tree in Abaqus With Python
vinhtungbk
No ratings yet
Oracle 11g New Tuning Features: Donald K. Burleson
Document36 pages
Oracle 11g New Tuning Features: Donald K. Burleson
Sreekar Pamu
No ratings yet
DA0101EN-Review-Introduction - Jupyter Notebook
Document8 pages
DA0101EN-Review-Introduction - Jupyter Notebook
Sohail Doulah
No ratings yet
SDC - Cosmos DB
Document27 pages
SDC - Cosmos DB
Luis Castillo
No ratings yet
Demo Class 15 and 16102022 (Pandas in Python)
Document45 pages
Demo Class 15 and 16102022 (Pandas in Python)
Oskar Nguyen
No ratings yet
DS (2 Files Merged)
Document75 pages
DS (2 Files Merged)
PRANAV CHAUDHARI
No ratings yet
My Query Takes 30 Min To Refresh ... There Is Any Performance Tunning Is There To Reduce Refresh Time?
Document2 pages
My Query Takes 30 Min To Refresh ... There Is Any Performance Tunning Is There To Reduce Refresh Time?
Abhilasha Modekar
No ratings yet
Untitled Document
Document9 pages
Untitled Document
Manikant Pandey
No ratings yet
DataStage Material Imp
Document40 pages
DataStage Material Imp
Venkata Rao K
No ratings yet
DBA Interview Questions With Answers Part8
Document13 pages
DBA Interview Questions With Answers Part8
anu224149
No ratings yet
Improving Performance of SQLite Data 1703882908
Document8 pages
Improving Performance of SQLite Data 1703882908
James Flint
No ratings yet
DataStage Faq S
Document57 pages
DataStage Faq S
swaroop24x7
No ratings yet
What Is Compiler in Datastage - Compilation Process in Datastage
Document14 pages
What Is Compiler in Datastage - Compilation Process in Datastage
shivnat
No ratings yet
SQL Query Caching: Ori Staub 6 Comments
Document9 pages
SQL Query Caching: Ori Staub 6 Comments
Homer Sobrevega
No ratings yet
Assignment-1 Course Code:CAP205: Date: 08/09/10
Document9 pages
Assignment-1 Course Code:CAP205: Date: 08/09/10
Manish Kinwar
No ratings yet
Data Structures
Document22 pages
Data Structures
John Son
No ratings yet
What Is The Difference Between Star Schema and Snow Flake Schema ?and When We Use Those Schema's?
Document13 pages
What Is The Difference Between Star Schema and Snow Flake Schema ?and When We Use Those Schema's?
saprsa1
No ratings yet
Python For Data Science Nympy and Pandas
Document4 pages
Python For Data Science Nympy and Pandas
StocknEarn
No ratings yet
DataStage Interview Questions
Document3 pages
DataStage Interview Questions
vrkesari
No ratings yet
Tunning Dss Queries
Document16 pages
Tunning Dss Queries
Airc Smtp
No ratings yet
Interview Questions
Document12 pages
Interview Questions
swapnadip kumbhar
No ratings yet
Datasets
Document40 pages
Datasets
Asmatullah Khan
No ratings yet
A Interview Questions and Answers
Document34 pages
A Interview Questions and Answers
mallekrishna123
No ratings yet
Athena
Document13 pages
Athena
k2sh
No ratings yet
Stata: Getting Starting and Being Productive With VA Data
Document35 pages
Stata: Getting Starting and Being Productive With VA Data
Hector Garcia
No ratings yet
Apache Hive Optimization Techniques - 1 - Towards Data Science
Document8 pages
Apache Hive Optimization Techniques - 1 - Towards Data Science
mydumm
No ratings yet
5CS037 WS02 PandasForDataAnalysis
Document30 pages
5CS037 WS02 PandasForDataAnalysis
Pankaj Mahato
No ratings yet
Mastering Advanced Time Series Techniques - by Sadrach Pierre, Ph.D. - DataDrivenInvestor
Document10 pages
Mastering Advanced Time Series Techniques - by Sadrach Pierre, Ph.D. - DataDrivenInvestor
yatipa c
No ratings yet
AI Phase3
Document4 pages
AI Phase3
sameithyatech
No ratings yet
DSA Interview Questions
Document15 pages
DSA Interview Questions
Kavya Mamilla
No ratings yet
Athna
Document13 pages
Athna
k2sh
No ratings yet
Algo Assign
Document5 pages
Algo Assign
Muhammad Usman
No ratings yet
Unit 1.1
Document10 pages
Unit 1.1
Fardeen Khan
No ratings yet
Java Script
Document14 pages
Java Script
Ansu Man
No ratings yet
Dplyr Manual
Document71 pages
Dplyr Manual
pgnepal
No ratings yet
Data Structure - Importance and Advantages: Why Data Structures Are Needed
Document11 pages
Data Structure - Importance and Advantages: Why Data Structures Are Needed
S Kamal
No ratings yet
Pandas Tutorial
Document21 pages
Pandas Tutorial
KEVIN KUMAR
No ratings yet
Precat Fast Track Batch OM21
Document18 pages
Precat Fast Track Batch OM21
Hota b
No ratings yet
DA0101EN-2-Review-Data-Wrangling - Jupyter Notebook
Document14 pages
DA0101EN-2-Review-Data-Wrangling - Jupyter Notebook
Sohail Doulah
No ratings yet
Review Questions in Reference and Bibliography - Mam Ed
Document13 pages
Review Questions in Reference and Bibliography - Mam Ed
Sharmaine Soriano
No ratings yet
Chapter - 4 - Iot - Recording - 2
Document16 pages
Chapter - 4 - Iot - Recording - 2
Ravi Biradar
No ratings yet
Big Data Syllabus
Document2 pages
Big Data Syllabus
Avinash Bommina
No ratings yet
Google Cloud Platform: Quick Overview: Build, Test and Deploy Applications On Google's Infrastructure
Document22 pages
Google Cloud Platform: Quick Overview: Build, Test and Deploy Applications On Google's Infrastructure
Hari Haran
No ratings yet
Employee Leave Management: Analysis Report
Document39 pages
Employee Leave Management: Analysis Report
Kumar Kalawad
No ratings yet
Pharmacy Proposal
Document20 pages
Pharmacy Proposal
John Son
25% (4)
Chapter2. The Databases Concepts
Document34 pages
Chapter2. The Databases Concepts
Maha Alzahrani
No ratings yet
MySQL Aggregate Functions and Group by - Exercises, Practice, Solution
Document35 pages
MySQL Aggregate Functions and Group by - Exercises, Practice, Solution
Darwin Vargas
100% (1)
DMR TableColumnDescription PDF
Document4,741 pages
DMR TableColumnDescription PDF
matthieulombard
No ratings yet
Basis For Distributed Database Technology
Document35 pages
Basis For Distributed Database Technology
ashraf8
No ratings yet
Simad University: Chapter 8: Data Warehousing
Document9 pages
Simad University: Chapter 8: Data Warehousing
Mohamed Omar Halane
No ratings yet
Online Mobile Recharge - 7161
Document10 pages
Online Mobile Recharge - 7161
Shrikant Khatke
No ratings yet
Current Log
Document32 pages
Current Log
Fredy Torres ocoro
No ratings yet
Data Analysis For Auditors - Practical Case Studies Using Caat's
Document413 pages
Data Analysis For Auditors - Practical Case Studies Using Caat's
ajabhishekjain1993
100% (1)
How To: Export Data From MS Access Database, Automatic, No Mouse Clicks!
Document3 pages
How To: Export Data From MS Access Database, Automatic, No Mouse Clicks!
mizan70_954926161
No ratings yet
Kraft Heinz Finds A New Recipe For Analyzing Its Data
Document2 pages
Kraft Heinz Finds A New Recipe For Analyzing Its Data
Vignesh Madridista
No ratings yet
Architecture and Components of Spark
Document6 pages
Architecture and Components of Spark
miyumi
No ratings yet
Dbms Semester IV LAB Sheet - 1 Due Date For Submitting Report: 8th February 2021
Document30 pages
Dbms Semester IV LAB Sheet - 1 Due Date For Submitting Report: 8th February 2021
Prathamesh Shailendra More
No ratings yet
Final Clearance Management System
Document143 pages
Final Clearance Management System
Dawit Woldemichael
No ratings yet
©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
Document21 pages
©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
Yogesh Desai
No ratings yet
SEZ Online Manual DTA Procurement With Export Benefit
Document25 pages
SEZ Online Manual DTA Procurement With Export Benefit
Indranil Roy Choudhuri
No ratings yet
Database Users
Document2 pages
Database Users
prasad
No ratings yet
Osora Nzeribe Resume
Document5 pages
Osora Nzeribe Resume
HARSHA
No ratings yet
Hash Resume 2021
Document6 pages
Hash Resume 2021
cj tutors
No ratings yet
Event-Driven Microservices: © 2020 Confluent, Inc
Document25 pages
Event-Driven Microservices: © 2020 Confluent, Inc
Stefan Mirkovic
No ratings yet
MySQL Database Cluster
Document5 pages
MySQL Database Cluster
m.asif.muzammil
No ratings yet
Visualization From Azure Data Explorer in Power BI
Document13 pages
Visualization From Azure Data Explorer in Power BI
Gerardo Mestanza
No ratings yet
SPAutomation Course Labs V2009
Document149 pages
SPAutomation Course Labs V2009
yan liu
100% (1)
Database Management System (Introduction, KEYS, ACID Properties)
Document4 pages
Database Management System (Introduction, KEYS, ACID Properties)
Imrul Hasan
No ratings yet
Payroll System: COMPUTER SCIENCE (Which Semester)
Document68 pages
Payroll System: COMPUTER SCIENCE (Which Semester)
prachi singh
No ratings yet
Dark Data: Why What You Don’t Know Matters
From Everand
Dark Data: Why What You Don’t Know Matters
David J. Hand
Rating: 4.5 out of 5 stars
4.5/5 (3)
Blockchain Basics: A Non-Technical Introduction in 25 Steps
From Everand
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Daniel Drescher
Rating: 4.5 out of 5 stars
4.5/5 (24)
Oracle Database 12c Quickstart
From Everand
Oracle Database 12c Quickstart
Michael Elliott
Rating: 5 out of 5 stars
5/5 (5)
Optimizing DAX: Improving DAX performance in Microsoft Power BI and Analysis Services
From Everand
Optimizing DAX: Improving DAX performance in Microsoft Power BI and Analysis Services
Alberto Ferrari
No ratings yet
Grokking Algorithms: An illustrated guide for programmers and other curious people
From Everand
Grokking Algorithms: An illustrated guide for programmers and other curious people
Aditya Bhargava
Rating: 4 out of 5 stars
4/5 (16)
Learn SAP SD in 24 Hours
From Everand
Learn SAP SD in 24 Hours
Alex Nordeen
No ratings yet
Access 2019 For Dummies
From Everand
Access 2019 For Dummies
Laurie A. Ulrich
No ratings yet
Practical Data Analysis
From Everand
Practical Data Analysis
Hector Cuesta
Rating: 4.5 out of 5 stars
4.5/5 (14)
Microsoft Access Guide for Success
From Everand
Microsoft Access Guide for Success
Kevin Pitch
Rating: 5 out of 5 stars
5/5 (2)
ITIL 4: High-velocity IT: Reference and study guide
From Everand
ITIL 4: High-velocity IT: Reference and study guide
Mark Smalley
No ratings yet
Sql Injection Best Method for Begineers
From Everand
Sql Injection Best Method for Begineers
Kishor Sarkar X
No ratings yet
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
From Everand
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Walter Shields
Rating: 4.5 out of 5 stars
4.5/5 (46)
A Concise Guide to Object Orientated Programming
From Everand
A Concise Guide to Object Orientated Programming
alasdair gilchrist
No ratings yet
Excel 2021
From Everand
Excel 2021
JIAYI SIMONDS
Rating: 4 out of 5 stars
4/5 (11)
Fusion Strategy: How Real-Time Data and AI Will Power the Industrial Future
From Everand
Fusion Strategy: How Real-Time Data and AI Will Power the Industrial Future
Vijay Govindarajan
No ratings yet
FileMaker Pro Design and Scripting For Dummies
From Everand
FileMaker Pro Design and Scripting For Dummies
Timothy Trimble
No ratings yet
Agile Metrics in Action: How to measure and improve team performance
From Everand
Agile Metrics in Action: How to measure and improve team performance
Christopher Davis
No ratings yet
SQL Server: Tips and Tricks - 2
From Everand
SQL Server: Tips and Tricks - 2
Priyanka Agarwal
Rating: 4.5 out of 5 stars
4.5/5 (3)
Microsoft Access Guide to Success: From Fundamentals to Mastery in Crafting Databases, Optimizing Tasks, & Making Unparalleled Impressions [III EDITION]
From Everand
Microsoft Access Guide to Success: From Fundamentals to Mastery in Crafting Databases, Optimizing Tasks, & Making Unparalleled Impressions [III EDITION]
Kevin Pitch
Rating: 5 out of 5 stars
5/5 (8)
Business Intelligence Strategy and Big Data Analytics: A General Management Perspective
From Everand
Business Intelligence Strategy and Big Data Analytics: A General Management Perspective
Steve Williams
Rating: 5 out of 5 stars
5/5 (5)
Starting Database Administration: Oracle DBA
From Everand
Starting Database Administration: Oracle DBA
Oraclesql-plsql
Rating: 3 out of 5 stars
3/5 (2)
Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
From Everand
Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
Gordon S. Linoff
Rating: 4 out of 5 stars
4/5 (9)
Data Science
From Everand
Data Science
John D. Kelleher
Rating: 4.5 out of 5 stars
4.5/5 (66)
Oracle SQL: Jumpstart with Examples
From Everand
Oracle SQL: Jumpstart with Examples
Gavin JT Powell
Rating: 4.5 out of 5 stars
4.5/5 (7)
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Ultimate Snowflake Architecture for Cloud Data Warehousing: Architect, Manage, Secure, and Optimize Your Data Infrastructure Using Snowflake for Actionable Insights and Informed Decisions (English Edition)
From Everand
Ultimate Snowflake Architecture for Cloud Data Warehousing: Architect, Manage, Secure, and Optimize Your Data Infrastructure Using Snowflake for Actionable Insights and Informed Decisions (English Edition)
Ganesh Bharathan
No ratings yet
High-Performance Oracle: Proven Methods for Achieving Optimum Performance and Availability
From Everand
High-Performance Oracle: Proven Methods for Achieving Optimum Performance and Availability
Geoff Ingram
No ratings yet
Apache Spark Graph Processing
From Everand
Apache Spark Graph Processing
Ramamonjison Rindra
No ratings yet
Data Smart: Using Data Science to Transform Information into Insight
From Everand
Data Smart: Using Data Science to Transform Information into Insight
John W. Foreman
Rating: 4.5 out of 5 stars
4.5/5 (23)
Database Design: Know It All
From Everand
Database Design: Know It All
Toby J. Teorey
Rating: 5 out of 5 stars
5/5 (2)