You are on page 1of 19

Amazon Data

Analysis

MEMBERS : Vinay Gupta (3521)


Yash Patil (3530)
Yash Thakur (3544)
INDEX
1. INTRODUCTION
2. Background of R
3. AWS
4. Use cases for R on AWS

 Big Data Processing


 Databases
 File Storage

5. Getting started with AWS in R


6. Connecting to Databases
7. Extracting Text and Tables
8. Uploading Data to Database
INTRODUCTION

 Language and environment for statistical computing and graphics.


 Similar to the S language and environment.
 Generally comes with the Command-line interface.
 Provides a wide variety of statistical and graphical techniques, and is highly extensible.
 R’s strengths is the ease, with which well-designed publication-quality plots can be
produced.
 Is available as Free Software in source code form which compiles & runs on a wide variety
of UNIX platforms and similar systems.
Background of R

 R programming is used as a leading tool for machine learning, statistics, and data analysis.
 It’s a platform-independent language.
 It’s an open-source free language.
 R programming language is not only a statistic package but also allows us to integrate with other languages.
 Another important part of the R ecosystem is the development environment RStudio.
 One of the most popular sets of packages in the R ecosystem is the Tidy verse.
 These are designed to allow users to ingest data.
 The R programming language has a vast community of users and it’s growing day by day.
 R is currently one of the most requested programming languages.
 AWS (Amazon Web Services) is a comprehensive, evolving cloud computing platform.
IaaS
IaaS

 SaaS
SaaS

PaaS
PaaS

 AWS services can offer an organization tools such as compute power, database storage and content delivery services.
 AWS was launched in 2006 from the internal infrastructure that Amazon.com built to handle its online retail operations.
 AWS offers many different tools and solutions for enterprises and software developers that can be used in data centers in
up to 190 countries.

How AWS works??

 AWS are separated into different services which makes it easy to handle.
 Each can be configured in different ways based on the user's needs which helps the Users to see configuration options and
individual server maps for an AWS service.
 More than 100 services comprise the Amazon Web Services portfolio, including those for compute, databases,
infrastructure management, application development and security.
Use Cases For R On AWS

BIG DATA
DATABASES FILE
PROCESSIN
STORAGE
G
Big Data Processing

 For big data problems, R can be limited by locally available memory; high-memory instance types help
here.
 R deals with data in-memory by default, so using an instance with more memory can make a problem
tractable without having to make changes to code.
 Many problems are also parallelizable, and with R’s support for parallel processing, modifying code to
use R’s parallel processing packages allows users to take advantage of instance types with a large number
of cores.
 Between AWS’ R-type (memory optimized) and C-type (compute optimized) instances, developers can
choose an instance type that closely matches their compute and memory workload needs.
 Often, data scientists deal with these big problems only part of the time, and running permanent Amazon
EC2 instances or containers would not be cost effective.
DATABASES
 Databases are a valuable resource for data science teams; they provide a single source of truth
for datasets and offer performant reads and writes.

 We can take advantage of popular databases like PostgreSQL through Amazon Relational
Database Service (Amazon RDS), while letting AWS take care of underlying instance and
database maintenance.

 In many cases, R can interact with these services with only small modifications; the Tidy verse
packages within R allow you to write your code irrespective of where it’s going to run, and
allow you to retarget the code to perform operations on data sourced from the database.
FILE STORAGE

 Lastly, Amazon Simple Storage Service (Amazon S3) allows developers to store raw
input files, results, reports, artifacts, and anything else that we wouldn’t want to store
directly in a database.

 Items stored in S3 are accessible online, making sharing resources with collaborators
easy, but it also offers fine-grained resource permissions so that access is limited to
only those who should have it.
AWS Cost & Usage Data!
AWS Cost and Usage Reports can do the following:

 Deliver report files to your Amazon S3 bucket


 Update the report up to three times a day
 Create, retrieve, and delete your reports using the AWS CUR API Reference
 The AWS Cost & Usage Report contains the most comprehensive set of AWS cost and usage data available,
including additional metadata about AWS services, pricing, credit, fees, taxes, discounts, cost categories,
Reserved Instances, and Savings Plans.

 The AWS Cost & Usage Report (CUR) itemizes usage at the account or Organization level by product code,
usage type and operation. These costs can be further organized by Cost Allocation tags and Cost Categories.

 The AWS Cost & Usage Report is available at an hourly, daily, or monthly level of granularity, as well as at
the management or member account level.

 The right access, users can access CUR at management and member account level, which saves management
account holders from having to generate CUR reports for member accounts
Getting Started With In
 To use AWS in R, you can use the Paws AWS software development kit, an R
package developed by my colleague Adam Banker and me.

 Paws is an unofficial SDK, but it covers most of the same functionality as the official
SDKs for other languages.

 You can also use the official Python SDK, boto3, through the bettor and reticulate
packages, but you also will need to ensure Python is installed on your machine
before using them
Connecting to Databases

 You can use databases in R by setting up a connection to the database.

 Then you can refer to tables in the database as if they were datasets in R.

 The dbplyr package in the Tidy verse and the dbplyr database backend are
what provide this functionality.
Extracting Text and Tables
 Here, we need to identify where the tables are, then reconstruct their rows and columns based
on the position and spacing of the words or numbers on the page.
 To do this we use Amazon Extract, an AWS-managed AI service, to get data from images and
PDFs.
 With the Paws SDK for R, we can get a PDF document’s text using the operation
start_document_text_detection and get a document’s tables and forms using the operation
start_document_analysis.
 These are asynchronous operations, which means that they will initialize text detection and
document analysis jobs, returning an identifier for the specific jobs that we can poll to check
the completion status.
 Once the job is finished, we can then retrieve the result with a second operation,
get_document_text_detection and get_document_analysis respectively, by passing in the job
IDs.
Uploading Data to Database

 A suitably configured PostgreSQL server running on RDS supports authentication via IAM,
avoiding the need to store passwords.

 If we are using an IAM user or role with the appropriate permissions, we can then connect to
our PostgreSQL database from R using an IAM authentication token.

 The Paws package supports this feature as well; functionality that was developed using the
support of the AWS Open Source program.

 We connect to our database using the token generated by build_auth_token from the Paws
package.
THANK YOU..!!

You might also like