Professional Documents
Culture Documents
(Download PDF) Advanced R Data Programming and The Cloud 1St Edition Matt Wiley Online Ebook All Chapter PDF
(Download PDF) Advanced R Data Programming and The Cloud 1St Edition Matt Wiley Online Ebook All Chapter PDF
https://textbookfull.com/product/advanced-r-statistical-
programming-and-data-models-analysis-machine-learning-and-
visualization-1st-edition-matt-wiley/
https://textbookfull.com/product/beginning-r-4-from-beginner-to-
pro-1st-edition-matt-wiley/
https://textbookfull.com/product/beginning-r-4-from-beginner-to-
pro-1st-edition-matt-wiley-2/
https://textbookfull.com/product/functional-data-structures-in-r-
advanced-statistical-programming-in-r-mailund/
Functional Data Structures in R: Advanced Statistical
Programming in R Thomas Mailund
https://textbookfull.com/product/functional-data-structures-in-r-
advanced-statistical-programming-in-r-thomas-mailund/
https://textbookfull.com/product/functional-programming-in-r-
advanced-statistical-programming-for-data-science-analysis-and-
finance-1st-edition-thomas-mailund/
https://textbookfull.com/product/advanced-object-oriented-
programming-in-r-statistical-programming-for-data-science-
analysis-and-finance-1st-edition-thomas-mailund/
https://textbookfull.com/product/metaprogramming-in-r-advanced-
statistical-programming-for-data-science-analysis-and-
finance-1st-edition-thomas-mailund/
https://textbookfull.com/product/the-modern-data-warehouse-in-
azure-building-with-speed-and-agility-on-microsofts-cloud-
platform-1st-edition-matt-how/
Advanced R
Data Programming and the Cloud
—
Matt Wiley
Joshua F. Wiley
Advanced R
Data Programming and the Cloud
Matt Wiley
Joshua F. Wiley
Advanced R: Data Programming and the Cloud
Matt Wiley Joshua F. Wiley
Elkhart Group Ltd. & Victoria College Elkhart Group Ltd. & Victoria College
Columbia City, Indiana Columbia City, Indiana
USA USA
ISBN-13 (pbk): 978-1-4842-2076-4 ISBN-13 (electronic): 978-1-4842-2077-1
DOI 10.1007/978-1-4842-2077-1
Library of Congress Control Number: 2016959581
Copyright © 2016 by Matt Wiley and Joshua F. Wiley
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image, we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director: Welmoed Spahr
Lead Editor: Steve Anglin
Technical Reviewer: Andrew Moskowitz
Editorial Board: Steve Anglin, Pramila Balan, Laura Berendson, Aaron Black, Louise Corrigan,
Jonathan Gennick, Robert Hutchinson, Celestin Suresh John, Nikhil Karkal, James Markham,
Susan McDermott, Matthew Moodie, Natalie Pao, Gwenan Spearing
Coordinating Editor: Mark Powers
Copy Editor: Sharon Wilkey
Compositor: SPi Global
Indexer: SPi Global
Artist: SPi Global
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street,
6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com,
or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer
Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail rights@apress.com, or visit www.apress.com.
Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use.
eBook versions and licenses are also available for most titles. For more information, reference our Special
Bulk Sales–eBook Licensing web page at www.apress.com/bulk-sales.
Any source code or other supplementary materials referenced by the author in this text are available to
readers at www.apress.com. For detailed information about how to locate your book’s source code, go to
www.apress.com/source-code/ . Readers can also access source code at SpringerLink in the Supplementary
Material section for each chapter.
Printed on acid-free paper
To Family.
Contents at a Glance
vii
■ CONTENTS
S4 System ...................................................................................................................... 71
S4 Classes ............................................................................................................................................ 72
S4 Class Inheritance ............................................................................................................................. 76
S4 Methods........................................................................................................................................... 77
Summary ........................................................................................................................ 80
■Chapter 6: Writing a Package .............................................................................. 83
Before You Get Started ................................................................................................... 83
Version Control ..................................................................................................................................... 84
viii
■ CONTENTS
ix
■ CONTENTS
x
■ CONTENTS
xi
About the Authors
xiii
About the Technical Reviewer
xv
Acknowledgments
We would like to profusely thank our technical reviewer, Andrew Moskowitz. Through direct comments in
chapters, e-mails about proper explanations, and Skype calls, Andrew gave us a lot of thoughtful feedback.
If our readers feel that any portion explains a technique well, that is thanks to his efforts; the errors of course
remain ours alone.
Mark Powers has been extraordinarily kind to us, and this book would not be here without his advocacy
and support. Steve Anglin also deserves thanks for working with us to start this project. Truly, if you look at
the very front of this book, there is an entire team at Apress who deserve rich and warm thanks.
xvii
Introduction
R has become one of the most popular programming languages in an era where data science is increasingly
prevalent. As R and data science have become more mainstream, there is a growing number of R users
without dedicated training in statistical computing or data science, and thus a growing demand for books
and resources to bridge the gap between applied users who may have only an introductory background
in statistics or programming and advanced and sophisticated data analytics. This book focuses on how to
use advanced programming in R to speed up everyday tasks in data analysis and data science. This book is
also unique in its coverage of how to set up R in the cloud and generate dynamic reports for analyses that
are regularly repeated, such as monthly analysis of company sales or quarterly analysis of student grades,
enrollment, and dropout numbers in schools with projections for future enrollment rates.
Chapters 1 through 6 focus on more advanced programming techniques than the Apress offering of
Beginning R.
Chapters 7–10 develop powerful data management measures including the exciting and
(comparatively) new data.table.
From here, we delve into the modern (and slightly edgy) world of cloud computing with R. From the
ground up, we walk you through getting R started on an Amazon cloud in chapters 11–14.
Finally, Chapter 15 provides you with solid techniques in dynamic documents and reports.
xix
CHAPTER 1
Programming Basics
As with most languages, more advanced usage requires delving into the underlying structure. This chapter
covers such programming basics, and this first section of the book (through Chapter 6), develops some
advanced programming techniques. We start with R’s basic building blocks, which create our foundation for
programming, data management, and cloud analytics.
Before we dig too deeply into R, some general principles to follow may well be in order. First,
experimentation is good. It is much more powerful to learn hands-on than it is simply to read. Download the
source files that come with this text, and try new things!
Second, it can help quite a bit to become familiar with the ? function. Simply type ? immediately
followed by text in your R console to call up help of some kind. We cover more on functions later, but this is
too useful to ignore until that time.
Finally, just before we dive into the real reason you bought this book, a word of caution: this is an
applied text. There may be topics and areas of R we skip or ignore. While we, the authors, like to imagine this
is due to careful pruning of ideas, it may well be due to ignorance. There are likely other ways to perform
these tasks or additional good topics to learn. Our goal is to get you up and running as quickly as possible
toward some useful skills. Good luck!
Electronic supplementary material The online version of this chapter (doi: 10.1007/978-1-4842-2077-1) contains
supplementary material, which is available to authorized users.
In case it is not already, you also need Java installed. We used Java Version 8 Update 91 for 64 bit in this
book. Java may be downloaded at www.oracle.com/technetwork/java/javase/; specifically, get the Java
Development Kit (JDK).
While these choices may have minor consequences, our goal is to provide universal guidance that
remains true enough regardless of environmental specifics. Nevertheless, some packages and prebuilt
functions on occasion have quirks. We turn our attention to ensuring that you can readily reproduce our
results.
Reproducing Results
One useful feature of R is the abundance of packages written by experts worldwide. This is also potentially
the Achilles’ heel of using R: from the version of R itself to the version of particular packages, lots of code
specifics are in flux. Your code has the potential to not work from day to day, let alone our code written
months before this book was published. To solve this, we use the Revolution Analytics checkpoint package
(Microsoft Corporation, 2016), which uses server-stored snapshots from the Comprehensive R Archive
Network (CRAN) to “lock” our code to a specific version and date. To learn the technical specifics of how
this is done, visit the link in the “References” section at the end of this chapter. We’ll get you started with the
basics.
For this book, we used R version 3.3.1, Bug in Your Hair, along with Windows 10 Professional x64. As this
version moves from the current version to historical, CRAN maintains an archive of past releases. Thus, the
checkpoint package has ready access to previous versions of R, and indeed all packages. What you need to
do is add the following code to the top of your Chapter 1 R file in your project directory:
We place all library calls at the start of each chapter’s project file, after the call to the checkpoint library.
By including the date of September 4, 2016, we ensure that the latest version of all packages up to that cutoff
is installed and run by checkpoint. The first time it is run, after asking permission, checkpoint creates a
folder to host the needed versions of the packages used. Thus, as long as you start each chapter’s code file
with the correct library calls, you use the same versions of the packages we use.
Types of Objects
First of all, we need things to build our language, and in R, these are called objects. We start with five very
common types of objects.
Logical objects take on just two values: TRUE or FALSE. Computers are binary machines, and data often
may be recorded and modeled in an all-or-nothing world. These logical values can be helpful, where TRUE
has a value of 1, and FALSE has a value of 0:
TRUE
[1] TRUE
FALSE
[1] FALSE
2
CHAPTER 1 ■ PROGRAMMING BASICS
As you may remember from the quickly muttered comments of your algebra professor, there are many
types, or flavors, of numbers. Whole numbers, which include zero as well as negative values, are called
integers. In set notation, {…,-2, -1, 0, 1, 2, …}, these numbers are helpful for headcounts or other indexes
(as well as other things, naturally). In R, integers have the capital L suffix. If decimal numbers are needed,
then double numeric objects are in order. These are the numbers suited for even-ratio data types. Complex
numbers have useful properties as well and are understood precisely as you might expect, with an i suffix on
the imaginary portion. R is quite friendly in using all of these numbers, and you simply type in the desired
numbers (remember to add the L or i suffix as needed):
42L
[1] 42
1.5
[1] 1.5
2+3i
[1] 2+3i
Nominal-level data may be stored via the character class and is designated with quotation marks:
"a" ## character
[1] "a"
Of course, numerical data may have missing values. These missing values are of the type that the rest of
the data in that set would be (we discuss data storage shortly). Nevertheless, it can be helpful to know how to
hand-code logical, integer, double, complex, or character missing values:
NA
[1] NA
NA_integer_
[1] NA
NA_real_
[1] NA
NA_character_
[1] NA
NA_complex_
[1] NA
Factors are a special kind of object, not so useful for general programming, but used a fair amount
in statistics. A factor variable indicates that a variable should be treated discretely. Factors are stored as
integers, with labels to indicate the original value:
factor(1:3)
[1] 1 2 3
Levels: 1 2 3
factor(c("a", "b", "c"))
[1] a b c
Levels: a b c
factor(letters[1:3])
[1] a b c
Levels: a b c
3
CHAPTER 1 ■ PROGRAMMING BASICS
x2 %*% t(x2)
[,1] [,2] [,3]
[1,] 17 22 27
[2,] 22 29 36
[3,] 27 36 45
tcrossprod(x2)
[,1] [,2] [,3]
[1,] 17 22 27
[2,] 22 29 36
[3,] 27 36 45
We end this chapter with some final thoughts . First, as you have just seen, it is common in R for
someone else to have done the heavy lifting by making a function that simply creates the desired outcome.
Of course, these friendly programmers’ work is subjected to only the underlying constraints of R itself as
well as the ability to acquire a free GitHub account. Thus, it can be helpful to understand at least some of the
base commands and operators that make R work. Second, R runs on computers, and for those who have not
yet met computer logic, there are differences due to the hardware structure and (and consequent software
implementation choices).
Next, let’s focus on understanding implementation nuances as well as quickly getting data in and out of R.
References
• https://mran.microsoft.com/open/
• https://cran.r-project.org/web/packages/checkpoint/vignettes/checkpoint.html
15
Another random document with
no related content on Scribd:
Cradle, American 75, Description, 118
Cradle, G. P. F. Description, 162
Cradle 155-mm Howitzer, Description, 181
Cradle 3-inch Gun, Description, 66
Cylinders 3-inch Gun, Care of, 244
Cylinders 155-mm Howitzer, Description, 181
Cylinders, Outer, Discussion, 31
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.