This action might not be possible to undo. Are you sure you want to continue?
Author: Simon Fortier-Garceau
Supervisor: Gilles Lamothe
Translation from French to English: Gilles Lamothe
August 1, 2012
1 Introduction 1.1 R Environment and Installation . . . . . . . . . . . . . . . . . 1.2 Guide structure and interaction with R . . . . . . . . . . . . . 2 Binomial Distribution 2.1 Probability mass function and Cumulative distribution function 2.2 Example 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Example 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 4 10 10 13 16
3 Poisson Distribution 17 3.1 Probability Mass Function and Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Example 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Normal Distribution 4.1 Probability Density Function and Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Example 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Example 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Categorical Data 5.1 Data Entry and Bar Charts . . . . . . 5.2 Example 9.1 . . . . . . . . . . . . . . . 5.3 Contingency Tables & Side-by-Side Bar 5.4 Example 9.2 . . . . . . . . . . . . . . . 21 21 25 26 29 29 32 34 38
. . . . . . . . . . Charts . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
6 Importing Data from a Text File 41 6.1 Importing Clutch Sizes (Examples 9.3-9.4) . . . . . . . . . . . 42 7 Numerical Variables 47 7.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.2 Example 9.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.3 Example 9.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
There are many resources and and introductory manuals for the statistical software R. However, a resource is useful only insofar as it can introduce us eﬀectively in full or in one or another aspect of the software, depending on the particular circumstances of our work. In our case, the purpose of this Guide is twofold: 1. Serve students and teachers as part of an introductory course in biostatistics; 2. Act as an extension to the textbook Expect the unexpected: A ﬁrst course in Biostatistics [RBGL], by Raluca Balan and Gilles Lamothe The textbook [RBGL] discusses many theoretical aspects and allows us to put our knowledge into practice through a series of suggested problems. However, the distinctive nature of statistics forces us to work with databases and analyze large amounts of information which can be undertaken in a more meaningful way by the computer. This is where R, along with this guide, is useful and oﬀers us the chance to practice our knowledge in probability and statistics in terms of the vastly rich world of computing.
The idea behind this guide is mainly to introduce students to the basic features of R. On the other hand, its structure gives it the proper role of a handy reference. Each algorithm is presented in full detail and independently of others. The table of contents is arranged to allow easy identiﬁcation of relevant methods. This guide is ideal, therefore, for students and teachers seeking to save time when implementing a speciﬁc method, whether to compute statistics for a particular data set, calculate values associated with certain probability distribution or build a chart or graph for analyses.
Though the guide can be used independently of the textbook [RBGL] as a reference for the methods and algorithms, its primary function is to accompany the book in question. The progression of sections of the guide is strongly linked to the sequence of chapters of [RBGL] and also the examples 2
discussed are taken from the latter. We therefore recommend using this guide in conjunction with [RBGL], this way, they will be used to their full potential.
R Environment and Installation
A compromise was made with regard to the practicality of this guide. There are, strictly speaking, some libraries that are extensions to R which would ensure that some of algorithms presented below would be more succinct. The compromise is that we will not make use of R libraries in the algorithms of this guide, so that the student learns to work more closely with the programming aspect of R as it is done without software extensions. However, we encourage the student (or reader) to explore the libraries that can act as a complement to the R experience which is after all a system of open architecture (that is to say that the community can freely participate in the enrichment and building software environment). It is possible to download R libraries by choosing “Packages” in the main window. But the ﬁrst step of course would be to download R, and this can be done at the following link:
Click on the link “CRAN”, under “Download, Packages” in the window on the right-hand-side. Then, one is led to the “Cran Mirror”, where we select a server for downloading (eg. Canada, USA). We must then select our operating system (either Windows, Linux or MacOS X) under the title “Download & Install R”. Depending on the operating system, one may have to click on various other links to get to the download. Read the instructions to obtain the Basic binary. Once the executable ﬁle is downloaded, the installation is initiated and you should pay attention to the folder where you install the software because the installation by default does not necessarily create a shortcut. You can open R software by navigating to the folder and selecting the appropriate executable ﬁle. A window opens and you can therefore work with the base interface of R, that is to say the R Console.
Guide structure and interaction with R
The approach of this guide is simple. Each section contains generic statements associated with a variety of algorithms. Examples generally follow such statements to illustrate concrete implementations of algorithms in R.
Each algorithm speciﬁes the parameters and commands to run in R. The parameters speciﬁed at the beginning of each algorithm are either numeric values or character strings in quotation marks, that we must substitute in the commands of the algorithm so that it runs correctly. For example, if we have an algorithm with the parameters x1 , ...,xn : a random sample for a numerical variable xname : a name for the variable (in quotation marks : ") and that the algorithm gives the commands > z <– c(x1 , ...,xn ) > plot(z, xlab=xname) then, the data set of values x1 = 5.35, x2 = 4, x3 = 8, x4 = 6.7, x5 = 1.1234, and the name “Weight (in g)”, are entered (after the prompt symbol “>”). z <- c(5.35, 4, 8, 6.7, 1.1234) and press Enter afterwards.
At this point, the assignment command of values to z is executed implicitly, and a new prompt symbol “ >” appears so that we can enter a new command. Then you type plot(z, xlab="Weight (in g)") and press Enter afterwards.
Note that the prompt symbol “ >” appearing in the statements of algorithms at the beginning of each command line.0:13 5 . do not have to be entered by the user in the console as they are generated automatically by R after the execution of each command. and they MUST be replaced by a value or by type speciﬁed in the list. main="Binomial Distribution: Number of trials=n. n. Otherwise. they should be writing in several lines. ylab="Probability". apart from the underlined strings.The later command execution is more explicit: it opens a new window (R Graphics) and a graph of our data appears with the name of our variable along the horizontal axis. type="h") we must enter (let us suppose that p = 0. p). the algorithm will run poorly or not at all. Some commands are quite long and can not be entered at once in R. if we have an algorithm with the following two commands (the second command is four lines): > > + + + x <– 0: n plot(x. For example. Parameters listed at the beginning of each algorithm in this guide are always underlined in the listed commands. When a command is not fully entered in R and then that we press the Enter key. This is the end of the algorithm. Probability of success=p. Just enter the listed commands exactly as they are given in the algorithms.7 and n = 13) x <. dbinom(x. that must be replaced without exception. R detects that the command is incomplete and begins a new line with the following prompt symbol “ +” (to indicate that the rest of the command will join the previous line). xlab="Number of successes".
the second command is executed (a combination of four lines of code in total). 13. followed by the Enter key. Again. 0.7). and then press on the Enter key. dbinom(x. 6 . and then press on the Enter key. The ﬁrst command is executed implicitly (it constructs and a vector that stores with the integers from 0 to 13 in the variable x) and we can start typing the second command: plot(x. xlab="Number of successes". and you type Probability of success=0. and we enter ylab="Probability". the last line begins. In the menu choose “File” and then “new script”. type="h") and when pressing the Enter this time. Finally. The selected script will be submitted to the R console. a new line is created automatically (with the prompt “+”) and you enter main="Binomial Distribution: Number of trials=13. You can enter the R script (a series of R commands) into the R editor. A new line begins with the prompt “ +”.and press the Enter key. Below we have a display of the selected script within the R editor.7". Select the commands that you want to submit and press “CTRL-R”. There is an alternative method to submit commands in R using the R editor.
3. dbinom(2. The data set can then be imported into R through the command read. Now. for those who have never worked with this software.3) will execute exactly the same as dbinom( 2.0.We should now be able to implement all the algorithms presented in this guide. x<–(3+4)/2 7 .worldscientific. It is strongly recommended to visit the website R Project [?] and to read the early sections 2 and 5 from the manual introduction to R (the latter being available through the link “Manuals” under the tab “Documentation” in the window on the right-hand-side of the webpage). Another mention concerning this guide is that we will need to import data sets for many examples presented in the text. These sections discuss basic operations on vectors and matrices. 2. This link will be very useful throughout this guide.. but we this will be discussed later in the sections 5 and 7 of this guide. Also. without much diﬃculties. we give some tips for the R console: 1. It is possible to recover one line of code previously entered into an R session by using the up arrow.table() . These sets of data are available as text ﬁles at follows: http://www.7. To download a dataset as a text ﬁle do a right click on the link with the mouse..com/page/7546-tabd [WScienPI] The website in question is a directory managed by World Scientiﬁc and contains links to all data sets used in the textbook [RBGL]. You can clear the console window by typing “Ctrl + L” (or by using the Clear console in the tab Edit ). and choose the “Save Target As . 7. before we begin. and no doubt they will make it easier to work with R overall. This saves much time when we make a mistake in the middle of a command and we do not want to have to rewrite every line of the code.”. 0.3 ). For example. White spaces generated by the space bar does not change the nature of the command most of the time.
the use of an upper case or a lower case must be exactly as given in the statement of the algorithm. We will present this procedure for three or four algorithms (related to conﬁdence intervals). This appendix will be very useful when we will import data sets for our algorithms. That is. Appendix ?? presents alternate versions of some algorithms contained in the text.7.3) is a valid command.executes exactly as x <– ( 3+ 4 ) / 2 4. it is not the same as the variable x. it’s not very complicated to achieve. For example. In truth. Select the commands to be submitted to the R console and then press CTRL-R. We can write an R script (a sequence of R commands) in the R Editor. 5.0.0. These alternatives can save considerable time when our intentions are to often use the same algorithms. Speciﬁcally. 8 . Loading a history of commands used during a previous R session (and these commands are accessed by pressing the up arrow in the console) will not retrieve the values assigned to objects and variables in the previous sessions. And ﬁnally. Appendix ?? gives speciﬁcations on how to change the names of the axes and the title of a graph when we use graphical methods in R. if we declare a variable X . To open the editor select “ﬁle” in the menu and then “new script”. It is possible to save the command history of a session in R. in lower case. dbinom(2. you must save/load the workspace rather than the command history. in upper case. The R translator is case sensitive. Similarly. Appendix ?? shows us how to build a function that automatically executes a series of commands given in advance. but Dbinom(2. It is for the reader to conceive a way in which we can generalize this procedure for any algorithm.3) will not work.7. there are three appendices which will help us as we go through this guide. Appendix ?? outlines a series of ﬁle formats used for storage of data sets. To save objects and functions built during a session. or even the work environment of a session (these options are found under “ﬁle” in the main window of R). 6.
9 .And on this note. we can begin to work with our ﬁrst probability distribution in R. the binomial distribution.
which will be discussed in section ??. The following commands can be used to evaluate the probability mass function and the cumulative distribution function of X . the normal distribution and Student’s t distribuiton. a little further. Computation of P (X = x).p) Parameters : n : a positive integer (the number of trials) p : a real number between 0 and 1 (the probability of success) x : an integer between 0 and n (to evaluate the function at x) 10 . Refer to Chapter 6 of the manual [RBGL].1 Probability mass function and Cumulative distribution function Let X . be a binomial distribution with parameters n =“number of trials” and p =“probability of success”. the Poisson distribution. X ∼binomial(n. Algorithm 1. They are the binomial distribution. P (X ≤ x) and P (X > x).2 Binomial Distribution Within this guide we will work with two discrete distributions and two continuous distributions. 2. We start with the binomial distribution in this section.
p = 0. The procedure to evaluate fX (x) = P (X = x) and FX (x) = P (X ≤ x) in R is very simple. a graphical representation of fX and FX might be useful. n. p)  “value of P (X = x)” Command and output for P (X ≤ x): > pbinom(x. For example. For a better overview of a certain binomial distribution. if we enter dbinom(4.2592 in the R console. We can construct the graphs in R with the following commands: 11 . p)  “value of P (X ≤ x)” Command and output for P (X > x): > 1−pbinom(x. p)  “value of P (X > x)” Remark: We used the complement rule to obtain the value of P (X > x). n.Command and output for P (X = x): > dbinom(x.5. which is the value of P (X = 4).6) after the R prompt.0. then R will output the value 0. n.6). where X ∼binomial(n = 5.
n.axis=1. type = "l". col="gray") The graph of FX is displayed in the R Graphics window. xlab="Number of successes". X ∼binomial(n.lab=1. main = "Binomial Distribution : Number of trials=n.main=1. cex. cex.5) abline(h=0.cex. col="gray") The graph of fX is displayed in the R Graphics window. p)[−length(x)]. 12 . type = "h". ylab = "Probability". n.axis=1.5. rep(2.5) abline(h=0. ylab = "Cumulative Probability". main = "Binomial Distribution: Number of trials=n. Probability of success = p". p).5. pbinom(x. Script for the graph of P (X ≤ x): x <– 0: n x <– rep(x. xlab = "Number of successes".lab=1. dbinom(x.5.5. cex.p) Parameters : n : a positive integer (the number of trials) p : a real number between 0 and 1 (the probability of success) Script for the graph of P (X = x): x <– 0: n plot(x.cex. Probability of success = p". cex. Graphs of fX and FX .main=1.Algorithm 2.length(x)) ) plot(x[−1].
6)  0. To be more speciﬁc. it is a vector that contains the values in the domain of the function to be evaluated to produce the graph. We compute P (X = x) for x from 0 to 5. We interpret 1.5.2 in [RBGL]. The “x” found in the above commands is not listed among the parameters it is an object constructed in R in order to obtain the values on the horizontal axis. We refer the reader to the Appendix ?? for a discussion concerning changing names and the title of a graph in R.0.2 For Example 6. We illustrate the implementation of the algorithms 1 and 2 through examples taken from [RBGL].0.2 Example 6. Commands and output > dbinom(0.5 as 50% larger than the default.5.6) 13 .5. one by one.0.01024 > dbinom(1. 2.2304 > dbinom(3. 2.6)  0. ylab and main.5.6.Remarks : 1. We use cex to manipulate the font size. The command abline() is to add a line that displays the zero on the vertical axis. 4.6)  0. 3.0768 > dbinom(2.0. we use R to compute the probabilities for each of the values of the binomial random variable X with n = 5 and p = 0. You can change the names in the quotation marks for xlab.
3456 > dbinom(4. type="h".6.6).6)  0. ylab="Probability".68256 We construct a stick diagram of the probability mass function fX .0.lab=1. dbinom(x. col="gray") 14 .main=1. xlab="Number of successes". cex.0. we will use n = 5 and p = 0.0:5 plot(x. Command and output > 1-pbinom(2.5. 5.07776 We also compute P (X ≥ 3) = 1 − P (X ≤ 2).5) abline(h=0. main="Binomial Distribution: Number of trials=5. Probability of success=0.5.2592 > dbinom(5.cex. 0. 0.5. Script and output x <.0.6".6)  0.5.5.6)  0.axis=1.cex.
6". main="Binomial Distribution: Number of trials=5.Similarly for the graph of the cumulative distribution function FX . Probability of success=0.5) abline(h=0. xlab="Number of successes".0. cex. cex.5. here is the script. rep(2.5.axis=1.6)[-length(x)].main=1. with the parameters p = 0. ylab="Cumulative Probability".5.lab=1. col="gray") 15 .6 and n = 5.length(x)) ) plot(x[-1]. Script and output x <.rep(x. type="l". pbinom(x.0:5 x <.cex.
7. With R. we are working with a binomial random variable X with n = 7 and p = 0.0003)  0.0.0003.002098111. we compute the probability P (X ≥ 1) = 1 − P (X ≤ 0) = 0.002098111 16 .3 of [RBGL]. Command and output > 1-pbinom(1-1.3 Example 6.2.3 In Example 6.
which stands for probability. P (X ≤ x) and P (X > x).2. We compute the probability masses with the function dpois() in R. while the cumulative probabilities are computed with ppois().3.1. There is also the preﬁx “q” to ﬁnd quantiles..3 Poisson Distribution For this section.. (to evaluate at x) 17 . 3. which stands for density. that we will see when working with the normal and T distributions. gives the cumulative distribution function. We now see how to evaluate the probability mass function and the cumulative distribution function for a Poisson distribution. The preﬁx “d” in R. we will be working with a Poisson random variable X with mean λ. gives the probability mass function (or probability density function). .1 Probability Mass Function and Cumulative Distribution Function In this section.. The preﬁx “p”. X ∼Poisson(λ) Parameters: λ : a real value > 0 (the mean) x : a non-negative integer 0. We will see a few examples of computations of probabilities and the construction of graphs. Algorithm 3. Computation of P (X = x). we are referring to Chapter 6 of [RBGL].
λ)  “value of P (X = x)” Command and output for P (X ≤ x): > ppois(x. which is P (X = 2). if we have a Poisson random variable with a mean of λ = 4. but let us see how to produce the graphs for fX and FX .Command and output for P (X = x): > dpois(x. λ)  “value of P (X ≤ x)” Command and output for P (X > x): > 1−ppois(x.5) in the R console. we enter dpois(2.4. λ)  “value of P (X > x)” For example. We will see a more concrete example for dpois() and ppois(). 18 .5 events and that we want to compute the probability that 2 events will occur. The R output is 0.1124786.
5) abline(h=0.5. Script for the graph of P (X ≤ x): x <– 0: t x <– rep(x. cex. length(x)) ) plot(x[−1]. ylab = "Probability".5) abline(h=0.axis=1. main = "Poisson distribution: Mean=λ".lab=1. xlab="x". rep(2.3. type = "l".cex.cex. main = "Poisson distribution : Mean=λ".. col="gray") The graph of fX is displayed in the R graphics window. type = "h".4.axis=1. . ppois(x.main=1.main=1. (an upper bound for the horizontal axis) Script for the graph of P (X = x): x <– 0: t plot(x.2.cex. cex. Graphs of fX and FX .5. xlab="x".5. col = "gray") The graph of FX is displayed in the R graphics window. dpois(λ).cex... λ)[−length(x)].5. 19 . X ∼Poisson(λ) Parameters : λ : a real value > 0 (the mean) t : a positive integer 1.lab=1.Algorithm 4. ylab = "Cumulative Probability".
09020401 We omit the construction of the graphs for a Poisson distribution because the method is virtually the same as for the binomial distribution. 20 . With R. We compute P (Y ≥ 2) = 1 − P (Y ≤ 1) = 0.2 Example 6.3. we compute P (X ≤ 3) = 0. Command and output > 1-ppois(1.1512039.1512039 Now let us consider a Poisson random variable Y with a mean of 0.4 of [RBGL]. we have a random variable X ∼P oisson(6).0.5.4 For Example 6. Command and output > ppois(3.5)  0.6)  0.09020401.
Similar to the binomial and Poisson distribution. µ. We will work with the probability density function and the probability cumulative function and graphics. we will refer to Chapter 7 of the textbook [RBGL]. 4.σ 2 ) Parameters : x : a real number (to evaluate fX (or FX ) at x) µ : a real number (the mean) σ : a positive real number (the standard deviation) Command and output for fX (x) : > dnorm(x.4 Normal Distribution For this section. Computation of fX (x). We will also learn how to ﬁnd quantiles of a normal distribution. σ ) 21 . Algorithm 5. evaluations of fX and FX for a normal distribution are as direct.1 Probability Density Function and Cumulative Distribution Function We use the notation X to represent a random variable with a normal distribution with mean µ and standard deviation σ . FX (x) and P (X > x). X ∼N(µ.
We often would like to ﬁnd a quantile from a normal distribution. σ )  “value of FX (x)” Command and output for P (X > x): > 1−pnorm(x. R provides methods for calculating quantiles from several distributions. Of course. pnorm(x) for the cumulative distribution function. then is it suﬃcient to use dnorm(x). µ. then R uses µ = 0 and σ = 1 by default for dnorm() or pnorm(). to evaluate the probability density function. if we want to work with the standard normal. “value of fX (x)” Command and output for FX (x): > pnorm(x. We will ﬁnd lower and upper quantiles. µ. 22 . σ )  “value of P (X > x)” Remark: If you do not give a value for µ nor for σ . but you will only see the method with the normal distribution (and the T distribution). or to use. In other words.
tail=FALSE)  “the value x such that P (X > x) = q ” Remark: R uses µ = 0 and σ = 1 by default for qnorm(). X ∼N(µ. if we want quantiles from the standard normal then it suﬃces to use the command qnorm(q ).σ 2 ) Parameters : q : a real number between 0 and 1 (the order of the quantile) µ : a real number (the mean) σ : a positive real number (the standard deviation) Command to ﬁnd a lower quantile: > qnorm(q . µ. µ. Computation of a quantile. σ . In other words. Finally. σ )  “the value x such that P (X < x) = q ” Command to ﬁnd an upper quantile: > qnorm(q . we produce graphs of fX or of FX in R as follows: 23 . lower.Algorithm 6.
col="gray") The graph of fX is displayed in the R graphics window.cex. dnorm(x. X ∼N(µ. µ+3*σ . xlab="x". Parameters: Graphs of fX and FX . Standard Deviation=σ ". σ ).axis=1.5. Remark : It is again possible to shorten the writing of the command 24 . Script for the graph of FX : x <– seq(µ−3*σ . type="l".cex.σ 2 ) µ : a real number (the mean) σ : a real positive value (the standard deviation) Script for the graph of fX : x <– seq(µ−3*σ . length. cex. type="l". µ+3*σ .axis=1. main="Normal Distribution : Mean=µ. col="gray") The graph of fX is displayed in the R graphics window.cex.5.5. xlab="x".5) abline(h=0. Standard Deviation=σ ". ylab="Cumulative Probability".lab=1.main=1.5.Algorithm 7.out=100) plot(x. µ. main="Normal Distribution: Mean=µ. cex.main=1.out=100) plot(x. dnorm(x. µ. σ ). length.lab=1. ylab="Probability Density".5) abline(h=0.cex.
5) 25 .axis=1. for those that had not noticed it up to now. cex. for the probability density function. we obtain Command and output > pnorm(0.lab=1.1 of [RBGL]. if we are working with the standard normal distribution.main=1.1) . type="l".25). dnorm(x). dnorm(x. we compute P (−1.5. main="Standard Normal Distribution".pnorm(-1.5. we could have > plot(x.1)  0. ylab="Probability Density".25.5) (which is equivalent to P (Z ≤ 0.seq(0-3*1. xlab="z".dnorm() and pnorm() inside of the command plot().5858127 We can build the graph for the density of the standard normal.0. 0. For example.5) − P (Z ≤ −1. With x = 0. 1). Also.out=100) plot(x. 0+3*1. Script and output x <.25)).1 In Example 7. xlab="x".0.2 Example 7. where Z has a standard normal distribution. 4.5.cex. we can omit the values of the parameters µ and σ .5 (and x = −1.cex. Since we are working with the standard normal distribution.25 < Z < 0. dnorm(x). xlab="x". on the 2nd line. length. the star symbol “ * ” is used for multiplication in R. instead of > plot(x.
3)  0. For part (a).04779035 For part (b). we use x = 45.6826895.pnorm(45.2 For the Example 7. and within the R command. we get Command and output > 1 . we use x = 43 and x = 37 to compute P (37 < X < 43) = 0. µ = 40 and σ = 3.3 Example 7. we are working with a normal random variable X with mean µ = 40 and standard deviation σ = 3. Using algorithm 5.4. we compute P (X > 45).2 of the textbook [RBGL].40. 26 .
cex.3)  0.seq(40-3*3. Standard Deviation=3". cex.main=1.40.3) . type="l".5. length.5) 27 . main="Normal Distribution: Mean=40.out = 100) plot(x.axis=1. xlab="x".cex.40. ylab="Cumulative Probability".3). we build the graph of the cumulative distribution function using algorithm 7. 40+3*3.Command and output > pnorm(43.40.6826895 For part (c).pnorm(37.lab=1. which gives Script and output x <.5. pnorm(x.
We observe that the cumulative probability reaches 0. More precisely.3)  37. by using the command qnorm() : Command and output > qnorm(0.25.40.97653. which is x = 37.25 at around x = 38. we obtain the 25th percentile (or lower quantile of order 25%).97653 28 .
However.1 Data Entry and Bar Charts For a categorical variable. the other containing the frequencies associated with these categories. the contents of this section refers to Section 9. More speciﬁcally. Numerical variables are discussed in the following section. we will enter the data directly into R. for numerical variables. we can quickly build a bar chart. One can also make the choice to display the frequencies or the relative frequencies. So a random sample of size n for a categorical variable contains n categorical values. So we will show the reader how to enter the frequency distribution into R. There are two types of variables : numerical and categorical. 5. The method for the direct input of the frequency distribution consists essentially of the explicit construction of two vectors in R: one containing the names of categories.5 Categorical Data This section. For categorical variables. In most of the examples from the textbook [RBGL]. each observation represents a category. we will discuss two methods of data entry: a method for which the data entry is done directly in R. Once the data is entered. For such a sample. we can construct a frequency (or relative frequency distribution). along with several others that follow. 29 . and then a method for which data can be imported into R from a text ﬁle or an excel worksheet. refers to Chapter 9 of [RBGL].1 (Random Sampling and Data Description) of Chapter 9. we are given the frequency distribution and not the raw data. We consider categorical variables in this section.
x3 ... .. ym : names of the categories (placed within quotes : ") x1 . For example.. y3 . ym ) x <– c(x1 . We are using m to represent the number of categories... We must begin and end each name by quotation marks: ". x2 .. 3. and corresponds to the length of the vector x (and of the vector y )... xm ) The data are stored in the variables x and y . replace y1 by "Cat" and y2 by "Dog" in the command. if we want to enter the categories Cat and Dog for y1 et y2 . 2. Remarks: 1.Algorithm 8.. Entry of a Frequency Distribution (Categorical Data) directly into R Parameters : y1 . Example 9. We can verify the contents of x and of y at any time.1 clariﬁes this syntax.. simply by typing x (or by typing y ) in the R console and press the Enter key. x2 .. y2 . y2 .. . xm : the frequencies (xi is the frequency for the category yi ) Commands for the direct data entry: y <– c(y1 .. 30 .
. cex. y2 ..5.axis=1. cex.. cex... xm ) Command for a (Frequency) Bar Chart : barplot(x.. x2 ... Direct Entry of a Frequency Distribution and Constructing a Bar Chart Parameters: y1 .arg = y.arg = y... names.The procedure is not complicated and you can add a line or two to this algorithm and get a bar chart.. .lab=1.5) The bar chart is displayed in the R graphics window. x3 .lab=1.. x2 ..5. y3 . cex. ylab="Frequency"..cex..names=1. xm : the frequencies (xi is the frequency for the category yi ) Commands for the direct entry : y <– c(y1 . ym : names of the categories (placed within quotes: ") x1 .names=1.axis=1. y2 .5 ) 31 . ym ) x <– c(x1 . ylab="Relative Frequency".5.cex. names. Algorithm 9.5. Script for a (Relative Frequency) Bar Chart: t <– sum(x) barplot( x/t.
ylab="Frequency". Command and output barplot(x.3. The following example clariﬁes the procedures involving categorical variables.10. we enter the data concerning the tumour variable using Algorithm 8.c(35.lab=1.axis=1.arg = y. "no tumours") x <.5) 32 .c("only liver". cex. cex.1 from the textbook [RBGL].The bar chart is displayed in the R graphics window.5. Commands y <.5. "liver and mouth".names=1.75) We continue with the algorithm to construct the bar chart. 5. "only mouth".2 Example 9.cex.1 Based on Example 9. names.
colnames(z) t <. cex.(If certain names are not displayed along the horizontal axis.5. cex.sum(x) barplot(x/t. Commands and output z = read.cex.arg=y. ylab="Relative Frequency".axis=1.names=1.5. and the problem should be ﬁxed. names.table(file.numeric(z) y <.choose().as. header=TRUE.lab=1. increase the size of the window.) Below we build a relative frequency bar chart for the tumour variable. sep="\t") x <.5) 33 .
Algorithm 10. We will enter the data as a contingency table and produce side-by-side bar charts. We will want to compare the distributions of X conditioned on the value of Y . Contingency Table Enter data directly into R Parameters : 34 .5.3 Contingency Tables & Side-by-Side Bar Charts We will be working with two categorical variables X and Y .
table(as. the columns vary according to the categories of X ... ncol=c. z1c .... . 35 .c : number of categories for X (number of columns) r : number of categories for Y (number of rows) x1 . xc : the names of the categories for X y1 . and the rows vary according to the categories of Y) Data Entry : x <– c(x1 .. y2 . z22 . zrc ) (For the matrix.. x2 . dimnames=list(y. nrow=r . . y2 .matrix(ct)) ctTotal <– addmargins(ct) colnames(ctTotal)[(length(x) + 1)] <– "Total" rownames(ctTotal)[(length(y) + 1)] <– "Total" ctTotal The contingency table will be displayed in the console window.. y3 .. z12 ..... zr1 .... x2 . z21 . zr2 .x)) ct <– as..... yr : the names of the categories for Y zij : zij is the frequency for the category xj conditional on Y = yi ( z = (zij ) is a matrix of size r × c. yr ) z <– c(z11 ... xc ) y <– c(y1 ......) Commands to produce the contingency table: ct <– matrix(z.. x3 . z2c .. we enter the data one row at a time until we have entered the whole matrix. byrow=TRUE..
We obtain the conditional relative frequency distributions of X (the columns) by conditioning on the values of Y (the rows).table(tc.Commands to obtain the conditional relative frequency distributions: ct <– matrix(z. nrow=r . the name of the column and of the row with the totals is Sum.table(ct.x)) ct <– as.1) condDist The contingency table will be displayed in the console window. we can use the following algorithm. the command addmargins() computes the row totals and the column totals of the table. 36 . We can obtain the distributions of Y conditional on X by substituting the above command prop. Remarks : 1. 2. We have added two lines to the script to change the name Sum to Total. To construct side-by-side bar charts to display the distributions of one categorical variable conditional on another categorical variable from the contingency table of two categorical variables.matrix(ct)) condDist <– prop.table(tc. ncol=c. Furthermore. dimnames=list(y.1) with prop.2).table(as. By default. byrow=TRUE.
z12 .. yr : the names of the categories for Y zij : zij is the frequency for the category xj conditional on Y = yi ( z = (zij ) is a matrix of size r × c... the columns vary according to the categories of X ..... and the rows vary according to the categories of Y) Data Entry : x <– c(x1 .. y2 . Side-by-side bar charts Enter the contingency table into R Parameters : c : number of categories for X (number of columns) r : number of categories for Y (number of rows) x1 ... xc ) y <– c(y1 .. zr2 .. z1c . z22 . . we enter the data one row at a time until we have entered the whole matrix. z21 ..... xc : the names of the categories for X y1 .. zr1 . z2c .) 37 .. zrc ) (For the matrix......... yr ) z <– c(z11 . . x3 ...Algorithm 11. y2 .. x2 . y3 .. x2 .
5. cex. The value of cex in the command legend() is to control the size of the font in the legend.names=1.axis=1. The value 1.5. ﬁll=gray.4 Example 9. ylab="Relative Frequency".colors(length(y)). col="gray") is not necessary. 5. xlab=xname. beside=TRUE. title=yname. respectively. ncol=c. it is a matter of aesthetics. Remarks : 1.2 We illustrate the construction of the contingency table and of the size-by-side bar charts via Example 9. names.5) abline(h=0.2) (R will ask you to select the location for the legend (the list of names for Y ) by clicking directly on the R graphics window.lab=1.1) barplot( condDist. It adds a gray on the horizontal axis.Script for the side-by-side bar charts : ct <– matrix(z. 2. nrow=r .table(ct. 38 . byrow=TRUE) condDist <– prop. The command abline(h=0. cex. The values of the parameters c and r are 4 and 2. cex=1.) The graph and its legend are displayed in the R graphics window. col="gray") legend(locator(1).arg=x. y.cex.2 of [RBGL].2 means 20% larger than the default.
15. 255) ct <. nrow=length(y).08130081 0. 75.024390244 2 0.9107143 39 . "2") z <.matrix(z."Total" condDist <.c("1".6097561 0.prop.c(35.addmargins(ct) colnames(ctTotal)[(length(x)+1)] <. tumour Total 75 123 255 280 330 403 no tumour 0. Commands x <.05357143 0. "no tumour") y <. ncol=length(x). dimnames=list(y. 8.x)) ctTotal <.table(ct.c("only liver". byrow=TRUE. 10.1) ctTotal condDist Output > ctTotal only liver only mouth both no 1 35 10 3 2 15 8 2 Total 50 18 5 > condDist only liver only mouth both 1 0. "only mouth".007142857 Below we build side-by-side bar charts. to produce a contingency table with totals and to produce a table with the distributions of the tumour variable conditional on the river system.We use Algorithm 10 to enter the contingency table into R. 3.02857143 0."Total" rownames(ctTotal)[(length(y)+1)] <. 2.28455285 0. "both".
prop. 10. names. 2. xlab="Tumour Category". byrow=TRUE. "only mouth". 75.2) 40 . cex.Commands and output x <. 255) ct <. 8.x)) condDist <.axis=1.colors(length(y)). dimnames=list(y. 15.table(ct.c("1". col="gray") legend(locator(1). cex.5. "no tumour") y <.c("only liver".cex=1. 3.1) barplot( condDist.matrix(z. fill=gray.names=1. ncol=length(x). nrow=length(y). ylab="Relative Frequency". "2") z <.c(35. beside=TRUE. y.lab=1.5.arg=x. "both".cex. title="River System".5) abline(h=0.
Algorithm 12. it would very useful to know how to import data from a text ﬁle or an Excel ﬁle. For this reason. For all information regarding the formatting of the text ﬁle to import into R. we will need to import a data from a ﬁle. Fortunately. 41 . The following algorithm can be used in general to import data from a text ﬁle. so we will only discuss importing data from a text ﬁle.6 Importing Data from a Text File It is convenient to enter data directly into R but too often. it is necessary that the data within the text ﬁle follow a particular format (discussed in Appendix ??) so the data can be used within particular R commands. However. In particular. Parameters : None Import data from a text ﬁle into R Format of the data within the text ﬁle : A format given in the Appendix ?? of this guide. we refer the reader to Appendix ??. we will always give an explicit indication concerning the format of the data within the text ﬁle in order to execute a particular algorithm. an Excel ﬁle is easily converted into a text ﬁle. A method for converting an Excel ﬁle to a text ﬁle is also given in this appendix.
Remarks: • We can verify the contents of the variable z at any time. • We will assume that the ﬁle is tab-delimited. to access column 2 use z[. but it does not work very well in French. that is we are using a tabular to separate the columns. It contains two columns that are tab-delimited.2]. sep="\t") A window will open to help us ﬁnd the location of the text ﬁle containing the data. we will need to use a ﬁeld separator (sep ). To access column 1 use z[.txt. To indicate to R that the ﬁeld separator is a tab.3-9. we will refer to the data found in the text ﬁle CLUTCHSIZE. 6.1]. We will be prompted to browse for the ﬁle. and so on.4) In this section. To identify the columns in the text ﬁle.table() will create a data frame.1 Importing Clutch Sizes (Examples 9. • The command read. We will use Algorithm 12 to import the data. • Another good ﬁeld separator in English is a comma (use sep = ".choose(). which uses any white space (spaces. use sep = "\t".Command to import data: z = read. tabs or newlines) as a separator.table( ﬁle. simply enter z at the prompt in the R console and press the Enter key. header=TRUE." ). R uses sep = "". that is a list of columns. • To access the names of the variables use names(z). By default. 42 . • The argument header=TRUE tells R that the ﬁrst row of the ﬁle should be interpreted as variable names.
choose().header=TRUE.sep="\t") To view the contents of z.size. we enter z in the R console at the prompt and hit ENTER.1 clutch.2 1 11 6 2 12 7 3 11 8 4 9 8 5 11 8 6 2 9 7 6 9 8 3 9 9 17 9 10 6 9 11 10 10 12 10 10 13 8 10 14 11 10 15 5 10 16 NA 10 17 NA 10 18 NA 11 19 NA 12 20 NA 14 43 .size.Command z=read.table(file. Command and Output > z clutch.
1] or z[.2].e. . We then display the ﬁrst and second column.header=TRUE.1" "clutch.sep="\t") 44 . we import the data into R with Algorithm 12.table(file.size. .The names of the columns are in names(z) and to access each column we can z[. Afterwards. The second column can be a categorical variable that is used to identify the region. we entered the data into the ﬁrst two columns using the following format. some columns can be numerical and some can be categorical.1]  11 12 11 9 11 2 6 3 17 6 10 10 8 11 5 NA NA NA NA NA > z[.2" > z[. i. Command and Output > names(z)  "clutch.choose().2]  6 7 8 8 8 9 9 9 9 9 10 10 10 10 10 10 10 11 12 14 The data frame can contain variables of various types. . .size. As an example. In Excel. 12 Region 2 14 Region 2 Command to import the data z=read. We saved the data as a tab-delimited ﬁle. say that we have one column that contains the 35 clutch sizes. region clutch size 11 Region 1 12 Region 1 11 Region 1 . .
R will consider the variable as numerical. To subset.size" "region" > z[. we can use many logical arguments to select the rows and use the argument select to select to the columns.character(name of variable ).1]  11 12 11 9 11 2 6 3 17  9 9 9 9 9 10 10 10 10 > z[. We selected the clutch sizes from region 1.numeric(name of variable ) and as.Commands and output to display the variables > names(z)  "clutch. It is possible to impose a data type to a vector with the commands: as. • Below we use the subset command to get a subset of the data frame.2]  Region 1 Region 1 Region 1  Region 1 Region 1 Region 1  Region 1 Region 1 Region 1  Region 2 Region 2 Region 2  Region 2 Region 2 Region 2  Region 2 Region 2 Region 2 Levels: Region 1 Region 2 Remarks: • R identiﬁed the variable region as a categorical variable since there are non-numerical characters in the ﬁeld. • If the names of the categories are numbers. 6 10 10 8 11 5 10 10 10 11 12 14 Region Region Region Region Region Region 1 1 2 2 2 2 Region Region Region Region Region Region 1 1 2 2 2 2 6 7 8 8 8 Region Region Region Region Region 1 1 2 2 2 45 .
the operand must be a vector. region==”Region 1”. Suppose that we have a categorical variable x called “region”. if the vector is numerical and also if the 46 . Since all the values are numerical. By using the command x<–x[. we are converting a data frame with one column into a vector.1].select=c(clutch.size" "region" > x<-subset(z.size)) is a data frame.Commands and output to display the variables > names(z)  "clutch. whose values are used to identify the region of the observation. region=="Region 1". R thinks that it is a numerical variable. We will end this section with a data type conversion. We will build the vector x. for most commands.size 1 11 2 12 3 11 4 9 5 11 6 2 7 6 8 3 9 17 10 6 11 10 12 10 13 8 14 11 15 5 > x<-x[. The categories are 1.size)) > x clutch.1] > x  11 12 11 9 11 2 6 3 17 6 10 10 8 11 5 Remark: The result of the command x<–subset(z. We will ask R. However. since we are taking a subset of a data frame. 2 and 3. not a data frame.select=c(clutch.
3.2.numeric(x) > x  1 1 1 1 2 2 3 3 3 > is.character(x) > x  "1" "1" "1" "1" "2" "2" "3" "3" "3" > is.numeric(x)  TRUE > is.numeric(x)  TRUE > is. We will then convert its data type to categorical (called character is R) and then we will convert its data type to numerical.character(x)  FALSE > x <.as.vector is categorical.2.1. Commands for data type conversion > x <.1.character(x)  FALSE 7 Numerical Variables We now consider the case of the construction of a histogram for a numerical variable.as.character(x)  TRUE > is.3) > x  1 1 1 1 2 2 3 3 3 > is. 47 . We will refer to Chapter 9 of [RBGL].3.1.c(1.numeric(x)  FALSE > x <.
lab=1..1 Histogram We can again proceed by entering the data directly into R. 48 . .5. cex. cex. xlab=xname. col="lightgray"...main=1.axis=1.7.5) This histogram is displayed in the R graphics window.. ylab="Frequency". Histogram Constructing a histogram Parameters : x1 . Command for a (frequency) histogram: hist(x.. but we will also discuss importing data into R from a text ﬁle. Algorithm 13. We could alternatively import a data frame into R and assign the appropriate column to x. xn : n observations of the variable x xname : the name of the variable (between quotation marks : ") Data Entry : x <– c(x1 . x2 . x2 ...cex. We start with the case of data entered directly into R. xn ) Remark: The above is an example of direct data entry into R. main="Frequency Histogram".5. x3 .
ylab="Relative Frequency".5. The ﬁrst step will be to build an object h that will be the frequency histogram. We must force R to do so. • The total area under the histogram is equal to the number of observations.lab=1. ylab="Probability Density". Command for a (density) histogram: hist(x. xlab=xname.lab=1.Remarks: • You can add the optional argument breaks=b.main=1.5.axis=1. Command for a (relative frequency) histogram: h=hist(x) h$counts = h$counts/sum(h$counts) plot(x. Afterwards. col="lightgray". The total area equals one. cex. xlab=xname. Remark: It is possible to obtain a relative frequency histogram. 49 . then they are the values of the breaks. cex. To obtain a total area of one. col="lightgray".cex.axis=1. we can plot the object h. but not directly.5) This histogram is displayed in the R graphics window.5) This histogram is displayed in the R graphics window. main="Relative Frequency Histogram". where b represents the number of breaks.5.main=1. by using use the optional argument prob=TRUE. main="Density Histogram". We will modify the counts so that it represents the relative frequencies instead of the frequencies. prob=TRUE. cex. If b is a vector. cex. we should use a density histogram.cex.5. The vector of frequencies for the histogram object h is h$counts.
By adding the argument breaks=c(188.8.131.52) hist(x.cex.3.6. 50 .lab=184.108.40.206.10. R uses inclusive upper bounds instead of inclusive lower bounds.9. So the graph will appear a bit diﬀerent than the graph in the textbook.6. main="Frequency Histogram".5.3 of ??.ylab="Frequency".axis=1.5.8. Commands and the graph x <.17).2 Example 9. However.5. col="lightgray".14.main=1.3 We enter the data directly into R and contruct the histogram for the cluster size variable of Example 9.11.8.xlab="Clutch Size".11.c(11.2.5) Remarks: 1.12. cex. cex. we can get the same intervals as used in Example 9.17.
Note that the length of the intervals are approximately the same.10. To force R to display frequencies.7. 51 .99.13. R will by default construct a density histogram. The last interval is of length 3.99. while the others are of length 3. to obtain the following frequency histogram. In this case. By using the following argument breaks=c(1. but they are not exactly the same.2.99.17).99. we must add the argument prob=FALSE. we can obtain the same histogram as in the textbook.01.99.4.
ylab="Frequency". We will now consider the case where we import a data frame into R.txt.axis=1.1)) hist(z[. The ﬁrst column are clutch sizes from region 1. We will build a histogram for the clutch size variable for each of the two regions.5) 52 . main="Frequency Histogram".5.table( file. col="lightgray". cex. Commands and the graphs z = read.cex.ylab="Frequency".In the above example we entered the data directly into R.5) hist(z[. header=TRUE. main="Frequency Histogram". cex. cex.lab=1.5.5.xlab="Clutch Size (Region 1)". Consider the data in the text ﬁle CLUTCHSIZE.cex. cex.5. We will display both histograms in the same graphics window.main=1.main=1.lab=1.choose().2].xlab="Clutch Size (Region 2)". col="lightgray". while the second column is for region 2.axis=1.1]. sep="\t") par(mfrow=c(2.
5. we must subset the data frame to obtain the two samples of clutch sizes.1] par(mfrow=c(1.1] y<-subset(z.2)) hist(x. col="lightgray".1)). cex. To do so.xlab="Clutch Size (Region 1)".cex.size)) x<-x[. main="Density Histogram".ylab="Probability Density". suppose now that we importing a data frame with two columns: the ﬁrst column contains the clutch sizes and the second column is a categorical variable that is used to identify the region. To revert back to displaying one graph per window enter the command par(mfrow=c(1. To end example 9. cex. The components of the matrix within the graphics window are ﬁlled one by one as we produce graphs. sep="\t") x<-subset(z.5) hist(y. prob=TRUE. region=="Region 1". cex.lab=1.xlab="Clutch Size (Region 2)". where r is the number of rows and c is the number of columns in the display.table( file. header=TRUE.5. cex.3.cex. prob=TRUE.size)) y<-y[. Commands and the graphs z = read.axis=1.choose().main=1. region=="Region 2".select=c(clutch.ylab="Probability Density".5) 53 .c)).lab=1.5.5. main="Density Histogram".axis=1.Remark: To display many graphs in the same window. we can use the command par(mfrow=c(r.main=1. We will produce side-by-side density histograms of the clutch sizes per region. col="lightgray".select=c(clutch.
5) We will now produce the histogram for the logarithm of the survival time from Example 9.main=1.axis=1. main="Frequency Histogram". header=TRUE.ylab="Frequency".choose().5. 54 .1].6 We will build the frequency histogram by importing the ﬁle SURVIVALTIMES. We are assuming that the data has already been imported and that it is in the data frame x.3 Example 9.7.5. col="lightgray". sep="\t") hist(x[.6.xlab="Survival Times (in months)". cex. Commands and the graph x = read.table(file.lab=1.cex. cex.txt.
In other words. This concludes the section on histograms.lab=1. the ith component of log(x) is the logarithm of the ith component of x. cex.xlab="Survival Times (in log(months))".main=1. col="lightgray". main="Frequency Histogram". 55 . cex.5) Remark: If x is a numerical vector in R.5. then log(x) is the natural logarithm of x evaluated component-wise.5.cex.axis=1.Commands and output hist(log(x[.1]). ylab="Frequency".
2011. World Scientiﬁc Publishing Co.r-project. 2012.worldscientific.. Available at : http://www.org/ 56 . The R project for Statistical Computing.com/page/7546-tabd [Rproj] Institute for Statistics and Mathematics of the Vienna University of Economics and Business. Expect the unexpected: A ﬁrst course in Biostatistics.References [RBGL] Raluca Balan and Gilles Lamothe. 2012. Data sets for Expect the unexpected: A ﬁrst course in Biostatistics. Singapore [WScienPI] World Scientiﬁc Publishing Co. Disponible au : http://www.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.