You are on page 1of 8

Fall 2005 STATISTICS 579 R Tutorial : Vectors, Matrices, and Arrays

1. Creating Matrices: Recall that a vector is an R object; matrix, array, and data frame are examples of other classes of R objects. The R function matrix() as used below creates a 3 × 2 matrix object m using the data in the form of a vector, the row, and the column sizes, respectively, as the arguments. > m=matrix(c(1.2,3.5,4.7,1.8,-6.4,5.3),3,2) > m [,1] [,2] [1,] 1.2 1.8 [2,] 3.5 -6.4 [3,] 4.7 5.3 The arguments to an R function may be specified as named arguments i.e., in the form name=value or just by specifying the values if they are provided in the same sequence as given in the function specification, i.e., as positional arguments. For example, arguments can be specified to the matrix function in a different order from above using the named form, to obtain the same result: > matrix(c(1.2,3.5,4.7,1.8,-6.4,5.3),ncol=2,nrow=3) [,1] [,2] [1,] 1.2 1.8 [2,] 3.5 -6.4 [3,] 4.7 5.3 Also recall that data from external files may be directly “scanned” to matrix() for creating data matrices. Recall that earlier we used: > insulin1=matrix(scan("insulin.data"),ncol=3,byrow=T) Read 24 items The functions dim() and dimnames() may be used to determine or assign the corresponding attributes, respectively, of matrix objects, as shown below. In a similar fashion, elements of a vector object may also be assigned names using the names() function. > dim(m) [1] 3 2 > dimnames(m) NULL > dimnames(m)=list(paste("Row",1:3),paste("Col",c(1,2))) > m Col 1 Col 2 Row 1 1.2 1.8 Row 2 3.5 -6.4 Row 3 4.7 5.3 > dimnames(m) [[1]] [1] "Row 1" "Row 2" "Row 3" [[2]] [1] "Col 1" "Col 2" 1

> h [1] 15.1 11.3 7.0 9.0 > names(h)=c("APE","BOX","CAT","DOG") > h APE BOX CAT DOG 15.1 11.3 7.0 9.0 There are several functions that help perform complex matrix operations. Two of these are introduced below but their uses will be illustrated in examples that appear later. The row() function operating on a matrix, returns a matrix of integers indicating the row number of the elements of the matrix. Obviously, the returned matrix is of the same dimensions as the argument. Similarly, the col() function returns a matrix of column numbers. > row(m) [,1] [,2] [,3] [1,] 1 1 1 [2,] 2 2 2 > col(m) [,1] [,2] [,3] [1,] 1 2 3 [2,] 1 2 3 Note carefully that these would be the same for any 2 × 3 matrix. 2. Matrix Operations: A variety of operators (e.g.,%*%) and functions (e.g., rbind(), solve(), diag() etc. are available to extract information from matrix objects or perform computations involving matrix operations. The functions rbind() and cbind() allow appending of row or columns to matrices: > rm(m) > mdata=c(1.2,3.5,4.7,1.8,-6.4,5.3) > m=matrix(mdata,ncol=3,byrow=T);m [,1] [,2] [,3] [1,] 1.2 3.5 4.7 [2,] 1.8 -6.4 5.3 > m1=rbind(1:3,m);m1 [,1] [,2] [,3] [1,] 1.0 2.0 3.0 [2,] 1.2 3.5 4.7 [3,] 1.8 -6.4 5.3 > m2=cbind(2,m1);m2 [,1] [,2] [,3] [,4] [1,] 2 1.0 2.0 3.0 [2,] 2 1.2 3.5 4.7 [3,] 2 1.8 -6.4 5.3 The operator %*% requires that the two matrices conform to matrix multiplication. The function t() transposes a matrix and function solve() may be used either to find the inverse of a matrix or to solve a set of linear equations, as illustrated below:

2

> mm=m2%*%t(m2) > mm [,1] [,2] [,3] [1,] 18.0 26.30 8.90 [2,] 26.3 39.78 8.67 [3,] 8.9 8.67 76.29 > m [,1] [,2] [,3] [1,] 1.2 3.5 4.7 [2,] 1.8 -6.4 5.3 > m%*%c(1,-1) Error in m %*% c(1, -1) : non-conformable arguments > solve(mm) [,1] [,2] [,3] [1,] 2.09544227 -1.3659267 -0.08922338 [2,] -1.36592672 0.9161643 0.05523140 [3,] -0.08922338 0.0552314 0.01723990 > solve(mm,c(10.3,24.5,36.7)) [1] -15.156647 10.403973 1.066873 The previous command solves the linear system: 18.0x1 + 26.30x2 + 8.90x3 = 10.3 28.3x1 + 39.78x2 + 8.67x3 = 24.5 8.9x1 + 8.67x2 + 76.29x3 = 36.7 Several other linear algebra functions of interest are chol(), qr(), backsolve(), forwardsolve(), and ginv(). These are useful for performing many statistical computations and will be discussed in other courses. As an example, chol() performs a factorization of a symmetric positive-definite matrix X into the form X = R R where R is an upper triangular matrix. As an application, we may use this factorization to generate random variables from the p−variate multivariate Normal distribution y ∼ N (µ, Σ) using the relationship y = µ + Rz where z ∼ N (0, I) and Σ = R R. It is easy to generate samples from the p−variate multivariate multivariate Normal distribution N (0, I): just generate a random sample of size p from the univariate standard Normal distribution. > R=chol(mm) > R [,1] [,2] [,3] [1,] 4.242641 6.198969 2.097750 [2,] 0.000000 1.163090 -3.726186 [3,] 0.000000 0.000000 7.616100 > t(R)%*%R [,1] [,2] [,3] [1,] 18.0 26.30 8.90 [2,] 26.3 39.78 8.67 [3,] 8.9 8.67 76.29 The function det() calculates the determinant, and eigen() operates on square matrices and returns two components of a list:values containing the eigen values, and vectors is a matrix containing the corresponding eigenvectors. > det(mm) [1] 1412.421 3

> eigen(mm) $values [1] 82.2943241 51.4420374 $vectors

0.3336385

[,1] [,2] [,3] [1,] 0.2668772 -0.4810885 0.83506315 [2,] 0.3483415 -0.7597540 -0.54902828 [3,] 0.8985737 0.4374103 -0.03517792 The function diag() performs several operations depending on whether its argument is a scalar, a vector, or a matrix. If the argument is a scalar, diag() returns an identity matirix of that dimension; if the argument is a vector, it returns a diagonal matrix with the elements of the vector as its diagonal elements. If the argument is a matrix, diag() returns a vector containing the diagonal elements of the matrix. > diag(4) [,1] [,2] [,3] [,4] [1,] 1 0 0 0 [2,] 0 1 0 0 [3,] 0 0 1 0 [4,] 0 0 0 1 > diag(h) [,1] [,2] [,3] [,4] [1,] 15.1 0.0 0 0 [2,] 0.0 11.3 0 0 [3,] 0.0 0.0 7 0 [4,] 0.0 0.0 0 9 > diag(mm) [1] 18.00 39.78 76.29 > m [,1] [,2] [,3] [1,] 1.2 3.5 4.7 [2,] 1.8 -6.4 5.3 > diag(m) [1] 1.2 -6.4 3. Subscripting Vectors: The elements of a vector can be extracted using an index vector enclosed in square brackets i.e., it is said to be used as a substript. The use of subscripts is illustrated below to reference or extract elements of vectors: > hh [1] 15.1 11.3 7.0 9.0 0.0 0.0 0.0 15.1 11.3 7.0 9.0 > hh[1:5] [1] 15.1 11.3 7.0 9.0 0.0 > hh[c(1,5,8)] [1] 15.1 0.0 15.1 > hh[-c(1,5,8)] [1] 11.3 7.0 9.0 0.0 0.0 11.3 7.0 9.0 The use of the negative subscripts causes all values except those specified in the index vector to be extracted. The use of logical values as indices is perhaps the most useful of all operations 4

involving subscripts. If an index vector consisting of TRUE and FALSE values is used as a subscript, the values in the vector for which the subscript is TRUE are extracted. Such index vectors are usually created by comparing the vector to a scalar using a comparison operator. For example: > hh>0 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE > hh[hh>0] [1] 15.1 11.3 7.0 9.0 15.1 11.3 7.0 9.0 > attach(chickwts) > weight [1] 179 160 136 227 217 168 108 124 143 140 309 229 181 141 260 203 148 169 213 [20] 257 244 271 243 230 248 327 329 250 193 271 316 267 199 171 158 248 423 340 [39] 392 339 341 226 320 295 334 322 297 318 325 257 303 315 380 153 263 242 206 [58] 344 258 368 390 379 260 404 318 352 359 216 222 283 332 > wtmean=mean(weight) > wtsd=sd(weight) > outsiders=sum(weight<wtmean-2*wtsd|weight>wtmean+2*wtsd) > outsiders [1] 1 > weight[weight<wtmean-2*wtsd|weight>wtmean+2*wtsd] [1] 423 > index=weight<wtmean-2*wtsd|weight>wtmean+2*wtsd > index [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [37] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > weight[index] [1] 423 > (1:length(weight))[index] [1] 37 > seq(along=weight)[index] [1] 37 > precip Mobile Juneau Phoenix Little Rock 67.0 54.7 7.0 48.5 Los Angeles Sacramento 14.0 17.2 ............................... ............................... Cheyenne San Juan 14.6 59.2 > seq(along=precip)[min(precip)==precip] [1] 3 > names(precip)[seq(along=precip)[min(precip)==precip]] [1] "Phoenix" > names(precip)[min(precip)==precip] [1] "Phoenix" 5

4. Subscripting Matrices: Subscripts or index vectors can be used to extract or replace elements, entire rows or columns, and submatrices of matrix objects: > mdata=c(1.2,3.5,4.7,1.8,-6.4,5.4,-1.9,2.7,3.4,-2.0,7.2,4.5) > m=matrix(mdata,3,4,byrow=T) > m [,1] [,2] [,3] [,4] [1,] 1.2 3.5 4.7 1.8 [2,] -6.4 5.4 -1.9 2.7 [3,] 3.4 -2.0 7.2 4.5 > m[,2] [1] 3.5 5.4 -2.0 > m[2,3] [1] -1.9 > m[2,2:3] [1] 5.4 -1.9 > m[2:3,c(1,3)] [,1] [,2] [1,] -6.4 -1.9 [2,] 3.4 7.2 > m[2,2]=5.5 > m [,1] [,2] [,3] [,4] [1,] 1.2 3.5 4.7 1.8 [2,] -6.4 5.5 -1.9 2.7 [3,] 3.4 -2.0 7.2 4.5 Just as in the case of a vector if the matrix is compared to a scalar, a matrix of logical values (TRUE or FALSE) is created. This matrix can be used as a subscript to index the elements of the matrix that correspond to the TRUE values. These element may be extracted or changed to new values in place: > m<0 [,1] [,2] [,3] [,4] [1,] FALSE FALSE FALSE FALSE [2,] TRUE FALSE TRUE FALSE [3,] FALSE TRUE FALSE FALSE > m[m<0] [1] -6.4 -2.0 -1.9 > row(m)[m<0] [1] 2 3 2 > col(m)[m<0] [1] 1 2 3 > m[m<0]=0 > m [,1] [,2] [,3] [,4] [1,] 1.2 3.5 4.7 1.8 [2,] 0.0 5.5 0.0 2.7 [3,] 3.4 0.0 7.2 4.5 One very useful way of extracting information from a large matrix is to use the row or column name attributes of the matrix in logical expressions as subscripts. This is illustrated using the R built-in data set named state.x77 which is a matrix with 50 rows and 8 columns. 6

R currently contains several inter-related “state” data sets all of which are loaded using the data(state) command. > help("state") > data(state) > colnames(state.x77) [1] "Population" "Income" [6] "HS Grad" "Frost"

"Illiteracy" "Life Exp" "Area"

"Murder"

The technique used previously to extract a subset of a vector may be extended to extract rows of a matrix that meets a specified condition. The first expression below causes the rows of the matrix state.x77 (which correspond to States) for which values of the column named Area (i.e., column 8) are greater than 80000 to be extracted. In the next expression only the column named Income is printed from this subset of rows. > state.x77[state.x77[,"Area"]>80000,] Population Income Illiteracy Life Exp Murder HS Grad Frost Area Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 California 21198 5114 1.1 71.71 10.3 62.6 20 156361 Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 Idaho 813 4119 0.6 71.87 5.3 59.5 126 82677 Kansas 2280 4669 0.6 72.58 4.5 59.9 114 81787 Montana 746 4347 0.6 70.56 5.0 59.2 155 145587 Nevada 590 5149 0.5 69.03 11.5 65.2 188 109889 New Mexico 1144 3601 2.2 70.32 9.7 55.2 120 121412 Oregon 2284 4660 0.6 72.13 4.2 60.0 44 96184 Texas 12237 4188 2.2 70.90 12.2 47.4 35 262134 Utah 1203 4022 0.6 72.90 4.5 67.3 137 82096 Wyoming 376 4566 0.6 70.29 6.9 62.9 173 97203 > state.x77[state.x77[,"Area"]>80000,"Income"] Alaska Arizona California Colorado 6315 4530 5114 4884 Nevada New Mexico Oregon Texas 5149 3601 4660 4188

Idaho 4119 Utah 4022

Kansas 4669 Wyoming 4566

Montana 4347

As we may have anticipated there is an R function named subset() to perform this type of operations on vectors, matrices, as well as data frames. subset(state.x77,state.x77[,"Area"]>80000) help(Orange) Orange subset(Orange,circumference<80) Tree age circumference 1 1 118 30 2 1 484 58 8 2 118 33 . . . . . . . . 29 5 118 30 30 5 484 49 > > > >

7

5. Creating Arrays: An array is a generalization of a matrix and may have one, two, three, or more dimensions. Thus, the dim attribute of an array may have more than two elements. The function array() may be used to create arrays. Below, we create an array a3 with 4 tiers (or faces, or slices) of 2 × 3 matrices. In this example note how the two sets of indices correspond: a3[1,1,1] <----a3[2,1,1] <----a3[1,2,1] <----a3[2,2,1] <----........... a3[1,1,2] <----........... a3[2,3,4] <----a[1] a[2] a[3] a[4] a[7] a[24]

That is, the first index of a3 moves the fastest, the last index moves the slowest. > > > , a=seq(1,24) a3=array(a,dim=c(2,3,4)) a3 , 1 [,1] [,2] [,3] 1 3 5 2 4 6

[1,] [2,] , , 2

[,1] [,2] [,3] [1,] 7 9 11 [2,] 8 10 12 , , 3 [,1] [,2] [,3] 13 15 17 14 16 18

[1,] [2,] , , 4

[1,] [2,]

[,1] [,2] [,3] 19 21 23 20 22 24

The following is a well-known multivariate data set known as the “iris3” data, that exists in the R database: > data(iris3) > iris3

8