Sample data set

Introduction to R: Presentation Title Goes Here Data Manipulation subtitle. …presentation and Statistical Analysis
Data Manipulation
Violeta I. Bartolome
Senior Associate Scientist-Biometrics Crop Research Informatics Laboratory International Rice Research Institute

mydata[3,4]

:: color, composition, and layout

Selecting Variables
• Select variable Y1 o mydata[“Y1”] o mydata[,3] o mydata[3] o mydata[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)] o mydata[as.logical(c(0,0,1,0,0,0))] o mydata[names(mydata)==“Y1”] o mydata$Y1 To create a data frame containing Y1 myA<- mydata[“Y1”]
:: color, composition, and layout

Selecting Variables
• Select variables Y1, Y2, Y3, Y4 o mydata[c(3,4,5,6)] o mydata[3:6] o mydata[-c(1,2)] o mydata[-I(1:2)] # I() is the isolation function o mydata[c(“Y1”, “Y2”, “Y3”, “Y4”)] o mydata[c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)] o mydata[as.logical(c(0,0,1,1,1,1))] To create a data frame containing Y1, Y2, Y3, Y4 myB<- mydata[c(3,4,5,6)]
Dataset

:: color, composition, and layout

Selecting Variables
• Select variables Y1, Y2, Y3, Y4 o myB<-data.frame(mydata$Y1, mydata$Y2, mydata$Y3, mydata$Y4) this is equivalent to attach(mydata) myB<-data.frame(Y1,Y2,Y3,Y4) detach(mydata) o myB<-subset(mydata, select=Y1:Y4)

Selecting Observations
• Select observation numbers 3 to 8 o mydata[3:8, ] o mydata[-c(1,2), ] • Select observations of Site B o mydata[mydata$Site==“B”, ] o subset(mydata,subset=Site==“B”) o mydata[which(mydata$Site==“B”),] To create a data frame myC<- mydata[mydata$Site==“B”, ]
Dataset

:: color, composition, and layout

:: color, composition, and layout

Selecting Observations
Select observations of Sites A and B, and Trt 1 and 2 o attach(mydata) mydata[(Site==“A” | Site==“B”) & (Trt==1 | Trt==2), ] detach(mydata) o subset(mydata,subset=((Site==“A” | Site==“B”) & (Trt==1 | Trt==2))) o mydata[which((mydata$Site==“A” | mydata$Site==“B”) & (mydata$Trt==1 | mydata$Trt==2)),]

Selecting Both Variables and Observations
• Data frame containing Site B and Y1-Y4 o myD<-mydata[4:6, 3:6] myD<-mydata[mydata$Site==“B”, c(“Y1”,”Y2”,”Y3”,”Y4”)] o myD<-subset(mydata,subset=Site==“B”,select=Y1:Y4)

Dataset

Dataset

:: color, composition, and layout

Hands-on

:: color, composition, and layout

Transforming/Creating New Variables
• Using Numerical Expressions o mydata$Y5 <- mydata$Y3 sample dataset o mydata$Y6 <- 0 • Using Mathematical Operations (+, -, *. / **) o mydata$sum <mydata$Y1+mydata$Y2+mydata$Y3+mydata$Y4 o attach(mydata) mydata$sum<-Y1+Y2+Y3+Y4 detach(mydata) o mydata<-transform(mydata, sum=Y1+Y2+Y3+Y4) o If with more than 1 transformation mydata<-transform(mydata, sample dataset sum=Y1+Y2+Y3+Y4, mean=sum/4)
:: color, composition, and layout

Using Numerical Expressions

Using Mathematical Operations

back forward

:: color, composition, and layout

Transforming/Creating New Variables
• Using functions o mydata$sqrtY3 <- sqrt(mydata$Y3) o mydata$Y4 <- log10(mydata$Y4)

Missing data: using the na.rm option
• Consider the statement o mydata$sumy<mydata$Y1+mydata$Y2+mydata$Y3 Note: if any of the Y’s is missing sum will be missing • To get sum of non-missing observations o myYs<-subset(mydata,select=c(Y1,Y2,Y3)) o mydata$sum<-rowSums(myYs,na.rm=TRUE)
sample data set

:: color, composition, and layout

:: color, composition, and layout

Missing data: using the is.na()
• Selecting observations with at least one missing observation o missing <- subset(mydata,subset=(is.na(Y1)==T| is.na(Y2)==T|is.na(Y3)==T|is.na(Y4)==T))

back

forward

:: color, composition, and layout

:: color, composition, and layout

Keeping and Dropping Variables
• Create a copy of mydata mysubset <- mydata • Drop Y3 and Y4 from mysubset mysubset$Y3 <- mysubset$Y4 <- NULL

Renaming Variables
• Rename Y1-Y4 to X1-X4, respectively o library (reshape) mydata <- rename(mydata, c(Y1=“X1”)) mydata <- rename(mydata, c(Y2=“X2”)) mydata <- rename(mydata, c(Y3=“X3”)) mydata <- rename(mydata, c(Y4=“X4”)) o names(mydata) <- c(“Site”, “Trt”, “X1”, “X2”, “X3”, X4”)

:: color, composition, and layout

Hands-on

:: color, composition, and layout

Stacking/Concatenating Data Frames
• Data frame containing Site A only attach(mydata) A <- mydata[Site==“A”, ] • Data frame containing Site B only B <- mydata[Site==“B”, ] • Combine the two data frames both <- rbind(A,B) detach(mydata)

Merging Data Frames
• Data frame containing Y1 and Y2 attach(mydata) left <- mydata[c(“Site”,”Trt”,”Y1”,”Y2”)] • Data frame containing Y3 and Y4 right <- mydata[c(“Site”,”Trt”,”Y3”,”Y4”)] • Merge the two data frames both <- merge(left, right, by=c(“Site”,”Trt”)) detach(mydata)

Hands-on

:: color, composition, and layout

Hands-on

:: color, composition, and layout

Sorting Data Frames
• Sort by Trt and Site mydataSorted <mydata[order(mydata$Trt, mydata$Site), ] Note: Default is ascending order. Prefix a variable by a minus sign to get descending order mydataSorted <mydata[order(-mydata$Trt, mydata$Site), ]
Hands-on :: color, composition, and layout

Parallel to Serial

data.serial <- reshape(mydata, varying=list(3:6), v.names=“Y", idvar=c(“Site“,”Trt”), timevar=“Rep", times=c(1:4), direction="long“) data.serial

# object to be reshaped # if >1 variable -- list(3:4,5:6) # v.names=c(“Y”,”X”) # be used as rownames # new variable to be created # values of new variable

:: color, composition, and layout

Parallel to Serial

Parallel to Serial

idvar used as row names

Change row names row.names(data.serial) <- 1:NROW(data.serial) data.serial
:: color, composition, and layout Hands-on :: color, composition, and layout

Serial to Parallel

Serial to Parallel

data.parallel <- reshape(serialdata, v.names=c("yld","dm"), idvar=c("plot","date"), timevar="rep", drop=c(“var1”,”var2”) direction="wide“) data.parallel

# object to be reshaped # variables to be converted # variables to be retained # values of which will be affixed to column names # variables to be removed from the reshaped data

Remove “.” from column names
colnames(data.parallel) <- gsub("[.]", "", colnames(data.parallel)) data.parallel

:: color, composition, and layout

:: color, composition, and layout

Serial to Parallel Serial to Parallel

Change row names row.names(data.parallel) <1:NROW(data.parallel) data.parallel
:: color, composition, and layout Hands-on :: color, composition, and layout

Aggregating data
• With only one response variable meanY <- aggregate(data.serial$Y, by = list(data.serial$Site,data.serial$Trt), FUN=mean, na.rm=TRUE) # gets statistics from nonmissing values meanY
na.rm=TRUE na.rm=FALSE

Aggregating data
• With more than one response variables
Ys <- subset(mydata,select=Y1:Y4) # data frame of numerical variables

meanYs <- aggregate(Ys, by=list(mydata$Site), # subsetting variables FUN=mean, # function to be performed na.rm=TRUE) meanYs

:: color, composition, and layout

Hands-on

:: color, composition, and layout

Please do the exercise. Thank You.

:: color, composition, and layout

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.