409 views

Uploaded by Vivay Salazar

Data Management and Statistical Analysis - Data Manipulation

Data Management and Statistical Analysis - Data Manipulation

Attribution Non-Commercial (BY-NC)

- All Php Function
- Data Management and Statistical Analysis - Basic R Graphics
- Data Manipulation and Statistical analysis - Analysis of Variance
- Data Management and Statistical Analysis - Descriptive Statistics
- Regression and Correlation Analysis - Regression and Correlation Analysis
- R CropStat Introduction
- Data Management and Statistical Analysis - Loading data
- Intermediate R - Multiple Regression
- Intermediate R - Principal Component Analysis
- Introduction to Gene Mapping
- Principles of Experimental Design and Data Analysis
- Introduction to R
- Powerpoint - Regression and Correlation Analysis
- Intermediate R - Cluster Analysis
- Intermediate R - Multidimensional Scaling
- Data Management and Statistical Analysis - Generating Randomization Layout
- Intermediate R - Nonlinear Regression in R
- Experimental Design Used in Rice Research
- Introduction to R Exercises
- Powerpoint presentation - Experimental Design Used in Rice Research

You are on page 1of 8

Introduction

Presentation Title Goes Here to R: mydata[3,4]

Data Manipulation and Statistical Analysis

…presentation subtitle.

Data Manipulation

Violeta I. Bartolome

Senior Associate Scientist-Biometrics

Crop Research Informatics Laboratory

International Rice Research Institute

• Select variable Y1 • Select variables Y1, Y2, Y3, Y4

o mydata[“Y1”] o mydata[c(3,4,5,6)]

o mydata[,3] o mydata[3:6]

o mydata[3] o mydata[-c(1,2)]

FALSE, FALSE)] o mydata[c(“Y1”, “Y2”, “Y3”, “Y4”)]

o mydata[as.logical(c(0,0,1,0,0,0))] o mydata[c(FALSE, FALSE, TRUE, TRUE, TRUE,

o mydata[names(mydata)==“Y1”] TRUE)]

o mydata$Y1 o mydata[as.logical(c(0,0,1,1,1,1))]

To create a data frame containing Y1 To create a data frame containing Y1, Y2, Y3, Y4

Dataset

myA<- mydata[“Y1”] myB<- mydata[c(3,4,5,6)]

:: color, composition, and layout :: color, composition, and layout

Selecting Variables

Selecting Observations

• Select variables Y1, Y2, Y3, Y4

• Select observation numbers 3 to 8

o myB<-data.frame(mydata$Y1, mydata$Y2,

o mydata[3:8, ]

mydata$Y3, mydata$Y4)

o mydata[-c(1,2), ]

this is equivalent to

attach(mydata) • Select observations of Site B

o mydata[mydata$Site==“B”, ]

myB<-data.frame(Y1,Y2,Y3,Y4)

o subset(mydata,subset=Site==“B”)

detach(mydata)

o mydata[which(mydata$Site==“B”),]

o myB<-subset(mydata, select=Y1:Y4)

myC<- mydata[mydata$Site==“B”, ]

Dataset

Selecting Observations Observations

Select observations of Sites A and B, and

Trt 1 and 2 • Data frame containing Site B and Y1-Y4

o attach(mydata)

o myD<-mydata[4:6, 3:6]

mydata[(Site==“A” | Site==“B”) &

(Trt==1 | Trt==2), ] myD<-mydata[mydata$Site==“B”,

c(“Y1”,”Y2”,”Y3”,”Y4”)]

detach(mydata)

o myD<-subset(mydata,subset=Site==“B”,select=Y1:Y4)

o subset(mydata,subset=((Site==“A” |

Site==“B”) & (Trt==1 | Trt==2)))

o mydata[which((mydata$Site==“A” |

mydata$Site==“B”) &

(mydata$Trt==1 | mydata$Trt==2)),] Dataset Dataset

Transforming/Creating New Variables

• Using Numerical Expressions

o mydata$Y5 <- mydata$Y3

sample dataset

Using Numerical Expressions Using Mathematical Operations

o mydata$Y6 <- 0

• Using Mathematical Operations (+, -, *. / **)

o mydata$sum <-

mydata$Y1+mydata$Y2+mydata$Y3+mydata$Y4

o attach(mydata)

mydata$sum<-Y1+Y2+Y3+Y4

detach(mydata)

o mydata<-transform(mydata, sum=Y1+Y2+Y3+Y4)

o If with more than 1 transformation

mydata<-transform(mydata, sample dataset back

forward

sum=Y1+Y2+Y3+Y4,

mean=sum/4)

:: color, composition, and layout :: color, composition, and layout

• Using functions

• Consider the statement

o mydata$sqrtY3 <- sqrt(mydata$Y3)

o mydata$sumy<-

o mydata$Y4 <- log10(mydata$Y4)

mydata$Y1+mydata$Y2+mydata$Y3

Note: if any of the Y’s is missing sum will be missing

• To get sum of non-missing observations

o myYs<-subset(mydata,select=c(Y1,Y2,Y3))

o mydata$sum<-rowSums(myYs,na.rm=TRUE)

Missing data: using the is.na()

• Selecting observations with at least one missing

observation

o missing <- subset(mydata,subset=(is.na(Y1)==T|

is.na(Y2)==T|is.na(Y3)==T|is.na(Y4)==T))

back

forward

• Create a copy of mydata • Rename Y1-Y4 to X1-X4, respectively

mysubset <- mydata o library (reshape)

• Drop Y3 and Y4 from mysubset mydata <- rename(mydata, c(Y1=“X1”))

mysubset$Y3 <- mysubset$Y4 <- NULL mydata <- rename(mydata, c(Y2=“X2”))

mydata <- rename(mydata, c(Y3=“X3”))

mydata <- rename(mydata, c(Y4=“X4”))

o names(mydata) <- c(“Site”, “Trt”, “X1”, “X2”, “X3”, X4”)

Merging Data Frames

Stacking/Concatenating Data Frames

attach(mydata) attach(mydata)

A <- mydata[Site==“A”, ] left <- mydata[c(“Site”,”Trt”,”Y1”,”Y2”)]

• Data frame containing Site B only • Data frame containing Y3 and Y4

B <- mydata[Site==“B”, ] right <- mydata[c(“Site”,”Trt”,”Y3”,”Y4”)]

• Combine the two data frames • Merge the two data frames

both <- rbind(A,B) both <- merge(left, right,

detach(mydata) by=c(“Site”,”Trt”))

detach(mydata)

Hands-on :: color, composition, and layout Hands-on :: color, composition, and layout

Parallel to Serial

Sorting Data Frames

• Sort by Trt and Site

mydataSorted <-

mydata[order(mydata$Trt,

mydata$Site), ]

order. Prefix a variable by a varying=list(3:6), # if >1 variable -- list(3:4,5:6)

minus sign to get descending v.names=“Y", # v.names=c(“Y”,”X”)

order idvar=c(“Site“,”Trt”), # be used as rownames

timevar=“Rep", # new variable to be created

mydataSorted <- times=c(1:4), # values of new variable

mydata[order(-mydata$Trt, direction="long“)

mydata$Site), ] data.serial

Hands-on :: color, composition, and layout :: color, composition, and layout

Parallel to Serial Parallel to Serial

idvar used as

row names

data.serial

:: color, composition, and layout Hands-on :: color, composition, and layout

v.names=c("yld","dm"), # variables to be converted Remove “.” from column names

idvar=c("plot","date"), # variables to be retained

timevar="rep", # values of which will be colnames(data.parallel) <- gsub("[.]", "", colnames(data.parallel))

affixed to column names data.parallel

drop=c(“var1”,”var2”) # variables to be removed

from the reshaped data

direction="wide“)

data.parallel :: color, composition, and layout :: color, composition, and layout

Serial to Parallel

Serial to Parallel

row.names(data.parallel) <-

1:NROW(data.parallel)

data.parallel

• With only one response variable • With more than one response variables

meanY <- aggregate(data.serial$Y, Ys <- subset(mydata,select=Y1:Y4) # data frame of numerical variables

by = list(data.serial$Site,data.serial$Trt), meanYs <- aggregate(Ys,

FUN=mean, by=list(mydata$Site), # subsetting variables

na.rm=TRUE) # gets statistics from nonmissing values FUN=mean, # function to be performed

meanY na.rm=TRUE)

na.rm=TRUE na.rm=FALSE

meanYs

Please do the exercise.

Thank You.

- All Php FunctionUploaded byKeenan Williams
- Data Management and Statistical Analysis - Basic R GraphicsUploaded byVivay Salazar
- Data Manipulation and Statistical analysis - Analysis of VarianceUploaded byVivay Salazar
- Data Management and Statistical Analysis - Descriptive StatisticsUploaded byVivay Salazar
- Regression and Correlation Analysis - Regression and Correlation AnalysisUploaded byVivay Salazar
- R CropStat IntroductionUploaded byVivay Salazar
- Data Management and Statistical Analysis - Loading dataUploaded byVivay Salazar
- Intermediate R - Multiple RegressionUploaded byVivay Salazar
- Intermediate R - Principal Component AnalysisUploaded byVivay Salazar
- Introduction to Gene MappingUploaded byVivay Salazar
- Principles of Experimental Design and Data AnalysisUploaded byVivay Salazar
- Introduction to RUploaded byVivay Salazar
- Powerpoint - Regression and Correlation AnalysisUploaded byVivay Salazar
- Intermediate R - Cluster AnalysisUploaded byVivay Salazar
- Intermediate R - Multidimensional ScalingUploaded byVivay Salazar
- Data Management and Statistical Analysis - Generating Randomization LayoutUploaded byVivay Salazar
- Intermediate R - Nonlinear Regression in RUploaded byVivay Salazar
- Experimental Design Used in Rice ResearchUploaded byVivay Salazar
- Introduction to R ExercisesUploaded byVivay Salazar
- Powerpoint presentation - Experimental Design Used in Rice ResearchUploaded byVivay Salazar
- Intermediate R - Analysis of Count and Proportion DataUploaded byVivay Salazar
- Syllabus Bs It Msc It Computer Science UosUploaded byHamza Masood
- Intermediate R - Analysis of Categorical DataUploaded byVivay Salazar
- QTL MappingUploaded byVivay Salazar
- A few Basics about QTL MappingUploaded byVivay Salazar
- Gamer Resume David ChangUploaded bykuroizero
- Multidimensional ScalingUploaded byPankaj2c
- Week 5 Quiz 5 ch 9 ch 10Uploaded byLH
- Sampling Plan 123Uploaded byRahul kumar
- 0203_RNC_RBS_027_IubLink_RXIUploaded byFery Sigiro

- Powerpoint presentation - Partitioning Sum of SquaresUploaded byVivay Salazar
- R CropStat IntroductionUploaded byVivay Salazar
- PublicationsUploaded byVivay Salazar
- Powerpoint - Regression and Correlation AnalysisUploaded byVivay Salazar
- Introduction to Gene MappingUploaded byVivay Salazar
- QTL MappingUploaded byVivay Salazar
- Intermediate R - Principal Component AnalysisUploaded byVivay Salazar
- Missing DataUploaded byVivay Salazar
- A few Basics about QTL MappingUploaded byVivay Salazar
- Intermediate R - Nonlinear Regression in RUploaded byVivay Salazar
- Regression and CorrelationUploaded byVivay Salazar
- Intermediate R - Cluster AnalysisUploaded byVivay Salazar
- Intermediate R - Multidimensional ScalingUploaded byVivay Salazar
- Intermediate R - Analysis of Categorical DataUploaded byVivay Salazar
- Data Management and Statistical Analysis - Generating Randomization LayoutUploaded byVivay Salazar
- Intermediate R - Analysis of Count and Proportion DataUploaded byVivay Salazar
- Intermediate R - Multiple RegressionUploaded byVivay Salazar
- Data Management and Statistical Analysis - Loading dataUploaded byVivay Salazar
- R-Cheat SheetUploaded byPrasad Marathe
- Introduction to R ExercisesUploaded byVivay Salazar
- Introduction to RUploaded byVivay Salazar
- Powerpoint presentation - Missing DataUploaded byVivay Salazar
- Powerpoint presentation - Data TransformationUploaded byVivay Salazar

- Data Management and Statistical Analysis - Data ManipulationUploaded byVivay Salazar
- Using Excel for Data Manipulation and Statistical AnalysisUploaded byRamon G Pacheco
- Math1050 Lesson2 Week1 Ch1Uploaded byunduly_dude
- Uses and Misuses of StatisticsUploaded bygeorgeloto12
- Artificial Intelligence Training and Placement Program - Bangalore and CoimbatoreUploaded byAerofolic Business Solution
- Math 2200 Section1.4-1.6Uploaded bySumberScribd
- Statistical MisuseUploaded byCecilia Thomas
- Q vs. SPSSUploaded byasesorestadistico
- Courses on DataCampUploaded byLokesh Dhaker
- Demand ForecastingUploaded bysanowareceku