# Introduction to R: Data Management and Statistical Analysis

Regression and Correlation Analysis
Leilani Nora
Assistant Scientist

CORRELATION ANALYSIS

DATAFRAME : corr.csv
• Consider the data for grain yield and N,P,K content of the plant taken from several samples.

DATA FRAME:GYNPK
GY14 1678 4265 2431 2431 4461 . . . 48 5483 1 2 3 4 5 N 1.0 1.2 1.1 1.0 1.2 P 0.1 0.1 0.1 0.1 0.1 K 0.4 0.4 0.4 0.4 0.4

1.7 0.2 0.3

CORRELATION ANALYSIS : correlation()
• correlation() obtains the coefficients of correlation and p-value between all the variables of a data table. The results are similar to SAS. • Required package is agricolae. Usage > correlation(x, y=NULL, method=“pearson”, alternative=“two.sided”,…) # x and y – table, matrix or vector # method – “pearson”, “kendall”, “spearman” # alternative – “two.sided”, “less”, “greater”

CORRELATION ANALYSIS : correlation()
> library(agricolae) > corrGY <- correlation(GYNPK) > corrGY
\$correlation GY14 N P K GY14 1.00 0.72 0.38 -0.40 N 0.72 1.00 0.02 -0.34 P 0.38 0.02 1.00 -0.35 K -0.40 -0.34 -0.35 1.00 \$pvalue GY14 N P K GY14 1.000000e+00 1.084596e-08 7.979778e-03 5.289776e-03 N 1.084596e-08 1.000000e+00 8.686113e-01 1.798541e-02 P 0.007979778 0.868611288 1.000000000 0.016134208 K 0.005289776 0.017985414 0.016134208 1.000000000

\$n.obs [1] 48

CORRELATION ANALYSIS : cor.matrix()
• Package ‘Deducer’ is an intuitive graphical data analysis for use with JGR. • JGR is a Java Gui for R, a cross platform, universal and unified Graphical User Interface for R • This package was released last August 2, 2009 with 33 functions. • One of the functions in package Deducer is the cor.matrix()

CORRELATION ANALYSIS : cor.matrix()
• cor.matrix() creates a correlation matrix with a function to test the significance of the correlation coefficient, r. Usage > cor.matrix(variables, data, test=cor.test, method …) # variables – an expression denoting a set of variable # data – a data frame # test – a function to test significance of the correlation coefficient # method – “pearson”, “kendall”, “spearman”

CORRELATION ANALYSIS: cor.matrix()
> library(Deducer) > corrGY2 <- cor.matrix(GY14:K,data=GYNPK) > corrGY2
Pearson's product-moment correlation GY14 GY14 cor 1 N 48 CI* stat** p-value ----------N . . . P . . . K . . . ----------HA: two.sided N 0.7157 48 (0.5417,0.8309) 6.95 (46) 0.0000 P 0.3785 48 (0.1058,0.5983) 2.774 (46) 0.0080 K -0.3964 48 (-0.6116,-0.1265) -2.928 (46) 0.0180

CORRELATION ANALYSIS : print.cor.matrix()
• print.cor.matrix() print object “cor.matrix” in a nice layout Usage > print.cor.matrix(x, digits=4, N=TRUE, CI=TRUE, stat=TRUE, p.value=TRUE,…) # # # # # # x digits N CI stat p.value - object of class “cor.matrix” - Number of digits to round - logical, prints a row for sample size - logical, prints a row for CI if they exist - logical, prints a row for test statistics - logical, prints a row for p-values

CORRELATION ANALYSIS: cor.matrix()
> print.cor.matrix(corrGY2, digits=4, N=FALSE, CI=FALSE, stat=FALSE)
Pearson's product-moment correlation GY14 1 N P K 0.7157 0.3785 -0.3964 0.0000 0.0080 0.0053 1 0.02452 -0.3402 0.8686 0.0180 -0.3456 0.0161

GY14

cor p-value ----------N cor p-value ----------P cor p-value ----------K cor p-value -----------

REGRESSION ANALYSIS

0.7157 0.0000 0.3785 0.0080

0.02452 1 0.8686

-0.3964 -0.3402 -0.3456 1 0.0053 0.0180 0.0161 HA: two.sided

DATAFRAME : SRATE.csv
• Consider grain yield data for six levels of rates of seedlings.

DATA FRAME:SRATE
Seedrate GYield 1 25 5.30425 2 50 5.12400 3 75 5.07025 4 100 4.84775 5 125 4.70800 6 150 4.70325

REGRESSION ANALYSIS : lm()
• lm() which stands for Linear Model, fits linear models which can be used to carry out regression, single stratum ANOVA, ANACOVA and multiple linear regression Usage > lm(formula, data, na.action, model=TRUE,…) # formula – a model formula. A typical model has the form “response ~ terms” # data – dataframe # na.action – when the data contains NAs the default is “na.omit” and “na.exclude” can also be useful # model – logical, if TRUE the corresponding components of the fit are returned.

REGRESSION ANALYSIS : lm()
> ModelGY <- lm(SRATE\$GYield~SRATE\$Seedrate) > ModelGY
Call: lm(formula = SRATE\$GYield ~ SRATE\$Seedrate) Coefficients: (Intercept) 5.324283

SRATE\$Seedrate -0.004168

• The result of lm is model object.

REGRESSION ANALYSIS : summary()
• The function summary is used to obtain and print a summary and ANOVA table of the results. > summary(ModelGY)
Residuals: 1 2 3 4 5 0.292567 -0.096083 -0.045633 -0.059733 -0.095283 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.324283 0.154081 34.555 4.18e-06 *** SRATE\$Seedrate -0.004168 0.001583 -2.634 0.058 . --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1655 on 4 degrees of freedom Multiple R-squared: 0.6342, Adjusted R-squared: 0.5428 F-statistic: 6.936 on 1 and 4 DF, p-value: 0.05796 6 0.004167

SCATTERPLOT : plot() and abline()
> plot(SRATE\$Seedrate, SRATE\$GYield, main="ScatterPlot of Mean Yield", xlab=“Seedrate", ylab=“Mean Yield", col="Red") > abline(ModelGY, col="blue", lty=3) • abline(lm.object) displays a fitted line which draw lines of the intercept(a) and slope(b) from the lm object. • lm.object – regression object where the first two values are taken to be the intercept and slope.

SCATTERPLOT : mtext()
• mtext(text, side=3…) displays text on top of the plot # text – a character expression specifying the text to be written # side – on which side of the plot you want to display a text 1 – bottom 2 – left 3 – top 4 – right > mtext(“GYield=(5.324-0.0042Seedrate) with r=-0.7964", side=3, cex=0.7)

SCATTERPLOT : title() and mtext()
> plot(…) # same as previous slide

> abline(…) # same as previous slide > mtext(“GYield=(5.324-0.0042Seedrate) with r=-0.9773", side=3, cex=0.7)

SCATTERPLOT
ScatterPlot of Mean Yield
GYield=(5.324-0.0042Seedrate) with r=-0.7964

RESIDUAL PLOT
> plot(ModelGY\$fitted.values, ModelGY\$residual, main= “Residual Plot”, xlab="Fitted", ylab="Residuals", col="red") > abline(h=0, col="blue", lty=3) # draws a horizontal line at Y=0 with colored blue dotted line

Mean of yield

4.7 20

4.8

4.9

5.0

5.1

5.2

5.3

40

60

80 Seedrate

100

120

140

RESIDUAL PLOT

THANK YOU! ☺ Please do Exercise E