You are on page 1of 16

Stats with R

Sample Exam - Part 1


Question 1
a) Population: it is the set of all data points relevant for the research question. In general,
impossible to obtain.
b) Sample: it is a subset of the population which is selected at random and it is
representative enough to draw conclusion about the whole population.
c) Variance vs Standard deviation (give formulae)

Standard Deviation: describes how much the data points in a sample or population differ
from one another.

(Sample sd) (Population sd)


Variance: how far each data point is from the sample/population mean?

(Sample var) (Population var) obs : sd is just the square root of the
variance
Question 1
d) Standard Error: describes how unsure we are about a measure/parameter one
is trying to estimate. (ex: population mean)

e) p-value: it is the conditional probability of an observation pval = p(obs | H0 =


true).

f) type I error: also known as false positive rate, which means that one concludes
there is an interesting difference when in fact there isn't any (Reject the null
hypothesis when it should not).

g) power: probability of detecting an effect if there is one. Depends on sample


size, effect size and significance level.
Question 2: Ratings: 21 11 15 6 8 9 18
Ordered data: 6 8 9 11 15 18 21.

a) Median: Splits the data into 50% lowest and 50% highest.

Mean: 6+8+9+11+15+18+21/7 = 12.57

1st quantile: Splits the data into 25% lowest and 75% highest. (In this case, 1st quantile
is 8.)

3rd quantile: Splits the data into 75% lowest and 25% highest. (In this case, 3rd
quantile is 18.)

b) Median of the dataset: 11

c) Compare mean and Median: Mean (12.57) > Median (11).


Question 3: What statistical test should we use?
Why?
Unpaired 2-sample t-test.

T-test because we have one categorical predictor(gender) and the response


variable is ratio (frequency).

Unpaired 2-sample because we have 2 samples for each frequency (male and
female) and they are not related.
Question 4
a) H0: all auxiliaries have same frequency. (f-hebben = f-zijn = f-zijnheb)

H1: all auxiliaries do not have same frequency.

b)
(212 - 95)^2/95 + (15 - 95)^2/95 + (58- 95)^2/95

c) Auxiliary has 3 levels so df = 3-1 = 2. Use Chisquare for df = 2.


Question 4
d)

H0: no difference between groups (ex. f-irregularhebben = f-regularhebben)

H1: there is a difference between groups.

-> Calculating expected number of observations for regular verbs with auxiliary
zijn:

Ereg/zign = 143 * 15/ 285


Question 4
e) pchisq() = integral of dchisq()

Recap:

dchisq(x, df, ncp = 0, log = FALSE) # density function, to see what the χ2
distributions look like, we use the dchisq().

pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE) # cumulative probability


for the χ2 distribution with df degrees of freedom.

Obs: qchisq() is the inverse of pchisq(): if pchisq(x)=y, then qchisq(y)=x.


Question 4
f) pchisq() -> use to find the rejection value (no need to memorize it for each Chi
Square distribution).

Given alpha = 0.05 -> rejection value is about 6.2.

Draw the rejection region in dchisq() from 6.2.

dchisq() -> knowing the rejection value from pchisq for a certain alpha-value,
decide whether Chi Square value within or outside the rejection region.
Question 4

For 0.54 and 4.13 -> outside rejection region, we do not reject the null hypothesis.

For 8.23 -> within the rejection region, reject the null hypothesis.
Question 5
a) Alpha = 0.10 - two-tailed (= 0.05 each side)

b) For alpha = 0.05 and 0.10, the value of 1.62 would not reject the null
hypothesis. 0.05 (rejection at >2/<-2 and) 0.10 (rejection at >1.7/<-1.7
Question 6
a) - Homogeneity of Variance:
Levene Test: Test whether the groups have the same variance, we can use
the Levene test.
H0: no indication that homogeneity of variance assumption is violated. (p>
0.05)
H1: homogeneity of variance is violated. (p < 0.05)

- Normality of Residuals:
Shapiro-Wilk test : test whether data at hand is normally distributed (Bell
shape)
H0: no indication that normality is violated. (p> 0.05)
H1: normality is violated. (p < 0.05)
Or QQ-plot
Question 6
b)

c)

As p <0.05, we can tell that there is a difference between the mean of the groups.
Question 6
d) Run a pairwise t.test() in R

e) EtaSquared measures the Effect Size of ANOVA, just as Cohens D for t-tests. It
refers to the proportion of the variability in the outcome variable that can be
explained in terms of the predictor.
Question 7
a) Pearson correlation ~ 0.61, moderate positive correlation, however p-value >
0.05 which means that we retain the H0 (Pearson correlation = 0, there is no
correlation between grade and hours slept.)
b) If the experiments were repeated, one would expect the value to lay on that
interval (-0.035 and 0.89).
c) Intercept means the minimum grade the student will get without any sleep.
Slope how grade increase at each unit increase of hours slept.
d) Residuals is the error in the prediction of yi by yi-hat.
e) Second model is better given that R-Squared and
Adjusted R-Squared are greater than first model.
Question 8
a) Both tests compare the mean of two groups/independent samples. Welch test
is used when the Homoscedasticity assumption does not hold (population
standard deviation is not the same in both groups).
b) Chi-Square is a sum of squared Binomial Distributions; each Binomial
distribution is approx. Normal when n*p and n*q > 5 meaning that the
frequencies should be big enough for the approximation to hold.
If this does not hold use Fisher’s Exact Test.
Note that the observed values don't have to be larger than 5, but the expected
values need to be larger than 5 in each cell.
c) Non-parametric test does not rely on any assumption about the distribution of
the response variable.
Ex. Fisher’s Exact Test, Wilcoxon Test, Mann-Whitney.

You might also like