Professional Documents
Culture Documents
PETZOLD
a) Explain with your own words “Principle of Parsimony” (=Occam´s razor) tell us? 6pts
The principle of Parsimony explain that normally the easies answer is better than the complicate
answer. It is apply in statistics to use the correct model, the enough amount the variables and test
necessary to explain in a correct way the situation.
The problem is that normally environmental systems are complex with a large amount of variables and
the model can be over-simplificated.
Because normally we don’t need and over complicated model or a model that maximally fits the data.
We need the model with the best compromise between goodness of fit and model complexity, i.e. the
smallest model that fits the data reasonably well.
2. What Are?
Random sampling means that the individual objects for measurement are selected at random from the
population.
For stratified sampling it is required that the population is divided in sub-populations (strata). These are
separately sampled (at random) and the population characteristics are estimated by using weighted
mean values. (Therefore, it is essential, to have valid information about the size of the strata to derive
the weighting coefficients.)
Random sampling. Random selection of sampling sites by means of numbered grid cells; random
placement of experimental units in a climate chamber, random order of treatments in time.
Stratified sampling. Election forecast, volumetric mean for water body from measurements for different
layers, gut content of fish estimated from size classes.
*The advantage of stratified sampling is to get better estimates from smaller samples; it works only if
the weights coefficients are correct.
3. Given
a. Which is normal distributed 3 pts
Left, because is more close to the line
Right is right skewed (gamma resp. log normal)
From the output, the p-value > 0.05 implying that the distribution of the data are not significantly different
se =2/5=0.4
5. Given: two samples x1=2,3,4,4,5 and x2=3,7,12,15,16 (Procedure of mean values evaluation of two
samples)
c. Explain, why and esp. how trend- and difference stationary respectively can make
level stationary?2pt
Trend stationary can make a level stationary subtracting the trend and difference stationary can make
level stationary by differencing, so the variance heterogeneity disappear.
Binary variables only have two states, 0/1, true or false, present or absent.
Metric variables can be measured continuously, and can have intervals o ratios.
b. Give an example
Example of ORDINAL data is the trophic states of lake, oligotrophic, mesotrophic, eutrophic, etc.
Example of METRIC data could be the temperature.
4. Which advantages have statistical measures like median, quartiles, minimum and maximum in
comparison to arithmetic mean and standard deviation?
Because with the those statistical parameters are more robust and we obtain a lot of information from
the data that we have and we can check skewness of distribution (symmetry), if there is normal
distribution.
6. Sketch density plots of (a) a normal distribution (b) a lognormal distribution and (c) a uniform
distribution. Annotate it like follows:
Note: There are other distribution like Poisson and binomial.
a) Normal: variable x: probability density f, mean, standard deviation.
d) How are the normal and the lognormal distribution related to each other?
The lognormal distribution is formed by taking the logarithm of the values of the data in order
to get a normal distribution.
e) Which distribution would you use as a first guess for discharge data of a stream similar to the
Elbe River?
Log normal distribution, due to the data normally is not normal distributed or uniform and
normally shows a right skewness (distribution) due to the different values.
f) Does the result of the Shapiro-Wilks W test imply that a normal distribution can be assumed?
(2 pts.)
g) What does the Box-Cox plot for the same data set tell us? Would it be wise to transform the
data? Give an explanation and if yes, which transformation? (2 pts.)
10. Given is the following x-y- data set and an analysis with R.
a) Which type of the function can be used to fit a curve to the data?
A nonlinear regression (exponential)
b) With which statistical and practical approach would you do this?
11. The following time series of annual air temperature (Tair) data was taken from the German
weather service station near Dresden. The scientific question of the analysis was to test for a
significant temperature trend as a consequence of the climate change.
a) Reformulate the scientific question as a pair of hypotheses: a null hypothesis (Ho) and an
alternative hypotheses (Ha) for Man kendall
H0: there is no trend
Ha: There is a monotonous trend (increasing or decreasing) it does not test a linear trend.
b) Explain which of the tests below answers our scientific questions best, and what can be
concluded for the temperature trend.
Mann Kendall test, the increment in temperature is 0.07972
c) How many degrees did the air temperature increase or decrease per year?
0.07972
d) Which statistical hypothesis is tested by the Mann-Kendal test? What are its advantages and
disadvantages, compared to the linear regression model (lm) below?
e) What is the effect size, i.e. how many degrees did the air temperature increase or decrease per
year?
Slope of the regression: 0.07972
The KPSS test (Kwiatkowski-Phillips-Schmidt-Shin test) tests directly for stationarity or trend
stationarity
Common trend tests are suitable only for trend-stationary time series, but not for difference-stationary
time series, because in these the residuals are autocorrelated. This also holds true for the Mann-
Kendall test which is popular in environmental sciences. In its standard formulation it is applicable only
for trend-stationary time series.
> MannKendall(Tair)
tau = 0.449, 2-sided pvalue =1.5855e-05
> m <- lm(Tair ~ time(Tair))
> summary(m)
Call:
lm(formula = Tair ~ time(Tair))
Residuals:
Min 1Q Median 3Q Max
-3.2733 -0.9369 -0.0816 0.6197 2.9386
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -150.03413 31.88781 -4.705 2.64e-05 ***
time(Tair) 0.07972 (change o T) 0.01603(st. error) 4.973 1.11e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.397 on 43 degrees of freedom
Multiple R-squared: 0.3651, Adjusted R-squared: 0.3504
F-statistic: 24.73 on 1 and 43 DF, p-value: 1.106e-05
> plot(Tair); abline(m, col="red")
> acf(residuals(m))
12. The following plot shows the time series of the discharge from River Nile. The purpose was to
figure out if there were structural breaks.
a) Reformulate the scientific question as a pair hypotheses: a null hypothesis (Ho) and an
alternative hypothesis (Ha).
Ho: There is no change of discharge over time
Ha: There is a change of the mean discharge (getting higher or lower)
b) Which test will give you information about the existence of break points in the time series?
OLS CumSUM test (it works: we take average over the hole time and cumulate the
differences (subtract the average of al individual data) and if they are above the mean
value we have an increase of the accumulated sum and if they are below the mean value
we have a decreaseing. Graph ols.
c) Based on the given information below, which model and how many break points represent
your data best?
Ans1. There is one breakpoint measured by BIC in the year 1898 m=1
Ans 2. If we consider the AIC there are 2 break points 1898 and 1953 m=2
Perfect Answer. Bic says 1 break point and Aic says 2 break point. We don’t knot for sure
which of this models are better because the values of fm1 and fm2 for AIC and BIC is less
than the meaure unit 2, because in AIC each value is penalizy with a value of 2. To have a
better it should be improve more than 2 units.
e) What are possible hydrological reason for the structural breaks in the river discharge time
series ?
Extrem events, drough or flood
Perfect answer for c. The measurement unit of AIC is 2, the difference between fm1 and fm2
is less than 2
Test Trial
Distribution
Log normal distribution
Shapiro, kolmogorov
Accuracy of mean value
A) Standar error
B) s/raiz n
Scales
Interval: temperature
c. Rank transformation
Linea Regression
a) Right.
b. the error variance must be homogeneous
c. R2= 74.8% by the linear model (coeff of determination for non linear regr) Multiple R in (Rstudio)
R= 86.5% correlation value between the variables, only for lineal regression (square root of multiple
R). IT does not apply for non linear regression.
Fish Growth
a) T test
Trend
=SLOPE+/-2*STANDAR ERROR=0.07972+/-2*0.01603=0.04766,0.11178
R nonlinear regression
Parameters:
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
b) Are the mean values of the two samples significantly diferent? Why? (2 pts.)
8. The following data from an experiment show the dependency of carbon uptake (V ) by
microorganisms on substrate concentration (S). The plot shows a typical saturation at high substrate
concentrations.
a) What is the name of the model defined by function f in the R code (1 pt)?
b) Draw a sketch of this model and annotate variables (S, V) and how to get the parameters (Vm, K). (2
pts)
c) For nonlinear regression, start parameters have to be given. Please give rough estimates for good
start parameters for Vm = ::: and K = :::. (2 pts)
d) How can the regression line be fit in a spreadsheet program like LibreOfice or Excel?
Give a short explanation and sketch a spreadsheet table how this can be done. (3pts)
e) Which formula can be used to calculate the coefficient of determination for a nonlinear model? (2
additional points)