# Brandenburg University of Technology at Cottbus Dept.

of Ecosystems and Environmental Informatics

Statistical Modelling

Univ.-Prof. Dr. habil. Albrecht Gnauck

International Master Course of Study Hydroinformatics – EuroAquae

Winter term 2010/2011

Contents
1. 1.1 1.2 1.3 1.4 2. 2.1 2.2 2.3 2.4 3. 3.1 3.2 3.3 3.4 4. 4.1 4.2 4.3 4.4 5. 5.1 5.2 5.3 5.4 6. 6.1 6.2 6.3 6.4 7. 7.1 7.2 7.3 7.4 Events and data Analysis and control of aquatic ecosystems Statistical management of ecological data Sampling strategies Re-sampling and pre-treatment of data Probability functions and statistical measures Probability functions of ecological data Normal and skewed probability distribution functions Comparison of expectations Statistical measures Statistical test procedures Introduction General procedure of hypothesis testing Rules of decision Selected test procedures Linear regression and correlation analysis Steps of linear regression Confidence region of regression line The power of linear regression Empirical covariance and statistical measures of correlation Nonlinear regression analysis Polynomial regression Periodic regression Trend functions Comparison of regression functions Time series analysis Dynamic behaviour of time series Description of time series in the time and frequency domain Stationary processes Correlation and spectral functions Analysis of cycling processes Introduction Fourier analysis Digital data filter Wavelets

Literature

2

1. Events and data
Statistical modelling of hydrological systems is an important task to extract information from former and actual states of aquatic ecosystems (aquifers, freshwater ecosystems, marine ecosystems) by means of water quantity data and water quality data. Holism and reductionism are the two different approaches to study and model ecological processes and systems. Both approaches are needed for ecosystems modelling, simulation and management. Holism Aquatic ecosystems are complex systems with nonlinear interrelationships. Holism attempts to reveal the properties of ecosystems by studying the system as a whole. The system properties cannot be found by a study of the system components separately. It is required that the study be on the system level. This does imply that a study of the ecosystem components is not sufficient. The components of ecosystems are coordinated to such an extent that ecosystems work as indivisible unities. A study of ecosystem components level will never reveal the ecosystem properties. Reductionism To simplify the ecosystem study and to facilitate the interpretation of ecological processes the ecosystem components are separated from the system level. This method is useful to find governing relationships in real systems. This method has obvious shortcomings when the functioning of the entire ecosystem is to be revealed. As an example: A forest is more than the sum of all trees. The analysis and control of dynamic aquatic ecosystems such as ponds, lakes, reservoirs and river basins is often a complicated task because of the high number of system elements (or components) and interrelationships between system elements and between system elements and their environments. To solve management problems the system has to be decomposed and nonlinear interrelationships have to be linearised. Furthermore, the controllability of aquatic ecosystems has to refer to different and parallel working subsystems and system states. The quality of aquatic ecosystem analysis depends on the flexibility of statistical models used. The restricted information structure of complex aquatic ecosystems and aggregation of information lead to uncertainties of the modelling process and of the resulting models. Dynamic processes within 3

complex dynamic systems like hydrological systems (aquatic ecosystems) are characterised by three features (table 1). To solve practical problems such approaches are necessary which are compatible with the stochastic nature of the input variables and state equations. If more than one variable (or indicator) is considered the population is denoted as a multivariate one. state and output variables. g. robustness. A group of state variables under study is called a (statistical) population (e. which is an analytical function defined by a number of parameters. then it is possible to describe it by a probability density function or probability distribution function. In general. stability. Statistical procedures will be the adequate mathematical methods as long as the processes within the systems and their describing equations are 4 . Table 1: General characterisation of complex dynamic systems Feature High dimension Uncertainty Solving procedure Decomposition of system Analysis of dynamic characteristics (observability. For the study of aquatic ecosystems a subset of the population or a sample is used. controllability. They result in rapid changes of system states and output variables (non-autonomous control) or in low changes (autonomous control). and measures of their dispersion as variance or spanning width. If the frequency distribution of the attributes of a population is known. BOD data of a waste water treatment plant). salinity data of river water. data of water flow. reachability. Common univariate measures are averages as measures of location of centres of data clouds along an axis. Regression and correlation analysis belong to experimental statistical modelling of hydrological systems which is based on methods of the theory of probability. A population is denoted as univariate if only one variable (or water quality indicator) is considered. They are observations about characteristics and/or attributes of hydrological input. sensitivity) Aggregation of information Restricted information structure Statistical modelling of hydrological systems is based on data. perturbability.aquatic ecosystems are initiated by switching of input and state variables.

The power of natural and artificial (man-made) external as well as natural internal driving forces on hydrological indicators influence the quality of data to be obtained. variances. Administrations as well as industrial and agricultural companies use statistical data and results to plan their operations and economic developments. re-sampling. A distinction is made between two groups of methods depending on whether the variable time is included or not: Static methods (without consideration of time as variable) and dynamic methods (with consideration of the variable time). the topics of statistical modelling can be formulated by: 1. Therefore. 5 . 4. geostatistics. frequency analysis). 2. advanced statistical techniques. This question can be answered by a regression analysis which gives out the type of relationship between variables.unknown. Mostly. Data analysis to fulfil the requirements of different professional users (e. frequency distributions. Researchers use statistics mainly as a first step to derive new scientific results. significance tests). forestry) (Methods: Explanatory statistics. industry. only small sets of data of representative regularly sampled data are available. averages. Data analysis to fulfil the requirements of environmental administrations and associations (Methods: Descriptive statistics. error correction. plausibility checks. 3. time series analysis). multivariate statistics. Static procedures answer the question whether there is a relationship between two or more variables of an environmental system. Data sampling (Methods: Sampling design. Disturbances of statistical analysis of hydrological data are given by: 1.g. multivariate statistics. Simple and multiple linear and non-linear regression and correlation belong to static methods as well as multivariate statistical procedures. outlier correction). digital data filtering. Statistical modelling is done for different purposes. Basic research (Methods: Regression and correlation analysis. The latter one is often called time series analysis or dynamic statistics. agriculture. 2.

Spatial and temporal scales are not specified a-priori. 4. 5. 2.1 Analysis and control of aquatic ecosystems An aquatic ecosystem is a biotic and functional system or unit. Experimental studies where manipulations of a whole ecosystem are used to identify and elucidate ecological mechanisms. Hydrologic processes possess different rate constants. Ecosystems are often called complex systems. Comparative studies are presented to compare some structural and functional components for a range of ecosystem types. 3. Empirical studies collect bits of information. 1.3. Measured and/or observed data of hydrological indicators will be obtained by field samples and/or laboratory experiments. Classification of hydrological data Hydrological data may be classified by their origin: 1. 6 . They are directly observed (direct observations) or indirectly observed (due to calibration of analytical instruments or sensors). Mostly. Modelling and computer simulation studies to work out ecosystem management plans and to derive eco-technological tools for goal oriented control actions. but are entirely based upon the objectives of the ecosystem study. An attempt is made to integrate and assemble the studies into a complete picture. Simulated data will be obtained by simulation models. which is able to sustain life and includes all biological and non-biological variables in that unit. the a-priori process information on water quality indicators is low. Summary data will be derived from statistics or from restricted observable ecological. respective water quality indicators. Several approaches exist to study the behaviour of ecosystems. Cycling effects in hydrological data are induced by natural internal or external as well as by man-made external processes.

Information systems and decision support systems studies to support industrial, agricultural and administrative ecological decisions and to work out medium-term and long-term development plans for ecological management. Like many words for which people have an intuitive understanding, a “system” is difficult to define precisely. In relation to the physical and biological sciences, a system is an organised collection of interrelated physical components characterised by a boundary and functional unity. A system is a collection of communicating materials and processes that together perform some set of functions. A system is an interlocking complex of processes characterised by many reciprocal cause-effect pathways. A system is a set of interrelated objects (elements, parts) that have certain general properties: 1. It fulfils a certain function, i.e. it can be defined by a system purpose recognisable by an observer. 2. It has a characteristic constellation of essential system elements and an essential system structure which determine its function, purpose, and identity. 3. It loses its identity if it is destroyed. Analysis and control of aquatic ecosystems are often complicated because of the high number of system elements and interrelationships between system elements and between an ecosystem and its environment. Mostly, an ecosystem will be analysed as one unit. Dynamic processes within ecosystems are initiated by switching processes of input and state variables with different transfer time constants (fig. 1). If they are overlaid by external and internal disturbances it can not be distinguished which part of ecosystem response and its intensity stem from a single ecological element. For ecosystem analysis, the complex structure of an ecosystem requires its decomposition and linearization of nonlinear interrelationships. The controllability of ecosystems has to refer to different working elements (or subsystems) and system states. Therefore, the whole ecosystem will be divided into several subsystems with internal and external feedbacks. This leads to uncertain statements on the ecosystem behaviour. The quality of ecosystem analysis depends

7

on the flexibility of mathematical models used for computation. Restricted information structure and aggregation of information lead to model errors.

Figure 1: Switching processes within a freshwater ecosystem

Ecosystems are multidimensional systems with several input and output variables. They can be seen as black box, grey box or white box systems. In dependence of the numbers of input and output variables SIMO-, MIMO-, SISOand MISO-systems will be distinguished. Ecosystems can be considered as stochastic transfer systems described by its state variables and parameters. They are characterised by measurable inputs, immeasurable (stochastic) disturbances as well as by measurement errors. In the case of real systems, disturbances, input signals and measurement errors will be overlaid and produce disturbed (and unsure) output signals. Transfer functions are represented by 1. Pulse function x(t) = 0 for t < 0 and t > T, x(t) = x0 for 0 ≤ t ≤ T, 2. Jump function: x(t) = x0σ(t) with σ(t) = 0 for t < 0 and σ(t) = 1 for t ≥ 0, 3. Harmonic function: x(t) = x0 + cos(ωt+ϕ) for -∞ < t < + ∞ or x(t) = x0 ej(ωt+ϕ) = x0+ ejωt with x0+= x0 ejϕ, 8

4. White noise function. Other transfer functions are 1. Exponential function: x(t) = x0 e-t/T for 0 ≤ t < + ∞ or x(t) = x0+ ejωt eδt for 0 ≤ t < + ∞ and δ ≠ 0, 2. Periodic function: x(t) = a0/2 + Σi aicos(iω0t) + Σi bisin(iω0t) or x(t) = Σi ci ej(iω0t), 3. Dirac impulse: x(t) = 0 for t < 0 and t > T, x(t) = δ(t) with δ(t) = 0 for t ≠ 0 and ∫ δ(t)dt = 1, 4. Ramp function: x(t) = 0 for t < 0 and x(t) = at for t ≥ 0 or 5. Time discrete signal: x~(t) = ∑k x(kT)δ(t-kT) with k = 0, 1, 2, … and T ≤ 1/(2fmax) where fmax is the maximum frequency contained in the data serie. Feedback structures (or couplings) within ecosystems are given by simple feed-forward, feed-back self-tuning or complicated couplings between the ecosystem elements. 1.2 Statistical management of ecological data To handle and investigate hydrological data with sense they should be characterised by some relationships. The increase of information content of hydrological data analysis is expressed by the number of data operations. Four scales can be distinguished (fig. 2).

Increase of information content

Ratio Scale

Interval Scale

Ordinal Scale

Nominal Scale

Figure 2: Data scales in hydrological research

9

then the data are valuated as “comparable”. estimation of median and quartiles. statements on distances and differences between data are allowable. -. Statistical analysis depends on the a-priori information of essential hydrological variables considered. concentrations). Table 2: Comparison of data scales in hydrology Scale Ratio Scale Interval Scale Ordinal Scale Nominal Scale Arithmetic operation +. Ratio scale: It is an interval scale with a “natural” origin and allows statements on ratios (e. The information content (knowledge. water temperature). Hydrological variables and their rates of changes have different scales in time and space. EU water quality classes. Ordinal scale: Ranking of events or representations. g. antithesis of uncertainty) and the scale level should not be changed during sampling and/or statistical data analysis. Quartiles Frequencies only Nominal scale: No relationship between events. 10 . •. sometimes they are coded by numbers (e. classification of environmental indicators (e.).Transformations from one data scale to another serve as unificators of variables (tab. pie charts). Sources of uncertainty are characterised by 1. lottery. no arithmetic operation possible. 2. soil classes etc. g. Interval scale: Ordinal scale with equal intervals (e. 2). No “natural” origin (Zero point) exists. If there is no empiric equivalence scale. g. One of the most important characteristic of hydrological data is its uncertainty which can be characterised as a state or condition of incomplete or unreliable knowledge. none none Statistical measure Geometric mean Arithmetic mean Median. / +. g. ordinal comparisons are possible: Class I > Class II.

5 7 15 O2 (mg/l) 10 5 0 NO3 (mg/l) 0 15 10 5 0 20 J FM A M J J A S O N D time (month) 0 J FM AM J J A S O N D time (month) Figure 3: Data series of water quality samples The quality and usability of hydrological data are usually highly depending on the suitability of the sample and the adequacy of the sampling or monitoring program. on stationary or instationary external or internal effects as well as on random influences.5 0 40 8 0. a small set of representative data will be available. These estimates are called sample statistics and form a base to give prognoses on environmental developments in general. The strength of disturbances of the data observed leads to fuzzy effects of interpretations. Sampling frequency depends on hydrologic process dynamics. Different results may 11 .3. and on the type of substance.5 1 0.8 0. Mostly. Figure 3 shows different types of annual water quality data series which can be distinguished by their statistical measures: 30 TW (°C) 20 10 0 9 NO2 (mg/l) pH-value Lf (mS/cm) o-PO4-P (mg/l) DOC (mg/l) J FM AM J J A S O N D time (month) NH4 (mg/l) 10 5 0 1 1 0. but also on hydrological changes. If an investigation is based on samples then sampling statistics depends on the particular sampling environment. on the degree of water pollution. The goal of sampling is to get information about the frequency distributions of data indicating environmental states or about the distribution parameters. on the type of pollution.6 1.

2. The difference between a statistic and the true population value is called sampling error. Simulated data are obtained by simulation models. Different data treatment. Random sampling.3 Sampling strategies Ecological data are obtained by field samples and/or laboratory analysis. 12 . 3. 2. Summary data are derived from statistics or by restricted observable indicators. Sampling design is based on different procedures. 3. dispersion is approximately time constant.be obtained if different samples are selected. It increases if more random factors influence the sampling procedure. Sampling variance is a measure of the precision of the estimates. They are directly observed (direct observations) or indirectly observed (due to calibration of analytical instruments and sensors). 1.96. There is a margin of uncertainty expressed in terms of the sampling variance of the estimator. 4. daily. Different sample treatment. Comparison of hydrological data series: 1. Intrinsic factors between water samples. v = x*/s⋅100 and e(x*) = 10% allowable deviation from mean. 2. This variation in the data from sample to sample is called sampling variability. Average and dispersion are time dependent. 3. Systematic (periodically) sampling (yearly. Variability within data series is caused by: 1. dispersion is time dependent. Environmental influences or factors. Sample size for normal distributed data without trend and periodicities: n = ((t(95)⋅v)/e(x*))2 with t(95) = 1. Sampling based on the level of admissible fault of the annual mean. weekly. 4. Average is time dependent. The most common used designs are 1. monthly. Average is approximately time constant. and hourly).

2. frequency of sampling and spacing can be estimated either by preliminary sampling experiments. by conclusions from expert knowledge. The goals and needs for hydrological data collection should be formulated explicitly for each application before sampling is started. 3. Existing estimates may be sufficient if they were obtained by an unbiased sampling design. Interpretation (data analysis. 2. During sampling significant changes of external and internal driving forces should not take place. Disturbances of data analysis: • Only small sets of representative regular sampled data are available. or by statistical sampling design formulas and methods. 1. 5. The sampling procedure covers three parts. interpretation of results) Recommendations for hydrological sampling: 1. 3. Optimum number of samples. hydrobiological variables (life cycle of plants and organisms. microbiological variables. conversion of organic and inorganic substances). 13 . hydrophysical variables (considering internal and external driving forces). sampling design. Observation (sampling techniques. Sampling design in hydrology should cover the water budget (surface and groundwater). 4. Hypothesis (program purpose. Prior knowledge of factors that affect hydrological variables to be sampled should be given. sampling protocol. by practical experiences. hydrochemical variables (organic and inorganic substances.The sampling location in space and time can have a very real effect on the quality and usefulness of data in hydrology. Geostatistical methods can be helpful to determine optimal space distribution of sampling points. metabolites). analytical techniques). and other variables as required. formulation of questions). Site selection should be made primarily on the basis of the goal of the study as well as on the nature of the hydrologic process or phenomenon under consideration.

they often contain missing data or they are based on different sampling intervals in time and space. • • The a-priori process information on water quality (hydrological) indicators is low.• The power of external and internal driving forces on water quality (hydrological) indicators influences the quality of data to be obtained. 1. Water quality (hydrologic) processes possess different rate constants.4 Re-sampling and pre-treatment of data Series of measurements of hydrological data are time series of data recorded at discrete points in time often with unequal sampling intervals. The application of static and dynamic statistical methods for analysing such data sets requires equidistant data. To extract hydrologic process information from single data (events) the data series should be completed and based on a regular sampling grid. R a w h y d r o lo g ic a l d a ta In te r p o la tio n E q u id is ta n t d a ta A p p r o x im a tio n D ig ita l d a ta f ilte r in g S ta tic D y n a m ic Low pass H ig h p a s s F u n c tio n a l r e la tio n s h ip C o n s is te n t d a ta Figure 4: Interpolation. Figure 4 gives an overview on these procedures. in the case of noisy information. approximation and digital filtering of data 14 . data approximation. Re-sampling generally means data interpolation or. In practice.

neighbour 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 14 d. Table 3: Interpolation methods Method Nearest neighbour Linear Cubic Hermite polynomial Algorithm t < (t k + t k +1) / 2 ⎧x ~ x (t ) = ⎨ k ⎩ x k +1 t ≥ (t k + t k +1) / 2 − ~ x (t ) = xk +1 xk (t − t k ) + xk t k +1 − t k ~ = ak t 3 + b k t 2 + c k t 3 + d k x (t )| . linear NH4-N (mg/l) Figure 5: Results of interpolation for two-weekly sampled data The effectiveness of interpolation procedures can be evaluated by standard error estimations. Spree NH4-N (mg/l) NH4-N (mg/l) 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 14 d. tn] ~ ( x continuous dif~ x ∈ C(2)[t0. Table 3 contains some commonly used interpolation methods. x ’ continuous differentiable) ferentiable) Characteristics Cubic spline Results of interpolation of water quality data based on biweekly sampling intervals are presented in figure 5 and in figure 6 for monthly sampling intervals.t k +1] ~ x ∉ C(0)[t0.The goal of the application of interpolation and approximation methods onto incomplete time series is to fill the intervals between two grid points so that series of measurements with small unique sampling intervals are kept. Results for the biweekly data sets are presented in table 4. tn] ~ ( x continuous) ~ x ∈ C(1)[t0.t ] [t k k +1 ~ = ek t 3 + f k t 2 + g k t 3 + hk x (t )| [t k. tn] ~ ~ ( x . cubic 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 14 d. tn] ~ ( x discontinuous) ~ x ∈ C(0)[t0. 15 NH4-N (mg/l) . spline 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 14 d.

38 0.36 0.14 1.09 0.035 0.44 0.11 0.016 NO3-N 0.22 0.39 0.Table 4: Standard error of data series with biweekly sampling interval Year 1991 1991 1991 1991 1992 1992 1992 1992 1993 1993 1993 1993 1994 1994 1994 1994 1995 1995 1995 1995 Method neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic NH4-N 0.016 0.23 0.016 0.065 0.41 0.21 0.23 0.43 0.39 0.041 0.18 0.64 0.11 0.015 0.39 0.020 0.026 0.033 0.08 0.115 0.20 0.017 0.27 0. cubic 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 28 d.019 0.23 0.35 0.026 DOC 1.41 0.43 0. linear NH4-N (mg/l) Figure 6: Results of interpolation for (nearly) monthly sampled data 16 NH4-N (mg/l) . neighbour 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 28 d.038 0.037 0.73 0.36 1.20 0.081 0. spline 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 28 d.027 0.09 0.041 0.25 0.10 0.67 0.019 0.038 0.37 0.71 Figure 6 contains some interpolation results for monthly sampled data sets of water quality indicators of River Spree at Berlin.025 0.020 0.36 0.053 0.37 0.22 o-PO4-P 0.46 0.116 0.081 0.079 0.12 0.37 0.081 0.35 0.47 1.43 0.079 0.33 1.10 NO2-N 0.019 0.71 0.69 0.019 0.42 0.115 0. Spree NH4-N (mg/l) NH4-N (mg/l) 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 28 d.41 1.22 0.035 0.24 0.41 1.082 0.114 0.23 0.080 0.24 0.36 0.053 0.20 0.46 0.052 0.65 0.70 0.43 1.019 0.079 0.

119 0.49 1.54 2.30 0.43 0.37 0. Turbidity Conductivity Dissolved oxygen Spree linear linear linear linear.026 0.54 0.48 0.030 0.11 0.29 0.022 NO3-N 0.23 0.021 0.41 0.53 2.10 NO2-N 0.066 0.027 0.82 0. spline. polynomial linear The application of interpolation methods leads to equidistant data while approximation methods result in functional relationships which can be used as 17 .18 0.075 0.21 0.33 0.036 0.115 0.063 0.031 0.Standard error estimates for monthly sampled water quality data sets are given in table 5.37 0.084 0.058 0.020 0. polynomial linear linear linear spline. Table 5: Standard error of data series with monthly sampling interval Year 1991 1991 1991 1991 1992 1992 1992 1992 1993 1993 1993 1993 1994 1994 1994 1994 1995 1995 1995 1995 Method neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic NH4-N 0.35 0.48 1.066 0.39 0.29 0.029 DOC 1.24 0.081 0.37 0.47 1. spline linear linear.028 0.026 0. spline polynomial linear spline.035 0.74 0. Table 6: Interpolation methods for rivers with different hydraulic regime Variable Ammonia Nitrite Nitrate Phosphate DOC UV absorp.026 0.035 0.46 0.46 0.44 0.83 Table 6 contains some selected results of an interpolation study for rivers with different speed of flow.41 0.00 0.30 0. polynomial linear linear linear spline.025 0.42 0.49 0.57 0.081 0.028 0.44 0. polynomial linear Havel linear linear linear linear.57 1.10 0.96 0.081 0.082 0.084 0.115 0.079 0.081 0.41 2.115 0.83 0.083 0.49 0.52 0.83 0.21 0. polynomial linear Oder linear linear linear linear linear linear.15 2.32 0.026 0.022 0.029 0. spline.48 0.42 0.41 0.19 0.20 0. spline linear linear linear.55 0.083 0.025 0.19 0.15 0. polynomial linear Dahme linear. spline.89 0.078 0.40 0.09 0.29 o-PO4-P 0. polynomial linear linear.

the results of all three different types of procedures (interpolation. In each case.) is available. 18 . simulation and optimisation as well as decision making. approximation and digital data filtering) deliver data sets which can be used for modelling. The application of digital data filters gives out consistent data.estimations of reference functions (or estimated reference data) if no other reference (may be from literature or from former experience etc.

a probability distribution function of a random variable is defined by F(x) = P(X < x) where x takes all real values.….1 Probability functions of ecological data In general.2. A continuous random variable takes either each or any value of the region of definition. For any fixed value a: P(X = a) = 0. A probability distribution function of a discrete random variable X = {x1. A random variable X is called discrete if it takes finite or enumerable infinite values x1. In opposite of that are deterministic processes with well-known results of theoretical and practical experiments. 4. Two types of random variables have to be distinguished: discrete or continuous. x2.xn. x2. 5. A random event is an event which will occur under certain conditions. Probability distribution functions and statistical measures Relationships between random events will be analysed by probability calculus and mathematical statistics. 2.…. F(x) is a monotone non-decreasing continuous function of x which takes values between 0 and 1. but it doesn’t have to occur. A probability distribution function of a continuous random variable is defined by F(x) = P(X < x) = ∫f(x) dx for -∞ < xi < x. f(x) is called the probability density function where always f(x) ≥ 0.xn} with single probabilities P(X = xi) = pi (i = 1. f(x) = dF(x)/dx = F’(x). 19 . lim F(x) = F(+∞) = ∫f(x)dx = 1 with -∞ < x < +∞ and lim F(x) = F(-∞) = 0. if F(x) is continuous differentiable. Main characteristics of F(x) and f(x) are: 1. For x1 < x2: F(x2) – F(x1) = P(X < x2) – P(X < x1) = P(x1 ≤ X ≤ x2) = ∫f(x)dx for x1 < x < x2. 2. The formula defines the area under the function f(x) between the values x1 and x2 of abscissa.n. 3.….…) is given by F(x) = ∑P(X = xi) = ∑pi for all xi < x.

Its probability density is given by f(x. probability density 20 . Weibull probability distribution. µ.…. The dispersion of a discrete random variable X is defined by σ2 = D2X = E(X – EX)2 = ∑(xi . 2. Most regression methods and multivariate statistical procedures are based on the assumption that random variables to be analysed follow a Gaussian probability distribution function.2 Normal and skewed probability distribution functions The most important probability distribution of a random variable X is the Gaussian probability distribution. Special continuous probability distributions are: Continuous equal probability distribution. Often. For this reason. Poisson probability distribution. binomial probability distribution. Special discrete probability distributions are: Discrete equal probability distribution. geometrical probability distribution.µ2 for -∞ < x < +∞. exponential probability distribution.…. The coefficient of variation of a random variable X with µ ≠ 0 is defined by ν = σ/µ (%).To solve practical problems it is sometimes impossible to determine the probability distribution function of a random variable X. The expectation of a continuous random variable X with probability density f(x) is defined by µ = EX = ∫xf(x)dx for -∞ < x < +∞.∞. and dispersion (variance) D2X or σ2.µ)²/2σ²dx.µ)2pi for i = 1. The expectation of a discrete random variable X which takes values xi with probabilities pi belonging to it is defined by µ = EX = ∑xipi for i = 1. µ. logarithmic Gaussian probability distribution.µ)2 f(x)dx = ∫x2 f(x) dx . hypergeometrical probability distribution. The most important parameters are expectation (or average) of X: EX or µ. σ2) = (1/√2πσ2)e-(x .∞. The dispersion of a continuous random variable X with probability density f(x) is defined by σ2 = D2X = ∫(x . Gaussian (normal or bell shaped) probability distribution.µ)²/2σ² while the probability distribution function is given by F(x. σ2) = (1/√2πσ2)∫e-(x . a characterisation of the probability distribution function can be given by estimates of parameters of this function.

The upper panel contains the normal (Gaussian) probability distribution as well as Figure 8: Examples of frequency distributions 21 . 70 60 50 40 30 20 10 0 1.functions (frequency distributions) indicate skewed probability distribution (fig.5 3.0 2. 7).0 Figure 7: Skewed frequency distribution of a hydrological variable Figure 8 contains some probability density functions which differ in form and shape.5 2.5 4.0 1.0 3.

68 Fe 0.45 0.10 Skewness 0.03 Cu 20.47 6.04 0. mean 24.40 g.31 Cr 1.24 3.33 0.50 2.88 0. mode and median of a probability density function a simple test of normality can be carried out.28 0.67 22 .30 282. median and mode differ from each other (fig. dev.71 0.74 -0.77 0. 9).58 Median 28.96 19.74 3.50 62.04 3. median and mode In the case of a skewed probability distribution the arithmetic mean.79 5. No water quality variable will follow a normal (Gaussian) probability distribution.33 0.31 15.4 Comparisons of expectations When comparing the position of mean.01 8.13 384.95 Variance 204.20 Max 53.05 16.28 0.45 Excess -0. Figure 9: Comparison of mean. Table 7 contains a list of sample statistics of heavy metal concentrations which were observed in a freshwater lake. Statistical computations are carried out by means of SPSS.60 1.77 0. Table 7: Statistical measures of heavy metal concentrations in a freshwater lake Measure Al Mean 28.27 1.02 11.87 0.23 0.45 14.31 Std.31 1.52 Min 7.59 2.03 0.43 0.50 2.42 0.06 0. 14.76 0.40 2.04 0.94 Ni 0.10 15.18 2.06 0.40 Mode 20.71 8.10 5.14 0.00 21.77 1.00 70.74 7.18 0.04 0.30 1.19 0.60 0.15 Pb 1.86 Zn 24.35 17. the median and the mode are arranged at the same position on the abscissa.30 Range 46.37 0.10 1.64 Std.00 1.15 0.25 17.67 0.20 6.62 6. For a Gaussian distribution the arithmetic mean.00 0.76 Cd 0. error 4.79 1.54 7.17 4.17 0.91 3.40 1.00 2.32 68.13 0.2.70 2.

Empirical median: x~ 3. Statistical measures of expectation . Empirical variance: s2 3. Weighted geometric mean: lg x° Statistical measures of dispersion . Correlation measures will be given in chap. Weighted arithmetic mean: x*g 6.xmin 2. Geometric mean: x° 5. variances and correlation coefficients are often called statistical measures.averages 1.4. Range (spanning width): R = xmax . 4. Arithmetic mean: x* = 1/n⋅∑ xi 2. Empirical coefficient of variation: v = s/x*⋅100 (%) 23 .2. Empirical mode: M 4. In this chapter.4 Statistical measures Averages. measures of expectations and dispersions are presented. Empirical standard deviation: s = √s2 4.variances 1.

Statistical test procedures In sample statistics the characteristics of interest are often expressed in terms of sample parameters such as average µ or variance σ².3. The confidence coefficient ε is given by ε = 1 – α. 6. when this statistics falls into the region of rejection the Null hypothesis is rejected.05. 0. The region of rejection of the test statistic on the basis of its probability distribution and the significance level is determined. They may be expressed by the differences of averages. Hypothesis testing consists of comparing some statistical measures called test criteria (or statistics) deduced from data sample with the values of these criteria taken on the assumption that a given hypothesis is correct. The test statistic is chosen. …. The significance level α is selected. 5. In hypothesis testing one examines a Null hypothesis H0 against one or more alternative hypotheses H1. 4. 3. H2.1 Introduction A statistical hypothesis is a statement about the sample distribution of some random environmental variables.001). Other questions arise from comparing two or more samples. On the other hand. The probability of the test statistic falling in the region of rejection is equal to ε. 3. When this statistic falls into the range of acceptance. Decision: The Null hypothesis is rejected and the alternative hypothesis is accepted when the value of the test statistic falls into the region of re- 24 . To reach a decision about the hypothesis an arbitrary significance level α is selected which should be small (0. 2.2 General procedure of hypothesis testing 1. Test statistic is calculated from data set. For hypothesis testing the test criterion (or test statistics) is set up. Hn which are stated explicitly or implicitly. the Null hypothesis is not rejected. 3.01 or 0. The Null hypothesis H0 and an alternative hypothesis H1 have to be formulated. It is expressed in %-values.

3. 25 .K|/s ⋅√n. If the test statistic falls into the region of acceptance of the Null hypothesis. the stronger the confidence of the test. The Null hypothesis H0: m = K is tested against the alternative hypothesis H1: m ≠ K. Test statistic: tcalc = |x* . If t* < t(95). The test statistic is chosen: t = |m . then a significant difference exists between m and K. If t(99. x*. n – sample size.05 is selected. H0 cannot be rejected. then a difference between m and K cannot ascertained. table 8). Decision: Acceptance if tcalc < ttab. The Null hypothesis is accepted if the value of test statistic does not fall into the region of rejection. The power of the test depends on sample size n.sample mean. Rules of decision: 1. If t(95) ≤ t* < t(99). µ0 – expectation value of the ensemble. otherwise rejection (cf. A change of test procedure can lead to other (sharper) results of hypothesis testing. Prerequisite: n. µ0.µ0|/s⋅√n.jection. 3. that means tα/2 < t < t1-α/2 . 2. 3. The significance level α = 0. The bigger the sample size (more information is available). then there is probably a difference between m and K.4 Selected test procedures t – Test (Student – Test) Goal: Comparison of a sample average with a standard value. where x* .9) < t*. then a high significant difference exists between m and K. s. The rules of decision can be adapted to all test procedures.9). 4.3 Rules of decision From sampled data an average m was calculated and is now compared with a fixed number (standard value) K. s – standard deviation. If t(99) ≤ t* < t(99.

Decision: If tcalc < ttab.17 3.1/0.6|/0.60 4.96 5.92 5.29 After waste water input in a river DO measurements were carried out to check the water quality and to answer the question wether the river water fulfils the requirements for water quality class II after LAWA regulations. The sample average has to be accepted.9 mg/l.62 31.30 3.96 1.98 2.04 4.787.33 3.37 2.34 3.31 3.84 4.9) 636.14 4. When testing the average by t-test it turns out that the average of the sample differs not significantly from the standard value.39 3.9) = 3.86 5. Interpretation: The absolute value of average is smaller than the standard value of LAWA.78 4.59 4.58 P(99.71 3. then accept x*.85 2.6666.3×5 = 1. n = 25.58 2.15 2.18 2.23 2. the waste water input leads to a lower water quality but 26 .36 3.68 2.01 1. t(95) = 2.61 6.12 2.060.60 12. t(99.50 3.63 2. Comparison: tcalc and ttab for f = n – 1 = 24: tcalc = 1. t(99) = 2.03 3.60 2.97 1.59 2.3 The test statistic holds tcalc = |x* .92 8. x* = 5.57 2.96 P(99) 63.10 2.78 2.9 .45 2.31 2.71 4.88 2.67.08 2.06 2.18 2.59 2. µ0 = 6 mg/l.26 2.50 3.3⋅√25 = 0. s = 0.98 1.92 3.96 1.92 2. Of course.Table 8: Table of t – Test (according to Kaiser and Gottschalk 1974) f=n-1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 50 100 200 300 500 700 ∞ Example: P(95) 12.25 3.02 3.µ0|/s⋅√n: tcalc = |5.85 3. This means the water quality standard of LAWA is not fulfilled.41 5.32 4.30 3.725.97 1.66 9.

otherwise rejection (cf.56 2. x** – second sample mean.96 4.12 2.41 4.91 4. table 9a – 9c).45 2.99 1. Decision: Acceptance if Fcalc < Ftab.61 2.there is no significant difference to the quality standard value.49 4.62 1.21 10 241.41 2.53 5.07 1.84 5 230.30 2.01 1.63 4.39 2.68 2. In the case of x* = 5.91 2.51 1.19 2.69 2.07 2. where x* .60 2.24 4.36 2.01 1. The test statistic holds: F = (s*/s**)2 ≥ 1. Comparison of variances (F – Test) Goal: Evaluation of standard deviations of two homogeneous data sets.61 4. n** – second sample size. otherwise rejection (cf.75 4. s* – first standard deviation.28 2.11 2.60 4.17 4.08 4.75 1.92 1. s** is the standard deviation of the second sample.00 3.57 ∞ 254. Comparison of means The test statistic holds: t = |x* .8 mg/l exists a significant difference between the average and the standard.50 8.37 2.33 3.00 27 .13 7.24 2.05 3. s** – second standard deviation.66 2.96 2. Table 9a: Table of F – Test (according to Kaiser and Gottschalk 1974) for P(95) f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60 ∞ 1 161.54 2.01 6. s2.93 1.39 8.77 2.26 5. Prerequisite: n1. n* – first sample size.16 2.51 10.x**|/sd ⋅√n*⋅n** / (n* + n**).first sample mean.53 2.71 2.35 2.85 2.35 4.30 4. n2.84 1.44 8.4 18.78 1.84 1.30 9.0 19. where s* is the standard deviation of the first sample. n-1 – degrees of freedom and sd = √((n*-1)s*² + (n**-1)s**²)/(n*+n**-2) Decision: Acceptance if tcalc < ttab. table 8).39 1.2 19.80 4.9 19. s1.13 2.71 6.71 1.83 20 248.49 2.74 5.66 5.77 2.54 2.3 19.30 2.

82 4.17 2.13 4.5 mg/l.6 51.10 7.91 8.86 8.2 48.95 20 6208 99.63 2.9) From laboratory analysis of water quality exist two small data sets of BOD data with x11 = 30.28 4.54 10.37 3. x14 = 30. x24 = 30.25 3.25 3.31 10 6.64 5.10 10 6056 99.51 3.97 2.87 20 6. 56 4. s* = 0.9 mg/l.71 15.12 15. x** = 30.5 mg/l.04 9.70 2.87 3.80 3.06 3.60 1.638 .55 4. x13 = 30.20 1.67 3. n** = 7. s** = 0. x27 = 30.94 2.91 3.33 8.69 3.61 11.31 7.45 26.04 4.34 3.13 2.38 5.Table 9b: Table of F – Test (according to Kaiser and Gottschalk 1974) for P(99) f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60 ∞ f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60 ∞ Example: 1 4052 98.25 4.99 3.29 8.8 mg/l.2.00 2.42 2.94 7.20 16.42 4.49 3.14 2.71 29.90 2.26 10.49 34.01 1.55 4.2×105 999.78 6.90 1.04 21.02 9.08 6.70 3.83 5 5764 99.81 5. x22 = 30.82 14.1×105 999.07 4.59 2.27 6.14 16.4 mg/l.5 126.51 3.231.59 4.19 5. x25 = 30.37 2.25 ∞ 6366 99.98 11.69 14.80 2.5 mg/l. n* = 5.83 2.05 3.81 2.97 10.50 26.05 23.14 47.08 2.86 3.51 3.38 3.40 27.3 134.80 6.70 4.56 7.4 129.60 4.4.5 74.46 9.25 28.41 3. x26 = 30.46 6.39 5.89 5.75 10.76 4. x23 = 30.5 123.86 3.75 7.77 7.81 6.48 8.31 2.1 mg/l.02 5 5.06 4.38 13.4×105 999.36 3.05 26.1 mg/l.5 44.04 18.00 ∞ 6.53 8.15 2.64 17.57 2.30 3.85 4.64 1 4.2 mg/l.8×105 999.89 13.53 2.40 5.02 3.4 mg/l.94 3. 28 .28 6.99 4.23 14.3 mg/l. x15 = 29.95 7.75 2.38 14.23 1.55 2.1×105 998.5 46. x12 = 30.40 7.12 21.16 25.89 7.2 mg/l and x21 = 30.23 3.00 Table 9c: Table of F – Test (according to Kaiser and Gottschalk 1974) for P(99.44 4.12 13.76 5. x* = 30.80 1.53 5.2 167.40 5.10 3.29 12.

715 1.78.553 2.895 1. s.329 2.x*)|/s)⋅√n/(n-1). F(99.051 2.256 2.903 1. otherwise rejection (cf.982 2.265 3.730 1.447 2.324 2.Test statistic: F = (s*/s**)2 ≥ 1: F = (0. Table 10: Table of r – test (according to Kaiser and Gottschalk 1974) f=n-1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 50 100 200 300 500 700 ∞ P(95) 1.920 1.576 P(99.414 1.951 1.291 2.205 2. F(99) = 9.412 2.874 2.638/0.570 2.279 3.6281 ≥ 1. for f* = n1 – 1 = 3.730 2. degrees of freedom f = n – 2.76.385 2.812 2.960 P(99) 1.572 2.9) 1.291 29 . where x+ is to be expected as an outlier.540 2.937 1.926 1.231)2 = 7.409 1.529 2.294 2. F(95) = 4.956 1.166 3.414 1.870 1.885 1.208 2. The test statistic: r = (|(x+ . x* is the expectation of the sample. s is the standard deviation of the sample.566 2.178 2.918 2.958 1. 460 2.70.6281. Prerequisite: Data set n > 3.447 2. f** = n2 – 1 = 6: Fcalc = 7.283 3.848 1.05.935 1.432 2.678 2.271 3. table 10).227 3.564 2.9) = 23.757 1. Choice of significance level α = 0. x*.616 2. Decision: Acceptance if rcalc < rtab. Decision: If Fcalc < Ftab: No difference between standard deviations. and n – sample size. testing of homogeneity of data set.348 2. Outlier – Test (NALIMOV-Test) Goal: Detection of outlier within a data set.990 3.931 1.958 1.959 1. Comparison: Fcalc < Ftab.142 2.645 1.959 1.910 1.814 1. Interpretation: Both data sets can be combined.

r(95) = 1.9 mg/l.1 mg/l. then accept x5: 1. r(99) = 1.638 .752 Comparison: rcalc and rtab for f = n – 2 = 3: rcalc = 1. x* = 30.567×1.757.Example: From laboratory analysis of water quality exist a small data set of BOD data with x1 = 30. x3 = 30.638)√5/4= 1. Result and interpretation: The value x5 is not an outlier and belongs to the data set. The data set itself seems to be homogeneous. s = 0. The last value is expected to be an outlier.752 < 1.2.5 mg/l. r(99.2)|/0.752. n = 5 Test statistic: r = (|(29. Decision: If rcalc < rtab.2 – 30.2 mg/l.118 = 1.9) = 1. x2 = 30.757. x4 = 30. In the case that a value has been found as an outlier the average and variance have to be re-calculated and tested again. That would mean the data set is inhomogeneous.982.638)⋅√5/(5-1) = (1.4 mg/l. x5 = 29.918. 30 .0/0.

0 Temp Figure 10: Scatterplot of hydrological variables 2. yn) (or n-tupels of data) which can be considered as realisations of a two-dimensional (or n-dimensional) random vector (X.0 5.0 10.0 8. y1).0 7. (xn. Positive relationship: Increasing values of X and increasing values of Y. Goal of a simple or multiple linear regression analysis is the determination of a linear relationship between two or more measurable (or observable) variables or characteristics X and Y of a hydrological system.8 7. 31 . No relationship between X and Y (e.8 8. parallels to the axes).0 20. 10).0 30. g.4. (x2. 2.0 15. Step: Estimate the relationship (positive or negative) between variables. Y). Linear regression analysis is one of the best studied statistical methods. 4. 9. The measurement values of size n consist of n pairs of data (x1. 3.1 Steps of linear regression 1. y2). The relationships can be strong or weak.0 25.6 8.2 8. Linear regression and correlation analysis A regression analysis is required for problems in which stochastic dependencies (stochastic cause-effect relationships) have to be described by functions with one or more several variables.6 0. Step: Scatter-plot of variables of interest (fig.4 pH 8. Directions of relationships 1. Negative relationship: Increasing values of X and decreasing values of Y.….

By means of this formula (explanation see chapter 4. the confidence statements will be fuzzier. Using the confidence intervals of a and b a confidence region of the (mean) linear model EY = a + bx can be defined by gu < EY < go where gu = y* .0 20. Therefore.0 7.0 25.3 The power of linear regression The strength of a relationship is expressed by the empirical (linear) correlation coefficient: r = ∑(xi – x*)(yi – y*)/√∑(xi – x*)2⋅∑(yi – y*)2.0 10. 11): pH = 7.6 0.0 5.4) the next step of linear regression procedure is derived.0 30.8 7.0 observed linear 8.sy*t and go = y* + sy*t. The width of the confidence band L depend from sy* and can be calculated by L = 2 sy*t. The limits of confidence are symmetric hyperbolas around the linear regression model y* = a* + b*x. Step: Calculate the confidence region of the regression line. Step: Formulate the (linear) model equation (fig.2 8.0 15.3. They get their minimum for x = x* and increase with for other x – values.6 8.4 8. 32 .025 Temp Linear regression between Temp and pH 9.868 + 0.8 8.2 Confidence region of regression line 4. The general model of linear regression is given by y = a +bx.0 Temp Figure 11: Linear relationship between variables 4. 4.

4.5. r low. Combining data series of different water quantity or water quality variables referring to two or more measurable characteristics sets of pairs of data (x1. y1). b high.…). r high. b low. 12). Step: Calculate the power of relationship: r = 0.493 or B = r2 = 0. The calculation algorithm is presented in chapter 4. s high. Normal probability distribution of data pairs or data tupel is a (strong) prerequisite. Measures of correlation are the correlation coefficient r.or multi-dimensional stochastic vector (X.z. y2) . s low. To derive statistical characteristics of a linear regression model the following cases should be distinguished: 1. b low. the performance index B = r2 or the partial correlation coefficient rxy. s low.4. 4. 120 100 80 60 40 20 Y 0 -20 0 20 40 60 80 100 120 X Figure 12: Scatterplot of a bivariate relationship. 3. These sets of data can be seen as realisations of a two. (x2. yn) or n-tupel of data will be obtained (fig. s high. r low. (xn. 2. r very low. The power or intensity of such a relationship is expressed by correlation. b high. Y. 33 .243.4 Empirical Covariance and statistical measures of correlation A correlation analysis answers the question about the strength and direction of a linear (but not severe functional) relationship between two or more variables.….

the difference between arithmetic mean and xi will be positive. In principle. By normalisation of sxy with empirical standard deviations sx und sy one gets the empirical coefficient of correlation rxy: 34 . s xy = n − 1 i =1 n − 1 i =1 sxy can be positive or negative. A new data series with n pairs of data (xi. n is formed by two variables {X} and {Y}. the difference between arithmetic mean and xi will be negative. a negative covariance characterises a relationship where big values xi are connected with small values yi mostly and vice versa. …. The information content is high but cannot be extracted very clearly (fig. 120 100 80 Y 60 40 20 0 120 100 140 100120 60 80 20 40 80 60 40 20 X 0 0 Z Figure 13: 3-D scatterplot of variables Such relationships are characterised by statistical measures which are denoted as correlation measures. 13). yi). For big values of xi. i = 1. This is also valid for data yi. For this reason. For small values of xi. arithmetic means and empirical variances of data series are used: x* = 1/n ∑xi and y* = 1/n ∑yi sx2 = 1/(n-1) ∑(xi – x*)2 and sy2 = 1/(n-1) ∑(yi – y*)2. The empirical covariance sxy will be calculated as follows: 1 n 1 n ∑ ( x i − x ) ⋅ ( yi − y) = ∑ ( x i y i − nxy ) .A visualisation of a relationship between three variables is possible but in some cases not really helpful.

rxz⋅ryz)/√(1 – rxz2)(1 – ryz2) rxz. normal probability distribution not necessary) rS = 1− 6 ⋅ ∑ ( xi − y i ) n ⋅ (n − 1) 2 i =1 n 2 =1− 6 ⋅ ∑ Di i =1 2 n 2 n ⋅ (n − 1) Table 11 contains data and an explanation of the the ranking procedure for a SPEARMAN-test. Empirical bivariate correlation coefficient r = ∑(xi – x*)(yi – y*)/√∑(xi – x*)2⋅∑(yi – y*)2 Performance index (coefficient of determination) B = r2 Partial correlation coefficients rxy. 35 .rxy⋅ryz)/√(1 – rxy2)(1 – ryz2) ryz.z = (rxy .rxy⋅rxz)/√(1 – rxy2)(1 – rxz2) Multiple correlation coefficients x. Statistical measures of correlation between two or more hydrological variables are mainly based on the assumption that the data sets are subsets of Gaussian distributed data sets. ⋅ sy sx Because of sxy = syx also rxy = ryx is valid.x = (ryz . z) Rx. z → x = f(y.yz = (rxy2 + rxz2 – 2⋅rxy⋅rxz⋅ryz)/(1 .y = (rxz . The rank correlation procedure functions without assuming a normal probability distribution of the data set to be analysed.ryz2) Multiple performance index Bx. rxy is a measure of strength and direction of a linear relationship between hydrological variables X and Y. yz = √ rxy2 + rxz2 – 2⋅rxy⋅rxz⋅ryz)/(1 . y.r xy = s xy .ryz2) B SPEARMAN’s rank correlation (Valid for small sample size.

3 0.3 0.5 2 2 yi 4 6 2 10 8 12 5 3 9 11 R(yi) 3 5 1 8 6 10 4 2 7 9 Di 2. For n > 30 the table of standardised normal probability distribution should be used: rSTab(95) = 0. Result and interpretation: Between both data sets exists a relatively strong negative correlation. the example shows that for each significance level rS ≤ rStab is valid.5 9 -2.8 Comparison of rS and rStab (positive values only): For n ≤ 30 the table of probability values of rS has to be used.25 6.5 10 5. 36 .8 0.8 1.5515.5 0.25 25 49 ∑ 297 Result: rS = -0.Table 11: Data and procedure of rank correlation xi 0. Decision: If rS ≥ rStab.5 7.5 4 2 9 7. rSTab(99) = 0.8667.7333. rSTab(99.5 0.25 4 64 25 30.5 -2 -8 5 5.5 2.25 81 6.5 -5 -7 Di2 6. then reject rS.4 0.9 0.1 0.3 R(xi) 5.9) = 0.

238 0.2194 -1.6972 2.0 Temp Figure 14: Linear and nonlinear regression curves Table 12: Results of nonlinear regression models Model LIN LOG INV QUA CUB COM POW S GRO EXP LGS B 0. 13).1219 2.0 7.0 30.0030 0.0635 7.243 0.1270 37 b1 0.0 10.4 8.4486 7.8731 7. pH 9.194 0.6 composed power S-shaped growth exponential logistic 8.0030 0.0005 .8731 0.308 0.0310 0.238 0.1097 7.3568 8.0 15.2875 -0.0252 0.1544 0.9970 b2 b3 0.8683 7.0030 0.2 8.0 25.198 0.432 0.0022 -0. Step: Find out other model types if the linear model is insufficient (fig.8 7.2131 1. Figure 14 contains some standard nonlinear regression models computed by means of SPSS.238 0.6798 8.0 observed linear logarithmic 8. Nonlinear regression analysis In the case that a linear regression model is not valid or insufficient other regression models should be tested.0191 0.0 20.0262 -0.238 b0 7.158 0. From this statement the following step of (linear) regression procedure is derived: 6. The results are presented in table 12.0 5.6 0.5.156 0.8 invers squared cubic 8.

where n is called the order of the polynomial. i = 1. Figure 15: Examples of polynomial regression 38 . For 6th and 7th order polynomials the performance will be B = 1.2% only.When comparing the performance indexes of these standard models the “best” statistical model is the cubic one.1 Polynomial regression The basic model is given by y = a0 + ∑ ai xi.…. 5. The remaining 56.8% are not described by the model. But this model represents the data cloud by 43. Figure 15 shows polynomials of different order. As an overall outcome of this analysis all of these models should be rejected and other types of nonlinear models should be investigated. Each of the polynomials represents the given data set by a relatively high degree of performance.

DO(t) = a0 + a1TW + a2Q + a3BSB or DO(t) = a0+ a1TW + a2Q + a3BSB + a4TW² + a5Q² + a6BSB² + a7TW³) or models derived from control theory (e. A continuous dynamic process is described by a time discrete model applying the z-transformation on a difference equation. 25m) and the approximating graphs are presented. stochastic transfer method). In an extended form this method is called Fourier analysis (see chapter 7).By comparing the graphs different interpretations are possible. g. Disadvantages are the high number of coefficients and sometimes physically not realistic results. The equation represents the simplest form of periodic regression or so-called Fourier polynomial. Figure 16: Periodic regression of water temperature in a reservoir 39 . In figure 16 water temperature of a reservoir at three depth levels (0m. 10m. The advantage of polynomial regression is to get an algorithm for calculation of the existing nonlinear relationship between hydrological variables. g. The best models are not the ones where the graphs are joining all data points. G(z) = B(z-1)/A(z-1) +ξ(z) 5.2 Periodic regression The basic relationship is given by y = a + b1⋅sin x + b2⋅cos x. For the polynomial of 7th order the graph indicates negative values which do not exist. Other model types used in water quality management are multiple linear or nonlinear regression models (e.

the disadvantage is that the functions are valid for fixed cycling periods only.5374x + 2.3 Trend functions Medium-term and long-term temporal and spatial developments (trends) of hydrological variables can be estimated by simple.8938 TeK0030 SPK0010 SPK0020 sampling point Hv0190 Hv0200 Figure 18: Linear trend of phosphate phosphorus in a channel 40 .00 25014 y = 0.9501 SPK0010 SPK0020 sampling point Hv0190 Hv0200 Figure 17: Polynomial trend function for BOD in a river Other examples of linear and nonlinear trend functions are presented in figures 18 to 20. Parameter estimation is done by the method of least squares (MKQ). The advantage of this family of regression type functions is the visualisation of a cycling process.0166x + 0. 5.0908x 2 . explicitly given functions.05 0.0 BOD (mg/l) 1.5 1. Figure 17 shows the development of BOD in along a river stretch following a polynomial of 2nd order.0. 0.5 2.5 0.15 0.0 0.6386 R 2 = 0.It can clearly be seen that water temperature (and all other hydrological cycling variables) can be approximated very well by periodic functions.20 o-PO4-P (mg/l) 0. 2.0854 R2 = 0.10 0.0 TeK0030 y = 0.

809 TeK0030 SPK0010 SPK0020 Hv0190 Hv0200 sampling point Figure 19: Quadratic trend function of water flow On the other hand. 20). But the regression model is not able to compensate the positive jump in water flow because it works with fixed parameters (coefficients). The deviations of regression line from measurements are small. 70 60 50 flow (m3/s) 40 30 20 10 0 25014 y = 3. For the same river stretch.6 4 5 9 TeK0030 SPK0010 SPK0020 s a m p lin g p o in t H v0190 H v0200 Figure 20: Quadratic trend function of chlorophyll-a 41 .The linear function (also denoted as a polynomial of 1st order) is able to follow the increasing trend of phosphate phosphorus load due to waste water input in a low flow channel with acceptable accuracy.1 1 5 R 2 = 0 .053x + 51.3914x2 .18. The stationary or uniform flow conditions of the first part of the water body are disturbed now.4 6 2 7 x 2 .4 2 2 1 x + 6 7 . The reason for this are changing hydraulic conditions and increasing values of water flow. Therefore. for the same river stretch the trend of chlorophyll-a is expressed by a 2nd order polynomial again (fig.117 R2 = 0. Chlorophyll-a (µg/l) 80 60 40 20 0 25014 y = 1 . 19) shows stronger deviations after conjunction of the main river with a channel. another regression model should used.6 . Considering the performance index the graph should be acceptable. the approximating 2nd order polynomial of water flow (fig.

Table 13: Trend functions of water quality in the River Havel Water quality indicator Water flow Temperature Conductivity Chloride DO BOD CSV NH4-N NO2-N NO3-N O-PO4-P TP SiO2 Suspended matter Chlorophyll-a Inorg.3858 0. All polynomials are of 2nd order.1971 0.5669 0.8126 0.4264 0. matter Trend polynomial polynomial polynomial polynomial polynomial polynomial polynomial exponential exponential exponential exponential polynomial polynomial polynomial polynomial polynomial polynomial R² 0.8888 0. part of biomass Loss of org.0227 0. (a1) . Interpretations of trend functions can be given as follows: Linear trend: y(t) = a0 (t) + a1 (t) x(t). But the trend follows the computed polynomial.1418 P (95%) + + + + + + + + + + + + - As can be seen from table 13. (Interpretation of parameters: (a0) . The following table gives a survey on trend functions used to estimate the developments of water quality in a river (table 13). 19 for water flow because of some disturbances caused by hydrophysical phenomenon.8683 0.6177 0.mean rate of change.The performance index is lower than before in fig.6742 0.4879 0.6032 0. Taking into account the variations in chlorophyll measurements the trend polynomial is quite acceptable. (Interpretation of parameters: (a0) – mean initial value. The signs in the last column indicate significance on a 95% probability level.0382 0. (a2) – mean process acceleration) 42 . polynomial and exponential trend functions are sufficient to describe the changing water quality mathematically.0822 0.mean initial value.7611 0.4746 0. (a1) – mean rate of change) Squared trend: y(t) = a0 (t) + a1 (t) x(t) + a2 (t) x2 (t).

Exponential trend: x(t) = x(0) e . k – rate of change... 5. (Interpretation of parameters is mostly impossible).. E – random quota).4 Comparison of regression functions To describe one and the same data set different nonlinear models can be applied.Polynomial trend: y(t) = a0 (t) + a1 (t) x(t) + a2 (t) x2 (t) + . + an (t) xn (t).kt + E.. (Interpretation according to 1st order kinetics: x(0) – initial concentration value. Figure 21: Comparison of different regression functions for the same data set 43 .

y )² .y )2.y )² / ∑ ( y . As can be seen in part H.By comparing the initial and the final reach of regression functions the best functional relationship will be selected (fig. 44 . An evaluation of the quality of fit can be given by: Linear coefficient of determination (performance index): ˆ R2 = B = ∑ ( y . ˆ Residual sum of squares: SR = ∑(yi . or Residual dispersion: s2 = SR/(n – m – 1) (n – number of data.y )2 / (n-1) sy² ). the middle range of all computed models shows very small variations while the initial and the final part of the graphs show a spreading of curves. ˆ Nonlinear performance index: Bnl = 1 .(∑ (y . 21). Also the linear model seems to be suitable. m – number of parameters).

Time series analysis The distinction between discrete and continuous variables is not a clear dichotomy because continuous processes (seen from a physical point of view of understanding nature) will be observed at discrete time events. Therefore.0 0 10 20 30 40 50 60 70 Figure 23: Approximation of a time varying process by a function 45 .0754 t + 0.5 0. NO3-N raw data are described by a polynomial trend as follows: NO3N(t) = 1. water related processes are represented by time varying signals. An exact mathematical (or functional) description of random fluctuations is not possible.0028 t2 .0. In figure 23. The function describes more or less the mean behaviour of the process. 3. 22). x(t) Ғ y(t) Figure 22: Schematic diagram of a transfer process Therefore.8987 – 0.5 2.5 3.00003 t3.0 . 6.0 1.1 Dynamic behaviour of time series Freshwater ecosystems may be seen as switching networks where inputs are transformed into outputs by an operator Ғ which describes the transient behaviour of ecological processes (fig.0 2.6. The overall operator Ғ transforms input signals into output signals: y(t) = Ғ x(t) where the signals will be smoothed (damped).5 1. and there exists some redundancy between input and output signals. mostly random variables are observed.

not measurable (stochastic) disturbances as well as by measuring errors. Statistical characteristics of stationary random processes can be expressed by 1. Only some statistical statements on the future development of the process X(t) can be given: Prob(X(tn+1) ≤ x) ≡ P(x). stationary random processes can be investigated on different time intervals between .6. 6. 8 and 9). Spectral power density function Sxx(ω) A time varying process is expressed by a stochastic signal X(t). Probability density function p(x) of signals X(t). pp. Disturbances. transfer functions. Auto-correlation function Φxx(τ).∞ < t < + ∞. For each time stroke tn one measured value Xn(t) will be obtained. Therefore. by coherency functions as well as by wavelets.3 Stationary processes Because of time lags between input and output processes stationary processes will then be reached when all transient processes are decayed. 2. They are characterised by measurable inputs. When the process is described by an analytical (deterministic) function f(t) then the time behaviour can be predicted completely. Mathematical descriptions of hydrological time series can be represented by time domain functions (cf. If statistical characteristics do not change in time. some statistical characteristics of signals should only be grasped. In the frequency domain hydrological time series are represented by Fouriertransforms of correlation functions. 46 . Process averages and dispersions will not change so much in time. Therefore. input signals and measurement errors will be overlaid and will produce output signals. The further development of the process can be predicted only for a short time interval.2 Description of time series in time and in frequency domain Hydrological systems can be seen as stochastic transfer systems described by system state variables and parameters. 3. then these processes are called stationary processes.

It cannot be seen whether a process contains lower and/or higher frequencies. multiple probability distribution functions are necessary to describe the time varying process behaviour. Therefore. Important expectations are linear average: E(x) = ∫ x⋅p(x) dx and squared average: E(x2) = ∫ x2⋅p(x) dx. The auto-power spectrum Sxx(ω) of x(t) is the Fourier transform of the ACF: Sxx(ω) = Sxx(-ω) = 1/2π⋅∫Φxx(τ)⋅e-jωτ dτ. x2)⋅Δx1⋅Δx2. x2 < X(t2) ≤ x2 + Δx2) ≈ p(x1. The auto-power spectrum of an ecological process or signal is visualised by a periodogram.4 Correlation and spectral functions The probability density function gives an information about the probability of the process X(t) that the amplitude at time t lies between x and (x + Δx): Prob(x < X(t) ≤ x + Δx) ≈ p(x)⋅Δx. The auto-correlation function (ACF) gives information on the inner correlation between data with the distance τ on the time axes: ∫∫(x(t)⋅x(t + τ)⋅p[x(t). No statements on changes of X(t) within time intervals Δx are made. It gives the 47 . 6. y(t + τ)]dx(t)⋅dy(t + τ) ≡ Φxy(τ). The probability that X(t) at time t = t1 lies between x1 and x1 + Δx1 and at time t = t2 = t1 + τ between x2 + Δx2 (after τ time units) is approximately given by Prob(x1 < X(t1)≤ x1 + Δx1. x(t + τ)]dx(t)⋅dx(t + τ) ≡ Φxx(τ) The cross-correlation function (CCF) gives information on the statistical correlation of two different processes X(t) and Y(t): ∫∫(x(t)⋅y(t + τ)⋅p[x(t). The parameter τ gives information on the statistical coupling of data x(t1) and x(t1 + τ). It represents the dominant frequency of the process. By transforming the time correlation functions into the frequency domain one gets the auto-power spectrum or the cross-power spectrum.or Prob(a < X(t) ≤ b) = ∫ p(x)dx. The Gaussian distribution with a bell-shaped density is one of the most important probability density distributions where p(x) = 1/√2πσ⋅exp-(x-x*)2/2σ2.

spectrum of a stationary signal which is a distribution of the variance of the signal as a function of frequency. 48 . The auto-covariance function is the time domain counterpart of the periodogram. Small fluctuations are not dominant and can be neglected. Significant periodicity in the signal will induce a sharp peak in a periodogram. Each peak represents the part of the variance of the signal that is due to a cycle of a different period or length. The low frequency component is responsible for the general tendency of the indicator. The periodogram of water temperature (figure 24) shows a single distinct peak which indicates the major cyclic behaviour. Figure 24: Periodogram of water temperature of the Lower Havel River Figure 25: Periodogram of pH The periodogram of pH in figure 25 shows that the highest variance is displayed by a low frequency. The frequency components that account for the largest share of the variance are revealed.

Figure 26: Periodogram of dissolved oxygen.Only long term changes are responsible for the overall observed behaviour of the indicator. The periodogram represents low frequency components which exhibit the highest variances and some small fluctuation at higher frequencies. The periodogram of pH is similar to that of dissolved oxygen presented in figure 26. This means that the general tendency of this indicator is determined by long term changes. For the indicator of phytoplankton biomass the periodogram is shown in figure 27. High variances at low frequencies are observed. Figure 27: Periodogram of chlorophyll-a The cross-power spectrum Sxy(ω) of two stochastic ecological processes x(t) and y(t) is the Fourier transform of the CCF: 49 . Two distinct peaks reveal two cycles of different periods and amplitudes. They determine the long term behaviour of the indicator.

50 . It is calculated on the base of periodograms of both signals by Coxy(ω) = |Sxy(ω)|2/Sxx(ω)⋅Syy(ω). The coherency function Co(ω) is a measure of synchronicity of (two) signals.Sxy(ω) = 1/2π⋅∫Φxy(τ)⋅e-jωτ dτ It is a complex function. where H(ω) is the Fourier transform of h(τ) which distorts Sxy(ω) to ~Sxy(ω). where |Sxy(ω)| = √Re(Sxx(ω))2 + Re(Syy(ω))2 and for the phase shift between both signals ϕ(ω) = arc tan (Im(Sxy(ω))/Re(Sxy(ω))) is valid.α). The limitation of CCF is considered by what is called a window function h(τ): ~ Sxy(ω) = 1/2π⋅∫Φxy(τ)⋅h(τ)⋅e-jωτ dτ = Sxy(α)⋅H(ω .

In fig. State transitions take place on intervals (ai (t).7. Time delays in the courses of action of system components lead to retardations in the changes of system states and to redundancies in the data transfer. 51 . physiological parameters and others. Water quality processes are characterised by different time parameters such as time delay. Switching processes of input variables take place at certain different time events. hydrological variables vary often with high frequencies because of random changes of internal system states and/or fluctuations of variables. altering. bi (t)) with probability densities wi (t) of time delays of system variables and probabilities pi (t) for each realisation of a state transition: For ai (t) ≤ wi (t) ≤ bi (t): pi (t) = ∫ wi (t) dt. 28 some examples of cycling processes with different periods and frequencies are presented. They lay out different time and frequency behaviour of the water quality (hydrological) processes. threshold values. On the other hand. Analysis of cycling processes Cycling processes in hydrology are natural. 200 3 Q (m /s) 100 0 1000 600 200 30 Tw (°C ) E (μ /cm C S ) 15 0 20 10 0 10 pH O2 (m g/l) 8 6 1985 1987 1989 1991 time (a) 1993 1995 Figure 28: Cycling water quality indicators Such processes are caused mostly by natural external driving forces but also by natural internal driving forces.

Ti is the period of cycle. Table 14: Classification of hydrological systems Classification Characteristics of signals Modulation Quantification Adaptability of system adaptive Remark Change of amplitudes. wi (t). change of inputs and disturbances. 52 . Mostly. A harmonic process is described by a trigonometric function x(t) = x0cos(ωi + ϕi) with -∞ < t < +∞.1 Introduction Mathematical equations describe either the time dependency (function of time t) which is called description in the time domain or the frequency dependency (function of frequency ω or cycles per time unit) which is called description in the frequency domain”. frequencies and phases of signals Discretisation of time domain of amplitudes and duration interval of signals Change of systems states.A state transition can be characterised by a quadrupel Θi (t) = {ai (t). In the case of correct reproduction and forecast of a process it is called a deterministic one. ωi = 2π/Ti is the basic cycling frequency (circle frequency). pi (t)). cycling (or periodic) processes in hydrological context are caused by natural external driving forces. bi (t). and ϕi is the shift of phase. A classification of hydrological systems can be given by its characteristics of signals and by the type of change of dynamic properties (table 14). Otherwise it is called a non-deterministic or stochastic (random) process. Another distinction can be made by the ability to reproduce a time-varying process. aperiodic hydrological processes are mainly influenced by artificial (man-made) external driving forces. change of parameters. no change of ecosystem structure non-adaptive 7. On the other hand. Each deterministic process x(t) is characterised by its time development (or behaviour) x = x(t) with -∞ < t < +∞. change of system structure fixed parameters.

with . chemical and biological environmental or ecological variables respectively. ω0 = 2π/T0 – frequency of the basic cycle. Then.∞ ≤ i ≤ + ∞. Table 14 gives an example on the usefulness of this method for physical. Figure 29 shows a Fourier approximation of global radiation process.7. The amplitudes ai and bi are calculated as follows: ai = 1/T0 ∫x(t)⋅cos(iω0t)dt. The Fourier polynomial is an approximation which represents the minimum mean squared deviation of a cycling process. 2π] by ϕi = arc tan bi/ai. amplitude (f=1/352d) Figure 29: Fourier approximation of global radiation Fourier approximations can be used to explain the variance of a cycling process by its basic frequency. the amplitudes of the approximating function are given by Ai = √ ai2 + bi2. T0 – period of cycle. It can be seen that the approximation is shifted from the real frequencies due to a fixed frequency. This fact causes some error.2 Fourier analysis A periodic process with period T0 is described by a Fourier series of the form x(t) = a0/2 + ∑ai⋅cos(iω0t) + ∑bi⋅sin(iω0t). bi = 1/T0 ∫x(t)⋅sin(iω0t)dt and a0 = 1/2T0 ∫x(t)dt. The 3rd column of table 14 contains the values of total variance of the time series under consideration. Phase shifts are given in the interval [0. 500 450 400 global radiation (W/m ) 350 300 250 200 150 100 50 0 1996 1997 1998 1999 2000 time (a) 2001 2002 2003 2 raw data component with max. The last column contains the val53 .

7 5.6.4 Average 12.9 71.1 12.90 1. dev.5 10.0 10.25 2.35 34.0 . The best results will be obtained for physical variables.94 1.0 90.87 1.ues of variance which are explained by the dominant cycle contained in the timw series.81 1.43 33.650⋅cos((9π/180)t) .2 7.63 22. 44.6.8 35.0 + 1.7 Std.59 Variance of the yearly cycle (% of total variance) 84.74 22.35 92.41 56.6 76.415⋅sin((6π/180)t Reservoir Kličava TEMP(t) = 11.76 90.4 1.462⋅sin((6π/180)t Reservoir Neunzehnhain TEMP(t) = 11.9 11. The results of applying digital filters are consistent data series which can be used for modelling.073⋅cos((10π/180)t) .2 36.9 76.7 95.1 . Basic filter functions are derived from an ideal low pass filter: 54 . Table 14: Fourier analysis of water quality indicators Indicator TEMP Reservoir Saidenbach Neunzehnhain Kličava Slapy Saidenbach Neunzehnhain Kličava Slapy Saidenbach Neunzehnhain Total variance (%) 90.693⋅cos((6π/180)t) + 4.2 75.9 + 0.19 37.458⋅cos((6π/180)t) – 4.3 Digital data filter Digital filter function transfer sequences of input signals to sequences of output signals by compressing or decompressing noisy information contained in the measured signals of hydrological processes.88 32.98 4.37 8.0 11.0 95.62 74.80 52. followed by chemical variables.7.820⋅sin((9π/180)t) Reservoir Slapy TEMP(t) = 12. Insufficient results are obtained for biological variables.1 10.684⋅sin((10π/180)t) 7. simulation and optimisation in hydrological sciences.7.96 DO CHA Example: Approximation of water temperature of reservoirs (yearly dominant harmonic cycle): Reservoir Saidenbach TEMP(t) = 12.

To get an acceptable transfer behaviour filters of order 1 to 3 should be used only.1526. type 2 (inverse Chebyshev-Filter) |H (ω )| 2 = 1 1+ ε * ⋅c (ω ) 2 2 n with ε* = 2ε/(1-ε). ε = 0.Ideal low pass |H (ω )| 2 = 1 1+ F (ω ) 2 • Butterworth filter (power low pass) |H (ω )| 2 = 1 1+ ω 2n (Amplitude response should be as flat as possible in the pass band). type 1 |H (ω )| 2 = 1 1+ ε ⋅c (ω ) 2 2 n (ε . 55 .ripple factor (or eccentricity). Higher order filters show rippling transfer behaviours and cause nonlinear effects in the output sequences of signals. In the pass band a ripple is accepted.) • Elliptic filter (Cauer filter) |H (ω )| 2 = 1 1+ ε ⋅F * (ω ) 2 2 n (Ripples arise in the pass band and in the stop range. This leads to misinterpretations and unexplainable events within the data series. (In the stop band a ripple is accepted. Figures 30 to 33 represent the transfer behaviours of digital filters for different water quality time series. One gets the steepest transition between both frequency bands). • Tschebyshev filter. The transition from pass band to stop band is steeper than for the Butterworth filter). • Tschebyshev filter.

32 and 33) demonstrate the disturbances within the transfer process.3 0.5 2 1.2 0. 56 .5 3 standard error O2 (mg/l) 2.8 0.6 0. 30 and 31. 0.4 0.5 1 0. Tchebychev 1 filters for chlorophyll-a and for water temperature (figs.5 0 order 1 order 2 order 3 order 4 order 5 order 6 0 50 100 150 200 250 300 reciprocal of critical frequency (d) 350 Figure 30: Selection of filter order of a Butterworth filter for DO The higher order filters lead to changing (welling) transfer behaviour during the filtering process as can be seen in figs.1 0 order 1 order 2 order 3 order 4 order 5 order 6 0 50 100 150 200 250 300 reciprocal of critical frequency (d) 350 Figure 31: Selection of filter order of a Butterworth filter for pH They show this behaviour for a Butterworth filter.7 standard error pH-value 0.5 0.3.

035 0.005 0 order 1 order 2 order 3 order 4 order 5 order 6 0 50 100 150 200 250 300 reciprocal of critical frequency (d) 350 Figure 32: Tschebychev 1 filter for chlorophyll-a standard error water temperature (°C) 2.5 1 0. If the data series contains some gaps interpolation methods should be used to get a time series with equidistant data. Fig.01 0.5 2 order 1 order 2 order 3 order 4 order 5 order 6 1. 34 shows such a data series for the variable conductivity of the Oder River at Frankfurt.025 0. This is a strong prerequisite for all further steps.03 0.5 0 0 50 100 150 200 250 300 reciprocal of critical frequency (d) 350 Figure 33: Tschebychev 1 filter for water temperature The first step of digital data filtering procedures is the selection of a complete hydrological time series. 57 .02 0.015 0.standard error total chlorophyll-a (mg/l) 0.

35).4 0. 10 10 power density (conductivity) 10 10 10 10 10 6 5 power density upper bound (confidence interval 95%) lower bound (confidence interval 95%) 4 3 2 1 0 fg 0 0.confidence region should be selected. For the example a critical frequency fg = 0.1 0. 36). In case of the Oder River an elliptic filter was used to reconstruct the original time series and to get a consistent time series for modelling (fig.2 0.3 frequency (1/d) 0.5 Figure 35: Selection of critical frequency of the filter The last step consists of computation of the digital filter and reconstruction of the original data series. As confidence band the 95% .053 was used. 58 .1600 1400 1200 1000 800 600 400 1993 1994 1995 1996 1997 time (a) 1998 1999 2000 conductivity (μS/cm) Figure 34: Original data series of conductivity In the next step the critical frequency is calculated from spectral density function (fig.

NH4-N NO2-N NO3-N DO o-PO4-P pH Filter 1. The filters given in parenthesis indicate a 99% .041 1/d) 1980 1990 time (a) 2000 Figure 36: Reconstruction of a time series by an elliptic filter Table 15 shows an overview of the usage of digital data filters to produce consistent time series for modelling of freshwater ecosystems.1200 1100 1000 conductivity (μS/cm) 900 800 700 600 500 400 300 200 Raw Data Elliptic Filter (1.significance level. order (Elliptic) (Elliptic) Butterworth Elliptic Chebychev1 (Elliptic) (Elliptic) (Elliptic) Butterworth Elliptic Chebychev1 (Elliptic) Butterworth Elliptic Chebychev1 Chebychev2 Butterworth Elliptic Chebychev1 Butterworth Elliptic Chebychev1 Chebychev2 59 . All others are given on 95% significance level. order. f<0. Table 15: Digital data filters for water quality variables Indicator CHA DOC Conduct. order Elliptic Elliptic Butterworth Elliptic Chebychev1 Elliptic Elliptic Elliptic Butterworth Elliptic Chebychev1 Elliptic Butterworth Elliptic Chebychev1 Butterworth Chebychev1 Elliptic Butterworth Chebychev1 Elliptic Filter2. order (Elliptic) (Elliptic) Butterworth (Elliptic) (Elliptic) (Elliptic) Butterworth (Elliptic) Butterworth Chebychev2 Butterworth Chebychev1 Butterworth Chebychev1 Chebychev2 Elliptic Q TW Filter 3.

The following questions can be effectively answered with the help of wavelet analysis: 1. How are two indicators related on a scale by scale basis? How do they covary at different scales? The wavelet analysis imitates the windowed Fourier analysis by using basis functions (wavelets) that are better suited to capture local behaviour of nonstationary signals. It is a solution for the time scale analysis problem because it offers an effective approach to extract both the information on the time localization and the frequency content of the time series. Applying shifted and scaled versions of a wavelet function decomposes the signal into simpler components. It is the effect of the shifting and scaling proc- 60 . s) = ∫ X (t )Ψu . As a result.4 Wavelets Wavelet analysis has been proven quite useful for time scale based signal analysis. The coefficients that are obtained are a function of the location and scale parameters.7. It has the ability to decompose time series into several sub-series which may be associated with particular time scales. −∞ Ψu . s (t )dt. The wavelet transformation is a function of two variables W(u.s) obtained by projecting a signal X(t) on to a particular wavelet Ψ and is given by ∞ W (u. Are the statistical variations in the hydrological indicator homogenous across time? 4. What are the time dependent variations such as the presence of trends? 5. Are the variations from one day to the next more prominent than the variations from one week to the next? 3. What is the dominant scale of variation influencing the observed general tendency of the indicator? 2. the interpretation of features in hydrological time series may be facilitated by first applying an appropriate wavelet transform and subsequently interpreting each individual sub-series. s (t ) = 1 ⎛t −u ⎞ Ψ⎜ ⎟ s ⎝ s ⎠ which gives a translated and dilated version of the original wavelet function.

k = ∫ f (t )Ψ j .k is the father wavelet and Ψj.k = 2 −j 2 ⎛t −2jk Ψ⎜ ⎜ 2j ⎝ ⎞ ⎟ ⎟ ⎠ ∫ Ψ(t )dt = 0 where ΦJ. comprising two filters.k = ∫ f (t )Φ J .k being the coefficients for the father wavelet at a maximum scale of 2j (the smooth coefficients) and dj.k Ψ1.. The scaling filter known as the father wavelet is a low pass filter while the wavelet filter known as the mother wavelet is a high pass filter.k is the mother wavelet with the scale parameter “s” being restricted to the dyadic scale 2j.k Φ J .k (t ) k k k and can be equally represented by f(t)=Sj+ Dj + Dj-1+ … + Dj + …D1 61 . the function f(t) can be represented by f (t ) = ∑ S J .ess what makes this representation possible and is referred to as multiresolution analysis. The wavelet coefficients or detail coefficients are produced by the wavelet filter while the scaling filter gives rise to the smooth version of the signal used at the next scale.k (t ) + .k will be obtained with SJ. Given a signal X(t) of length n = 2j.k ΨJ . then d j .k = 2 −J 2 ⎛ t − 2J k Φ⎜ ⎜ 2J ⎝ ⎞ ⎟ ⎟ ⎠ ∫ Φ(t )dt = 1 and Ψ j ..k (t ) + ∑ d j . The wavelet transform is usually applied in the form of a filter bank.. + ∑ d1.k . Given the respective father and mother wavelets. Φ J . the filtering procedure can be performed a maximum of j time. Based on these coefficients. giving rise to j different wavelet scales.k being the detail coefficients from the mother wavelet at all scales from 1 to j. to the maximal scale. If a signal is projected onto a given basis function S J .

2 days. d5.where S J = ∑ S J . d4.k (t ) k and D j = ∑ d j . k Multiresolution decomposition (MRD) reveals the variations at different scales denoted by ‘‘d“. 4 days. Figure 37 shows the details of the multiresolution analysis of dissolved oxygen sampled at daily interval. The notations d1. d3. 62 . This progressive decomposition reveals the differences in fluctuations from one scale to another.k Ψ j .k (t ) . d6 and d7 reveal the variations occurring at one day. 8 days. 16 days 32 and 64 days respectively. Figure 37: Multiresolution analysis details of dissolved oxygen signal sampled at daily intervals The details reveal the high frequency variations present in the dissolved oxygen time series or provide an additive decomposition of the high frequency variation on a scale by scale basis. d2.k Φ J . It effectively shows that the lower scales are less important compared to the higher scales of variation.

the fluctuations are high throughout the year. Hence. However. This can be seen at level a7.Multiresolution analysis (MRA) filters information in the signal at different scales represented by ‘‘a“. the fluctuations are higher during the warmer months than during the colder months. At the lower scales. At the higher scales such as scale 32. the long term tendency observed in the dissolved oxygen time series is significantly influenced only by the fluctuations occurring at the higher scales and not the lower scales. Taking of from the original signal s all high frequent events the basic nature of the cycling process comes out. 38 an example of a MRA and MRD is given for longterm observations of dissolved oxygen in eutrophic freshwater ecosystem. 63 . Figure 38: Wavelet analysis of DO Figure 38 reveals that the variations occurring at a time scale of 1 day are equally of relatively low intensity and are not able to influence the general tendency observed in the dissolved oxygen signal. In fig. the fluctuations occurring at higher time scales such as scale 8 are strong enough to influence the long term behaviour of the signal. It is quite interesting to examine the variance at different scales to effectively quantify these variations.

the homogeneity of variations from one scale to the next.t ( S j ) = 1 var(w j . Table 15: Frequencies and scales of MRA and MRD MRA scale a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 MRD scale d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 Frequency 1 2 4 8 16 32 64 128 256 512 1024 2048 The variance of a signal can equally be decomposed by using this technique.An overview on the respective frequencies is given in table 15.t . 64 . For a signal Xt the time varying variance of the scale Sj of a wavelet coefficient Wj. This graphical representation of the wavelet variance enables the researcher to answer questions concerning the dominant scale of variation in the time series. the importance of the variations at one scale compared to the variations occurring at another scale.t ) . The wavelet variance shown in figure 39 reveals the intensity of variation from one scale to the next of the dissolved oxygen time series. the wavelet covariance decomposes the covariance between two signals on a scale by scale basis by ∞ ∑γ j =1 x ( S j ) = cov( x1.t can be calculated by σ x2.t ) . x x . 2S j Similar to the wavelet variance of a univariate signal.

00 0.50 0.2.05 U * L 1 2 4 8 16 32 U * L L U * L L U * U * U * Wavelet Scale Figures 39: Wavelet variance of dissolved oxygen with db4 65 .10 0.00 1.20 0.

S. Rubin. Brémaud. Morgan Kaufmann. J. 2004: An Introduction to the statistical filling of environmental data time series. McAleer (eds. H. Jakeman and M. J. Han. 1995: Interpolation of Irregularly Sampled Data Series . 73-82. R. Mitsch (eds. W. M. S. P. Rubin. Chichester. A. 1998: Introduction to Time Series and Forecasting.. E. Englewood Cliffs. E. P. Payne and J. 1-4. 66 . Jørgensen. New York. Latini.Literature Adorf. J. E. G. J. 2002: Mathematical Principles of Signal Processing. Hipel. I... Passerini (eds. Academic Press. Berlin. ASC. Little.). Wiley. 127. Wiley. H. and D. Academic Press. Brockwell. 1987: Statistical Analysis with Missing Data. R.). In: Wright. Springer.. 1988: Principles of Environmental Sampling. Box. Southampton. H. New York. 77. Gentili. J. and G. Amsterdam. G. R. WIT Press. In: Latini. Magnaterra. B. and M. WIT Press. 2004: Handling Missing Data..). Vol.): Modelling Change in Integrated Economic and Environmental Systems. 1983: Application of Ecological Modelling in Keith. Passerini. P. pp. London. 2006: Data Mining – Concepts and Techniques. and G. B. A.-M.A Survey. 1994: Time Series Modelling of Water Resources and Environmental Systems. Hayes (eds. und W. P. Prentice Hall. J. Jenkins and G. (ed. C. (ed. New York. Franses. and A. H. G. ASP Conference Series. Davis. Passerini (eds. A.): Handling Missing Data. S. L. Elsevier.): Statistical Methods and the Improvement of Data Quality. New York.. and D. Kamber.): Astronomical Data Analysis Software and Systems IV. In: Shaw. and R. G. ACS Professional Reference Book. In: Mahendrarajah. K. 3rd ed. E. Little. 2002. Reinsel.. Southampton. McLeod. Springer. A. T. L. Salem.. pp. pp. and G. 1999: Periodicity and Structural Breaks in Environmetric Time Series.. 1994: Time Series Analysis. 1983: Missing Data in Large Data Sets. J. A.

Springer. S. M. Woburn. H. und A.Mallat.. Müller. Reckhow. Springer.. Rebecca. Berlin. 1999: Interpolation of Spatial Data. Stein. D. Academic Press. Amsterdam. 1: Data Analysis and Empirical Modelling.. 2000: Time Series Analysis and Its Applications. S. 1995: Ecological Time Series.. H. und S. Chapra. New York. (1998): A wavelet tour of signal processing . Shumway. G. C. Springer. New York. K. Butterworth. and D. W. S. 1998: Spectral analysis of time-series data. and J. Stoffer. M. M. 1985: Freshwater Ecosystems – Modelling and Simulation. 67 . T. New York.). Elsevier. 1983: Engineering Approaches for Lake Management. R. 1999: A Handbook of Time-Series Analysis. Berlin. New York. Vol. Gnauck. H. L. M. Chapman & Hall. Pollock. Signal Processing and Dy Powell. Straškraba. 2001: Collecting Spatial Data. Steele (eds. Guilford Press. G.