You are on page 1of 40

DATA AND ITS HANDLING AND

PROCESSING
by
Dr. N.K. Goel,
Professor,
Department of Hydrology,
Indian Institute of Technology Roorkee,
Roorkee- 247667
Email: goelnfhy@iitr.ernet.in
goelhy@yahoo.com
Mobile: +91-9412393851
Contents
General about data handling, processing and
analysis
Plotting of Data
Computation of basic statistical parameters
Examples
Identification of trends and randomness
Interpolation techniques
General about data processing
What is processing
Necessity
Inventory of data
Classification of data
Plotting of data
Computation of basic statistical parameters
VARIOUS TYPES OF DATA
Space oriented data
Time oriented data
Relation oriented data

SPACE ORIENTED DATA

Catchment data
River data
Lake reservoir data
Station data

Further details and sources
CATCHMENT DATA
PHYSICAL (Catchment area, river network), MORPHOLOGICAL
CHARACTERISTICS
Topo-sheets (Survey of India)
Geological maps (geological survey of India)
Soil maps (NATMO)
RIVER DATA
X-SECTIONS, PROFILES, BED CHARACTERISTICS
LAKE/RESERVOIR DATA
ELEVATION-AREA-CAPACITY RELATIONSHIPS
Bed profile
STATION CHARACTERISTICS
CODE, NAME, DRAINAGE UNITS, GEOGRAPHIC
COORDINATES, ALTITUDE, CATCHMENT AREA ETC.

TIME ORIENTED DATA
Meteorological data
Hydrological data
Water quality data
METEOROLOGICAL DATA and
instruments
Precipitation data- raingauge and snow
gauges
Pan evaporation data- evaporation
pans (class A pan, Colorado sunken
pan, floating pans)
Evapo-transpiration data - Lysimeters
Temperature data (thermometers-
minimum, maximum, dry, wet bulb)
Meteorological data- Contd.
Atmospheric data - Barometer
Humidity data- Hair hygrograph
Wind speed and direction- anemometer,
wind vane
Sunshine hours duration and intensity
data- Sun shine hour recorder and
Pyranometer)

HYDROLOGICAL DATA-
Instruments
Water level data-, staff gauges and other
automatic gauges
Ground water level data- Water level
recorders
Infiltration data Infiltrometer
Discharge- velocity by current meters,
ADCPs (Acoustic Dopler current profilers)
WATER QUALITY DATA
Organic matter
Dissolved oxygen
Major and minor ions
Toxic metals
Nutrients
Biological properties
RELATION ORIENTED DATA
PURPOSE- to reduce storage space
Stage- discharge data
Rainfall-runoff data
Water quality and discharge data
Stage- discharge- sediment data

PROCESSING OF DATA
Preliminary scrutiny and checking
reasonableness of data
Storage of data
Quality control
Estimation of missing data
Internal consistency of data
Spatial consistency of data
Adjustment of data.
Conversion of data
Computation of basic statistical parameters
VALIDATION OF DATA
Plotting of data
Time series plot
Residual mass curve plot
Comparison plots
Comparison plots
Multi station single variable plots
Single station-multi variable plots
Plotting of data

Plotting helps in identification of
unit errors,
decimal errors,
outliers in the data,
basic characteristics of the data in terms of trends,
jumps and periodicities.



Various types of plots
Single station single variable plots,
Single station, multiple variable plots,
Multiple station single variable plots,
Residual series plots
Plots of annual time series for identifying the trends,
jumps etc.
Mass curve plots
Double mass curve plots
PRELIMINARY ANALYSIS OF
DATA
Computation of basic statistical
parameters
Checking the data for randomness
Identification of trends in the data
Identification of shift in the data


Computation of basic Statistical
parameters

Mean: Mean is a measure of central tendency. Other
measures of Central tendency are median and mode.
Arithmetic mean is the most commonly used measure of
central tendency and is given by

(1)

where Xi is the ith variate and N is the total number
of observations.
N X X
i
N
i
/
1

=
=
Standard Deviation: An unbiased estimate of standard
deviation (Sx) is given by

(2)


Standard deviation is the measure of variability of a
data set. The standard deviation divided by the mean is
called the coefficient of variation and (Cv) is generally
used as a regionalization parameter.
5 . 0
2
1
1 / ) (
(

=

=
N x x S
i
N
i
x
Coefficient of skewness (Cs) : The coefficient of
skewness measures the assymtry of the frequency
distribution of the data and an unbiased estimate of the
Cs is given by

(3)
3
x
3
i
N
1 i
s
2)S 1)(N (N
) x (x N
C

=

=
Coefficient of kurtosis (Ck) : The coefficient of kurtosis
is Ck measures the peakedness or flatness of the
frequency distribution near its center and an unbiased
estimate of it is given by

(4)
4
4
1
2
) 3 )( 2 )( 1 (
) (
x
i
N
i
k
S N N N
x x N
C

=

=
Cross correlation coefficients: The coefficient of linear
correlation between two series may be computed by
r
X,Y
= Cov(X,Y)/(S
X
*S
Y
) (5)

In case of serial correlation coefficients or autocorrelation
coefficients, the Y series is the lagged X series by one
step or two steps or three or four steps.
Example 1
The annual water levels of well no. 250109D of Tumkur
district of Karnataka for 1975 to 2004 period are given in
Table 1. Compute the basic statistical parameters of these
water levels in original as well as logarithm domain.
Table 1. The annual water levels of well no. 250109D of
Tumkur district of Karnataka

year 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
Waterlevel
( m bgl) 4.20 5.67 6.21 5.91 6.36 6.36 6.72 6.63 7.51 8.28
year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
Waterlevel
( m bgl) 8.54 7.14 4.95 5.35 6.72 5.73 6.37 4.21 5.14 5.68
year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Waterlevel
( m bgl) 8.50 9.44 9.29 8.85 4.56 4.75 5.06 5.46 8.67 10.57

Mean = = 6.63

Standard deviation = =1.03

Coefficient of Skewness = =0.588

Coefficient of kurtosis = =2.173
Solution:
Statistical parameters
N x x
N
i
i
=
=
1
5 . 0
2
1
1 / ) (
(

=

=
N x x S
i
N
i
x
3
3
1
) 2 )( 1 (
) (
x
i
N
i
s
S N N
x x N
C

=

=
4
4
1
2
) 3 )( 2 )( 1 (
) (
x
i
N
i
k
S N N N
x x N
C

=

=
Table 2. The annual water levels (log domain) of well no.
250109D of Tumkur district of Karnataka
year 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
Log
series 0.624 0.753 0.793 0.772 0.803 0.804 0.827 0.821 0.875 0.918
year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
Log
series 0.931 0.854 0.695 0.729 0.827 0.758 0.804 0.624 0.711 0.754
year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Log
series 0.929 0.975 0.968 0.947 0.659 0.677 0.704 0.737 0.938 1.024

Mean = = 0.808

Standard deviation = =0.109

Coefficient of Skewness = =0.018

Coefficient of kurtosis = =2.414

Solution:
Statistical parameters of log series
N x x
N
i
i
=
=
1
5 . 0
2
1
1 / ) (
(

=

=
N x x S
i
N
i
x
3
3
1
) 2 )( 1 (
) (
x
i
N
i
s
S N N
x x N
C

=

=
4
4
1
2
) 3 )( 2 )( 1 (
) (
x
i
N
i
k
S N N N
x x N
C

=

=
Original Series lag 1 lag 2 lag 3
4.20
5.67 4.20
6.21 5.67 4.20
5.91 6.21 5.67 4.20
6.36 5.91 6.21 5.67
6.36 6.36 5.91 6.21
6.72 6.36 6.36 5.91
6.63 6.72 6.36 6.36
7.51 6.63 6.72 6.36
8.28 7.51 6.63 6.72
8.54 8.28 7.51 6.63
7.14 8.54 8.28 7.51
4.95 7.14 8.54 8.28
5.35 4.95 7.14 8.54
6.72 5.35 4.95 7.14
5.73 6.72 5.35 4.95
6.37 5.73 6.72 5.35
4.21 6.37 5.73 6.72
5.14 4.21 6.37 5.73
5.68 5.14 4.21 6.37
8.50 5.68 5.14 4.21
9.44 8.50 5.68 5.14
9.29 9.44 8.50 5.68
8.85 9.29 9.44 8.50
4.56 8.85 9.29 9.44
4.75 4.56 8.85 9.29
5.06 4.75 4.56 8.85
5.46 5.06 4.75 4.56
8.67 5.46 5.06 4.75
10.57 8.67 5.46 5.06
Auto correlation Coefficients
Correlation coefficient =




Calculation of r
1
:
Total no of data = 29
Correlation coefficient of lag 1 series =
r
1
= 0.587
Total no of data = 28
Correlation coefficient of lag 1 series =
r
2
= 0.015
Total no of data = 27
Correlation coefficient of lag 1 series =
r
3
= -0.4036


( ) ( )




=
2 2 2 2
y y N x x N
y x xy N
r
k
IDENTIFICATION OF TREND AND
RANDOMNESS
Trend
A steady and regular movement in a time series, through
which the values are on the average increasing or
decreasing is termed as trend.
The existence of trend in hydrological series may be due to
low frequency oscillatory movement induced by climatic
changes or through changes in land use and catchment
characteristics.
If a trend in a particular series is obvious it can be
described by fitting a polynomial to the original series.
There are number of statistical tests to detect the
presence of trend in a time series. Kendalls rank
correlation test and linear regression tests can be used
to check whether the time series is trend free or not.
An undesirable consequence of this type of trend
removal is that the artificial cycles may be induced into
the data. This is known as Slutzky Yule effect (1937).
TESTS FOR RANDOMNESS AND TREND
In certain cases the presence of trend is quite obvious,
but often there is doubt whether any suspected
systematic effects are significant or not.
Turning point test-for checking the randomness of series.
Kendalls rank correlation test-for trend identification.
Regression test for linear trend-to test whether slope of
line representing trend is significant or not.

TURNING POINT TEST

In an observed sequence x
t
, t=1, 2,3, N, a turning
point, p, occurs at time t=I if x
i
is either greater than x
i-1

and x
i+1
or less than two adjacent values.
The expected number of turning points in a random
series is E(p) = 2(N-2)/3 and variance, Var (p) = (16N-
29)/90.
Here N is the number of observations. Consequently p
can be expressed as a standard measure, Z= (p-E(p) ) /
Var p)
1/2
, which is treated approximately as a standard
normal deviate. Too many or too few turning points
indicate non-randomness of series


Example 2:
Test the randomness of the following Yearly Mean GW
data of Well No. 250001D of Tumkur district, Karnataka
at 5% significance level.
sl no 1 2 3 4 5 6 7 8 9 10
annual water level
( m bgl) 9.90 11.23 9.61 8.72 10.57 10.83 9.51 9.92 11.24 10.67
sl no 11 12 13 14 15 16 17 18 19 20
annual water level
( m bgl) 11.67 13.30 12.91 10.07 11.18 13.00 12.86 10.57 11.11 13.66
sl no 21 22 23 24 25 26 27 28
annual water level
( m bgl) 16.54 15.78 17.65 17.23 17.65 16.41 16.04 16.38
Solution:
There are 8 peaks and 8
troughs making the number of
turning points as 16. Total
number of data N is 28
E(p) =2*(N-2)/3 =17.33
Var (p) = (16N-29)/90 = 4.655

0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1
9
7
5
1
9
7
6
1
9
7
7
1
9
7
8
1
9
7
9
1
9
8
0
1
9
8
1
1
9
8
2
1
9
8
3
1
9
8
4
1
9
8
5
1
9
8
6
1
9
8
7
1
9
8
8
1
9
8
9
1
9
9
0
1
9
9
1
1
9
9
2
1
9
9
3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
w
a
t
e
r

l
e
v
e
l

(
m
)
time
well no - 250001D
Fig 1 Time series plot of
Yearly Mean GW data of Well
No. 250001D
Z = (p-E(p) )/(Var(p) )
1/2

= (16-17.33)/(4.655)
1/2

= - 0.618
As < 1.96, the series is random at 5% significance level according to
Turning point test.

KENDALLS RANK CORRELATION TEST

This test, which is also known as t test, is based on the
proportionate number of subsequent observations, which
exceed a particular value. For a sequence x
1
,
x
2
,x
N
, the standard procedure is to determine
the number of times, say, p, in all pairs of observations
(x
i
, x
j
, j>I) that x
j
is greater than x
i
.
Example 3: For the data given in Example 2,
test whether the sequence 1980-89 is trend free.

Solution: Here p = 24 + 15 + 23 +2 4+ 19+ 17+ 21+ 20+
14+ 16+ 13+9 +10+ 14+11 +9+9 +10+9 +8+ 3+ 6+0 +1+
0+0 +1+0 = 306
t = ( (4p)/(N(N-1) ) )-1
= 0.619
Var (t) = 0.0179
t/(Var t)
1/2
= 0.619 / (.0179)
1/2

= 4.623
Since > 1.96, the hypothesis of rising trend is accepted at
5% significance level







REGRESSION TEST FOR LINEAR TREND
Straight line is fitted to the data and statistically it is
tested, whether slope of the lime is significantly different
from zero or not.
If straight line of the form Y = a + bx is fitted to the data
then following statistics are computed.


2
( )( )
( )
i i
i
x x y y
b
x x

=

2 2 2
) ( / x x S Sb
i
=

| |
2 / 1
2
) 2 /( =

N S
i
c

2 2 2
) ( ) ( x x b y y
i i i
=

c
a y b x =


In above equations S
b
is standard error of b and is sum
of squares of residuals or errors.
The hypothesis to be tested in this case is b=1. The first
step is to estimate b and its variance using above
equations. The test statistics t = b/S
b
is then tested
using students t- test. It is assumed here, that the
residuals, are stationary, sequentially independent and
normally distributed.

Example 4: For the data given in Example 2, test whether
there is a significant linear trend. Assume that the values in
the sequence can be represented by straight line
Solution: For this case

0 . 1827 ) (
2
=

x x
i
64 . 218 ) (
2
=

y y
i
95 . 541 ) )( ( =

y y x x
i i




5 . 14 = X
72 . 12 = Y


b = 541.95/1827.0 = 0.297
a = 8.420

= - = 88 . 57 0 . 1827 297 . 297 . 64 . 218


2
i
c
S = 1.49
S
b
= 0.035
t = b/S
b

t = 8.50
t > t1-o/2,
n-2 implies there is trend at 5%
significance level.
y = 0.296x + 8.420
R = 0.735
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1
9
7
5
1
9
7
6
1
9
7
7
1
9
7
8
1
9
7
9
1
9
8
0
1
9
8
1
1
9
8
2
1
9
8
3
1
9
8
4
1
9
8
5
1
9
8
6
1
9
8
7
1
9
8
8
1
9
8
9
1
9
9
0
1
9
9
1
1
9
9
2
1
9
9
3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
w
a
t
e
r

l
e
v
e
l

(
m
)
time
well no - 250001D
TEST FOR DETECTING THE CHANGE IN MEAN
( SHIFT AND JUMP)
Many times two segments of a time series may appear to be
fluctuating around different means. The important test for
detecting the presence of jumps in the series is given by
Buishand (1977) using Von Neumanns ratio method. The
test is explained as below:

1. Compute the lengths of two different segments say n
1
and n
2

2. Compute the mean and standard deviation of the two
segments as
1
and
1
and
2
and
2
.

3. Compute Z as

( )
2 1
2
2
2
1
2
1
2 1
|
|
.
|

\
|
+

=
n n
Z
o o

If 1.96, the two means may be considered
as same at 5% significance level.
s Z
Example 5: For the data given in Example 2, test whether
there is presence of jumps.

Solution: For this case two different segments say n
1
and n
2
is
taken as 18 (year: 1975-92) and 10 (year: 1993-2002).
The mean and standard deviation of the two segments as
1
=
10.99 and
1
= 1.33 and
2
= 15.84 and
2
= 2.02

0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1
9
7
5
1
9
7
6
1
9
7
7
1
9
7
8
1
9
7
9
1
9
8
0
1
9
8
1
1
9
8
2
1
9
8
3
1
9
8
4
1
9
8
5
1
9
8
6
1
9
8
7
1
9
8
8
1
9
8
9
1
9
9
0
1
9
9
1
1
9
9
2
1
9
9
3
1
9
9
4
1
9
9
5
1
9
9
6
1
9
9
7
1
9
9
8
1
9
9
9
2
0
0
0
2
0
0
1
2
0
0
2
w
a
t
e
r

l
e
v
e
l

(
m
)
time
well no - 250001D
( )
2 1
2 2
10
02 . 2
18
33 . 1
84 . 15 99 . 10
|
|
.
|

\
|
+

= Z
= - 6.821
Since >1.96, the two
means are not same at 5%
significance level.
Z
Thank you

You might also like