Professional Documents
Culture Documents
Estimating Missing Weather Data for Agricultural Simulations Using Group Method of
Data Handling
M. C. ACOCK
Remote Sensing and Modeling Laboratory, U.S. Department of Agriculture Agricultural Research Service,
Beltsville, Maryland
YA. A. PACHEPSKY
Remote Sensing and Modeling Laboratory, U.S. Department of Agriculture Agricultural Research Service,
Beltsville, Maryland, and Duke University Phytotron, Duke University, Durham, North Carolina
ABSTRACT
Contacting weather stations via modems to obtain weather data for crop simulations has become a common
practice. Users sometimes encounter gaps in these data, and techniques are needed to estimate weather variables
for days when the data are absent. The authors hypothesized that such estimations can be made using data from
before and after the day with no data. Dependencies of weather variables of a particular day on weather variables
from several days before and after could be very complex. To find and to express these dependencies, group
method of data handling (GMDH), which is a tool for modeling complex ‘‘input–output’’ relationships by building
hierarchical polynomial regression networks, was used. Data on daily solar radiation, maximum and minimum
temperatures, and wind runs collected daily in Stoneville, Mississippi, during May–September of 1982–92 were
used. Fourteen-hundred sequential 7-day datasets from the database were extracted. For each dataset, the authors
assumed that weather variables on the fourth day were unknown and had to be found from the weather variables
of days 1, 2, 3, 5, 6, and 7. Seventy-five percent of these data were used to find the hierarchical polynomial
regression, and 25% were used to evaluate it. Correlation coefficients between calculated and actual parameters
were similar for training and evaluation datasets. Coefficients of determination (R 2 ) were about 0.88 for minimum
temperature, 0.80 for maximum temperature, and 0.80 for wind run. Accuracy of the solar radiation and pre-
cipitation estimates was lower, and R 2 was about 0.2–0.3 but improved to 0.5–0.6 for the training dataset and
0.3 for the validation dataset for both variables when an additional indicator variable that shows the presence
or absence of rain was included. The next day after the day with missing data gave the most essential information.
Increasing the number of missing days resulted in gradual deterioration of the accuracy for all variables but
wind run. GMDH can be a useful tool for filling gaps in weather data from weather stations installed in the
field.
be a viable alternative to using neighboring stations in 2d day of the 7-day group and the minimum temperature
reconstructing missing data. two days before the missing data day, and so on.
The reconstruction of missing weather data is differ- We also wanted to see how the accuracy of estimates
ent from weather forecasting because both data collected will change when we use different numbers of days after
before the gap and data collected after the gap can be the missing data day. For that purpose, we used the same
used. Still, the dependencies between the missing and 7-day groups and repeated estimations given the as-
available weather variables are expected to be complex, sumption that the weather variables were unknown on
and advanced data analysis tools are needed to find and the 5th, 6th, and 7th days. When the weather variables
to express these dependencies with sufficient accuracy. were unknown on the 7th day, the estimates were ac-
Group method of data handling (GMDH) is a term tually representing a weather forecast based on the
for the group of algorithms developed to simulate ob- weather of the preceding week.
served complex ‘‘input–output’’ dependencies. GMDH There may be cases when data are missing for more
provides an automated selection of essential input var- than 1 day. We built estimates of average weather var-
iables and builds hierarchical polynomial regressions of iables for the sequences of 2 and 3 missing days. Data
necessary complexity (Farrow 1984). GMDH algo- for these estimates were extracted from the same May–
rithms were applied successfully to predict seasonal pre- September weather records collected in Stoneville, Mis-
cipitation (Lebow et al. 1984), monthly mean temper- sissippi. We extracted all sequential 8-day groups and
ature (Lin et al. 1994; Setoodehnia et al. 1993), and 9-day groups and computed average weather variables
daily minimum and maximum temperature (Abdel-Aal for days 4 and 5 in 8-day groups and for days 4, 5, and
and Elhadidy 1994, 1995). GMDH algorithms are self 6 in 9-day groups. Then we assumed that these averages
organizing so that they do not presume any fixed func- are unknown and have to be found from the weather
tional dependence between inputs and outputs; this de- variables of days 1, 2, 3, 6, 7 and 8 and days 1, 2, 3,
pendence evolves as the modeling with GMDH pro- 7, 8 and 9 in 8-day groups and 9-day groups, respec-
ceeds. tively.
The objective of this paper was to explore the ap-
plicability of GMDH to reconstruct missing daily weath-
er variables when preceding and subsequent daily b. GMDH algorithm
weather variables were available. Historical weather re-
The idea of GMDH is to employ estimates of the
cords for a particular location in Mississippi were used
output variable obtained from simple regression equa-
to build and to test GMDH models. The location rep-
tions that include small subsets of input variables (Far-
resents the region for which the so-called GLYCIM crop
row 1984). Although the accuracy of such estimates is
simulator is used on farms having weather stations (Red-
low, these estimates may be better predictors of the
dy et al. 1995) at which gaps in data sometimes occur.
output variable than are some of the input variables.
The best of these estimates are included in the set of
2. Materials and methods input variables, and, again, small subsets of variables
from this set are used to build new estimates.
a. Design of the study The general functioning of the GMDH algorithm can
be understood from the following example. Let the orig-
We estimated key daily weather variables from data
inal data consist of the column of the observed values
collected 3 days before and 3 days after the day in
of y and of N columns of the observed values of the
question. Fourteen-hundred sequential groups of 7-day
independent variables x1 , x 2 , . . . , x N . The primeval
daily weather data were extracted from weather records
equations are quadratic polynomials:
collected during May–September in Stoneville, Missis-
sippi, for 10 consecutive years. We included data for z 5 A 1 Bx j 1 Cx k 1 Dx j2 1 Exk2 1 Fx j x k , (1)
the May–October period that covered the growing sea-
son for soybean and cotton crops. The daily weather Here A, B, C, D, E, and F are parameters, and x j and
variables were solar radiation (Rad), maximum and min- x k are two independent variables taken from the total
imum temperature (Tmax and Tmin), precipitation list of independent variables. Each iteration consists of
(Prec), and wind run (WndR). three steps. Step 1 consists of obtaining estimates of y
For each 7-day group, we assumed that weather var- using primeval equations. All independent variables x1 ,
iables on the 4th day were unknown and had to be found x 2 , . . . , x N are taken two at a time to become x j and x k
from the weather variables of days 1, 2, 3, 5, 6, and 7, in Eq. (1) and regression polynomials [Eq. (1)] are con-
and from the day of the year of the fourth day (IDAY). structed so that values of z best fit the dependent variable
Below, the day number in the 7-day group is attached y. The total number of these polynomials is N(N 2 1)/2.
to the symbol of the weather variable so that Rad5 de- The resulting columns of z m values, m 5 1, 2, . . . , N(N
notes the solar radiation for the 5th day of the group 2 1)/2, contain estimates of y from each polynomial
and the solar radiation for the day after the missing data and are interpreted as new improved variables that may
day, Tmin2 denotes the minimum temperature for the have better predictive power than the original x1 , x 2 ,
1178 JOURNAL OF APPLIED METEOROLOGY VOLUME 39
. . . , x N . The objective of the next step is to keep only 1. The original list of input variables included Rad1,
the best of these new variables. Rad2, Rad3, Rad5, Rad6, Rad7, Tmax1, Tmax2,
Step 2 consists in screening out the least-effective Tmax3, Tmax5, Tmax6, Tmax7, Tmin1, Tmin2, Tmin3,
new variables. There are several selection criteria (Far- Tmin5, Tmin6, Tmin7, Prec1, Prec2, Prec3, Prec5,
row 1984), all based on mean square absolute or relative Prec6, Prec7, WndR1, WndR2, WndR3, WndR5,
error of z m with respect to measured y, and often in- WndR6, WndR7, and IDAY for a total of 36 variables.
cluding a correction that ‘‘punishes’’ a network for ex- All of them were normalized. Only three variables,
cessive complexity. In some GMDH versions, columns Rad5, Tmin3, and Tmin5, were selected at the first it-
z m are retained that have the criterion value smaller than eration, and a new variable z1 was retained. At the sec-
a prescribed value. In other versions, a prescribed num- ond iteration, Rad6 and Tmin2 were selected as addi-
ber of the best z m are retained. tional variables that formed a primeval equation with
The list of input variables is modified at the end of the z1 variable, and a new variable z 2 was retained. At
this step. In some versions of the method, columns of the third iteration, Tmax3 and Wndr5 were selected as
x1 , x 2 , . . . , x N are merely replaced with the retained additional variables that formed a primeval equation
columns of z1 , z 2 , . . . , z K , where K is the total number with the z 2 variable, and a new variable z 3 was retained.
of retained columns. In other versions, the best K re- Last, at the fourth iteration, WndR2 and WndR7 were
tained columns are added to columns x1 , x 2 , . . . , x N to selected to be coupled with the variable z 3 , and new
form a new set of input variables. The total number N variable z 4 was retained. After the fourth iteration, the
of input variables changes to reflect the addition of z m number of coefficients in the model was balanced with
values or the replacement of old columns x N with z m . the model’s accuracy, and the iterations were stopped.
Step 3 tests whether the set of equations can be im- The variable z 4 was denormalized to calculate Tmin4
proved further. The smallest value of the selection cri- in its original range. All primeval equations were cubic
terion obtained at this iteration is compared with the polynomials.
smallest value obtained for a previous iteration. If an Performance of the GMDH method has been com-
improvement is achieved, steps 1 and 2 are repeated, pared with ‘‘climatology’’ and persistence estimates.
otherwise the iterations stop, and the network is built. Here climatology is defined to mean that estimates of
We have worked with a version of the algorithm cod- a weather variable for a particular day of a particular
ed in the commercial software package ModelQuest year were obtained as average values for the same day
(AbTech Corp., 1992–96). In this version, singlets, dou- across all years in the database except for the year for
blets, and triplets are used as primeval equations. which the estimate was sought. The persistence esti-
Singlet: z5 O ax. i i
mates of a weather variable were obtained as values of
the same variable measured the day before the day for
z5 O x x .
i,15
j
which the estimate was sought. A statistical comparison
Doublet: i
1 2 of accuracies of the methods was made using F ratios
z5 O x x x.
i1j,4
calculated as
j k
Triplet: i
1 2 3
sClim
2
s2
FClim 5 and FPers 5 2 Pers ,
i1j1k,4
FIG. 1. Structure and content of the GMDH network built to estimate daily minimum temperature at day 4 (Tmax4) from the values of
daily solar radiation Rad, daily minimum temperature Tmin, daily maximum temperature Tmax, and daily wind run WndR. See comments
in the text.
before and 3 days after the day in question. Root-mean- using weather variables from previous and following
square error of the estimates was about 1.38C. The slope days (Fig. 2). The evaluation dataset had values of R 2
of the regressions of estimated minimum temperatures near 0.3, significantly less than the value of 0.50 ob-
on measured ones was significantly less than 1. Hier- tained for the training data set. Mean square errors were
archical regressions built with GMDH tended to over- similar for training and evaluation datasets (Table 1).
estimate low values and underestimate high values of Polynomial networks to estimate missing weather
daily minimum temperature. variables are shown in Fig. 3. All of the networks have
Data on daily maximum temperature and daily wind four layers. Table 2 shows how the accuracy of the
run shown in Fig. 2 demonstrate similar accuracy levels. network progressed as more layers were added. The
The percentage of the variation explained by the GMDH main improvement in estimation of Tmin4, Tmax4, and
hierarchical regression is about 80%. Root-mean-square WndR4 occurred after the first triplet was found. There-
error of the maximum temperature and wind run esti-
fore, variables included in this triplet were expected to
mates is about 1.68C and 13 km, respectively. High
be the most influential predictors. Results from sensi-
values are underestimated and low values are overes-
timated. tivity analysis (not shown) concurred with this hypoth-
Estimates of daily radiation and precipitation had rel- esis. In all cases, the most influential predictors were
atively low accuracy. The coefficient of determination values of the same weather variables measured the day
(R 2 ) was significant but small, in the range 0.2–0.3. before and the day after the day of missing datum. The
Somewhat better results were obtained when we intro- third influential variable was the next-day solar radiation
duced an indicator variable IP. This variable was set to for the minimum temperature, the precipitation on the
zero when there was no precipitation on day 4 and was day before for the maximum temperature, and the next-
set to one when any precipitation occurred on day 4. day maximum temperature for the wind run. The ra-
With this variable, about 50% of variation in the daily diation and precipitation estimates gradually improved
solar radiation and in precipitation could be explained as new layers were included in the network (Table 2),
1180 JOURNAL OF APPLIED METEOROLOGY VOLUME 39
FIG. 2. Plots of 1:1 ratio diagrams and error distributions for GMDH estimates of the daily weather
variables: (a) and solid line—training dataset; (b) and broken line—validation dataset.
JULY 2000 ACOCK AND PACHEPSKY 1181
FPers
2.3
2.1
1.7
1.7
1.7
FClim
6.2
4.4
1.7
8.6
1.9
30.6
40.0
199.0
32.7
93.7
Max
Database
extremes
16.7
Min
5.6
2.9
6.9
0.0
TABLE 1. Accuracy of the GMDH networks to estimate missing daily weather variables as compared with climatology and persistence.
Clima- Persis-
tology tence
0.68
0.56
0.13
0.70
0.02
0.26
0.19
0.03
0.00
0.00
R2
0.88
0.77
0.30
0.79
0.31
b
GMDH
0.88
0.79
0.49
0.83
0.47
a
Persis-
2.11
2.44
tence
17.6
11.2
5.9
Root-mean-square error
Clima-
tology
3.11
3.14
32.6
6.1
9.0
1.25
1.67
13.6
4.9
6.5
b
GMDH
1.24
1.60
12.8
4.4
5.2
a
1.45
1.72
4.84
4.20
tence
12.5
4.82
4.45
28.1
3.7
9.6
3.2
GMDH
3.3
9.3
2.1
1 2 3 4
Max temperature, 8C
Min temperature, 8C
* Training dataset.
Variable
1.88
1.96
4.87
15.0
11.2
b
precipitation also are large. Correlation between mea-
0 days
sured and estimated-from-climatology weather variables
is very weak. The ratios of variances of climatology and
1.84
1.95
4.59
4.92
TABLE 3. Accuracy of missing daily weather variables estimates as affected by how many days after the day with missing data were used in the estimation.
14.3
GMDH estimation errors exceed the critical value F of
a
1.33 for p , 0.001 (Table 1). Therefore, greater than
0.999 probability exists that the errors in climatology
Root-mean-square error
estimates have larger spread than do the errors in
GMDH estimates.
1.35
1.69
5.28
6.87
12.6
b
Weather variables estimated from persistence are
1 day
closer to the actual values than those estimated from
climatology. All persistence estimates trail the GMDH
1.27
1.59
4.34
12.6
5.2
estimates in accuracy, however. For temperature, radi-
a
ation integral, and precipitation, the ratios of variances
of persistence and GMDH estimation errors exceed the
critical value F of 1.15 found for p , 0.01 (Table 1).
Therefore, greater than 0.99 probability exists that the
1.30
1.63
5.13
9.46
13.9
b
errors in persistence estimates have a larger spread than
2 days
do the errors in GMDH estimates.
Table 3 shows changes in the estimation accuracy as
1.28
1.59
4.26
12.5
5.2
related to number of days after the missing-data day.
a
Comparing data in this table with data in Table 1, one
can conclude that there is a very small loss in accuracy
as the number of days after the missing-day data de-
creases from 3 to 1. The accuracy of estimates, however,
1.36
1.46
3.54
3.65
11.0
b
becomes significantly worse when no data after the
0 days
missing data day are used, that is, when the estimates
represent a pure forecast. Using at least 1 day of data
1.29
1.44
3.52
1.89
10.6
after the missing day is essential in filling gaps in weath-
a
er data.
Results of estimating average weather variables for
Average absolute error
3.97
2.05
9.6
b
3.25
2.01
9.3
a
2.90
b**
3.8
9.7
2 days
4. Discussion
The performance of the GMDH model developed to
0.94
1.15
3.16
2.04
a*
9.3
* Evaluation dataset.
* Training dataset.
0.75
0.70
0.30
0.87
0.19
b
GMDH models tended to overestimate observations
3 days
with low values and to underestimate observations with
high values. This result may be related to the fact that
0.76
0.72
0.51
0.84
0.49
a
low-order polynomials are used as regression functions
in elements of the networks. These functions are not
R2
0.35
0.83
0.25
and Elhadidy (1995), who used GMDH for weather
b
2 days
0.48
0.83
5.05
10.4
3.9
b
1.56
1.58
3.60
3.84
2.98
5.90
12.0
3.80
4.15
11.8
a
variables the day before and the day after the day of
interest. Use of the indicator variable showing presence
or absence of rain is justified only if the user can de-
termine whether the day with missing datum was rainy.
Both the coefficients and the structure of GMDH
1.14
1.33
3.54
2.65
b
7.9
2.75
8.55
2.18
a
2.95
2.66
b**
2.80
2.00
a*
8.8
data is needed.
In general, our results show that GMDH can be a
useful tool in filling gaps in weather data from weather
stations installed in the field. Selection of the predictor
variables needs to be researched. The successful appli-
Daily solar radiation, MJ m22
* Evaluation dataset.
* Training dataset.
Variable
REFERENCES
Abdel-Aal, R. E., and M. A. Elhadidy, 1994: A machine-learning
approach to modeling and forecasting the minimum temperature
at Dhahran, Saudi Arabia. Energy Int. J., 19, 739–449.
1184 JOURNAL OF APPLIED METEOROLOGY VOLUME 39
, and , 1995: Modeling and forecasting the daily maximum Lebow, W. M., R. K. Mehra, H. Rice, and P. M. Tolgalagi, 1984:
temperature using abductive machine learning. Wea. Forecast- Forecasting applications in agricultural and meteorological time
ing, 10, 310–325. series. Self-Organizing Methods in Modeling: GMDH Type Al-
AbTech Corp., 1992–96: ModelQuest Version 4.0. User’s Manual. gorithms, S. J. Farrow, Ed., Marcel Dekker, 121–147.
[Available from AbTech Corporation, 1575 State Farm Blvd., Lin, Z.-S., J. Liu, and X.-D. He, 1994: The self-organizing methods
Charlottesville, VA 22911.] of long term forecasting. I. GMDH and GMPSC methods. Me-
Ashraf, M., J. C. Loftis, and K. G. Hubbard, 1997: Application of teor. Atmos. Phys., 53, 155–160.
geostatistics to evaluate partial weather station networks. Agric. Mott, P., T. W. Sammis, and G. M. Southward, 1994: Climate data
For. Meteor., 84, 255–264. estimation using climate information from surrounding climate
Barron, A. R., 1984: Predicted square error—A criterion for the au- stations. Appl. Eng. Agric., 10, 41–44.
tomated model selection. Self-Organizing Methods in Modeling: Parker, D. D., and D. Zilberman, 1994: The adoption and use of
GMDH Type Algorithms, S. J. Farrow, Ed., Marcel Dekker, 87– information services: The case of CIMIS. Working Paper 736,
103. 62 pp. [Available from University of California, Dept. of Ag-
Blackie, J. R., and T. K. M. Simpson, 1993: Climatic variability within riculture and Resource Economics, Berkeley, CA 94720.]
the Balquhidder catchments and its effect on Penman potential Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,
evaporation. J. Hydrol., 145, 371–387. 1992: Numerical Recipes in FORTRAN: The Art of Scientific
Dinelli, D., 1995: What weather stations can do. Landscape Manage.,
Computing. 2d ed. Cambridge University Press, 992 pp.
34 (3), 6G.
Reddy, V. R., B. Acock, and F. D. Whisler, 1995: Crop management
Dumais, R. E., Jr., and K. C. Young, 1995: Using a self-learning
algorithm for single-station quantitive precipitation forecasting and input optimization with GLYCIM: Differing cultivars. Com-
in Germany. Wea. Forecasting, 10, 105–113. put. Electron. Agric., 13, 37–50.
Elwell, D. L., and J. C. Klink, 1993: Ongoing experience with Ohio’s Setoodehnia, A., J. Y. Cheung, and H. Li, 1993: Hypercube clustering
Automatic Weather Station Network. Appl. Eng. Agric., 9, 437– method: System modeling via modified group method of data
441. handling. Proc. 36th Midwest Symp. on Circuits and Systems,
Farrow, S. J., 1984: The GMDH algorithm. Self-Organizing Methods Vol. 1, Detroit, MI, Institute of Electrical and Electronics En-
in Modeling: GMDH Type Algorithms, S. J. Farrow, Ed., Marcel gineers, 422–425.
Dekker, 1–24. Snyder, R. L., P. W. Brown, K. G. Hubbard, and S. J. Meyer, 1996:
Fujioka, F. M., 1995: High resolution fire weather models. Fire Man- A guide to automated weather station networks in North Amer-
age. Notes, 57, 22–25. ica. Adv. Bioclimatol., 4, 1–61.