You are on page 1of 259

Springer Proceedings in Mathematics & Statistics

Ratan Dasgupta Editor

Growth Curve
Models and
Applications
Indian Statistical Institute, Giridih, India,
March 28–29, 2016
Springer Proceedings in Mathematics &
Statistics

Volume 204
Springer Proceedings in Mathematics & Statistics
This book series features volumes composed of selected contributions from
workshops and conferences in all areas of current research in mathematics and
statistics, including operation research and optimization. In addition to an overall
evaluation of the interest, scientific quality, and timeliness of each proposal at the
hands of the publisher, individual contributions are all refereed to the high quality
standards of leading journals in the field. Thus, this series provides the research
community with well-edited, authoritative reports on developments in the most
exciting areas of mathematical and statistical research today.

More information about this series at http://www.springer.com/series/10533


Ratan Dasgupta
Editor

Growth Curve Models


and Applications
Indian Statistical Institute, Giridih, India,
March 28–29, 2016

123
Editor
Ratan Dasgupta
Theoretical Statistics and Mathematics Unit
Indian Statistical Institute
Kolkata
India

ISSN 2194-1009 ISSN 2194-1017 (electronic)


Springer Proceedings in Mathematics & Statistics
ISBN 978-3-319-63885-0 ISBN 978-3-319-63886-7 (eBook)
DOI 10.1007/978-3-319-63886-7
Library of Congress Control Number: 2017947465

Mathematics Subject Classification (2010): 62-02

© Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

A growth curve is an empirical model of the evolution of a quantity over time.


Growth curve models are now widely studied in different branches of science. The
present volume on Growth Curve Model (GCM) is a culmination of the talks given
at the workshop on the topic held during 28–29 March 2016 at Indian Statistical
Institute, Giridih.
This workshop proceeding, ‘Growth Curve Models and Applications, Indian
Statistical Institute, Giridih, India, March 28–29, 2016’ presents some ideas about
the research works, both theoretical and applied, on Growth Curve Models those are
going on by the scientists of Indian Statistical Institute in different branches of
science over years. I am thankful to the readers as the previous two volumes on
GCM; Advances in Growth Curve Models: Topics from the Indian Statistical
Institute (2013) and Growth Curve and Structural Equation Modeling: Topics from
the Indian Statistical Institute (2015) are well accepted by the scientific community
as reflected in Book Performance Report. We invited for contribution to this
workshop proceeding and further invited the participants of the workshop to submit
more than one paper, if possible for the proceedings. All the papers were peer
reviewed. The result is compilation of 12 research papers presented in this volume.
Another workshop on GCM was conducted at Indian Statistical Institute, Giridih
during 23–24 February 2017.
The endeavor will be considered successful if this can give some idea about
solving theoretical and practical problems in this broad area of GCM in which many
researchers in different branches of science are interested in.

Kolkata, India Ratan Dasgupta


May 2017

v
Picture 1 Garlanding the statue of Professor P.C. Mahalanobis before the inaugural function of
GCM workshop, Giridih, 2016
Picture 2 Workshop participants and workers of Indian Statistical Institute, Giridih

Picture 3 Professor J.K. Ghosh giving a video message to the workshop participants
Picture 4 River Bidhyadhari, on the way to seed farm in Mannmathanagar, Sunderban

Picture 5 Mangrove forest near the bank of River Bidhyadhari. Famous Royal Bengal tigers’
residence is in deep forest
Picture 6 Coconut trees of dwarf variety grown on the bank of River Bidhyadhari, in seed farm in
Mannmathanagar

Picture 7 Elephant foot yam plantation near the seed farm office in Manmathnagar, Sunderban.
The leaves became yellowish after plot was subjected to water stagnation in rainy season
Picture 8 Cut seed corm of elephant foot yam for plantation in the second year’s growth
experiment in Sunderban

Picture 9 Preparation of land for plantation in the second year’s yam growth experiment in
Sunderban
Picture 10 Fully grown yam plants in the second year’s yam growth experiment in Sunderban

Picture 11 Field workers attending the yam plants at Giridih farm


Picture 12 Paddy cultivation in Giridih farm
Contents

Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-


Whitney U-Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Ratan Dasgupta
Protein Structure Modeling of Abnormal Genes Associated with
PARK 1 and PARK 8 Loci Related to Autosomal Dominant
Parkinson’s Disease and Docking the Protein(s) with Appropriate
Ligands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Sanchari Roy and T.S. Vasulu
Time Detection for Ovulation in a Cycle in Presence of Polycystic
Ovary Syndrome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Ratan Dasgupta
Growth Model for Micro-Particles Towards Indistinguishability
and Dirichlet Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Ratan Dasgupta
Coconut Plant Growth, Mahalanobis Distance, and Jeffreys’
Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Ratan Dasgupta
Growth Rate of Primary School Children in Kolkata, India . . . . . . . . . . 127
Susmita Bharati, Manoranjan Pal, Madhuparna Srivastava
and Premananda Bharati
Growth Curve Estimation of a Bulb Crop from Incomplete Data . . . . . 151
Ratan Dasgupta
Tackling Poverty Through Balanced Growth: A Study on India . . . . . . 169
Sattwik Santra and Samarjit Das
Model Selection and Validation in Agricultural Context: Extended
Uniform Distribution and Some Characterization Theorems . . . . . . . . . . 183
Ratan Dasgupta

xv
xvi Contents

Longitudinal Growth Curve of Elephant Foot Yam Under Extreme


Stress and Plant Sensitivity II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Ratan Dasgupta
An In-Depth Analysis of Population Ageing for Selected States
in India in the Perspective of Economic Development . . . . . . . . . . . . . . . 215
Prasanta Pathak
Growth Curve of Yam from Incomplete Data in Saline Soil
of Sunderban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Ratan Dasgupta
Contributors

Premananda Bharati Biological Anthropology Unit, Indian Statistical Institute,


Kolkata, India
Susmita Bharati Sociological Research Unit, Indian Statistical Institute, Kolkata,
India
Samarjit Das Economics Research Unit, Indian Statistical Institute, Kolkata, India
Ratan Dasgupta Theoretical Statistics and Mathematics Unit, Indian Statistical
Institute, Kolkata, India
Manoranjan Pal Economic Research Unit, Indian Statistical Institute, Kolkata,
India
Prasanta Pathak Population Studies Unit, Indian Statistical Institute, Kolkata,
India
Sanchari Roy Indian Statistical Institute, Kolkata, India
Sattwik Santra Centre for Training and Research in Public Finance and Policy,
Center for Studies in Social Sciences, Calcutta, Kolkata, India
Madhuparna Srivastava Economic Research Unit, Indian Statistical Institute,
Kolkata, India
T.S. Vasulu Indian Statistical Institute, Kolkata, India

xvii
Growth Curve of Elephant-Foot-Yam, Plant
Stress and Mann-Whitney U-Statistics

Ratan Dasgupta

Abstract Longitudinal growth of Elephant foot yam [Amorphophallus paeoniifolius


(Dennst.) Nicolson] is studied for different seed weights by taking plants off the
ground for interim growth measurements by Archimedean principle, and then replant-
ing these to continue growth experiment till maturity at the Indian Statistical Institute,
Giridih farm. In order to find appropriate seed weight for high yield and appropriate
time for harvest, twenty yam plants in each category of seed weight 500, 650 and 800 g
are considered in a growth experiment conducted in the year 2013. Effect of severe
plant stress on growth is also examined when underground yam is detached from a
plant, while pulling stems off the ground by jerk in the middle of experiment, thus
endangering plant survival during interim growth recording. The injured plant having
only stem structure survived when replanted under stress, and deposited yam in its
extended lifetime. Under stress, canopy radius of yam plant is a more stress-sensitive
variable compared to perimeter on the stem top. These variables may be modeled by
normal distribution. Yam plant can withstand severe stress, and over time may grow
like a healthy plant when proper care is taken. Deviations of observed data from esti-
mated growth curve are modeled by Ornstien-Uhlenbeck process, a Gaussian process
with exponentially decaying correlation function. Process parameters are estimated
from the real data set and comparison of residual variability over seed weights is
made to ascertain assured yield for a given seed weight. Under the assumption of
symmetric error distribution, growth curves are estimated; and the proposed new
technique is compared with general nonparametric regression. Among different seed
weights, growth curve of yam yield corresponding to seed weight 650 g is seen to
be superior from almost sure confidence band. Mann-Whitney U test indicates the
same, the test statistic is further considered to compare the induced plant stress due to
uprooting and replanting that affects slope change in canopy radius around the time
of intervention for interim growth recording of yam. Error bounds for two sample
U -statistics from its projection under stringent moment assumptions on kernel are
obtained to ascertain the adequacy of the test statistics.

R. Dasgupta (B)
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 B T Road,
Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 1


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_1
2 R. Dasgupta

Keywords Elephant foot yam · two-sample U -statistic · Ornstein-Uhlenbeck


process · Exponentially decaying correlation function

MS subject classification: 62P10 · 62G20

1 Introduction

Elephant foot yam is a staple food that can grow even in harsh agro-climatic environ-
ment. In order to find the optimal seed weight and harvest time, growth experiments
on yam for different seed weights are conducted in Giridih farm of the Indian Sta-
tistical Institute. We consider yam plantation of Bidhan Kusum, a non-irritant high
yielding variety. Yam corms were sown on 26 March 2013 and the first sprouting of
yam is observed on 11 April 2013.
Effect of plant stress on yield is of interest. In an earlier study, growth rate of yam
in a plant is seen to be increasing under stress (Dasgupta 2015a). In that experiment
with 6 plants during the years 2012–2013, one yam plant of seed weight 500 g was
moderately hurt during uprooting while taking interim growth reading. Growth rate
of yam was higher after plant injury, causing stress. Plant stress can be conveniently
used to increase yield, see e.g. Dasgupta (2016, 2017).
Plant girth on the stem top and canopy radius are two important growth measures
of plant over time. These may be modeled by normal distribution. Canopy radius is
affected more than stem girth due to plant stress. While taking interim growth reading
of underground yam by uprooting plants in the middle of growth experiment, and
measuring the yam volume by Archimedean principle of displaced water, stems of a
plant got detached at base from underground yam due to a sudden pull of stems with
jerk. The maximum diameter of leaf spread in injured plant showed steep decline in
slope thereafter, when stem structure attached with a few roots was replanted. Later
with progress of time the leaf spread of the wounded plant became steady compared
to other plants in the same category of seed weight, as the plant was healed. Residual
variability is measured by deviation of data points from the response curve. We
model the growth residuals over time by Ornstien-Uhlenbeck (O-U) process. Process
parameters are estimated from observed data set and a relationship of these with seed
weight is inferred by spline regression. From parametric modeling of the residuals,
we identify the superior growth curve. In previous studies seed weight 650 g was
recommended for high yield in Giridih. This is reconfirmed in the present study.
Lowess smoothing is a nonparametric regression technique that is used to esti-
mate growth curve for different seed weights. An alternative procedure of estimating
growth curve is also studied under the assumption that error distribution is symmet-
ric. Comparison of the two procedures is made. Mann-Whitney U test is considered
to compare stress effect over different seed weights due to plant uprooting in the
middle of experiment for interim growth recording. Residual variation of the vari-
ables ‘girth at top of stem’ and ‘canopy radius’ are less for plants with seed weight
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 3

500 g. The growth curve corresponding to 650 g of seed weight is seen to outperform
the overall response curve, and the other two curves; corresponding to seed weight
500 and 650 g. The same is reflected in almost sure confidence bands. In addition,
asymptotic normal test to compare yield is investigated. Nonparametric two-sample
U-statistics test is also taken recourse to in the same endeavor. Homogeneous stress
effect of intervention, in terms of canopy radius slope, due to uprooting for taking
interim reading, across different seed weight of the plants is seen. Canopy radius is a
highly correlated variable with yam yield. Seed weight 500 g has less residual varia-
tion in the growth curve of canopy radius, thus indicating possibility of less residual
variation in yam yield as well. This part is confirmed in yam yield data analysis.
Two sample U -statistics is used in analysing growth data. Error bounds for two
sample U -statistics from its projection under suitable moment assumptions on kernel
are also obtained.

2 Materials and Methods

Longitudinal growth of Elephant-foot-yam for sixty plants under stress in a field


experiment is conducted in the agricultural farm at Indian Statistical Institute, Giridih,
with seed weights 500, 650 and 800 g of yam.
The experimental layout consists of six columns, in each column there are ten
equidistant pits at a distance of 1 m. First two columns are for seed weight 500 g,
next two are for seed weight 650 g, and the last two columns are for plants with seed
weight 800 g. Column to column distance is also 1 m; the plants are numbered 1–10
in the first column, 11–20 in the second etc. The plants are uprooted sequentially
in the middle of growth experiment and the interim yam deposition is measured
by Archimedean principle of displaced water volume by submerging the yam part
attached to stem in a water container, before replanting these to continue experiment.
An approximate weight of interim yam weight is available from multiplying yam
volume by yam density ≈4 g per c.c. Care is taken to minimise the time of exposure
outside the pit for plants, so as to minimise external stress.
We analyse resultant growth curves and a data set on yam plant that got detached
from underground yam by a field worker while taking interim reading during the
growth experiments in the years 2013–2014. The seriously wounded plant, detached
from its root structure and corm, survived when the stem structure is replanted, and
deposited a substantial amount of yam in its remaining lifetime when care for plant
survival is taken.
Yam plants have nearly circular leaf structures at the stem-top. The maximum leaf
radius or canopy radius can be measured by pulling part of the leaf structure gently
together upward to the top of stem and measuring the distance between upper point
of leaf from the topmost point of stem by a scale. Usually there are three segments
in the leaf structure on the top of a stem. The maximum reading of leaf lengths over
all the stems in a plant is then recorded as a growth measure.
4 R. Dasgupta

Some of the causes affecting canopy radius readings over time are abortion of
stem under plant stress, growth of a new stem, partial dehydration affecting growth
in leaf structure etc.
The growth curves are estimated by non-parametric lowess regression and the
residuals are modeled by Ornstien-Uhlenbeck process. This stochastic process with
exponentially decaying correlation function is the only continuous process which
is Gaussian, strongly Markov and stationary. The growth characteristics are seen to
follow normal distributions. Growth status at a particular day may be interpreted as the
growth record of previous day plus an additional growth on that day. Growth of yam
is a continuous variable. These considerations along with a simplifying assumption
on stationary error distributions suggest O-U process as a candidate model.
Process parameters are estimated from real data set. Comparison of residual vari-
ability over seed weights is made to identify the response curve with less varia-
tion. Growth curve in each category is obtained by computing the average response
(mean/median) at specific time points where at least one yam observation is available
in individual growth curves and then using nonparametric lowess/spline regression on
averaged points, see (Dasgupta, 2015a, b).These growth curves explain the inherent
underground yam deposition scenario over plant life time.
Almost sure confidence bands are constructed to cover growth curve with prob-
ability one i.e., with certainty, see Dasgupta (2015c) for a general exposition on
such bands. These nonparametric almost sure bands are of stronger assertion than
conventional models based percentage probability confidence bands.
We compare yam growth scenarios from growth curve analysis, approximate
normal tests; and nonparametric tests based on Mann-Whitney U statistic, which
is linearly related to Wilcoxon two-sample U statistic. In earlier growth studies
seed weight 650 g of yam is seen to be appropriate for high yield in agro-climatic
environment of Giridih.

3 Results

3.1 Estimation of Growth Curves

We analyse data on a yam plant that was detached from underground deposition
while taking interim reading during the growth experiments in the years 2013–2014.
The seriously wounded plant was detached from its root structure and corm. The
stem structure survived when replanted, and later deposited a substantial amount of
yam in the remaining lifetime when appropriate care for its survival is taken.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 5

100
80
leaf length (cm)

60
40
20
0

11 Apr 03 May 24 May 14 Jun 05 Jul 24 Jul 14 Aug 12 Sep 09 Oct 28 Oct 21 Nov 15 Dec 06 Jan

time

Fig. 1 Maximum leaf length of 20 yam plants with seed weight 800 g in calendar days

The maximum diameter of leaf spread showed steepest decline in slope after
wounding. Later with the progress of time, the leaf spread of the wounded plant
became steady compared to other plants in the same category of 800 g for seed
weight. The remaining lifetime of this plant is quite high (118 days, 18 rank out of
20 plants) when planted under shade to protect it from harsh summer, at a distance
from experimental plot. The additional yam deposition (998 g) is also quite high with
rank 16 for the wounded plant.
The rate of yam deposition (division of the above two) is of moderate rank 9, as
the remaining lifetime is high.
The plant stopped aboveground vegetative growth to heal itself after being severely
wounded. Under due care, the plant survived long and rate of yam-deposition came
to moderate (middle rank) in the remaining lifetime.
The growth characteristics of yam plants with seed weight 800 g are given in
Table 1.
In Fig. 1, we plot maximum leaf radius (canopy radius) of 20 yam plants grown
from seed weight 800 g over calendar days in the years 2013–2014. The vertical
line in the middle of the graph indicates the time of intervention by uprooting the
plants for taking interim readings. The uprooted plants were then replanted for fur-
ther continuance of growth experiment. The red line represents the characteristics of
wounded plant no. 53. Observe the sharp decrease of maximum leaf radius imme-
6

Table 1 Yam plant characteristics with seed weight 800 g after intervention of taking interim weight
Plant no. Plant lifetime Rank Slope of max Rank Yam deposition Rank Yam deposition Rank
after leaf length after after slope after
intervention intervention intervention intervention
(day) (cm/day) (gm) (gm/day)
41 68 9 0 2 −1.027652 1 −0.015112529 1
42 79 11 0.466666667 11 0.564 7 0.007139241 7
43 52 1 0.166666667 7 0.85538 12 0.016449615 19
44 52 2 0.333333333 10 0.620336 10 0.011929538 16
45 79 12 0 3 0.87588 13 0.011087089 11
46 79 13 0.833333333 14 0.09672 3 0.001224304 3
47 60 8 1 17 0.503084 6 0.008384733 8
48 79 14 1.166666667 20 1.14736 18 0.014523544 17
49 68 10 0 4 1.150648 19 0.016921294 20
50 52 3 0 5 0.797176 11 0.015330308 18
51 142 19 1 18 1.436472 20 0.010116 10
52 94 16 1 19 1.118152 17 0.011895234 15
53 118 18 −0.321428571 1 0.998 16 0.008457627 9
54 52 4 0.166666667 8 0.111 4 0.002134615 5
55 52 5 0.166666667 9 0.59436 8 0.01143 12
56 110 17 0.5 12 0.2316 5 0.002105455 4
57 142 20 0.033333333 6 0.8869 14 0.006245775 6
58 79 15 0.833333333 15 0.93326 15 0.011813418 14
59 52 6 0.833333333 16 0.60636 9 0.011660769 13
60 52 7 0.666666667 13 −0.22734 2 −0.004371923 2
R. Dasgupta
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 7

100
80
60
leaf length (cm)

40
20
0

11 Apr 03 May 24 May 14 Jun 05 Jul 24 Jul 14 Aug 12 Sep 09 Oct 28 Oct 21 Nov 15 Dec 06 Jan

time

Fig. 2 Maximum leaf length of 20 yam plants with seed weight 650 g in calendar days

diately after intervention compared to other curves. The wounded plant eventually
makes a recovery towards the end of plant lifetime as seen from the latter part of
canopy radius curve.
Figure 2 plots the same characteristic of maximum leaf radius of 20 yam plants
grown from seed weight 650 g over calendar days. Decrease in canopy radius is seen
after a while from the time of intervention.
Curves for plant no. 21 and 28 are having a break in between, as the main stems
died and secondary stems sprouted after a gap.
Figure 3 plots the maximum leaf radius of 20 yam plants grown from seed weight
500 g over calendar days. Interim readings were taken in two days, unlike plants with
seed weight 650 and 800 g. The curves shows similar pattern of earlier figures.
Note that the peak of the curves in Figs. 1, 2 and 3 increase with increase in seed
weight.
8 R. Dasgupta

100
80
leaf length (cm)

60
40
20
0

11 Apr 03 May 24 May 14 Jun 05 Jul 24 Jul 14 Aug 12 Sep 09 Oct 28 Oct 21 Nov 15 Dec 06 Jan

time

Fig. 3 Maximum leaf length of 20 yam plants with seed weight 500 g in calendar days

Figures 4, 5 and 6 plots the maximum girth perimeter at the top of stems corre-
sponding to seed weights 800, 650 and 500 g respectively. The curve corresponding
to wounded plant is marked as red in Fig. 4. However, there are other curves of sim-
ilar slopes like the red curve near the time of intervention; as such it appears that
severe injury of the plant number 53 does not seem to affect the girth perimeter to
that extent, unlike canopy radius. The peak of the curves in Figs. 4, 5 and 6 increase
with increase in seed weight, like in Figs. 1, 2 and 3.
In Figs. 7 and 8 we show normal quantile plot for maximum girth perimeter at
the top of stems corresponding to seed weight 800 g before and after intervention,
respectively. The correlation coefficients are 0.9688 and 0.9799 respectively. Normal
distribution seem to be a plausible model for these characteristics.
In Figs. 9 and 10 we show normal quantile plot for maximum leaf radius corre-
sponding to seed weights 800 g before and after intervention, respectively. The cor-
relation coefficients are 0.9684 and 0.9830 respectively. Normal distribution, once
again, seems to be a plausible model for these characteristics. Note the downward
change in position of the red point corresponding to wounded plant in Fig. 10 com-
pared to Fig. 9, indicating a sharp fall of leaf radius after injury, thus the maximum
leaf radius is a sensitive measure for severe plant injury.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 9

20
15
girth on top (cm)

10
5
0

11 Apr 03 May 24 May 14 Jun 05 Jul 24 Jul 14 Aug 12 Sep 09 Oct 28 Oct 21 Nov 15 Dec 06 Jan

time

Fig. 4 Maximum girth on top of 20 yam plants with seed weight 800 g in calendar days
20
15
girth on top (cm)
10
5
0

11 Apr 03 May 24 May 14 Jun 05 Jul 24 Jul 14 Aug 12 Sep 09 Oct 28 Oct 21 Nov 15 Dec 06 Jan
time

Fig. 5 Maximum girth on top of 20 yam plants with seed weight 650 g in calendar days
10 R. Dasgupta

20
15
girth on top (cm)

10
5
0

11 Apr 03 May 24 May 14 Jun 05 Jul 24 Jul 14 Aug 12 Sep 09 Oct 28 Oct 21 Nov 15 Dec 06 Jan

time

Fig. 6 Maximum girth on top of 20 yam plants with seed weight 500 g in calendar days
19
18
17
Girth on plant top

16
15
14
13

−2 −1 0 1 2
Normal Quantiles

Fig. 7 Normal quantile plot for top girth of seed weight 800 g before intervention
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 11

18
17
16
Girth on plant top

15
14
13
12

−2 −1 0 1 2
Normal Quantiles

Fig. 8 Normal quantile plot for top girth of seed weight 800 g after intervention
100
95
Maximum leaf length

90
85
80
75

−2 −1 0 1 2
Normal Quantiles

Fig. 9 Normal quantile plot for maximum leaf length of seed weight 800 g before intervention
12 R. Dasgupta

100
95
Maximum leaf length

90
85
80
75

−2 −1 0 1 2
Normal Quantiles

Fig. 10 Normal quantile plot for maximum leaf length of seed weight 800 g after intervention
7
6
5
yam yield (kg.)

4
3
2
1
0

0 50 100 150 200 250 300


life time (day)

Fig. 11 Individual growth curves of yam and median response curve


Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 13

6
5
4
yam yield (kg.)

3
2
1
0

0 50 100 150 200 250


life time (day)

Fig. 12 Individual growth curves of yam and median response curve for seed weight 500 g

In Fig. 11 we show the 60 individual growth curves of underground yam with


linear interpolation in between growth readings in each plant. The median response
curve is computed from the median of y values of individual curves with spacing of
a day in x axis over all plants. These joined by lines is shown in red color. A general
increasing trend is seen in the curve, with indication of a steep growth towards the
end.
Figure 12 shows 20 individual growth curves corresponding to the seed weight
500 g. The median response curve in blue shows sharp upturn towards the end.
In Fig. 13 we show the yam growth curves for seed weight 650 g. The median
response curve shows similar upturn towards the end like that for seed weight 500 g.
Figure 14 shows the individual growth curves for seed weight 800 g. Growth curve
of the injured plant with interim yam detached is shown in red color. The growth
curve corresponding to injured plant remains in the upper part of the Fig. 14. The
upturn of median curve in blue is slightly dampened towards the end.
In Fig. 15 we show lowess smoothed median response curves with f = 2/3 for
different seed weights in the same frame for easy comparison. The overall response
curve of Fig. 11 is also shown as a dashed curve. The overall growth curve mimics
the curve corresponding to seed weight 650 g remaining slightly below the latter.
The figure indicates that the growth curve corresponding to 650 g of seed weight
outperforms the other two curves, this outperforms the overall response curve as well,
confirming the earlier finding that seed weight 650 g is appropriate for cultivation in
such lateritic alluvial soil as seen in Giridih.
14 R. Dasgupta

7
6
5
yam yield (kg.)

4
3
2
1
0

0 50 100 150 200 250 300


life time (day)

Fig. 13 Individual growth curves of yam and median response curve for seed weight 650 g
5
4
yam yield (kg.)

3
2
1

0 50 100 150 200 250


life time (day)

Fig. 14 Individual growth curves of yam and median response curve for seed weight 800 g
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 15

500 gm.
650 gm.
4 800 gm.
median response
yam yield (kg.)

3
2
1

0 50 100 150 200 250 300


life time (day)

Fig. 15 Growth curves of yam


20
15
girth on top (cm)
10
5
0

0 15 36 56 76 93 123 151 178 204 228 252


time (day)

Fig. 16 Band of girth at top of the stem: 800 g seed weight


16 R. Dasgupta

10
5
residual (cm)
0
−5
−10

0 50 100 150 200 250


time (day)

Fig. 17 Stretched band of girth with central line as base: 800 g seed weight

Table 2 Variation in yam growth


Figure no. Upper area under curve Max upper height
17 677.2 (cm × day) 6.25 (cm)
19 631.175 (cm × day) 6.0 (cm)
21 618.675 (cm × day) 5.5 (cm)
23 4577.55 (cm × day) 40.75 (cm)
25 4021.675 (cm × day) 36.25 (cm)
27 3644.6 (cm × day) 29.5 (cm)
29 250.3589 (kg × day) 1.565385 (kg)
31 337.1626 (kg × day) 2.191087 (kg)
33 232.5363 (kg × day) 2.199239 (kg)

Overall growth curve may also be drawn by tracing the midpoint in range of y
values for each fixed time point plotted in x axis from a collection of individual growth
curve for plants, the procedure is valid under the assumption of symmetric error
component in growth curve estimation; like in O-U process. Variation of O-U process
has an almost sure bound as explained in the appendix. The growth characteristics
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 17

20
15
girth on top (cm)
10
5
0

0 15 36 56 76 93 123 151 178 204 228 252


time (day)

Fig. 18 Band of girth at top of the stem: 650 g seed weight

and growth curves of underground yam are associated with suitable bands in the
following figures.
Figure 16 shows the minimal band containing all the curves and having smallest
area for plant girth at the ‘top of stem’ with 800 g of seed weight; computed from
individual growth curves as shown in Fig. 4. The central line of the band is computed
from the maximum and minimum of y values for each x, and shown in red color.
For each fixed time x, the band has the minimum width in y.
Under the assumption of symmetry, the central curve is an estimate of growth of
plant girth over time.
Considering the red line as base, the deviations of the upper and lower curves in
band from central line may be viewed as maximum fluctuation of errors above and
below the central line. Such residuals of girth measurements for seed weight 800 g
are shown in Fig. 17. The lower curve is the mirror reflection of the upper curve.
The maximum height and the area under the curve may be interpreted as functions
of process parameters in O-U model, and these may be compared to assess residual
variation over different seed weights. Residual variations in these growth curves in
terms of height and area are given in Table 2.
18 R. Dasgupta

10
5
residual girth (cm)

0
−5
−10

0 50 100 150 200 250 300


time (day)

Fig. 19 Stretched band of girth with central line as base: 650 g seed weight

Figures 18 and 19 show the same characteristic for seed weight 650 g, Figs. 20
and 21 corresponds to the same for seed weight 500 g. Residual variation of top girth
seems to be less for seed weight 500 g.
A similar analysis may be made for canopy radius of yam plants to check the
growth status. For seed weight 800 g, Fig. 22 shows the upper, lower and central
curves in the band of canopy radius. Figure 23 shows residuals of canopy radius
from the central line.
Figures 24 and 25 show the same for seed weight 650 g and Figs. 26 and 27 cor-
responds to seed weight 500 g. Here again seed weight 500 g seems to be of less
variation for residuals, in canopy radius growth curve.
Figure 28 shows the growth curve of yam for seed weight 800 g computed from
the minimal band criterion. Upper and lower curves along with mid band curve in red
is shown as central line, which may be interpreted as the yam growth curve under the
assumption of symmetric error distribution. Figure 29 shows yam growth residuals,
as deviation of upper and lower curves from central line.
Figures 30 and 31 show similar features of growth curve from minimal band
criterion and the growth residuals for yam seed weight 650 g. Figures 32 and 33
refers to the same for seed weight 500 g.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 19

20
15
girth on top (cm)
10
5
0

0 15 36 56 76 93 123 151 178 204 228 252


time (day)

Fig. 20 Band of girth at top of the stem: 500 g seed weight

From Figs. 28, 30 and 32, harvest after 225 days from sprouting of plants may be
recommended, as relative stability of the growth is seen at the mature stage of yam
plants. See also Figs. 12, 13 and 14.
Figure 34 shows the lowess smoothed growth curves with f = 2/3, computed
from Figs. 28, 30 and 32. Figure 35 incorporates almost sure confidence band to the
growth curves.
One may like to compare the growth curves with similar almost sure confidence
band for those given in Fig. 15, here lowess smoothed median response curves with
f = 2/3, computed without the assumption of symmetric error are shown.
Figure 36 shows the earlier growth curves, computed without the assumption of
symmetric error, along with associated almost sure confidence bands. There are
some similarities among the curves of Figs. 35 and 36. Seed weight 650 g is seen to
be superior for higher yield.
Residuals of the yield data from growth curve estimated may be modeled by
an O-U process. The diffusion and drift parameters of the processes are of interest
to examine residual variability. Estimation of parameters is explained in Sect. 3.3.
Following two pictures provide the relationship of these two parameters with yam
seed weight. The pictures are drawn in program linesspline with smoothing parameter
n = 300 in SPlus.
20 R. Dasgupta

10
5
residual girth (cm)
0
−5
−10

0 50 100 150 200 250 300


time (day)

Fig. 21 Stretched band of girth with central line as base: 500 g seed weight

Figure 37 shows the variation of the drift parameter α in the model with O-U
process for different seed weight with spline regression. The curve reaches a peak
slightly above seed weight 650 g.
Figure 38 shows the variation of the diffusion parameter σ 2 of the O-U model
with seed weight. Here the curve reaches a peak at seed weight 650 g.

3.2 Mann-Whitney U-Statistic and Other Tests

Next we proceed to test whether the intervention of uprooting to take interim readings
are of homogeneous effect on canopy radius immediately after intervention, with
seed weight 500, 650 and 800 g. The values of leaf radius slopes x500 , x650 , x800
around the time of uprooting are given in Table 3. Mann Whitney U -statistic is
linearly related to the Wilcoxon two sample U -statistics that is sum of ranks in first
sample/second sample, when ranks in combined sample are considered; as such these
two tests are equivalent. With bounded kernels I (x500 < x650 ), I (x650 < x800 ) and
I (x500 < x800 ), two sample U -statistics may be considered for testing the hypothesis
on homogeneity of stress effect due to uprooting over different seed weight.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 21

100
80
canopy radius (cm)

60
40
20
0

0 15 36 56 76 93 123 151 178 204 228 252


time (day)

Fig. 22 Band of canopy radius: 800 g seed weight

For the null hypothesis that two populations are the same against an alternative
hypothesis, especially that a particular population tends to have larger values than
the other, the Wilcoxon test is an efficient nonparametric test with null asymptotic
distribution of standardised U as N (0, 1).
The standardised value of the statistic U ∗ = (U − n 12n 2 )/{ n 1 n 2 (n12
1 +n 2 +1) 1/2
} for
sample sizes in two samples as n 1 = n 2 = 20 and above kernels are −0.108, 1.894,
and 1.65 respectively, to be compared with a normal deviate. The values are insignifi-
cant, the middle one has p value 0.0582 in two sided alternative, indicating plausible
homogeneous effect of intervention in terms of canopy radius slope due to uprooting
for taking interim reading, across different seed weight of the plants.
Growth curve analysis indicates the superiority of seed weight 650 g based on
longitudinal analysis with interim and final yam weights. Conventional analysis based
on final harvest may also be made. Final yield in kg for different seed weights are
given below. For 500 g yam yields are as follows
6.003, 2.528, 2.624, 3.591, 2.924, 3.104, 2.116, 2.37, 3.306, 2.458, 3.527, 4.54,
2.374, 3.383, 2.88, 4.366, 3.637, 2.45, 2.773, 1.792
22 R. Dasgupta

40
20
residual canopy radius (cm)

0
−20
−40

0 50 100 150 200 250


time (day)

Fig. 23 Stretched band of canopy radius with central line as base: 800 g seed weight

For seed weight 650 g, these are


6.727, 2.961, 4.14, 4.554, 3.412, 3.333, 3.953, 4.135, 4.099, 2.806, 4.072,
4.162, 1.383, 2.403, 3.044, 2.795, 4.577, 3.224, 4.607, 3.978
Finally, for seed weight 800 g, final yields are
2.065, 3.417, 5.249, 4.295, 3.843, 2.265, 2.135, 3.658, 3.958, 3.673, 4.221,
5.158, 5.11, 2.964, 3.105, 2.514, 2.884, 2.588, 3.117, 1.998
We may test to find the seed weight that is superior for higher yield by two-sample
U -statistic with similar kernels like I (x500 < x650 ), I (x650 < x800 ) and I (x500 <
x800 ), let x now represent yam yield at plant maturity i.e., yield at final harvest, with
associated seed weight given in suffixes. The standardized values of U ∗ = (U −
n1 n2
2
)/{ n 1 n 2 (n12
1 +n 2 +1) 1/2
} for n 1 = n 2 = 20 are 2.164007, −0.9738032 and 1.000853
respectively. For one sided test, p-value for the first U ∗ is 0.015, which is low,
indicating seed weight 650 g is superior than 500 g for yam yield; other two test
statistic values are insignificant.
We may further check for equality of mean yield μ over different seed weights
by an approximate normal test of the type z = (x̄ − ȳ)/{(sx2 /m) + (s y2 /n)}1/2 , the
alternatives being μ500 < μ650 , μ650 < μ800 and μ500 < μ800 . The values of the
statistics for above three comparisons are 1.818445, −0.9390781 and 0.8819223
respectively. The first z is significant at 5% level, with associated p-value 0.0345,
which is low; other two z values are insignificant when compared with normal table.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 23

100
80
canopy radius (cm)
60
40
20
0

0 15 36 56 76 93 123 151 178 204 228 252


time (day)

Fig. 24 Band of canopy radius: 650 g seed weight

The z values may also be compared with Welch’s t test criterion, see Welch
(1947). However, the degrees of freedom d = {(sx2 /m) + (s y2 /n)}2 /{(sx2 /m)2 /(m −
1) + (s y2 /n)2 /(n − 1)} in t test turns out to be high, these are approximately 38 in
all cases; suggesting normal test to be a close approximation.

3.3 Comparison of Growth by Ornstein-Uhlenbeck Process

We first provide an introduction to the process and its salient properties on which
the comparison of yam growths are based. The Ornstein-Uhlenbeck process V (s)
is a continuous Gaussian Markov process with constant mean and exponentially
decaying covariance structure. This satisfies the following differential equation,

d V (s) = −αV (s)ds + σ d B(s), α > 0, σ > 0 (3.3.1)

where B(s) is the standard Brownian motion, α is the drift parameter; αV (s) is a
restoring force directed towards origin proportional to the distance V (s). Here V (s)
represents the distance of the trajectory from the mean line y = 0, at time s(= x)
24 R. Dasgupta

40
20
residual canopy radius (cm)

0
−20
−40

0 50 100 150 200 250 300


time (day)

Fig. 25 Stretched band of canopy radius with central line as base: 650 g seed weight

in the x, y plane. Since the restoring force drifts the process (towards origin); α is
called a drift parameter; σ is called the diffusion parameter as it relates to the spread
of the process.
The quantity V (s) may be used to model the distance at the time s of a observed
trajectory measured from the line of mean response. The process may be used to
model material wastage in industrial production e.g., see Dasgupta (2006a).
The successive increments of Brownian motion are independent and normally
distributed with zero mean and variance proportional to the length of increments.
Therefore m.l.e. of σ 2 t based on the likelihood calculated at the grid points is the
following:
1
2n
σ̂ =
2
[B( jt2−n ) − B(( j − 1)t2−n )]2 .
t j=1

By transforming the process V (s) to the corresponding Brownian motion B(s)


one may write, according to a result given in Lemma 4.2, page 212 of Basawa and
Rao (1980), the following:

1
2n
lim [V ( jt2−n ) − V (( j − 1)t2−n )]2 = σ 2 a.s. (3.3.2)
n→∞ t
j=1
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 25

100
80
canopy radius (cm)

60
40
20
0

0 15 36 56 76 93 123 151 178 204 228 252


time (day)

Fig. 26 Band of canopy radius: 500 g seed weight

Therefore one may consider grids of finer length and then select that grid size for
which the estimate of σ 2 stabilizes.
We consider maximum likelihood estimation of the parameter of O − U process
based on a single realization. Following the example 5.4, page 187–188 of Basawa
and Rao (1980), the m.l.e. of α is the following:
 t  t
α̂ = − V (s)d V (s)/ V 2 (s)ds
0 0
 t
1 −1
= [ V (s)ds] [σ 2 t + V 2 (0) − V 2 (t)].
2
(3.3.3)
2 0

We shall see later that α̂ > 0 with probability 1 as t → ∞.


 tThe asymptotic distribution of α̂ = α̂(t) is normal with mean α and variance
[ 0 V 2 (s)ds]−1 , i.e.,
 t
[ V 2 (s)ds]1/2 (α̂(t) − α) ∼ AN (0, 1), (3.3.4)
0

see e.g., Brown and Hewitt (1975). By LIL of standard Brownian motion e.g. see
Chung (1948),
26 R. Dasgupta

40
20
residual canopy radius (cm)

0
−20
−40

0 50 100 150 200 250 300


time (day)

Fig. 27 Stretched band of canopy radius with central line as base: 500 g seed weight

limt→∞ (2t log log t)−1/2 B(t) = 1 a.s. (3.3.5)

and
limt→∞ (2t log log t)−1/2 sup | B(s) |= 1 a.s. (3.3.6)
0≤s≤t

Using the relationship V (s) = e−αs B[σ 2 (e2αs − 1)/2α], see e.g., Karlin and Taylor
(1981), one may write

σ2
limt→∞ [ (1 + o(1)) log t]−1/2 V (t) = 1 a.s (3.3.7)
α
and
σ2
limt→∞ [ (1 + o(1)) log t]−1/2 sup | V (s) |= 1, a.s. (3.3.8)
α 0≤s≤t

One should indeed write ≤ sign in (3.3.8) instead of equality, but the equality sign
holds in (3.3.8) in view of equality in (3.3.7). Observe further that sup0≤s≤t | V (s) |=
O(log t)1/2 a.s. Thus from (3.3,3), α̂ > 0 with probability one as t → ∞.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 27

5
4
yam yield (kg)

3
2
1

0 50 100 150 200 250


time (day)

Fig. 28 Band of yam yield: 800 g seed weight

Hence from (3.3.7) and (3.3.8), the fluctuation of the process is dependent on the
parameter θ = σ/α 1/2 . Higher the value√of θ higher will be the uncertainty in the
trajectory due to fluctuation, note that θ/ 2 = σ/(2α)1/2 can also be interpreted as
the standard deviation of the limiting distribution of V (s), s → ∞. Let θ1 and θ2
be the parameters of the two processes to be compared. Then the ratio θ2 /θ1 indeed
refers to the ratio of two asymptotic standard deviations of the processes.
Assume that the residual curves from mean response of yield with seed weight
800 g say, i.e., deviations of 20 longitudinal yields from central line in Fig. 28, to be
realisations of O-U process that lie within the band of Fig. 29. Further a.s. fluctuation
of O-U process given in (3.3.7)–(3.3.8) help to estimate the process parameters from
observed bands.
From (3.3.7)–(3.3.8), the area above the line y = 0 over the range [0, t] for a O −
t 2 t
U process may be approximated by 21 0 |V (s)|ds ≈ [ σα 0 (1 + o(1)) log s]1/2 ds

Now use the approximation (log s)1/2 ds = s[(log s)1/2 − 21 (log s)−1/2 +
1
4
(log s)−3/2 (1 + o(1))]
28 R. Dasgupta

3
2
residual yam yield (kg)

1
0
−1
−2
−3

0 50 100 150 200 250 300


time (day)

Fig. 29 Stretched band of yam yield with central line as base: 800 g seed weight

Time range of available data in different figures are of same order, thus the ratio
of upper areas in Figs. 17(2)33 for different seed weights over similar characteristic
can also be interpreted as the ratio of standard deviation of the limiting distribution
of V (s), s → ∞; corresponding to different seed weights, over same variables like
girth, canopy radius, yield etc.
From (3.3.7), a similar comparison of variability may be made from the ratio of
peak heights in Figs. 17(2)33 for different seed weights over similar characteristic.
Consider the time segment of [75, 140] days in Fig. 12. All the 20 yam plants
with seed weight 500 g are alive in this time region. For yam yield of 20 plants,
one may obtain the 20 curves of deviation from lowess growth curve (computed
over the entire data set of 20 plants, then restricted to the focused segment with
linear interpolation) for mean/median yield computed at different time points; to be
modeled by O-U process. Twenty independent estimates of (α, σ 2 ) may be obtained
from the trajectories in the time zone [75, 140] days.
The 20 estimates of σ 2 obtained by (3.3.2) from squared differences over narrowly
placed grid points of 5 days in each of 20 trajectories are
0.001664569, 0.000261755, 0.000308695, 0.000328442, 7.58E − 05,
0.000171903, 0.00033702, 0.000118769, 9.78E − 05, 0.000242651, 0.000637052,
0.000220903, 0.000136615, 0.000146311, 6.79E − 05, 0.000324541,
0.000193116, 0.000176361, 3.10E − 05, 0.00015151
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 29

7
6
5
yam yield (kg)
4
3
2
1
0

0 50 100 150 200 250 300


time (day)

Fig. 30 Band of yam yield: 650 g seed weight

We obtain the average estimate σˆ2 = 0.000285 kg2 from 20 values given above
for O-U process that models yam yield for seed weight 500 g.
In a similar fashion, we may consider the time range [79, 144] days for 650 g of
seed weight. Except for plant no. 34, the 19 estimates of σ 2 are
0.000582171, 0.00070874, 0.000150616, 0.000213152, 0.000738633,
7.55E − 05, 0.003526466, 0.000238287, 0.004012515, 0.000173194,
0.002395853, 0.001078497, 0.001833376, 0.000287308, 0.000204145,
0.00481592, 9.24E − 05, 0.000260316, 0.003854702
Then the average estimate is σˆ2 = 0.001329 kg2 obtained from 19 plants with
seed weight 650 g. This refers to an O-U process that models yam yield from seed
weight 650 g. Plant no. 34 had lifetime 78 days at at the time of first interim yam
growth recording, we considered taking individual growth curve characteristic from
79 days onward up to 140 days for seed weight 650 g. Plant no. 34 died at 130 days
of lifetime, so it was deleted in the σ 2 calculations.
For seed weight 800 g, we consider the time range [83, 148] days, plant no. 46
has first interim yam growth recording at 83 days of lifetime. To maintain a range of
65 days as taken for other plants, we took 148 day as the last point of time range in
this case. For 800 g seed weight, we estimate σ 2 based on 19 plants excluding the
wounded plant. Except for plant no. 34, the 19 estimates of σ 2 are
30 R. Dasgupta

3
2
residual yam yield (kg)

1
0
−1
−2
−3

0 50 100 150 200 250 300


time (day)

Fig. 31 Stretched band of yam yield with central line as base: 650 g seed weight

0.00187558, 0.000184949, 0.000466427, 0.000610181, 0.000127615,


0.000354622 5.27E − 05, 7.29E − 05, 0.00026018, 0.000243256, 6.04E − 05,
0.000759074, 0.000224042, 3.12E − 05, 0.000224058, 3.67E − 05, 0.000103171,
5.98E − 05, 0.000824948
After averaging values over 19 plants, the estimate of σ 2 for 800 g of seed weight
turns out to be σ̂ 2 = 0.000346 kg2 . For wounded plant σˆ2 = 0.000678 kg2 . This is
almost double of the other plants averaged estimate. The variation of yield is high
due to affect of wound in the plant.
A common estimate of α for plants in each of seed weight is available from the
maximum fluctuation of growth curve computed from central response curve as given
in Table 2.
From (3.3.7), (3.3.8), Fig. 33 and Table 2 one may have an estimate of drift para-
meter of the O-U process, as σα (1 + o(1)) log t ≈ (2.199239)2 = 4.8367, t = 65;
2

thus α̂500 = 0.000245976, for seed weight 500 g.


In a similar fashion α̂650 = 0.001155576 for seed weight 650 g. Restoring force
towards mean is of higher order for 650 g compared to 500 g of seed weight.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 31

6
5
4
yam yield (kg)
3
2
1
0

0 50 100 150 200 250


time (day)

Fig. 32 Band of yam yield: 500 g seed weight

Computed in a similar manner, we obtain α̂800 = 0.0005894222 for 800 g of seed


weight.
Relationship of the process parameters with seed weight is explained in Figs. 37
and 38 with spline regression in SPlus.
From (3.3.7), for m independent copies {Vi , 1 ≤ i ≤ m} of O − U process one
may write
σ2
limt→∞ Vi (t) ∼ [ (1 + o(1)) log t]1/2 a.s (3.3.9)
α
Hence,
σ2
limt→∞ max Vi (t) ∼ [ (1 + o(1)) log t]1/2 a.s (3.3.10)
1≤i≤m α

for otherwise, infinitely many times the Vi (t) in (3.3.9) for a particular i, will be out
of track to violate the assertion in (3.3.9), as m is finite.
32 R. Dasgupta

3
2
residual yam yield (kg)

1
0
−1
−2
−3

0 50 100 150 200 250 300


time (day)

Fig. 33 Stretched band of yam yield with central line as base: 500 g seed weight

Thus for a large t and h > 0


 
t+h
σ2 t+h
max Vi2 (t)ds ≈ [ (1 + o(1))] log s ds] (3.3.11)
h 1≤i≤m α h

Since for different seed weights the time intervals are of approximately equal mag-
nitude in the bands, the bounded areas in the units of (kg × day), in the curves
shown in Figs. 29, 31 and 33 for seed weight 800, 650 and 500 g; being 261.495413,
352.5872116, 185.1649622 respectively, are proportional, to a first approximation,
to the asymptotic variance of the respective modeled O − U processes; the last
figure 185.1649622 being the minimum.
Seed weight 500 g thus corresponds to the minimum variation in yield, this is in
concordance with minimum variation in canopy radius for seed weight 500 g.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 33

500 gm.
650 gm.
5
4 800 gm.
yam yield (kg)

3
2
1

0 50 100 150 200 250 300


time (day)

Fig. 34 Growth curve of yam (mid band)

Appendix A: Two Sample U Statistics and Rates of


Convergence

Two sample U -statistic is a widely used nonparametric test with nice optimal prop-
erties. Below we prove some results in general set-up.
A.1 Two sample U -Statistic
Let Un,m be a two sample U -statistic based on the independent but not necessarily
identically distributed random variables X 1 , . . . , X n and Y1 , . . . , Ym with kernel φ
and degree (r, s ) i.e.,

U = (n Cr m Cs )−1 φ(X i1 , . . . , X ir ; Y j1 , . . . , Y js ), (1)
1≤i 1 <···<ir ≤n
1≤ j1 <···< js ≤m

where the kernel φ is symmetric in its arguments X i ’s and Y j ’s. Without loss of
generality let,

Eφ(X i1 , . . . , X ir ; Y j1 , . . . , Y js ) = 0, ∀ i 1
= · · ·
= ir , j1
= · · ·
= js . (2)
34 R. Dasgupta

500 gm.
650 gm.
5
4 800 gm.
yam yield (kg)

3
2
1

0 50 100 150 200 250 300


time (day)

Fig. 35 Almost sure interval of yam growth curve (mid band)

An example of such a statistic is Wilcoxon 2-sample statistic. Nonuniform rates of


convergence in CLT for two sample U -statistics is studied in Dasgupta (2008) when
a finite (≥ 2) moment of the kernel φ exist. A Berry-Essen bound for random index
was also established therein.
We study the error of approximation for two sample U -statistics by the corre-
sponding Hajek’s projection in non iid case under certain conditions which ensures
that all the moments of the kernel exist but the moment generating function of
the kernel may not exist. Since the error bounds are sharp, the results obtained
indicate that optimal results in the set-up of sum of independent random vari-
ables are possible for U -statistics. The error bounds are also relevant in comput-
ing the rates in almost sure convergence. Let m = Oe (n) and v be the number
of arguments in φ. The assumed moment bound (9) for the kernel φ ensures that
E exp[s{loge (1 + |φ|)}ν/(ν−1) ] < ∞, ν > 0, s > 0.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 35

500 gm.
650 gm.
4

800 gm.
3
yam yield (kg)

2
1

0 50 100 150 200 250 300


time (day)

Fig. 36 Almost sure interval of yam growth curve


0.0010
0.0008
Alpha

0.0006
0.0004

500 550 600 650 700 750 800


seed weight (g)

Fig. 37 Drift parameter α of O-U process with respect to seed weight


36 R. Dasgupta

0.0012
0.0010
Sigma^2

0.0008
0.0006
0.0004

500 550 600 650 700 750 800


seed weight (g)

Fig. 38 Diffusion parameter σ 2 of O-U process with respect to seed weight

A.2 Decomposition of 2 Sample U -Statistic and Estimate of Remainder


The steps of decomposition are shown in Dasgupta (2008), for r = 2 and s = 2;
we adopt the notations used therein. It is possible to generalize for other values of
(r, s), see also Ghosh and Dasgupta (1980), Dasgupta (2015e). For completeness,
we provide the main steps.
With
2  (1) 2  (1)
n m
V1 = ψ (X i1 ) + ψ (Yi3 ),
n i =1 m i =1
1 3

to be considered as the main part of the U -statistic obtained via Hajeck’s projection,
where ψ represents average of conditional expectation of the kernel φ fixing some
of its coordinates,
one has,

U = V1 + V2 + V3 + V4
= V1 + Rn,m ; where Rn,m = V2 + V3 + V4 is the remainder with
associated componentsV2 , V3 andV4 . (3)
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 37

Table 3 Leaf canopy slope (cm/day) around the time of intervention


Plant no. (in each Seed wt 500 g Seed wt 650 g Seed wt 800 g
category)
1 0.3333333 0.09090909 0
2 0.8333333 0.33333333 0.46666667
3 −0.3333333 −0.66666667 0.16666667
4 0 −2.16666667 0.33333333
5 0.3333333 0.33333333 0
6 0 0.16666667 0.83333333
7 0 0 1
8 0.6666667 0.16666667 1.16666667
9 0.1666667 0.16666667 0
10 0.4 0.16666667 0
11 0 0.33333333 1
12 0.3333333 0.5 1
13 0 −1.5 −0.32142857
14 0.6666667 0.16666667 0.16666667
15 1 0.16666667 0.16666667
16 0.1666667 0.83333333 0.5
17 0.8333333 −0.66666667 0.03333333
18 0.1666667 0.5 0.83333333
19 0 0.66666667 0.83333333
20 0.6666667 0.83333333 0.66666667

In the above representation V1 is the main part for which application of standard
theory is possible. In fact, we use the set up of triangular array for random vari-
ables where variables in each array are independent. The arrays may themselves be
dependent.
The remaining parts in (3) viz., V2 , V3 , and V4 are comparatively negligible than
the main part V1 . The following moment estimates of V2 , V3 , V4 and the remainder
Rn,m hold, see Dasgupta (2008). We shall require the following result to compute
rates of convergence of two sample u-statistics, used in testing purpose in the earlier
sections.

Proposition A.1 Let (2) holds and for an integer q ≥ 1,


  
n m −1 
δq = sup[ ] E|φ(X i1 , X i2 , Yi3 , Yi4 )|2q < ∞. (4)
m≥2 2 2 1≤i <i ≤n
n≥2 1 2
1≤i 3 <i 4 ≤m
38 R. Dasgupta

Then,
E(V2 )2q ≤ (n −2q + m −2q + (mn)−q ) L q (2q)!δq , (5)

E(V3 )2q ≤ (m −q n −2q + n −q m −2q ) L q (3q)!δq , (6)

E(V4 )2q ≤ (mn)−2q L q (4q)!δq ; (7)

where L(> 1) is a constant independent of m, n and q. Finally, from (5)-(7), for


Rn,m defined in (3), one has
2q
E Rn,m ≤ n −2q L q (vq)! δq , (8)

under the assumption m = Oe (n), where v is the number of arguments in φ.

In the above decomposition, v = r + s = 4.


A.3. Rates of Convergence
The representation (3) permits us to compute the difference between two sample
U -statistics from sum of independent random variables, to which standard theory
applies. Consider m = Oe (n). Then from (3),

2  (1) 2  (1)
n m
U = V1 + Rn,m , where V1 = ψ (X i ) + ψ (Yi )
n i=1 m i=1

is the weighted sum of (m + n) independent random variables.


Assume that the kernel φ satisfies
 ν
sup (n c2 m c2 )−1 E | φ(X i1 , X i2 , Y j1 , Y j2 ) |2q = δq ≤ Lewo q (9)
n≥1,m≥1 1≤i 1 <i 2 ≤n
1≤ j1 < j2 ≤m

∀q > 1, and for some L > 1, where wo > 0, ν > 1. Condition (9) is equivalent to
the following.

sup (n c2 m c2 )−1 E exp[s{loge (1 + |φ|)}ν/(ν−1) ] < ∞ (10)
n≥1,m≥1 1≤i 1 <i 2 ≤n
1≤ j1 < j2 ≤m

−1/(ν−1)
where φ = φ(X i1 , X i2 , Y j1 , Y j2 ), and s = wo .
This ensures existence of m.g.f. for ν/(ν − 1) th power of a logarithmic function
of φ. The assumption implies φ has a wider moment-bound compared to a random
variable with finite m.g.f.. The bound on moment-growth for φ is of such a high order
that, ν/(ν − 1) th power of the tamed variable log(1 + φ) admits m.g.f., instead of
kernel φ. Finiteness of all moments of φ is stringent compared to assuming existence
of a fixed order moment, but this assumption is milder compared to assuming finite
m.g.f. for φ.
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 39

Now write,

Un∗ = Un,m

= [var(V1 )]−1/2 Un,m = [var(V1 )]−1/2 {V1 + Rn,m } = Ûn∗ + Rn,m

(11)

n (1) ∗
m (1)
where, V1 = 2
n i=1 ψ (X i ) +
(Yi ); Rn,m
2
m
= [var(V1 )]−1/2 Rn,m ,
i=1 ψ
n  (1) (1)
σ ∗2 = var(V1 ) = 4( i=1 E[ψ (X i )] /n + i=1
m
E[ψ (Yi )]2 /m 2 ) = Oe ( n+m
2 2 1
),
2 ∗2
σn = σn,m = (n + m) σ = (n + m) var(V1 ) = Oe (n + m) = Oe (n), provided
2 2 2

n (1) m (1)
inf n≥1 n −1 E[ψ (X i )]2 > 0, inf m≥1 m −1 E[ψ (Yi )]2 > 0. (12)
i=1 i=1

Let L > 1 be a generic constant. The first term in the r.h.s. of (11) is then standard-
ized sum of independent random variables and the second term is a remainder with
∗ ν
∗ 2q
E(Rnm ) ≤ n −q L q (vq)! δq . Now, ew q >> L q (vq)!, w ∗ > 0, ν > 1. Thus, one
may write Un,m = V1 + Rn,m = Ûn,m + Rn,m where,
ν
E(Rnm )2q ≤ n −2q Lew q , w > wo , (13)
∗ 2q −q w qν
E(Rnm ) ≤n Le , w > wo , (14)

for a different (large) choice of L. For simplicity take m = n. Note that (Rn,n , F n,n )
is a reverse martingale, where Fn,n is the σ -algebra generated by {X 1 , . . . , X n ,
Y1 , . . . , Yn }. Thus,

P(supi≥n |Ui,i − Ûi,i | > t) ≤ t −2q E|Un,n − Ûn,n |2q


ν
(15)
≤ t −2q n −2q Lew q

Differentiating the above with respect to q, we obtain the optimal bound for the above
ν
probability as O(exp([−(ν − 1)w{2(log t + log n)/(wν)} ν−1 ])). The condition m =
n may be relaxed. One need to assume that each co-ordinate of (m, n) is strictly
increasing, as m + n → ∞.
The result on maximum difference of U -statistic from its projection for all large
n is stated below.

Theorem A.1 Under the assumptions (2), (9)/(10) and (12), and m = n, one has
ν
P(sup |Ui,i − Ûi,i | > t) = O(exp([−(ν − 1)w{2(log t + log n)/(wν)} ν−1 ])) (16)
i≥n

A similar technique is adopted in proving Theorem 3.1, p 82–83 of Dasgupta


(2015d) with a different type of moment bound on kernel φ in one sample case.
Similar results on two-sample U -statistics may be obtained under the moment bound
on kernel, assumed in Dasgupta (2015e).
40 R. Dasgupta

Specifically we assume the moment bound (6)/(7) p. 38 of Dasgupta (2015e), i.e.,


we consider the bound
δq ≤ L q eνq log q (17)

∀q > 1, where L > 0, ν ∈ (0, 1). The above condition is implied by



sup (n c2 m c2 )−1 E exp(s|φ|1/ν ) < ∞, (18)
n≥1,m≥1 1≤i 1 <i 2 ≤n
1≤ j1 < j2 ≤m

where 0 < s < so = νe−1 L −1/ν and φ = φ(X i1 , X i2 , Y j1 , Y j2 ).


This follows along the lines of Dasgupta (2006b), see Proposition 2.1 and Remark
2.2 therein. Existence of m.g.f. for φ corresponds to the case ν = 1.

We may then have the following bounds


2q
E Rn,m ≤ n −2q L q e(v+ν)q log q .

∗ 2q
E(Rnm ) ≤ n −q L q (vq)! δq , v = r + s is the number of arguments in φ
(19)
≤ n −q L q e(v+ν)q log q , ∀q > 1, under (17)

see (12) p. 39 of Dasgupta (2015e), R ∗ being remainder term of standardised U from


its projection.
Now proceeding as in (4.38)–(4.40) p. 73 of Dasgupta (2013), we obtain the
following theorem.

Theorem A.2 Under the assumptions (2), (17)/(18) and (12), and m = n, one has

n L −1 )1/(v+ν) e−1
P(sup |Ui,i − Ûi,i | > t) ≤ e−(v+ν)(t
2 2
(20)
i≥n

The bound in (20) is sharper than (16), the former is derived under more strin-
gent condition. The condition is satisfied for bounded kernel of indicator function,
considered in Wilcoxon statistics.

References

Basawa IV, Rao BLSP (1980) Statistical inference for stochastic processes. Academic press, London
Brown BM, Hewitt JI (1975) Asymptotic likelihood theory for diffusion processes. J Appl Prob
12:228–238
Chung KL (1948) On the maximum partial sum of sequences of independent random variables.
Trans Amer Math Soc 64:205–233
Dasgupta R (2006a) Modeling of material wastage by ornstein–uhlenbeck process. Calcutta Stat
Ass Bull 58:15–35
Dasgupta R (2006b) Nonuniform rates of convergence to normality. Sankhya 68:620–635
Growth Curve of Elephant-Foot-Yam, Plant Stress and Mann-Whitney U -Statistics 41

Dasgupta R (2008) Convergence rates of two sample U-statistics in non iid case. CSA Bull 60:81–97
Dasgupta R (2013) Non uniform rates of convergence to normality for two sample u-statistics in non
IID case with applications. Advances in growth curve models: topics from the Indian statistical
institute. In: Proceedings in mathematics & statistics, Chapter 4, vol 46. Springer, Berlin, pp
60–88
Dasgupta R (2015a) Plant sensitivity and growth curve analysis of elephant foot yam. In: Growth
curve and structural equation modeling, Springer, Berlin, pp 1–23. http://dx.doi.org/10.1007/
978-3-319-17329-0_1
Dasgupta R (2015b) Longitudinal growth of elephant foot yam and some characterisation theorems.
In: Growth curve and structural equation modeling, Springer, Berlin, pp 259–285. http://dx.doi.
org/10.1007/978-3-319-17329-0_14
Dasgupta R (2015c) Growth of tuber crops and almost sure band for quantiles, communications in
statistics–simulation and computation. http://dx.doi.org/10.1080/03610918.2014.990097
Dasgupta R (2015d) Growth curve of elephant foot yam, one sided estimation and confidence band.
In: Growth curve and structural equation modeling: topics from the Indian statistical institute,
Chapter 5. Springer, Berlin, pp 75–103
Dasgupta R (2015e) Rates of convergence in CLT for two sample U-statistics in non iid case and
multiphasic growth curve. In: Growth curve and structural equation modeling: topics from the
indian statistical institute, Chapter 3. Springer, Berlin, pp 35–58
Dasgupta R (2016) Growth curve of elephant foot yam under moderate to severe stress and plant
sensitivity. Int J Horticult 6(14):1–8. doi:10.5376/ijh.2016.06.0014
Dasgupta R (2017) Longitudinal growth curve of elephant foot yam under extreme stress and plant
sensitivity. Int J Horticult 7(13):104–114. doi:10.5376/ijh.2017.07.0013
Ghosh M, Dasgupta R (1980) Berry-Esseen theorem for U-statistics in non iid case. Colloquia
mathematica societatis janos bolyai. 32 non parametric statistical inference. Hungary, pp 293–
313
Karlin S, Taylor HM (1981) A second course in stochastic processes. Academic press, London
Welch BL (1947) The generalization of student’s problem when several different population vari-
ances are involved. Biometrika 34(12):28–35
Protein Structure Modeling of Abnormal
Genes Associated with PARK 1 and PARK 8
Loci Related to Autosomal Dominant
Parkinson’s Disease and Docking the
Protein(s) with Appropriate Ligands

Sanchari Roy and T.S. Vasulu

Abstract Parkinson’s disease (PD) is a common neurological disorder with a preva-


lence of 1–2 per 1000 overall. PD is of two types: autosomal dominant and recessive,
autosomal dominant ones are more harmful—than recessive types—and a single copy
of their gene causes the disease. Of the five dominant loci involved in PD—PARK1,
PARK3, PARK4, PARK5 and PARK8—the two most predominant are PARK1 and
PARK8. Understanding and modeling of the abnormal proteins of these genes of the
disease is of importance which can help in drug design and help treating patients
of PD disease. In this regard, of these five loci, the protein 3-dimensional struc-
ture for alpha-synuclein gene present in PARK 1 locus is known but the abnormal
alpha-synuclein proteins causing PD is yet to be modeled. However, no 3-D protein
structure for PARK 2 gene present PARK8 locus and the abnormal protein coded
by the LARK2 gene are not known. And suitable ligands are also not available for
these proteins (Dardarin coded by LRRK2 and alpha-synuclein) that can neutralize
the effect in the human brain. We report modeling the PARK1 and PARK8 locus
abnormal proteins.

Keywords Neurological disorder · Parkinsonism · Mutation · SNCA and. LRRK2


genes · Alpha-synuclein and Dardarin · Open Reading Frame prediction · SOPMA
and GOR Algorithm · Secondary structure prediction · Protein 3-D structure mod-
eling: threading · Model validation · Receptor-ligand docking

MS subject classification: 92C40

1 Introduction

Parkinson’s disease (also known as PD) is a degenerative disorder of the central


nervous system that often impairs the sufferer’s motor skills and speech. It is char-
acterized by muscle rigidity, tremor, a slowing of physical movement, termed as

S. Roy · T.S. Vasulu (B)


Indian Statistical Institute, Kolkata, India
e-mail: vasulut@hotmail.com

© Springer International Publishing AG 2017 43


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_2
44 S. Roy and T.S. Vasulu

Bradykinesia and, in extreme cases, a loss of physical movement called Akinesia


(Allan 1937). It is a common neurological disorder with a prevalence of 1–2 per
1000 overall. However the incidence rises after the age of 50, such that 1–2% of the
elderly (e.g., in the UK) are affected. The disease is due to the striatal deficiency of
dopamine following neuronal degeneration within the substantia nigra. Parkinson’s
disease belongs to a group of conditions called movement disorders. PD is both
chronic and progressive. PD is the most common cause of Parkinsonism, a group
of similar symptoms. PD is also called “primary parkinsonism” or “idiopathic PD”.
There is genetic disposition or risk factors for the manifestation of PD. In recent
years, a number of specific mutations causing Parkinson’s disease have been dis-
covered. These account for a minority of cases of Parkinson’s disease. A patient
suffering from Parkinson’s disease is more likely to have relatives who also suffer
from the same disease. PD genes generally occur in the autosomal chromosomes
and not in the sex chromosomes. In autosomal dominant PD, inheritance of only a
single copy of the gene causes this disease, whereas, in case of, autosomal recessive
PD, two copies of the gene must be inherited to cause this disease. The identification
of a number genes related to PD has opened up to protein structure modeling of the
abnormal proteins that deviate from the wild types, thus paving way for the drug
designing to find suitable molecules which can help in controlling the PD.

1.1 Discovery of Genes of Parkinson Disease

1.1.1 Autosomal Dominant, Recessive and Linkage Studies

The first gene discovered in PD was alpha-synuclein and the gene is located on the
long arm of chromosome 4 (Polymeropoulos et al. 1997), thus confirming the genetic
predisposition to at least one form of Parkinson’s disease. Studies have mapped the
NACP/synuclein gene to chromosome 4q21.3-q22 (Campion et al. 1995; Chen et al.
1995). Similarly SNCA gene was mapped to chromosome 4q21 (Shibasaki et al. 1995;
Spillantini et al. 1995). However Campion et al. (1995) did not found mutations in
NACP gene-sequence from 26 patients of early-onset Alzheimer disease.
Polymeropoulos et al. (1996) performed a genome wide linkage scan in the large
Italian kindred previously reported by Golbe et al. (1990). Linkage to markers in
the 4q21-q23 region was found with a maximum lod score of 6.00 at recombination
fraction theta = 0.00 for marker D4S2380. The locus was designated PARK1. In 94
Caucasian families, Scott et al. (1997) could not demonstrate linkage to 4q21-q23.
They also found no linkage even when the 22 families from their study with at least 1
case of early-onset PD were examined separately. Ghosh et al. (2007) excluded link-
age in 13 multigenerational families with Parkinson disease, with the exception of 1
family for which they achieved a maximum multipoint lod score of 1.5 for genetic
markers in the 4q21-q23 region. Scott et al. (2001) described a genetic linkage study
conducted in 1995-2000 in which a complete genomic screen was performed in
174 families with multiple individuals diagnosed as having idiopathic PD, identified
Protein Structure Modeling of Abnormal Genes … 45

through probands in 13 clinic populations in the continental United States and Aus-
tralia. Significant evidence for linkage was found in 5 distinct chromosomal regions:
chromosome 6 in the PARKIN gene (PRKN) in families with at least 1 individual with
PD onset at younger than 40 years (lod = 5.47), chromosomes 17q (lod = 2.62), 8p
(lod = 2.22), and 5q (lod = 1.50) overall and in families with late-onset PD, and 9q
(lod = 2.59) in families with both levodopa-responsive and levodopa-nonresponsive
patients. The data suggested that the PARKIN gene is important in early-onset PD
and that multiple genetic factors may be important in the development of idiopathic,
late-onset PD.
Farrer et al. (1999) suggested that a locus on 4p is responsible for autosomal dom-
inant Lewy body Parkinsonism, and that postural tremor, consistent with essential
tremor, may be an alternate phenotype of the same pathogenic mutation that causes
Lewy body Parkinsonism. They studied the large family described by Waters and
Miller Waters and Miller (1994) and Muenter et al. (1998) with levodopa-responsive
Lewy body Parkinsonism. After performing a genome screen, they identified a chro-
mosome 4p haplotype that segregated with the disorder; however, this haplotype
also occurred in individuals in the pedigree who did not have clinical Lewy body
Parkinsonism but rather suffered from postural tremor.
Day and Thompson (1987) cloned UCHL1 cDNA. The deduced protein, which
they called PGP9.5, contains 212 amino acids. By Northern blot analysis, Leroy
et al. (1998) detected a 1.3-kb transcript expressed only in brain. Examination of
specific brain regions revealed expression in all areas tested, particularly in the sub-
stantia nigra. Li et al. (2002) performed a genomic screen for age at onset (AAO) of
Parkinson disease, studying 174 families. Heritabilities between 40 and 60% were
found. Significant evidence was found for linkage of AAO in Parkinson disease on
1p (lod = 3.41).
Using 781 micro-satellite markers, Hicks et al. (2002) performed a genome wide
scan on 117 Icelandic patients with classic late-onset Parkinson disease (mean age
of onset 65.8 years) and 168 of their unaffected relatives from 51 families. They
found linkage to chromosome 1p32, and further analysis yielded a lod score of 4.9
near marker D1S2652 within a 7.6-cM segment. The authors designated this locus
PARK10.
Pankratz et al. (2002) reported linkage to 2q in a sample of sib pairs with Parkinson
disease. Pankratz et al. (2003) expanded the sample to include 150 families meeting
their strictest diagnostic definition of verified Parkinson disease. To delineate further
the chromosome 2q linkage, they performed analyses using only those pedigrees with
the strongest family history of PD. Linkage analyses in this subset of 65 pedigrees
generated a lod score of 5.1, which was obtained using an autosomal dominant model
of disease transmission. This result strongly suggested that variation in a gene on
2q36-q37 contributes to PD susceptibility.
By genomic sequence analysis, Gray et al (2000) mapped the HTRA2 gene to 2p13.
Faccio et al. (2000) mapped the OMI gene to 2p12 by FISH. By radiation hybrid
mapping, Engelender et al. (1999) showed that the SNCAIP gene is located on 5q23.
1-q23.3. In a large Italian family with autosomal recessive early-onset Parkinson-
ism, Valente et al. (2001) identified a novel locus, PARK6, in a 12.5-cM region of
46 S. Roy and T.S. Vasulu

1p36-p35. The large Sicilian family, which the authors designated the Marsala kin-
dred, had 4 definitely affected members. The phenotype was characterized by early-
onset (range 32 to 48 years) parkinsonism with slow progression and sustained
response to levodopa. A maximum lod score of 4.01 at recombination fraction 0.00
was obtained for marker D1S19.
Autosomal recessive early-onset Parkinson disease is caused by mutation in the
DJ1 gene. Another form of early-onset Parkinson disease (PARK2) is caused by
mutation in the ‘PARKIN’ gene. In a family with early-onset Parkinsonism from a
genetically isolated community in the Netherlands, Van Duijn et al. (2001) found
linkage to chromosome 1p36. Using a multiple marker spanning a disease haplotype
of 16 cM, they found a multipoint linkage lod score of 4.3.
Pankratz et al. (2002) studied 160 multiplex families with Parkinson disease (PD)
in which there was no evidence of mutation in the PARKIN gene and used multi-
point nonparametric linkage analysis to identify PD susceptibility genes. For those
individuals with a more stringent diagnosis of verified PD, a lod score of 2.1 was
observed on the X chromosome. Analyses performed with all available sib pairs,
i.e., all examined individuals treated as affected regardless of their final diagnostic
classification, yielded even greater evidence of linkage to the X chromosome (lod
score equal to 2.7).
Pankratz et al. (2003) studied 754 affected individuals, comprising 425 sib pairs,
to identify PD susceptibility genes. They employed 2 diagnostic models for genome
wide nonparametric linkage analysis. Under the model representing a broader disease
definition, a lod score of 3.1 was achieved (genome wide P = 0.04). After removing
from the sample those 85 families with a strong history of PD, the genome screen
in the remaining 277 families resulted in a lod score of 3.2 on the X chromosome.
Pankratz et al. (2003) noted that Hicks et al. (2002) and Scott et al. (2001) also
reported linkage to this region of Xq21-q25.
Genetic analyses, epidemiologic studies, neuropathologic investigations, and new
experimental models of PD are providing important new insights into the pathogen-
esis of PD (Jun et al. 2007; Flower et al. 2007; Liu et al. 2007; Dufty et al. 2007). At
least 10 distinct loci are responsible for rare Mendelian forms of PD (Flower et al.
2007). Despite the genetic advances, PD is primarily a sporadic disorder with no
known cause (Jun et al. 2007).

1.2 Family Studies on PD and Identification of Genes

1.2.1 SNCA Gene

Polymeropoulos et al. (1996) demonstrated that the Parkinson disease phenotype in


a large family of Italian descent could be mapped to 4q21-q23. Parkinson disease
type 1 (PARK1), in the families were documented to be typical for Parkinson disease,
including Lewy bodies, with the exception of a relatively early age of onset of illness
at 46 +/− 13 years. In this family, the penetrance of the gene was estimated to
Protein Structure Modeling of Abnormal Genes … 47

be 85%. Since the SNCA gene maps to the same region, it was considered to be
an excellent candidate for the site of the mutation in PARK1. In the Italian family,
Polymeropoulos et al. (1997) found a G-to-A transition in nucleotide 209 of the
SNCA gene, which resulted in an ala53-to-thr substitution (A53T). The same A53T
mutation segregated with the Parkinson disease phenotype in 3 Greek kindreds. In
these families also, the onset of the disease occurred relatively early.
In an in vitro study, Conway et al. (2000) compared the rates of disappearance
of monomeric alpha-synuclein and appearance of fibrillar alpha-synuclein for the
wildtype and 2 mutant proteins, A53T and A30P, as well as equimolar mixtures
that may model heterozygous Parkinson disease patients. Whereas A53T and an
equimolar mixture of A53T and wildtype fibrillized more rapidly than wildtype
alpha-synuclein, the A30P mutation and its corresponding equimolar mixture with
wildtype fibrillized more slowly. However, under conditions that ultimately pro-
duced fibrils, the A30P monomer was consumed at a comparable rate or slightly more
rapidly than the wildtype monomer, whereas A53T was consumed even more rapidly.
The difference between these trends suggested the existence of nonfibrillar alpha-
synuclein oligomers, some of which were separated from fibrillar and monomeric
alpha-synuclein by sedimentation followed by gel-filtration chromatography. Con-
way et al. (2000) concluded that drug candidates that inhibit alpha-synuclein fibril-
lization but do not block its oligomerization could mimic the A30P mutation and
may therefore accelerate disease progression.
In affected members of a Spanish family with autosomal dominant Lewy body
dementia (127750) and Parkinsonism, Zarranz et al. (2004) identified a 188G-A tran-
sition in the SNCA gene, resulting in a glu46-to-lys (E46K) substitution in the amino-
terminal region of the protein. The mutation showed complete segregation with the
disease phenotype and was absent in 276 Spanish healthy and disease controls. Choi
et al. (2004) found that the E46K SNCA mutation resulted in a significant increase
in alpha-synuclein binding to negatively charged phospholipid liposomes compared
to the wildtype, A53T and A30P mutant proteins. The A30P mutant had decreased
binding, and the A53T mutant had binding similar to wildtype. The mutated E46K
protein had an increased rate and amount of filament assembly compared to wild-
type and the A30P mutant. The E46K mutant filaments had a pronounced twisted
appearance with width varying between about 5 and 14 nm and a crossover spacing
of 43 nm, yielding arrays with a meshwork appearance. The A53T mutant had an
increased rate and amount of filament assembly, yielding a twisted appearance with
a width between 5 and 14 nm and a crossover spacing of approximately 100 nm. The
A30P mutant showed a slower rate of filament assembly compared to wild type, but
the total number of filaments formed was greater than wild type. The appearance of
the A30P filaments was similar to wild type, characterized by a 6 to 9-nm width.
The findings suggested a mechanism for the pathogenicity of E46K. Greenbaum
et al. (2005) also showed that the E46K mutation resulted in increased amyloid fibril
assembly compared to the wildtype protein, but the effect was not as strong as that
of the A53T mutation.
The identification by Polymeropoulos et al. (1997) of an Ala53Thr alteration in
the α-synuclein gene in persons with autosomal dominant Parkinson’s disease (PD)
48 S. Roy and T.S. Vasulu

provides support for the genetic basis of PD. Because the identical alteration was
found among four “unrelated” families [one Italian (Contursi) and three Greek kin-
dreds], Polymeropoulos et al. (1997) suggest that this genetic alteration is causative.
This mutation nevertheless appears to be rare in familial PD, as others have not
detected linkage to 4q21-q23 in sizable series of PD pedigrees, except for one (fam-
ily K), where it remains unclear whether or not family K is linked to 4q21-23.
Assuming that the linkage of the Contursi kindred to 4q21-q23 is valid, it was con-
cerned that this molecular alteration may not be the disease-causing mutation, but
represents a neutral variant in linkage disequilibrium with a neighboring PD disease
gene. Factors including selection, admixture, finite population size, migration and
mutation, co-ancestry, genetic hitchhiking, and growing population can affect link-
age disequilibrium. Contursi, in the Salerno province, lies close to the port of Naples
on the west coast of Italy. Close contact between Greece and Italy has occurred
through the port of Naples for centuries. Thus, it is possible that these four kin-
dreds are distantly related (co-ancestry) and that the Ala53Thr alteration represents
an α-synuclein polymorphism in allelic association with a neighboring PD disease
gene. Other neurological disorders, such as idiopathic torsion dystonia and Machado-
Joseph disease, demonstrate linkage disequilibrium between microsatellite markers
and the disease gene among different national populations. The mutated residue is
not evolutionally conserved, in contrast with adjacent residues, which are conserved
between species. The “mutant” human sequence has a threonine at residue 53 like
the wild-type rodent sequence. Thus, the sequences are identical in this domain of
the protein. α-synuclein is found in Lewy bodies, the pathological hallmark of PD.
However, as many other proteins (for example, neurofilament, and ubiquitin) are
present in Lewy bodies, the presence of α-synuclein, although intriguing, does not
prove that α-synuclein is a candidate PD disease gene. The report by Polymeropou-
los et al. is a major step forward in PD research. Even if α-synuclein is not the PD
gene, the Ala53Thr alteration provides further localization of the PD disease gene
that may lie within one megabase of α-synuclein.
Kobayashi et al. (2006) showed that Pyrroloquinoline quinone (PQQ) is a non-
covalently bound cofactor in the bacterial oxidative metabolism of alcohols. PQQ
also exists in plants and animals. Due to its inherent chemical feature, namely its
free-radical scavenging properties, PQQ has been drawing attention from both the
nutritional and the pharmacological viewpoint. Alpha-synuclein, a causative factor
of Parkinson’s disease (PD), has the propensity to oligomerize and form fibrils,
and this tendency may play a crucial role in its toxicity. We show that PQQ pre-
vents the amyloid fibril formation and aggregation of alpha-synuclein in vitro in
a PQQ-concentration-dependent manner. Moreover, PQQ forms a conjugate with
alpha-synuclein, and this PQQ-conjugated alpha-synuclein is also able to prevent
alpha-synuclein amyloid fibril formation. This is the first study to demonstrate the
characteristics of PQQ as an anti-amyloid fibril-forming reagent. Agents that pre-
vent the formation of amyloid fibrils might allow a novel therapeutic approach to PD.
Therefore, together with further pharmacological approaches, PQQ is a candidate
for future anti-PD reagent compounds.
Protein Structure Modeling of Abnormal Genes … 49

1.2.2 LRRK2 Gene

The leucine-rich repeat kinase 2 (LRRK2; also known as PARK8) gene has been iden-
tified to cause dominantly inherited Parkinson’s disease. LRRK2 is a large gene that
consists of 51 exons, and which encodes a 2,527-amino-acid protein named LRRK2
or Dardarin, with various conserved domains recognized in its primary amino-acid
sequence. To date, more than 40 variants have been reported in this gene. Out of
40 at least 16 variants appear to be pathogenic. These variants, which include eight
recurrent mutations occur in only 10 of the 51 exons of LRRK2. For the most frequent
and well-investigated mutation (c.6055G→A), a common FOUNDER has been sug-
gested. This single mutation has been reported in ∼1.5% of tested index cases (∼100
out of 6,500 cases) and in only 2 out of ∼12,000 healthy individuals. More recently,
LRRK2 mutations have been detected in ∼1% of early-onset PD cases (Hedrich et
al. 1997)). Post-mortem analysis of four patients from a family with one of the recur-
rent mutations surprisingly revealed a broad spectrum of abnormalities: Lewy bodies
restricted to brainstem nuclei in the first patient; diffuse Lewy bodies in the second
patient, NEUROFIBRILLARY TANGLES, but no Lewy bodies, in the third patient;
and isolated cell loss without neurofibrillary tangles or Lewy bodies in the fourth
patient.
Paisan-Ruiz et al. (2004) identified a putative disease-causing transcript
(DKFZp434H2111) within a 2.6-Mb region encompassing a locus for Parkinson
disease-8 (PARK8). The predicted transcript encodes a deduced 2,482-amino acid
protein with a leucine-rich repeat, a kinase domain, a RAS domain, and a WD40
domain. Northern blot analysis detected a 9-kb mRNA transcript in all tissues tested,
including brain. The authors named the protein product dardarin, derived from the
Basque word dardara, meaning tremor.
Zimprich et al. (2004) cloned LRRK2 from a human brain cDNA library and found
that it encodes a 2,527-amino acid protein with a molecular mass of approximately
250-kD. Northern blot analysis detected a major 9-kb transcript at low levels in most
brain regions. Highest transcript levels were obtained in the putamen, substantia
nigra, and lung. The appearance of smaller bands suggested alternative splicing. By
measuring the activity of LRRK2 against myelin basic protein as a test substrate,
West et al. (2005) determined that LRRK2 possesses mixed-lineage kinase activity.
LRRK2 also showed autophosphorylation activity. LRRK2 contains a ‘Ras of complex
proteins’ (ROC) domain that may act as a GTPase to regulate its kinase activity. Deng
et al. (2008) reported the crystal structure of the LRRK2 ROC domain in complex
with GDP-Mg (2+) at 2.0-angstrom resolution. The structure displayed a dimeric
fold generated by extensive domain swapping, resulting in a pair of active sites with
essential functional groups contributed from both monomers. Two residues mutated
in PARK8, arg1331 and ile1371, were located at the interface of the 2 monomers and
provided interactions to stabilize the ROC dimer. Deng et al. (2008) concluded that
PARK8-associated mutations in the ROC domain disrupt dimer formation, resulting
in decreased GTPase activity. They proposed that the ROC domain regulates LRRK2
kinase activity as a dimer, possibly via the COR domain acting as a molecular hinge.
50 S. Roy and T.S. Vasulu

In 7 affected members of an English family with Parkinson disease, Paisan-Ruiz


et al. (2004) identified a mutation in the LRRK2 gene that predicts a tyr1654-to-
cys substitution. Gasser (2005) noted that the correct numbering of this mutation is
tyr1699-to-cys (Y1699C).
In affected members of a family with autosomal dominant Parkinson disease
originally reported by Wszolek et al. (1997), Zimprich et al. (2004) identified het-
erozygosity for the Y1699C mutation resulting from a 5096A-G transition in the
LRRK2 gene.
In affected members of 4 of 61 (6.6%) unrelated families with autosomal dom-
inant Parkinson disease, Di Fonzo et al. (2005) identified a heterozygous 6055G-A
transition in exon 41 of the LRRK2 gene, resulting in a gly2019-to-ser (G2019S) sub-
stitution. Two families were from Italy, and 1 each were from Portugal and Brazil.
The gly2019 residue is highly conserved and is part of a 3-amino acid motif required
by all human kinase proteins. Gilks et al. (2005) identified the G2019S mutation in
8 of 482 (1.6%) unrelated patients with Parkinson disease. Five of the patients had
no family history of the disorder, suggesting either a de novo occurrence or reduced
penetrance. Nichols et al. (2005) identified the G2019S mutation in 20 of 358 (6%)
families with PD. In 1 family, 1 sib was heterozygous for the mutation and another
was homozygous; the homozygous individual did not differ in clinical presentation
from the sib and did not have early disease onset or more rapid progression.
By sequencing the LRRK2 gene in multiplex families showing linkage to the
PARK8 region, Kachergus et al. (2005) identified the G2019S mutation. The fam-
ilies in which the mutation was found originated from the United States, Norway,
Ireland, and Poland. In patients with idiopathic Parkinson disease from the same
population, further screening identified 6 more patients with the LRRK2 G2019S
mutation, no mutations were found in matched control individuals. Subsequently,
42 family members of the 13 probands were examined, 22 had an LRRK2 G2019S
substitution, 7 with a diagnosis of PD. All patients shared an ancestral haplotype
indicative of a common founder and within families, LRRK2 G2019S segregated
with disease (multipoint lod score 2.41). Penetrance was age dependent, increasing
from 17% at age 50 to 85% at age 70 years. In all 19 affected members of the original
Japanese family with Parkinson disease-8 (Hasegawa and Kowa 1997), Funayama
et al. (2005) identified a heterozygous 6059T-C transition in exon 41 of the LRRK2
gene, resulting in an ile2020-to-thr (I2020T) substitution in a conserved region of
the kinase motif domain. The neuropathologic features in this family were notable
for absence of Lewy bodies. The mutation was also detected in 2 affected members
of another family with PARK8. In the second family, 3 unaffected members also car-
ried the mutation, but their ages (73, 58, and 56) were within the variation of age at
onset in that family (39 to 76 years). The mutation had previously been reported by
Zimprich et al. (2004). Recently (2016) another new gene TMEM has been dis-
covered in a familial PD by Deng Han-Xiang et al. (2016). Details of the genes,
corresponding loci and their chromosomal location, the mode of inheritance of PD
that has been investigated by various authors have been given in Table 1.
Protein Structure Modeling of Abnormal Genes … 51

Table 1 Genes, mode of inheritance, loci and chromosomal location linked to familial PD or
implicated as genetic causes for PD
Gene Mode of inheritance Locus Chromosomal Reference
location
alpha- Autosomal dominant PARK1 4q21-q23 Campion et al. (1995)
synuclein
Parkin Autosomal recessive PARK2 6q25.2-27 Kitada et al. (1998)
Yet to be Autosomal dominant PARK3 2p13 Karamohamed et al.
assigned (2003)
alpha- Autosomal dominant PARK4 4q Singleton et al. (2003)
synuclein
UchL1 Autosomal dominant PARK5 4p14 Day et al. (1987,1990)
PINK1 Autosomal recessive PARK6 1p35-p36 Valente et al. (2004)
DJ-1 Autosomal recessive PARK7 1p36 Bonifati et al. (2003)
LRRK2 Autosomal dominant PARK8 12p11q13.1 Zimprich et al. (2004)
Yet to be Autosomal recessive PARK9 1p36 Najim Al-Din et al.
assigned (1994); Hampshire et al.
(2001)
DYT/TAF1 X-linked PARK12 Lee et al. (1991), Graeber
and Muller (1992)
Yet to be Late-onset PARK10 1p32 Li et al. Hicks et al. (2002)
assigned susceptibility gene
NR4A2 Susceptibility gene PARK11 2q22-23 Pankratz et al. (2002)
Synphilin-1 Susceptibility gene PARK12 5q23.1-23.3 Engelender et al. (2000)
Myhre et al. (2008)
Tau Susceptibility gene PARK13 17q21 Andreadis et al. (1992)
FBX07 Autosomal recess. PARK15 Di Fonzo et al. (2009);
Paisan-Ruiz et al. (2010)

1.3 Proteins Concerned with the Candidate Genes

Alpha-synuclein, a presynaptic nerve terminal protein, was originally identified as the


precursor protein for the non-β amyloid component of Alzheimer’s disease amyloid
plaques NAC. Genotype analysis in the Italian PD kindred with additional genetic
markers showed recombination events. One recombination was observed for genetic
marker D4S2371 at the centromeric end of the PD interval and one recombination
was inferred for marker D4S2986 at the telomeric end of the interval. These recom-
binations redefined the location of the PD gene to an interval of approximately 6 cM
between markers D4S2371 and D4S2986. A minimal physical contig of yeast arti-
ficial chromosome (YAC) clones was constructed to span the interval from marker
D4S2371 to marker D4S2986. Using this contig, it was being established that the
α-synuclein gene is located within the D4S2371-D4S2986 interval, just telomeric to
marker D4S2371. Thus, α-synuclein represented an excellent candidate gene for PD.
52 S. Roy and T.S. Vasulu

Sequence analysis of the fourth exon of the α-synuclein gene revealed a single base
pair change at position 209 from G to A (G209A), which results in an Ala to Thr
substitution at position 53 (Ala53Thr) and the creation of a novel Tsp45 I restriction
site. Mutation analysis for the G209A change in the Italian kindred showed com-
plete segregation with the PD phenotype with the exception of individual 30, who
was affected but not carrying this mutation. This individual apparently inherited a
different PD mutation from his father because it was seen that he shared a genetic
haplotype with his unaffected maternal uncle, individual 3, for genetic markers in the
PD linkage region. The frequency of this variation was studied in two general popula-
tion samples, one consisting of 120 chromosomes of the parents of the CEPH (Centre
d’Etude du Polymorphisme Humain) reference families, and the other consisting of
194 chromosomes of unrelated individuals from the blood bank in Salerno, Italy. Of
these 314 chromosomes, none was found to carry the G209A mutation. Fifty-two
patients of Italian descent with sporadic PD were also screened for the mutation,
along with five individuals who had been used to identify previously unpublished
Greek families.
It was being demonstrated by amplification by the polymerase chain reaction
(PCR) of reverse-transcribed mRNA (RT PCR) that the mutant allele is transcribed
in the lymphoblast cell line of an affected individual from the Italian kindred. These
data indicate that the mutant allele is transcribed.
The Ala53Thr substitution was localized in a region of the protein whose sec-
ondary structure predicting a α helical formation, bounded by β sheets. Substitution
of the alanine with threonine is predicted to disrupt the α helix and extend the β sheet
structure. Beta pleated sheets are thought to be involved in the self-aggregation of
proteins, which could lead to the formation of amyloid-like structures.
Three members of the synuclein family have been characterized in the rat, with
SYN1 exhibiting 95% similarity to the human α-synuclein protein. SYN1 of the rat is
expressed in many regions of the brain, with high levels found in the olfactory bulb and
tract, the hippocampus, dentate gyrus, habenula, amygdala, and piriform cortex, and
intermediate levels in the granular layer of the cerebellum, substantia nigra, caudate-
putamen, and dorsal raphe. This pattern of expression coincides with the distribution
of the Lewy bodies found in brains of patients with Parkinson’s disease. Decreases
in olfaction often accompany the syndromic features of Parkinson’s disease, and it
was proposed that in many cases hyposmia (decreased sense of smell) is an early
sign of the illness.
In the zebra finch the homolog to α-synuclein, synelfin, is thought to be involved
in the process of song learning, suggesting a possible role for synuclein in memory
and learning. In contrast to humans, rats have a threonine at the same position in
their homologs to the human α-synuclein gene.
Omar M.A El-Agnafa et al. in 1998 found that the effects of the mutations Ala30
to Pro and Ala53 to Thr on the physical and morphological properties of α-synuclein
protein implicated in Parkinson’s disease. Alpha-Synuclein (α-syn) protein has been
found in association with the pathological lesions of a number of neurodegenerative
diseases. Mutations in the α-syn gene have been reported in families susceptible
to an inherited form of Parkinson’s disease. Human wild-type α-syn, PD-linked
Protein Structure Modeling of Abnormal Genes … 53

mutant α-syn (Ala30Pro) and mutant α-syn (Ala53Thr) proteins have been observed
to self-aggregate and form amyloid-like filaments. The mutant α-syn forms more
β-sheet and mature filaments than the wild-type protein. This accumulation of
α-syn as insoluble deposits of amyloid plays a major role in the pathogenesis of
these neurodegenerative diseases.
The SNCA gene (also known as PARK1) was the first gene to be associated
with dominantly inherited familial PD (Neurology, 2006). The SNCA protein is
abundantly expressed as a 140-residue cytosolic and lipid-binding phosphoprotein
in the vertebrate nervous system, where it is believed to participate in the maturation
of presynaptic vesicles and to function as a negative co-regulator of neurotransmitter
release. Fibril-forming, phosphorylated species of SNCA were found to be abundant
in insoluble inclusions (Lewy bodies and Lewy neurites). These ’synucleinopathy
disorders’ (a term coined by Trojanowski and Lee) primarily encompass sporadic
PD, SNCA-linked PD, dementia with Lewy bodies, and multiple-system atrophy, but
can also be variably found in other neurodegenerative syndromes.
Dawson and Dawson (2003) gave the molecular pathways of neurodegeneration
in PD. Parkinson’s disease (PD) is a complex disorder with many different causes, yet
they intersect in common pathways, raising the possibility that neuroprotective agents
have broad applicability in the treatment of PD. Clinically, most patients present with
a motoric disorder and suffer from slowness of movement, rest tremor, rigidity, and
disturbances in balance. A number of patients also suffer from anxiety, depression,
autonomic disturbances, and dementia. Although there are effective symptomatic
therapies, there are no proven neuroprotective or neurorestorative therapies Loss of
dopamine (DA) neurons in the substantia nigra pars compacta (SNC) leads to the
major clinical symptoms of PD, but there is widespread neuropathology and the
SNC only becomes involved toward the middle stages of the disease (Wassef et al.
2007). Lewy bodies (LBs) and dystrophic neurites (Lewy neurites) are a pathologic
hallmark of PD and classically are round eosinophilic inclusions composed of a halo
of radiating fibrils and a less defined core (Watabe et al. 2007). LBs are thought to
be a pathognomonic feature of PD, but recent studies suggest that some forms of
PD does not have LBs (Flower et al. 2007). Ultrastructurally, LBs are composed of
10- to 14-nm amyloid-like fibrils (Watabe et al. 2007) and α-synuclein, which can
polymerize into 10-nm fibrils in vitro and is the primary structural component of the
LB (Sredni et al. 2007). The list of proteins related to the candidate genes discovered
for Parkinson’s Disease is shown in Table 2.

1.4 Genes, Proteins Structure Related PD

The extensive molecular genetic studies have discovered the role of several genes
that are associated with the Parkinson’s disease. There are 13 types of loci leading to
Parkinson’s disease and out of those 13 loci, 4 loci- PARK 1, PARK 3, PARK 4 and
PARK 8 result in autosomal dominant Parkinson’s disease. 3 variants of PARK 1 and
around 40 variants of PARK 8 have been studied till date. This wealth of information
54 S. Roy and T.S. Vasulu

Table 2 Proteins concerned with candidate genes


Genes Locus Proteins
SNCA PARK1 Alpha-synuclein
Autosomal recessive juvenile 2 PARK2 Parkin
\ parkin
Yet to be assigned PARK3 Yet to be assigned
Yet to be assigned PARK4 Yet to be assigned
Ubiquitin carboxyl-terminal PARK5 Ubiquitin carboxyl-terminal
esterase L1 (ubiquitin esterase L1 (ubiquitin
thiolesterase) or UCHL1 gene thiolesterase)
Yet to be assigned PARK6 Yet to be assigned
DJ-1 PARK7 Protein DJ-1
Leucin-rich repeat kinase 2 PARK8 Dardarin
(LRRK 2)
ATPase type 13A2 PARK9 ATP13A2
(Kufor-Rakeb syndrome)
Yet to be assigned PARK10 Yet to be assigned
Nuclear receptor subfamily 4, PARK11 Nuclear receptor subfamily 4,
group A, member 2 (NR4A2) group A, member 2
Alpha synuclein interacting PARK12 Alpha synuclein interacting
protein (SNCAIP) protein (SNCAIP)
Microtubule-associated protein PARK13 Microtubule-associated protein
tau (MAPT, TAU) tau (MAPT, TAU)

can be further explored for structural bioinformatics studies concerning the abnor-
mal protein structures of the loci causing autosomal-dominant Parkinson’s disease.
The dominant loci for PD are: PARK1, PARK3, PARK4, PARK5 and PARK8. Out
of these five loci, the most predominant ones are PARK1 and PARK8. The results of
the study indicate that:
1. The protein 3-dimensional structure the gene of the PARK 1 locus (SNCA) is
known, but the abnormal alpha-synuclein proteins causing PD is unknown yet.
2. However, no 3-D protein structure is available for LRRK2 gene present in PARK8
locus and also, the structure for abnormal proteins coded by the LRRK2 gene are
unknown.
3. Suitable ligands are also not available for these proteins (Dardarin coded by
LRRK2 and alpha-synuclein) which can neutralize their effect inside the human
brain.
4. Modeling the PARK1 and PARK8 locus abnormal proteins.
Protein Structure Modeling of Abnormal Genes … 55

1.4.1 Objectives of the Study

The study reports a. the protein structure modeling of the abnormal gene alpha-
synuclein, b. The protein structure modeling (wild type) and the abnormal protein
structure modeling of abnormal LRRK2 gene. c. Finding out the suitable ligands
which can neutralize the effect of the above protein(s).

2 Materials and Methods

2.1 Data Source

The basic sequence information and the corresponding protein sequence information
of the two genes: alpha-synuclein and the LRRK2 were accessed from the NCBI and
Swiss Prot database from the internet sources.

2.2 Methods

(i) Ideogram Study of Human Chromosome:


From the NCBI Map viewer we have obtained the ideogram map of chromosome 4
and chromosome 12 where the two genes viz., alpha-synuclein and leucin-rich repeat
kinase (LRRK2) were mapped (located).
(ii) Open Reading Frame Prediction
The region of the nucleotide sequences from the start codon (ATG) to the stop
codon is called the Open Reading frame. Depending on the starting point, there
are six possible ways (three on forward strand and three on complementary strand)
of translating any nucleotide sequence into amino acid sequence according to the
genetic code. These are called reading frames.
ORF Finder is a graphical analysis tool supported by NCBI which finds all open
reading frames of a selectable minimum size in a sequence already in the database.
This tool identifies all open reading frames using the standard or alternative genetic
codes. The reading frame determines which amino acids will be encoded by a gene.
Typically only one reading frame is used in translating a gene (in eukaryotes), and
this is often the longest open reading frame. ORF prediction helps us to identify the
mutation in the particular genetic code of the gene sequence which results in signif-
icant mutation in the amino acid sequence leading to a disrupted protein structure
giving abnormal functionality to the protein. The ORF information of the two genes
of Alpha-synuclein and LRRK2 and the corresponding Aminoacid sequences have
been obtained based on NCBI tools.
56 S. Roy and T.S. Vasulu

3 Results

3.1 SNP Report

Open Reading Frame prediction gives us the information about the Single Nucleotide
Polymorphisms (SNPs) in the genetic code of the gene sequence of proteins of alpha-
synuclein (SNCA) and leucin-rich repeat kinase 2 (LRRK 2). The following SNP
reports have generated from the ORF predictions of the gene sequences of SNCA
and LRRK 2 (Tables 3 and 4):

3.2 Mutant Protein Sequences of the Wild Type Proteins

The mutant protein sequences of the wild type proteins have been obtained by insert-
ing a point mutation in the amino acid sequences of the wild type proteins.

3.2.1 Alpha-Synuclein

VAR_007957 (A30P)
>VAR_007957| SYUA_HUMAN Alpha-synuclein—Homo sapiens (Human).
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAPGKTKEGVLYVGSKTKEGVVHGVAT
VAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDP
DNEAYEMPSEEGYQDYEPEA

Table 3 SNP report for Alpha-synuclein


Accession no. Nucleotide (SNPs) Amino acid (Missence Mutation)
Wild type Base Mutant Wild type Residue Mutant
position position
VAR_007957 Guanine 88 Cytosine Alanine 30 Proline
VAR_022703 Guanine 188 Adenine Glutamate 46 Lysine
VAR_007454 Guanine 209 Adenine Alanine 53 Threonine

Table 4 SNP report for Leucin-rich repeat kinase 2 (LRRK2)


Accession no. Nucleotide (SNPs) Amino acid (Missence Mutation)
Wild type Base Mutant Wild type Residue Mutant
position position
VAR_024954 Adenine 5096 Guanine Tyrosine 1699 Cysteine
VAR_024958 Guanine 6055 Adenine Glycine 2019 Serine
VAR_024959 Thyamine 6059 Cytosine Isoleucine 2020 Threonine
Protein Structure Modeling of Abnormal Genes … 57

VAR_022703 (E46K)
>VAR_022703|SYUA_HUMAN Alpha-synuclein—Homo sapiens (Human).
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKK
GVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATG
FVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA

VAR_007454 (A53T)
>VAR_007454|SYUA_HUMAN Alpha-synuclein—Homo sapiens (Human).
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKE
GVVHGVTTVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQ
LGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA

3.2.2 Leucin-Rich Repeat Kinase 2 (LRRK2)

In LRRK2, two PD-associated LRRK2 mutations occur in the kinase domain i.e.
G2019S and I2020T which increase autophosphorylation, suggesting a dominant
gain-of-function mechanism and Y1699C mutation is the most frequently occur-
ring mutation in PD causing the most significant effect. Hence, sequences up to
1681–2040 amino acids have been taken in each variant for further analysis.
VAR_024954 (Y1699C)
>VAR_024954|L R R K 2_ HU M AN Leucine-rich repeat serine/threonine-protein kinase 2—Homo sapiens (Human).
ELPHCENSEIIIRLYEMPCFPMGFWSRLINRLLEISPYMLSGRERALRPN
RMYWRQGIYLNWSPEAYCLVGSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFP
GLLEIDICGEGETLLKKWALYSFNDGEEHQKILLDDLMKKAEEGDLLVNPDQPRLTIPIS
QIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSFGSVYRAAYEGEEVAVKIFNKHT
SLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQQDKASLTRTLQH
RIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGIAQYCCRMGIKTSEGTPGFRA

VAR_024958 (G2019S)
>VAR_024958|L R R K 2_ HU M AN Leucine-rich repeat serine/threonine-protein kinase 2—Homo sapiens (Human).
ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGI
YLNWSPEAYCLVGSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDIC
GEGETLLKKWALYSFNDGEEHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNND
ELEFEQAPEFLLGDGSFGSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPR
MLVMELASKGSLDRLLQQDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVL
LFTLYPNAAIIAKIADYSIAQYCCRMGIKTSEGTPGFRA

VAR_024959 (I2020T)
>VAR_024959|LRRK2_HUMAN Leucine-rich repeat serine/threonine-protein kinase 2—Homo sapiens (Human).
ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRP
NRMYWRQGIYLNWSPEAYCLVGSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPG
LLEIDICGEGETLLKKWALYSFNDGEEHQKILLDDLMKKAEEGDLLVNPDQPRLTIPI
SQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSFGSVYRAAYEGEEVAVKIFNKH
TSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQQDKASLTRTLQHRIALHVAD
GLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGTAQYCCRMGIKTSEGTPGFRA
58 S. Roy and T.S. Vasulu

3.3 Secondary Structure Prediction

Secondary structure prediction is a set of techniques in bioinformatics that aim to


predict the local secondary structures of proteins and RNA sequences based only on
knowledge of their primary structure—amino acid or nucleotide sequence, respec-
tively. For proteins, a prediction consists of assigning regions of the amino acid
sequence as likely alpha helices, beta strands (often noted as “extended” conforma-
tions), or turns.

3.3.1 Algorithms used for Prediction of Secondary Structure

(a) SOPMA
The self-optimized prediction method (SOPM) has been described to improve the
success rate in the prediction of the secondary structure of proteins. It has been
predicted that all the sequences of a set of aligned proteins belongs to the same
family. This improved SOPM method (SOPMA) correctly predicts 69.5% of amino
acids for a three-state description of the secondary structure (alpha-helix, beta-sheet
and coil) in a whole database containing 126 chains of non-homologous (less than
25% identity) proteins.
(b) GOR
The GOR method, named for the three scientists who developed it—Garnier,
Osguthorpe, and Robson—is an information theory-based method developed not
long after Chou-Fasman that uses more powerful probabilistic techniques of Bayesian
inference. The GOR method takes into account not only the probability of each amino
acid having a particular secondary structure, but also the conditional probability of
the amino acid assuming each structure given that its neighbors assume the same
structure. This method is both more sensitive and more accurate because amino
acid structural propensities are only strong for a small number of amino acids such
as proline and glycine. The original GOR method is roughly 65% accurate and is
dramatically more successful in predicting alpha helices than beta sheets, which it
frequently mispredicts as loops or disorganized regions.
The present version, GOR IV, uses all possible pair frequencies within a window
of 17 amino acid residues. After cross validation on a data base of 267 proteins, the
version IV of GOR has a mean accuracy of 64.4% for a three state prediction (Q3).

3.4 Results of Secondary Structure Prediction

The two methods predicting the secondary structure from the program gives two
outputs, one eye-friendly giving the sequence and the predicted secondary structure
in rows, H = helix, E = extended or beta strand and C = coil; the second gives the
probability values for each secondary structure at each amino acid position. The
predicted secondary structure is the one of highest probability compatible with a
Protein Structure Modeling of Abnormal Genes … 59

Fig. 1 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

predicted helix segment of at least four residues and a predicted extended segment of
at least two residues. We have obtained three variants for the alpha-synuclein gene
for each of the two methods that we have followed. In case LRRK2 gene we have
obtained secondary structure for the wild protein and also three variants with the
identified mutations each by following two methods; GOR and SOPMA. The results
are:

3.4.1 Variants of Alpha-Synuclein

1. VAR_007957
(a) GOR method (Fig. 1)
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAPGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAV
Cccccccchhhhhhhhhhhhhhhhhhhhcccccceeeeeecccccceeeeeeeehhhhceeeeeecccee

VTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
Eeeeeehhhhhhhchhhhhhhhhhhhhhhcccccccccccceeecccccccccccccccccccccceeec

(b) SOPMA method (Fig. 2)


MDVFMKGLSKAKEGVVAAAEKTKQGVAEAPGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAV
Hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhccttteeeeecccctheeeeeeeehhcchhhhhhhhhhe

VTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
Ehhhhhhhhhhhhhhhhhhhhhhhhhhcttcccccccchhhhhccccccccccchhccchhhhhcccthh

2. VAR_022703
(a) GOR method (Fig. 3)
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKKGVVHGVATVAEKTKEQVTNVGGAV
Cccccccchhhhhhhhhhhhhhhhhhhhhhhcccceeeeecccccceeeeeeeehhhhceeeeeecccee

VTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
Eeeeeehhhhhhhchhhhhhhhhhhhhhhcccccccccccceeecccccccccccccccccccccceeec
60 S. Roy and T.S. Vasulu

Fig. 2 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

Fig. 3 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

(b) SOPMA method (Fig. 4)


MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKKGVVHGVATVAEKTKEQVTNVGGAV
Hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhcttteeeeecccctheeeeeeeehhcchhhhhhhhhhe

VTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
ehhhhhhhhhhhhhhhhhhhhhhhhhhcttcccccccchhhhhccccccccccchhccchhhhhcccthh

3. VAR_007454

(a) GOR method (Fig. 5)


MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVTTVAEKTKEQVTNVGGAV
Cccccccchhhhhhhhhhhhhhhhhhhhhhhcccceeeeecccccceeeeeeeeeeeccceeeeecccee

VTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
Eeeeeehhhhhhhchhhhhhhhhhhhhhhcccccccccccceeecccccccccccccccccccccceeec

(b) SOPMA method (Fig. 6)


MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVTTVAEKTKEQVTNVGGAV
Hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhcttteeeeecccctteeeeeeeehhcchhhhhhhhhhe

VTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
Ehhhhhhhhhhhhhhhhhhhhhhheccccccccccccchhhhhccccccccccchhccchhhhcccctth
Protein Structure Modeling of Abnormal Genes … 61

Fig. 4 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

Fig. 5 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

Fig. 6 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

3.4.2 Leucin-Rich Repeat Kinase 2 (LRRK 2) Protein Structure

(a) GOR method (Fig. 7)


ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccceeeeeeeccceecccchhhhhhhhhhccchhhhhhhhhhhcceeeeeeeeeeccccceeeec

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Ccccccccccceeeeeeccccccceeeccccchhhhhhhhhcccceeeeecccchhhhhhheeeeccccc
62 S. Roy and T.S. Vasulu

Fig. 7 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

Fig. 8 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Hhhhhhhhhhhhhhhhcceecccccccceeccccccchhhhhccchhhhhccchhhhhhchhhhcccccc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Cchhhhhhcchhhhhhhhcccchhhhhhhhhhhhhhccccchhhhhhhchhhhhhhhhhhcccchhhhhh

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGIAQYCCRMGIK
Hhhhhhhhhhhhhhhhhhhchhhhhhhhhhhhhcccccceeeeecccchhhhhhhhhceeeeeeeeeeec

TSEGTPGFRA
Ccccccceec

(b) SOPMA method (Fig. 8)


ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccchhhhheeeccccctthhhhhhhhhhhhhhhheetccccccttheeeetteeeeccttceeee

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Hhhhccccccceeeeeccccccchhhhhhhhhhhhhhhhhhcttceeecccccchhhhhcccceeecttc

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Chhhhhhhhhhhhhhttceeeccccccccccccccchheeecccccceeechhhhhhhhccceeeccccc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Ceeeehccttcheeeeehhhcchhhhhhhhhhhhhhcccttheeeehhcccchheeeehccttchhhhhh

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGIAQYCCRMGIK
Httcchhhhhhhhhhhhhhhhhhhhhhhheeeetccttceeeeecccthhheeeeccttcccccchttcc

TSEGTPGFRA
Ccccccccee
Protein Structure Modeling of Abnormal Genes … 63

Fig. 9 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

Variants of LRRK 2
1. VAR_024954
(a) GOR method (Fig. 9)
ELPHCENSEIIIRLYEMPCFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccceeeeeeecccccccccccchhhhhhhccchhhhhhhhhhhcceeeeeeeeeeccccceeeec

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Ccccccccccceeeeeeccccccceeeccccchhhhhhhhhcccceeeeecccchhhhhhheeeeccccc

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Hhhhhhhhhhhhhhhhcceecccccccceeccccccchhhhhccchhhhhccchhhhhhchhhhcccccc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Cchhhhhhcchhhhhhhhcccchhhhhhhhhhhhhhccccchhhhhhhchhhhhhhhhhhcccchhhhhh

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGIAQYCCRMGIK
Hhhhhhhhhhhhhhhhhhhchhhhhhhhhhhhhcccccceeeeecccchhhhhhhhhceeeeeeeeeeec

TSEGTPGFRA
Ccccccceec

(b) SOPMA method (Fig. 10)


ELPHCENSEIIIRLYEMPCFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccchhhhheeeccccctthhhhhhhhhhhhhhhhhccccccccttheeeetteeeeccttceeee

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Ehhhccccccceeeeeccccccceeehhhhhhhhhhhhhhhcttheeecccccchhhhhccceeeecttc

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Chhhhhhhhhhhhhhttceeecttcccccccccccchhheeeccccceeechhhhhhhhccceeecttcc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Ceeeehccttcheeeehhhhhhhhhhhhhhhhhhhhcccttheeeehhcccchheeeehccttchhhhhh

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGIAQYCCRMGIK
Httcchhhhhhhhhhhhhhhhhhhhhhhheeeetccttceeeeeeccthheeeeeccttcccccchttcc

TSEGTPGFRA
Ccccccccee
64 S. Roy and T.S. Vasulu

Fig. 10 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

Fig. 11 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

2. VAR_024958
(a) GOR method (Fig. 11)
ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccceeeeeeeccceecccchhhhhhhhhhccchhhhhhhhhhhcceeeeeeeeeeccccceeeec

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Ccccccccccceeeeeeccccccceeeccccchhhhhhhhhcccceeeeecccchhhhhhheeeeccccc

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Hhhhhhhhhhhhhhhhcceecccccccceeccccccchhhhhccchhhhhccchhhhhhchhhhcccccc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Cchhhhhhcchhhhhhhhcccchhhhhhhhhhhhhhccccchhhhhhhchhhhhhhhhhhcccchhhhhh

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYSIAQYCCRMGIK
Hhhhhhhhhhhhhhhhhhhchhhhhhhhhhhhhcccccceeeeecccchhhhhhhhhhhhhhhhhhhhhc

TSEGTPGFRA
Ccccccceec

(b) SOPMA method (Fig. 12)


ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccchhhhheeeccccctthhhhhhhhhhhhhhhheetccccccttheeeetteeeeccttheeee

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Hhhhccccccceeeeeccccccchhhhhhhhhhhhhhhhhhcttcceecccccchhhhhcccceeccccc
Protein Structure Modeling of Abnormal Genes … 65

Fig. 12 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Chhhhhhhhhhhhhctteeeecccccceeecccccccheeeeccctheeectthhhhhhccheeeccccc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Ceeehhhhttcheeeeeecccchhhhhhhhhhhhhhcccttheeeehhcccchheeeehccttchhhhhh

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYSIAQYCCRMGIK
Hhhhhhhhhhhhhhhhhhhhhhhhhhhteeeeetccttceeeeecccthhheeeeccttcccccchttcc

TSEGTPGFRA
Ccccccccee

3. VAR_024959
(a) GOR method (Fig. 13)
ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccceeeeeeeccceecccchhhhhhhhhhccchhhhhhhhhhhcceeeeeeeeeeccccceeeec

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Ccccccccccceeeeeeccccccceeeccccchhhhhhhhhcccceeeeecccchhhhhhheeeeccccc

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Hhhhhhhhhhhhhhhhcceecccccccceeccccccchhhhhccchhhhhccchhhhhhchhhhcccccc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Cchhhhhhcchhhhhhhhcccchhhhhhhhhhhhhhccccchhhhhhhchhhhhhhhhhhcccchhhhhh

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGTAQYCCRMGIK
Hhhhhhhhhhhhhhhhhhhchhhhhhhhhhhhhcccccceeeeecccchhhhhhhhcccceeeeeeeeec

TSEGTPGFRA
Ccccccceec

(b) SOPMA method (Fig. 14)


ELPHCENSEIIIRLYEMPYFPMGFWSRLINRLLEISPYMLSGRERALRPNRMYWRQGIYLNWSPEAYCLV
Cccccccchhhhheeeccccctthhhhhhhhhhhhhhhhhccccccccttheeeetteeeeccttheeee

GSEVLDNHPESFLKITVPSCRKGCILLGQVVDHIDSLMEEWFPGLLEIDICGEGETLLKKWALYSFNDGE
Ehhcccccccceeeeeccccccceeehhhhhhhhhhhhhhhctthceeccccchhhhhhccceeeecttc

EHQKILLDDLMKKAEEGDLLVNPDQPRLTIPISQIAPDLILADLPRNIMLNNDELEFEQAPEFLLGDGSF
Chhhhhhhhhhhhhhttceeecttcccceeehhhcccheeeecccthheechhhhhhhhccceeecttcc

GSVYRAAYEGEEVAVKIFNKHTSLRLLRQELVVLCHLHHPSLISLLAAGIRPRMLVMELASKGSLDRLLQ
Ceeeehhcttcheeeeeecccchhhhhhhhhhhhhhcccttheeeehhccccheeeeehccttchhhhhh
66 S. Roy and T.S. Vasulu

Fig. 13 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

Fig. 14 Graph for the probability of occurrence of helix, sheet, turn and coil in the secondary
structure

QDKASLTRTLQHRIALHVADGLRYLHSAMIIYRDLKPHNVLLFTLYPNAAIIAKIADYGTAQYCCRMGIK
Hhhhhhhhhhhhhhhhhhhhhhhhhhhhheeeetccttceeeeeeccthheeeeeccttcceeeccttcc

TSEGTPGFRA
Ccccccceee

3.5 Secondary Structure Prediction Report

The summary of the results obtained from the graphic representation of three variants
obtained from the two methods for the gene sequence alpha-synuclein and lucin-rich
repeat kinase 2 are shown in the following tables. The table shows the Aminoacid
substitution in case of wild and mutant variety their position and the corresponding
report of secondary structure prediction (Tables 5 and 6).

3.6 Protein 3-D Structure Modeling

Based on the information from the Tables 5 and 6 which gives details of the wild and
mutation and its position in the predicted structure, we constructed the probable 3-D
Protein Structure Modeling of Abnormal Genes … 67

Table 5 Secondary structure prediction report of abnormal proteins of Alpha-synuclein


Algorithm Mutant form Amino acid substitution Prediction report of
of structure secondary structure
prediction
Wild (W) Position Mutant (M)
SOPMA Var_007957 A 30 P Alpha helix (wild and
mutant)
Var_022703 E 46 K Beta turn (mutant)
Alpha helix (wild)
Var_007454 A 53 T Extended strand
(mutant) Alpha helix
(wild)
GOR Var_007957 A 30 P Random coil (mutant)
Alpha helix (wild)
Var_022703 E 46 K Random coil (mutant)
Alpha helix (wild)
Var_007454 A 53 T Extended strand
(mutant) Alpha
helix(wild)

Table 6 Secondary structure prediction report of abnormal proteins of Leucin-rich repeat kinase
2 (LRRK2)
Algorithm Mutant form Amino acid substitution Prediction report of
of structure secondary structure
prediction
Wild (W) Position Mutant (M)
SOPMA Var_024954 Y 1699 C Random coil (mutant)
Random coil ( wild )
Var_024958 G 2019 S Beta turn (wild and
mutant)
Var_024959 I 2020 T Extended strand
(mutant) Random coil
(wild)
GOR Var_024954 Y 1699 C Random coil (mutant)
Extended strand (wild)
Var_024958 G 2019 S Alpha helix (mutant)
Extended strand (wild)
Var_024959 I 2020 T Random coil (mutant)
Extended strand (wild)

structure modeling of the concerned proteins. The protein 3-D structure modeling
has been performed by two methods:
68 S. Roy and T.S. Vasulu

(1) Modeling of 3-D structure of abnormal proteins of Alpha synuclein by homology


modeling method using Modeller 9v2.
(2) Modeling of 3-D structure of original protein and variants of Leucin rich repeat
kinase 2 by threading method using Threader 3.

3.6.1 Homology Modeling of Variants of Alpha-Synuclein

The variants for alpha-synuclein are obtained by inserting one point mutation in
the primary protein sequence of wild type SNCA. Hence, the alignment of alpha-
synuclein variants with its wild type protein showed 99% similarity and the wild
type alpha-synuclein acts as template for modeling its variants’ protein 3-D structure.
Thus, the 3-D protein structure for variants of SNCA can be modeled using homology
modeling method, since for homology modeling 25–30% of target-template sequence
similarity is needed. Finally, the protein 3-D structures for variants of SNCA have
been modeled using Modeler 9v2.
The four steps to homology modelling are:

1. Template Selection
The simplest method of template identification\selection relies on serial pairwise
sequence alignments given by database search techniques such as FASTA and
BLAST. More sensitive methods based on multiple sequence alignment—of which
PSI-BLAST is the most common example—iteratively update their position-specific
scoring matrix to successively identify more distantly related homologs. These meth-
ods produce a larger number of potential templates. When performing a BLAST
search, a reliable first approach is to identify hits with a sufficiently low E-value,
which are considered sufficiently close in evolution to make a reliable homology
model.

2. Target-Template Alignment
For alignment of target sequence with the template, we can use pair wise alignment
program e.g., BLASTZ, LALIGN, etc. in case of a single template and multiple align-
ment programs e.g., CLUSTALW, T-COFFEE, etc. in case of multiple templates.
When multiple templates are selected, a good strategy is to superimpose them
with each other first to obtain a multiple structure—based alignment. In the next
step, the target sequence is aligned with this multiple structure—based alignment.
The final target—template alignment is then obtained by aligning the two profiles.

3. Model Construction
Given a template and an alignment, the information contained therein must be used
to generate a three-dimensional structural model of the target, represented as a set
of Cartesian coordinates for each atom in the protein. The major classes of model
generation methods have been proposed:
Protein Structure Modeling of Abnormal Genes … 69

3.6.2 Fragment Assembly

The original method of homology modeling relied on the assembly of a complete


model from conserved structural fragments identified in closely related solved struc-
tures. Thus unsolved proteins could be modeled by first constructing the conserved
core and then substituting variable regions from other proteins in the set of solved
structures.

3.6.3 Segment Matching

The segment-matching method divides the target into a series of short segments, each
of which is matched to its own template fitted from the Protein Data Bank. Thus,
sequence alignment is done over segments rather than over the entire protein. Selec-
tion of the template for each segment is based on sequence similarity, comparisons
of alpha carbon coordinates, and predicted steric conflicts arising from the van der
Waals radii of the divergent atoms between target and template.

3.6.4 Satisfaction of Spatial Restraints

The most common current homology modeling method takes its inspiration from
calculations required to construct a three-dimensional structure from data generated
by NMR spectroscopy. One or more target-template alignments are used to construct
a set of geometrical criteria that are then converted to probability density functions for
each restraint. Restraints applied to the main protein internal coordinates—protein
backbone distances and dihedral angles—serve as the basis for a global optimization
procedure that originally used conjugate gradient energy minimization to iteratively
refine the positions of all heavy atoms in the protein.

3.7 Loop Modeling

Regions of the target sequence that are not aligned to a template are modelled by
loop modeling, they are the most susceptible to major modelling errors and occur
with higher frequency when the target and template have low sequence identity.
The coordinates of unmatched sections determined by loop modeling programs are
generally much less accurate that those obtained from simply copying the coordinates
of a known structure.
The most commonly used software in homology modeling is SWISS—PROT. We
submit our query (target) sequence in the SWISS PROT and receive our 3D model of
the target sequence through our mail. The above procedures are written as program
in the SWISS—PROT server and all the above methods are performed by this server.
70 S. Roy and T.S. Vasulu

Once, we get our 3-D model, then we view our model using SWISS PDBViewer or
RASMOL (these are 3-dimensional structure viewing softwares).

3.7.1 Model Validation

After building the model, this is necessary step. We need to evaluate the model. If
our model has errors then we have to discard our model and start the same process
from the beginning. The steps involved in the homology modeling of protein 3D
structure of the two genes with respect to Parkinson’s Disease are shown in flow
chart (Appendix 3—Fig. 1).
Threading—Method for Protein 3-D Structure Modeling
No 3-D protein structure is available for the wild type Leucin-rich repeat kinase
2 (LRRK 2). The PSI-BLAST search showed only one template for LRRK 2 hav-
ing already available 3-D structure with 17 percent similarity with LRRK2. Thus,
threading method has been used instead of homology modeling for modeling the
3-D protein structure of LRRK 2 and its variants. Threader 3 has been used for the
protein 3-D structure modeling.
Threading is a method for the computational prediction of protein structure from
protein sequence. Protein threading or fold recognition refers to a class of computa-
tional methods for predicting the structure of a protein from amino acid sequence.
The basic idea is that the target sequence (the protein sequence for which the struc-
ture is being predicted) is threaded through the backbone structures of a collection
of template proteins known as the fold library and a “goodness of fit” score calcu-
lated for each sequence-structure alignment. This goodness of fit is often derived
in terms of an empirical energy function, based on statistics derived from known
protein structures, but many other scoring functions are also available. The most
useful scoring functions include both pair wise terms (interactions between pairs of
amino acids) and solvation terms. Threading methods share some of the character-
istics of both comparative modeling methods (the sequence alignment aspect) and
ab initio prediction methods (predicting structure based on identifying low-energy
conformations of the target protein).
Fold recognition methods can be broadly divided into two types:
(1) Methods that derive a 1-D profile for each structure in the fold library and align
the target sequence to these profiles.
(2) Methods that consider the full 3-D structure of the protein template.
In the 3-D representation, the structure is modeled as a set of inter-atomic distances
i.e. the distances are calculated between some or all of the atom pairs in the structure.
This is a much richer and far more flexible description of the structure, but is much
harder to use in calculating an alignment. The profile-based fold recognition approach
was first described by Bowie, Lüthy and Eisenberg in 1991. The term threading was
first coined by Jones, Taylor and Thornton in 1992, and originally referred specifically
to the use of a full 3-D structure atomic representation of the protein template in fold
Protein Structure Modeling of Abnormal Genes … 71

Fig. 15 The region of 1×q8


susceptible for mutations

recognition. Today, the terms threading and fold recognition are frequently (though
somewhat incorrectly) used interchangeably.
Fold recognition methods are widely used and effective because it is believed
that there are a strictly limited number of different protein folds in nature, mostly
as a result of evolution but also due to constraints imposed by the basic physics
and chemistry of polypeptide chains. There is, therefore, a good chance (currently
70-80%) that a protein which has a similar fold to the target protein has already been
studied by X-ray crystallography or NMR spectroscopy and can be found in the PDB
(Protein Data Bank). Currently there are just over 1100 different protein folds known.
The protein structure for the gene sequence of alpha-synuclein is already available
from the NCBI sites and the 3-D model has been shown in below. The figure shows
three positions where there are mutations have been observed and the corresponding
changes in the Aminoacid substitutes (Fig. 15).

3.7.2 Results of Modeling

We have obtained the 3-D protein structure models for the three mutations and are
discussed below for Alpha-synuclein variants and for the Lucin-rich repeat kinase 2
separately. These are shown below. The figures shows three probable models for the
three variants in case of alpha-synuclein (Figs. 16, 17 and 18).
72 S. Roy and T.S. Vasulu

Fig. 16 VAR_007957 (variant of alpha-synuclein)

Fig. 17 VAR_022703 (E46K) (variant of alpha-synuclein)

Models of Alpha-SynucleinVariants
(a) VAR_007957 (A30P) (Fig. 16)

(b) VAR_022703 (E46K) (Fig. 17)

(c) VAR_007454 (A53T) (Fig. 18)

Model of wild type protein of Leucin rich repeat kinase 2: (LRRK2) (Fig. 19)
Protein Structure Modeling of Abnormal Genes … 73

Fig. 18 VAR_007454 (A53T) (variant of alpha-synuclein)

Fig. 19 Leucin-rich repeat kinase 2. Model of wild type protein of LRRK2


74 S. Roy and T.S. Vasulu

Fig. 20 VAR_024954 (variant of LRRK 2)

Fig. 21 VAR_024958 (variant of LRRK 2)

Models of Leucin-rich repeat kinase 2 variants


(a) VAR_024954 (Y1699C) (Fig. 20)
(b) VAR_024958 (G2019S) (Fig. 21)
(c) VAR_024959 (I2020T) (Fig. 22)
Protein Structure Modeling of Abnormal Genes … 75

Fig. 22 VAR_024959 (variant of LRRK 2)

3.7.3 Model Validation

The validations of the 3-D protein structures have been done using three protein
structure validation servers: 1. Verify3D, 2. ERRAT, and 3.ANOLEA.
The Verify3D Structure Evaluation server is a tool designed to help in the refine-
ment of crystallographic structures. It provides a visual analysis of the quality of a
putative crystal structure for a protein.
ERRAT is a protein structure verification algorithm that is especially well-suited
for evaluating the progress of crystallographic model building and refinement. The
program works by analyzing the statistics of non-bonded interactions between dif-
ferent atom types. A single output plot is produced that gives the value of the error
function vs. position of a 9-residue sliding window. By comparison with statistics
from highly refined structures, the error values have been calibrated to give confi-
dence limits.
ANOLEA (Atomic Non-Local Environment Assessment) is a server that performs
energy calculations on a protein chain, evaluating the “Non- Local Environment”
(NLE) of each heavy atom in the molecule. The energy of each pairwise interaction
in this non-local environment is taken from a distance-dependent knowledge-based
mean force potential that has been derived from a database of 147 non-redundant pro-
tein chains with a sequence identity below 25% and solved by X-Ray crystallography
with a resolution lower than 3 Å.
76 S. Roy and T.S. Vasulu

Table 7 Model validation report for the Alpha synuclein variants’


Accession no. Structure validation report
Verify3D ERRAT ANNOLEA
VAR_007957 Model 1 Good Good Good
Model 2 Fair Satisfactory Fair
Model 3 Satisfactory Below minimum Fair
interaction limit
VAR_022703 Model 1 Fair Below minimum Satisfactory
interaction limit
Model 2 Good Good Good
Model 3 Fair Satisfactory Good
VAR_007454 Model 1 Satisfactory Below minimum Satisfactory
interaction limit
Model 2 Good Good Good
Model 3 Fair Satisfactory Satisfactory

3.8 Model Validation Results

The different possible models of the 3-D protein structure by threading methods has
been validated by following three methods viz., verify3D, ERRAT and ANNOLEA.
The comparative results obtained from the validation of 3-D protein structure for the
two genes Alpha synuclein and LRRK2 are shown in Tables 7 and 8.

3.8.1 Models of Variants of Alpha-Synuclein

The results of the validation methods shown in Table 7 suggest Model 1 variant of
VAR_007957 for alpha synuclein scores GOOD for the different methods followed.
In case of VAR_022703 and VAR_007454 model 2 scores GOOD for the three
methods used for the validation (Table 7). Based on the above scores the three final
models of 3-D protein structure for alpha synuclein have been constructed and are
shown in Fig. 23.
Final models after Validation—Alpha-Synuclein (Fig. 23)

3.8.2 Model Validation Results—LRRK2

The wild type protein structure of LRRK2 gene is shown in Fig. 24. The results of the
validation methods shown in Table 7 suggest Model 1 variant of VAR_024954 for
LRRK2 scores GOOD for the different methods followed. In case of VAR_024958
and VAR_024959 model 2 scores GOOD for the three methods used for the validation
Protein Structure Modeling of Abnormal Genes … 77

Table 8 Model Validation report for the Leucin rich repeat kinase 2 variants’
Accession no. Structure validation report
Verify3D ERRAT ANNOLEA
VAR_007957 Model 1 Good Good Good
Model 2 Fair Satisfactory Fair
Model 3 Satisfactory Below minimum Fair
interaction limit
VAR_022703 Model 1 Fair Below minimum Satisfactory
interaction limit
Model 2 Good Good Good
Model 3 Fair Satisfactory Good
VAR_007454 Model 1 Satisfatory Below minimum Satisfactory
interaction limit
Model 2 Good Good Good
Model 3 Fair Satisfactory Satisfactory

(Table 8). Based on the above scores the three final models of 3-D protein structure
for LRKK2 have been constructed and are shown in Fig. 25.

Fig. 23 Final models of Alpha-synuclein based on the Table 7. VAR_007957 (A30P), VAR_022703
(E46K) and VAR_007454 (A53T)
78 S. Roy and T.S. Vasulu

Fig. 24 Final model protein


structure for wild
Leucin-rich kinase 2

Fig. 25 Protein structure models of LRRK2 gene after the validation results

3.9 Final Models After Validation—LRRK2

Leucin-rich repeat kinase 2 (LRRK 2) wild protein structure (Fig. 24)


Final Models of Variants of LRRK 2: VAR_024954 (Y1699C), VAR_024958
(G2019S) and VAR_024959 (I2020T) (Fig. 25)
Protein Structure Modeling of Abnormal Genes … 79

Fig. 26 Diagram showing receptor-ligand docking (induced-fit) (http://www.wikipedia.org/


docking/)

4 Receptor-Ligand Docking

Molecular docking can be thought of as a problem of “lock-and-key”, where one is


interested in finding the correct relative orientation of the “key” which will open up
the “lock” (where on the surface of the lock is the key hole, which direction to turn
the key after it is inserted, etc.). Here, the protein can be thought of as the “lock”
and the ligand can be thought of as a “key”. Molecular docking may be defined
as an optimization problem, which would describe the “best-fit” orientation of a
ligand that binds to a particular protein of interest. However since both the ligand
and the protein are flexible, a “hand-in-glove” analogy is more appropriate than
“lock-and-key”. During the course of the process, the ligand and the protein adjust
their conformation to achieve an overall “best-fit” and this kind of conformational
adjustments resulting in the overall binding is referred to as “induced-fit” (Fig. 26).
The focus of molecular docking is to computationally stimulate the molecular
recognition process. The aim of molecular docking is to achieve an optimized con-
formation for both the protein and ligand and relative orientation between protein
and ligand such that the free energy of the overall system is minimized.
Autodock 4 has been used for docking the three abnormal proteins of alpha-
synuclein with pyrroquinoline quinone. AutoDock is a suite of automated docking
tools. It is designed to predict how small molecules, such as substrates or drug
candidates, bind to a receptor of known 3D structure. AutoDock actually consists of
two main programs: 1. AutoDock performs the docking of the ligand to a set of grids
describing the target protein, 2. AutoGrid pre-calculates these grids.
AutoDock has applications in:
• X-ray crystallography
• Structure-based drug design
• Lead optimization
• Virtual screening (HTS)
• Combinatorial library design
• Protein-protein docking
• Chemical mechanism studies
80 S. Roy and T.S. Vasulu

Fig. 27 Chemical structure


for Pyrroloquinoline
Quinone

Fig. 28 3-D structure for


Pyrroloquinoline Quinone

Kobayashi et al. (2006) showed that pyrroloquinoline quinone (PQQ) is a non-


covalently bound cofactor in the bacterial oxidative metabolism of alcohols. PQQ
also exists in plants and animals. Due to its inherent chemical feature, namely its
free-radical scavenging properties, PQQ has been drawing attention from both the
nutritional and the pharmacological viewpoint. Alpha-Synuclein, a causative fac-
tor of Parkinson’s disease (PD), has the propensity to oligomerize and form fibrils,
and this tendency may play a crucial role in its toxicity. PQQ prevents the amyloid
fibril formation and aggregation of alpha-synuclein in vitro in a PQQ-concentration-
dependent manner. Moreover, PQQ forms a conjugate with alpha-synuclein, and this
PQQ-conjugated alpha-synuclein is also able to prevent alpha-synuclein amyloid fib-
ril formation. This study demonstrates the characteristics of PQQ as an anti-amyloid
fibril-forming reagent and together with further pharmacological approaches, PQQ
is a candidate for future anti-PD reagent compounds (Fig. 27).
Chemical name: 4,5-dioxo-1H-pyrrolo[5,4-f]quinoline-2,7,9-tricarboxylic acid
(Fig. 28)
The docking of Pyrroloquinoline quinone with alpha-synuclein and its variants
using AutoDock 4 has failed to give lowest minimization energy models and thus,
Pyrroloquinoline quinone has not docked to alpha-synuclein and its variants, in silico.
Hence, the docking of alpha-synuclein with the ligand became unsuccessful.
Protein Structure Modeling of Abnormal Genes … 81

5 Conclusion

Parkinson’s disease has been caused by mutations in 13 different types of gene loci
in Homo sapiens. These 13 different types are located in 4 different loci, they are:.
PARK 1, PARK 3, PARK 4 and PARK 8, and out of these four, information for the
proteins of only two loci i.e. PARK 1 and PARK 8 have been given in the Biological
Databases. Thus, this study has been done using only two loci, i.e. PARK 1 and
PARK 8. Alpha-synuclein is the gene for PARK 1 locus and Leucin-rich repeat kinase
2 (LRRK 2) is the gene for PARK 8 locus. Analysis of the SNCA exon showed a G
to C nucleotide substitution in base 88, G to A nucleotide substitution in base 188 &
G to A substitution in base 209 of the SNCA gene, causing amino acid substitutions
of Ala to Pro (A30P), Glu to Lys (E46K), & Ala to Thr (A53), respectively. An
also, analysis of the Leucin-rich repeat kinase 2 ( LRRK2 ) exon showed a A to G
nucleotide substitution in base 1699, G to A nucleotide substitution in base 2019 and
T to C substitution in base 2020 of the LRRK2 gene, causing amino acid substitutions
of Tyr to Cys (Y1699C), Gly to Ser (G2019S) and Ile to Thr (I2020T), respectively.
The secondary structure predictions have been done for both the variants of alpha-
synuclein and LRRK 2 using two algorithms viz. SOPMA and GOR. The secondary
structure prediction using SOPMA algorithm for variants of alpha-synuclein showed
that alpha-helix is present for both mutant and wild type in case of VAR_007957
(A30P), while in case of VAR_022703 (E46K), beta strand is present for mutant and
alpha-helix is present for wild type and in VAR_007454 (A53T), extended strand
is present for mutant and alpha-helix is present for wild type alpha-synuclein. The
secondary structure prediction using GOR algorithm for variants of alpha-synuclein
showed that random coil is present for mutant and alpha-helix for wild type in case of
VAR_007957 (A30P), while in case of VAR_022703 (E46K), random coil is present
for mutant and alpha-helix is present for wild type and in VAR_007454 (A53T),
extended strand is present for mutant and alpha-helix is present for wild type alpha-
synuclein.
The secondary structure prediction using SOPMA algorithm for variants of
Leucin-rich repeat kinase 2 (LRRK 2) showed that random coil is present for
both mutant and wild type in case of VAR_024954 (Y1699C), while in case of
VAR_024958 (G2019S), beta turn is present for both mutant and wild type and
in VAR_024959 (I2020T), extended strand is present for mutant and random coil
is present for wild type LRRK 2. The secondary structure prediction using GOR
algorithm for variants of Leucin-rich repeat kinase 2 (LRRK 2) showed that ran-
dom coil is present for mutant and random coil for wild type in case of VAR_024954
(Y1699C), while in case of VAR_024958 (G2019S), alpha-helix is present for mutant
and extended strand for wild type and in VAR_024959 (I2020T), random coil is
present for mutant and extended strand is present for wild type LRRK 2.
Alpha-synuclein variants’ and leucin-rich repeat kinase 2 (LRRK 2) and its
variants’ have been modeled successfully. The variants of alpha-synuclein i.e.
VAR_007957, VAR_022703 and VAR_007454 have been modeled using Modeler
9v2 with 99% similarity with the template i.e. wild type alpha-synuclein. Three mod-
82 S. Roy and T.S. Vasulu

els have been generated each for VAR_007957 (A30P), VAR_022703 (E46K) and
VAR_007454 (A53T). Homology modeling of these three variants shows slight dif-
ference in Ramachandran Plot values at corresponding mutated residues which pro-
vide valuable information about their structural backbone orientation. It was observed
that A53T and E46k mutation have a significant effect on the structure of the folded
protein, although the A30P mutation may cause a minor perturbation in the helical
structure around the site of the mutation.
Pyrroquinoline quinone is the ligand which when bind to these sites may inhibit
the action of these mutations (Masaki et al. 2006). The docking of Pyrroquinoline
quinone with these three variants can inhibit abnormal action of these variants in
the human brain by inhibiting the formation of inclusion bodies or aggregates. The
Pyrroquinoline quinone is unable to bind with the active sites of the variants of
alpha-synuclein, in silico, and hence, the docking of Pyrroquinoline quinone with
the variants of alpha-synuclein is unsuccessful and hence, it provides us an idea that
the Pyrroquinoline quinone may not bind to the alpha-synuclein variants’ as it is, so
this ligand needs to be present as dimmer, or polymer to fit into the active sites of
variants of alpha-synuclein because the active site of alpha-synuclein is much big
for Pyrroquinoline quinone molecule to fit into it.
No original model was available for Leucin-rich repeat kinase 2 (LRRK 2) protein,
i.e. Dardarin. The 3-D protein structure for LRRK 2 has been modeled by thread-
ing method using Threader 3. The variants of Leucin-rich repeat kinase 2 (LRRK 2)
i.e. VAR_024954, VAR_024958 and VAR_024959 have also been modeled using
Threader 3 with 17% similarity with the template. Three models have been gener-
ated each for VAR_024954 (Y1699C), VAR_024958 (G2019S) and VAR_024959
(I2020T). The protein models for the variants of LRRK2 protein showed a significant
difference in Ramachandran Plot values which can provide valuable information
about the folding of the kinase domain and its backbone orientation.

References

Allan W (1937) Inheritance of shaking palsy. Arch Intern Med 60:424–436


Andreadis A, Brow MW, Kosik KS (1992) Structure and novel exons of the human τ gene. Bio-
chemistry 31:10626–10633
Belin AC, Westerlund M (2008) Parkinson’s disease: a genetic perspective. FEBS J 275(7):1377–
1383. 10.111/j.1742-4658.2008.06301.x
Bonifati V, Rizzu P, van Baren MJ, Schaap O, Breedveld GJ, Krieger E, Dekker MC, Squitieri
F, Ibanez P, Joosse M, van Dongen JW, Vanacore N, van Swieten JC, Brice A, Meco G, van
Duijn CM, Oostra BA, Heutink P (2003) Mutations in DJ-1 gene associated autosomal recessive
early-onset parkinsonism. Science 299(5604):255–259
Bowie JU, Luthy R, Eisenberg D (1991) A method to identify protein sequences that fold into a
known three-dimensional structure. Science 253(5016):164–170
Brice A (2005) How much does dardarin contribute to Parkinson’s disease. Lancet 365(9457):
363–364
Protein Structure Modeling of Abnormal Genes … 83

Campion D, Martin C, Heilig R, Charbonnier F, Moreau V, Flaman JM, Petit JL, Hannequin D,
Brice A, Frebourg T (1995) The NACP/synuclein gene: chromosomal assignment and screening
for alterations in Alzheimer disease. Genomics 26:254–257
Chen X, Rohan de Silva HA, Pettenati MJ, Rao PN, St. George-Hyslop P, Roses AD, Xia Y,
Horsburgh K, Ueda K, Saitoh, (1995) The human NACP/alpha-synuclein gene: chromosome
assignment to 4q21.3-q22 and TaqI RFLP analysis. Genomics 26:425–427
Choi HJ, Lee SY, Cho Y, Hwang O (2004) JNK activation by tetrahydrobiopterin: implications for
Parkinson’s disease. Neurosci Res 75(5):715–721
Christine K, Katja L-H (2007) Impact of recent genetic findings in Parkinson’s disease. Curr Opin
Neuro 20(4):453–464
Clarimon J, Xiromerisiou G, Eerola J, Gourbali V, Hellstrom O, Dardiotis E, Peuralinna T, Papadim-
itriou A, Hadjigeorgiou GM, Tienari P, Singleton AB (2005) Lack of evidence for genetic associ-
ation between FGF20 and Parkinson’s disease in Finnish and Greek patients. BMC Neurol 5:11.
doi:10.1186/1471-23775/5/11
Conway KA, Lee SJ, Rochet JC, Ding TT, Williamson RE, Lansbury PT Jr (2000) Acceleration of
oligomerization, not fibrillization, is a shared property of both alpha-synuclein mutations linked
to early-onset Parkinson’s disease: implications for pathogenesis and therapy. Proc Natl Acad Sci
97:571–576
Cookson MR (2015) LRKK2 pathways leading to neurodegeneration. Curr Neurol Neurosci Rep
15(7):564. doi:10.1007/s11910-015-0564-y
Dawson TM, Dawson VL (2003) Molecular pathways of neurodegeneration in Parkinson’s disease.
Science 302(5646):819–822. doi:10.1126/science.1087753
Day IN, Thompson RJ (1987) Molecular cloning of cDNA coding for human PGP 9.5 protein: a
novel cytoplasmic marker for neurones and neuroendocrine cells. Fedrat Europ Biochem Societ
Lett (FEBS) 210:157–160
Day IN, Hinks LJ, Thompson RJ (1990) The structure of the human gene encoding protein gene
product 9.5 (PGP9.5), a neuron-specific ubiquitin C-terminal hydrolase. Biochem J 268(2):
521–524. doi:10.1042/bj2680521
Dekker MCJ, Bonfati V, van Dujin CM (2003) Parkinson’s disease: piecing together a genetic
jigsaw. Brain 126:1722–1733
Deng H-X, Shi Y, Yang Y, Kreshnik B, Ahmeti etc. (2016) Identification of TMEM230 mutation
in familial Parkinson’s disease. Nat Genet 48(7):733–741
Deng J, Lewis PA, Greggio E, Sluch E, Beilina A, Cookson MR (2008) Structure of the ROC domain
from the Parkinson’s disease-associated leucine-rich repeat kinase 2 reveals a dimeric GTPase.
Proc Natl Acad Sci 105:1499–1504
Di Fonzo A, Rohe CF, Ferreira J, Chien HF, Vacca L, Stocchi F, Guedes L, Fabrizio E, Manfredi M,
Vanacore N, Goldwurm S, Breedveld G, Sampaio C, Meco G, Barbosa E, Oostra BA, Bonifati
V (2005) Italian Parkinson genetics network : a frequent LRRK2 gene mutation associated with
autosomal dominant Parkinson’s disease. Lancet 365:412–415
Di Fonzo A, Chien HF, Socal M, Giraudo S, Tassorelli C, Iliceto G, Fabbrini G, Marconi R, Fincati
E, Abbruzzese G, Marini P, Squitieri F et al (2007) ATP13A2 missense mutationin juvenile
parkinsonism and young onset Parkinson disease. Neurology 87:1557–1562
Di Fonzo A, Dekker MC, Montagna R, Baruzzi A, Yonova EH, Correia Guedes L, Szczerbinska
A, Zhao T, Dubbel-Hulsman LO, Wouters CH, de Graaff E, Oyen WJ, Simons EJ, Breedveld
GJ, Oostra BA, Horstink MW, Bonifati V (2009) FBX07 mutations cause autosomal recessive
early-onset parkinsonian-pyramidal syndrome. Neurology 72:240–245
Dufty BM, Warner LR, Hou ST, Jiang SX, Gomez-Isla T, Leenhouts KM, Oxford JT, Feany MB,
Masliah E, Rohn TT (2007) Calpain-Cleavage of alpha-synuclein: connecting proteolytic process-
ing to disease-linked aggregation. Am J Pathol 170:1725–1738
Engelender S, Kaminsky Z, Guo X, Sharp AH, Amaravi RK, Kleiderlein JJ, Margolis RL, Troncoso
JC, Lanahan AA, Worley PF, Dawson VL, Dawson TM, Ross CA (1999) Synphilin-1 associates
with alpha-synuclein and promotes the formation of cytosolic inclusions. Nat Genet 22:110–114
84 S. Roy and T.S. Vasulu

Engelender S, Wanner T, Kleiderlein JJ, Ashworth R, Wakabayashi K, Tsuji S, Takashi H, Margolis


RL and Ross CA (2000) Organization of The Human synphilin-1 gene, a candidate for Parkinson’s
disease. Mamm Genome 01/09/2000. 11:763-766
Erusalimsky JD, Moncada S (2007) Nitric oxide and mitochondrial signaling: from physiology to
pathophysiology. Biology 27:2524–2531
Faccio L, Fusco C, Chen A, Martinotti S, Bonventre JV, Zervos AS (2000) Characterization of a
novel human serine protease that has extensive homology to bacterial heat shock endoprotease
HtrA and is regulated by kidney ischemia. J Biol Chem 275:2581–2588
Farrer M, Gwinn-Hardy K, Muenter M, DeVrieze FW, Crook R, Perez-Tur J, Lincoln S, Maraganore
D, Adler C, Newman S, Mac Elwee K, McCarthy P, Miller C, Waters C, Hardy J (1999) A
chromosome 4p haplotypes segregating with Parkinson’s disease and postural tremor. Hum Molec
Genet 8:81–85
Farrer M, Stone J, Mata IF, Lincoln S, Kachergus J, Hulihan M, Strain KJ, Marganore TM
(2005) LRRK2 mutations in Parkinson disease. Neurology 65:738–740. doi:10.1212/01.WNL.
0000169023.51764.b0:1526-632x
Flower TR, Clark-Dixon C, Metoyer C, Yang H, Shi R, Zhang Z, Witt SN (2007) YGR198w (YPP1)
targets A30P alpha-synuclein to the vacuole for degradation. J Cell Biol 177:1091–1104
Foround T (2005) LRRK2: both a cause and a risk factor for Parkinson ’s disease? Neurology
65:664–665. doi:10.1212/01.wnl.0000179342.58181.c9
Funayama M, Hasegawa K, Ohta E, Kawashima N, Komiyama M, Kowa H, Tsuji S, Obata F (2005)
An LRRK2 mutation as a cause for the parkinsonism in the original PARK8 family. Ann Neurol
57:918–921
Garnier J, Osguthorpe D, Robson B (1978) Analysis of the accuracy and implications of simple
methods for predicting the secondary structure of globular proteins. J Mol Biol 120:97–120
Garnier J. JF Gibrat and B Robson (1996) GOR Secondary structure prediction method version IV,
In: Doolittle RF (ed) Methods in enzymology, vol 266, pp 540–553
Gasser T (2005) Genetics of Parkinson’s disease. Curr Opin Neurol 18:363–369
Gasser T, Muller-Myhsok B, Durr Wszolek ZK, Vaughan A, Bonifati JR, Meco V, Bereznai G,
Oehlmann B, Agid R, Brice Y, Wood AN (1997) Genetic complexity and Parkinson’s disease.
Science 277:388–389
Geourjon C, Deleage G (1995) Significant improvement in protein secondary structure prediction
by consensus prediction from multiple alignments. Comput Appl Biosci 11(6):681–684
Ghosh A, Roy A, Liu X, Kordower JH, Mufson EJ, Hartley DM, Ghosh S, Mosley RL, Gendelman
HE, Pahan K (2007) Selective inhibition of NF-kappaB activation prevents dopaminergic neuronal
loss in a mouse model of Parkinson’s disease. Proc Natl Acad Sci USA 104:18754–18759
Gilks WP, Abou-Sleiman PM, Gandhi S, Jain S, Singleton A, Lees AJ, Shaw K, Bhatia KP, Bonifati
V, Quinn NP, Lynch J, Healy DG, Holton JL, Revesz T, Wood NW (2005) A common LRRK2
mutation in idiopathic Parkinson’s disease. Lancet 365:415–416
Golbe LI, Farrell TM, Davis PH (1990) Follow-up study of early-life protective and risk factors in
Parkinson’s disease. Mov Disord 5:66–70
Golbe LI, Di Lorio G, Sanges G, Lazzarini AM, La Sala S, Bonavita Duvoisin RC (1996) Clinical
genetic analysis of Parkinson’s disease in the Contursi kindred. Ann Neurol 40(5):767–75
Goldman JE, Yen S-H, Chiu FC, Peress NS (1983) Lewy bodies of Parkinson’s disease contain
neurofilament antigens. Science 83(221):1082–1084
Gowers WR, (1900) A manual of diseases of the nervous system. Vol. I. Diseases of the nerves and
spinal cord, 3rd edn. P. Blakiston’s Son & Co. pub, Philadelphia
Graeber MB, Muller U (1992) The X-linked dystonia-parkinsonism syndrome: clinical and mole-
cular genetic analysis. Brain Pathol 2:287–295
Gray CW, Ward RV, Karran E, Turconi S, Rowles A, Viglienghi D, Southan C, Barton A, Fantom
KG, West A, Savopoulos J, Hassan NJ, Clinkenbeard H, Hanning C, Amegadzie B, Davis JB,
Dingwall C, Livi GP, Creasy CL (2000) Characterization of human HtrA2, a novel serine protease
involved in the mammalian cellular stress response. Europ J Biochem 267:5699–5710
Protein Structure Modeling of Abnormal Genes … 85

Greenbaum EA, Graves CL, Mishizen-Eberz AJ, Lupoli MA, Lynch DR, Englander SW, Axelsen
PH, Giasson BI (2005) The E46K mutation in alpha-synuclein increases amyloid fibril formation.
J Biol Chem 280:7800–7807
Hampshire DJ, Roberts E, Crow Y, Bond J, Mubaidin A, Wriekat AL, Al-Din A, Woods CG (2001)
Kufor-Rakeb syndrome, pallid-plyramidal degeneration with supranuclear upgaze paresis and
dementia, maps to 1p36. BMJ J Med Genet 38:680–682
Hardy Rideout J (ed) (2017) Leucine-rich repeat Kinase 2 (LRRK2). Springer. doi:10.1007/978-3-
319-49969-7
Hasegawa K, Kowa H (1997) Autosomal dominant familial Parkinson disease: older onset of age,
and good response to levodopa therapy. Europ Neurol 38:39–43
Health 2012 4(11A) Special issue on Parkinson’s disease
Hedrich K, Heintz N, Zoghbi H (1997) Alpha-synuclein–a link between Parkinson and Alzheimer
diseases. Nature Genet 16:325–327
Hedrich K, Winkler S, Hagenah J, Kabakci K, Kasten M, Schwinger E, Volkmann J, Pramstaller
PP, Kostic V, Viergge P, Klein C (2006) Recurrent LRRK2 (Park 8) mutations in early on set of
Parkinson’s disease. Movement Disorders 21:1506–1510. doi:10.1002/mds.20990
Hedrich K, Marder K, Harris J, Kann M, Lynch T, Mejia-Santana H, Pramstaller PP,
Schwinger E, Bressman SB, Fahn S, Klein C (2002) Evaluation of 50 probands
with early onset Parkinson’s disease for Parkin mutations. Neurology 58(8):1239–1246.
http//dx.doi.org/10.1212/WNL.58.8.1239
Hicks AA, Petursson H, Jonsson T, Stefansson H, Johannsdottir HS, Sainz J, Frigge ML, Kong
A, Gulcher JR, Stefansson K, Sveinbjornsdottir S (2002) A susceptibility gene for late-onset
idiopathic Parkinson’s disease. Ann Neurol 52:549–555
Hope AD, Myhre R, Kachergus J, Lincoln S, Bisceglio G, Hulihan M, Farrer MJ (2004) α-Synuclein
missense and multiplication mutations in autosomal dominant Parkinson’s disease. Neurosci Lett
367(1):97–100
Jin SM, Youle RJ (2012) PINK1-and Parkin-mediated mitophagy at a glance. J Cell Sci 125:
795–799. doi:10.1243/cs.093849
Jones DT, Taylort WR, Thomson JM (1992) A new approach to protein fold recognition. Nature
358:86–89. doi:10.1038/358086a0
Jun DJ, Kim J, Jung SY, Song R, Noh JH, Park YS, Ryu SH, Kim JH, Kong YY, Chung JM, Kim
KT (2007) Extracellular ATP mediates necrotic cell swelling in SN4741 dopaminergic neurons
through P2X7 receptors. J Biol Chem 282:37350–37358
Kachergus J, Mata IF, Hulihan M, Taylor JP, Lincoln S, Aasly J, Gibson JM, Ross OA, Lynch T,
Wiley J, Payami H, Nutt J, Maraganore DM, Czyzewski K, Styczynska M, Wszolek ZK, Farrer
MJ, Toft M (2005) Identification of a novel LRRK2 mutation linked to autosomal dominant
parkinsonism: evidence of a common founder across European populations. Am J Hum Genet
76:672–680
Karamohamed S, DeStefano AL, Wilk JB, Shoemaker CM, Golbe LL, Mark MH, Lazzarini AM,
Suchowersky O, Labelle N, Gurrman M, Currie LJ, Wooten GF, 22 others, (2003) A haplotypes
at the Park3 locus influences onset age for Parkinson’s disease: the gene PD study. Neurology
61:1557–1561
Kitada T, Asakawa S, Hattori N, Matsumine H, Yamamura Y, Minoshima S, Yokochi M, Mizuno Y,
Shimizu N (1998) Mutations in the parkin gene cause autosomal recessive juvenile parkinsonism.
Nature 392:605–608. doi:10.1038/33416
Kitada T, Pisani A, Porter DR, Yamaguchi H, Tscherter A, Martella G, Bonsi P, Zhang C, Pothos
EN, Shen J (2007) Impaired dopamine release and synaptic plasticity in the striatum of PINK1-
deficient mice. Proc Natl Acad Sci USA 104:11441–11446
Kitada T, Tong Y, Gautier CA, Shen J (2009) Absence of nigral degeneration in aged parkin/DJ-
1/PINK1 triple knockout mice. J Neurochem 111:696–702
Klein C, Schlossmacher MG (2006) The genetics of Parkinson’s disease: implications for neuro-
logical care. Nat Clin Pract Neurol 2:136–146. doi:10.1038/ncpneuro0126
86 S. Roy and T.S. Vasulu

Kobayashi M, Kim J, Kobayashi N, Han S, Nakamura C, Ikebukuro K, Sode K (2006) Pyrrolo-


quinoline quinone (PQQ) prevents fibril formation of alpha-synuclein. Biochem Biophys Res
Commun 349(3):1139–44
Kontakos N, Stokes J (2000) Monograph series on aging-related diseases: XII. Parkinson’s disease-
recent developments and new directions. Chronic Diseases 20(3)
Kuzuhara S, Mori H, Izumiyama N, Yoshimura M, Ihara Y (1988) Lew bodies are ubiquitinated.
Acta Neuropathol 75:345–353
Lee LV, Kupke KG, Caballar-Gonzanga F, Hebron-Ortiz M, Muller U (1991) The phenotype of
the X-linked dystonia-parkinsonism syndrome. An assessment of 42 cases in the Phyilippines.
Medicine 70:179–187
Lees AJ, Singleton AW (2007) Clinical heterogeneity of ATP13A2 linked disease (Kufor-Rakeb)
justifies a PARK designation. Neurology 68:1553–1554
Lennox G, Lowe J, Morrell K, Landon M, Meayer RJ (1989) Anti-ubiquitin immunocyto-chemistry
is more sensitive than conventional techniques in the detection of diffuse Levy body disease.
J Neurol Neurosurg Psychiatry 52:67–71
Leroy E, Boyer R, Polymeropoulos MH (1998) Intron-exon structure of ubiquitin C-terminal
hydrolase-L1. DNA Res 5:397–400
Lesage S et al (2005) LRRK2 haplotype analyses in European and North African families with
Parkinson disease: a common founder for the G2019S mutation dating from the 13th century.
Am J Hum Genet 77:330–332
Levecque C, Elbaz A, Clavel J, Vidal JS, Amouyel P, Alperovitch A, Tzourio C, Chartier-Harlin MC,
(2017) Association of polymorphisms in the Tau and Saitohin genes with Parkinson’s Disease,
BMJ: J Neuro Neurosurg Pschiat 75(3):478–480. http://jnnp.bmj.com
Li Y, Scott J, Hedges WK, Zhang DJ, Gaskell F, Nance PC, Watts MA, Hubble RL, Koller JP,
Pahwa WC, Stern R, Hiner MB (2002) Age at onset in two common neurodegenerative diseases
is genetically controlled. Am J Hum Genet 70:985–993
Liu C, Fei E, Jia N, Wang H, Tao R, Iwata A, Nukina N, Zhou J, Wang G (2007) Assembly of lysine
63-linked ubiquitin conjugates by phosphorylated alpha-synuclein implies lewy body biogenesis.
J Biol Chem 282:14558–14566
Massano J, Bhatia KP (2012) Clinical approach to Parkinson’s disease: features, diagnosis and
principles of management. Cold Spring Harb Perspect Med 2:a008870. doi:10.1101/cshperspect.
a008870
Mata IF, Kachergus MJ, Taylor JP, Lincoln S, Aasly J, Lynch T, Hulihan M, Cobb SA, Wu RM,
Lu CS, Lahoz C, Wszolek ZK, Farrer JM (2005) LRRK2 pathogenic substitutions in Parkinson’s
disease. Neurogenetics 6:171–177
Muenter MD, Forno LS, Hornykiewicz O, Kish SJ, Maraganore DM, Casellli RJ, Peuraalinna T,
Dutra A, Nusbaum R, Lincoln S, Crawley A, 10 others (1998) Hereditary form of parkinsonism
dementia. Ann Neurol 43:768–781
Myhre R, Klungland H, Mathew JF, Aasly JO (2008) Genetic association study of synphilin-I in
idiopathic Parkinson’s disease. BMC Med Genet 9:19. doi:10.1186/1471-2330-9-19
Najim Al-Din AS, Wriekat A, Mubaidin A et al (1994) Pallidopyramidal degeneration, supranuclear
upgaze paresis and dementia: Kufor-Rakeb syndrome. Acta Neurol Scand 89:347–352
Newhouse Klintworth HK, Li T, Choi W-S, Faigle R, Xia Z (2007) Activation of c- Jun N-terminal
protein kinase is a common mechanism underlying Paraquat- and Rotenone-induced dopaminer-
gic cell apoptosis. Toxicol Sci 97:149–162
Nichols WC, Pankratz N, Hernandez D, Paisan-Ruiz C, Jain S, Halter CA, Michaels VE, Reed T,
Rudolph A, Shults CW, Singleton A, Foroud T (2005) Genetic screening for a single common
LRRK2 mutation in familial Parkinson’s disease. Lancet 365:410–412
Norris EH, Giasson BI, Lee VM (2004) α-synuclein: normal function and role in neurodegenerative
diseases. Curr Top Dev Biol 60:17–54
OMIM #168600, Online Mendelian Inheritance for Man
Protein Structure Modeling of Abnormal Genes … 87

Paisan-Ruiz C, Jain S, Evans EW, Gilks WP, Simon J, van der Brug M, Lopez de Munain A, Aparicio
S, Martinez Gil A, Khan N, Johnson J, Martinez JR (2004) Cloning of the gene containing
mutations that cause PARK8-linked Parkinson’s disease. Neuron 44:595–600
Paisan-Ruiz C, Lang AE, Kawarai T, Sato C, Salehi-Rad S, Fisman GK, Al-Khairallah T, St P, Sin-
gleton George-Hyslop A, Rogaeva E (2005) LRRK2 gene in Parkinson disease: mutation analysis
and case control association study. Neurology 65:696–700. doi:10.1212/01.WNL.0000167552.
79769.b3:1526-623x
Paisan-Ruiz C, Guevara R, Federoff M, Hangasi H, Sina F, Elahi E, Schneider SA, Schwingenschuh
P, Bajaj N, Emre M, Singleton AB, Hardy J, Bhatia KP, Brandner S, Lees AJ, Houlden H (2010)
Early-onset L-dopa-responsive parkinsonism with pyramidal signs due to ATP13A2, PLA2G6,
FBXO7 and spatacism mutations. Mov Disord 25(1):791–800
Pankratz N, Nichols WC, Uniacke SK, Halter C, Rudolph A, Shults C, Conneally PM, Foroud T
(2002) The Parkinson Study group: genome screen to identify susceptibility genes for Parkinson
disease in a sample without parkin mutations. Am J Hum Genet 71:124–135
Pankratz N, Nichols WC, Uniacke SK, Halter C, Rudolph A, Shults C, Conneally PM, Foroud
T (2003) The Parkinson Study group: significant linkage of Parkinson disease to chromosome
2q36-37. Am J Hum Genet 72:1053–1057
Pankratz N, Foroud T (2007) Genetics of Parkinson’s Disease. Genet Med 9(12):801–811
Pickrell AM, Youle RJ (2015) The roles of PINK1, Parkin and mitochodrial fidelity in Parkinson’s
disease. Neuron 85:257–273
Polymeropoulos MH, Lavedan C, Leroy E, Ide SE, Dehejia A, Dutra A, Pike B, Root H, Rubenstein
J, Boyer R, Stenroos ES, Chandrasekharappa S, Athanassiadou A, Papapetropoulos T, Johnson
WG, Lazzarini AM, Duvoisin RC, Di Iorio G, Golbe LI, Nussbaum RL (1997) Mutations in the
α-synuclein gene identified in families with Parkinson’s disease. Science 276(5321):2045–2047
Polymeropoulos MH, Higgins JJ, Golbe LI, Johnson WG, Ide SE, Di Iorio G, Sanges Gm Stenroos
ES, Pho LT, Schaffer AA, Lazzarini AM, Nussbaum RL, Duvoisin RC (1996) Mapping of a gene
for Parkinson’s disease to chromosome 4q21-q23. Science 274:1197–1198
Polymeropoulos MH, Lavedan C, Leroy E, Ide SE, Dehejia A, Dutra A, Pike B, Root H, Rubenstein
J, Boyer R, Stenroos ES, Chandrasekharappa S, Athanassiadou A, Papepetropoulos T, Johnson
WG, Lazzarini AM, Duvoisin RC, Di Iorio G, Golbe LI, Nussbaum R (1997) Mutation in the
alpha-synuclein gene identified in families with Parkinson’s disease. Science 276:2045–2047
PUBMED. No.: WO/2003/076658, 2003
Quian L, Flood PM, Hong J-S (2010) Neuroinflammation is a key player in Parkinson’s disease and
a prime target for therapy. J Neural Transm 117(8):971–979
Risch N, de Leon D, Ozelius L, Kramer P, Almasy L, Singer B, Fahn S, Breakefield X, Bresman S
(1995) Genetic analysis of idiopathic dystonia in Askenazi Jews and their recent descent from a
small founder population. Nat Genet 9:152–159. doi:10.1038/ng0295-152
Ritchie CM, Thomas PJ (2012) Alpha-synuclein truncation and disease. Health 4(11A):1167–1177
Schmidt ML, Murray J, Lee VM-Y, Hill MD, Trojanowski JQ (1991) Epitope map of neurofilament
protein domains in cortical and peripheral nervous Lewy bodies. Am J Pathol 139:53–65
Scott WK, Stajich JM, Yamaoka LH, Speer MC, Vance JM, Roses AD, Pericak-Vance MA, Deane
Laboratory Parkinson Disease Research Group (1997) Genetic complexity and Parkinson’s dis-
ease. Science 277:387–388
Scott WK, Nance MA, Watts RL, Hubble JP, Koller WC, Lyons K, Pahwa R, Stern MB, Colcher
A, Hiner BC, Jankovic J (2001) Complete genomic screen in Parkinson disease: evidence for
multiple genes. JAMA 286:2239–2244
Scott L, Dawson VL, Dawson T (2017) Trumping neurodegeneration: targeting common pathways
regulated by autosomal recessive Parkinson’s disease genes. Exp Neurol (in press). doi:10.1016/
j.expneurol.2017.04.008
Shibasaki Y, Baillie DAM, St Clair D, Brookes AJ (1995) High-resolution mapping of SNCA
encoding a-synuclein, the non-A-beta component of Alzheimer’s disease amyloid precursor, to
human chromosome 4q21.3-q22 by fluorescence in situ hybridization. Cytogenet Cell Genet
71:54–55
88 S. Roy and T.S. Vasulu

Singleton AB, Farrer M, Johnson J, Singleton A, Hague S, Kachergus J, Hulihan M (2003) Alpha-
synuclein locus triplication causes Parkinson’s disease. Science 302:841. doi:10.1126/science.
1090278
Spillantini MG, Divane A, Goedert M (1995) Assignment of human alpha-synuclein (SNCA) and
beta-synuclein (SNCB) genes to chromosomes 4q21 and 5q35. Genomics 27:379–381
Spillantini MG, Schmidt ML, Lee VM-Y, Trojanowski JQ, Jakes R, Goedert M (1997) α-synuclein
in lewy bodies. Nature 388:839–840
Sredni B, Geffen-Aricha R, Duan W, Albeck M, Shalit F, Lander HM, Kinor N, Sagi O, Albeck
A, Yosef S et al (2007) Multifunctional tellurium molecule protects and restores dopaminergic
neurons in Parkinson’s disease models. FASEB J 21:1870–1883
Stevenin G, Cancel G, Didierjean O, Durr A, Abbas N, Cassa E, Feingold J, Agid Y, Brice A (1995)
Linkage disequilibrium at the Machado-Joseph disease/spinal cerebellar ataxia 3 locus: evidence
for a common founder effect in French and Portuguese_Brazillian families as a second ancestral
Portuguese-Azorean mutation. Am J Hum Genet 57:1247–1250
Tan LCS, Venketasubramanian N, Hong CY, Sahadevan S, Chin JJ, Krishnamoorthy ES, Tan AKY,
Saw SM (2004) Prevalence of Parkinson disease in Singapore: Chinese vs Malays vs Indians.
Neurology 62:1999–2004
Toft M, Mata IF, Kachergus JM, Ross OA, Farrer MJ (2005) LRRK2 mutations and Parkinsonism.
Lancet 365(9466):1229–30
Trenkwalder C, Schwarz J, Gebhard J, Ruland D, Trenkwalder P, Hense HW, Oertel WH (1995)
Starnberg trial on epidemiology of Parkinsonism and hypertension in the elderly: prevalence of
Parkinson’s disease and related disorders assessed by a door-to-door survey of inhabitants older
than 65 years. Arch Neurol 52:1017–1022
Ueffing M, Meitinger T, Gasser T, Farrer MJ et al (2008) Helmholtz Zentrum München. University
Clinic Tübingen, Mayo Clinic, BioVaria
Valente EM, Caputo Abou-Sleiman PM, V, Mugit MMK, Harvey K, et al (2004) Hereditary early-
onset Parkinson’s disease caused by mutations in PINK1. Science 304(1158):1160. doi:10.1126/
science.1096284
Valente EM, Bentivoglio AR, Dixon PH, Ferraris A, Ialongo T, Frontali M, Albanese A, Wood NW
(2001) Localization of a novel locus for autosomal recessive early-onset parkinsonism, PARK6,
on human chromosome 1p35-p36. Am J Hum Genet 68:895–900
Van Duijn CM, Dekker MCJ, Bonifati V, Galjaard RJ, Houwing-Duistermaat JJ, Snijders PJLM,
Testers L, Breedveld GJ, Horstink M, Sandkuijl LA, Van Swieten JC, Oostra BA, Heutink P
(2001) PARK7, a novel locus for autosomal recessive early-onset parkinsonism, on chromosome
1p36. Am J Hum Genet 69:629–634
Wassef R, Haenold R, Hansel A, Brot N, Heinemann SH, Hoshi T (2007) Methionine Sulfoxide
Reductase A and a Dietary Supplement S-Methyl-L- Cysteine Prevent Parkinson’s-Like Symp-
toms. J Neurosci 27:12808–12816
Watabe M, Nakaki T (2007) Mitochondrial Complex I Inhibitor Rotenone-Elicited Dopamine Redis-
tribution from Vesicles to Cytosol in Human Dopaminergic SH- SY5Y Cells. J Pharmacol Exp
Ther 323:499–507. doi:10.1124/jpet.107.128017
Waters CH, Miller CA (1994) Autosomal dominant Lewy body parkinsonism in a four generation
family. Ann Neurol 35:59–64
Wellenbrock CK Hedrich, N Schafer, M Kasten, H Jacob, E Schwinger, J Hagenah, PP Pramstaller,
P Vieregge, C Klein (2003) NR4A2 mutations are rare among European patients with familial
Parkinson’s disease. Ann Neruol 54(3):415- PMD:12953278, DOI:1002/ana.10738
West AB, Moore DJ, Biskup S, Bugayenko A, Smith WW, Ross CA, Dawson VL, Dawson TM
(2005) Parkinson’s disease-associated mutations in leucine-rich repeat kinase 2 augment kinase
activity. Proc Natl Acad Sci 102:16842–16847
Wirdefeldt K, Gatz M, Schalling M, Pedersen NL (2004) No evidence for heritability of Parkinson
disease in Swedish twins. Neurology 63:305–311
Wszolek ZK et al (2004) Autosomal dominant Parkinsonism associated with variable synuclein
and tau pathology. Neurology 62:1619–1622
Protein Structure Modeling of Abnormal Genes … 89

Wszolek ZK, Vieregge P, Uitti RJ, Gasser T, Yasuhara O, McGeer P, Berry K, Calne DB, Vinger-
hoets FJG, Klein C, Pfeiffer RF (1997) German-Canadian family (family A) with parkinsonism,
amyotrophy, and dementia-longitudinal observations. Parkinsonism Relat Disord 3:125–139
Xiong M, Guo SW (1997) Fine-scale genetic mapping based on linkage disequilibrium: theory and
applications. Am J Hum Genet 60:1513–1531
Xiong Y, Dawson TM, Dawson VL (2017) Models of LRRK2-associated Parkinson’s disease. In:
Rideout HJ (ed) Advances in neurobiology, Leucine-Rich Repeat Kinase 2 (LRRK2). Springer,
pp 163–191
Zabetian CP, Samii A, Mosley AD, Roberts JW, Leis BC, Yearout D, Raskind WH, Griffith A (2005)
A clinic-based study of the LRRK2 gene in Parkinson disease yields new mutations. Neurology
65(5):741–744. doi:10.1212/01.WNL.0000172630.22804.73:1526-632x
Zarranz JJ, Alegre J, Gomez-Esteban JC, Lezcano E, Ros R, Ampuero I, Vidal L, Hoenicka J,
Rodriguez O, Atares B, Llorens V, Gomez Tortosa E, del Ser T, Munoz DG, de Yebenes JG
(2004) The new mutation, E46K, of alpha-synuclein causes parkinson and Lewy body dementia.
Ann Neurol 55:164–173
Zimprich A, Biskup S, Leitner P, Lichtner P, Farrer M, Lincoln S, Kachergus J, Hulihan M, Uitti RJ,
Calne DB, Stoessl AJ, Pfeiffer RF, Patenge N, Carballo Carbajal I, Vieregge P, Asmus F, Muller-
Myhsok B, Dickson DW, Meitinger T, Strom TM, Wszolek ZK, Gasser T (2004) Mutations
in LRRK2 cause autosomal-dominant parkinsonism with pleomorphic pathology. Neuron 44:
601–607

Web References

http://en.wikipedia.org/wiki/Secondary_prediction
http://www.ibcp.fr/predict.html
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/omim/
http://en.wikipedia.org/wiki/
Time Detection for Ovulation in a Cycle in
Presence of Polycystic Ovary Syndrome

Ratan Dasgupta

Abstract We study the body temperature variation in four menstrual phases of an


individual in presence of Polycystic Ovary Syndrome (PCOS). From the temperature
data recorded, we identify the time of ovulation when the cycles are not regular. We
obtain growth curve of body temperature by lowess regression. Proliferation rate
d
dt
log y(t) of body temperature y = y(t) at time t, attains the lowest value near the
time of ovulation. Temperature residuals from the growth curves are seen to follow
a correlated Gaussian process. Some convergence results of empirical distribution
functions used in this context are also discussed. Detection of ovulation time may
help the individual to plan in conceiving a child.

Keywords Menarche · PCOS · Lowess regression · Spline regression · Proliferation


rate

MS subject classification: Primary: 62P10 · Secondary: 62G08

1 Introduction

Patients with polycystic ovary syndrome (PCOS) have accumulation of multiple


cysts in the ovaries, this associated with high male hormone levels, chronic absence
of ovulation and other metabolic disorders complicate the problem to have regular
menstrual cycle in women. Excess facial and body hair, acne, obesity, irregular
menstrual cycles, and infertility are some of the symptoms of PCOS. Pre-pubertal
obesity and early menarche are some of the possible factors for developing PCOS at a
later stage. Treatment is based on lifestyle changes such as weight loss and exercise.
Birth control pills may help with improving the regularity of periods, excess hair
growth, and acne. PCOS is the most common endocrine disorder among women
between reproductive ages. Females who reach menarche at an early age expose

R. Dasgupta (B)
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 B T Road,
Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 91


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_3
92 R. Dasgupta

their reproductive organs to the female hormone estrogens at an earlier age. This
combined with a late marriage prevalent at present time; result in a long gap between
menarche and pregnancy, leading to other health problems. A tendency towards
central obesity and other symptoms associated with insulin resistance are seen in
patients. A child, who has irregular menstruation to start with and that continued
into adulthood, develops imbalance in hormones, which often leads to problems
conceiving a child. Identification of the time of ovulation is of great help in conceiving
a child for individuals with PCOS.
Growth curve of body temperature is marked by a distinct peak in nonparametric
lowess regression in four cycles. With an application of the result on convergence
of empirical distribution function, see Dasgupta (2015a), the residual process of
temperature is seen to be Gaussian.

2 Data and the Results

Body temperature variation in a regular menstrual cycle of a PCOS patient reflects the
ovulation time like a normal individual in a specific cycle when the body function
behaves normally in that phase. Most women without PCOS will show signs of
ovulation in the middle of their cycle. During the first half of the cycle (menstruation-
ovulation), body temperature is slightly lower. However, once ovulation has occurred,
there is a upturn in body temperature as progesterone is released and this prepares
body for pregnancy. The increase in temperature over the second half of the cycle
is what usually signifies ovulation. We observed ovulation to have occurred slightly
towards the first part of the cycle i.e., around [12–14] days in presence of PCOS.
Combining data from several such normal cycles, along with other indications of
body discharge; it is possible to infer about the ovulation status of the patient and a
right time for planning to conceive a child.
We observe that the body temperature of the PCOS patient on day 13 of the first
cycle is 37.2 ◦ C, it rises to a peak at 37.4 ◦ C on day 14. The temperature comes down
to 36.8 ◦ C on the next day, and then stabilizes gradually with small oscillations; see
Fig. 1. Ovulation of the individual seems to have occurred on day 14 in the first cycle.
In the second cycle temperature is 36.4 ◦ C on day 11, it rises to a peak at 37.2 ◦ C
on day 12, then comes down to 36.9 ◦ C on the next day. Ovulation of the individual
seems to have occurred on day 12 in the second cycle. An upward trend in temperature
is seen; see Fig. 2.
In the third cycle the reading is 36.8 ◦ C on day 13, it rises to a peak at 37.7 ◦ C
on day 14. The temperature comes down on the next day, i.e., on day 15 at 36.7 ◦ C
and then stabilizes gradually with small oscillations, as time progresses in the cycle.
Ovulation of the individual seems to have occurred on day 14 in the third cycle. A
downward trend in temperature is evident; see Fig. 3.
In the fourth cycle, from a reading of 36.5 ◦ C temperature on day 11, it rises to
a peak at 37.7 ◦ C on day 12. The temperature comes down on the next day, i.e., on
Time Detection for Ovulation in a Cycle in Presence of Polycystic Ovary Syndrome 93

Fig. 1 Body temperature over cycle 1. Body temperature of the PCOS patient in the first cycle is
shown in Fig. 1. From a reading of 37.2 ◦ C temperature on day 13, it rises to a peak at 37.4 ◦ C of
the cycle 1 on day 14. The temperature comes down on the next day, i.e., on day 15 at 36.8 ◦ C and
then stabilizes gradually with small oscillations, as time progresses in the cycle. Ovulation of the
individual seem to have occurred on day 14 in the first cycle

Fig. 2 Body temperature over cycle 2. Body temperature of the PCOS patient in the second cycle
is shown in Fig. 2. From a reading of 36.4 ◦ C temperature on day 11, it rises to a peak at 37.2 ◦ C of
the cycle 2 on day 12. The temperature comes down on the next day, i.e., on day 13 at 36.9 ◦ C and
then stabilizes gradually with small oscillations, as time progresses in the cycle. Ovulation of the
individual seems to have occurred on day 12 in the second cycle. An upward trend in temperature
is evident in the Fig. 2
94 R. Dasgupta

Fig. 3 Body temperature over cycle 3. Body temperature of the PCOS patient in the third cycle is
shown in Fig. 3. From a reading of 36.8 ◦ C temperature on day 13, it rises to a peak at 37.7 ◦ C of the
cycle 3 on day 14. Then the temperature comes down on the next day, i.e., on day 15 at 36.7 ◦ C and
then stabilizes gradually with small oscillations, as time progresses in the cycle. Ovulation of the
individual seems to have occurred on day 14 in the third cycle. A downward trend in temperature
is evident in the Fig. 3

Fig. 4 Body temperature over cycle 4. Body temperature of the PCOS patient in the fourth cycle
is shown in Fig. 4. From a reading of 36.5 ◦ C temperature on day 11, it rises to a peak at 37.7 ◦ C of
the cycle 4 on day 12. The temperature comes down on the next day, i.e., on day 13 at 36.7 ◦ C and
then stabilizes gradually with small oscillations, as time progresses in the cycle. Ovulation of the
individual seems to have occurred on day 12 in the fourth cycle

day 13 at 36.7 ◦ C and then stabilizes gradually. Ovulation of the individual seems to
have occurred on day 12 in the fourth cycle; see Fig. 4.
The cycles are not always adjacent to each other, as irregular menstruation is a
common symptom in presence of PCOS.
Time Detection for Ovulation in a Cycle in Presence of Polycystic Ovary Syndrome 95

Fig. 5 Temperature on 4 cycles. Temperature readings for all the four cycles of the PCOS patient
are plotted on the same frame starting from day 1 to day 28 of cycles. This helps to examine
temperature variation and location of peaks indicating variation of ovulation days for the patient.
Ovulation occurs in the upper part of the cycle and body temperature seems to be symmetrically
distributed around the region of peaks, to a first approximation as seen in Fig. 5

Fig. 6 Temperature over 4 cycles with peak fixed at the day zero. To study the variation of tem-
perature around the time of ovulation, the peak of all the cycles are time-fixed to happen at a day
levelled zero. The temperature variation seem to be more or less symmetric around the peak, before
and after the ovulation as seen in Fig. 6
96 R. Dasgupta

Fig. 7 Growth curve (Lowess) of temperature with shifted time point zero at peak. Growth curve
of body temperature by lowess regression with f = 5/107 is shown in Fig. 7. With little oscillations
around the peak the variation seems similar on both sides to a first approximation. However, in a
finer scale a little bit of uplift in the right hand side of the lowess curve during post ovulation time
is seen

Fig. 8 Residual plot of temperature with shifted time point zero at peak. Temperature residuals for
the four cycles superimposed on each other on the range of cycle are shown. The peak temperature
in each cycle is shifted to day zero. Residual temperature in the left side of day zero represent status
before ovulation, and the residuals in right side of zero represent status after ovulation. Symmetric
pattern of temperature residuals from lowess curve is apparent in Fig. 8
Time Detection for Ovulation in a Cycle in Presence of Polycystic Ovary Syndrome 97

Fig. 9 Residual plot of temperature in consecutive four cycles. In this picture, we show the variation
of residuals in consecutive four cycles, side by side. The progress of time on days is counted in a
manner as if there are no gaps in the cycles. In a micro scale of 0.2 ◦ C of temperature, the fluctuation
looks erratic in Fig. 9

In different cycles of temperature, the peaks can be distinctly identified in the


patient. These along with other indications like body-discharge can be of help to
identify the ovulation time in a menstrual cycle in presence of PCOS.
Ovulations occur in the upper part of a cycle starting from day 1 to day 28. Body
temperature seems to be symmetrically distributed around the region of peaks.
To study the variation of temperature around the time of ovulation, the peak of all
the cycles are time-fixed to happen at a day levelled zero. The temperature variation
seem to be more or less symmetric around the peak, before and after the ovulation.
We obtain the growth curve of temperature with lowess regression. Symmetric
pattern of temperature residuals from lowess curve is apparent. In a micro scale of
0.2 ◦ C of temperature, the fluctuation of residuals looks erratic. In order to check
whether the residuals are normally distributed, we plot the residuals in Normal QQ
plot, linearity seems to hold in regression with a high value of r ; indicating a strong
possibility. The correlated residuals of all the cycles together are plotted in Normal
QQ plot. Justification of such a plot in correlated process may be made by virtue
of a result of Dasgupta (2015a). With a high value of R 2 = 0.9616, normality of
residuals seem to hold.
We quote a result of Dasgupta (2015a) for correlated process below. This justifies
the Normal QQ Plot of four cycles merged, and validates the assertion that errors are
(correlated) normal. The result is relevant in the present case as we are ignoring the
time gaps with missing observations in successive cycles, which may be irregular
in presence of PCOS. The correlated residuals merged together for all the cycles
are seen to follow a normal distribution. From the slope and intercept of the fitted
least square line we may compute the parameters of the limiting distribution under
Theorem. 1. These are of help in hypothesis testing problems related to ovulation
time.
98 R. Dasgupta

Fig. 10 Normal QQ plot of residuals (cycle 1). In order to check whether the residuals of cycle
1 are normal, we plot these in Normal QQ plot in Fig. 10. Linearity seem to hold in regression,
indicating a strong possibility of normal distribution

Fig. 11 Normal QQ plot of residuals (cycle 2). We plot the residuals of cycle 2 in normal QQ plot
in Fig. 11. Linearity seem to hold in regression, although a few outliers are seen towards the top

Theorem 1 Consider a Gaussian process X (t), 0 ≤ t ≤ T with mean m(t) and


covariance kernel σ (t, u) = σ (t)σ (u)ρ(t, u), where m(t) → 0, σ (t) → σ ; t →
∞. Assume X (t) has the weak limit denoted by X (∞) and the correlation function
|ρ(t, u)| < K |t − u|−β , K > 0, β > 0. Consider the empirical distribution func-
tion of the process based on the observations at time points t1 , t2 , · · · , tn which are not
necessarily equispaced. Let the time interval [0, T ) of recording the observations be
subdivided into k subintervals and the length of all except finitely many subintervals
and the number of observations in each subinterval, except finitely many increase to
∞. Also let the time gap between two consecutive observations within each subin-
terval be homogeneous and the number n ∗ of ‘isolated’ observations which do not
fall in any one of the homogeneous subintervals, be negligible compared to n, i.e.,
n ∗ = o(n). Then the empirical distribution function of the recorded observations
Time Detection for Ovulation in a Cycle in Presence of Polycystic Ovary Syndrome 99

Fig. 12 Normal QQ plot of residuals (cycle 3). In this normal QQ plot, the residuals are close to
the straight line, indicating a strong possibility of normal distribution for residuals in Fig. 12

Fig. 13 Normal QQ plot of residuals (cycle 4). In this normal QQ plot, there are a few outliers
toward the top of graph, and other points are near to the straight line, indicating a possibility of
normal distribution for residuals in Fig. 13

from the process is a strongly consistent estimate for distribution function of the
limiting variable X (∞), as n → ∞.

The result ensures that even if there are missing observations in a cycle, and
even if the cycles are not always adjacent, as in the case of individuals in presence
of PCOS; the empirical c.d.f. of the correlated residuals converges strongly to the
limiting distribution under certain conditions.
Proliferation rate is a scaled version of velocity of a time dependent variable.
We compute the proliferation rate dtd log y(t) of body temperature y = y(t) from
observed data in each cycle by a technique described in Dasgupta (2013), see also
Dasgupta (2015b). The rate provides an insight into temperature variation, and it is
not dependent on the scale of measurement (Fig. 16).
100 R. Dasgupta

Fig. 14 Residuals of four cycles merged. We close the gaps of residual plots shown in Fig. 9. In the
merged picture we examine the features of residuals in a broad perspective. There is a fluctuation
of relatively high magnitude at the start. Residual fluctuations are erratic in Fig. 14

Fig. 15 Normal QQ plot of four cycles merged. The correlated residuals are plotted in normal QQ
plot in Fig. 15. Justification of such a plot in correlated process may be made by virtue of a result
of Dasgupta (2015a). With a high value of R 2 = 0.9616, normality of residuals seem to hold

Individual crude estimates of slope are weighted by a normalised exponentially


decaying function with total weight 1. High weight is assigned to the points near
the point of computing proliferation, and low weights are assigned to distant points.
Median of these slope estimates are divided by y(t) to have a crude estimate of prolif-
eration rate. Finally the proliferation rates are smoothed by SPlus with smooth.spline
and spar= 0.0001 to obtain the proliferation rate curve. The curves seem to attain a
minimum around the time of ovulation.
The results from analysed data of the patient in presence of PCOS are further
explained in the figures.
Time Detection for Ovulation in a Cycle in Presence of Polycystic Ovary Syndrome 101

0.0002
Proliferation rate of temperature/day
0.0
-0.0002
-0.0006
-0.0010

0 5 10 15 20 25
Time (day)

Fig. 16 Proliferation rate of body temperature (1): trimmed mean, wt. exp(−0.01x);spline. The
proliferation rate in cycle 1 sharply goes down to attain a minimum at day 14, as seen in Fig. 16.
This is also the time of ovulation in the first cycle
0.0006
Proliferation rate of temperature/day
0.0004
0.0002
0.0
-0.0002
-0.0004

0 5 10 15 20 25
Time (day)

Fig. 17 Proliferation rate of body temperature (2): trimmed mean, wt. exp(−0.01x);spline. The
proliferation rate in cycle 2 goes down to attain a minimum at day 13 in Fig. 17. This is near the
time of ovulation viz. day 12 in the second cycle
102 R. Dasgupta

0.0
Proliferation rate of temperature/day
-0.0002
-0.0004
-0.0006

0 5 10 15 20 25
Time (day)

Fig. 18 Proliferation rate of body temperature (3): trimmed mean, wt. exp(−0.01x);spline. The
proliferation rate in cycle 3 sharply goes down to attain a minimum at day 14 in Fig. 18. This is also
the time of ovulation in the third cycle
0.0
Proliferation rate of temperature/day
-0.0002
-0.0004
-0.0006

0 5 10 15 20 25

Time (day)

Fig. 19 Proliferation rate of body temperature (4): trimmed mean, wt. exp(−0.01x);spline. The
proliferation rate in cycle 4 sharply goes down to attain a minimum at day 12 in Fig. 19. This is also
the time of ovulation in the fourth cycle
Time Detection for Ovulation in a Cycle in Presence of Polycystic Ovary Syndrome 103

3 Discussions

About one third of females suffer from irregular menstruation, and PCOS is one of the
complicated health problem among these, associated with hormonal imbalance. See
e.g., Wiweko et al. (2014) for anti-mullerian hormone as a diagnostic and prognostic
tool in PCOS. Phase change and BMI in PCOS patients are investigated in Dasgupta
and Pan (2015). PCOS may lead to other health complications, it is a potential
risk factor for nonalcoholic fatty liver disease that can progress to nonalcoholic
steatohepatitis and even cirrhosis, see Kelley et al. (2014). Prevalence of the disease
in different women groups is studied in Agrawal et al. (2004).
Among healthy women in reproductive age, body temperature slightly go up at the
time of ovulation. Once ovulation occurs which can be confirmed by a rise in body
temperature, a patient’s cycle behaves very much like a woman who does not have
PCOS. The post-ovulatory phases of the cycle are of a consistent number of days,
even for those who have highly irregular cycles. In the present study we investigate
the body temperature variation for four cycles in a PCOS patient, in order to detect
the time of ovulation in a cycle. In presence of other indications like body-discharge
at that time, this detection may help the patient to plan for conceiving a child.
We obtain the growth curve of body temperature in a cycle with nonparametric
lowess regression. With an application of a result of Dasgupta (2015a), we show that
the residual temperatures constitute a correlated Gaussian process. Proliferation rates
of body temperature over four cycles attain a minimum near the time of ovulation,
thus providing another way to detect a proper time for planning conception.

Acknowledgements Temperature data is collected by Ms. Anwesha Pan.

References

Agrawal R, Sharma S, Bekir J, Conway G, Bailey J, Balen AH, Prelevic G (2004) Prevalence of
polycystic ovaries and polycystic ovary syndrome in lesbian women compared with heterosexual
women. Fertil Steril 82:1352–1357
Dasgupta R (2013) Non uniform rates of convergence to normality for two sample U-statistics in non
IID case with applications. advances in growth curve models: topics from the Indian statistical
institute. In: Proceedings in mathematics & statistics, Chapter 4, vol 46. Springer, Berlin, pp
60–88
Dasgupta R (2015a) Optimal choice of small regular shapes for accidentally damaged tessellation.
In: Growth curve and structural equation modeling: topics from the Indian statistical institute,
Chapter 15. Springer, Berlin, pp 287–299
Dasgupta R (2015b) Rates of convergence in CLT for two sample U-statistics in non iid case
and multiphasic growth curve. growth curve and structural equation modeling. Dasgupta R (ed)
Proceedings in mathematics & statistics, Chapter 3, vol 132. Springer, Berlin, pp 35–58
Dasgupta R, Pan A (2015) Growth curve of phase change in presence of polycystic ovary syndrome.
In: Growth curve and structural equation modeling: topics from the Indian statistical institute,
Chapter 8. Springer, Berlin, pp 135–149
104 R. Dasgupta

Kelley CE, Brown AJ, Diehl AM, Setji TL (2014) Review of nonalcoholic fatty liver disease in
women with polycystic ovary syndrome. World J Gastroenterol 20:14172–14184
Wiweko B, Maidarti M, Priangga MD, Shafira N, Fernando D, Sumapraja K, Natadisastra M,
Hestiantoro A (2014) Anti-mullerian hormone as a diagnostic and prognostic tool for PCOS
patients. J Assist Reprod Genet 31:1311–1316
Growth Model for Micro-Particles Towards
Indistinguishability and Dirichlet Prior

Ratan Dasgupta

Abstract We consider partial distinguishability of micro particles in terms of aver-


aging the cell probabilities by Dirichlet prior on k dimensional unit simplex with
an added prior perturbation, where k is the number of states. The perturbation in
uniform prior is such that the added term becomes negligible over progress in time;
as the particles decay to a lower mass eventually. We compute Shannon’s measure
of entropy for the ensemble of micro particles over time that converges to Shannon’s
entropy of Bose-Einstein statistics for indistinguishable particles. Remainder in the
expression of ensemble entropy of particles in intermediate state, from Shannon’s
entropy of particles following Bose-Einstein (BE) statistics is examined to assess the
evolution of the modeled system towards indistinguishability from partial indistin-
guishability. The rate of such convergence is seen to be polynomially decaying in
terms of a controlling parameter in prior perturbation and the number of states k.

Keywords Partial indistinguishability · Dirichlet prior · Shannon’s information ·


Perturbation of prior · Exciton · Cooper pair

MS subject classication: 62P35 · 62G20

1 Introduction

Bose-Einstein statistics and Maxwell-Boltzmann statistics are two probability mod-


els related to arrangements of distinguishable and indistinguishable particles respec-
tively. The Bose-Einstein statistics for micro-particles may be explained in terms
of classical probability theory. The Dirichlet prior in k dimensional unit simplex
also has a role in the representation, here k is the number of possible states for the
particles. We would like to propose a growth model on a priori probabilities on the
partitioned state space such that the parameters in the model regulates the degree

R. Dasgupta (B)
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 B T Road,
Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 105


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_4
106 R. Dasgupta

of indistingiushablity of particles assigned in different states, and the Shannon’s


entropy for the particle assignment in intermediate state converges to that for BE sta-
tistic quite fast. We obtain an estimate of the difference between the two, depending
on the parameters introduced in the model. The intermediate state can be applied
to explain composite-particle system, and the rates of convergence are of interest to
examine nearness of intermediate statistics to BE statistics.
In the following, we briefly recapitulate the results related to Bose-Einstein (BE)
statistics in terms of Maxwell-Boltzmann (MB) statistics for distinguishable parti-
cles, see also Dasgupta and Roy (1990, 2008).
Let W = (W1 , W2 , . . . , Wk ) to be a random vector uniformly distributed in the
region, a k dimensional simplex:


k
 = {(W1 , W2 , . . . , Wk ) : Wi ≥ 0, Wi = 1} (1)
i=1

Consider the following Dirichlet integral,


 
(n 1 ) · · · (n k )
··· w1n 1 −1 · · · wkn k −1 dw = (2)
 (n 1 + n 2 + · · · + n k )

where n 1 , . . . , n k > 0, w = (w1 , . . . , wk ); dw = dw1 · · · dwk .


For n 1 = · · · = n k = 1 the r.h.s. of (2) is 1/(k − 1)!. This provides the volume of
the region of integration . The joint probability density of the prior W, when the
density is a constant, is then given by


k
f (w) = (k − 1)! if, wi ≥ 0 and wi = 1;
i=1
= 0, otherwise. (3)

This is uniform distribution on the probability simplex  with w1 + · · · + wk = 1,


where wi ≥ 0, ∀ i .
Consider N = (N1 , N2 , . . . , Nk ) to be a random vector with non-negative integer
valued coordinates such that given W = w, the vector N has a multinomial distrib-
ution with parameters n, w1 , w2 , . . . , wk . In other words,


k
n!
P(N1 = n 1 , . . . , Nk = n k | W = w, n i = n) = w1n 1 w2n 2 · · · wkn k
i=1
n !n
1 2 ! · · · n k !
(4)
The multinomial distribution in the right hand side of (4) when integrated over the
region  by uniform a priori distribution given in (3), provides the Bose-Einstein
statistics. In other words,
Growth Model for Micro-Particles Towards Indistinguishability and Dirichlet Prior 107

  
  n + k − 1 −1
k k
P(N = n| n i = n) = P(N = n|W = w, n i = n) f (w)dw =
 k−1
i=1 i=1
(5)
see (2), where n = (n 1 , . . . , n k ); see also Tersoff and Bayer (1983). The Maxwell-
Boltzmann statistics have wi = 1/k, i = 1, . . . , k in (4), this refers to assignment
of distinguishable particles. Here probabilities for a marble (particle) to be assigned
in any cell (state) are equal i.e., 1/k; whereas for Bose-Einstein statistics we assume
equal expected probability with uniform a priori distribution (3) on the region 
given by (1), which implies Ewi = 1/k, i = 1, . . . , k. The coordinates of the vector
w = (w1 , . . . , wk ) have an exchangeable distribution in (3).
The uniform prior f when added with a perturbation in (5) affects the probability
of the particle ensemble, and gives rise to cases where Bose-Einstein statistics is
achieved in the limit.

2 Particle Ensemble, Associated Correlation and


Information

The uniform a priori distribution (3) is used in the characterization (5). In view of the
relation W1 + W2 + · · · + Wk = 1, the correlation structure of W1 and W2 is given
by
1
ρ(W1 , W2 ) = − (6)
k−1

where k is the number of cells. The random prior probabilities Wi vary in a manner
such that cell probabilities add up to one. Increasing the value of one Wi is likely to
reduce the values of the other W j ’s, inducing a negative correlation structure amongst
these random probabilities.
The general definition of Shannon entropy is the following. If a system can be in
one of the several possible states in S, but if we know only the probabilities pi of its
being in each state i in S, then the amount of information about the system is

I = pi ln pi (7)
i∈S

This is the negative of the Shannon entropy. 


If pi = p for each micro-state i, then I = ln p as i∈S pi = 1. In general the
index i may refer to configuration of the system or the microstate of the system.
Information of the configuration in Bose-Einstein statistics is I B E = −ln n+k−1
k−1
.
In Fermi-Dirac type indistinguishability, one state is occupied by at most by one
particle.
108 R. Dasgupta

3 Smooth Priors, Partial Indistinguishability and


Correlation Function

The correlation function (I B E − I M B ) due to Bose-Einstein type indistinguishabil-


ity, or (I F D − I M B ) due to Fermi-Dirac type indistinguishability is of discrete nature
depending on n and k. Change in prior distribution results in smooth change in
correlation function, which may not remain discrete in terms of the parameters intro-
duced. The multinomial distribution (4) of distinguishable particles, when integrated
by a uniform a priori distribution over  provides Bose-Einstein statistics. The
Bose-Einstein statistics for quantum micro particles of mass m( 0) may be seen
in relation to a class of smoothly changing priors on  indexed by a parameter
L = L(m) depending on decreasing mass m = m(t)(↓ 0, as time t ↑ t0 ≤ ∞,) of
distinguishable objects. Although mass is measured in a discrete scale, the parame-
ter L = L(m) may be considered a continuous variable, while measuring mass over
time.
One may then consider prior distributions which may converge to the uniform
prior associated to Bose-Einstein statistics in a continuous manner.
It may not be out of place to mention that in nature the opposite phenomenon
of mass accumulation is also prevalent e.g., for cloud formation, water requires a
non-gaseous surface to make the transition from a vapour to a liquid; the process
is called condensation. In the atmosphere tiny solid or liquid particles called cloud
condensation nuclei (CCN) like dust particles, sea salt from ocean wave spray, etc.
help the process. A growth model on particle condensation is relevant therein, see
e.g., Westervelt et al. (2013).
Instead of uniform distribution over  that leads to Bose-Einstein statistic, con-
sider the following a priori distribution for the probability vector W, where first two
cell probabilities affect the joint probability in a special manner. Let,

w1 − w2
f L (w) = (k − 1)!(1 + ) (8)
Lk

on , where L = L(m) > 1 is a large constant, thus causing a little perturbation on


the uniform prior for small m = m(t). This distribution, when compounded with the
multinomial distribution (4), gives rise to the following probability distribution to
the particle arrangements,

n!(k − 1)! n1 − n2
PL = P(N1 = n 1 , . . . , Nk = n k ) = {1 + } (9)
(n + k − 1)! (n + k)Lk

vide (2). Then in the limiting case, as k → ∞ or, L → ∞, one regains the Bose-
Einstein statistics. Parameter L → ∞ implies that the perturbation over uniform a
priori distribution (3) is negligible. The above type of prior was studied in Dasgupta
and Roy (2008).
One may also consider a prior more general than (8),
Growth Model for Micro-Particles Towards Indistinguishability and Dirichlet Prior 109

w1α − w2α
f L∗ (w) = (k − 1)!(1 + ) (10)
Lk

where α ≥ 1 is an integer.1 Then

PL∗ = P(N1 = n 1 , . . . , Nk = n k )
n!(k − 1)! {(n 1 + α) · · · (n 1 + 1)} − {(n 2 + α) · · · (n 2 + 1)}
= [1 + ] (11)
(n + k − 1)! {(n + k + α − 1) · · · (n + k)}(Lk)

For α = 1, PL∗ reduces to PL of (9). Further note that, as α → ∞ the above probability
converges to PB E . The random components W1 , W2 are less than 1 with certainty; thus
a high value of α in the power diminishes the effect of these in the prior perturbation.
In (8) and (10), we considered a small perturbation over uniform prior affecting
only the first two cell’s prior probabilities. This perturbation uniformly affects the
probability of particle arrangements in all the cells (states) via n 1 and n 2 , the number
of particles in the first and second state. When n 1 = n 2 , (11) reduces to BE statistics.
Such intermediate situations may arise when occupation of a state by a particle has
an influence on the occupation of other states in a special manner; related to it is the
screening type effect where a cluster of nearby cells are noticeably correlated.
Although the elementary particles in nature are either bosons or fermions, one
can generate a special mechanism of selection such that the resultant probability
distribution is of the above intermediate type. This intermediate situation has many
applications in transient cases. The intermediate statistics can be applied to explain
composite-particle systems; e.g., the Cooper pair in the theory of superconductivity,
the Fermi gas super fluid, the exciton,2 etc. Intermediate statistics may then be used
as an effective tool for studying these systems; see Yao Shen et al. (2007), for relevant
discussions and references.
In the second part of the prior (8), there is an odd function of w and some other odd
function g(w) of w may also be considered, so that the total integral is 1. One may
interpret (8)/(10) as follows: on certain restricted sets of , restriction being on first
two coordinates; the particles are indistinguishable e.g., when w1 = w2 i.e., first two
cells are of equal random probability, then f L (w) = (k − 1)!; i.e., f is uniform on 
and (9) becomes Bose-Einstein statistic with n 1 = n 2 . One may interpret L = L(m),
(↑ ∞ for m ↓ 0) of (8)/(10) as a degree of indistinguishability; since Bose-Einstein
statistics of indistinguishable particles is regained when L → ∞.

1 In general α need not be an integer. The expression (11)) is of nice form for integer α.
2 An exciton is a bound state of an electron and an electron hole which are attracted to each other by

the electrostatic Coulomb force. This occurs when an electron is displaced from its position leaving
a positively charged ‘hole’. When a molecule absorbs a quantum of energy that corresponds to a
transition from one molecular orbital to another molecular orbital, the resulting electronic excited
state is also an exciton. This was proposed by Frenkel (1931), when he described the excitation
of atoms in a lattice of insulators, and postulated that the excited state would be able to travel in
a particle-like fashion through the lattice without the net transfer of charge. Molecular excitons
are not stable, typically have characteristic lifetimes of small order, on the order of nanoseconds,
after which the ground electronic state is restored and the molecule undergoes photon or phonon
emission.
110 R. Dasgupta

The Shannon’s information I L of the probability distribution (9) is given in (12).


It turns out to be sum of two components; the first component is Shannon’s infor-
mation for Bose-Einstein statistics and the second component is a remainder with
diminishing effect, as L → ∞.

    
 n+k−1 n + k − 1 −1 n1 − n2 n1 − n2
p ln pi = −ln + {1 + } ln{1 + }
i i k−1 An k−1 (n + k)Lk (n + k)Lk
k (12)
where An = {(n 1 , . . . , n k ) : i=1 n i = n, n i ≥ 0 ∀ i = 1, . . . , k}, L ≥ 1. Similar
expression holds for the configuration probability given in (11). Note that (12) is
a continuously differentiable function of L , unlike the discrete nature of I B E/F D .
Expanding the logarithm and using the variance covariance results of occupancy
vectors of Bose-Einstein statistics see e.g., Kunte (1977), one obtains for large L,
the following expression.

  
n+k−1 1 2n(nk + 2k − 1)
IL = pi ln pi = −ln + 2 4 {1 + o(1)} (13)
i
k−1 L k (k + 1)(n + k)2

where o(1) term goes to zero, as L → ∞.


From (3), (8), (9) and (13), it is interesting to observe that

|| f L − f || = O(L −1 ) = ||PL − PB E ||, although |I L − I B E | = O(L −2 ).

The correlation function for the partial type of indistinguishability (8) is



n+k−1 1 2n(nk + 2k − 1)
IL − IM B = n ln k − ln + 2 4 {1 + o(1)} (14)
k−1 L k (k + 1)(n + k)2

Thus from (12) and (14), the probability of the arrangement of particles, may smoothly
change to Bose-Einstein type indistinguishability; e.g., when the distinguishable
particles are of diminishing mass m → 0 and the a priori probabilities of the particles
going to different cells are random variables of similar magnitude, and e.g., when
the first two cells have equal random probability of particle assignment.

4 Growth Rate Related to Shannon’s Information in Micro


Particles

Rate of convergence in Shannon’s entropy of partial distinguishability affecting other


cells by the occupancy numbers of first two cell in the state space as specified in
ensemble probability (8)/(10) towards (5), corresponding to BE statistic; is polyno-
mially decaying in L = L(t), and in k the number of states; the order of decay are
O(L −2 ) and O(k −6 ), respectively.
Growth Model for Micro-Particles Towards Indistinguishability and Dirichlet Prior 111

Fig. 1 Values of
z = I L − I B E for n = 10.
Here the residual nicely
drops to zero when the
number of particles n = 10
and the number of states k
and the controlling
parameter L of (8) increases.
For k > 60 and L > 700 the
residuals are almost zero

Fig. 2 Values of
z = I L − I B E for n = 50.
Drop of residuals is fast in
the beginning, but these are
non-zero for a larger region
of k and L, the white region
representing zero is
shrinking towards far end

Although the difference in priors and resultant probabilities of the micro particles
are of the order L −1 , the difference in Shannon’s information is faster viz., O(L −2 ).
Shannon’s information of the perturbed  system
 is 1given in (13). The residual of
this from I B E is I L − I B E = I L − {−ln n+k−1
k−1
} ≈ 2n(nk+2k−1)
L k (k+1)(n+k)2
2 4 , which remains
positive as n → ∞, while the other parameters L , k remain fixed or bounded; the
limiting value being L12 k24 = O(L −2 k −4 ); which decreases quite fast for large L and
k. The rate of convergence is fast to zero, if growth of L and/or k is without any bound
for moderate values of n. The rate of convergence to zero of the residual (I L − I B E )
also depends on the number of particles n, see Figs. 1, 2, 3, 4 and 5.
The figures explain the fall in growth of the residuals (I L − I B E ) as L and k
increases for n = 10, 50, 100, 150, 200. It is observed that graphs are gradually mov-
112 R. Dasgupta

Fig. 3 Values of
z = I L − I B E for n = 100.
Drop of residuals is faster in
the beginning; the white
region representing zero has
shrinked further

Fig. 4 Values of
z = I L − I B E for n = 150.
Drop of residuals is faster in
the beginning compared to
the previous figure, the white
region representing zero has
turned almost invisible,
indicating increase in the
residual magnitude towards
end

ing towards stability, which seem to have achieved for n = 200, there is no apparent
change in the figures when n > 200, as confirmed by computation.

5 Choice of Priors

In intermediate states, there are cases where the partial indistiguishability is of mild
type, in the sense that it deviates a little bit from the indistinguishabilty of BE type
or FD type. To take into account such cases, the priors and the parameter appearing
Growth Model for Micro-Particles Towards Indistinguishability and Dirichlet Prior 113

Fig. 5 Values of
z = I L − I B E for n = 200.
Drop of residuals is faster in
the beginning compared to
the previous figure, the white
region representing zero in
previous figure is now nil,
indicating increase in
residual magnitude towards
far end with increase in n,
the number of particles. The
pattern of the residuals
remains the same for higher
values of n, as verified by
computation

therein should have the flexibility of nearness to the uniform prior on , giving rise
to BE statistic. Priors considered here have properties of such fast convergences to
the uniform prior as evident from Figs. 1, 2, 3, 4 and 5. Rates of fall to zero are sharp
in the residuals for different values of the parameters. Large values of the parameter
α in (10) also induce the above mentioned property in the prior we considered.

References

Dasgupta R, Roy S (1990) Quantum statistics, distinguishability and random trajectory. Phys Lett
A 149:63–66
Dasgupta R, Roy S (2008) Multinomial distribution, quantum statistics and Einstein-Podolsky-
Rosen like phenomena. Found Phys 38:384–394. doi:10.1007/s10701-008-9207-3
Frenkel J (1931) On the transformation of light into heat in solids. I. Phys Rev 37:17
Kunte S (1977) The multinomial distribution, Dirichlet integrals and Bose-Einstein statistics.
Sankhya 39A:305–308
Shen Y, Dai WS, Xie M (2007) Intermediate-statistics quantum bracket, coherent state, oscillator,
and representation of angular momentum [su(2)] algebra. Phys Rev A 75:042111
Tersoff J, Bayer D (1983) Quantum statistics for distinguishable particles. Phys Rev Lett 50:2038
Westervelt DM, Pierce JR, Riipinen I, Trivitayanurak W, Hamed A, Kulmala M, Laaksonen A,
Decesari S, Adams PJ (2013) Formation and growth of nucleated particles into cloud condensation
nuclei: model-measurement comparison. Atmos Chem Phys 13:7645–7663
Coconut Plant Growth, Mahalanobis
Distance, and Jeffreys’ Prior

Ratan Dasgupta

Abstract We study coconut plant growth in saline soil of Sunderban, West Bengal.
Two growth environments are compared by Mahalanobis distance. Jeffreys’ non-
informative prior and related matching priors are investigated in relation to cases
including bi-exponential distribution for first principal component in the analyzed
data. Fisher’s information I(θ ) is seen to be a measure of distribution sensitivity in
terms of chi-square distance, extending a result given in Rao (1974).

Keywords Information matrix · Distribution sensitivity · Chi-square distance ·


Principal component · Bi-exponential distribution

MS subject classification 62P10 · 62G20 · 62F15

1 Introduction and Genesis of the Problem

Growth experiment was initiated in the year 1987, to see the adaptability of various
coconut cultivars in the saline land of Sunderban, West Bengal.
Selection of land was made in District Seed Farm, Manmathanagar, near Gosaba;
in Sunderban. The piece of land, given by Farm on lease to Indian Statistical Institute
for coconut cultivation was by the side of river Bidyadhari, flowing near the farm
boundary. The experimental plot was in lowland area subjected to water stagnation
in rainy season. Coconut trees were planted in several rows parallel to the river flow
on 3 ft elevated strip of farm land, i.e., at the ground level of land before elevation.
First two rows, most distant from the river were planted with dwarf variety, and
in the remaining rows tall variety of coconuts was planted. The total number of
plantations is 128. After several years, Farm management started cultivating paddy in
the land between strips that in part damaged the coconut plants’ fibrous root structure

R. Dasgupta (B)
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute,
203 B T Road, Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 115


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_5
116 R. Dasgupta

that mostly propagate horizontally. Some palm trees were also planted towards the
edge of the farm on river bank. Salinity of river water near sea is around 33 g/l.
Probably a better selection of pit depth would have been 1–1.5 ft below the original
ground level of farm land, to prevent damage of root structure.

2 Present State of the Trees Planted

The costal storm Aiila and erosion of land due to river flow damaged the palm trees
and coconut trees to a great extent. Further damage occurred recently for construction
of a new dam inside the farm land at a distance from river, to resist further erosion
of farm land. Effects of dehydration, damaged root structure due to paddy cultiva-
tion in between strips, severe pest infection, and lack of care and nourishments are
prominently observed on surviving 44 coconut trees.

3 Comparison with Other Coconut Trees of Same Age


in Nearby Villages, Adjacent to Farm Land

In an adjacent village, growth status of 10 coconut plants of comparable age, with


a nominal cleaning the trees once in two years with a cost of Rs. 20–25 per plant
shows that the land in Manmathanagar is suitable for coconut cultivation of tall
variety. Tender dwarf variety is susceptible to harsh environment and lack of care,
compared to tall variety that is sturdy to resist these adverse conditions for growth.
Average lifetime of coconut trees is 60–80 years.

4 Different Technique of Comparison and Results

Measurements on length of stem; girth at base, top and middle of stem length; number
of leaves in plants are taken via digital photographs of trees along with a marker of
known length placed along the stem on the tree. Mahalanobis distances (MD) of
these vector variables with p = 5 coordinates from origin are computed. The group
of plants with high value of distance indicate superior growth, as increasing order
in each coordinate variable is indicative of higher ranking in growth. For a quality
index based on Mahalanobis D 2 statistics, see Dasgupta (2008).
For first two rows consisting of 11 plants of dwarf variety, Mahalanobis distance
squared D 2 from origin, with k = p + p( p − 1)/2 = 15 estimated parameters is
419.3738. Thus the distance squared per degrees of freedom is D2 / p = 83.875, per
estimated parameter this is D2 /k = 27.958.
Coconut Plant Growth … 117

Euclidean distance of the mean of five coordinates from origin, for the dwarf vari-
ety of coconut cultivated in the farm is d1 = (20307.13)1/2 = 142.5031. Euclidean
distance for the tall variety cultivated in adjacent village is d2 = (68746.86)1/2 =
262.1962. The ratio of these two distances is d2 /d1 = 1.84.

5 Further Comparisons

It appears that the growth characteristics of trees in nearby village are dominant
compared to the trees near the river bank. A visual grade on growth in the range
(0, 10) in increasing order for higher growth are next assigned to all the plants as
sixth variable in analysis. See Tables 1 and 2.
For first set of data near riverbank, Fig. 1 represents Mahalanobis Distance squared
with spline regression when the value of shape parameter is 1. The distance attains
its maximum when the trees are not too far and not very near to the flow of saline
water river Bidyadhari, see Table 3. This indicates a high growth pattern with mod-
erate salinity of soil. Earlier investigation confirmed that coconut trees may adapt to
irrigation with sea water twice a week, e.g., see Carr (2012).
Dispersion matrices of the six variables in two sets with 46 and 10 observations are
tested for equality. The pseudo-likelihood ratio test for high dimensions, as proposed
in Bai et al. (2009) performs well even in small or moderate dimensions p. The
value of LRT statistic is 6.279, to be compared with a Chi-square variable with
p( p + 1)/2 = 21 degrees of freedom. The computed value is insignificant, p value
of significance is 0.9992.
We accept the hypothesis that dispersion matrices of six variables in two growth
environments are equal. Now let the common value of dispersion be .
The followings are along the lines of (6.3)–(6.5) of Dasgupta (2013).
Let x1(1) , . . . , xn(1) be a sample of size n 1 , from a population with mean vec-
1

tor and dispersion matrix as (μ(1) , ) and x1(2) , . . . , xn(2) be a sample of size n 2 ,
 n 1 2 (1)
from (μ , ). Then, estimate of μ is x = i=1 xi /n 1 , of μ(2) is x (2) =
(2) (1) (1)
n 2 (2) 
i=1 x i /n 2 and an unbiased estimate S of the common dispersion matrix is
given by,


n1 
n2
(n 1 + n 2 − 2)S = (xi(1) − x (1) )(xi(1) − x (1) ) + (xi(2) − x (2) )(xi(2) − x (2) )
i=1 i=1
i.e., n S = (n 1 − 1)S1 + (n 2 − 1)S2 , n = n 1 + n 2 − 2

In the present case n 1 = 46 and n 2 = 10.


An estimate of population Mahalanobis distance squared 2 above is provided
by sample Mahalanobis distance squared,

D 2 = (x (1) − x (2) ) S −1 (x (1) − x (2) )


118 R. Dasgupta

Table 1 Coconut plant characteristics: near the saline river side


Plant Plant height Girth at base Girth at Girth at top No of Growth
No. (in) (in) middle (in) (in) leaves grade
(visual)
1 149.790411 11.9630137 7.41369863 6.739726027 20 5
2 97.84828496 9.08707124 6.81530343 5.192612137 7 3
3 124.6666667 11.5 7.166666667 5.333333333 16 4
4 174.3859416 18.92307692 8.809018568 6.851458886 19 6
5 170.5532787 12.93852459 7.057377049 5.881147541 13 5
6 144.2614286 13.00285714 7.028571429 6.15 14 3
7 154.7597015 16.33880597 8.444776119 6.241791045 18 5
8 112.3839286 11.71428571 7.6875 5.308035714 9 2
9 145.8381963 11.90848806 7.667108753 6.688328912 16 5
10 125.4956522 10.33913043 6.952173913 6.060869565 13 4
11 149.9956395 11.62063953 7.50872093 6.436046512 10 4
12 111.9928977 13.453125 7.6875 6.813920455 15 4
13 120.4087079 9.155898876 7.428370787 6.391853933 17 4
14 135.1684492 9.372994652 9.372994652 5.755347594 14 6
15 149.5364431 8.785714286 7.172011662 5.737609329 23 6
16 158.7268786 18.6632948 9.065028902 7.820809249 16 7
17 162.5724234 28.77994429 10.62116992 7.880222841 17 8
18 220.9580838 18.22904192 9.02245509 6.628742515 14 5
19 251.945 23.37 11.48 8.2 12 2
20 278.2004717 25.81839623 11.60377358 7.83254717 0 0
21 138.2148438 10.08984375 6.7265625 5.60546875 12 4
22 485.6597938 36.77319588 26.62886598 19.65463918 22 5
23 215.1389892 21.9801444 9.768953069 7.326714801 25 7
24 175.7697161 18.04258675 8.342271293 5.432176656 7 3
25 220.251506 21.85843373 9.076807229 4.445783133 0 0
26 185.5115132 18.61184211 10.11513158 7.080592105 16 4
27 164.2831492 16.98895028 9.683701657 6.965469613 19 7
28 162.2961039 22.36363636 10.54285714 7.028571429 16 5
29 189.1613692 24.9608802 11.57823961 7.668704156 11 4
30 100.419403 10.64776119 7.343283582 6.058208955 9 3
31 207.622093 20.73837209 9.534883721 7.627906977 15 5
32 285.3378378 23.5472973 10.80405405 8.587837838 21 7
33 226.2696246 17.00170648 7.136518771 8.815699659 22 6
34 142.3767123 20.05068493 9.604109589 6.739726027 16 5
35 208.7272727 20.5 10.64935065 9.318181818 19 5
36 183.6651584 19.20135747 9.461538462 8.904977376 22 5
37 248.3904594 17.6024735 7.606007067 5.650176678 24 8
38 208.8722222 20.5 9.338888889 6.833333333 20 6
(continued)
Coconut Plant Growth … 119

Table 1 (continued)
Plant Plant height Girth at base Girth at Girth at top No of Growth
No. (in) (in) middle (in) (in) leaves grade
(visual)
39 280.4174312 22.28669725 9.027522936 7.052752294 22 7
40 215.4883721 20.26162791 8.819767442 7.151162791 14 4
41 285.3234201 19.20446097 8.687732342 5.715613383 21 6
42 234.9615385 20.76282051 8.147435897 5.782051282 20 7
43 179.7184987 14.34450402 7.749329759 6.26541555 19 5
44 286.421371 23.55846774 8.679435484 6.695564516 24 7
45 244.3815789 28.66917293 9.016917293 8.092105263 17 5
46 265.5120482 21.24096386 9.63253012 6.915662651 26 8

Table 2 Coconut plant characteristics: grown in nearby village


Plant Plant height Girth at base Girth at Girth at top No of Growth
no (in) (in) middle (in) (in) leaves grade
(visual)
1 177.2264151 25.00471698 9.504716981 6.580188679 17 7
2 225.5078864 19.36277603 8.214511041 7.432176656 23 6
3 245.6230032 20.79872204 9.50798722 6.932907348 18 5
4 283.539749 23.34728033 10.89539749 9.59832636 11 6
5 327.9051724 22.98275862 8.017241379 6.948275862 23 8
6 262.4511278 18.87969925 9.090225564 5.593984962 18 8
7 312.2604167 23.57291667 11.625 10.01041667 16 7
8 283.5090909 11.27272727 9.018181818 5.072727273 17 8
9 272.3760684 20.66666667 7.418803419 7.153846154 15 7
10 225.9555556 13.22666667 9.093333333 8.266666667 11 7

The estimated dispersion matrix of (x (1) − x (2) ) is then ( n11 + 1


n2
)S = S ∗ , say.
−1
A scaled version of D 2 , viz.,D ∗ = (x (1) − x (2) ) S ∗ (x (1) − x ) provides a test
(2)
2

of equality of mean vectors H0 : μ(1) = μ(2) . The statistic has asymptotically a Chi-
square distribution with p degrees of freedom, see (6.13)–(6.15) of Dasgupta (2013).
The computed value of D ∗ is D ∗ = 45.7830, to be compared with Chi-square
2 2

variable with d.f. 6; this is highly significant with p-value ≈ 0.


Thus the two distribution may have equal dispersion, but definitely have distinct
mean.
−1
M.D squared from origin for first set of data is (x (1) ) S ∗ (x (1) ) = 19.5299, for
second data this is 29.5815. Since the M.D. squared from zero for the second pop-
ulation is higher than that for first population, the growth status of the plants inside
village is superior.
120 R. Dasgupta

Table 3 Mahalanobis Plant group No. of plants in the (MD)2 from origin
distance squared for six group
coconut plants groups near
saline river 1 11 491.7873
2 7 8953.672
3 7 20.20491
4 7 4847711.0
5 7 23140.03
6 7 322.5809
5e+06
4e+06
Mahalanobis Distance squared
3e+06
2e+06
1e+06
0e+00

1 2 3 4 5 6
Group no.

Fig. 1 (MD)2 for plant groups with increasing proximity towards saline river. The figure shows
that the group of coconut plants not so far and not so near the saline river flow are of highest growth
status with peak value of (MD)2

We have seen that the first set of multivariate data on coconut growth is non normal
according to Mardia’s test. Chi-square for skewness for the first data set is 320.5989
with p value 9.09 × 10−39 ; z value for kurtosis is 10.4458 with p value ≈ 0. However,
for the second set these values are 35.43994 (with p = 0.985) and −1.6356 (with
p = 0.102) respectively, indicating possibility of multinormal distribution. It may
so happen that harsh environment and lack of care in first group of plants made
distribution of variables different compared to that for second set of coconut plants
grown with care.
Principal components are uncorrelated set of variables, useful in dimension reduc-
tion and in obtaining best linear predictors. Between-Groups Comparison of princi-
pal components are studied in Krzanowski (1979). In the first data set, first principal
component of six variables 0.996412481x1 + 0.069772150x2 + 0.031664418x3 +
Coconut Plant Growth … 121

500
400
Sample Quantiles
300
200
100

−2 −1 0 1 2
Theoretical Quantiles

Fig. 2 Normal Q-Q plot for first principal component. The first principal component with 6 variables
from 46 coconut trees near the river bank indicates that the point on the top to be an outlier in normal
quantile plot

0.021833915x4 + 0.027955590x5 + 0.005757078x6 , with first eigen value


4878.7766743 explains a major part (98.91%) of variation in data. Other five eigen
values are 33.6201501, 14.0475550, 5.0017456, 0.7468274, and 0.3882655.
Normal Q-Q plot for first principal component indicates presence of an outlier in
Fig. 2.
Deleting an extreme observation on the topmost corner in rhs of the normal quan-
tile plot in Fig. 2, a normal fit also seems plausible, see Fig. 3. However, if one is
unwilling to sacrifice information of a data point that is apparently an outlier in the
normal plot; a three-phase Laplace distribution in empirical cdf plot may be seen as
a candidate model in Fig. 4 when empirical cdf is centred with median 179.005 and
scaled with mean deviation 15.5144.
In the second data set, first principal component 0.999870975x1 +
0.005453676x2 + 0.003727087x3 + 0.007780236x4 + 0.009869993x5 +
0.007513320x6 with first eigen value 1999.2227767 explains 97.96% of varia-
tion. Other five eigen values are 22.1676962, 16.6634949, 1.5317403, 0.7902101,
0.4627600.
In Fig. 5, this principal component is seen to be normally distributed, which seems
to be appropriate in view of plausible multinormalily of second data set.
Apart from comparison of growth scenarios by Mahalanobis distance, we may
compare the equality of means for first principal components in two growth sce-
122 R. Dasgupta

250
Sample Quantiles
200
150
100

−2 −1 0 1 2
Theoretical Quantiles

Fig. 3 Normal plot for first principal component deleting the outlier. If the topmost point of the
previous figure is deleted, normal fit for the first principal component in the first group of plants
seems plausible
1.0
0.8
Theoretical distribution function
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


Sample distribution function

Fig. 4 Laplace fit for first principal component. If the topmost point, apparently an outlier, is
taken into account for data analysis, for maintaining full information in the data; then a three-phase
bi-exponential distribution seem to be possible for first principal component in the first group of
coconut plants
Coconut Plant Growth … 123

300
Sample Quantiles
250
200

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


Theoretical Quantiles

Fig. 5 Normal Q-Q plot for first principal component (2nd data set). We checked multinormality
in second data set from 10 coconut plants grown inside the nearby village. The normal Q-Q plot in
the above figure reconfirms the finding

narios by large sample test. From Tables 1 and 2, mean of first principal com-
ponents for riverside and in-village grown coconut trees are 194.519596, and
262.0205648 respectively, with variances as, first eigen value divided by sample size
i.e., 4878.776674/46 and 1999.222777/10 respectively for riverside and in-village
grown coconut trees. By CLT, equality of two means are provided by the test statis-
tic τ = (262.0205648 − 194.519596)/(1999.222777/10 + 4878.776674/46)1/2 =
3.858883. The calculated value is highly significant with p-value for two sided test
as p ≈ 0.
Coconut plant growth scenario inside village is completely different when com-
pared to riverside growth.

6 Distribution Sensitivity in Chi-Square Distance, Principal


Component Analysis, Jeffreys’ Prior and Future Work

Distance between parameters has to be examined w.r.t. the change on the distribution
of r.v. induced by shift of parameters. Only one of the distributions corresponding to
the data collected from inside village seems to follow normal distribution, for which
standard theory is applicable. We consider following sensitivity results, in general
set-up that includes Laplace distribution.
The sensitivity of a distribution due to small change in parameter θ is seen to
depend on Fisher’s information I (θ ), see e.g., (Rao 1974, p. 332). In terms of Chi-
124 R. Dasgupta

square distance the same phenomenon holds. This distance sometimes results in
noninformative priors, that maximises the difference between prior and posterior,
different from Jeffreys’ prior |I (θ )|1/2 . For two densities p and q consider χ 2 ( p, q) =
 ( p−q)2
q
d x. Then for a density f with small change δθ in parameter θ

χ 2 ( f θ+δθ , f θ ) = [ f θ2 + 2(δθ ) f θ f θ + (δθ )2 { f  (θ ) f (θ ) + ( f  (θ ))2 }]/ f θ d x − 1

f  (θ ) 2
= (δθ )2 { } f (θ )d x(1 + o(1)) = I (θ )(δθ )2 (1 + o(1)) (1)
f (θ )

Thus I (θ ), the Fisher’s information is a sensitivity measure of distributions in Chi-


square distance.
We have seen that the growth of coconut trees inside village environment is supe-
rior, by Mahalanobis distance.
The location and scale parameters (μ, σ ) of the normal distribution of first princi-
pal component are of interest for assessing growth pattern over time. Construction of
credible region of growth parameters in Bayesian set-up is possible. From the asymp-
totic normality of the m.l.e., a two sided credible interval for the parameter centered
at the posterior mean and scaled by the posterior standard deviation will have the
same asymptotic frequentist coverage probability as the one centered at the m.l.e. and
scaled by the square root of the reciprocal of Fisher information. Time component
in such situation is worth studying. Characteristics of the remaining plants surviving
on riverbank is planned to be assessed now for comparison with collected data under
analysis, to study the effect of time variation.
For a regular family of distributions Bernstein and Von Mises (see, e.g., Ferguson
1996, p. 141) proved the asymptotic normality of the posterior of a parameter vector
centered around the maximum likelihood estimator or the posterior mode and vari-
ance equal to the inverse of the observed Fisher information matrix evaluated at the
maximum likelihood estimator or the posterior mode. We may check moment match-
ing prior of the posterior distribution of (μ, σ ) with m.l.e. under a two parameter
location scale model. These sometimes turn out to be Jeffreys’ prior. Such priors are
useful in asymptotic bias or MSE reduction of the m.l.e. through some adjustment,
the same adjustment applies directly to the posterior means.
Consider a general symmetric location-scale family of distribution with probabil-
ity density function
1 x −μ
f (x|μ, σ ) = p( ) (2)
σ σ

where p(x) = p(−x). Denote, h(x) = log p(x), then h  (x) = −h  (−x), h  (x) =
h  (−x) and h  (x) = −h  (−x). Following Ghosh and Liu (2011), the matching
prior π is solution of the equation
c
∂ log π/∂μ = 0, ∂ log π/∂σ = − (3)

Coconut Plant Growth … 125

where,
 
2 h  (x) p(x)d x + xh  (x) p(x)d x
c= 
h  (x) p(x)d x
   
2 + 6 xh (x) p(x)d x + 6 x 2 h  (x) p(x)d x + x 3 h  (x) p(x)d x
+   (4)
1 + 2 xh  (x) p(x)d x + x 2 h  (x) p(x)d x

For Laplace distribution, also called bi-exponential distribution that is relevent for
the distribution of first principal component shown in Fig. 4, p(x) = 21 e−|x| , x ∈
(−∞, ∞);
h(x) = log p(x) = − log 2 − |x|. h  (x) = −sgn(x), h  = 0 = h  ; x = 0. Thus
c = 4. The form of the matching prior is then π(μ, σ ) ∝ σ − 2 c = σ −2 , which is
1

Jeffreys’ general rule prior.


Next, we investigate for matching prior adequate in uniform pdf f that has
Lebesgue measure on R in the limit. Consider a distribution that assigns uniform
mass to a large interval [−k, k] and puts negligible mass beyond the interval so
that the resultant pdf is decaying smoothly almost everywhere in R. From (2)–(4)
we have the form of the matching prior to be π(μ, σ ) ∝ σ − 2 c = σ −1 , which is
1

Jeffreys’ independence prior.


In the uniparameter case, see e.g., Ghosh (2011), the moment matching prior

turns out to be π(θ ) = exp[ 21 g3 (t)/I (t)dt], g3 (θ ) = E∂ 3 log f (x, θ )/∂θ 3 , and
one gets Jeffreys’ prior as the solution when regularity conditions are satisfied.
Now suppose, −E(∂ 3 log f (x, θ )/∂θ 3 ) = g3 (t) ≈ [I (t)]r I  (t), r > 0. In that
case the inner integral is [I (θ )]r and π(θ ) ≈ exp([I (θ )]r /2).
Such a relation may be possible only when regularity conditions are not satisfied.

References

Bai Z, Jiang D, Yao J, Zheng S (2009) Corrections to LRT on large dimensional covariance matrix
by RMT. http://arxiv.org/pdf/0902.0552.pdf
Carr MKV (2012) Advances in irrigation agronomy: plantation crops. Cambridge University Press
Dasgupta R (2013) Optimal-time harvest of elephant foot yam and related theoretical issues.
Advances in growth curve models: topics from the indian statistical institute. In: Proceedings
in mathematics & statistics, Chapter 6, vol 46. Springer, Berlin, pp 101–130
Dasgupta R (2008) Quality index and mahalanobis D 2 statistics. advances in multivariate statistical
methods. In: Proceedings of ISI platinum jubilee conference, pp 367–382 (World Scientific)
Ferguson T (1996) A course in large sample theory. Chapman & Hall/CRC Press, Boca Raton, FL
Ghosh M (2011) Objective priors: an introduction for frequentists. Stat Sci 26(2):187–202. doi:10.
1214/10-STS338
Ghosh M, Liu R (2011) Moment matching priors. Sankhya A 73:185–201
Krzanowski WJ (1979) Between-groups comparison of principal components. J Am Stat Assoc
74:703–707
Rao CR (1974) Linear statistical inference, 2nd edn. Wiley, New Delhi
Growth Rate of Primary School Children
in Kolkata, India

Susmita Bharati, Manoranjan Pal, Madhuparna Srivastava


and Premananda Bharati

Abstract It is known that the measurement of growth rate is ideal with time series
panel data. However, it is also possible to measure the growth rate with cross-section
data, provided the data are grouped appropriately. Along with calculating the growth
rate if one wants to find the factors associated with the growth rates then one needs to
group it more prudently. This paper illustrates how we can do so using data of primary
school going children of age group 6–10 years. The data has been taken from students
up to class four, from schools in Kolkata. We have taken Medium of instruction, Type
of school, Sex of children, Household size and Per-capita expenditures as grouping
criteria. Altogether we should have got 25 , i.e., 32 combinations. But in our case we
have only 24 combinations, because the schools with the remaining 8 combinations
are not found in Kolkata. Thus, though, we have a large number of students as
sampled, we have essentially only 24 observations. We could have taken some more
variables to increase the number of observations, but in that case the number of
students in each combination (group) would have been very small and the mean
values would not have been stable. Growth rates of height, weight, Mid-Upper Arm
Circumference(MUAC) and body fat have been calculated. Childhood period is the
period when there is maximum growth. Our data also shows the same for both boys
and girls and for students when boys and girls are taken together. However, we do not
get much association of the growth rate with medium of instruction, type of school,
household size and percapita expenditure.

Keywords Growth rate · Primary school children · Socio-economic factors · Linear


regression · India

Subject classification 62-07

S. Bharati (B)
Sociological Research Unit, Indian Statistical Institute, 203, B.T. Road, Kolkata 700108, India
e-mail: pbharati@gmail.com
M. Pal · M. Srivastava
Economic Research Unit, Indian Statistical Institute, 203, B.T. Road, Kolkata 700108, India
P. Bharati
Biological Anthropology Unit, Indian Statistical Institute, 203, B.T. Road, Kolkata 700108, India

© Springer International Publishing AG 2017 127


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_6
128 S. Bharati et al.

1 Introduction

India is a home to more than one billion people, of which 42% are children. Children
are considered the most important natural resource and biggest human investment
for development in every community (Kaushik et al. 2012).Growth is a fundamental
characteristic of all living organisms.It is associated with socio-economic and socio-
cultural environment and is distinctly different in urban and rural areas (Bhandari
et al. 1972; ICMR 1972; Indirabai et al. 1979; Kolekar and Sawant 2013; Mukerjee
and Kaul 1970; Phadake 1968; Sahoo et al. 2011). The integrated nature of growth
is determined by many factors such as heredity, nutrition, psychology and socio-
economy. Except heredity, the other three factors, especially the nutrition factor
and some of the socio-economic factors can be controlled to some extent. Thus the
study of socio-economic influence on child growth is important because we can then
appropriately plan to control the factors towards balanced growth of children,if we
are aware of the direction and relative amount of changes that these factors have on
growth of children.
Average body weight of children at each age differs by different socio-economic
levels (DattaBanik 1982). Nutrition level possibly plays an important role.Nutritional
environment may be somewhat homogeneous at home, but it differs from one house-
hold to the other household due to the differences in the socioeconomic conditions.
Children from good economic conditions have a higher growth rate than middle
or lower economic group of families. One of the reasons is that children of upper
economic stratum have a better scope to get proper nutrition or individual care and
attention than other strata of socio-economic classes.
Socio-economic statuses have traditionally been taken to constitute mainly educa-
tion, income and occupation. Education is one of the most important socio-economic
components because it shapes future occupational opportunities and earning poten-
tial. It also provides knowledge and life skills that allow better educated persons to
gain more ready access to information and resources to promote health. Income is the
means for acquiring things towards health care. Higher incomes can provide better
nutrition, housing, schooling and recreation. Thus, it is necessary to investigate the
socio-economic conditions in order to understand the acceleration and retardation of
growth and nutrition.
Growth studies among the (6–10)-year children in Kolkata are few. The studies of
its association with some socio-economic variables are even less. This is mainly due
to the fact that time series data are rarely available. The measurement of growth rate
can ideally be done with time series panel data. But, even with the cross-section data,
it is possible to measure the growth rate provided the data are grouped appropriately.
Thus along with calculating the growth rate if one wants to find the factors associated
with the growth rates then one needs to group it more prudently. So the aim of the
study is that how we can do so using data of primary school going children of age
group (6–10)-year to find out the growth rate and degree of association with some
of the socio-economic variables.
Growth Rate of Primary School Children in Kolkata, India 129

2 Methodology

This study is a part of the project which was sanctioned from Planned Budget of
Indian Statistical Institute during 2013–2015. This is a micro level cross-sectional
study using multistage stratified sampling procedure. Our population consists of
6–10 years old children in Kolkata Corporation and its peripheral areas.
Due to obvious difficulty of identifying the children of proper age-group, we have
restricted our study to only school-going children, studying in class I–IV. We have
collected our sampled data from primary sections of Secondary schools.
Due to differentiation of socio-economic background, medium of school has been
considered as one of the criteria for stratification. The reason is that the upper class
or more economically affluent people prefer their children to be admitted to Eng-
lish medium schools rather than to Bengali medium schools (Banerjee 2016; Basu
1989). We have an exhaustive list of 513 Bengali medium and 303 English medium
schools in Kolkata and its periphery. Out of these 513 Bengali medium schools and
303 English medium schools, 12 Bengali medium schools and 8 English medium
schools respectively are selected randomly using Simple Random Sampling without
Replacement (SRSWOR). So a total of 20 schools are selected for our study.
It is known that the students of Government, semi-Government and private
schools differ in their Socio-economic status(Sunitha and Khadi 2007), because
Socio-economic status, in particular parental income, wealth, education and occu-
pation have long been known to be major determinants of educational enrolment
and achievement in both developing and developed countries (Evangelista de Car-
valhoFilho 2008; Mingat 2007; Shavit and Blossfeld 1993).
So we have grouped the schools into two categories: Government and semi-
government schools are grouped as public schools(525) and other schools are private
schools. There are 525 Government and semi-government schools and 291private
schools in Kolkata and its periphery. Fortunately, out of 20 sampled schools only
15 schools belong to public school category and only 5 schools belong to private
school category. Thus, though we have taken English and Bengali medium schools
for stratifications, there is scope for comparison between public and private schools.
The total number of children in the sample is 4270, of which 2260 are girls and
2010 are boys children. The children are grouped, according to the household size
(hhsize), into two categories: (i) hhsize ≤ 4 and (ii) hhsize> 4. Similarly, the grouping
is also made, according to per capita expenditure (pcexp), as having household per
capita income ≤Rs. 2500 and >Rs. 2500.
Anthropometric measurements such as Height in centimeter (cm), Weightin Kilo-
gram (Kg) and Mid Upper Arm Circumference (MUAC) (cm) have been taken from
all the students of the selected schools following standard techniques (Weiner and
Lourie 1981). Body Mass Index (BMI) has been calculated by using the following
formula.
BMI = Weight (kg.)/Height(mt.)2 ,
130 S. Bharati et al.

where weight is taken in Kg and height is taken in meter (m) in the formula and thus
the unit of BMI is kg/m2 . So, BMI for age is used to classify each child into different
nutritional status like underweight, normal, overweight and obese by age and gender.
Age and gender specific cut-points as per CDC (Center for Disease Control) are 85th
percentile for overweight and the 95th percentile for obesity and on the other side;
below 15th percentile for undernourished (WHO 2006).Percentage of body fat has
been calculated from BMI using the following formula (Deurenberg et al. 1991):
Child body fat% = (1.51 × BMI)−(0.70 × age)−(3.6 × sex) ×1.4.
In the above formula age is taken in completed year and sex is taken as 1 for males
and 0 for females.
To calculate the growth rate of weight (kg.), height (cm.), MUAC (cm.) and body
fat (%) by age,different combinations of medium, type of school, sex, hhsize and
per capita expenditure of the children have been taken (Vide Table A.1 in Appendix
1). The effective sample size is thus only 24. This obviously is a limitation of this
kind of approach. We have seen that the standard deviation (σ) of height is about
6 cm. If we want to estimate mean height within 3 cm (d = allowable error) of the
true average, then the formula for minimum sample size is

n = [(z 1−α /2 ) σ /d]2


= (1.96) ∗ (1.96) ∗ (6) ∗ (6)/(9)
= 15.36

We have taken the sample size as 24 which is more than 15.


The growth rates have been considered for weight, height, MUAC and body fat
by age. For each aspect, the growth rate measures the change and it is different
from the original variable for which the change is measured. If we consider the
height, its change over time is different from its measure at a particular point of time.
Thus, the sample size, calculated based on height may not guarantee the confidence
that one wants to get for measuring the growth rate in height. Another important
aspect that usually gets ignored in sampling is the variation in sizes of the schools
or in other words the variation in student enrolment from one school to the other.
The schools got equally treated and, thus, representativeness of the sample for the
considered population becomes questionable. To overcome the problem of choosing
appropriate sample size, one may use the results of other similar studies where
statistical parameters have been estimated for the growth rates of the considered
variables. In absence of that a rough procedure, following the Central Limit Theorem,
is to ensure a sample of at least 25 in each cell of Table A.1 in the appendix. This
necessitates taking a total sample of at least 3,000 students and it is satisfied by the
study with the sample size 4270. The sample design, however, has failed to ensure at
least 25 students (actual size taken is 24) and it is a limitation.
The data, apart from the anthropometric measurements, were verified from school
records as well as from the respective parents. The date of birth of each pupil has been
taken from the school records and was cross checked from their respective parents or
Growth Rate of Primary School Children in Kolkata, India 131

guardian. Mobile numbers of their parents were collected from the students or from
school records. Other queries, if any, were met from guardians of the children either
directly or over phone.
Growth rates of height (cm.), weight (kg.), MUAC (cm.) and body fat(%) of chil-
dren for a given state of medium, type of school, sex, hhsize and pcexp, have been
calculated by using the same formula. This will be illustrated by taking height, say.
Suppose x1 is the average height of children for a given state of Medium, type of
school, sex, hhsize and pcexp at age 6. Similarly, x2 , x3 , x4 and x5 are the corre-
sponding average height at ages 7, 8, 9 and 10, respectively.
Let us now define variables y1 as
y1 = x2 − x1 (i.e.,growth/increase of height/weight/MUAC and body fat from6to7),
= β 1 x1

β1 is nothing but the rate of growth of children from age 6 years to age 7 years.
Since we are taking different combinations of medium, type of school, sex, hhsize
and pcexp, we write

y1 = β 1 x1 + e1 ,

where, e1 is the error term associated with the regression equation and assuming that
growth rate remains same for each combination. Similarly, growth rates from 7 to 8,
from 8 to 9 and from 9 to 10 can be found by taking the following regressions,

y2 = β 2 x2 + e2
y3 = β 3 x3 + e3
y4 = β 4 x4 + e4

by defining y2 , y3 and y4 in a similar manner. Thus, β1 , β2 , β3 and β4 are the rate of


growth of population from ages 6 to 7, 7–8, 8–9 and 9–10 years respectively1 for a
given growth indicator.
The results are shown in the form of tables. p-values of significance by t-test have
been found for differences of means. The regressions were also carried out using
SPSS package. The significances of the regression coefficients using t-test are also
shown in the regression results. The tests of overall significance of the regression on
the basis of R2 has been done using F-test.

xt+1 = xt (1+α), then α is called the growth rate. It may also be expressed as percentage taking
1 If

(100×α). It is clear that α is nothing but (xt+1 − xt )/xt .


132 S. Bharati et al.

We can now apply OLS separately to each of the above four regressions.It may
be noted that each of these regressions is a regression without an intercept term and
the no. of observations is 24.

3 Results

Table 16 in the Appendix 1 shows the distribution of children by Age, Sex, Medium of
instructions, Type of School, Household size and Per capita income group. However,
since we are interested in the growth rates, we start with Table 1, that presents the
mean and standard deviation (sd) of height, weight, MUAC and percentage of body fat
of (6–10)-year children. Table 1 also shows the growth rate along with its significant
status for all the four types of measurements. It is seen that there has been a positive
trend for height and weight all along the age groups. MUAC shows positive growth
rate in the beginning, but slightly negative growth at the last phase, i.e., from age
9 years to 10 years. The growth rates, most of the time, are positive and statistically
significant at 1% level(Vide footnote of the table)except for the age groups from 9 to
10 years. For body fat, however, the growth pattern is haphazard and does not show
any pattern.
Tables 2 and 3 show the same as given in Table 1, but separately for Bengali and
English medium schools in Kolkata. It is seen that for both Bengali and English
medium schools, height has an increasing pattern throughout from age 6 years to
10 years and the growth rates are statistically significant at 1% level. It is also inter-
esting to note that the average height in English medium school is always higher
than that of Bengali medium school.But in case of weight, MUAC and Body fat, the
difference between Bengali and English medium schools are not prominent and also
not always in the same direction.
Tables 4 and 5 show the growth rates of height, weight, MUAC and body fat of
(6–10)-year children of Public and Private schools in Kolkata. The result shows that
there is a steady ascending tendency in the growth pattern of Height from 6 to 10 years
both in Public and Private school children. In case of weight and MUAC and body
fat, ascending tendency is noticed from 6 to 9 years but at the age 10 years either the
value is going downwards or is remaining in static position. In case of height, in both
types of school, growth rate is always significant at 1% level and the magnitude of
growth is higher among children of private school than the children of public school,
but in case of other variables, growth rate is generally insignificant or very few cases
are significant.
Tables 6 and 7 show the growth rates of height, weight, MUAC and body fat of (6–
10)-year children in Kolkata by sex. The result shows that there is a steady ascending
tendency in the growth pattern of Height, from 6 to 10 years in both sexes. In case of
weight and MUAC ascending tendency is noticed from (6–9) years but at the age 10
years either the value is going downwards or is remaining in static position. In case
of body fat, the pattern is zigzag. In case of height, in both the sexes, growth rate is
Table 1 Growth rates of height, weight, MUAC and percentage of body fat among (6–10)-year children in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(year) (MUAC in cm)
Mean Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
Mean Sd
6 24 117.01 5.78 – – 23.02 6.11 – – 17.98 2.90 – – 29.45 5.13 – –
7 24 121.36 6.10 0.037 0.000 24.67 5.78 0.078 0.000 18.48 2.86 0.030 0.009 28.40 4.61 −0.035 0.020
8 24 126.97 5.98 0.046 0.000 27.92 6.86 0.136 0.000 19.41 3.08 0.052 0.000 28.61 5.05 0.012 0.335
Growth Rate of Primary School Children in Kolkata, India

9 23 131.12 7.41 0.032 0.000 30.53 8.20 0.093 0.002 19.93 3.24 0.024 0.090 28.48 5.19 0.001 0.925
10 21 135.04 4.74 0.025 0.000 30.71 7.88 0.005 0.816 19.65 3.08 −0.022 0.212 26.73 4.71 −0.064 0.002
*: Sig. means p-value of significance by t-test of difference of means
133
134

Table 2 Growth rates of height, weight, MUAC and percentage of body fat among (6–10)-year children of Bengali medium schools in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(year) (MUAC in cm)
Mean Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
Mean Sd
6 24 116.47 5.94 – – 23.13 7.04 – – 17.83 3.13 – – 29.63 5.89 – –
7 24 121.18 5.67 0.040 0.006 24.45 5.33 0.129 0.000 18.37 2.72 0.034 0.207 28.61 4.37 −0.029 0.180
8 24 126.64 5.31 0.045 0.000 27.56 6.74 0.084 0.007 19.28 3.07 0.050 0.036 28.69 5.25 0.005 0.761
9 23 130.25 7.26 0.028 0.001 30.04 8.38 0.015 0.796 20.04 3.42 0.029 0.079 28.81 5.90 0.010 0.653
10 21 132.92 4.37 0.020 0.067 30.36 7.93 0.073 0.147 19.44 3.31 −0.030 0.332 27.67 4.62 −0.033 0.383
*: Sig. means p-value of significance by t-test of difference of means
S. Bharati et al.
Table 3 Growth rate of height, weight, MUAC and percentage of body fat among (6–10)-year children of english medium schools in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(year) (MUAC in cm)
Mean Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
Mean Sd
6 24 117.27 5.73 – – 22.97 5.65 – – 18.05 2.79 – – 29.36 4.75 – –
7 24 121.45 6.31 0.035 0.00 24.78 6.00 0.140 0.000 18.53 2.93 0.028 0.017 28.29 4.73 −0.039 0.065
Growth Rate of Primary School Children in Kolkata, India

8 24 127.13 6.32 0.047 0.000 28.10 6.92 0.097 0.021 19.48 3.08 0.053 0.001 28.58 4.95 0.016 0.371
9 23 131.50 7.48 0.034 0.001 30.75 8.13 0.016 0.504 19.88 3.17 0.022 0.261 28.34 4.87 −0.007 0.751
10 21 136.10 4.95 0.027 0.004 30.88 7.84 0.081 0.000 19.76 2.94 −0.018 0.424 26.26 4.76 −0.079 0.001
*: Sig. means p-value of significance by t-test of difference of means
135
136

Table 4 Growth rates of height, weight, MUAC and percentage of body fat among (6–10)-year children of public schools in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(year) (MUAC in cm)
Mean Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
Mean Sd
6 24 116.78 5.89 – – 22.75 6.06 – – 17.65 2.86 – – 29.23 5.08 – –
7 24 121.39 4.29 0.039 0.000 24.35 5.44 0.123 0.000 18.25 2.73 0.037 0.020 27.92 4.29 −0.044 0.036
8 24 126.84 4.90 0.045 0.000 27.37 6.60 0.104 0.019 19.13 3.03 0.048 0.001 27.98 4.90 0.054 0.689
9 23 130.56 5.12 0.029 0.005 30.18 8.31 0.009 0.757 19.71 3.15 0.026 0.197 28.11 5.12 0.010 0.640
10 21 135.46 4.82 0.030 0.004 31.09 8.28 0.080 0.005 19.63 3.05 −0.017 0.490 26.71 4.82 −0.536 0.034
*: Sig. means p-value of significance by t-test of difference of means
S. Bharati et al.
Table 5 Growth rates of height, weight, MUAC and percentage of body fat among (6–10)-year children of private schools in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm Body fat (%)
(year) circumference (MUAC in
cm)
Mean Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
Mean Sd
6 24 117.46 5.61 – – 23.57 6.22 – – 18.64 3.00 – – 29.88 5.22 – –
7 24 121.30 6.54 0.039 0.000 25.28 6.47 0.157 0.005 18.93 3.12 0.016 0.271 29.35 5.24 −0.053 0.365
Growth Rate of Primary School Children in Kolkata, India

8 24 127.21 6.64 0.045 0.000 29.03 7.38 0.074 0.022 19.99 3.18 0.057 0.020 29.89 5.34 0.076 0.369
9 23 132.17 6.50 0.029 0.000 31.20 7.99 0.030 0.434 20.36 3.42 0.018 0.200 29.18 5.31 −0.050 0.210
10 21 134.35 6.90 0.030 0.031 30.08 7.37 −0.074 0.037 19.69 3.12 −0.031 0.255 26.78 4.56 −0.083 0.029
*: Sig. means p-value of significance by t-test of difference of means
137
138

Table 6 Growth rates of height, weight, MUAC and percentage of body fat among (6–10)-year male children in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(year) (MUAC in cm)
Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
6 24 117.93 6.17 – – 24.19 6.44 – – 18.36 3.12 – – 32.15 5.27 – –
7 24 122.46 6.22 0.038 0.000 26.05 6.22 0.081 0.000 18.85 3.10 0.030 0.128 31.49 5.04 −0.018 0.260
8 24 127.22 5.65 0.039 0.000 28.16 6.79 0.144 0.002 19.23 3.07 0.020 0.024 30.79 5.02 −0.021 0.019
9 23 132.86 7.68 0.043 0.000 32.27 9.35 0.043 0.242 20.21 3.61 0.045 0.036 31.37 5.98 0.018 0.408
10 21 135.03 5.14 0.016 0.078 31.10 6.93 −0.086 0.018 20.14 2.57 −0.009 0.753 29.08 4.37 −0.077 0.007
*: Sig. means p-value of significance by t-test of difference of means
S. Bharati et al.
Table 7 Growth rates of height, weight, MUAC and percentage of body fat among (6–10)-year female children in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(year) (MUAC in cm)
Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
6 24 116.08 5.42 – – 21.85 5.78 – – 17.59 2.69 – – 26.74 4.99 – –
7 24 120.25 5.98 0.036 0.000 23.28 5.34 0.191 0.000 18.09 2.62 0.031 0.023 25.31 4.18 −0.053 0.047
8 24 126.71 6.31 0.054 0.000 27.68 6.93 0.047 0.230 19.59 3.09 0.084 0.000 26.43 5.07 0.046 0.039
Growth Rate of Primary School Children in Kolkata, India

9 23 129.53 7.45 0.022 0.038 28.94 7.15 0.029 0.357 19.68 2.91 0.005 0.801 25.83 4.46 −0.019 0.431
10 21 126.08 4.42 −0.033 0.002 30.35 8.82 0.070 0.007 19.22 3.59 −0.034 0.118 24.60 5.04 −0.052 0.088
*: Sig. means p-value of significance by t-test of difference of means
139
140

Table 8 Growth rate of height, weight, MUAC and percentage of body fat among (6–10)-year children of household sizes (1–4) in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(MUAC in cm)
Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
6 24 116.46 5.96 – – 22.46 5.59 – – 17.83 2.79 – – 29.09 4.51 – –
7 24 121.51 6.23 0.000 0.000 24.71 5.86 0.141 0.000 18.51 2.86 0.039 0.002 28.01 4.45 −0.039 0.129
8 24 127.36 6.32 0.000 0.000 28.10 7.14 0.107 0.032 19.44 3.04 0.052 0.001 28.31 5.10 0.015 0.306
9 23 130.88 7.31 0.000 0.000 31.05 9.15 0.035 0.273 20.13 3.49 0.039 0.084 28.85 5.95 0.020 0.411
10 21 134.28 4.76 0.000 0.000 30.68 7.58 −0.018 0.000 19.85 2.78 −0.029 0.245 26.91 4.88 −0.078 0.004
*: Sig. means p-value of significance by t-test of difference of means
S. Bharati et al.
Table 9 Growth rate of height, weight, MUAC and percentage of body fat among (6–10)-year children of household sizes (above 4) in Kolkata
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(MUAC in cm)
Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
6 24 117.55 5.68 – – 23.59 6.63 – – 18.13 3.02 – – 29.81 5.75 – –
7 24 121.20 5.97 0.000 0.000 24.62 5.69 0.132 0.001 18.45 2.85 0.021 0.288 28.79 4.76 −0.032 0.082
8 24 126.57 5.64 0.000 0.000 27.74 6.58 0.078 0.031 19.38 3.12 0.052 0.013 28.92 4.99 0.008 0.679
Growth Rate of Primary School Children in Kolkata, India

9 23 131.38 7.53 0.000 0.000 29.97 7.16 0.027 0.449 19.65 2.97 0.007 0.672 28.08 4.35 −0.025 0.249
10 21 135.88 4.73 0.000 0.000 30.74 8.34 0.054 0.130 19.43 3.55 −0.015 0.587 26.54 4.42 −0.048 0.133
*: Sig. means p-value of significance by t-test of difference of means
141
142

Table 10 Growth rate of height, weight, MUAC and percentage of body fat among (6–10)-year children in Kolkata on the basis of per capita expenditure (Rs.
500–2500)
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(MUAC in cm)
Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
6 24 116.33 5.48 – – 22.31 5.59 – – 17.77 2.87 – – 29.04 4.79 – –
7 24 120.68 6.30 0.037 0.000 24.49 6.10 0.126 0.001 18.44 2.93 0.038 0.006 28.79 4.80 −0.008 0.325
8 24 126.72 6.14 0.050 0.000 27.43 6.86 0.106 0.046 19.22 3.14 0.044 0.013 28.47 5.17 −0.006 0.760
9 23 130.26 7.27 0.028 0.015 30.26 8.86 0.018 0.569 19.86 3.58 0.035 0.155 29.01 5.87 0.019 0.483
10 21 135.10 5.78 0.028 0.008 30.41 7.95 0.099 0.000 19.71 2.73 −0.025 0.287 27.08 5.13 −0.076 0.011
*: Sig. means p-value of significance by t-test of difference of means
S. Bharati et al.
Table 11 Growth rate of height, weight, MUAC and percentage of body fat among (6–10)-year children in Kolkata on the basis of per capita expenditure (Rs.
above 2500)
Age N Height (cm) Weight (kg) Mid upper arm circumference Body fat (%)
(MUAC in cm)
Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.* Mean Sd Growth Sig.*
rate rate rate rate
6 24 117.67 6.11 – – 23.74 6.63 – – 18.18 2.94 – – 29.85 5.46 – –
7 24 122.03 5.89 0.037 0.000 24.84 5.46 0.146 0.000 18.52 2.78 0.022 0.245 28.01 4.41 −0.063 0.032
8 24 127.20 5.82 0.042 0.000 28.41 6.86 0.079 0.000 19.61 3.02 0.059 0.002 28.75 4.93 0.031 0.067
Growth Rate of Primary School Children in Kolkata, India

9 23 132.06 7.58 0.037 0.000 30.84 7.48 0.006 0.862 20.01 3.43 0.012 0.401 27.91 4.43 −0.024 0.126
10 21 134.98 3.72 0.022 0.026 30.98 7.80 0.057 0.117 19.59 2.88 −0.020 0.479 26.42 4.28 −0.052 0.075
*: Sig. means p-value of significance by t-test of difference of means
143
144 S. Bharati et al.

Table 12 Linear regression of growth rate of height with different independent variables among
(6–10)-year children in Kolkata
GrRateHt67 GrRateHt78 GrRateHt89 GrRateHt910 GrRateHt710
(Constant) 0.048 0.043 0.031 0.009 0.022
(0.000) (0.000) (0.043) (0.486) (0.000)
DMedium −0.002 −5.987E-005 −0.001 0.024 0.011
(0.823) (0.990) (0.923) (0.102) (0.021)
DType −0.006 0.004 0.009 −0.027 −0.008
0.567 (0.424) (0.524) (0.059) (0.075)
DSex −0.003 0.015 −0.022 0.017 0.006
(0.766) (0.001) (0.075) (0.128) (0.088)
DHHSize −0.012 −0.004 0.012 0.015 0.003
(0.178) (0.347) (0.339) (0.179) (0.450)
DPCExp −0.00000747 −0.008 0.011 −0.010 −0.005
(0.999) (0.062) (0.362) (0.365) (0.138)
R2 0.133 0.534 0.251 0.410 0.449
(0.735) (0.011) (0.379) (0.124) (0.082)
Values in parentheses denote significant level.
DMedium: Medium of instruction in the school, DType: Type of school, DSex: Sex of the child,
DHHSize: Household size, DPCExp: Per capita expenditure
GrRateHt67: Growth rate of height of children from age 6 year to age 7 year. GrRateHt78,
GrRateHt89, etc. are defined in a similar manner

always significant at 1% level and boys have always higher values than girls, but in
case of other variables, growth rates are insignificant in most of the cases.
Tables 8 and 9 present the growth rates of height, weight, MUAC and body fat
of (6–10)-year children in Kolkata by household sizes separately for children who
are coming from small size families (Household size 1–4, shown in Table 8) and
large size families (Household size 5 or more, shown in Table 9). The results are very
similar to those of Tables 6 and 7.
Tables 10 and 11 present the growth rates of height, weight, MUAC and percentage
of body fat of (6–10)-year children in Kolkata by per capita expenditure of the
household. It is noticeable that in case of height and weight, the mean height and
weight are higher in case of higher per capita expenditure than its former group except
for the age group 10 year. Growth pattern of height and weight are in ascending nature
with the age group and growth rate is statistically significant at 1% level of significant
in case of height only.
Tables 12, 13, 14 and 15 describe the association of growth rate of height, weight,
MUAC and Body fat of (6–10)-year children with certain variables like medium
of instruction in the school, type of school, sex of the child, household size and per
capita expenditure through linear regression. In fact, each column of the tables shows
the result of the regression of Growth rate with the variables concerned, descriptions
Growth Rate of Primary School Children in Kolkata, India 145

Table 13 Linear regression of growth rate of weight with different independent variables among
(6–10)-year children in Kolkata
GrRatewt67 GrRatewt78 GrRatewt89 GrRatewt910 GrRatewt710
(Constant) 0.229 −0.081 0.318 −0.153 0.033
(0.086) (0.354) (0.086) (0.318) (0.578)
DMedium 0.014 −0.006 0.033 −0.018 0.015
(0.767) (0.855) (0.637) (0.782) (0.543)
DType −0.013 0.035 −0.047 −0.030 −0.026
(0.782) (0.297) (0.488) (0.625) (0.286)
DSex −0.016 0.110 −0.094 0.061 0.026
(0.690) (0.001) (0.105) (0.233) (0.196)
DHHSize −0.047 −0.009 −0.026 0.058 0.001
(0.244) (0.735) (0.645) (0.253) (0.950)
DPCExp −0.042 0.020 −0.025 0.058 −0.002
(0.297) (0.452) (0.660) (0.253) (0.910)
R2 0.138 0.519 0.193 0.231 0.173
(0.718) (0.014) (0.558) (0.507) (0.683)
Values in parentheses denote significance level.
DMedium: Medium of instruction in the school, DType: Type of school, DSex: Sex of the child,
DHHSize: Household size, DPCExp: Per capita expenditure
GrRatewt67: Growth rate of weight of children from age 6 year to age 7 year. GrRatewt78,
GrRatewt89, etc. are defined in a similar manner

Table 14 Linear regression of growth rate of MUAC with different independent variables among
(6–10)-year children in Kolkata
GrRateMUAC67 GrRateMUAC78 GrRateMUAC89 GrRateMUAC910 GrRateMUAC710
(Constant) 0.104 −0.075 0.174 −0.013 0.032
(0.175) (0.204) (0.067) (0.915) (0.459)
DMedium 0.005 −0.003 −0.002 0.024 0.010
(0.866) (0.893) (0.950) (0.637) (0.596)
DType −0.023 0.010 −0.007 −0.028 −0.013
(0.428) (0.634) (0.840) (0.566) (0.451)
DSex 0.001 0.064 −0.038 −0.026 −0.002
(0.979) (0.002) (0.190) (0.513) (0.866)
DHHSize −0.018 6.930E-5 −0.031 0.017 −0.003
(0.430) (0.997) (0.281) (0.664) (0.824)
DPCExp −0.016 0.015 −0.022 0.002 −0.004
(0.478) (0.408) (0.440) (0.952) (0.772)
R2 0.096 0.439 0.191 0.063 0.047
(0.855) (0.047) (0.563) (0.957) (0.978)
Values in parentheses denote significance level.
DMedium: Medium of instruction in the school, DType: Type of school, DSex: Sex of the child,
DHHSize: Household size, DPCExp: Per capita expenditure
GrRateMUAC67: Growth rate of MUAC of children from age 6 year to age 7 year. GrRateMUAC89,
GrRateMUAC78, etc. are defined in a similar manner
146 S. Bharati et al.

Table 15 Linear regression of growth rate of percentage of body fat with different independent
variables among (6–10)-year children in Kolkata
GrRateBF67 GrRateBF8 GrRateBF89 GrRateBF910 GrRateBF710
(Constant) 0.081 −0.162 0.223 −0.097 −0.003
(0.373) (0.029) (0.039) (0.408) (0.920)
DMedium −0.030 0.001 0.004 −0.054 −0.009
(0.390) (0.958) (0.918) (0.283) (0.549)
DType 0.041 0.020 −0.035 0.002 −0.008
(0.236) (0.453) (0.355) (0.961) (0.584)
DSex −0.035 0.068 −0.034 0.016 0.011
(0.224) (0.005) (0.289) (0.682) (0.317)
DHHSize 0.007 −0.007 −0.045 0.032 −0.005
(0.790) (0.750) (0.170) (0.411) (0.635)
DPCExp −0.055 0.037 −0.044 0.032 0.004
(0.061) (0.101) (0.180) (0.410) (0.738)
R2 0.287 0.440 0.277 0.172 0.173
(0.256) (0.047) (0.309) (0.684) (0.681)
Values in parentheses denote significance level.
DMedium: Medium of instruction in the school, DType: Type of school, DSex: Sex of the child,
DHHSize: Household size, DPCExp: Per capita expenditure
GrRateBF67: Growth rate of Percentage of Body Fat of children from age 6 year to age 7 year.
GrRateBF78, GrRateBF89, etc. are defined in a similar manner

of which are given at the end of each table as footnote. From the tables, it appears
that there are significant differences in the growth rates of height, weight, MUAC
and Body fat between boys and girls from 7 to 8 years. Per capita expenditure also
has substantial effect on the growth rates.

4 Discussions

Growth study among 6–10 year children is important as rapid growth occurs in this
stage. For this, it needs more attention and care for their physical and mental health.
The study of growth and development provides information about the normal child’s
health in one side and on the other side, it gives information about the deviation of
growth from normal child’s growth. Physical growth, development, and well-being
are directly related to the nutritional status (Manna et al.; 2011).Our study gives
the growth pattern of height, weight, MUAC and body fat of 6–10 year children
in Kolkata and their association with the related socio-economic factors. It is seen
from the above results that there is a positive trend of height, weight and MUAC
over ages among(6–10)-year old school children in Kolkata though the magnitude
Growth Rate of Primary School Children in Kolkata, India 147

of change is different for different age groups.In case of body fat, growth trend is
zigzag. Growth rates are found for different combinations of medium of instruction,
type of school, gender, household size and per capita expenditure. This is to find
the factors which influence the growth pattern of the children. It is seen that in both
Bengali and English medium schools, height is always increasing from 6 to 10 years
andthe rate of growth is statistically significant at 1% level.It is interesting to note
that average height in English medium schools is always higher than the average
height of Bengali medium schools. This is also true if we find the same separately
for Public and Private school or English and Bengali medium schools. In case of
household sizes, the magnitudes are not higher in smaller household sizes than in
larger household sizes. It is noticeable that, in case of height and weight, the mean
height and weight are higher in case of higher per capita expenditure than that of
lower per capita expenditure group, except for the age group 10 years.It is also seen
that the growth rates are significantly affected by sex of children because boys have
always higher values than girls. The per-capita expenditure is positively influencing
growth values to some extent.
It is well known that socio-economic conditions have great effect on human
growth. And among many factors of socio-economy, nutrition is certainly important.
Nutrition not only depends on economic status, but also other factors like education
of mothers, awareness to hygienic conditions, food taboo etc. But conditions at home
are the most vital factors which consists of provision of regular meals, adequate sleep,
exercise etc. Because conditions at home reflect the intelligence and personality of
the parents than economic status of the family (Bose 2007). In fact, Dreze and Sen
(1989) found an inverse association between child growth and economic status.
To summarize, nutrition is one of the most influential factors towards growth of
children. But nutrition not only depends on economic status, it depends on other
associated factors like education of mothers, awareness to hygienic conditions, food
taboo etc. Mother’s education is highly positively related to growth of the children
(Bharati et al. 2008). But in this paper, we could not take all these factors into
consideration due to lack of data. Another limitation of this paper is the small sample
size. Since we have used the cross-section data to find growth rates we had to further
grouped the data by taking different combinations of medium, type of school, sex,
hh size and per capita expenditure of the children. The sample size is thus reduced
from 4270 to only 24.
148 S. Bharati et al.

Appendix 1

Table 16 Number of children by Age-Sex and medium of school, type of school, household size
and per capita income groups
Medium Type Sex HHSize PCI Age in yrs. (N)
6 7 8 9 10 6–10
1 1 1 1 1 72 126 93 34 05 330
1 1 1 1 2 21 42 27 20 02 112
1 1 1 2 1 68 88 71 34 06 267
1 1 1 2 2 3 04 02 – – 9
1 1 2 1 1 79 159 170 133 30 571
1 1 2 1 2 32 54 57 42 04 189
1 1 2 2 1 89 161 128 100 25 503
1 1 2 2 2 6 14 12 05 02 39
2 1 1 1 1 27 25 18 03 02 75
2 1 1 1 2 94 90 89 41 06 320
2 1 1 2 1 34 22 20 13 – 89
2 1 1 2 2 43 35 31 09 01 119
2 1 2 1 1 13 09 12 02 – 36
2 1 2 1 2 42 64 67 34 04 211
2 1 2 2 1 15 14 12 12 01 54
2 1 2 2 2 23 30 24 09 01 87
2 2 1 1 1 6 09 06 14 06 41
2 2 1 1 2 59 66 79 77 32 313
2 2 1 2 1 22 29 34 23 13 121
2 2 1 2 2 29 57 53 51 24 214
2 2 2 1 1 10 29 33 31 16 119
2 2 2 1 2 25 69 61 39 10 204
2 2 2 2 1 35 38 24 36 19 152
2 2 2 2 2 20 23 20 29 03 95
Total 867 1257 1143 791 212 4270
[Medium of instruction—Bengali = 1, English = 2; Type of school - Public = 1, Private = 2; Sex-
M=1, F = 2, Hh size- 1–4 = 1, 5 or more = 2; per cap exp. (PCI)- 500–2500 = 1, 2501 or more = 2]

References

Banerjee S (2016) A study of current status quo of english as a second language in India study done
on West Bengal schools. Int J Sci Res Publ 6:478–482
Basu A (1989) Education in modern India. In: Sharma RK (ed) Problems and solutions of teaching
english. Commonwealth Publishers, New Delhi, pp 3–4
Growth Rate of Primary School Children in Kolkata, India 149

Bhandari B, Jain AM, Padma Karna, Mathur A, Sharma VK (1972) Nutritional anthropometry of
rural schoolchildren of Udaipur district. Indian J Paediatr 39:1–11
Bharati S, Pal M, Bharati P (2008) Determinants of growth and nutritional status of pre-school
children in India. J Biosoc Sci 40:801–814
Bose K (2007) Concept of human physical growth and development. http://nsdl.niscair.res.in/jspui/
handle/123456789/243
DattaBanik ND (1982) Semi-longitudinal growth evaluation of children from birth to 14 years in
different socioeconomic groups. Indian Paediatr 19:353–359
Deurenberg P, Weststrate JA, Seidell JC (1991) Body mass index as a measure of body fatness: age
and sex specific prediction formula. Br J Nutr 65:105–114
Dreze J, Sen A (1989) Hunger and public action. Oxford, Clarenden Press
Evangelista de CarvalhoFilho IE (2008) Household income as a determinant of child labor and
School enrollment in Brazil. IMF Working Paper, WP/08/241
ICMR (1972) Growth and physical development of Indian infant and children, Technical Report
Series No. 18. ICMR, New Delhi
Indirabai K, Raghavaprasad SV, Ravi Kumar, Reddy CO (1979) Nutritional and anthropometric
profile of primary school children in rural Andhra Pradesh. Indian Paediatr 16:1085
Kaushik A, Raj R, Mishra CP, Singh SP (2012) Nutritional status of rural primary school children and
their socio-demographic correlates: a cross-sectional study from Varanasi. Indian J Community
Health 24(4):310–318
Kolekar SM, Sawant SU (2013) A comparative study of physical growth in urban and rural school
children from 5–13 years of age. Int J Recent Trends Sci Technol 6:89–93 ISSN 2277-2812
E-ISSN 2249-8109
Mingat A (2007) Social disparities in education in Sub-Saharan African Countries. In: Teese R,
Lamb S, Duru-Ballet M (eds) International studies in educational inequality, theory and policy.
Vol 1, Springer, Dordrecht
Mukerjee B, Kaul KK (1970) Anthropometric observations- urban school children. Indian J Med
Res 58:1257
Phadake MV (1968) Growth norms in Indian children. Indian J Med Res 56:851
Sahoo K, Hunshal S, AndItagi S (2011) Physical growth of school girls from Dharwad and Khur-
dadistricts. Karnataka J Agric Sci 24:221–226
Shavit Y, Blossfeld HP (1993) Persistent inequality: changing educational attainment in thirteen
countries. Westview Press, Boulder
Sunitha NH, Khadi PB (2007) Academic learning environment of students from English and Kan-
nada medium high schools. Karnataka J Agric Sci 20:827–830
Weiner JS, Lourie JA (1981) Practical human biology. Academic Press, New York
(2006) WHO child growth standards: length, height for age, weight for age, weight for length,
weight for height and body mass index for age methods and development. WHO, Geneva
Growth Curve Estimation of a Bulb Crop
from Incomplete Data

Ratan Dasgupta

Abstract We reconstruct original growth curve from a partial data set arising out of
substantially missing observations. Under the assumption that growth of an auxiliary
variable is of similar rate, we calibrate the growth curve based on available yield
data to infer about the growth based on full data. We further explore the relationship
between the auxiliary variable with the main variable yield of the crop by the method
of least squares and lowess regression under log-log transformation in the partial
data. The reconstructed missing data from the model is then used to obtain growth
curve from full data, partial plus reconstructed. The adopted procedure is validated
and found to be satisfactory for estimating original growth curve.

Keywords Bulb crop · Auxiliary variable · Log-log transformation · Lowess


regression

MS subject classification: Primary: 62P10

1 Introduction and Genesis of the Problem

The bulb crop onion takes about 3–4 months time from sprouting stage to mature for
harvest. With all the plant parts edible this is familiar vegetable with many medicinal
properties. The bulbs are widely used as a seasoning or a vegetable in various dishes.
This crop is grown with shallow and unbranched root system. Temperature has a role
on the rate of sprout growth for onion, see Brewester (1987). In Giridih, Jharkhand
both summer and winter are extreme.
Consider the problem of estimating growth curve of a bulb crop from incomplete
observations on final yield in an experimental plot in Giridih. Such incomplete data
may arise when the crop is damaged, missing or stolen in part at a mature stage from
the field.

R. Dasgupta (B)
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 B T Road,
Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 151


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_7
152 R. Dasgupta

To compensate partially for the lack of information in such situations, available


auxiliary information, like growth of related variables, may be used under the assump-
tion that auxiliary variable has proportionate growth with crop yield. Non parametric
regression on associated variables may shed light on the main variable of crop yield,
and may be used to check the tenacity of the assumptions made.
The present problem relates to partially missing growth data. One hundred onion
seedlings were transplanted in an experimental plot on 29 December 2014 in farmland
of Indian Statistical Institute, Giridih. Auxiliary variables like plant height, number
of leaves, and thus average number of leaves in a plant over lifetime, are available. A
majority of the onions (56) was stolen/missing from the experimental field in mature
stage. The problem is to estimate the full growth curve from the incomplete yield
data on 44 plants.1
One may estimate the growth curve of yield versus plant lifetime based on incom-
plete data. However, this may not be representative of the total group, if data on
the better part of the crop is missing. Knowledge about the growth curve may be
supplemented by additional information on the pattern of associated variables.

2 Auxiliary Variable and an Assumption on Growth

We estimate the mean plant height from leaf lengths at a particular time point. Assum-
ing a linear or proportionate relationship of growth we associate mean plant height
with onion yield on the group of 44 plants at different lifetime. This is plausible as
lush vegetative growth is an indication of high tuber yield.
In essence we assume that the mean plant height curve of full set of 100 plants
and that for 44 plants remaining in field till the end are of same ratio or having a
linear relationship with corresponding yields for different plant lifetime. Three of
the variables being known/estimable, onion yield curve for the full set of 100 plants
is estimable. Such linear relationship is observed in studying growth of other tuber
crops as well. In Dasgupta (2013) pp. 8–9, regression of yam yield on maximum
plant height is seen to be linear, the relationship is more prominent in log-log scale
of the variables; see Figs. 6 and 7 therein. See also Dasgupta (2015) and Dasgupta
and Pan (2015) for problems related to growth curve reconstruction.
The real problem is slightly more complicated, as taking observations on plant
height is stopped at a mature stage of onion plants around 100 days of plant lifetime,
when above ground biomass shows sign of growth decline. However, some more time
was allowed by farmers for the plants to remain in the field till the leaves become
pale, almost dry and/or damaged. Around this time period, a major part of the crop is
stolen or missing from the field. Although records on auxiliary variables are available
for all the plants till maturity, the yield data on substantial part of the crop is missing
that result in the present problem in growth curve estimation.

1 Plant numbers 5, 22, 23, 29 had low lifetime and/or low final weight, as such these are not included

in these 44 plants used in prediction.


Growth Curve Estimation of a Bulb Crop from Incomplete Data 153

Plant lifetime thus includes a period where no observations on height of plants are
available. Linear relationship for estimating yield in this extended period of no data
on plant heights, is then obtained from the time of last available data on auxiliary
variable viz., plant height in two groups.
Lifetime data on missing plants are available, and computed as the time from
sprouting to the time of crop stolen from field at mature stage.

3 Some Features of Collected Data and the Results

Data is collected on maximum height of leaves in each onion plant, number of leaves
over time, and final yield; see Tables 1 and 2.
Figure 1 shows longitudinal growth of height of 100 plants over plant lifetime
i.e., maximum height of leaves in an onion plant, recorded over plant lifetime. The
curves show a common pattern of a sudden increase in growth around 40 days. It
remains steady for some time and then a slight downward trend is observed beyond
80 days. Recall that a part of the crop went missing at mature stage only, when data
recording were frequent, thus full data on plant height is available.
The curve in red color represents the mean curve obtained by joining average
of plant heights at different points of plant lifetime. The curve shows an increasing
trend till 85 days approximately, after which growth retardation, which is usual in
mature stage of plants, is seen.
Figure 2 is based on individual growth curve of plant heights for not stolen/missing
44 plants, a subset of 100 plants. The curve shows more or less similar features as
that of Fig. 1.
Figure 3 shows the comparison of mean curves in two sets. The mean curve in blue
color in Fig. 2 lies below that of Fig. 1; indicating that the plants those remained in
field are of lower growth status. Good plants, with higher yield, seem to be missing.
The curves run with more or less constant shift over time a little after start, the
gap between curves increases slightly towards the end of plant lifetime of 100 days.
In Fig. 4 we plot lowess growth curve of average number of leaves for 100 plants
with f = 2/3.
In Fig. 5 we plot longitudinal growth curve of leaf count of 15 plants with sl. no.
1, 8, . . . , 99, only; as plotting discrete data on leaf count in longitudinal growth for
the full set of 100 plants becomes messy. The mean curve in red color in Fig. 5 is
obtained by joining the mean of leaf count at different time points. Average number
of leaves is 6 during [40, 80] days of a substantial growth period. Figure 5 is based
on longitudinal data, whereas Fig. 4 is based on cross-sectional data.
In Fig. 6, with f = 2/3 we plot the lowess growth curve of onion yield data based
on 44 plants. Observations with lifetime greater than 90 days are shown.
In Fig. 7 the predicted growth curve in blue is shown along with the original
growth curve in red. The calibration factor c is obtained as the ratio of heights in two
curves, by comparing the growth curves shown in Fig. 3 on mean plant heights.
For 90 days c = 1.051195, and for 100 days c = 1.063559.
154 R. Dasgupta

Table 1 Growth data on 100 onions plants


Date Date Date Date
28-01-15 05-02-15 12-02-15 19-02-15
Plant no No of Plant No of Plant Plant no No of Plant No of Plant
leaf height leaf height leaf height leaf height
1 4 19.5 5 19.8 1 6 20.1 6 24
2 6 22 6 22.5 2 7 26 7 31
3 4 22.2 6 22.5 3 7 23 7 24.6
4 4 19.5 5 19.5 4 6 21 6 26.5
5 2 20 3 19.5 5 2 12.5
6 3 22.5 4 22.5 6 5 23 6 26.5
7 4 19 5 17 7 7 20.1 7 25
8 5 24 6 24.5 8 6 28.5 6 31.4
9 4 17.6 5 17 9 6 21 7 28
10 5 19.5 5 20.2 10 7 28.4 6 33
11 4 25 5 25 11 7 26 7 28
12 4 19 5 19 12 6 20.6 6 24.5
13 4 18.5 4 18.5 13 6 14 6 21.5
14 4 18 4 18 14 5 14 5 22
15 3 16 3 14 15 4 15 4 16.5
16 5 23 5 23.5 16 6 25.5 6 30
17 4 25 5 25 17 7 31.2 7 34.5
18 4 21 5 21 18 6 22 6 24.5
19 4 15.5 5 14.6 19 6 23 6 30.5
20 3 18.5 5 18 20 4 19 5 19.9
21 4 25.6 4 25 21 6 26.4 6 30.5
22 2 17 3 7 22 2 13.5 1 14.5
23 2 17.5 2 12 23 3 7 2 7
24 5 23 6 23.5 24 7 28 7 32.1
25 5 21 5 21 25 6 24.1 6 28.5
26 4 19 5 19.2 26 6 22.1 6 28
27 2 15.5 1 8.5 27
28 4 18 5 19 28 7 21.5 7 29.5
29 2 17 3 15.5 29 3 15
30 3 16.2 4 16 30 5 16.5 5 14
31 5 21 6 20.2 31 6 26 7 32.1
32 4 14.5 5 14 32 5 21.5 6 26.5
33 5 25.5 6 25 33 6 26 7 28.5
34 2 18.5 3 18 34 4 18.5 4 19
35 5 24 6 24 35 7 25.1 7 33.5
(continued)
Growth Curve Estimation of a Bulb Crop from Incomplete Data 155

Table 1 (continued)
Date Date Date Date
28-01-15 05-02-15 12-02-15 19-02-15
Plant no No of Plant No of Plant Plant no No of Plant No of Plant
leaf height leaf height leaf height leaf height
36 4 21.5 5 22 36 7 27.5 7 30.5
37 2 15.5 3 14.5 37 5 15.5 5 16
38 4 23.5 5 21.5 38 6 23 6 27
39 4 18.5 5 24 39 7 35.5 7 40
40 4 22 6 22 40 7 32.6 7 37.5
41 5 18 5 18.5 41 6 27 6 31.6
42 5 18.5 4 14 42 5 17.5 5 23.6
43 5 17 6 18 43 6 24 7 26.5
44 5 23.5 6 23 44 7 26 8 27.6
45 4 21.5 5 23.5 45 5 32.1 6 38
46 3 17 2 9.5 46 3 10 4 12.5
47 5 18 5 16 47 5 18 5 23
48 4 17 5 16.5 48 6 29 7 34
49 5 21.5 6 23.5 49 7 35.2 8 38
50 4 15.6 5 20 50 6 24 5 27
51 4 16.8 6 14.5 51 5 19 3 22.2
52 5 21.8 5 20.5 52 7 22.7 8 30
53 3 18 5 17.8 53 6 20.7 5 27
54 4 23 5 21.5 54 6 25 6 30.5
55 4 23.5 5 24 55 6 30.1 6 35
56 4 23.5 5 23.5 56 7 32 6 34
57 4 20.6 5 21.5 57 6 26 6 27.5
58 4 20.5 5 17 58 5 18 4 18
59 4 17 6 22.5 59 6 24.5 6 32
60 5 22 6 25.5 60 6 32.6 7 38
61 5 19 6 20.5 61 7 26.5 7 29.5
62 4 19 5 15.5 62 6 16 6 18
63 5 24.5 6 23 63 6 27 7 31.5
64 5 22.5 6 17.5 64 6 23.5 5 28
65 3 18 3 18 65 4 19 2 18
66 5 21 6 21.5 66 7 29 7 34.5
67 5 16 6 24.5 67 7 32.6 7 37.5
68 4 21 5 16.5 68 6 25.6 4 31
69 4 24.5 5 20 69 6 24 6 30.5
70 3 22 5 14.5 70 6 17 6 25
(continued)
156 R. Dasgupta

Table 1 (continued)
Date Date Date Date
28-01-15 05-02-15 12-02-15 19-02-15
Plant no No of Plant No of Plant Plant no No of Plant No of Plant
leaf height leaf height leaf height leaf height
71 3 22.1 5 18.2 71 6 21.5 6 28.5
72 5 24.5 6 24.6 72 7 31 6 37.5
73 6 22 6 28.5 73 8 30.7 6 35
74 5 22.1 6 22 74 7 27 6 35.5
75 5 24.5 6 24.5 75 7 35 7 39
76 5 21.5 5 24.6 76 6 32.6 6 39
77 4 21.5 5 19 77 7 26.5 7 32
78 5 17.5 6 21 78 6 28 6 31.1
79 4 19.5 5 18 79 6 24.6 5 30.5
80 5 22 5 22 80 7 28 6 31
81 4 21.5 5 18 81 6 23.5 4 29.6
82 4 18 5 18.5 82 6 22.7 7 27.5
83 4 23.5 5 24.5 83 6 27.7 6 28
84 4 18 5 18.5 84 7 27 7 32
85 4 15.5 5 17 85 6 29 5 37.5
86 5 16.5 6 17 86 7 25 7 30
87 5 18.8 6 19.5 87 6 32.6 6 38.4
88 4 17 6 14.5 88 5 20 5 23.2
89 5 24.5 5 27.5 89 7 33.2 6 34.5
90 5 18.5 6 18.5 90 6 23.5 6 24.5
91 4 18.5 5 19 91 7 23.5 5 27.5
92 5 22 5 21.5 92 6 24.6 6 29.6
93 4 20.1 5 20 93 7 21 7 22.5
94 4 16 5 17.5 94 6 23.2 5 25.6
95 4 16.5 5 15.5 95 6 25.1 6 32
96 5 19 6 24.5 96 7 37.5 6 43
97 5 20.5 5 28 97 6 35.2 7 40
98 5 21.5 5 26.5 98 7 33 7 35
99 5 22.2 6 32.5 99 7 43 7 44
100 5 17.2 6 17 100 6 20.5 6 23.6
(continued)
Growth Curve Estimation of a Bulb Crop from Incomplete Data 157

Table 1 (continued)
Date Date Date Date
26-02-15 05-03-15 12-03-15 19-03-15
Plant no No of Plant No of Plant Plant no No of Plant No of Plant
leaf height leaf height leaf height leaf height
1 6 25 7 26.2 1 7 27.5 7 27
2 7 35.5 8 34 2 8 34.5 8 34.5
3 5 24.8 4 25.5 3 4 19.5 4 19
4 5 28.5 5 27.5 4 6 27.5 6 27.5
5 5
6 5 29.5 5 28.5 6 6 29.5 4 29
7 7 31.5 6 32.5 7 7 36 7 36
8 6 31.5 5 27.5 8 5 27.5 5 37.5
9 6 33.5 6 35.5 9 7 36.5 7 36
10 6 32 6 33 10 6 33.5 6 33.5
11 7 28 5 26.5 11 6 27 6 26
12 6 25.5 7 26 12 7 26.2 6 26.2
13 5 21 5 23.5 13 6 26.5 5 26.5
14 4 26.5 5 30.5 14 6 31.4 6 31.5
15 4 16.5 3 18 15 4 19.5 4 20.5
16 7 30.5 7 29.5 16 7 29 6 29.6
17 4 32.5 5 32.5 17 5 32 5 30.5
18 5 24.5 6 24 18 5 22.5 6 22
19 6 35.5 6 37.5 19 6 34.5 7 35.5
20 5 16.5 5 18 20 5 21 4 22
21 6 31 5 28.5 21 6 30.5 6 30.5
22 22
23 23
24 7 32.5 7 35 24 7 36 8 36
25 7 29.5 7 26.5 25 7 24.5 7 24.5
26 6 31.5 6 31.5 26 7 31 7 31.5
27 27
28 7 34.5 8 36.5 28 7 37.5 7 36.5
29 29
30 3 17.5 4 21 30 4 22 4 22
31 7 34 7 34.5 31 8 36.5 8 36
32 5 26.5 5 27.5 32 6 27.5 7 28
33 7 32 6 34 33 7 34.5 7 35
34 3 18 3 15 34 3 15.5 3 15.5
35 7 34.5 7 34.5 35 6 35.2 6 34
(continued)
158 R. Dasgupta

Table 1 (continued)
Date Date Date Date
26-02-15 05-03-15 12-03-15 19-03-15
Plant no No of Plant No of Plant Plant no No of Plant No of Plant
leaf height leaf height leaf height leaf height
36 7 33 5 35 36 5 35.5 5 35.6
37 4 16.5 4 19.5 37 5 21.5 4 21.5
38 5 29 5 31 38 6 31.5 5 32
39 6 42 7 38 39 7 39 7 37
40 8 39 7 39.5 40 7 40 7 40
41 5 32 7 29.5 41 6 27.5 6 28.5
42 5 24 5 25 42 6 25.5 5 26.5
43 6 29.5 7 30.5 43 8 30.2 7 30.6
44 8 27.5 7 29 44 8 29.5 7 29.7
45 5 38 6 40.5 45 7 41 6 40.2
46 5 16.5 5 19.5 46 5 22.5 5 23.5
47 5 25 5 26 47 7 25 6 25
48 7 36.5 6 37 48 7 40 8 40.5
49 7 38.5 6 37.5 49 6 38.5 6 39
50 5 27 6 29 50 5 29.5 5 28.6
51 3 22.5 4 32.5 51 5 33 3 33
52 5 30.5 7 34.2 52 6 34.5 7 34.5
53 6 31 6 34.5 53 6 34.6 6 35.5
54 6 32.5 7 32.2 54 7 33.5 6 33.5
55 7 37 7 35.5 55 7 37 6 36.5
56 6 34 6 34 56 7 29.5 7 29
57 5 28 6 32.5 57 6 33.5 6 34
58 3 19 3 19 58 3 15 2 15.5
59 7 32 7 30.5 59 7 28.5 6 28.5
60 6 38 6 34 60 5 32 5 35
61 6 29.5 6 33 61 5 34 6 34.5
62 5 19.5 4 18.5 62 4 18.5 5 18.5
63 5 30.5 6 30 63 6 32.5 6 30.6
64 5 28.5 6 30 64 5 31 4 31.5
65 2 15 65
66 8 38 7 39 66 7 44 8 44.5
67 7 37.5 8 40.5 67 7 40.5 8 41
68 5 34.5 4 37 68 4 38 5 39.5
69 5 35.5 6 34.5 69 6 36 6 36
70 4 30 5 32.2 70 6 32.5 5 33.5
(continued)
Growth Curve Estimation of a Bulb Crop from Incomplete Data 159

Table 1 (continued)
Date Date Date Date
26-02-15 05-03-15 12-03-15 19-03-15
Plant no No of Plant No of Plant Plant no No of Plant No of Plant
leaf height leaf height leaf height leaf height
71 5 31.5 4 32.5 71 5 33 4 33.5
72 6 37 6 44.5 72 7 45 6 45.2
73 6 46 6 48 73 6 49 7 49.5
74 6 40 6 41 74 7 41.5 7 42
75 7 39 8 39.5 75 8 40 8 40.5
76 7 40 7 37.5 76 7 39 8 38
77 5 37.5 7 41 77 7 41 7 41
78 6 31 5 29.5 78 7 30.5 6 29.5
79 6 32.5 6 36 79 7 37 7 36.7
80 7 32 7 30.5 80 8 30.5 8 29.5
81 5 33 4 38.5 81 6 38.5 6 38
82 6 34.5 5 42.5 82 6 43.5 5 43.5
83 5 36.2 5 49 83 5 49.5 5 49.5
84 6 34 6 36 84 7 36.5 7 37
85 5 37.5 6 41.5 85 6 43.5 6 38
86 6 33.5 7 37 86 6 37.5 7 37.5
87 6 38.5 7 34 87 8 35.5 7 36.5
88 4 26.5 6 30.6 88 6 32.5 5 32
89 6 35.5 7 37.5 89 7 36 6 36
90 6 28 7 30.5 90 7 30.5 7 31.5
91 5 28.5 6 30.5 91 5 30.5 5 30.5
92 6 34.5 6 34 92 6 37.5 6 38.5
93 5 19.5 5 26.5 93 6 27 5 27
94 6 25.6 6 29 94 7 29.5 6 30.5
95 6 32 7 36 95 8 36.5 7 37.2
96 6 43 7 42 96 8 42.5 9 43.5
97 7 40.5 7 34.5 97 8 39.5 9 40.2
98 7 35 8 33 98 9 33.5 9 33.5
99 5 42.5 7 44 99 9 44.5 9 44.5
100 6 24 6 25 100 7 25.5 7 26
160 R. Dasgupta

Fig. 1 Growth curve of

50
plant height (100)

40
plant height (cm)
30
20
10
0

0 20 40 60 80 100
lifetime (days)

Fig. 2 Growth curve of


50

plant height (44)


40
plant height (cm)
30
20
10
0

0 20 40 60 80 100
lifetime (days)

Next we check the assumption of proportional growth of the auxiliary variable


‘maximum plant height’ with yield. In Fig. 8 we show the regression of onion yield
on maximum plant height.
In Fig. 9 we show the regression of log(onion yield) on log(maximum plant
height). The linear fit here in log-log scale seems better than the fit shown in Fig. 8,
where original variables are considered.
Growth curve may be reconstructed on the basis of linear growth of the variables
in least square line fit in log-log scale, or lowess regression of the variables in log-log
Growth Curve Estimation of a Bulb Crop from Incomplete Data 161

Table 2 Yield of onion (gm)


Plant no Weight (gm) Plant no Weight (gm) Plant no Weight (gm)
1 Nil 42 4 83 12
2 Nil 43 Nil 84 14
3 Nil 44 13 85 Nil
4 5 45 Nil 86 13
5 10 46 5 87 Nil
6 5 47 6 88 Nil
7 11 48 20 89 Nil
8 18 49 22 90 Nil
9 14 50 7 91 10
10 14 51 4 92 Nil
11 Nil 52 12 93 5
12 Nil 53 18 94 Nil
13 Nil 54 20 95 Nil
14 Nil 55 Nil 96 Nil
15 2 56 Nil 97 Nil
16 9 57 Nil 98 Nil
17 11 58 6 99 Nil
18 5 59 Nil 100 Nil
19 16 60 18
20 2 61 6
21 Nil 62 3
22 2 63 Nil
23 2 64 Nil
24 Nil 65 Nil
25 Nil 66 Nil
26 10 67 Nil
27 Nil 68 21
28 16 69 Nil
29 2 70 Nil
30 5 71 Nil
31 Nil 72 Nil
32 Nil 73 Nil
33 Nil 74 Nil
34 Nil 75 Nil
35 17 76 Nil
36 15 77 Nil
37 5 78 Nil
38 8 79 Nil
39 14 80 Nil
40 18 81 Nil
41 8 82 8
162 R. Dasgupta

Fig. 3 Mean plant height for

50
100 and 44 plants 100 plants
44 plants

40
plant height (cm)
30
20
10
0

0 20 40 60 80 100
lifetime (days)

Fig. 4 Lowess growth curve


7

of average number of leaves


(100 plants)
average number of leaves
6
5
4
3

95 100 105
lifetime (days)

scale in a similar manner. These may provide more accurate curves, since the value
of r = 0.84 is higher in the log-log scale as described in Fig. 9.
Figure 10 provides lowess growth curve with f = 2/3 for 44 onions having com-
plete data in original. The lowess growth curve with f = 2/3 for reconstructed data
in place of missing observations on the basis of linear growth of the variables (weight
on plant height) in least square line fit in log-log scale, or lowess regression of the vari-
ables (weight on plant height) in log-log scale, are also shown after retransforming
the variables via antilog. The lowess growth curve in these two cases with f = 2/3
incorporates the missing data with original data to obtain full growth scenario, as if
the full data is available.
Growth Curve Estimation of a Bulb Crop from Incomplete Data 163

Fig. 5 Longitudinal growth

10
curve of leave count for 15
plants, 1(7)99

8
no of leaves
6
4
2

0 20 40 60 80 100 120
lifetime (days)

Fig. 6 Growth curve of


onion in partial data (44
20

plants)
onion weight (gm)
15
10
5

95 100 105
lifetime (days)
164 R. Dasgupta

20
Predicted
Original

onion weight (gm)


15
10
5

95 100 105
lifetime (days)

Fig. 7 Predicted growth curve and original growth curve

Fig. 8 Regression of onion yield on maximum plant height

Fig. 9 Regression of log(onion yield) on log(maximum plant height)


Growth Curve Estimation of a Bulb Crop from Incomplete Data 165

Fig. 10 Predicted growth and original growth under log calibration

4 Validation

To validate the procedure adopted in estimating the growth curve by reconstructing


the missing data, we adopted the following procedure. We start with data on 44
onion plants sequentially and assign new serial number 1, . . . , 44; on which complete
information is available. The full data is divided into two equal halves with odd serial
numbers, and even serial numbers. Postulating that first set is missing, we would like
to estimate the full growth curve on 44 plants based on information on second set.
We adopt the same procedure of estimating the growth curve on 100 plants with
data available on 44 plants, as mentioned above. The goal is to see how close the
newly constructed growth curves are to the actual curve.
Consider the subset with even serial numbers from 44 onion plants, i.e., 22 even
numbered onion plants, those are available for prediction purpose. Figure 11 shows
the height and weight of these plants in log-log scale represented by black points,

Fig. 11 Regression of log(onion yield) on log(maximum plant height) for 22 plants


166 R. Dasgupta

Fig. 12 Predicted growth and original curve for validation under log calibration

lowess regression points with f = 2/3 in magenta color, and the least square line fit
with r = 0.79. The linear fit seems good.
In Fig. 12 we take antilog to get yield values and show the original growth curve
for 44 onions, predicted growth curve from least square fitted line and growth curve
from lowess regression with f = 2/3 are also shown. The predicted curves closely
approximates the features of the original growth curve on the scale of 2 gm in the
y axis, indicating that the proposed procedure is efficient. The curve with lowess
regression is closer to the original curve.

5 Discussions

Lowess regression, a nonparametric technique may be used to estimate growth curve


from partial data, with the help of available additional information. In this particular
problem, the relationship between maximum plant height with onion yield in log-log
scale helps us to infer about missing onion yield.
This coupled with complete data on 44 plants provides an estimate of growth
curve from reconstructed full data by lowess regression.
Denote y =log(yield) and x =log(max plant height), for respective plants. For
least square calculations, the regression on y, is y = mx + c, where, m = 2.1783,
and c = −5.3026 as least square fit on 44 plants with full recorded data. For all
missing plants, antilog of y estimate the respective onion weights from the following,
Estimated yield = e−5.3026 (max plant height)2.1783
For calculations of yield via lowess regression, we have used the relation,
(1) log(weight)= c∗ log(height) for fully recorded data on 44 onion plants’ height
and onion weight, at lowess predicted points with f = 2/3; and obtained c∗ , specific
to a plant height in 44 plants.
We next calculate log of onion weight of the missing plants from,
(2) log(weight)= c∗ log(height), height=height(t), t =lifetime of the respective
Growth Curve Estimation of a Bulb Crop from Incomplete Data 167

plants. Next for calculating c∗ for a missing plant, the nearest point out of 44 lowess
points (height, weight) is taken, which is closest to the height of the particular plant
with missing yield.
This essentially amounts to taking the value of weight from the lowess point having
closest height with the missing plant. Linear interpolation for intermediate points
may also be made, if two heights are not close.
We get the respective weights with t = lifetime of the respective missing plants.
The original data of 44 plants’ lifetime and weight are merged in this stage to have
overall lowess regression with f = 2/3.
Validation of the proposed procedure indicates that growth curve estimation from
missing data by the technique is satisfactory, as explained below.
Data on 44 serially numbered plants are divided into two parts, even and odd
numbered. Odd numbered plants are presumed missing, and even numbered plants
are available for prediction of the ‘missing’ data.
The least square line on y =log(yield) from x =log(max plant height), calculated
with 22 points is of the form y = mx + c, where, m = 1.930, and c = −4.446. Thus,
for 22 plants with presumed ‘missing’ data.
Estimated yield = e−4.446 (max plant height)1.930 as seen in the second calculation
for validation of the procedure.
Both the lowess regression and least square regression in log-log scale preserves
the main features of the growth curve under prediction.

References

Brewester JL (1987) The effect of temperature on the rate of sprout growth and development within
stored onion bulbs. Ann App Biol 111:463–465
Dasgupta R (2013) Yam growth experiment and above-ground biomass as possible predictor.
advances in growth curve models: topics from the Indian statistical institute. In: Proceedings
in mathematics & statistics, Chapter 1, vol 46. Springer, Berlin, pp 1–33
Dasgupta R (2015) Growth curve reconstruction in damaged experiment via nonlinear calibration.
In: Growth curve and structural equation modeling: topics from the Indian statistical institute,
Chapter 7. Springer, Cham, pp 119–134
Dasgupta R, Pan A (2015) Growth curve of phase change in presence of polycystic ovary syndrome.
In: Growth curve and structural equation modeling: topics from the Indian statistical institute,
Chapter 8. Springer, Cham, pp 135–149
Tackling Poverty Through Balanced Growth:
A Study on India

Sattwik Santra and Samarjit Das

Abstract The relationship between growth, poverty and inequality lies as one the
most debated and controversial topic in the area of development economics. India has
an evidence of self-sustaining growth for more than two decades. The question often
arises whether, this increment of positive growth is an implication of shining India.
Why is growth more pro-poor in some states than others? We attempt to answer
this question using a novel approach that allows for inter-regional dependence of
poverty together with its major exogenous determinants such as the average level
of income and income inequality. Based on a balanced panel data set on state level
consumption data, our study upholds not only the trivial observations that mean per-
capita consumption expenditure decreases the absolute poverty rates while the degree
of inequality increases it, but also corroborates a significant positive inter-regional
dependence of poverty as well as a positive association of poverty with the mean
per-capita consumption expenditures of the ‘neighboring’ regions thus suggesting
the importance of inter-regional inequality on poverty.

Keywords Growth, poverty and inequality nexus · Balanced growth · Inter-regional


dependence · Spatial dependence · Laspeyre’s price index

JEL Classification: C23 · O40

1 Introduction

The conventional wisdom is that continued growth would reduce the incidence of
poverty and would lead to the improvement of the living conditions of the poor. Yet,
in reality, it is often observed that continued growth upsets the wealth distribution

S. Santra
Centre for Training and Research in Public Finance and Policy, Center for Studies in Social
Sciences, Calcutta, R–1, BaishnabghataPatuli Township, Kolkata 700 094, India
S. Das (B)
Economics Research Unit, Indian Statistical Institute, 203 B T Road, Kolkata 700108, India
e-mail: samarjit@isical.ac.in

© Springer International Publishing AG 2017 169


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_8
170 S. Santra and S. Das

thus compromising the very sustainability of growth. Examination of the contribution


of growth and income distribution to the changes in poverty is essential to design-
ing poverty reduction policies. India is one of the largest growing economy in the
world. During the last two decades, it has not only been able to maintain a sustained
growth, but also been able to reduce poverty steadily. However, neither growth rate
nor poverty reduction, is uniform across regions in India. Indian states and regions
within states however, are characterized by geographic, demographic, and economic
diversity. It is also a fact that India is quite vast; regions have different resource
bases, along with rigidity in factor mobility leading the possibility of regional vari-
ation and disparities. Our objective in this article is not only to explore the role of
growth and income distribution on poverty but also the regional spill-over effects
of poverty, inequality and growth that endogenously determines the poverty outlook
of a region. To incorporate this “neighbour” effects on the poverty reduction, we
consider an econometric model with spatial dependence. The results of our analysis
suggests that the incidence of poverty is determined by both local and global factors
suggesting neighbourhood effect. Significant spill-over effects is observed for both
poverty and the average level of income while inequality within a region affects only
the extent of poverty of that region and thus shows no significant neighbourhood
effect.
There is a voluminous amount of empirical literature that deals with poverty and
of these, a number of works addresses the interrelationship between poverty, growth
and inequality (examples include Datt and Ravallion (1992, 2002), Ravallion (1995,
1997), Adams (2004), Beck et al. (2007), Kalwij and Verschoor (2007), Ravallion
and Chen (1997, 2007), Chambers (2011) among others). In this strand of empiri-
cal literature, a number of studies also deal with the issue of regional convergence
of poverty. Our present work adds on to this area of empirical research work and
introduces spatial econometric techniques to shed further light on this topic in the
context of India. In doing so, the work emphasizes the spatial endogeneity of poverty
together with its causative factors and highlights the distinctive role of inter-regional
inequality over intra-regional inequality as the determinants of poverty.
The paper is organized as follows. The next section provides detailed description
of data, sources, construction of different variables. This section also elaborates the
econometric model used in length. Section 3, presents the empirical findings together
with some possible economic implications and explanations to go with it. Section 4
concludes the paper with some remarks from the viewpoint of policy prescriptions.

2 Data and Model

This study is based on the last six major rounds of survey conducted by the National
Sample Survey Organization (NSSO) of India on ‘Household Consumer Expendi-
ture’. The data covers 32 states and Union territories1 of India and was conducted

1 The states and the Union territories are: Andaman & Nicobar Islands, Andhra Pradesh, Arunachal

Pradesh, Assam, Bihar, Chandigarh, Dadra & Nagar Haveli, Daman & Diu, Delhi, Goa, Gujarat,
Tackling Poverty Through Balanced Growth: A Study on India 171

during the periods of 1987–1988 (43rd round), 1993–1994 (50th round), 1999–2000
(55th round), 2004–2005 (61st round), 2009–2010 (66th round) and 2011–2012
(68th round). The survey collects data on various socio-economic characteristics
of a household as well as its members. To name a few, these include information
on the sampled households’ principal occupation, social group, religion, amount of
land cultivated, the amounts of various items consumed together with the associ-
ated expenditures adjusted to a suitable reference period, the age, sex, education of
each household members and their relationship with the household head. In addition
to this, data is also provided on the localization of the sampled households which
includes the sector (rural or urban), state, region (a subdivision of each state based on
certain broad geographical features) and the district in which the household resides.
Apart from this major data source, we also use the “Report of the Expert Group to
Review the Methodology for the Estimation of Poverty” and the geospatial data
provided by the GADM database for our analysis.
In this present exercise, we are primarily interested in determining the nature of
association of the level of absolute poverty in India with the other principal compo-
nents of the income distribution namely the average level of income and the income
inequality, in a suitably general empirical framework. For our purpose, we consider
the poverty gap index to measure the intensity of poverty, the mean monthly per-
capita real consumption expenditure (hereafter referred to as MMPCE) to serve as a
proxy for the mean income level and the 20:20 ratio of consumption expenditures2 as
the index of inequality. In order to investigate the dependence of the poverty gap index
on the mean per-capita real consumption expenditure and the 20:20 ratio empirically,
we construct a panel of observations on these indices from the furnished dataset for
the sixty four possible combinations of states and sectors (which constitute our cross
section units) spanning the six NSSO survey years.3 Note that this prepared panel
of observations on each of the variables of interest, relates to the distributions of
consumption expenditures of the individuals of the sampled households belonging
to a particular combination of state, sector and year.
Various descriptive statistics related to the distributions of the principal three
variables across the combinations of the states and sectors for each of the six NSSO
rounds, are tabulated in Table 1. Graphical representations of the changes in the
distribution patterns of these variables over the rounds are also depicted in Fig. 1.
From these exercises it can be inferred that over time for both rural and urban sectors,
there has been a general increase in the average level of affluence as measured by the

(Footnote 1 continued)
Haryana, Himachal Pradesh, Jammu & Kashmir, Karnataka, Kerala, Lakshadweep, Madhya
Pradesh, Maharashtra, Manipur, Meghalaya, Mizoram, Nagaland, Orissa, Pondicherry, Punjab,
Rajasthan, Sikkim. Tamil Nadu, Tripura, Uttar Pradesh and West Bengal.
2 Given by the difference between the aggregate shares of monthly consumption expenditures of the

households above the top (5th) consumption expenditure quintile to the aggregate share of monthly
consumption expenditures of the households below the bottom (1st) quintile.
3 Owing to some gaps in the availability of the data though, our resulting panel falls short of the

expected number of 384 observations.


172 S. Santra and S. Das

Table 1 Descriptive statistics. Source Authors’ calculations from the data


Statistics Mean Standard 1st Median 3rd Skewness Kurtosis
over states deviation quantile quantile
and
sectors:
Variable: Poverty gap index
Rural India
1987 0.2690 0.0842 0.1977 0.2856 0.3063 −0.4369 3.1480
1993 0.2560 0.0769 0.1836 0.2779 0.3349 −0.1095 3.6422
1999 0.1362 0.0457 0.1070 0.1282 0.1689 0.1226 3.9566
2004 0.0910 0.0317 0.0701 0.0916 0.1194 0.5465 3.3833
2009 0.1260 0.0706 0.0711 0.1143 0.1986 0.4167 3.9009
2011 0.1015 0.0721 0.0346 0.0935 0.1698 0.6634 4.0338
Urban India
1987 0.1575 0.0582 0.1307 0.1453 0.1625 0.6155 3.0598
1993 0.1514 0.0535 0.1152 0.1369 0.1599 0.8384 2.9525
1999 0.1021 0.0620 0.0632 0.0825 0.1032 2.3691 9.0904
2004 0.0576 0.0207 0.0409 0.0575 0.0648 0.3738 3.2221
2009 0.0663 0.0483 0.0358 0.0552 0.0727 1.9958 7.2279
2011 0.0517 0.0487 0.0191 0.0314 0.0563 1.5874 3.9517
Variable: RealMMPCE
Rural India
1987 393.5295 112.8214 328.2205 368.6754 432.2861 2.0349 8.3967
1993 394.1464 98.6705 315.9030 363.7263 447.3277 1.9095 9.4725
1999 492.9149 95.8629 438.8170 492.4627 507.5273 1.6985 7.6011
2004 586.2230 134.9324 539.2877 575.6547 601.6321 1.8174 6.4271
2009 553.9312 174.3572 407.6331 538.8271 606.3124 1.4280 4.7368
2011 600.2986 197.0054 425.1906 574.3254 722.5880 1.1394 3.8402
Urban India
1987 715.7302 187.8960 678.0958 718.5422 763.2227 2.1350 10.5304
1993 749.9787 172.2192 683.2816 753.5291 833.4937 1.2784 7.5931
1999 849.6941 198.6573 710.2099 883.4963 933.9968 0.3950 4.5354
2004 1108.5982 178.9727 944.5705 1158.9690 1228.4292 −0.0210 3.5571
2009 1150.8111 252.3730 968.5101 1240.3491 1325.6998 −0.3127 4.2534
2011 1203.3595 271.8219 1054.9183 1319.5826 1406.4446 −0.6303 2.6401
Variable: 20:20 ratio
Rural India
1987 0.8912 0.0047 0.8869 0.8911 0.8939 −0.0624 4.3299
1993 0.8959 0.0068 0.8921 0.8960 0.9006 −0.0516 3.9585
1999 0.8842 0.0080 0.8814 0.8861 0.8879 −5.4252 46.1875
2004 0.8882 0.0076 0.8836 0.8896 0.8900 0.0884 5.0706
2009 0.8868 0.0072 0.8838 0.8865 0.8880 −0.9441 24.8861
2011 0.8857 0.0067 0.8816 0.8833 0.8900 −0.4205 17.2466
(continued)
Tackling Poverty Through Balanced Growth: A Study on India 173

Table 1 (continued)
Statistics Mean Standard 1st Median 3rd Skewness Kurtosis
over states deviation quantile quantile
and
sectors:
Variable: 20:20 ratio
Urban India
1987 0.9006 0.0072 0.8970 0.9025 0.9061 −2.3439 16.4798
1993 0.9037 0.0071 0.9000 0.9045 0.9099 −0.3402 2.3886
1999 0.9001 0.0079 0.8957 0.9018 0.9066 −2.4026 15.9164
2004 0.9032 0.0074 0.9010 0.9053 0.9073 −4.3346 45.5002
2009 0.9030 0.0117 0.8967 0.9041 0.9064 −0.6582 4.1788
2011 0.8993 0.0133 0.8883 0.9037 0.9084 −0.2063 2.0505

MMPCE figures together with an appreciable decline in the absolute poverty level
and a modest decrease in the level of inequality as measured by the 20:20 ratio.
In addition to the principal variables, to serve as controls in our empirical model,
we also include a number of ancillary variables which include the per-capita land
available for cultivation, the proportion of households belonging to three broad prin-
cipal occupation types,4 the proportion of people belonging to four different classes
of education,5 the proportion of people belonging to the various social6 and religious7
groups. Recall that all variables are computed for the particular combination of state
and sector in question for the respective years. In order to maintain comparability
within the data, we estimate the Laspeyre’s price index from the available data on
consumption and used this index to suitably deflate the concerned variables (which
include the per-capita mean consumption expenditure figures and the poverty lines
used to estimate the poverty index) to factor out potential changes in prices over the
years as well as across the cross sections.
Apart from utilizing the panel structure of our data, to incorporate further gen-
eralizations to our empirical model, we also introduce a spatial dimension to our
analysis. Spatial dependence in a collection of a sample data implies that an obser-
vation on a variable (whether endogenous or exogenous) associated with a particular
cross section unit labeled ‘i’ depends on other observations of the variable associated

4 Which are: professional, technical, administrative, executive, managerial and related workers
dubbed as occupation group 1, clerical, sales, service, farmers, fishermen, hunters, loggers, produc-
tion and related works, transport equipment operators and laborers clubbed into occupation group
2 and workers not classified by occupations including unemployed laborers, grouped as 3.
5 Divided into: illiterate as education group 1, literate but below secondary level of education as

education group 2, secondary and higher secondary level of education as education group 3 and
above secondary level of education as education group 4.
6 Classified as: scheduled tribes as group 1, scheduled castes as group 2 and others as group 3.
7 Identified as: Hinduism and other religions excepting Islam and Christianity as group 1, Islam as

group 2 and Christianity as group 3.


174 S. Santra and S. Das

Fig. 1 Density plots. Source Authors’ calculations from the data

with any other cross section unit ‘ j’with i = j through some exogenously supplied
weighing scheme. Thus, the spatial aspect of our analysis calls for the construction
of spatial weights and for the same, we consider three possible weighing schemes.
The first of these, assumes that the contribution of any cross section unit ‘ j’ to a
cross section unit ‘i’ is proportional to the “economic” distance between the cross
section units measured as the reciprocal of the absolute difference in the per-capita
Tackling Poverty Through Balanced Growth: A Study on India 175

consumption expenditure of the two cross-sections for the period 2004–2005.8 The
idea behind this being, the more any two cross sectional units resemble each other in
terms of their economic performance as measured by their respective per-capita con-
sumption expenditures, the greater are their assigned mutual spatial contributions.
The second measure considered, weighs according to the physical distance between
the cross sections. In this formulation, the contribution of one cross section unit to
another is assumed to vary inversely with the geographical distance of the centroids
between the two cross section units.9 The final weight specification aims to combine
the above two weights and takes the form of a simple multiplicative communion of
these two weights. Note that in all of the above formulations the spatial contribution
of any cross section unit ‘ j’ to a cross section unit ‘i’ is same as the contribution of
cross section unit ‘i’ to cross section unit ‘ j’.
To formally estimate the degree of dependence of poverty on income and inequal-
ity, given nature and scope of our data, we propose a fairly general spatial panel fixed
effects regression specification (see Durbin 1960; Anselin 2007) given by:

    
Pit = αi + λ Wi j P jt + β1 Iit + β2 ln (Mit ) + ρ1 Wi j I jt + ρ2 Wi j ln M jt +
j =i j =i j =i
Ẑ it θ̂ + εit (1)
 
with εit ∼ N 0, σi2
The above equation may be written more compactly as:

 P̂ + β Iˆ + β ln (M
P̂t = α̂ + λW  ˆ   ˆ ˆ
t 1 t 2 t ) + ρ1 W It + ρ2 W ln (Mt ) + Ẑ t θ̂ + ε̂t
 
with ε̂t ∼ N 0̂,  ˆ .
In the above equations, where applicable, the subscripts i and j indexes the cross
section units whereas t indexes the time for the variables associated with the index
for poverty (P), real per-capita mean expenditure (M), the inequality index (I ), the
vector of other control variates ( Ẑ ) and the weight (W ) associated with the respective
cross section units. Also note that a single ‘hat’ is used to signify conversion of a
scalar to a vector whereas a double ‘hat’ denotes the matrix representation from a
scalar or vector form. The other symbols that appear in the above equations represent
the parameters associated with the respective variables that are to be estimated using
the data. In the following section, we elucidate on the estimated values of these
parameters.

8 The choice of this particular year facilitates computations since the poverty line available in the data

is provided for this particular time point as well as the fact that this period lies almost halfway in the
available time series and is characterized by a relatively stable nationwide economic performance.
9 The distance between the rural and urban sectors of a particular state is taken to be the distance

between two points that trisects the diagonal of a square having an area equal to that of the state in
question. In effect, the distance is a proportion to the square root of the area of the state.
176 S. Santra and S. Das

3 Empirical Findings

The results obtained from the above regression specification are tabulated in Tables 2
and 3. While Table 2 reports the results obtained for all the available observations, the
estimations results reported in Table 3 serve as checks of robustness for our empirical
exercise and depicts the results obtained when one drops the observations pertaining
to the states and union territories10 given by Daman and Diu, Nagaland, Andaman
and Nicobar Islands, Dadra and Nagar Haveli, Goa and Lakshadweep.
Inspection of the tabulated values yield some unambiguous conclusions regard-
ing the impact of the mean per-capita consumption expenditure and inequality on
poverty. The significantly positive coefficient associated with the 20:20 ratio and the
significantly negative coefficient associated with the mean per-capita consumption
expenditure clearly indicate the rather trivial observation that for any particular cross
section unit, both a rise in its mean per-capita expenditure levels or a fall in the expen-
diture inequality, unequivocally decreases the level of absolute poverty of that cross
section unit. What is rather more interesting to note is that for all our regression
specifications, the coefficient attached to the weights associated with the poverty
gap index (i.e. λ in Eq. 1) is significantly positive as is the coefficient attached to
the weights associated with the mean per-capita consumption expenditure (i.e. ρ2 in
Eq. 1). This indicates that the levels of poverty associated with any given cross section
unit ‘i’, is strongly positively correlated with both the poverty and the mean per-capita
consumption expenditures of other cross section units ‘ j’ with i = j thus sugges-
tive of the facts that on one hand, poverty is characterized by significant regional
spillovers and on the other hand, the degree of poverty is significantly dictated by
the inter-regional inequality of per-capita consumption.11
Apart from these observations on the key variables of the model, it may also be
noted from the tables that the coefficients associated with the proportion of econom-
ically active adults (i.e. with age between 18 and 62) belonging to education group
above secondary level of education (refer footnote 7) and the per-capita amount of
land cultivated are statistically significant and positive. These result associated with
the education group follows once we consider the fact that areas inhabited by high
paid white collar workers attract a large number of migrating poor workers who
take on various household related jobs like drivers, servants, maids, nannies etc. (a
host of other explanations also applicable for India, are offered in Ravallion and
Chen 2007). A similar observation that agricultural activities draws a class of poor
migratory short term laborers may also be forwarded to explain the positive impact
of per-capita amount of cultivated land on poverty.

10 On which, the number of available observations in the NSSO rounds are scant.
11 Although not by the degree of inequality within the other regions.
Tackling Poverty Through Balanced Growth: A Study on India 177

Table 2 Regression results obtained using all the available observations


Dependent variable: poverty gap index
Regression weights Distance MPCE Distance and MPCE
based on:
Coefficient associated with right hand side variables of
Inequality (20-20 0.3398** 0.3962*** 0.4370***
ratio)(β1 in Eq. 1)
(0.1373) (0.1359) (0.1228)
MMPCE (in logs)(β2 −0.3606*** −0.3457*** −0.3542***
in Eq. 1)
(0.0403) (0.0349) (0.0350)
Weighted poverty gap 0.8607*** 0.4685*** 0.8117***
index(λ in Eq. 1)
(0.1587) (0.1207) (0.1233)
Weighted inequality −0.0166 −0.0419 −0.0764
(20-20 ratio)(ρ1 in
Eq. 1)
(0.2184) (0.1976) (0.1567)
Weighted MMPCE (in 0.2860*** 0.1368*** 0.2811***
logs)(ρ2 in Eq. 1)
(0.0521) (0.0480) (0.0513)
Per-capita cultivated 0.0001** 0.0002*** 0.0001***
land (components of θ̂
in Eq. 1)
(0.0001) (0.0000) (0.0000)
Proportion of economically active adults in education group (components of θ̂ in Eq. 1)
2 (refer footnote 7) −0.0267 0.0534* 0.0264
(0.0323) (0.0291) (0.0270)
3 (refer footnote 7) −0.1131* −0.0806 −0.1133**
(0.0589) (0.0554) (0.0531)
4 (refer footnote 7) 0.4539*** 0.4271*** 0.3451***
(0.1099) (0.1160) (0.1071)
Proportion of male 0.1371 0.2275 0.2451*
(components of θ̂ in
Eq. 1)
(0.1357) (0.1511) (0.1327)
Proportion of population in social group (components of θ̂ in Eq. 1)
1 (refer footnote 8) −0.0013 −0.0079 −0.0177
(0.0329) (0.0350) (0.0302)
2 (refer footnote 8) 0.0789** 0.0509 0.0507
(0.0375) (0.0624) (0.0487)
(continued)
178 S. Santra and S. Das

Table 2 (continued)
Dependent variable: poverty gap index
Regression weights Distance MPCE Distance and MPCE
based on:
Proportion of population in religion group (components of θ̂ in Eq. 1)
2 (refer footnote 9) 0.0318 0.0432 0.0195
(0.0249) (0.0327) (0.0307)
3 (refer footnote 9) −0.0407 −0.1062 −0.0265
(0.0511) (0.0739) (0.0620)
Proportion of population in occupation group (components of θ̂ in Eq. 1)
1 (refer footnote 6) −0.1106** −0.1885*** −0.1046**
(0.0534) (0.0512) (0.0518)
2 (refer footnote 6) −0.1076** −0.2026*** −0.0944**
(0.0453) (0.0515) (0.0464)
Constant 0.2234 1.1699*** 0.1307
(0.3919) (0.3773) (0.3718)
Number of 320 (64 × 5) 320 (64 × 5) 320 (64 × 5)
observations (NxT)
Note: *, ** and *** denotes significance at 1, 5 and 10% respectively

4 Concluding Remarks

The paper tries to find the empirical relationship between growth, poverty and
inequality based on a balanced panel data with thirty two states over a period of
twenty five years. We incorporate spatial dependence both on poverty and income
variables. We find that our model estimates support the intuitively straightforward
result that income does have negative impact on poverty and that income inequality
does have a positive impact on poverty. However we also find the rather interesting
and non-trivial results that the extent of poverty of a region is positively affected by
the poverty rates of neighboring regions as well as the average income levels of the
neighboring regions implying the inter-region inequality aggravates poverty.
So our study has clear policy prescription that both the central and state gov-
ernments should undertake coordinated redistribution policies that target to reduce
inequality both within and across the geographic and economic regions, so that the
lower end of the income distribution gets better off.
Tackling Poverty Through Balanced Growth: A Study on India 179

Table 3 Regression results obtained by dropping observation pertaining to the states and union
territories namely Daman and Diu, Nagaland, Andaman and Nicobar Islands, Dadra and Nagar
Haveli, Goa and Lakshadweep
Dependent variable: poverty gap index
Regression weights Distance MPCE Distance and MPCE
based on:
Coefficient associated with right hand side variables of
Inequality (20-20 0.4998*** 0.4505*** 0.5233***
ratio)(β1 in Eq. 1)
(0.1388) (0.1345) (0.1289)
MMPCE (in logs) (β2 −0.3626*** −0.3431*** −0.3536***
in Eq. 1)
(0.0403) (0.0379) (0.0376)
Weighted poverty gap 0.7235*** 0.5614*** 0.7629***
index(λ in Eq. 1)
(0.1500) (0.1044) (0.1152)
−0.5055* −0.3129 −0.5028*
Weighted inequality (0.3054) (0.2897) (0.2896)
(20-20 ratio)(ρ1 in
Eq. 1)
0.2548*** 0.1915*** 0.2734***
Weighted MMPCE (in (0.0496) (0.0408) (0.0498)
logs) (ρ2 in Eq. 1)
0.0003*** 0.0002*** 0.0002***
Per-capita cultivated (0.0001) (0.0001) (0.0000)
land (components of θ̂
in Eq. 1)
Proportion of economically active adults in education group (components of θ̂ in Eq. 1)
2 (refer footnote 7) −0.0519 −0.0082 −0.0119
(0.0366) (0.0356) (0.0317)
3 (refer footnote 7) −0.0750 −0.0434 −0.0503
(0.0698) (0.0655) (0.0606)
4 (refer footnote 7) 0.4393*** 0.3500*** 0.2844**
(0.1207) (0.1231) (0.1195)
Proportion of male 0.0012 0.0578 0.0612
(components of θ̂ in
Eq. 1)
(0.1394) (0.1752) (0.1550)
(continued)
180 S. Santra and S. Das

Table 3 (continued)
Dependent variable: poverty gap index
Regression weights Distance MPCE Distance and MPCE
based on:
Proportion of population in social group (components of θ̂ in Eq. 1)
1 (refer footnote 8) 0.0378 0.0046 0.0042
(0.0382) (0.0310) (0.0316)
2 (refer footnote 8) 0.0791* 0.0752 0.0749*
(0.0411) (0.0542) (0.0432)
Proportion of 0.0192 0.0210 0.0088
population in religion
group (components of
θ̂ in Eq. 1)
2 (refer footnote 9) (0.0288) (0.0320) (0.0305)
−0.2015*** −0.2126*** −0.1551**
3 (refer footnote 9) (0.0622) (0.0702) (0.0608)
Proportion of population in occupation group (components of θ̂ in Eq. 1)
1 (refer footnote 6) −0.0370 −0.0725* −0.0377
(0.0383) (0.0403) (0.0413)
2 (refer footnote 6) −0.0879*** −0.0971** −0.0466
(0.0334) (0.0443) (0.0387)
Constant 0.7915** 0.9685*** 0.5323
(0.3916) (0.3601) (0.3232)
Number of 342 (57 × 6) 342 (57 × 6) 342 (57 × 6)
observations (NxT)
Note: *, ** and *** denotes significance at 1%, 5% and 10% respectively

References

Adams JR RH (2004) Economic growth, inequality and poverty: estimating the growth elasticity
of poverty. World Dev 32(12):1989–2014
Anselin L (2007) Spatial econometrics. In: Mills TC, Patterson K (eds) Palgrave handbook of
econometrics, vol 1. Econometric theory. Palgrave MacMillan, New York, pp 901–969
Beck T, Demirgüç-Kunt A, Levine R (2007) Finance, inequality and the poor. J Econ Growth
12(1):27–49
Chambers D (2011) A non-parametric measure of poverty elasticity. Rev Income Wealth 57(4):683–
703
Datt G, Ravallion M (1992) Growth and redistribution components of changes in poverty measures:
a decomposition with applications to Brazil and India in the 1980s. J Dev Econ 38(2):275–295
Datt G, Ravallion M (2002) Why has economic growth been more pro-poor in some states of India
than others? J Dev Econ 68(2):381–400
Durbin J (1960) Estimation of parameters in time-series regression models. J R Stat Soc 22(1):139–
153
Kalwij A, Verschoor A (2007) Not by growth alone: the role of the distribution of income in regional
diversity in poverty reduction. Eur Econ Rev 51:805–829
Tackling Poverty Through Balanced Growth: A Study on India 181

Ravallion M (1995) Growth and poverty: Evidence for developing countries in the 1980s. Econ Lett
48(3–4):411–417
Ravallion M (1997) Can high-inequality developing countries escape absolute poverty? Econ Lett
56(1):51–57
Ravallion M, Chen S (1997) What can new survey data tell us about recent changes in distribution
and poverty? World Bank Econ Rev 11(2):357–382
Ravallion M, Chen S (2007) China’s (uneven) progress against poverty. J Dev Econ 82(1):1–42
Model Selection and Validation
in Agricultural Context: Extended Uniform
Distribution and Some Characterization
Theorems

Ratan Dasgupta

Abstract We propose a new model to explain a bulb crop production and validate
the model. A new type of distribution named ‘extended uniform distribution’ in
discrete and continuous form is proposed and related characterization theorems are
proved. The proposed model fits the data well. Two production seasons of bulb crop
garlic are compared from estimated model parameters. We estimate end point of a
distribution based on one-sided convergence and minimise bias by averaging upper
and lower almost sure estimates. Convergence properties of the estimators are also
investigated.

Keywords Power law · Bulb crop · K-S distance · Extended uniform distribution ·
One-sided convergence · Non-regular analysis

MS subject classification: Primary: 62P10 · secondary: 60F15, 60E05

1 Introduction, Preliminaries and the Data Analysis

Garlic (Allium sativum) is a year round crop grown in moderate climates with many
medicinal properties. Garlic plant cannot withstand extreme temperature. Exposure
of dormant cloves or young plants to temperature of around 20 ◦ C or lower for a
time period hastens subsequent bulbing. In dry weather conditions, with increase in
evaporation rate during Indian summer, plant growth may be substantially affected.
The maximum summer temperature can be as high as 47 ◦ C in Jharkhand, India.
In the first study to asses crop yield, one hundred garlic clove seedlings were
planted in an experimental plot at Indian Statistical Institute Giridih farm in Jharkhand
on 12 February 2014, in winter season. The plot had topsoil eroded; this is part
of a barren land having sandy soil composition mixed with ‘dhoincha’ (Sesbania
bispinosa) plant compost manure, so as to make survival of plants easier in the

R. Dasgupta
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 B T Road,
Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 183


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_9
184 R. Dasgupta

unfertile plot of land. In each row there were ten plantations. Plant to plant distance
was 15 cm. There were ten rows; distance between rows was 30 cm. A little bit
of vermicompost manure was also provided in the experimental plot. Out of 100
plantations, 85 resulted in healthy garlic plants having positive yields on maturity.
For remaining 15 plants, there were no yields (there was a typo in no. of healthy
plants mentioned in an earlier report (Dasgupta 2015)).
Crop like garlic in harsh environment will be a worthwhile and cultivable crop
in Giridih, Jharkhand, if adequate fertilizers like e.g., DAP, organic manure etc. are
administered and additional cares like regular irrigation, loosening the soil near plants
are undertaken.
In a follow-up study undertaken in subsequent year, the growth scenario is seen
to improve as a result of front shifting the time zone of garlic cultivation i.e., early
winter plantation of seedlings. Simultaneously, the other concerns like land fertility
and plant care are also attended to increase yield.
Garlic bulbs are usually divided into numerous fleshy sections called cloves. The
numbers of cloves for production in different years are given in Table 1. Since the
yield depend on the number of cloves in each bulb, it is of interest to study the
distribution of number of cloves over bulbs.
In Fig. 1 we plot the observed cumulative frequency distribution n Fn (x) of number
of cloves in log-log scale. An approximate linear relationship suggests possibility of
the following model for c.d.f.

F(x) = (x/θ )α , α > 0, x = 1, 2, 3, . . . , θ (1)

For the observed data in the year 2013–2014, slope and intercept of the least
square linear fit are 0.06912 and 4.34809 respectively, with correlation coefficient
r = 0.9847.
One may take θ̂ = x(n) = 4, maximum of the observations, and α̂ = 0.06912,
slope of the least square fitted line.
One may estimate the value of θ from the intercept of the regression line as well.
The value θ̃ = 3.93 is pretty close to the m.l.e. θ̂ = x(n) = 4.
The continuous version of the distribution in (1) may be written as

G(y) = (y/θ )α , α > 0, y ∈ (0, θ ] (2)

A scaled version of the variable with proper shaping is uniform over the range
(0, 1). As such we may term such a distribution as extended uniform distribution.
The above model resembles power law, but has a positive exponent α; support of
the variable has an unknown upper bound.
The maximum likelihood estimate of the parameters in (2) based on n iid
n yi , i =yi1, −1
observations 2, 3, . . . , n are θ̂ = y(n) , the maximum observation; and
α̂ = [− n1 i=1 log( θ )] .
Model Selection and Validation in Agricultural Context … 185

Table 1 Garlic data for 2013–2014 and 2014–2015


Year 2013–2014 Year 2014–2015
Plant no Weight (gm) No. of cloves Weight (gm) No. of cloves
1 0.77 2 1.53 1
2 1.37 1 1.74 3
3 0.4 1 1.62 1
4 0.5 1 0.8 1
5 0.15 1 1.95 9
6 Nil Nil 3 1
7 Nil Nil 1.75 5
8 0.7 4 4.65 6
9 0.25 1 2.03 1
10 Nil Nil 1.39 1
11 0.35 1 0.83 6
12 0.32 1 0.39 1
13 0.5 1 2.23 1
14 0.17 1 0.52 1
15 0.1 1 2.92 5
16 Nil Nil 6.43 14
17 Nil Nil 3.82 8
18 Nil Nil 5.05 12
19 Nil Nil 4 11
20 0.27 1 6.65 3
21 Nil Nil 1.22 3
22 0.37 1 1.2 1
23 0.7 1 2.25 1
24 2.35 1 1.72 1
25 0.7 3 0.93 1
26 0.2 1 2.13 7
27 0.5 1 0.55 2
28 0.43 1 9.62 16
29 0.13 1 2.05 1
30 0.18 1 4.37 7
31 0.85 1 1.4 1
32 0.22 1 2.6 9
33 0.55 1 2.07 2
34 Nil Nil 2.4 4
35 0.55 1 5.85 11
36 Nil Nil 4.67 6
37 0.65 1 2.22 1
(continued)
186 R. Dasgupta

Table 1 (continued)
Year 2013–2014 Year 2014–2015
Plant no Weight (gm) No. of cloves Weight (gm) No. of cloves
38 0.2 1 6.4 12
39 0.32 1 3.5 12
40 0.2 1 4.7 6
41 0.07 1 1.45 3
42 0.12 1 3.23 8
43 0.52 1 3.95 9
44 Nil Nil 2.6 2
45 Nil Nil 3 4
46 0.15 1 4.12 6
47 0.57 1 2.22 2
48 Nil Nil 1.68 1
49 Nil Nil 3.22 1
50 0.87 2 8.05 16
51 0.42 1 3.42 9
52 0.3 1 0.65 1
53 0.37 1 11 10
54 0.35 1 9.77 16
55 0.3 1 9.25 18
56 0.7 1 5.85 16
57 0.7 1 4.6 9
58 0.55 1 4.69 8
59 0.85 1 1.65 1
60 0.68 1 Nil Nil
61 0.45 1 3.9 2
62 0.2 2 5.55 17
63 0.58 1 3.22 6
64 0.33 1 1.95 1
65 0.35 1 4.18 2
66 0.5 1 5.82 19
67 0.65 2 2.72 5
68 0.75 1 3.42 1
69 Nil Nil 5.32 11
70 1.68 4 4.94 3
71 0.25 1 3.15 1
72 0.45 1 0.85 1
73 0.85 1 3.82 2
74 0.35 1 3.32 1
(continued)
Model Selection and Validation in Agricultural Context … 187

Table 1 (continued)
Year 2013–2014 Year 2014–2015
Plant no Weight (gm) No. of cloves Weight (gm) No. of cloves
75 0.15 1 2.22 7
76 0.47 1 3.4 9
77 0.4 1 1.72 2
78 0.25 1 8.65 11
79 0.35 1 5.4 11
80 1.08 1 9.72 11
81 0.38 1 1.5 8
82 0.5 1 1.43 2
83 0.42 1 2.5 1
84 0.2 1 Nil Nil
85 0.7 1 5.75 9
86 0.55 2 4.53 4
87 0.35 1 2.3 1
88 0.7 1 1.83 1
89 0.16 1 2 1
90 0.53 1 7.2 12
91 0.47 1 0.72 1
92 0.17 1 2.02 1
93 0.3 1 2.57 3
94 0.65 1 3.52 2
95 0.37 1 5.32 11
96 1.12 1 2.83 3
97 0.5 1 2.62 12
98 0.45 1 3.04 2
99 0.15 1 1.72 3
100 0.47 1 3.07 6

 
A discretized version, the nearest integer z = y + 21 of the continuous variable
y within range (0, θ ], with c.d.f F given in (2) is of interest. The distribution may
be a candidate model to explain ‘number of cloves per garlic bulb’.
Growth of underground garlic-bulb is a continuous process over time. The number
of cloves is a discrete variable depending on the continuous development process of
a garlic-bulb as this grows in weight and size over lifetime. The innermost cloves
grown near the main stem are relatively new; these gradually expand towards outer
periphery over time.
Assuming that the observations z’s are to be discretized values of y, the estimates
in terms of y values are θ̂ = y(n) = 4, and α̂ = 0.0362.
Model fit may be ascertained by the chi-square goodness of fit.
188 R. Dasgupta

Fig. 1 Model fit for garlic

4.44
cloves (2013–2014)

4.42
log (nFn(x))
4.40
4.38
4.36
4.34

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4


log (x)
4.6

Fig. 2 Model fit for garlic


cloves (2014–2015)
4.4
4.2
log (nFn(x))
4.0
3.8
3.6
3.4

0.0 0.5 1.0 1.5 2.0 2.5 3.0


log (x)

The values of chi-squares are 1.1876 and 5.8719 respectively in the two above
cases viz., regression based estimates and the mle; with 1 d. f. p-value of significance
is 0.2758 and 0.0154 respectively.
Thus the first model with θ̂ = x(n) = 4, maximum of the observations, and α̂ =
0.06912, i.e., slope of the regression line; seems plausible.
Another set of 100 garlic seed cloves were planted on 5 December 2014, in a
comparatively fertile land near riverside, see the yield data of the year 2013–2014.
The crop produced maximum number of cloves as 19 in a total of 98 healthy surviving
plants. In Fig. 2, slope and intercept of the least square linear fit are 0.4005 and 3.4382
Model Selection and Validation in Agricultural Context … 189

Fig. 3 Empirical CDF of

1.0
garlic yield and the model
(2013–2014)

0.8
0.6
Fn (x)
0.4
0.2
0.0

0.0 0.5 1.0 1.5 2.0


x

respectively, with correlation coefficient r = 0.9956. One may take θ̂ = x(n) = 19,
maximum of the observations; and α̂ = 0.4005, slope of the least square regression
line. Estimated value of θ from the intercept of least square regression line is θ̃ =
17.52, which is again close to the m.l.e. θ̂ = x(n) = 19.
As before, assuming that the observations on number of cloves in yield z’s of the
year 2014–2015 to be discretized values of y, an approximate maximum likelihood
estimates in terms of y values are θ̂ = y(n) = 19; and α̂ = 0.108123.
The values of chi-squares in these two cases, merging last several classes with no.
of cloves ≥ 10 in a single class are 7.70469 (corresponding to α̂ = 0.4005, obtained
from slope of the least square regression line), and 120.324 (corresponding to m.l.e. of
α in continuous version of variable) respectively; with 10 − 2 − 1 = 7 d. f. p-value
of significance is 0.3594 and < 0.0001 respectively.
In this case the model providing θ̂ = x(n) = 19, maximum of the observations; and
α̂ = 0.4005 obtained from slope of the least square regression line; seems plausible.
The new model ‘extended uniform distribution’ proposed for the bulb crop garlic
in the first data set, is therefore validated for the data set of subsequent year. The
results from the model are close to those obtained from the growth experiment.
A similar model may be postulated for weight of garlic bulbs. Weight of a bulb
consisting of cloves may be taken as proportional to number of cloves in it to a first
approximation, as the former is approximately equal to weight of a typical clove
multiplied by the number of cloves. Model fit from maximum likelihood consid-
eration provide α = 0.5659 for the year 2013–2014 and α = 0.7267 for the year
2014–2015, with θ̂ as the maximum weight of bulb in that data set; θ̂ = 0.87 g, 5.85
g, respectively for the year 2013–2014, and 2014–2015.
Yet other estimates of θ are available from the intercept of the regression line as
θ̃ = 0.63 g, 22.39 g, respectively for the year 2013–2014, and 2014–2015. Observe
that in all the cases mentioned above, α ∈ (0, 1).
190 R. Dasgupta

Fig. 4 Empirical CDF of

1.0
garlic yield and the model
(2014–2015)

0.8
0.6
Fn(x)
0.4
0.2
0.0

0 2 4 6 8 10 12
x

Figure 3 shows theoretical and empirical c.d.f. (a slightly modified smooth curve
drawn by joining the mid-points of jump, instead of drawing traditional step function;
this modification does not change the computation of distances much as the steps
are of magnitude 1/n, but this smoothens the jig-jag look a little bit; especially for
convergence to a continuous c.d.f.) of the garlic weight for the year 2013–2014.
The maximum vertical distance between √ two curves is 0.3713, and the value of the
Kolmogorov-Smirnov (KS) statistic is 85 × 0.3713 = 3.42; this is significant even
at 0.5% level.
Figure 4 shows the comparison of model with empirical
√ c.d.f of garlic weight for
the year 2014–2015. The value of the KS statistic is 98 × 0.2569 = 2.54. Although
the second value is lower than the first one, the second value is greater than 0.5%
level KS value 1.73. Garlic weight data for two years indicate bad fit to the model;
in spite of very good fit to number of garlic cloves. We shall come back to this point
later.
In Fig. 5 we plot empirical c.d.f. for the garlic yield
 of two consecutive years. The
value of nonparametric two sample KS statistic is 85×98 85+98
× 0.8327 = 5.618, which
is highly significant. Thus the productions of garlic are markedly distinct for two
years.
Figures 6 and 7 explore model fit to empirical c.d.f. in log-log scale. Although
the curve in the middle is close to the straight line representing the model, deviation
from the model is prominent towards both extremes in two data sets.
These features of the figures indicate that the implicitly made model fitting assump-
tion viz., weight of a garlic bulb equals to number of cloves multiplied by weight of
a typical clove; may be a good approximation for the middle segment of the data
set of weights, and show model departure towards data points representing extreme
weights.
Model Selection and Validation in Agricultural Context … 191

Fig. 5 Empirical CDF of

1.0
garlic yield for two seasons

0.8
0.6
Fn (x)
0.4
0.2
0.0

0 2 4 6 8 10
x

Fig. 6 Model fit for


0

empirical CDF (2013–2014)


−1
log (Fn(x))
−2
−3
−4

−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5


log (x)

Growth curve of garlic with lowess regression ( f = 2/3) for the year 2014–2015
is shown in Fig. 8.
Spline regression in SPlus with smooth.spline and spar= 0.001 provides Fig. 9
as the growth curve of garlic. The curve is relatively smooth compared to previous
curve.
192 R. Dasgupta

Fig. 7 Model fit for

0
empirical CDF (2014–2015)

−1
log (Fn(x))
−2
−3
−4

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5


log (x)

Fig. 8 Growth curve of bulb


crop for the year 2014–2015
10
8
garlic weight (gm)
6
4
2
0

95 100 105 110 115 120


lifetime (days)

In Sect. 2 we prove some characterization theorems based on linear relationship


of conditional quantiles. The results have implications in parameter estimation. In
Sect. 2, we discuss strong convergence of one-sided estimators to a parameter from
above/below.
Model Selection and Validation in Agricultural Context … 193

Fig. 9 Growth curve of bulb


crop for the year 2014–2015

10
(spline)

garlic weight (gm)


8
6
4
2
0 90 95 100 105 110 115 120
life time (day)

2 Characterization of Extended Uniform Distribution

We first prove the following.

Theorem 1 Let X be a random variable with support (0, θ ), θ > 0, and distribution
function F. Denote c = c( p) to be the unrestricted p-th quantile of X (> 0), and
consider p in a (small) dense neighborhood A0 of 1 (e.g., p ∈ A0 = (1 − , 1) ∩
Q,  > 0, small and Q is the set of rational numbers). Then the p-th quantile of the
distribution, p ∈ A0 , under the restriction X < x0 (< θ ) is cx0 /θ iff F is extended
uniform distribution function (1.2).

Proof Consider the distribution function of scaled variable with θ = 1.

F(x) = x α , 0 < x < 1, α > 0 (3)

The p-th quantile of the distribution is at p 1/α . Denote g(x) = log F(x) =
α log x ↓ −∞, x ↓ 0. The c.d.f. of the variable, given that x < x0 (< 1), then turns
out to be F(x)/F(x0 ), and one may write P(X < x|X < x0 ) = F(x F(x)
0)
= ( xxo )α .
Equating this to p, we obtain the new p-th quantile of the random variable bounded
above by the threshold x0 as cx0 , where c = p 1/α is the p-th quantile of the unre-
stricted random variable X (< 1).
Assume that the property of constant multiple factor of restricted and unrestricted
quantiles holds for a dense set of quantiles corresponding to p ∈ (0, 1), p rational.
Suppose that the new median of the random variable X under the restriction x < x0
is at cx0 , where c is independent of x0 . Indeed c is the p-th quantile of original
unrestricted random variable as seen by taking x0 ↑ 1.
Next, write
F(cx0 )
eg(cx0 )−g(x0 ) = =p (4)
F(x0 )
194 R. Dasgupta

This provides,
g(cx0 ) − g(x0 ) = −k (5)

where, k = − log p(> 0).


Thus g(c2 ) = g(c) − k = −2k, g(c3 ) = −3k, . . . , g(cm ) = −mk. Hence, g(x) =
log F(x) = α log x; where α = −k/(log c) at the points x = c, c2 , . . . , cm , . . . ; c ∈
(0, 1).
This specifies the distribution function F to be extended uniform in a dense set
x = c, c2 , . . . , cm , . . . , of (0, 1). For an arbitrary real number z ∈ (0, 1), there exist
integer m and c = p 1/α ; p ∈ Q ∩ (0, 1) such that cm is arbitrary close to the number
z, where Q is the set of all rational numbers. Next from right continuity of distribution
function, the form of F is extended uniform at z, where z ∈ (0, 1) is arbitrary.
Finally, a dense choice of p in a small neighborhood of 1, e.g.,
p ∈ A0 = (1 − , 1) ∩ Q,  > 0 is small, suffices for the Theorem to hold; as the
resultant sequence {cm : m = 1, 2, 3, . . .} still spans a dense support of the variable.
For the general case let the supremum of possible value of X be θ (> 0). The
distribution function F with maximum value θ is then

F(x) = (x/θ )α , x ∈ (0, θ ), α > 0 (6)

One may then consider the transformed random variable X/θ ∈ (0, 1). Proceeding
as before, the characterization of Theorem 1 holds.
Characterization theorems for discrete random variables
Consider a random variable X with range either N0 , the set of nonnegative integers;
or set of positive integers N1 = N0 − {0}. Let the cumulative distribution function
of X be denoted by F(x) = P(X ≤ x), it is enough to define F at integer values.
For p ∈ (0, 1) the p-th quantile of F is defined as F −1 ( p) = {inf x : F(x) ≥ p}.
The following theorem is the counterpart of Theorem 1 stated for discrete random
variables.
Theorem 2 Let X be a random variable with support N1 ∩ [0, θ ], where θ is an
arbitrary positive integer, F(x) = P(X ≤ x) be the distribution function. Let the
p-th quantile of the distribution under the restriction X ≤ x0 (∈ N1 ), x0 ≤ θ be cx0 ;
where c ∈ N1 is the unrestricted
 p-th quantile of X. The above property holds for
all p of the form p = pi = ij=1 P(X = j), i = 1, 2, 3, . . . , θ iff F(x) = ( θx )α for
some α > 0, where x ∈ N1 .
Proof Proof of Theorem 2 follows similar lines as that of Theorem 1. One way
implication of the Theorem is easy to see. Consider the ‘only if’ part.
Steps similar to (4)–(5) hold. The variable X has support N1 ∩ [0, θ ]. This set
is same as the set {c, c2 , . . . , cm , . . .}∩ [0, θ ], where c = c( p) is the p-th quantile
of X, and p of the form p = pi = ij=1 P(X = j), i = 1, 2, 3, . . . , θ. The p-th
quantile is then an integer, as the jumps of F occur at integer points. For example
α
when F(x) = (x/θ ) , the p-th quantile c = c( p) is obtained as the solution i of the
equation p = pi = ij=1 P(X = j) = (i/θ )α .
Model Selection and Validation in Agricultural Context … 195

Over the set N1 ∩ [0, θ ], characterization for g(x) = log F(x) = α log(x/θ ) is
seen to hold in a similar fashion like in Theorem 1.

3 One Sided Estimation for Upper End Point

Conventional estimators of a parameter usually fluctuate around the unknown value


of the parameter. An estimator Tn of the unknown parameter θ ∈ R is said to converge
to θ from above (below), if Tn ≥ θ (Tn ≤ θ ) for all sufficiently large sample size n
and Tn → θ a.s., as n → ∞.
One-sided convergence from above is denoted as Tn →+ θ a.s., and one-sided
convergence from below is denoted as Tn →− θ a.s.
With an application of Marcinkiewicz-Zygmund strong law of large numbers
(MZSLLN), estimation problem for the mean θ = E Fθ (X ) from above/below has
been addressed by Gilat and Hill (1992). Observe that X n is the natural estimator for
θ = E X . But X n fluctuates above and below θ although X n → θ a.s., as n → ∞.
Estimation problems of this kind are considered in Dasgupta (2007) when θ is
a finite end point of the distribution function. Application of Borel-Cantelli lemma
and properties of extreme order statistics are some of the tools used to obtain the
results.
One-sided convergence may be useful while estimating the unknown variance of
a random variable, for which the estimator should be non-negative. Level of flood
water is another example. In such cases, one may like to estimate the parameter from
above. As for some other examples, consider estimating the strength of a dam or
bridge. One may like to estimate the unknown strength conservatively from below,
to have a protection from probable disaster.
The maximum observations in two sets of garlic clove data for the years 2013–
2014, and 2014–2015 are X n = 4, 19 respectively, with n = 85, 98 over two produc-
tion seasons. The underlying models proposed are (1)–(2). Large garlic with many
cloves have a market value. We wish to estimate θ, the maximum number of cloves
both from above and below.
The following result is stated in Galambos (1978).
Result. Let F be continuous, then P[X (n) ≤ F −1 (1 − δ lognlog n ) i.o.] = 0, δ > 1.
The above relationship may be inverted to conclude, X (n) > F −1 (1 − δ lognlog n ) =
θ − βn (say) a.s., as n → ∞;
i.e., X (n) + βn →+ θ a.s., as n → ∞.
For the form of F given in (2) we have βn = δ logαnlog n , providing the amount of
perturbation to be added to X (n) for upper convergence to θ, a.s.
Application of Borel cantelli lemma provides the following sharper result on upper
and lower convergence to the end point θ, in place of Proposition 1 and Proposition
3 of Dasgupta (2007).
196 R. Dasgupta

Theorem 3 Let X 1 , . . . , X n be iid random variables with distribution F = Fθ ,


where θ = sup{x : F(x) < 1} < ∞ and the functional form of F be known near
the right tail of the distribution.

(i) Let αn = θ − F −1 (1 − n ) = F −1 (1) − F −1 (1 − n ), where F −1 (a) = inf{x :


n)δ }
F(x) ≥ a} and n = log{n(log
n
→ 0, as n → ∞, δ > 1. Then,

θ̂+ = X (n) + αn →+ θ a.s., as n → ∞, (7)

(ii) Let αn∗ = θ − F −1 (1 − n∗ ) = F −1 (1) − F −1 (1 − n∗ ), and n∗ = n −2


−δ
(log n) → 0, as n → ∞, where δ > 1. Then,

θ̂− = X (n) + αn∗ →− θ a.s., as n → ∞, (8)

Thus,
PFθ (X (n) + αn∗ < θ < X (n) + αn ) = 1 (9)

for all sufficiently large n.

Proof Consider,

P(X (n) ≤ dn ) = F n (dn ) < e−n{1−F(dn )}


log{n(log n)δ }
< 1/{n(log n)δ } if, 1 − F(dn ) > , δ > 1.
n
log{n(log n)δ } log{n(log n)δ }
i.e., if F(dn ) < 1 − n
, i.e., if dn < F −1 (1 − n
).
 
In such a situation, n P(X (n) ≤ dn ) ≤ n 1/{n(log n)δ } < ∞ and therefore by
Borel-Cantelli lemma one gets P(X (n) ≤ dn i.o.) = 0.
n)δ }
Hence, X (n) > dn > F −1 (1 − log{n(log
n
) a.s., as n → ∞.
If the functional form of F is known (at least near the right tail) then one can
invert the above relation to obtain,

X (n) > θ − αn a.s., as n → ∞,


δ
n) }
where αn → 0, as n → ∞, since log{n(log n
→ 0.
Thus X (n) + αn →+ θ a.s., as n → ∞.

 For (ii), note∗ that, αn  → 0, as n∗ = n −1 (log n) −δ
 → 0; n → ∞. Next write,
∗ −1
n P(X (n) + α n > θ ) = n [1 − F n
(θ − α n )] ≤ n n (log n)−δ < ∞, δ > 1,
if F(θ − αn ) > (1 − n (log n) ) 1 − n (log n) , i.e., if αn∗ ≤ θ −
∗ −1 −δ 1/n −2 −δ

F −1 (1 − n −2 (log n)−δ ). Now use Borel-Cantelli lemma to claim (8). Result (9) then
follows from (7) and (8).
n)δ }
For the model (2), αn = log{n(logαn
and αn∗ = {αn 2 (log n)δ }−1 .
Model Selection and Validation in Agricultural Context … 197

Point estimate of upper end point


Usual estimator X (n) underestimates θ, the upper end point of non-regular model
(2) with discontinuous likelihood function. Non-regularity is caused by dependence
of the boundary on unknowns. Asymptotic analysis, as presented in Ibragimov and
Hasminskii (1981) covers a number of such models. These frequently arise in real
life problems including econometrics, see e.g., Chernozhukov and Hong (2004) on
auction models and equilibrium job-search models with a jump of density at start. Hall
and Wang (2005) considered estimation problem with empirical prior distribution
based on two extreme order statistics to estimate the lower end point of a distribution.
Our approach is based on one sided convergence. Results proved in previous
section states that the type of distribution and shape parameter α remains the same
with upper censoring of data, say up to X (n−1) , X (n−2) etc., thus providing more
than one estimate of α that can be combined by standard methods to have a pooled
estimate of α. This is required in computing αn and αn∗ .
For the model (2), one may then consider the midpoint of the interval in (9) viz.,
n)δ }
θ̃ = X (n) + log{n(log
2αn
+ {2αn 2 (log n)δ }−1 as a point estimate of θ.
The estimator is simple average of a positively biased and a negatively biased
estimate of the parameter θ. Other weighted average of these two estimates may also
be considered. The estimator always lies above X (n) , the m.l.e. of θ.

References

Chernozhukov V, Hong A (2004) Likelihood estimation and inference in a class of nonregular


econometric models. Econometrica 72:1445–1480
Dasgupta R (2007) Almost sure confidence intervals and one-sided estimation in non-regular cases
with applications. Calcutta Stat Assoc Bull 59(235–236):163–183
Dasgupta R (2015) Growth curve reconstruction in damaged experiment via nonlinear calibration.
In: Growth curve and structural equation modeling: topics from the indian statistical institute,
Chapter 7, pp 119–134
Galambos J (1978) The asymptotic theory of extreme order statistics. Wiley
Gilat D, Hill TP (1992) One sided refinements of the strong law of large numbers and Glivenko
Cantelli theorem. Anns Probab 20:1213–1221
Hall P, Wang ZJ (2005) Bayesian likelihood methods for estimating the end point of a distribution.
J R Stat Soc B 67:717–729
Ibragimov I, Hasminskii R (1981) Statistical estimation: asymptotic theory. Springer, New York
Longitudinal Growth Curve of Elephant
Foot Yam Under Extreme Stress and Plant
Sensitivity II

Ratan Dasgupta

Abstract Plant sensitivity under extreme stress and minimal survival environment
for yam plants are examined. In order to maximise total yam yield, Dasgupta (2017a)
studied longitudinal growths of 60 Elephant-foot-yam, 20 for each seed weight 500,
650 and 800 g with option of interim yam detachment in either of the two time points,
along with final harvest on maturity from replanting the stems, in a field experiment
conducted in an agricultural farm at Indian Statistical Institute, Giridih, Jharkhand
(India) during the year 2016–2017. Detaching yam around four and half month
from sprouting for plants with seed weight 800 g, and replanting the remaining stem
structure with some roots attached to it and continue experiment till final harvest
on maturity, was seen to have significantly increasing effect in two stage harvest,
in an agro-climatic environment with minimal survival condition for yam plants. In
the experiment conducted, only a few irrigations were given in the peak summer
temperature and little manure was administered in the start of the experiment. We
now construct almost sure bands of growth curves based on the data from above
mentioned experiment. These indicate that the curves are distinct, and yield for seed
weight 800 g is superior from other seed weights under the induced extreme plant
stress. Proliferation rates of yam yield with seed weight 500 g stops above zero
towards end, indicating possibility of further growth of yam, if plant lifetime could
be extended. In the case of extreme plant stress under yam detachment at the time of
second interim reading, we look for a ‘50 day window’ from sprouting in which the
accumulation of underground yam is high i.e., area under the proliferation rate curve
is high, to identify the time region of high yam growth. This turns out to be the time
span [50, 100] days for seed weight 800 g. For other seed weights 650 and 500 g, the
time span is [100, 150] days. From the peaks of proliferation rate curves, it appears
that yam deposition rate for seed weight 500 g is about three times compared to those
with seed weight 650 and 800 g, in the ‘50 day window’ time span of high deposition.
Individual growth trajectories are modeled by a correlated Gaussian process. Test of
hypothesis on parameters of the modeled process indicates the possibility of error
components following a Brownian motion.

R. Dasgupta (B)
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203, B T Road,
Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 199


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_10
200 R. Dasgupta

Keywords Elephant foot yam · Proliferation rate · Ornstein-uhlenbeck process ·


Brownian motion

MS subject classification: 62P10 · 62G20 · 60J65

1 Introduction

Elephant-foot-yam plants’ sensitivity under stress are studied in relation to yield.


Plant stress may suitably be used for higher yield of yam. The yam plants are found
to be stress resistant when cultivated in a harsh agro-climatic environment. Seed
weight 800 g is seen to be appropriate for high yield in a study reported in Dasgupta
(2017a), when the stress is extreme for plant survival, higher seed weight supports
the plant growth at initial stage and 800 g of seed weight produced more yam, when
the induced plant stress is extreme.
Harsh agro-climatic environment in the field experiment with a few irrigation
given in peak of summer in the beginning of the experiment acted as severe stress,
the yam plants were further subjected to interim yam detachment at either of the two
time zone in growth experiment and replantation of the remaining stem structure as
described in Dasgupta (2017a). The first interim yam detachment time was at two
and half months after sprouting, and the second interim detachment time was at four
and half months from sprouting. Yams are detached only once during the experiment
from alive plants.
For different seed weights with 500, 650 and 800 g; the second time period for
interim yam detachment is seen to be superior to achieve high yield in total.
In this study we concentrate on second interim yam detachment strategy made at
four and half months from sprouting, and analyse the growth curve in terms of almost
sure confidence band, proliferation rate, and estimation of growth curve via mid band
to contain fluctuations of curves around central line. Modeling the error component
by a Gaussian process is investigated. We estimate the process parameters from
observed data by different techniques including the method of maximum likelihood
and from observed maximum fluctuation of the growth curves. We further test for
the hypothesis that the error components are following a Brownian motion.

2 Results

We estimate growth curve from raw data by lowess regression, the technique in detail
is described in Dasgupta (2017a); where it is observed that yam detachment at second
interim time period has a distinct advantage over other strategies. Consequently, we
shall mainly consider this situation to obtain results on almost sure confidence band
and other topics. It appears that the yam plants are stimulated more by additional
stress of yam detachment at the time of second interim growth recording, when the
yam deposition is usually high for larger values of time.
Longitudinal Growth Curve of Elephant Foot … 201

Fig. 1 Lowess growth curve (Mean) for different seed weights

Figure 1 corresponds to Fig. 16 of Dasgupta (2017a). This exhibits mean growth


curve under lowess smoothing. The curve for second interim cut of yam with seed
weight 800 g is superior, if we incorporate the criterion of growth stability towards
the far end of the growth curves; the assertion that the cut at the time of second interim
growth recording of yam with seed weight 800 g is superior becomes clearer from
lowess regression of the curves with f = 0.35. Yam cut at second interim growth
recording with seed weight 800 g corresponds to higher yam yield. The corresponding
curve reaches a stable value of higher yield towards the end of plant lifetime viz.,
beyond 160 days.
Figure 1 also indicates that the bunch of three curves for 800 g seed weight lies
above the other two bunches with seed weight 650 and 500 g; pointing out that in an
environment of extreme stress and harsh agro-climatic condition, higher seed weight
is preferable to achieve higher yield.
In Fig. 2 we show the almost sure band for the three growth curves with seed
weight 800 g, corresponding to the top three curves in Fig. 1. Almost sure confidence
bands are constructed to cover growth curve with probability one i.e., with certainty,

Fig. 2 Almost sure band of yam growth curve seed weight 800 g
202 R. Dasgupta

Fig. 3 Almost sure band of yam growth curve seed weight 650 g

see Dasgupta (2015a) for a general exposition on such bands. These nonparametric
almost sure bands are of stronger assertion than conventional model based percentage
probability confidence bands. Almost sure bands indicate that the yam growth curves
of 800 g in three categories viz., growth curve for plants with yam detachment at first
interim growth recording time, curve for plants with yam detachment at second
interim growth recording time, and undisturbed yam growth curve, are all distinct.
The blue growth curve corresponding to second interim detachment is superior and
the final yield is about twice the initial seed weight.
In Fig. 3 we show the almost sure band for the growth curves with seed weight
650 g. Here again the growth curve corresponding to second interim detachment is
superior, and the final yield is about twice the initial seed weight.
In Fig. 4 we show the almost sure band for the growth curves with seed weight
500 g. Here the growth curve corresponding to second interim detachment is superior,
this is similar to the other two figures with seed weights 650 and 800 g thus confirming
the fact that second interim cut is a superior strategy irrespective of seed weights.
The final yield with second interim cut has a sharp upraise, the final yield is about
thrice the initial seed weight.

Fig. 4 Almost sure band of yam growth curve seed weight 500 g
Longitudinal Growth Curve of Elephant Foot … 203

0.001 0.002 0.003 0.004 0.005


Proliferation rate of yam/day
0.0

0 50 100 150 200


Time (day)

Fig. 5 Proliferation rate of yam (800 g) with trimmed mean, wt. exp(−.01 x), lowess

In Fig. 5 we consider second interim yam detachment with seed weight 800 g. We
plot the proliferation rate dtd log y(t) = r (t) of the growth curve from observed data
by a technique described in Dasgupta (2013), see also Dasgupta (2015b). The curve
comes down to zero with little possibility
t
of further yam growth over time as evident
r (s)ds
from the relation y = y(t) = e 0 .
Figure 6 we show the proliferation rate of the growth curve corresponding to seed
weight 650 g, when yam is detached during second interim growth record. Here
again, the curve comes down to zero with little possibility of further yam growth
0.001 0.002 0.003 0.004 0.005 0.006
Proliferation rate of yam/day
0.0

0 50 100 150 200


Time (day)

Fig. 6 Proliferation rate of yam (650 g) with trimmed mean, wt. exp(−.01 x), lowess
204 R. Dasgupta

0.015
Proliferation rate of yam/day
0.010
0.005
0.0

0 50 100 150 200


Time (day)

Fig. 7 Proliferation rate of yam (500 g) with trimmed mean, wt. exp(−.01 x), lowess

with time. The curve is different from that for seed weight 800 g. Peak of the yield
proliferation rate in the case plants with seed weight 650 g is attained at a later time
than that for 800 g.
In Fig. 7 we consider second interim yam detachment for plants with seed weight
500 g. The proliferation rate reaches a peak and remains steady for a longer period
of time than that for 650 g. The curve terminates well above zero, indicating a high
possibility of further substantial amount of yam growth, if plant lifetime could be
extended. This interesting phenomenon is revisited in Dasgupta (2017b), where to
estimate proliferation rate at a time t, rather than considering median or trimmed
mean of raw rates we consider weighted average of raw rates on proliferation, esti-
mated at time point t with smooth exponentially decaying weight function that down
weights the raw rates involving distant observations away from t. A three dimen-
sional representation of proliferation rates then provides a deeper insight to the yam
growth process with associated variables.

Fig. 8 Band of yam yield: 500 g seed weight, 2nd interim


Longitudinal Growth Curve of Elephant Foot … 205

Under the assumption of symmetric fluctuation of individual yam growth curves


around the central curve of mean response, we may draw the minimal band containing
all the curves and consider the central line as an estimate of the unknown response
curve. This nonparametric procedure may perform well in a number of cases, e.g.,
see Rider (1957). In Fig. 8 we show the minimal band of yam yield for seed weight
500 g, with the central line as an estimate of growth curve when the interim yam
detachment is made at the time of second interim growth recording. The growth
curve in red color shows downward trend in the beginning and then a upward trend
from 85 days onward, till the far end.
The band of yam yield curve of Fig. 8 when stretched, with central line as base
is shown in Fig. 9 for seed weight 500 g. The variation of band is higher for large
values of time.
In Fig. 10 we show the minimal band of yam yield for seed weight 650 g, with
the central line as an estimate of growth curve when the interim yam detachment is
made at the time of second interim growth recording. The growth curve in red color
shows slight downward trend in the beginning and then a upward trend from 85 days
onward.

Fig. 9 Stretched yam yield with band with central line as base: 500 g, 2nd interim

Fig. 10 Band of yam yield: 650 g seed weight, 2nd interim


206 R. Dasgupta

Fig. 11 Stretched yam yield with band with central line as base: 650 g, 2nd interim

The band of yam yield curve of Fig. 10 when stretched, with central line as base
is shown in Fig. 11 for seed weight 650 g. The variation of band seems to be much
higher at large values of time.
In Fig. 12 we show the minimal band of yam yield for seed weight 800 g with the
central line as an estimate of growth curve when the interim yam detachment is made
at the time of second interim growth recording. The curve in red color shows little
downward trend in the beginning and then a upward trend from 75 days onward.
The band of yam yield curve of Fig. 12 when stretched, with central line as base
is shown in Fig. 13 for seed weight 800 g. The variation of band seems to be higher
at large values of time.
We next model the individual growth curves with a particular seed weight by a
Gaussian process viz., O-U process over time. We consider the time segment of [65,
150] days, as all plants are alive in this time segment contributing to variation to the
individual curves. Modeling the error component in Growth curve model for yam
plants by O-U process is also proposed in Dasgupta (2015c).

Fig. 12 Band of yam yield: 800 g seed weight, 2nd interim


Longitudinal Growth Curve of Elephant Foot … 207

Fig. 13 Stretched yam yield with band with central line as base: 800 g, 2nd interim

Recall that the Ornstein-Uhlenbeck process V (s) is a continuous Gaussian Markov


process with constant mean and exponentially decaying covariance structure. This
satisfies the following differential equation,

d V (s) = −αV (s)ds + σ d B(s), α > 0, σ > 0 (1)

The maximum fluctuation of the curve is provided by

σ2
limt→∞ [ (1 + o(1)) log t]−1/2 V (t) = 1 a.s. (2)
α
and
σ2
limt→∞ [ (1 + o(1)) log t]−1/2 sup | V (s) |= 1, a.s. (3)
α 0≤s≤t

Hence the fluctuation of the O-U process as seen from (2) and (3) is dependent
on the parameter σ/α 1/2 .
Consider n independent O-U processes V1 (s), . . . , Vn (s) with parameters (α, σ 2 ).
From (1) one may write,

d[V1 (s) + ... + Vn (s)] = −α[V1 (s) + · · · + Vn (s)]ds + σ d[B1 (s) + · · · + Bn (s)] (4)

Thus the sum V1 (s) + · · · + Vn (s) is a O-U process with parameters (α, nσ 2 ).
One may also estimate the parameters (α, σ 2 /n) from the realised average process
[V1 (s) + · · · + Vn (s)]/n.
Estimate of central tendency obtained from smooth lowess regression is sharper
than sample mean. Thus, for plants with a fixed seed weight, the deviations of mean
yield curve from the lowess growth curve may be interpreted as response curve
of yam growth minus data mean, and these residuals may be modeled by an O-U
process. The m.l.e. of the process parameters (α, σ 2 /n) may then be compared to
find the appropriate seed weight for yield with less variation.
208 R. Dasgupta

Fig. 14 lowess curve on mean: seed weight 500 g, 2nd interim

In Fig. 14 we show in red color the mean of yam growth curves of 8 plants with
seed weight 500 g, when the yam detachment is made at the time of second interim
growth recording. The lowess curve on these with f = 0.35 is also shown in blue
color.
Convergence of lowess regression to the response curve is sharper than common
mean, and we may consider lowess as an estimate of the response curve to a first
approximation. We propose to model the deviations of the mean curve from the
lowess curve, taken as the base curve, as shown in Fig. 15.
An estimate of the diffusion parameter is given by

1
2n
lim [V ( jt2−n ) − V (( j − 1)t2−n )]2 = σ 2 a.s. (5)
n→∞ t
j=1

With grid spacing of 5 days, the followings are the independent estimates of σ 2 from
8 plants in the time zone [65, 150] days.

Fig. 15 Deviation of mean curve from the lowess curve seed wt 500 g, 2nd interim
Longitudinal Growth Curve of Elephant Foot … 209

0.001316975, 0.000954068, 7.35E − 05, 8.68E − 05, 0.000417231,


4.91E − 05, 0.000126263, 0.001075879, with pooled estimate σˆ2 = 0.000512
The m.l.e of σ 2 /n is 4.19E-05 based on a single realization of mean curve shown
in Fig. 15, made out of n = 8 plants. Here again, we consider grid spacing of 5 days
in (5).
An estimate of the drift parameter α, based on σˆ2 = 0.000512 that is pooled from
8 individual plants, may be computed from
 t  t  t
1
α̂ = − V (s)d V (s)/ V 2 (s)ds = [ V 2 (s)ds]−1 [σ 2 t + V 2 (0) − V 2 (t)]
0 0 2 0
(6)
Thus α̂ = 21 (0.01932024)−1 [(0.000512/8) × (150 − 65) + (0.003399)2 − (−0.035456)2 ] = 0.10855
If the m.l.e of σ 2 /n is used in the above i.e., 0.0000419 replaces (0.000512/8), then
we get a slightly different estimated value of the drift parameter; α̂ = 0.05993521
Yet another estimate of α is available from the maximum fluctuation of the residual
curve of mean as shown in Fig. 15. We equate the realised maximum fluctuation
(=| −0.029014 |) of the residual curve of mean, to its approximate theoretical value
[ σα (1 + o(1)) log t]1/2 given in (3), and consider m.l.e. of σ 2 /n = 0.0000419. Then,
2

[ σ α/n log t]1/2 = { 0.0000419


2

α
log(150 − 65)}1/2 ≈ 0.029014
i.e., from a.s. relation (3) for large time t
(σ 2 /n)
α= suph≤s≤t+h |V (s)|2
(1 + o(1)) log t ≈ { (−0.029014)
0.0000419
2 log(150 − 65)} = 0.2211266

We shall see later that this relation leads to a positive estimate of α(> 0), even
when m.l.e. fails to do so.
Next consider the plants with seed weight 650 g. In Fig. 16 we show in red color
the mean of yam growth curves of 13 plants with seed weight 650 g, when the yam
detachment is made at the time of second interim growth recording. The lowess curve
on these with f = 0.35 is also shown in blue color.

Fig. 16 lowess curve on mean: seed weight 650 g, 2nd interim


210 R. Dasgupta

Fig. 17 Deviation of mean curve from the lowess curve: seed wt 650 g, 2nd interim

With grid spacing of 5 days, the following are the independent estimates of σ 2
from 13 plants in the time zone [65, 150] days.
0.000378069, 0.000256675, 0.000686736, 0.000987547, 0.000201478, 0.0016
30377, 0.002194607, 0.001432672, 0.001437628, 0.003501173, 0.000311077,
2.80E-05, 0.002202819, with pooled estimate σˆ2 = 0.001172991
An estimate of the drift parameter α, based on the pooled estimate σˆ2 =
0.001172991 from 13 individual plants, may be obtained from (6). α̂ =
1
2
(0.395313686)−1 [(0.001172991/13) × (150 − 65) + (−0.006485)2 −
(0.130003)2 ] = −0.01162263.
Since the estimate of α is negative, this is inadmissible. Same comment holds when
the m.l.e of σ 2 /n is used in the above i.e., 0.000180859 replaces (0.001172991/13).
The problem occurs as V (t) is large to the right end point, see Fig. 17, for devia-
tions of the mean curve from the lowess curve as base curve.
A non-negative estimate of α computed from the maximum fluctuation
(=0.130003) of the deviation curve and the m.l.e. of σ 2 /n = 0.000180859, is obtain-
able from the following a.s. relation for large time t
(σ 2 /n)
α= suph≤s≤t+h |V (s)|2
(1 + o(1)) log t ≈ { 0.000180859
(0.130003)2
log(150 − 65)} = 0.0475418
Next consider the plants with seed weight 800 g. In Fig. 18 we show in red color
the mean of yam growth curves of 10 plants with seed weight 800 g, when the yam
detachment is made at the time of second interim growth recording. The lowess curve
on these with f = 0.35 is also shown in blue color.
With grid spacing of 5 days, the following are the independent estimates of σ 2
from 10 plants in the time zone [65, 150] days.
6.76E-05, 0.00028662, 0.000357005, 0.000241383, 0.000267159, 7.02E-05,
0.000213081, 0.000312243, 0.000162101, 0.000199446, with pooled estimate
σˆ2 = 0.0000218
An estimate of the drift parameter α, based on the pooled estimate
ˆ
σ = 0.0000218 from 10 individual plants, may be obtained from (6). α̂ = 21
2
Longitudinal Growth Curve of Elephant Foot … 211

Fig. 18 lowess curve on mean: seed weight 800 g, 2nd interim

(0.234073775)−1 [(0.0000218/10) × (150 − 65) + (−0.006001)2 − (−0.023395)2 ] =


−0.0006963916.
The estimate of α is negative, like in the case of seed weight 650 g, hence it is
inadmissible. Same comment holds when the m.l.e of σ 2 /n, based on the deviation
of mean curve from lowess curve on grid points and computed from (5); is used in
the above formula i.e., 0.0000932 replaces (0.0000218/10).
The problem occurs as | V (t) | is large at the right end point, see Fig. 19.
A non-negative estimate of α may be computed from the magnitude of max-
imum fluctuation (=| −0.098547 |) of the deviation curve and m.l.e. of σ 2 /n =
0.0000932, from the following a.s. relation for large time t.
(σ 2 /n)
α= suph≤s≤t+h |V (s)|2
(1 + o(1)) log t ≈ { (−0.098547)
0.0000932
2 log(150 − 65)} = 0.0426355

Drift parameter α of a O-U process represents the reverting force towards origin.
As α → 0, the process gradually approaches the Brownian motion. We may test
whether α is bounded away from zero by the following approximate test.

Fig. 19 Deviation of mean curve from the lowess curve: seed wt 800 g, 2nd interim
212 R. Dasgupta

 tThe asymptotic distribution of α̂ = α̂(t) is normal with mean α and variance


−1
[ 0 V 2
(s)ds] , i.e.,
 t
[ V 2 (s)ds]1/2 (α̂(t) − α) ∼ AN (0, 1) (7)
0

see e.g., Brown and Hewitt (1975). Consider the case of yam growth curves from
seed weight 500 g, with yam detachment made at second interim growth record-
ing time. The hypothesis H0 : α = 0, in the error component may be tested by
τ = 0.05993521 × (0.01932024)1/2 = 0.00833083 to be compared with a normal
deviate. The value is insignificant indicating a possibility of error components fol-
lowing a Brownian motion.
From (2) and (3), maximum fluctuation of | V (s) | is proportional to the standard
deviation σ/α 1/2 of the process. In Fig. 20 we plot the values of V 2 (s). Area under
the curves in Fig. 20 relates to the accuracy of the m.l.e. of α, see (6) and (7). The
green curve corresponding to seed weight 500 g has least fluctuation in Fig. 20, the
red curve corresponding to seed weight 650 g has highest fluctuation; and the blue
curve corresponding to seed weight 800 g is of moderate fluctuation. These provide
the level of accuracy for the m.l.e. of α in three situations of different seed weights.
Next consider equation (3). The maximum fluctuation of the residual process V ∗
of mean minus the lowess curve for different seed weights are modeled by O-U
process. Write σ ∗ = σ 2 /n, the diffusion parameter of the mean minus the lowess
2

curve, where mean curve is based on n plants of same seed weight. Then,

σ∗
2

limt→∞ sup | V ∗ (s) |∼ [ (1 + o(1)) log t]1/2 , a.s. (8)


h≤s≤t+h α

In Fig. 21 we plot the pair of curves viz., residual process for mean from lowess of
the form suph≤s≤t+h | V ∗ (s) | appearing in the l.h.s. of (8), and the corresponding
expression in the r.h.s of (8) by a pair continuous curve and a dashed curve; for each

Fig. 20 Squared deviation of the mean growth curves from lowess, 2nd interim
Longitudinal Growth Curve of Elephant Foot … 213

Fig. 21 Maximum fluctuation with model O-U process with upper bound

seed weights 500, 650 and 800 g. We considered the time segment [65, 150] days for
growth modeling in all plants. The value of h in computing the upper bound in (8) is
taken as 63.9 to avoid zero in the upper bound. Features in the pair of curves exhibit
similarity in each case, indicating model appropriateness.

References

Brown BM, Hewitt JI (1975) Asymptotic likelihood theory for diffusion processes. J Appl Prob
12:228–238
Dasgupta R (2017a) Longitudinal growth curve of elephant foot yam under extreme stress and plant
sensitivity. Int J Hortic 7(13): doi:10.5376/ijh.2017070013
Dasgupta R (2017b) Longitudinal growth curve of elephant foot yam under extreme stress and plant
sensitivity III. Int J Hortic 7(23): doi:10.5376/ijh.2017070023
Dasgupta R (2013) Non uniform rates of convergence to normality for two sample ustatistics in non
iid case with applications. advances in growth curve models: topics from the indian statistical
institute. In: Proceedings in mathematics and statistics, chap. 4, vol 46. Springer, Heidelberg, pp
60–88
Dasgupta R (2015a) Growth of tuber crops and almost sure band for quantiles. Commun Stat Simul
Comput. doi:10.1080/03610918.2014.990097
Dasgupta R (2015b) Rates of convergence in CLT for two sample u-statistics in non iid case and
multiphasic growth curve. growth curve and structural equation modeling. In: Dasgupta R (ed)
Proceedings in mathematics and statistics, vol 132. Springer, Berlin, pp 35–58
Dasgupta R (2015c) Plant sensitivity and growth curve analysis of elephant foot yam. growth curve
and structural equation modeling: topics from the indian statistical institute, chap. 1. Springer,
Berlin, pp 1–23
Rider PR (1957) The midrange of a sample as an estimator of the population midrange. J Am Stat
Ass 52(280):537–542
An In-Depth Analysis of Population Ageing
for Selected States in India in the Perspective
of Economic Development

Prasanta Pathak

Abstract Present paper examines how selected states in India with varying level of
economic development are at varying phases of taking advantage of demographic
dividend. It shows how the variation in taking advantage of the demographic divi-
dend occurs due to variation in the temporal pattern of the young and the old age
dependent populations. Significance of the young vis-a-vis the old age dependent
populations in economic perspective has been looked into separately. An attempt has
been made to deal the above aspects analytically so that the states can be classified.
The states which have been selected for in-depth study are Andhra Pradesh, Bihar,
Gujarat, Madhya Pradesh, Maharashtra, Rajasthan, Tamil Nadu, Uttar Pradesh and
West Bengal. To maintain comparability over the years, undivided Bihar, Madhya
Pradesh, Uttar Pradesh and Andhra Pradesh have been considered. Census publica-
tions for the years 1961, 1971, 1981, 1991, 2001 and 2011 have been used for the
study. Other than using the demographic measures like overall dependency ratio, old
age dependency ratio and young age dependency ratio, a newly introduced measure
called replacement ratio has been used and their temporal patterns have been ana-
lytically studied by using statistical models. A distance measure has been used to
rank the states in terms of the estimates of the replacement ratio. The states have
been classified into two groups based on advancement in taking advantage of demo-
graphic dividend. It makes clear that variation in economic development of different
states has definite influence on the temporal and regional characteristics of the con-
sidered measures of population dynamism. The paper ends with important policy
implications of the in-depth analyses.

Keywords Demographic transition · Population ageing · Demographic dividend ·


Dependency ratio · Replacement ratio

MS Subject classification: 62-07

P. Pathak (B)
Population Studies Unit, Indian Statistical Institute, Kolkata, India
e-mail: prasanta.pathak@gmail.com

© Springer International Publishing AG 2017 215


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_11
216 P. Pathak

1 Introduction

The growth rate of human population accelerated all over the world to unprecedented
levels in the second half of the twentieth century. The world population, thus, got
more than doubled reaching 6.5 billion in 2005 (United Nations 1962, 1973 and
2007). Population growth is expected to continue for several more decades before
reaching near 10 billion in the later part of twenty first century. Decline in death rates
with more and more technological advancement, industrialisation and urbanisation,
followed by decline in prevailing high birth rates due to increased chances of child
survival, increase in literacy among females, female participation in work, various
family welfare measures, etc., have brought about a very rapid change in the popula-
tions. This is characteristic of the central phase of a secular process, which is known
as demographic transition (Bongaarts 2009). Demographic transition transforms over
several years an age structure, dominated by young population to a structure dom-
inated by old age population. This transformation has significant developmental
consequences for large population. With increasing ages of birth cohorts in young
population, dependency ratio gets gradually lowered and increase in economically
active population in the working age groups accelerates economic growth. Though
some analysts have opined that rapidly increasing labour force may aggravate unem-
ployment problem with falling capital labour ratio (Coale and Hoover 1958) others
have considered it as an economic advantage (Bloom et al. 2000; Williamson and
Higgins 2001). Decline in the share of dependants, constituted of children and elderly
people, and increased economic activities are considered to help in increasing sav-
ings and investments in human and physical capital in an economy. This is referred
as demographic dividend (Gribble and Bremner 2012). It has been estimated that
nearly one-third of the economic achievement of the East Asian countries (includ-
ing China) can be attributed to the demographic dividend (Bloom and Williamson
1998; Bloom and Finlay 2009). Behrman et al. (1999), Anderson (2001), Feng and
Mason (2005), Kelley and Schmidt (2005), Bloom et al. (2003), Bloom et al. (2006),
Choudhry and Elhorst (2010), Wei and Hao (2010) have noted a positive association
between transition in the age structure and economic growth.
India now has a relatively young population and started witnessing a decline in the
share of dependants. Based on experience in the East and the South-East Asia, there
was high expectation that demographic dividend phase might take India to newer
economic heights (Bloom and Williamson 1998; Bloom et al. 2006; Bloom 2011;
Aiyar and Mody 2011). However, India did not gain much in the earlier phases of
demographic transition (in the 1980s and 19990s). To some extent, poor gains may
have had association with the concerns over the surrounding growth environment
(Navaneetham 2002; Mitra and Nagarajan 2005; Chandrasekhar et al. 2006; James
2008; Bloom 2011). In spite of it, India could overcome the past stagnancy starting
from the 1980s (Panagariya 2004; Rodrik and Subramanian 2005; Basu and Maertens
2007). In fact, after the 1990s, the per capita income in India has increased at a rate
of over 5 per cent per annum, which had been below 3 per cent before the 1990s.
Choudhry and Elhorst (2010) have concluded that population dynamism can explain
An In-Depth Analysis of Population Ageing for Selected States … 217

39 per cent of the economic growth in India and will have positive impact on economic
growth between 2005 and 2050. This turnaround is thought to be partly associated
with the increasing share of population in the working age group since the 1980s
(James 2008; Bloom 2011; Aiyar and Mody 2011). It has been argued in the last two
referred papers that about 1 to 2 per cent point growth in GDP per capita, compounded
year by year, is possible if India takes advantage of the demographic dividend by
utilising productively the population in the working age group.
Present paper examines how selected states in India with varying level of economic
development are at varying phases of taking advantage of demographic dividend. It
shows that the variation in taking advantage of the demographic dividend occurs
due to variation in the temporal pattern of the young and the old age dependent
populations. Significance of the young vis-a-vis the old age dependent populations
in economic perspective has been looked into separately. An attempt has been made
to deal the above aspects analytically so that the states can be classified.

2 Methodology

The sources of data for this study are the population age distributions, published
in the Census volumes of the Government of India in the years 1961, 1971, 1981,
1991, 2001 and 2011. The states which have been selected for in-depth study are
Andhra Pradesh, Bihar, Gujarat, Madhya Pradesh, Maharashtra, Rajasthan, Tamil
Nadu, Uttar Pradesh and West Bengal. To maintain comparability over the years,
undivided Bihar, Madhya Pradesh, Uttar Pradesh and Andhra Pradesh have been
considered. As per the government estimates for the year 2011–12 (Ref. http://pib.nic.
in/archieve/others/2013/dec/d2013121703.pdf), Bihar, Madhya Pradesh and Uttar
Pradesh with per capita income below Rs. 50,000 have been considered as less
economically developed. Andhra Pradesh, Rajasthan and West Bengal with per capita
income falling in the range of Rs. 50,000–80,000 have been considered as moderately
developed. The remaining states with per capita income above Rs. 80,000 have been
considered as well developed. The temporal changes in the Dependency Ratio (DR)
as defined below for all these states have been studied first.

Dependency Ratio = [(Number of young persons of age 14 years & less +


Elderly persons of age 60 years & above)/(All persons with ages between 15 to
59 years)] * 100.

The best fitted analytical function for the temporal pattern of change of the DR
for each state has helped understanding objectively how the DR has changed over the
decades. The best fitting has been judged based on the coefficient of determination,
which is measured by the proportion of the total sum of squares, explained by the
regression sum of squares. On getting corrected for the degrees of freedom, the
coefficient of determination is called the adjusted coefficient of determination.
218 P. Pathak

The DR has two components and these are Old Age DR (OADR) and Young Age
DR (YADR). The OADR and the YADR are defined the following way.
(A) Old Age Dependency Ratio = (No. of persons with ages 60 years and above
/No. of person with ages between 15 to 59 years) * 100
(B) Young Age Dependency Ratio = (No. of persons with ages 14 years and
below/No. of persons with ages between 15 to 59 years) * 100
The temporal changes in the OADR and the YADR have also been studied ana-
lytically for all the selected states.
Lastly, a new index called Replacement Index (RI) has been introduced. It has
been defined as follows.
(C) Replacement Ratio = (No. of persons with ages 14 years and below /No.
of persons with ages 60 years and above) * 100
It has been introduced realising the importance of the population in the working
age group. The people in this age group have the highest responsibility including
generating income, bringing up the children and providing livelihood support to the
older adults. While on growing up, the children join gradually this important age
group they also act as replacement for those who exit this important age group on
getting aged. To ensure some sort of balance in the dynamism of the population, it has
been thought very important to study how the people exiting from the population get
replaced by the ones who are new entrants in the population. The temporal changes
in the RR have also been studied analytically for all the selected states. An attempt
has been made to order the states by applying Euclidean distance on the RR vectors,
estimated state wise for six time points. The vector for the best performing states
been taken as standard. If it is denoted by (r1, r2, r3, r4, r5, r6) and the vector for any
other state is denoted
√ by (s1, s2, s3, s4, s5, s6) then the Euclidean distance between
the two vectors is [ {(r1-s1)2 + (r2-s2)2 + (r3-s3)2 + (r4-s4)2 + (r5-s5)2 + (r6-s6)2 }].
This distance has been used to order the states.
Other than tabular and graphical representations of the temporal changes, regres-
sion models and derivatives have been used for analytical dealing of the data. The
findings and the analyses are presented below.

3 Findings and Analyses

3.1 Dependency Ratio (DR)

The findings on DR are presented graphically on Charts 1, 2, 3, 4, 5, 6, 7, 8 and 9.


In Chart 1 for Bihar, it is found fluctuating between 91 and 95 during the considered
decades. In the next chart, it is found that the DR in UP has been around 90 till 2001
before falling to the level of 80 in 2011. The Charts 3 and 4 for MP and Rajasthan
show that the DRs have been between 80 and 10 till 2001 before falling below 80 in
An In-Depth Analysis of Population Ageing for Selected States … 219

Chart 1 Undivided Bihar Observed DR Poly. (Observed DR)


96
95
94
93
92
91
90
y = -0.308x4 + 4.335x3 - 20.98x2 + 40.22x + 68.63
89 R² = 0.611
88
1961 1971 1981 1991 2001 2011

Chart 2 Undivided Uttar Observed DR Poly. (Observed DR)


Pradesh 100
90
80
70
60
50
40 y = -0.845x4 + 11.38x3 - 52.82x2 + 97.42x +
30 32.91
R² = 0.975
20
10
0
1961 1971 1981 1991 2001 2011

2011. The Chart 5 for AP shows that the DR has been between 80 and 90 till 1981. It
fell below 80 in 1991 and then it dropped to 60 in 2011 after an abrupt jump to nearly
90 in 2001. The Charts 6, 7 and 8 for WB, Gujarat and Maharashtra respectively
show that after being around 90 in 1961 and 1971 these have fallen consistently to
a level below 60 in 2011. In the case of TN, as shown in Chart 9, the DR has been
little above 80 in 1961. This gradually has reached the level of nearly 50 in 2011.
Clearly, TN is ahead of the other states in getting advantages of the demographic
dividend. Bihar, UP, MP and Rajasthan need to wait number of decades to get sim-
ilar advantages. All other states have reached the state of getting the advantages of
demographic dividend. Changes over the decades have been captured analytically
by fitting appropriate mathematical functions. The best fitted functions have been
shown on the corresponding charts. Polynomial functions of degree two and above
are found most appropriate for Bihar, UP, MP and Rajasthan with coefficients of
determination falling in the range of 0.611 to 0.975. For WB and AP, the best fitted
functions are again found to be polynomials of degrees 2 and 4 and coefficients of
determination 0.930 and 0.942 respectively. However, the best fitted functions are
220 P. Pathak

Chart 3 Undivided Madhya Observed DR Poly. (Observed DR)


Pradesh 120
100
80
60
40
y = -1.952x2 + 10.24x + 79.48
20 R² = 0.842
0
1961 1971 1981 1991 2001 2011

Chart 4 Rajasthan Observed DR Poly. (Observed DR)


120
100
80
60
40
y = -1.751x2 + 8.582x + 85.74
20
R² = 0.919
0
1961 1971 1981 1991 2001 2011

Chart 5 Undivided Andhra Observed DR Poly. (Observed DR)


Pradesh 100
80
60
40 y = -1.533x4 + 20.42x3 - 93.27x2 + 166.0x -
7.621
20
R² = 0.942
0
1961 1971 1981 1991 2001 2011

Chart 6 West Bengal Observed DR Poly. (Observed DR)


100

80

60

40

20 y = -1.424x2 + 3.419x + 85.71


R² = 0.930
0
1961 1971 1981 1991 2001 2011
An In-Depth Analysis of Population Ageing for Selected States … 221

Chart 7 Gujarat Observed DR Linear (Observed DR)


120
100
80
60
40 y = -7.315x + 102.9
20 R² = 0.956
0
1961 1971 1981 1991 2001 2011

Chart 8 Maharashtra Observed DR Linear (Observed DR)


100
80
60
40
y = -5.757x + 96.27
20
R² = 0.888
0
1961 1971 1981 1991 2001 2011

Chart 9 Tamil Nadu Observed DR Linear (Observed DR)


100

80

60

40

20 y = -6.448x + 89.19
R² = 0.993
0
1961 1971 1981 1991 2001 2011

linear for Gujarat, TN and Maharashtra with coefficients of determination 0.956,


0.993 and 0.888 respectively.
Based on the temporal patterns of changes in the DR over the decades the states
may be classified into two groups. The first group with higher level of economic
development and having privilege of getting advantages of demographic dividend is
formed by AP, Gujarat Maharashtra, TN and WB and the second group with lower
level of economic development and not having the advantages of similar dividend is
formed by Bihar, UP, MP ad Rajasthan. The analytical functions for DR correspond-
ing to these two groups in terms of time (T) are the following.
222 P. Pathak

Group I: DR = 96.2791 − 6.0720 * T with adjusted coefficient of determination


0.6947
Group II: DR = 4.5300 − 0.0004 * exp (T) with adjusted coefficient of determi-
nation 0.4950

The rates of fall of the DR in Group II states are found close to those for the
Group I states only from 2001 and have been in the range of −11.8 to −15.2 in 2011
with the highest value attained by Bihar.

3.2 Old Age Dependency Ratio (OADR)

The findings on OADR are presented graphically on Charts 10, 11, 12, 13, 14, 15,
16, 17 and 18. It is clear from the charts that the DR in Bihar, UP and MP had been
in the range of 10 to 12 in 1961. It reached the level of 14 or little below it by 2011.
For Rajasthan, WB and Gujarat, the values had been between 9 and 10 in 1961and
it increased to values between 12 and 13 by 2011. Maximum ageing is noted in AP,
Maharashtra and TN, where the values in 1961 had been between 10 and 12 in 1961
and it increased to about 16 by 2011.
Among these Group I states with maximum ageing, the best fitted analytical
functions, as shown on the charts, have been linear for Maharashtra and TN with

Chart 10 Undivided Bihar Observed OADR Poly. (Observed OADR)


15

10

5
y = 0.048x4 - 0.569x3 + 2.115x2 - 2.044x + 11.17
R² = 0.876
0
1961 1971 1981 1991 2001 2011

Chart 11 Undivided Uttar Observed OADR Poly. (Observed OADR)


Pradesh 14.5
14
13.5
13
12.5
12
11.5 y = 0.073x3 - 0.843x2 + 3.165x + 9.474
11 R² = 0.96
10.5
1961 1971 1981 1991 2001 2011
An In-Depth Analysis of Population Ageing for Selected States … 223

Chart 12 Undivided Observed OADR Poly. (Observed OADR)


Madhya Pradesh 15

10

5
y = -0.152x2 + 1.767x + 8.176
R² = 0.956
0
1961 1971 1981 1991 2001 2011

Chart 13 Rajasthan Observed OADR Poly. (Observed OADR)


14
12
10
8
6 y = -0.077x2 + 1.144x + 8.862
4 R² = 0.984
2
0
1961 1971 1981 1991 2001 2011

Chart 14 West Bengal Observed OADR Linear (Observed OADR)


14
12
10
8
6
4 y = 0.726x + 8.339
2 R² = 0.891
0
1961 1971 1981 1991 2001 2011

Chart 15 Gujarat Observed OADR Linear (Observed OADR)


14
12
10
8
6
4
y = 0.567x + 8.950
2 R² = 0.965
0
1961 1971 1981 1991 2001 2011
224 P. Pathak

Chart 16 Undivided Observed OADR Poly. (Observed OADR)


Andhra Pradesh 20

15

10
y = 0.189x3 - 1.733x2 + 4.758x + 8.176
5
R² = 0.966
0
1961 1971 1981 1991 2001 2011

Chart 17 Maharashtra Observed OADR Linear (Observed OADR)


20

15

10
y = 1.203x + 8.261
5 R² = 0.956

0
1961 1971 1981 1991 2001 2011

Chart 18 Tamil Nadu Observed OADR Linear (Observed OADR)


20

15

10
y = 1.196x + 7.936
5 R² = 0.936

0
1961 1971 1981 1991 2001 2011

coefficients of determination 0.956 and 0.936 respectively. It has been a polynomial


of degree 3 for AP and the coefficient of determination has been 0.966. Gujarat and
WB are included here in Group II due to similarity in temporal pattern, indicating less
ageing. The analytical functions for these two states, however, have been linear with
coefficients of determination 0.965 and 0.891 respectively. Polynomial functions of
degree 2 and above are found best fitted with coefficients of determination in the
range of 0.876 to 0.984 for the remaining states.
For these two redefined groups of states, the analytical functions for OADR are
the following.

Group I: OADR = 8.9159 + 0.9838 * T with adjusted coefficient of determination


0.7782
An In-Depth Analysis of Population Ageing for Selected States … 225

Group II: OADR = 9.8148 + 0.5839 * T with adjusted coefficient of determina-


tion 0.5272
Clearly, the coefficient of T indicates that ageing of the population in Group I
states is faster than that for the Group II states.

3.3 Young Age Dependency Ratio(YADR)

The findings on YADR are presented graphically on Charts 19, 20, 21, 22, 23, 24,
25, 26 and 27. The YADR for Bihar is as high as 77 in 2011. It had been 81 in 1961
and its path of change was quite fluctuating. On the other hand, it was little below
or around 80 in UP during 1961 to 2001 and dropped to 65 in 2011. In the case of
Rajasthan, it was little above 80 in 1961 and dropped in a curvilinear way to a level
near 60 in 2011. In MP, it was nearly 75 in 1961 and gradually decreased to a level
above 55 in 2011. In contrast with Rajasthan, the YADR in Gujarat was little above
80 during 1961 and 1971 and decreased to a level around 45 in 2011. Again, the
YADR in WB dropped to a level little above 40 in 2011 after starting at 75 in 1961.
The ratio in AP was little above 70 in 1961and after fluctuating around 70 till 2001,
it abruptly dropped to a level around 40 in 2011. It was in the range of 70 to 80 in
Maharashtra during 1961 and 1971 and decreased consistently over the next census
years to a level little above 40 in 2011. The fall was most encouraging in TN, where
the ratio fell consistently over the census years to a level around 35 in 2011 after
starting at a level above 70 in 1961.
Clearly, TN has exercised much greater control over berths so as to reduce the
child dependants very significantly and that allowed it to take advantages of the
demographic dividend much ahead of the other states. Temporal changes of YADR
over the decades for TN and other states have been analytically captured in the same
charts. Group I states, formed based on DR remains the same here. The best fitted
functions are linear for WB, Gujarat, Maharashtra and TN, but it is a polynomial of
degree 3 for AP. The coefficients of determination are found in the range of 0.725 to
0.994. In fact, without the abrupt change in YADR in the last decade for AP, the state
has been almost to get classified in Group II. The state is, therefore, considered here

Chart 19 Undivided Bihar Observed YADR Poly. (Observed YADR)


86
84
82
80
78
76 y = -0.356x4 + 4.905x3 - 23.09x2 + 42.26x + 57.45
74 R² = 0.901
72
1961 1971 1981 1991 2001 2011
226 P. Pathak

Chart 20 Undivided Uttar Observed YADR Poly. (Observed YADR)


Pradesh 100
80
60
40
y = -0.534x3 + 4.447x2 - 10.63x + 84.34
20 R² = 0.684
0
1961 1971 1981 1991 2001 2011

Chart 21 Rajasthan Observed YADR Poly. (Observed YADR)


100
80
60
40
y = -1.674x2 + 7.437x + 76.88
20 R² = 0.937
0
1961 1971 1981 1991 2001 2011

Chart 22 Undivided Observed YADR Poly. (Observed YADR)


Madhya Pradesh 100
80
60
40
y = -1.799x2 + 8.477x + 71.30
20 R² = 0.878
0
1961 1971 1981 1991 2001 2011

Chart 23 Undivided Observed YADR Poly. (Observed YADR)


Andhra Pradesh 80
60
40
20 y = -1.225x3 + 10.75x2 - 28.87x + 94.59
R² = 0.725
0
1961 1971 1981 1991 2001 2011
An In-Depth Analysis of Population Ageing for Selected States … 227

Chart 24 Gujarat Observed YADR Linear (Observed YADR)


100
80
60
40
y = -7.883x + 93.94
20 R² = 0.962
0
1961 1971 1981 1991 2001 2011

Chart 25 West Bengal Observed YADR Linear (Observed YADR)


100
80
60
40 y = -7.279x + 90.67
20 R² = 0.866

0
1961 1971 1981 1991 2001 2011

Chart 26 Maharashtra Observed YADR Linear (Observed YADR)


100
80
60
40
y = -6.960x + 88.01
20 R² = 0.914
0
1961 1971 1981 1991 2001 2011

Chart 27 Tamil Nadu Observed YADR Linear (Observed YADR)


80

60

40

20 y = -7.644x + 81.26
R² = 0.994
0
1961 1971 1981 1991 2001 2011
228 P. Pathak

in Group II. Other states in Group II remain the same and the best fitted functions
for all the states have been polynomials of degrees 2 and above with coefficients of
determination ranging from 0.684 to 0.937. Following analytical functions are found
best fitted for the Group I and Group II states.

Group I: YADR = 87.4720 − 6.9210 * T with adjusted coefficient of determina-


tion 0.7273
Group II: YADR = 4.3950 − 0.0006 exp (T) with adjusted coefficient of deter-
mination 0.5832

While the combined rate of fall over the decades for Group I states has been 6.921
the rates of fall for the Group II states have been close to it only in 2001 and have been
in the range of −12.940 and −17.409 only in 2011 with the highest rate observed
only in Bihar.

3.4 Replacement Ratio (RR)

These are shown over the considered decades and for the selected states in Charts 28,
29, 30, 31, 32, 33, 34, 35 and 36. All charts indicate downward trend with varying
rates of fall. The rates of fall are particularly significant in Gujarat, Maharashtra and
TN. In Gujarat, the RR decreased from nearly 870 to about 350. In Maharashtra, it
declined from about 770 to about 270. In TN, it decreased from about 750 to nearly
220. Clearly, TN is leading in terms of the RR as the ratio is nearing 100 at the
fastest rate. Analytical functions of best fit for these states are found to be linear
with coefficients of determination in the range of 0.986 to 0.993. The status is almost
equally encouraging in WB where the RR has dropped from little above 800 in 1961
to nearly 300 in 2011. The best fitted analytical function for this state is a polynomial
with coefficient of determination 0.994. All these states may be considered as Group
I states. Remaining states are considered in Group II. The RRs for all these states
have been in the range of 650 to 800 in 1961 and have ended up with values in the
range of 400 to 550. The best fitted analytical functions for these Group II states are
mostly polynomials of degrees 2 and above except Rajasthan. The function has been
linear for Rajasthan. The coefficients of determination for these states have been in
the range of 0.875 to 0.978.
On observing much less fluctuations and greater regularity in the functional pat-
terns of RR for nine states an attempt has been made to rank the states using Eucledian
distance. TN being in the leading position in terms of the RR, distances of the tempo-
ral patterns of the RR for the remaining states are computed relative to the temporal
pattern of the RR for TN. As per the distances, the ordered states are TN, Maharash-
tra, Gujarat, WB, AP, MP, UP, Rajasthan and Bihar with distance of Bihar maximum
from TN.
An In-Depth Analysis of Population Ageing for Selected States … 229

Chart 28 Undivided Uttar Observed Replacement RaƟo


Pradesh
Poly. (Observed Replacement RaƟo)
800
600
400
y = -7.286x3 + 72.09x2 - 228.2x + 816.1
200 R² = 0.901
0
1961 1971 1981 1991 2001 2011

Chart 29 Undivided Bihar Observed Replacement Ratio


Poly. (Observed Replacement Ratio)
1000
800
600
400 y = -5.576x3 + 59.61x2 - 220.9x + 931.5
200 R² = 0.875
0
1961 1971 1981 1991 2001 2011

Chart 30 Rajasthan Observed Replacement Ratio


Linear (Observed Replacement Ratio)
1000
800
600
400
y = -71.75x + 924.3
200 R² = 0.971
0
1961 1971 1981 1991 2001 2011

Chart 31 Undivided Observed Replacement Ratio


Madhya Pradesh
Poly. (Observed Replacement Ratio)
1000
800
600
400
y = -2.184x2 - 56.65x + 855.0
200 R² = 0.978
0
1961 1971 1981 1991 2001 2011
230 P. Pathak

Chart 32 Undivided Observed Replacement Ratio


Andhra Pradesh Poly. (Observed Replacement Ratio )
800
600
400
y = -15.30x3 + 138.8x2 - 386.8x + 919.1
200 R² = 0.805
0
1961 1971 1981 1991 2001 2011

Chart 33 West Bengal Observed Replacement Ratio


Poly. (Observed Replacement Ratio)
1000

800

600

400
y = -14.60x2 - 0.958x + 845
200 R² = 0.994
0
1961 1971 1981 1991 2001 2011

Chart 34 Gujarat Observed Replacement Ratio


Linear (Observed Replacement Ratio)
1000

800

600

400

200 y = -103.9x + 986.2


R² = 0.986
0
1961 1971 1981 1991 2001 2011

Chart 35 Maharashtra Observed Replacement Ratio


Linear (Observed Replacement Ratio)
1000
800
600
400
200 y = -104.7x + 906.9
R² = 0.987
0
1961 1971 1981 1991 2001 2011
An In-Depth Analysis of Population Ageing for Selected States … 231

Chart 36 Tamil Nadu Observed Replacement Ratio


Linear (Observed Replacement Ratio)
1000
800
600
400
200 y = -106.7x + 853.9
R² = 0.993
0
1961 1971 1981 1991 2001 2011

4 Discussion and Conclusion

It is clear from the above findings and analyses that on the way to understanding the
mechanism of achieving demographic dividend, temporal as well as regional pat-
terns of DR should be studied simultaneously with temporal and regional patterns of
OADR, YADR and RR. If India is failing to achieve the demographic dividend as per
expectations, an in-depth study into the temporal patterns of DR, OADR, YADR and
RR is required by different regions. Variation in economic development of different
regions has definite influence on the temporal and regional characteristics of these
measures of population dynamism. The findings and analyses show that the Group I
states are ahead of the Group II states in achieving the demographic dividend through
bringing down significantly the YADR. Economic development allows overall devel-
opment of a region including increase in work participation, education, health status,
etc. on one hand and expansion of supportive infrastructure, urbanisation, etc. on
the other. This helps bringing down the YADR. Old age population is yet to become
a matter of serious concern in India and it gets indicated in the OADR estimates.
Ageing of the Group I states has been found to be faster and it is expected that more
and more attention will be required in the future years to ensure their security and
well being. It is also worth noting based on the fitted analytical functions that the
temporal patterns of change of the ratios are generally linear for the Group I states
and those are nonlinear for the Group II states. It might be due to more systematic
and planned implementation of the population control programmes in the Group I
states.
Effective and efficient utilisation of population in the working age group in eco-
nomically gainful activities is most essential so as to provide all necessary support
to the dependent population of both types. Two things are thus most important at
this stage: (1) to develop the Group II states for making them move faster towards
achieving the benefits of demographic dividend and (2) to ensure effective and effi-
cient utilistaion of all active population in the working age group so that savings and
investments out of their income generation may provide necessary support for secu-
rity and healthy living of the ageing population. It is also necessary to keep watch on
232 P. Pathak

the RR as estimates below 100 would cause concern on the future of an economy.
Complexity of some of the estimated analytical functions might be indicative of the
extent to which population dynamism is well managed in a state. Further investiga-
tion at depth may bring into light the reasons behind the fluctuations and it may help
in better formulation of policy and planning for population.

References

Aiyar S, Mody A (2011) The demographic dividend: evidence from the Indian states. In: IMF
working paper 11/38. IMF, New York. www.imf.org/external/pubs/ft/wp/2011/wp1138.pdf
Anderson B (2001) Scandinavian evidence on growth and age structure. Reg Stud 35(5)
Basu K, Maertens A (2007) The pattern and causes of economic growth in india. Oxf Rev Econ
Polic 24(2):143–67
Behrman JR, Dureyea S, Szekely M (1999) Aging and economic opportunities: major world regions
around the turn of the century, working paper 405. Inter-American Development Bank
Bloom DE, Williamson JG (1998) Demographic transitions and economic miracles in emerging
Asia. W Bank Econ Rev 12(3):419–56
Bloom DE, Canning D, Malaney P (2000) Demographic change and economic growth in Asia
population change in East Asia transition. In: Chu C, Lee R (eds) Population and development
review, vol 26. New York, Population Council, pp 257–290
Bloom DE, David C, Jaypee S (2003) The demographic dividend: a new perspective on the economic
consequences of population change, population matters, monograph MR-1274. RAND, Santa
Monica
Bloom DE, David C, Linlin W, Yuanli L, Mahal A, Yip W (2006) Demographic change and economic
growth: comparing China and India. Harvard School of Public Health, Harvard University, Boston,
MA
Bloom DE, Finlay JE (2009) Demographic change and economic growth in Asia. Asian Econ Policy
Rev 4:45–64
Bloom DE (2011) Population dynamics in India and implications for economic growth, PGDA
working paper 65. Harvard School of Public Health, Harvard University, Boston, MA
Bongaarts J (2009) Human population growth and the demographic transition. Philos Trans R Soc
B 364:2985–90
Chandrasekhar CP, Ghosh J, Roychowdhury A (2006) The demographic dividend and young India’s
economic future. Econ Polit Wkly 9: 5055–5064
Choudhry MT, Elhorst JP (2010) Demographic transition and economic growth in China. India Pak
Econ Syst 34:218–36
Coale Ansley J, Hoover Edgar M (1958) Population growth and economic development in low-
income countries. Princeton University Press, Princeton
Feng W, Mason A (2005) Demographic dividend and prospect of economic development in China.
Paper prepared for United Nations expert group meeting on social and economic implications of
changing population age structures, Mexico City, 31 Aug to 2 Sept
Gribble JN, Bremner J (2012) Achieving a demographic dividend. Population Bulletin, vol 67, no
2. Population Reference Bureau, Washington DC
James KS (2008) Glorifying malthus: current debate on demographic dividend in India. Econ Polit
Wkly 21:63–69
Kelley Allen C, Schmidt Robert M (2005) Evolution of recent economic-demographic modelling:
a synthesis. J Popul Econ 18:275–300
Mitra S, Nagarajan R (2005) Making use of the window of demographic opportunity: an economic
perspective. Econ Polit Wkly 10:5327–5332
An In-Depth Analysis of Population Ageing for Selected States … 233

Navaneetham K (2002) Age structural transition and economic growth: evidence from South and
South-east Asia, working paper No. 337, Centre for Population Studies, Thiruvananthapuram
Panagariya A (2004) Growth and reforms during the 1980s and 1990s. Econ Polit Wkly 19:2581–
2594
Rodrik D, Subramanian A (2005) From Hindu growth to productivity surge: the mystery of the
indian growth transition, IMF Staff Papers, International Monetary Fund vol 52(2)
United Nations (1962) Demographic yearbook United Nations, New York, NY
United Nations (1973) The determinants and consequences of population trends, Department of
economic and social affairs, population studies 50. United Nations, New York
United Nations (2007) World population prospects: the 2006 revision. United Nations Population
Division, New York
Wei Z, Hao R (2010) Demographic structureand economic growth: evidence from China. J Comp
Econ 38:472–91
Williamson J, Higgins M (2001) The accumulation of demographic connection in Est Asia. In:
Mason A (ed) Population change and economic development in East Asia, Stanford, Stanford
University Press 123-54
Growth Curve of Yam from Incomplete
Data in Saline Soil of Sunderban

Ratan Dasgupta

Abstract We propose a new technique of estimating growth curve of Elephant foot


yam when plant lifetime data is missing and has to be indirectly estimated. From
auxiliary variable associated with plant lifetime, we estimate the missing variable.
Lifetime of yam plants in general follows a Weibull distribution. As the plants mature
gradually, downfall of canopy radius from peak to a target canopy radius, after which
time no substantial additional yam weight gain is expected, provides an estimate of
lifetime for yam plants. The target canopy radius is so selected, as to minimise the
Anderson-Darling (AD) statistic of Weibull fit for plant lifetime. The estimated plant
lifetime is then used to obtain yam growth curve with missing lifetime data. The
proposed model of estimating growth curve is validated on a different data set.

Keywords Elephant foot yam · Stress elasticity · Weibull distribution · Anderson-


Darling (AD) statistic

MS subject classification: 62P10

1 Introduction, Genesis of the Problem and Plant Lifetime

Investigation on growth of improved variety of yam, near saline water river Bid-
hyadhari of Sunderban, was initiated recently while visiting the Manmathanagar
government seed farm in the context of suitability study on different coconut cultivar
planted therein about 28 years ago. For studying yam growth, on 23 February 2015,
about 13 kg of Bidhan kusum yam seed was taken from Indian Statistical Institute
Giridih Farm for plantation in Manmathnagar. Cut seed corms of yam of weight
within the range 275–600 g were planted in 30 pits of 9 in. depth following the usual
procedure adopted for planting yam in the Giridih farm. The plot selected is slightly
away from Bidhyadhari river with salinity of 30 g/L, near the Farm office.

R. Dasgupta (B)
Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 B T Road,
Kolkata 700108, India
e-mail: ratandasgupta@gmail.com; rdgupta@isical.ac.in

© Springer International Publishing AG 2017 235


R. Dasgupta (ed.), Growth Curve Models and Applications, Springer Proceedings
in Mathematics & Statistics 204, DOI 10.1007/978-3-319-63886-7_12
236 R. Dasgupta

Sprouting of the seed corms were observed in Manmathanagar within the usual
time span with healthy stems. However, with progress of rainy season, water stagna-
tion made the green leaves of plants slightly yellow. Porous soil, as found in Giridih
farm is conducive of water passage. Soil in Manmathanagar seed farm is of different
type. Lack of proper drainage and clay type soil made the situation worst, water
stagnation lasted longer in plots of seed farm than in Giridih experimental field.
Growth of plants was retarded and some plants had short lifetime after the water
drainage is cleared. Sensitive growth characteristics like canopy radius of the plants
showed downward trend in fall over time.
Precise data on lifetime may not always be available in experimental regions,
especially in a zone that is difficult to access at times in growth experiment. Number
of observations may sometimes be sparse, and the values of the response variable
like yield may be available at a later stage. Individual plant lifetime may have to be
indirectly estimated from auxiliary variables coupled with available information on
plant lifetime, while computing growth curve.
In the present study we consider growth curve estimation of yam, where plants
are under stress due to rain water stagnation. In the middle of experiment, plant
survival is endangered due to water logging around yam plants from excessive rain,
and lifetime data on plants are unavailable. In such an environment of cultivation,
estimation of yam growth curve is difficult from limited data and we are to take help
of auxiliary variable.
Canopy radius of yam plants is a stress sensitive variable. After reaching a max-
imum, sharp fall of canopy radius is seen when plants are subjected to stress. Yam
plants were short lived after rain water stagnation and leaves turned pale; data record-
ing became less frequent.
A variable’s sensitivity to a change in another variable is termed as elasticity.
Effect of stress elasticity of canopy radius is observed on the plants after a period of
stress exertion, this can be seen from deceasing slope of radius over time.
From the sharp fall of canopy radius in post stress period, we estimate the time
required to reach a predetermined terminal canopy radius maintaining individual
slope of fall in radius for plants. Thus plant lifetime is indirectly calculated from the
time projected to fall at a target canopy radius, attaining which plants are supposed
to be mature. The target canopy radius is so selected that the resultant lifetime of
yam so found, adheres to a known property of yam plant lifetime; viz., the lifetime
follows Weibull distribution. This provides reconstructed yam data for analysis.
Farmers observe the condition of above ground biomass to infer about maturity of
yam developed underground. Canopy radius is a stress and age sensitive variable. We
record canopy radius of all healthy stems in a plant and then the maximum radius over
stems in individual plants is chosen. From the decreasing slope of canopy radius when
stems turn frail in mature yam plants over time, one may infer about the remaining
time left for harvest.
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 237

2 Data Analysis

Data collected in this experiment is scanty. Yam corms were planted in 30 pits. In
some pits the yield is nil due to water stagnation. Table 1 provides the data on initial
seed weight and the final yam weight in gram on harvest.

Table 1 Yam yield (gm) in Plant no Seed_wt (gm) Final_wt (gm)


Sunderban
1 400 NIL
2 450 500
3 375 NIL
4 325 200
5 275 300
6 300 NIL
7 350 300
8 325 NIL
9 350 600
10 400 NIL
11 400 700
12 500 2400
13 500 2000
14 500 1300
15 350 1250
16 375 700
17 400 800
18 400 900
19 400 1000
20 400 1000
21 550 2300
22 450 2300
23 500 1000
24 550 300
25 600 1300
26 400 800
27 525 550
28 500 700
29 450 750
30 600 200
238 R. Dasgupta

70
60
stem height (cm)
50
40
30

50 100 150 200 250


life time (day)

Fig. 1 Plant height of yam

Below we state the main features of data on plant height, canopy radius etc.
Data points in Fig. 1 represent yam plant heights of Sunderban experiment. Short
lived plants are shown with a single reading, as no further reading after the first
could be taken. For plants having multiple readings, those joined by straight lines,
mean values for a fixed lifetime is shown in red color, where at that lifetime, at least
one observation on plant height is available. Mean values are joined by red lines
to represent overall mean response curve of yam plant height, having at least two
observations. In general, the response curve shows an increasing trend with slight
decrease on a region after the lifetime of 150 days.
Girth of yam plant over time is an important growth characteristic. Girths at the
top of yam plants are shown in Fig. 2. Mean response curve is computed as before
and shown in red color. A downward trend is prominent for a time region after the
plant lifetime of 150 days, much like the pattern seen in Fig. 1.
With a prominent drop of plant girth near 150 days as seen in Fig. 3, plant girth
at the middle seems to be more sensitive to stress than girth at the top.
Canopy radius is a sensitive variable of plant stress. Figure 4 shows a downward
trend of canopy radius on a wide time segment of the experiment damaged by water
stagnation. Plant lifetime data is missing and need to be indirectly estimated via such
sensitive variable under appropriate assumptions.
The first principal component y = 0.7669x1 − 0.6384x2 − 0.0467x3 − 0.0466x4 ,
where x1 , x2 , x3 and x4 are plant height, canopy radius, girth at the top, and girth
at middle respectively; explains 93.13% of variation in data. Principal component’s
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 239

12
10
girth at the top (cm)
8
6
4

50 100 150 200


life time (day)

Fig. 2 Girth at the top of yam plant


14
12
girth at the middle (cm)
10
8
6

50 100 150 200


life time (day)

Fig. 3 Girth at the middle of yam plant


240 R. Dasgupta

70
60
canopy radius (cm)
50
40
30
20

50 100 150 200 250


life time (day)

Fig. 4 Canopy radius of yam plant

growth during 3–5 months is steep in Fig. 5, indicating that a lot of changes did occur
during that time period.
Steep fall in plant canopy radius over time is seen in Fig. 4. Assume that the
same rate of fall from first two readings is maintained, till the yam plant is mature,
achieving terminal canopy radius. We find the time when radius 20 cm is reached
for individual plants maintaining same rate of fall for canopy radius, computed from
first two readings on radius. With that calculated time as predicted plant lifetime, we
obtain individual growth curves.
Mean response at different time points over curves are computed where at least
one data point on that time is available. These mean points after lowess regression
with f = 2/3 then provides the growth curve shown in Fig. 6. Time for yam maturity
is up to 450 days. This high value is unusual and needs to be rectified.
We find the time when canopy radius 35 cm is reached for individual plants. With
that as predicted plant lifetime, we obtain individual growth curves. Lowess growth
curve with f = 2/3 is drawn from mean over individual curves at fixed time points
where at least one recorded observation is available. Time to maturity of yam is now
seen to be lower; this is up to 300 days (Fig. 7).
Next we find the time when canopy radius 40 cm is reached. With that time as
predicted plant lifetime, we obtain individual growth curves. Lowess growth curve
with f = 2/3 is drawn from mean over individual curves at fixed time points where
at least one recorded observation is available in data. Time to maturity of yam is up to
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 241

20
first principal component (cm)
15
10
5
0

50 100 150 200


life time (day)

Fig. 5 First principal component


2.5
2.0
yam yield (kg)
1.5
1.0
0.5

0 100 200 300 400


life time (day)

Fig. 6 Yam growth curve with 20 cm as terminal canopy radius


242 R. Dasgupta

2.0
yam yield (kg)
1.5
1.0
0.5

0 50 100 150 200 250


life time (day)

Fig. 7 Yam growth curve with 35 cm as terminal canopy radius

230 days, which is usual, see Fig. 8. Thus, a reasonable choice of terminal canopy
radius seems to be near 40 cm.
We now compare the situation with another growth experiment. A growth exper-
iment in the year 2013 is conducted at Indian Statistical Institute Giridih Farm, see
Fig. 3 of Dasgupta (2017); this is shown here as Fig. 9. Yam plants with seed weight
500 gm are uprooted in the middle of the experiment for taking interim reading and
then these are replanted, thus inducing plant stress. Canopy radius has sharp fall from
peak after intervention in Fig. 9. Times of intervention are shown in vertical lines.
Mean weight of seed corm in Sunderban experiment is 430 gm, this is near to seed
weight 500 gm of Giridih experiment for comparison.
A similar feature of sharp fall of canopy radius from peak after intervention is also
seen for yam plants with seed weight 650 gm in Giridih Farm experiment, see Fig. 2
in Dasgupta (2017); this is shown here as Fig. 10. It appears that the plants’ stress
due to intervention is reflected by subsequent decrease in sensitive canopy radius.
In general yam plant lifetime is seen to follow a Weibull distribution e.g., see
Dasgupta (2014). We check the target canopy radius for which the estimated lifetime
data is closest to a Weibull model. The Anderson-Darling statistic, see Anderson
and Darling (1952); is 1.038 for 12 yam plants with target canopy radius 39 cm. See
Fig. 11.
The Anderson-Darling statistic is 1.037 for yam plants with target canopy radius
38.5 cm, see Fig. 12. The value is lower compared to that for 39 cm.
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 243

2.0
yam yield (kg)
1.5
1.0
0.5

0 50 100 150 200


life time (day)

Fig. 8 Yam growth curve with 40 cm as terminal canopy radius

The Anderson-Darling (AD) statistic increases to 1.04 for yam plants with target
canopy radius 38 cm, see Fig. 13. The best Weibull fit for Sunderban yam plants’
lifetime thus seems to be when 38.5 cm is taken as terminal canopy radius.
A detailed picture of variation of AD statistic over change in canopy radius is
provided in Fig. 14. Fall in the value of statistic is moderate in the beginning and
then sharp after canopy radius exceeds 25 cm, with a minimum value of AD statistic
attaining at 38.5 cm.
Scale parameter of Weibull fit varies almost linearly with increase in canopy
radius, see Fig. 15. The estimated slope of the scale parameter curve is −7.133 at the
saddle point 38.5 cm of canopy radius that minimises AD statistic.
Shape parameter of Weibull fit in Sunderban plants drops slowly in the beginning.
With further increase in canopy radius the fall of shape parameter is sharp. The curve
pattern of Fig. 16 is concave. Estimated slope of the shape parameter curve at the
saddle point 38.5 cm of canopy radius is −0.08801.
A similar Weibull analysis of yam lifetime for Giridih experiment with interim
intervention of uprooting and replanting 20 yam plants having seed weight 500 gm
shows that 36 cm of terminal canopy radius, required to compute lifetime, provides
best Weibull fit to yam lifetime in terms of Anderson-Darling statistic. We computed
several (up to four) successive slopes after the peak is achieved for each curve in
Fig. 9, after intervention, then took the minimum slope and computed the further time
needed from the time of observing that minimum slope to attain the target canopy
radius; maintaining the same rate of fall. Total lifetime so computed for 20 plants are
244 R. Dasgupta

100
80
leaf length(cm)
60
40
20
0

11 Apr 13 10 May 13 07 Jun 13 05 Jul 13 30 Jul 13 25 Aug 13 23 Sep 13 20 Oct 13 21 Nov 13 23 Dec 13

time

Fig. 9 Canopy radius of 20 yam plants with seed wt. 500 gm. in calendar days

then put in Weibull plot. The best fit is seen to attain at the canopy radius of 36 cm
with 20 plant lifetime, see Fig. 17.
Recall that, for the growth experiment conducted at Sunderban, there is fall in
canopy radius towards yam plant maturity. From Weibull probability plot, the target
canopy radius 38.5 cm. seem to minimise AD statistic. The value 38.5 cm. is a bit
higher than 36 cm corresponding to yam plants grown in Giridih experiment with
less plant stress.
Plants at Sunderban seem to attain maturity a bit early while facing comparatively
severe stress of brief water stagnation, than those in Giridih subjected to uprooting
and replanting for interim growth data record. It is quite common that growth rate is
high in yam plants when subjected to stress.
In Giridih experiment there are 20 plants with seed weight 500 gm, whereas in
Sunderban experiment there are only 12 plants. To see the efficacy of calculating the
plant lifetime via auxiliary information of canopy radius, we sort the 20 (hypothetical)
plant lifetimes in Giridih as marked in Fig. 17 of Weibull plot and delete the highest 4
and lowest 4 observations, and retain only the middle 12 observations. The individual
growth curves of these identified 12 plants of Giridih experiment are shown in Fig. 18.
These shall have to be compared with growth curves with estimated lifetime. Here
initial weight is 500 gm, interim weight is recorded while uprooting and replanting in
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 245

100
80
leaf length(cm)
60
40
20
0

11 Apr 13 10 May 13 07 Jun 13 05 Jul 13 30 Jul 13 25 Aug 13 23 Sep 13 20 Oct 13 21 Nov 13 23 Dec 13
time

Fig. 10 Canopy radius of 20 yam plants with seed wt. 650 gm. in calendar days

Fig. 11 Weibull fit of plant life at Sunderban with target canopy radius 39 cm
246 R. Dasgupta

Fig. 12 Weibull fit of plant life at Sunderban with target canopy radius 38.5 cm

Fig. 13 Weibull fit of plant life at Sunderban with target canopy radius 38 cm

the middle of experiment, and the final weight is assigned at realized plant lifetime.
Thus each growth curve has three points. One growth curve on the top in Fig. 18
seems to be an outlier.
The individual growth curves of the 12 plants in Giridih experiment with hypo-
thetical lifetime calculated via canopy radius are shown in Fig. 19 for comparison
with Sunderban experiment. Here, in Fig. 19, initial weight is 500 gm, interim yam
weight is recorded while uprooting the plant and replanting it in the middle of exper-
iment, and the final yam weight is assigned at hypothetical plant lifetime computed
via fall in canopy radius. Thus each growth curve has three points. The growth curve
on the top in Fig. 19 seems to be an outlier.
Deleting the outlier curve on the top from Figs. 18 and 19 related to Giridih
experiment, we compute the overall growth curve. Mean response at different time
points over 11 curves are computed where at least one data point on that time is
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 247

1.5
1.4
AD statistic
1.3
1.2
1.1

20 25 30 35 40
canopy radius

Fig. 14 AD statistic for Weibull fit with change in canopy radius


280
260
240
scale
220
200
180
160

20 25 30 35 40
canopy radius

Fig. 15 Scale parameter of weibull fit with change in canopy radius


248 R. Dasgupta

3.5
3.4
3.3
shape
3.2
3.1
3.0
2.9

20 25 30 35 40
canopy radius

Fig. 16 Shape parameter of weibull fit with change in canopy radius

Fig. 17 Weibull fit for 20 plant life time with target canopy radius 36 cm

available. These mean points after lowess regression with f = 2/3 then provides the
growth curve in Fig. 20a, b respectively. The observed and estimated growth curves
in Giridih experiment for seed weight 500 gm are seen to be similar except towards
far end of the graph, indicating that the proposed method of estimating yam plant
lifetime via canopy radius is satisfactory.
Since the adopted procedure of estimating lifetime via canopy radius seems satis-
factory as cross checked by yam data of Giridih Farm, in Fig. 21 we use the procedure
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 249

7
6
5
yam yield (kg)
4
3
2
1
0

0 50 100 150 200 250


life time (day)

Fig. 18 Growth curve of 12 plants


7
6
5
yam yield (kg)
4
3
2
1
0

0 50 100 150 200 250


estimated life time (day)

Fig. 19 Estimated growth curve of 12 plants


250 R. Dasgupta

(a)

6
yam yield(kg)
4
2
0

0 50 100 150 200 250


observed life time (day)

(b)
6
yam yield(kg)
4
2
0

0 50 100 150 200 250


estimated life time (day)

Fig. 20 Lowess growth curve of observed and estimated data


2.0
yam yield (kg)
1.5
1.0
0.5

0 50 100 150 200 250


life time (day)

Fig. 21 Yam growth curve with 38.5 cm as terminal canopy radius


Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 251

2.0
yam yield (kg)
1.5
1.0
0.5

0 50 100 150 200 250


life time (day)

Fig. 22 Yam growth curve with 38.5 cm as terminal canopy radius (spline)

to estimate the growth curve of Sunderban yam by lowess regression with f = 0.4.
Mean response at different time points over 12 curves are computed where at least
one data point on that time is available. These mean points after lowess regression
provided the desired growth curve of yam when lifetime data of yam plants are
missing and we use auxiliary information on canopy radius to estimate this instead.
In Fig. 22 we estimate the growth curve of Sunderban yam by spline regression
in SPlus with smooth.spline and spar= 0.0001. The curve is relatively smooth com-
pared to the curve in Fig. 21 obtained from lowess regression.
Experimental situation was adverse in first plantation in Sunderban, the total yam
yield on harvest was 24.15 kg out of total seed weight 12.90 kg, resulting in 1.872
times yam yield. Experiments are now being conducted on a slanting piece of land of
seed farm, where seed corms are planted just below the ground level to avoid water
stagnation in rainy season, faced earlier.

3 Discussions

Lifetime, an important variable in constructing growth curve, may sometimes be


missing. Precise data on lifetime may not always be available in experimental regions,
especially in a zone that is difficult to access readily at times in the production period.
Taking measurement is problematic in not-so-conducive environment and number
of observations may sometimes be scanty. However, at a later stage, the values of
the response variable like yield of an agricultural product on maturity are available.
Prior knowledge on lifetime behavior from experiments conducted earlier is likely
252 R. Dasgupta

to provide additional information in data analysis. Individual plant lifetime may then
be indirectly estimated from auxiliary variables coupled with available information
on lifetime.
In the present study we consider growth curve estimation of yam, where plants
are under stress due to rain water stagnation in a farm of Sunderban, South Bengal. In
the middle of experiment, plant survival is endangered due to water logging around
yam plants from excessive rain and lifetime data on plants are unavailable. In such an
environment of cultivation, estimation of yam growth curve is difficult from limited
data unless we take help of auxiliary information.
Canopy radius of yam plants is a stress sensitive variable. After reaching a max-
imum, sharp fall of canopy radius is seen when plants are subjected to stress. Yam
plants were short lived after water stagnation and leaves turned pale; data recording
were less frequent.
Severe plant stress results in sharp fall in canopy radius. Elasticity is a measure of a
variable’s sensitivity to a change in another variable, see e.g., Atanackovic and Guran
(2000). Concept of elasticity can be adopted in many situations. Stress elasticity of
canopy radius of yam plant may take effect after a period of stress exertion, this can
be seen from deceasing slope of radius over time.
From the sharp fall of canopy radius in post stress period, we estimate the time
required to reach a predetermined terminal canopy radius maintaining individual
slope of fall in radius for plants. Thus plant lifetime is indirectly calculated from the
time projected to fall at a target canopy radius, attaining which plants are mature. This
provides reconstructed yam data for analysis. Adjustment for the choice of terminal
canopy radius is made by comparing the hypothetical lifetime with span of yam plant
lifetime observed in other experiments.
Lifetime estimation is made precise by using the prior information on yam lifetime
distribution. It is known that in general yam plant lifetime follow a Weibull distrib-
ution. Stress on yam plants may change the parameters of the Weibull distribution,
much like the assumptions made in accelerated life testing of industrial products e.g.,
see Nelson (1980). For different choice of target canopy radius, we look for the min-
imum value of Anderson-Darling statistic to find best Weibull fit for plant lifetime.
With terminal radius so selected as to minimize AD statistic, we compute the yam
plant lifetime and subsequently estimate individual growth curves of underground
yam. The adopted procedure is similar to structural equation modeling, where we
assume that the unobserved plant lifetime in an experiment is functionally related to
the time required for canopy radius to fall at a terminal radius with plant specific rate
of fall, and the unobserved plant lifetime affects the growth curve under estimation.
From observed slope of decrease in canopy radius after plant stress, we estimate
individual plant lifetime as sum of the time from sprouting to plant stress, plus the
remaining plant lifetime estimated from steep decrease in canopy radius on post
stress period. To validate the procedure we check this in a situation where full data
on yam growth experiment are available. The proposed procedure of growth curve
estimation with incomplete data produces similar results in experiment with full data,
justifying the procedure.
Growth Curve of Yam from Incomplete Data in Saline Soil of Sunderban 253

Farmers observe condition of above ground biomass to infer about maturity of


yam developed underground. Canopy radius is a stress and age sensitive variable. We
record canopy radius of all healthy stems in a plant and then the maximum radius over
stems in individual plants is chosen. From the decreasing slope of canopy radius,
when stems turn frail for mature yam plants over time, one may infer about the
remaining time left for harvest. Thus the procedure of estimating yam lifetime based
on diminishing canopy radius is reasonable.
Once a sharp decrease in canopy radius is observed and leaves are pale, develop-
ment of yam underground is nearly complete. In such a situation little time is left for
further increase in yam weight towards maturity. The remaining lifetime of plants
after stress is induced may then be estimated by the time to reach a terminal canopy
radius, maintaining the observed steep fall in radius. Minimum of slopes observed
is a nonparametric estimate for the finite lower end point in the support of slope
distribution.
Validation of the adopted technique is made in an experiment with full data. In
the case of Giridih experiment with complete data on yam, we examine several
successive post stress slopes, and choose the minimum to infer about the remaining
lifetime to achieve terminal radius with selected elasticity of stress.

References

Anderson TW, Darling DA (1952) Asymptotic theory of certain goodness-of-fit criteria based on
stochastic processes. Ann Math Stat 23:193–212
Atanackovic TM, Guran A (2000) Hooke’s law. Theory of elasticity for scientists and engineers.
Birkhuser, Boston
Dasgupta R (2014) Characterization theorems for weibull distribution with applications. J Environ
Stat 6(4):1–25. (UCLA, Dept. of Stat.)
Dasgupta R (2017) Growth curve of elephant-foot-yam, plant stress and Mann-Whitney U -statistics.
Appearing in this volume as chap. 1
Nelson W (1980) Accelerated life testing-step-stress models and data analyses. IEEE Trans Reliab
2:103

You might also like