This action might not be possible to undo. Are you sure you want to continue?

2, 2011, Pages 181–217 La revue canadienne de statistique

181

**Case studies in data analysis
**

Alison L. GIBBS1 *, Kevin J. KEEN2 and Liqun WANG3

of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G3 of Mathematics and Statistics, University of Northern British Columbia, Prince George, BC, Canada V2N 4Z9 3 Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T 2N2

2 Department 1 Department

The following short papers are summaries of student contributions to the Case Studies in Data Analysis from the Statistical Society of Canada 2009 annual meeting. Case studies have been an important part of the SSC annual meeting for many years, providing the opportunity for students to delve into interesting problems and data sets and to present their ﬁndings at the meeting. Since 2008, prizes have been awarded for the best poster presentations for each of two case studies. The case studies at the 2009 annual meeting and the selection of this suite of papers were organized by Gibbs and Keen. This section consists of two groups of papers corresponding to two case studies. Each subsection starts with an introduction given by the data donors, which is followed by the winning paper and contributed papers. The subsection ends with discussion and summary by the data donors. The theme of case study 1 is the identiﬁcation of relevant factors for the growth of lodgepole pine trees. First, Dean, Gibbs, and Parish provide an introduction to the data and the problems of scientiﬁc interest. The winning paper authors Cormier and Sun ﬁrst use the nonparametric smoothing technique to identify a nonlinear relationship of the growth rate and the age of the trees. They then use a mixed model to explain the growth rate through the age and other environmental factors. In the second paper, Salamh ﬁrst estimates a similar mixed model and then supplements the analysis using a dynamic model. The theme of case study 2 is the classiﬁcation of disease status through proteomic biomarkers. Balshaw and Cohen-Freue introduce the data and problems of interest. The winning paper is authored by Lu, Mann, Saab, and Stone who ﬁrst explore various data imputation techniques including the k-nearest neighbours, local least squares and singular value decomposition. They then apply various multiple selection methods such as LASSO, least angle regression (LARS) and sparse logistic regression. This paper is accompanied by four contributed papers which use various modern classiﬁcation techniques. Guo, Chen, and Peng use a score procedure to classify the disease status. Liu and Malik employ a multiple testing procedure. Meaney, Johnston and Sykes apply support vector machines (SVM). Wang and Xia use classiﬁcation tree and logistic regression techniques. A summary and comparison of these methods and outcomes are given by Balshaw and Cohen-Freue. We are grateful to Charmaine Dean of Simon Fraser University, Roberta Parish of the British Columbia Ministry of Forests and Range, and Rob Balshaw and Gabriela Cohen-Freue of the

* Author to whom correspondence may be addressed. E-mail: alison.gibbs@utoronto.ca © 2011 Statistical Society of Canada / Société statistique du Canada

182

Vol. 39, No. 2

NCE CECR PROOF Centre of Excellence for the use of their data and their contributions to this suite of papers. We also thank the former and current Editors of the CJS, Paul Gustafson and Jiahua Chen, for agreeing to publish these papers and for their patience and support during the editorial process.

Received 31 October 2010 Accepted 6 January 2011

The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2011

183

**Case study 1: The effects of climate on the growth of lodgepole pine
**

C.B. DEAN1 *, Alison L. GIBBS2 and Roberta PARISH3

and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S6 of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G3 3 British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8

2 Department 1 Statistics

Key words and phrases: Climate effects; tree growth; biomass; lodgepole pine.

1. BACKGROUND To compete successfully in the world economy, the commercial forestry industry requires an understanding of how changes in climate inﬂuence the growth of trees. The goal of this case study was to examine how well-known climate variables, combined with estimated crown biomass, can predict wood accumulation in lodgepole pine. In order to model the growth and yield of trees over time, we need to determine how much wood a tree accumulates each year. Each year, a tree lays down an annual ring of wood in a layer under the bark. Pressler’s hypothesis states that area of wood laid down annually, measured by the cross-sectional area increment, increases linearly from the top of the tree to the base of the crown (the location of the lowest live branches) and is based on the assumption that area increment in the crown increases with the amount of foliage above the point of interest. Below the crown, the area increment in any given year remains constant down the bole until the region of butt swell at the base of most trees. The growth of a tree in any given year is strongly inﬂuenced by growth in the previous years. One reason for this is that buds are formed the year before they start to grow and carbohydrates from good years can be stored to fuel growth in subsequent years. The effects of previous growing conditions can last from 1 to 3 years, depending on tree species and location. Climate affects growth and inﬂuences both the size of the annual ring of wood and the proportions of early and late wood. Low density early wood is laid down during the spring when water is plentiful. Late wood, which is laid down from mid-summer until growth ceases in the fall, has a high density. Cessation of wood formation is sensitive to weather conditions such as temperature and drought. Lodgepole pine (Pinus contorta Doug. ex Loud.) stands dominate much of western Canada and the United States, covering over 26 million hectares of forest land. It is an important commercial species in British Columbia; stands consisting of more than 50% lodgepole pine occupy 58% of the forests in the interior of the province. Lodgepole pine is primarily used for lumber, poles, railroad ties, posts, furniture, cabinetry, and construction timbers. It is commercially important to be able to predict how lodgepole pine will grow and accumulate wood over time. Using high resolution satellite images of lodgepole pine stands to predict wood attributes is under consideration, but ﬁrst the relationship of crown properties such as the amount of foliage must be linked to wood properties and growth.

*** Author to whom correspondence may be addressed. E-mail: dean@stat.sfu.ca
**

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

184

Vol. 39, No. 2

2. THE DATA Data on the annual growth and wood density of 60 lodgepole pine trees from four sites in two geographic areas in central British Columbia were provided for this investigation. Samples were removed at 10–13 locations along each tree and two radii (A and B) per sample disc were measured. Measurements of the last year of growth and wood density are often unreliable because of proximity to the bark and difﬁculties of sample preparation. However, it is for this ring only that we have measures of the amount of foliage. Several growth outcomes are available including the widths of the A and B radii, in millimetres, percentage of late and early wood, and early and late wood densities, in kg/m3 . Foliar biomass measurements are provided for multiple branches; estimates are available for each annual whorl. The data on biomass include the average relative position of the branch in the crown (1 is the base of the crown and 0 is the top) and corresponding foliar biomass (the mass, in kg/m2 , of needles subtended by the branches at that position). Other variables such as the total height of the tree, in metres, as well as the height to the base of the crown, in metres, are also provided. Climate data from Environment Canada arise from the two nearest stations with long-term records, Kamloops and Quesnel. For each of these locations, monthly and annual data are provided on: (1) the minimum temperature, in degrees Celsius, (2) the maximum temperature, in degrees Celsius, and (3) the total precipitation, in millimetres. Additional details on the data and variables are provided at the site www.ssc.ca/en/education/archived-case-studies/ssc-case-studies-2009. 3. OBJECTIVES The primary objective of this case study was to determine to what extent climate, position on the tree bole (trunk), and current foliar biomass explain cross-sectional area increment and proportion of early and late wood. Other questions of interest included: • How have temperature and precipitation affected the annual cross-sectional growth and the proportions of early and late wood in lodgepole pine? • Is annual growth best explained by average annual temperature or do monthly maximum and/or minimum values provide a better explanation? Do early and late wood need to be considered separately? • Does the use of climate variables to predict the growth and proportions of early and late wood provide more reliable estimates than the use of the growth and density measurements from previous years as measured from the interior rings?

Received 31 October 2010 Accepted 6 January 2011

The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2011

185

The determination of the relevant explanatory variables for the growth of lodgepole pine using mixed models

Eric CORMIER* and Zheng SUN

Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 2Y2 Key words and phrases: Climate effects; biomass; mixed models; tree growth Abstract: In this paper a mixed model with nested random effects was used to model the cross-sectional area increment of lodgepole pine based on relevant explanatory variables. This model was used to show that minimum monthly temperature, monthly precipitation, and foliar biomass are positively related to the cross-sectional area increment, while an ordinal variable approximating lower trunk position and maximum monthly temperature are negatively related. It was shown that annual growth is better explained by monthly maximum and minimum temperatures than by average annual temperature and that the use of climate variables provided more reliable estimates for growth prediction than the use of the growth and density measurements from previous years. The Canadian Journal of Statistics 39: 185–189; 2011 © 2011 Statistical Society of Canada Resume: Dans cet article, un mod` le mixte avec des effets al´ atoires emboˆt´ s est utilis´ pour repr´ senter e e ıe e e ´ ´ les accroissements transversaux de l’aire du pin de Murray a l’aide de variables explicatives pertinentes. Ce ` mod` le a permis de montrer que la temp´ rature mensuelle minimale, la quantit´ de pr´ cipitation mensuelle e e e e et la biomasse foliaire sont positivement reli´ es a l’accroisement transversal d’aire tandis qu’une variable e ordinale approximant la position de la partie inf´ rieure du tronc et la temp´ rature mensuelle maximale sont e e n´ gativement reli´ es. Il est montr´ que la croissance annuelle est mieux expliqu´ e par les temp´ ratures e e e e e mensuelles maximale et minimale que par la temp´ rature annuelle moyenne et que l’utilisation des variables e de climats procure des estim´ s plus ﬁables de la pr´ diction de croissance que l’utilisation des observations e e sur la croissance et les mesures de densit´ des ann´ es pr´ c´ dentes. La revue canadienne de statistique 39: e e e e 185–189; 2011 © 2011 Société statistique du Canada

1. INTRODUCTION In this analysis, we addressed the following questions: (1) To what extent climate, position on the tree bole, and current foliar biomass explain cross-sectional area increment. (2) How have temperature and precipitation affected the annual cross-sectional growth? (3) Is annual growth best explained by average annual temperature or do monthly maximum and/or minimum values provide a better explanation? (4) Does the use of climate variables to predict the growth and proportions of early and late wood provide more reliable estimates than the use of the growth and density measurements from previous years as measured from the interior rings? We used mixed models to model the relationships between the explanatory variables: climate, position on the tree bole, and current foliar biomass, and the responses: cross-sectional area increment and the proportion of late wood of lodgepole pine. There were four features of the data that complicated the analyses: (1) Climate variables for each year were available and annual growth measurements were collected from tree samples, so we expected the data to exhibit autocorrelation. The correlation structure was accommodated by the use of random effects. (2) For each disc, there were measurements from two separate radii. Radius was treated as a nested random effect. It could have been assumed that the measurements along the two radii were two observations of the same variable and then an average could be taken, but due to the asymmetry of the tree radii, an average is not a good estimate of the

* Author to whom correspondence may be addressed. E-mail: ecomier@uvic.ca

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

186

Vol. 39, No. 2

variable. Alternatively, the measurements along the two radii could be used individually but there would be very high correlations between them. To allow both sets of measurements to be used and include the correlations between radii, a nested random effect was used (Pinheiro & Bates, 2000, p. 40). (3) The ages of the trees varied resulting in a different number of observations for each tree. This complication corresponds to drop-out in a longitudinal study. It was not believed that the reason for the resulting missing data was informative because it depended only on the age of the tree and not on a growth factor. Therefore this situation was modelled assuming the data on missing years of a tree’s life were missing at random and bias was not taken into account. (4) Destructive sampling meant that foliar biomass was collected at only one point in time; an inverse regression was conducted to determine the foliar biomass in other years. In addition to the climate variables, the growth and density measurements from previous years were used to predict the growth of early and late wood. This prediction was done using an ARIMA model and the reliability of these estimates were examined. 2. METHODOLOGY Non-parametric regressions of late wood percentage and cross-sectional area increment against trunk position, foliar biomass, and annual maximum temperature were ﬁt to determine the general trend in the response. We assumed that the nth measurement from the ground would correspond across trees regardless of the height of the trees so the ordinal variable position was used. The plot of cross-sectional area increment versus trunk position in Figure 1 shows that from trunk position 1–4, there is a negative relationship between trunk position and cross-sectional increment; but after position 5, there is a positive relationship between trunk position and cross-sectional increment. This could be due to the fact the position 4 or 5 corresponds with the start of the crown. The plot of cross-sectional area increment versus biomass and the plot of percentage of late wood versus biomass show that high values of biomass correspond to high cross-sectional area increment and low late wood percentage, respectively. The plot of cross-sectional area increment versus annual maximum temperature and the plot of percentage of late wood versus annual maximum temperature show that the relationship between annual maximum temperature and cross-sectional increment and the relationship between annual maximum temperature are different for each location (Kamloops and Quesnel). This suggests including two-factor interactions between location and annual maximum temperature. Similar interaction plots, which are not included in this paper, suggested including the interactions between the following pairs of variables: location and annual minimum temperature, location and precipitation, age of the tree and foliar biomass, and location and foliar biomass. We also included two-factor interactions among climate variables. One of the questions of interest was how the position on the tree bole affects the cross-sectional area increment. When examining the effect of position, it was necessary to account for the fact that trees had a large range in their absolute height. Measurements taken at 10 m from the ground on a tree that was 11 m tall have a different relative position on the tree trunk then the same measurement on a tree that was 40 m tall. To account for this an ordinal variable, position, was used to represent relative height. To determine whether monthly maximum and minimum temperature or average annual temperature best explain cross-sectional growth, the mixed model was ﬁtted separately with monthly measurements and with average annual measurements and model goodness-of-ﬁt criteria were compared. The variables that resulted in a better ﬁt were used in the analysis. To model the correlation structure, a random intercept and a random slope for each tree was adopted. To model the nested effect presented by having two radii measured on each tree disc, a nested random effect was introduced. A model to describe the growth effects of lodgepole pine was formulated using the response Yijkh , the cross-sectional area increment for the ith tree, jth radii, kth ring and hth position, where

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

187

Figure 1: Nonparametric regressions (LOESS) of cross-sectional area increment (measured in mm) and late wood percentage on trunk position, foliar biomass, and annual maximum temperature for each location, Kamloops (K) and Quesnel (Q). Note that the scale of cross-sectional increment in the biomass plot is larger than the other plots at the bottom panel and this is due to the fact that about 2.6% of the cases have foliar biomass greater than 1 and these cases have large measures of cross-sectional increment.

i = 1, 2, . . . , 60 indexes tree, j = 1, 2 indexes radius from each sample disc, h = 1, 2, . . . , mi indexes position on the ith tree, and k = 1, 2, . . . , nih indexes ring from the ith tree. Explanatory variables are region (X1 ), foliar biomass (X2 ), the position of the cut (X3 ), monthly minimum temperature (X4–15 ), monthly maximum temperature (X16–27 ) and monthly precipitation (X28–39 ). In addition, tihk , the age of the ith tree when the kth ring at hth position was developed, was included in the model. The age of the tree was used to determine if growth of the tree was constant over its lifetime. The mixed model is f (Yijkh ) = β0 + g(tihk ) + β1 X1 + β2 X2 + β3 X3 + β4-15 (X4-15 ) + β16-27 (X16-27 )k +β28-39 (X28-39 )h + (interactions) + b0i + b1i tihk + bij +

ijkh ,

(1)

2 where the random effects are the random intercept for tree b0i ∼N(0, τ1 ), the random slope for tree 2 2 b1i ∼N(0, τ2 ), the nested random effect for radii within disk bij ∼N(0, τ3 ) and the random error 2 ). Interactions included in the model were those suggested by the non-parametric ijkh ∼N(0, σ regression plots as well as the interactions between annual maximum temperature and annual minimum temperature, annual maximum temperature and annual precipitation, and annual minimum temperature and annual precipitation. The function f was a transformation chosen to improve the adequacy of the model and g was a function that modeled the relationship between the cross-sectional area increment and age. The same approach was taken to ﬁt a mixed model with proportion of late wood as the response. The growth of a tree in any given year is strongly inﬂuenced by growth in the previous years. One reason for this is that buds are formed the year before they start to grow and carbohydrates from good years can be stored to fuel growth in subsequent years. To model this dependence of growth on previous years an ARIMA model was developed. This model allowed the growth to be characterized and predicted.

DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

188

Vol. 39, No. 2

Figure 2: Mean cross-sectional area increment over age of the tree. Table 1: Climate effects on cross-sectional increment. Month J Maximum temperature Minimum temperature Precipitation * + + F − + + M * + + A * + + M − + + J − + + J − + + A * + + S − + + O − + + N * + + D − + +

Signiﬁcant positive effect (+), signiﬁcant negative effect (−), not signiﬁcant (*) at 5% level.

3. RESULTS To improve model adequacy a Box-Cox transformation was performed with λ = 0.25. The function g was determined to be cubic through examination of Figure 2, that is, g(tihk ) = 3 2 β40 tihk + β41 tihk + β42 tihk . Trunk position in the crown and the amount of foliar biomass have positive relationships, and age of the tree has a negative relationship with the cross-sectional area increment. The results of climate on cross-sectional increment are presented in Table 1. Results from the ﬁtted models were quite consistent with patterns observed in Figure 1 except the interaction effect of annual maximum temperature and location was not signiﬁcant. The estimated standard errors for the random effects were: τ1 = 0.5394, τ2 = 0.0225, τ3 = 0.0284. The nested effect for radii was not signiﬁcant, implying that the measurements from the two different radii did not signiﬁcantly differ. It was determined that annual growth was better explained using monthly maximum and minimum temperature values then average annual values because both AIC (66,135 vs. 67,750) and BIC (66,781 vs. 68,050) were smaller when monthly measurements were included in the model. However, although we used monthly climate variables as main effects, to reduce the number of interaction terms in the model interactions were modelled using annual climate variables. Examination of the residual plot showed heavy tails and skewness. This indicates that the residuals deviate from normality and that modelling them with a skew-elliptical distribution would be more appropriate (Jara, Quintana & San Mart´n, 2008). ı To model the dependence of the growth of early and late wood on measurements from previous years, an autoregressive integrated moving average model was ﬁtted using the yearly averages of early and late wood width over all trees. The ﬁnal model was determined to be a third order

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

189

Figure 3: Five step ahead predictions on early and late wood growth.

autoregressive, third order integrated and ﬁrst order moving average model. This model was used to predict future growth and determine prediction intervals (Figure 3). 4. CONCLUSIONS Data on the cross-sectional area increment and proportion of late wood for tree rings at multiple years, heights and geographical regions were analyzed, with the addition of climate data and measurements of foliar biomass for each year of the trees’ lives. The use of mixed models allowed all of these covariates to be included in each model and their relationship to be modelled using all available data. However, the model is not sufﬁcient to conclude cause-and-effect relationships between the variables and the growth of lodgepole pine. Since the monthly climate factors were correlated, the problem of multi-collinearity was present. Therefore, the model requires careful interpretation of the coefﬁcients. In response to questions (1) and (2) in Section 1, the results show that minimum monthly temperature, monthly precipitation and foliar biomass were positively related to the cross-sectional area increment while lower trunk position and maximum monthly temperature were negatively related. Examination of question (3) showed that annual growth was better explained by monthly maximum and minimum temperatures than by average annual temperature. Due to the wide prediction intervals from the time series analysis it was believed that the use of climate variables provided more reliable estimates for growth prediction (question (4)). Possible future work that could be carried out to improve the model is the use of skew-elliptical distributions to model the residuals to account for both skewness and heavy tails in the error terms and the incorporation of splines to accommodate the temporal trend in the observations. ACKNOWLEDGEMENTS Many thanks to Farouk Nathoo for the very helpful correspondence. BIBLIOGRAPHY

A. Jara, F. Quintana & E. San Mart´n (2008). Linear mixed models with skew-elliptical distributions: a ı Bayesian approach, Computational Statistics and Data Analysis, 52, 5033–5045. J. Pinheiro & D. Bates (2000). “Mixed-Effects Models in S and S-PLUS,” Springer-Verlag, New York. Received 31 October 2010 Accepted 6 January 2011

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

190

Vol. 39, No. 2

**Determinants of lodgepole pine growth: Static and dynamic panel data models
**

Mustafa SALAMH*

Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T2M2 Key words and phrases: linear mixed model; nested error components; autoregressive panel models; tree growth; climate effect; Pressler’s hypothesis; random coefﬁcients regression; two-stage least squares

1. INTRODUCTION This study was concerned with modelling the wood properties and growth over time for the Lodgepole pine in British Columbia. The primary objective was to determine to what extent climate, position on the tree trunk, and current foliar biomass explain cross-sectional area increment and proportion of early and late wood. The study also addressed other questions, such as whether growth is best explained by average annual temperature or monthly temperature extremes, and whether the use of climate variables to predict the growth and wood properties provide more reliable estimates than the use of the growth and density measurements from previous years. 2. METHODOLOGY Pressler’s hypothesis states that the annual increment in cross-sectional area of wood increases linearly from the top of the tree to the base of the crown and is proportional to amount of foliage above the point of the increment. Since tree and crown heights and foliar biomass were only available for the year in which the tree was cut down, they had to be estimated for other years of the tree’s life. This was done using loess regression based on the tree’s current height, crown length, and the height of disks. In order to answer the primary question about the effects of climate, disk position on the tree bole, and foliar biomass on the cross-sectional area increment and percentage of late wood, a linear mixed effects model was used to account for the two-level grouping structure of the data. The model was formulated according to Pressler’s hypothesis without ignoring the possible random variability due to disks below the crown. Other nuisance factors such as age from pith, tree height, and the geographic location were included in the model to control for their effect. The climate model takes the form yijt = f1 (disk ageijt , tree heightit , sitei ) + f2 (climt,t−1,t−2 ) + αDijt + β topdistanceijt + γ topmassijt + νi + δi Dijt + ξi topdistanceijt + ζi topmassijt + ηij (1−Dijt ) +

ijt ,

i = 1, I, j = 1, Ji , t = 1, Tij ,

(1)

where the response yijt is either the square root of the area increment or the log of late to early wood ratio for disk j within tree i at year t. f1 and f2 are linear functions, and “clim” represents a vector of climate variables (temperature and precipitation). The variable D is an indicator for being above the crown, topdistance is the product of D and the distance from the tree top to the disk, and topmass

* Author to whom correspondence may be addressed. E-mail: umabdel9@cc.umanitoba.ca

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

191

is the product of D and the gross amount of foliage above the disk. The standard distributional settings were assumed for the random effects and residual error, namely (γi , δi , ξi , ζi , ηij , j =

2 2 2 2 2 2 1, Ji ) ∼ N(0, diag(σν , σδ , σξ , σζ , ση I Ji )), ijt ∼N(0, σε ), ijt ∼ ARMA(p, q), and are independent of the random effects. Two lags of the climate variables were included since the effects of previous growing conditions can last from 1 to 3 years. I focused on the spring, summer, and fall climate variables because early wood is laid down during the spring and late wood is laid down from mid-summer to fall. Several sub-models were ﬁt using the R-package “nlme.” Diagnostic graphs were produced to ensure the adequacy of the models and to check the validity of assumptions. In almost all sub-models the residual error had ARMA(2,1) structure, which is consistent with Monserud (1986). Then, likelihood ratio and Wald-tests were performed to check the signiﬁcance. To answer whether annual growth is best explained by average annual temperature or monthly temperature extremes, I used the structure of Equation (1) for each set of temperature variables. Since neither of the two models is nested in the other, I applied the idea of the J-test (Gujarati, 2003, p. 533) to determine which model is preferred. To consider whether climate variables or growth and density variables from previous years better predict growth, I used a nested error component model with autoregressive dynamics and other explanatory variables to predict the annual growth. The proposed model equation is free of the climate variables. It is given by iid ∀ij

yijt = α1 yijt−1 + α2 yijt−2 + λ1 xijt−1 + λ2 xijt−2 + θ1 zijt−1 + θ2 zijt−2 + νij + ωijt ,

(2)

where yijt is deﬁned as in Equation (1), and x, z are the densities of early and late wood, respectively. The heterogeneity due to the trees and disks within trees is represented by the error component νij . The model is semiparametric, where the residual error, ωijt , satisﬁes the moment condition E(ωijt |νij , yijt−1 , xijt−1 , zijt−1 , yijt−2 , xijt−2 , zijt−2 , . . .) = 0. The model was ﬁt using two-stage least squares on the ﬁrst difference within disks. It was compared to the models of Equation (1) with regards to their out of sample prediction power. A test sample of size 2,718 taken across almost all the disks was discarded from the data and both models were ﬁt using the remaining data. The ﬁtted models were compared according to their mean squared error of prediction in the test sample. The MSE for the climate models was one-quarter that of the dynamic model. 3. CONCLUSION Regarding how the temperature and precipitation affect the annual cross-sectional growth and the proportions of early and late wood, the climate models showed high contributions of the current and lagged values of rain and temperature in explaining both of the target dependent variables. For example, annual growth was positively affected by rain (especially in spring) and negatively affected by extreme levels of temperatures. The proportion of early wood was positively affected by rain throughout the year and by higher temperatures in spring, however the proportion of late wood is negatively affected by higher temperatures in mid-summer. It is recommended that proportions of early and late wood be considered separately to allow a clearer view of how they were individually affected by the climate. About the extent to which the position on the tree bole, and current foliar biomass explain the annual cross-sectional growth and the proportions of early and late wood, it was found that the higher the disk was on the crown the smaller the amount of wood annual increment and the higher the ratio of early wood. The foliar biomass had a positive linear effect on the annual wood increment within the crown consistent with Pressler’s hypothesis. However it should be mentioned that the proportionality parameter is highly variable from one tree to another. Considering whether the growth is best explained by the average annual temperature or the monthly maximum (minimum) values, it was found that the two dependent variables were better

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

192

Vol. 39, No. 2

explained by the monthly extremes of temperature than the average annual temperature. The incremental contribution of the annual temperatures over the monthly temperatures was not signiﬁcant, but monthly temperatures were signiﬁcant additions to the annual climate model. Regarding whether the use of climate variables to predict the two dependent variables provides more reliable estimates than the use of the growth density from previous years, it was clear that the climate models provide more reliable prediction than the autoregressive dynamic model. This emphasizes the importance of climate, disk position and foliar biomass in prediction as well as in explanation. ACKNOWLEDGEMENTS I am grateful to Dr. Liqun Wang for encouraging me to carry out this research project and for ﬁnancial support through his research grants from NSERC and the National Institute for Complex Data Structures. BIBLIOGRAPHY

D. N. Gujarati (2003). “Basic Econometrics,” 4th ed., McGraw-Hill, New York. R. A. Monserud (1986). Time-series analyses of tree-ring chronologies. Forest Science, 32(2), 349–372.

Received 31 October 2010 Accepted 6 January 2011

The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2011

193

**Discussion of case study 1 analyses
**

C. B. DEAN1 *, Alison L. GIBBS2 and Roberta PARISH3

and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S6 of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G3 3 British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8

2 Department 1 Statistics

Key words and phrases: Nonlinear mixed models; hierarchical random effects; transformations; prediction intervals; autoregressive integrated moving average model; measurement error; model diagnostics

1. INTRODUCTION The two analyses incorporated similar features, but with different model formulations and variables; even so, they yielded similar major conclusions. Both Salamh and Cormier & Sun considered whether and to what extent climate, position on the tree bole, and foliar biomass explain crosssectional area increment and proportion of early and late wood. They both also speciﬁcally investigated the effects of temperature and precipitation, and whether average annual temperature or monthly maximum and/or minimum values better explain variability in area increment. Additionally, Salamh considered whether climate variables explain the variability in growth and proportions of early and late wood better than measurements from previous years of these variables. 2. THE MODELS Both Salamh and Cormier & Sun utilized nonlinear mixed effects models with hierarchical random effects. Transformations of the responses were considered including the square root of area increment and the logarithm of the ratio of early to late wood. Lags of climate variables were included as explanatory variables in Salamh, but not in Cormier & Sun. Salamh utilized a conceptually based approach, modelling the growth as increasing linearly from the top of the tree to the base of the crown, with tree-to-tree variability in this linear functional form. Cormier & Sun incorporated a variable labelled position (taking values 1, 2, 3, . . . ) which is meant to reﬂect the height of measurement of the area increment from the base of the tree. Note that the height at which measurements are taken are not multiples of a speciﬁc value, so when position = 2, the height from the base of the tree is not twice that when position = 1. As well, the heights at which measurements are taken are not the same from tree to tree. So the modelling of the transformed response variable as a linear function of the variable position, as considered in Cormier & Sun, is somewhat of an approximation to including a linear function of the height at which measurements are taken in the model. Future models should incorporate a more accurate measure of relative position by using tree heights and height to sample. Both Salamh and Cormier & Sun include (estimates of) biomass in the model, with the transformed response changing linearly with biomass increases. Salamh incorporated autoregressive errors, while Cormier & Sun incorporated interaction terms. Site-speciﬁc intercepts were included in Salamh, while Cormier & Sun omitted site effects. Cormier & Sun also handled the modelling of early and late wood separately, looking at the annual averages of early and late wood over all trees and investigating how these are related to lagged effects.

* Author to whom correspondence may be addressed. E-mail: dean@stat.sfu.ca

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

194

Vol. 39, No. 2

3. CONCLUSIONS Both Salamh and Cormier & Sun concluded that monthly climate variables explain the transformed growth response better than annual averages of these variables. More speciﬁcally, both Salamh and Cormier & Sun determined that area increment is positively affected by precipitation throughout the year, by higher temperatures in the spring, and negatively affected by high summer temperatures. Salamh conﬁrmed that 2 years of lagged climate variables were important for explaining growth. Foliar biomass positively affects growth, as expected, as well as position (Cormier & Sun). Additionally, Salamh determined that the proportion of early wood increases with increased rain and higher temperatures in the spring, while the proportion of late wood decreases with higher temperatures in the summer. In an interesting and effective contrast of predictions from models with climate variables versus those with lagged growth and density variables, Salamh suggested that climate models were superior in terms of mean squared error of predictions in a test sample which was not used for building the model. Finally, the analysis of Salamh on average annual early and late wood indicated that a third order autoregressive, third order integrated, ﬁrst order moving average model best predicted future growth; it was also noted that prediction intervals from this model were quite wide. Some important concerns were expressed by Cormier & Sun that residual analysis indicated skewness and heavy tails. This data set is messy and complicated. Climate data are subject to many sources of error: stations may be moved, or measurements inﬂuenced by shade from growing trees or new buildings. Foliar biomass prediction is prone to error because of the smoothing required to estimate three dimensional crown variables. Many of the effects are likely nonlinear. It is striking that both analyses yielded similar conclusions regarding climate effects despite such challenges, and also despite the use of simple assumptions of normality of errors and many uncorrelated error components in the model. The brief discussion by Cormier & Sun of residual analysis points to a need for more model diagnostics and graphical model evaluation techniques to help come to terms with whether the models can indeed do a useful job at prediction. It would also be important to identify which interaction terms are critical, how much variability they explain, and whether the effects are scientiﬁcally meaningful and interpretable. Salamh’s comparison of predictions based on climate variables versus those with previous values of growth and densities should be complemented with models including both of these, as well as an indication of whether any of these models provide useful output for predicting pine growth. Additional considerations for predictions under future climate scenarios will need to account for variability in climate predictions; it is unclear whether prediction intervals are already too wide to be helpful without the incorporation of errors in climate predictions, though the preliminary results of Salamh encourage some optimism in that regard.

Received 31 October 2010 Accepted 6 January 2011

The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2011

195

**Case study 2: Proteomic biomarkers for disease status
**

Robert F. BALSHAW* and Gabriela V. COHEN FREUE

NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6 Key words and phrases: Variable selection; class prediction; imputation

1. BACKGROUND Renal transplantation saves the lives of hundreds of Canadians every year. However, every transplant recipient must be monitored closely for signs of acute rejection, which is the body’s immunologic and inﬂammatory response to the presence of foreign organ. If not properly treated, acute rejection can lead to loss of the transplanted organ, dialysis or even death. Unfortunately, acute rejection can only be detected by biopsy a distressing, uncomfortable and expensive surgical procedure that can be required multiple times during the ﬁrst year post-transplant. The Biomarkers in Transplantation project was funded by Genome Canada to identify noninvasive biomarkers for the detection and prediction of acute rejection based on proteomic analysis of peripheral blood samples. A clinical test based on such a biomarker could lead to a better method for monitoring transplant recipients, reducing costs, improving treatment outcomes, and substantially improving recipients quality of life. Measures of protein abundance data were collected from a selection of kidney transplant recipients who were known at the time of the blood sampling to be either experiencing treatable acute rejection (AR) or not experiencing acute rejection (NR). Each sample was drawn from an independent subject within the ﬁrst 3 months post-transplant. For each AR sample, two NR samples were selected at approximately the same time point post-transplant. The goal of this case study was to utilize these proteomic data to create a classiﬁer for acute rejection, which could then be evaluated on a test set of 15 samples. 2. THE DATA At the time of the Case Study Competition for the 2009 SSC meeting, potential intellectual property considerations meant that the true nature of the dataset had to be hidden from the participants. For example AR and NR status are referred to the patients being in an active or inactive state of disease. The data were also supplemented with synthetic sample data, constructed to mimic the observed characteristics of the AR and NR samples, both to enrich the size of the dataset and to further protect intellectual property. The dataset included 11 samples from AR patients, 21 samples from NR patients, plus an additional 15 samples whose classiﬁcation was hidden. All experimental samples were taken from independent patients at the time when acute rejection was suspected or at a corresponding matched time-point for non-rejection samples. Historically, approximately 10% of renal transplant recipients experience rejection during the ﬁrst few months post-transplant; however, the study design was to select approximately 2 NR samples for every AR sample.

* Author to whom correspondence may be addressed. E-mail: rob.balshaw@syreon.com

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

196

Vol. 39, No. 2

A multiplex proteomic technology, called iTRAQ, was used to measure the relative protein abundances of the experimental samples relative to the quantity of the corresponding protein in a reference sample. The reference samples were taken from a homogeneous batch of blood pooled from 16 healthy volunteers. Plasma was obtained from each whole blood sample through centrifugation. To enhance detection sensitivity, the plasma samples were ﬁrst depleted of the 14 most abundant proteins. Trypsin was used to digest the proteins in the depleted samples, and the resulting peptides were labelled with one of four distinct iTRAQ reagents (i.e., chemical tags with unique molecular weights but otherwise identical chemical properties). The labelled samples were then pooled and processed using a MALDI TOF/TOF technology. Each iTRAQ run was designed with three experimental samples and one reference sample. Peptide identiﬁcation and quantitation was carried out by ProteinPilot Software v2.0 and the data were assembled into a comprehensive summary of the relative abundance of the proteins in the experimental samples. As the same reference sample was used in all runs, these relative abundance measures were comparable across experimental runs. Each run of the experiment detected and measured several hundred proteins (about 200 per run), but not every protein was identiﬁed in every sample, nor even in every run, leading to a complex pattern of missing data. If a protein was not identiﬁed in a particular experimental sample, the proteins relative abundance level was then unknown. When this happened for the reference sample in a particular run, then the relative levels for this protein could not be estimated for any of the three experimental samples in that run. Proteins were identiﬁed in the data using arbitrary protein identiﬁers (BPG0001–BPG0460). Though this prevented the incorporation of biological/subject-matter context in the analysis, it was necessary due to potential intellectual property concerns. In addition to acute rejection status and relative abundance measures for 460 proteins, for each observation sex, race, and age were provided.

Received 31 October 2010 Accepted 6 January 2011

The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2011

197

**Disease status determination: Exploring imputation and selection techniques
**

Linghong LU, Rena K. MANN*, Rabih SAAB and Ryan STONE

Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 3R4 Key words and phrases: BPCA; imputation; k-NN; LARS; LASSO; LLS; selection method; SLR; SVD Abstract: Analyzing a proteomics dataset that contains a large amount of independent variables (biomarkers) with few response variables and many missing values can be very challenging. The authors tackle the problem by ﬁrst exploring different imputation techniques to treat the missing values and then investigate multiple selection techniques to pick the best set of biomarkers to predict the unknown patients’ disease status. They conclude their analysis by cross-validating the different combinations of imputation and selection techniques (using the set of patients of known disease status) in order to ﬁnd the optimal technique for the supplied dataset. The Canadian Journal of Statistics 39: 197–201; 2011 © 2011 Statistical Society of Canada Resume: Analyser les jeux de donn´ es provenant de la prot´ omique repr´ sente un grand d´ ﬁ, car ils cone e e e ´ ´ tiennent un grand nombre de variables ind´ pendantes (biomarqueurs) avec peu de variables r´ ponses et ils e e contiennent aussi des valeurs manquantes. Les auteurs s’attaquent a ce probl` me en explorant diff´ rentes e e ` techniques d’imputation pour traiter les valeurs manquantes. Ils consid` rent aussi diff´ rentes techniques e e de s´ lection multiple pour choisir le meilleur ensemble de biomarqueurs aﬁn de pr´ dire l’´ tat non connu e e e de la maladie d’un patient. Ils concluent leurs analyses en croisant les diff´ rentes combinaisons de teche niques d’imputation et de s´ lection (en utilisant un ensemble de patients dont l’´ tat de la maladie est e e connu) de facon a trouver la technique optimale pour le jeu de donn´ es analys´ es. La revue canadienne ¸ ` e e de statistique 39: 197–201; 2011 © 2011 Société statistique du Canada

1. INTRODUCTION There is a multitude of missing information in the data and as such, we eliminated proteins with 40 or more missing values leaving 330 proteins. A log 2 transformation of the data was then taken as the data is fold change. In order to select the set of proteins that accurately predict the status of the unknown list of patients, combinations of different imputation methods and selection techniques were studied and then cross validated using the protein expressions for individuals with known disease status. 2. IMPUTATION Several imputation techniques were explored: k-Nearest Neighbours (k-NN), Local Least Squares (LLS), Singular Value Decomposition (SVD), and Bayesian Principle Component Analysis (BPCA). k-nearest neighbours is a sensitive and robust method when the missing values are determined by the proteins with expressions most similar to that protein in the other samples. The optimal value of k, which is the number of neighbours to be used, has been shown in literature to be in the range of 10–20 (Troyanskaya et al., 2001; Mu, 2008). Thus, the values of 10 and 20 were chosen in this instance. For each protein, the k-NN imputed value was found using Euclidean distance (Troyanskaya et al., 2001) for the columns where that protein was not missing. If more

* Author to whom correspondence may be addressed. E-mail: rmann@uvic.ca

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

198

Vol. 39, No. 2

than 50% of a protein was missing, the missing values were imputed using each patient’s average. Otherwise, the average of the k-nearest neighbours was used to estimate the missing value. In local least squares, a missing protein is evaluated as a linear combination of similar proteins (Kim, Golub & Park, 2005). The method borrows from k-NN and from least squares imputation. It is most effective when there is strong correlation in the data as the k proteins were selected based on those which had the highest correlation with the target protein. We used two methods: Spearman and Kendall as the two correlation types along with different values of k for the neighbours. Singular value decomposition starts by taking the data set and ignoring the missing entries. Then it calculates the mean for each of the rows of complete data. By initializing the missing values to be the previously calculated row means, an iterative procedure ﬁnds the missing values. Next, SVD is performed on the newly formed complete data set and the solution that is produced replaces the row means in the missing values. These steps are repeated until a solution converges which usually happens after ﬁve iterations (Hastie et al., 1999). Bayesian principal component analysis is another imputation method that was used to impute missing protein expression values. It combines an EM approach for PCA with a Bayesian model and is based on three processes: principal component regression, Bayesian estimation, and an expectation-maximization (EM)-like repetitive algorithm (Oba et al., 2003). The algorithm was developed for imputation and very few components are needed in order to ensure accuracy. It is an iterative process; it either terminates if the increase in precision becomes lower than the threshold of 0.01 or if the number of set iterations is reached. 3. SELECTION METHODS We used several methods to select differentially expressed proteins between the active and inactive groups of patients. The simplest of these methods is the t-test. We used this basic test to compare the protein expressions between the two groups. Signiﬁcant results of the t-test provided a preliminary analysis and a list of proteins potentially being expressed differently between the active and inactive groups. We also used the Wilcoxon signed rank test which is conceptually similar to the t-test but provides a more robust approach. More advanced selection techniques were then explored to select biomarkers inﬂuencing the disease status such as the Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle Regression (LARS), and Sparse Logistic Regression (SLR). 3.1. LASSO and LARS The concept of the LASSO, a technique suggested by Tibshirani (1996), is similar to ordinary least squares but uses a constraint to shrink the number of independent random variables included in the model by setting some of the coefﬁcients to zero. LARS is a model selection method, proposed by Efron et al. (2004), that is computationally simpler than LASSO. It starts by equating all of the coefﬁcients to zero and then adds the predictor most correlated with the response. The next predictor added is the predictor most correlated with the current residuals. The algorithm proceeds in a direction equiangular between the predictors that are already in the model. Efron et al. (2004) presented modiﬁcations to the LARS algorithm that generate the LASSO estimates and these were used to produce the LARS estimates in the paper. 3.2. SLR The SLR method is an extension of the LASSO method to a generalized regression model. The technique, proposed by Shevade and Keerthi (2003), ﬁts a logistic regression model and uses

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

199

Table 1: Number of individuals of known disease status misclassiﬁed by the different imputation and selection methods. SVD (e = 3) SLR LASSO LARS 21 0 3 SVD (e = 2) 16 0 0 k-NN (k = 20) 17 12 3 k-NN (k = 10) 13 13 4 BPCA 12 1 0 LLS (k = 20 spear) 18 13 5 LLS (k = 10 ken) 13 16 1 LLS (k = 5 ken) 16 15 1

The rows and columns correspond to the different selection and imputation methods respectively. LLS has a correlation type of either Spearman (spear) or Kendall (ken).

Table 2: Predicted status for unknown subjects using LARS combined SVD imputation, where I and A stand for inactive and active respectively. Subject Status 2 I 7 I 8 I 11 I 12 I 13 A 17 A 18 A 29 I 31 I 33 I 34 I 35 I 38 A 41 I

maximum likelihood estimation to obtain estimates of the model coefﬁcients. The method is similar to LASSO in that it uses a constraint to shrink the logistic regression model.

4. METHOD SELECTION To select the set of proteins that accurately predict the status of the unknown list of patients, combinations of imputation methods and selection techniques were studied and then cross validated using the protein expressions for individuals with known status. Misclassiﬁcation levels observed for the eight imputation and three selection methods employed are displayed in Table 1. The LARS algorithm had relatively few misclassiﬁed cases compared to SLR and LASSO. The SLR method had high misclassiﬁcation rates and was therefore dismissed for prediction purposes. SVD and BPCA appear to be the best imputation techniques to use.

5. PROTEIN SELECTION AND PREDICTION We chose the proteins picked by most methods. A protein was deemed to be selected as differentially expressed by the t- and Wilcoxon signed rank tests if the respective P-values were less than 0.01. Proteins that had nonzero coefﬁcients in the SLR, LASSO and LARS models were considered differentially expressed between the active and inactive groups of patients. The maximum frequency of selection for proteins was 13, so 10 was taken as the cut-off value. We noticed that seven proteins stood out: BPG0036, BPG0105, BPG0235, BPG0262, BPG0333, BPG0381, and BPG0447. For prediction purposes we used the selected proteins mentioned above and their corresponding coefﬁcients by the LARS algorithm combined with the SVD (e = 2) imputation method because this combination had a misclassiﬁcation rate of zero. Table 2 presents the results of the predictions.

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

200

Vol. 39, No. 2 Table 3: Predicted status for unknown subjects where I and A stand for inactive and active status respectively.

Subject Status

2 I

7 A

8 I

11 I

12 I

13 A

17 A

18 I

29 I

31 A

33 I

34 I

35 A

38 A

41 I

6. LOGISTIC REGRESSION After determining the seven signiﬁcant proteins, the variables of race, gender and age were analyzed to determine whether they were signiﬁcant. Chi-square tests of independence determined race by status was not signiﬁcant while gender was signiﬁcant and linear regression showed age by status was not signiﬁcant. A logistic regression model (Dobson, 2002) was then ﬁtted with the variables gender and all seven of the selected proteins to produce a second set of predictions for unknown disease status. Three proteins: BPG0036, BPG0105, and BPG0333 were found to be signiﬁcant by stepwise algorithms on the ﬁtted model. The expression levels of any of these three proteins changed signiﬁcantly between the active and inactive groups. Fitting the reduced model on the data of given status gave the ﬁnal model: log πj 1−πj = 1,309−6,948 × BPG0036 + 4,509 × BPG0105−2,715 × BPG0333

for j = 1, . . . , 32. The status of patient j is determined to be active if the corresponding πj is less than 0.5 and inactive otherwise. The predictions of the unknowns are shown in Table 3. To check consistency, we predicted the status of the known status patients (11 active and 21 inactive). The misclassiﬁcation was zero, which gave us conﬁdence in the prediction of the 15 unknowns. 7. CONCLUSIONS The main problem encountered when dealing with microarray data was the abundance of missing data and therefore, the choice of imputation method proved to be vital. Both logistic regression and LARS were shown to be very good selection methods; the LARS algorithm had low misclassiﬁcation rates. The three proteins picked by logistic regression were good predictors for a patient’s status as there was an obvious separation between the inactive and active patients. ACKNOWLEDGEMENTS The support and guidance of Dr. Mary Lesperance is greatly appreciated. The study was supported by Fellowships from the University of Victoria. BIBLIOGRAPHY

A. Dobson (2002). “An Introduction to Generalized Linear Models,” Chapman & Hall/CRC, Washington. B. Efron, T. Hastie, I. Johnstone & R. Tibshirani (2004). Least angle regression, Annals of Statistics, 32(2), 407–499. T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown & D. Botstein (1999). Imputing missing data for gene expression arrays. Technical Report, Department of Statistics, Stanford University, Palo Alto, California, USA. H. Kim, G. Golub & H. Park (2005). Missing value estimation for DNA microarray gene expression data: local least squares imputation, Bioinformatics, 21(2), 187–189.

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

201

R. Mu (2008). Applications of correspondence analysis in microarray data analysis. MSc Thesis, University of Victoria, Victoria, British Columbia, Canada. S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara & S. Ishii (2003). A Bayesian missing value estimation method for gene expression proﬁle data, Bioinformatics, 19(16), 2088–2096. S. Shevade & S. Keerthi (2003). A simple and efﬁcient algorithm for gene selection using sparse logistic regression, Bioinformatics, 19(17), 2246–2253. R. Tibshirani (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, 58(1), 267–288. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. B. Altman (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17(16), 520–525. Received 31 October 2010 Accepted 6 January 2011

DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

202

Vol. 39, No. 2

**Bootstrap-multiple-imputation; high-dimensional model validation with missing data
**

Billy CHANG*, Nino DEMETRASHVILI and Matthew KOWGIER

Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada M5T 3M7 Key words and phrases: Bootstrap validation; EM-algorithm; multiple imputation; penalized likelihood

1. METHODOLOGY We ignored age, sex, and race when building the classiﬁers. We then removed proteins with more than 50% missing (there were 191 proteins left), transformed zero relative abundance values to 0.0001 to avoid −∞ when applying the log-transform. We compared four classiﬁers: regularized discriminant analysis (RDA) with penalization constant γ = 0.99, diagonal linear discriminant analysis (DLDA), penalized logistic regression (PLR) with penalization constant λ = 0.01, and linear kernel support vector machine (SVM) with penalization constant C = 1 (for the Lagrange multiplier). The classiﬁers are described in detail in Hastie, Tibshirani and Friedman (2009). Multiple imputation (Little & Rubin, 2002) and bootstrap validation (Hastie, Tibshirani & Friedman, 2009) were used to compare and validate the four classiﬁers. We employed ROC curves and AUC (area under ROC curves) as the classiﬁers’ performance metrics. Due to the small-sample validation sets created by bootstrapping, we employed the BS.632+ error correction method (Hastie, Tibshirani & Friedman, 2009) to adjust for the small-sample prediction error bias when comparing the prediction error for the four classiﬁers. We assume the 191 log-abundance scores for subject i (i = 1, . . . , N = 47) are multivariate normal: xi ∼N(µ, ). To avoid singular covariance estimates, we penalize the trace of the precision matrix: l(µ, |{xi }N ) = logdet( i=1

−1

)−trace(S

−1

)−λ trace(

−1

)

**Here S is the sample covariance matrix. The maximum penalized-likelihood estimates are: 1 µ= ˆ N
**

N

xi ,

i=1

ˆ = S + λI

where I is the identity matrix. In the presence of missing data, we use the EM-algorithm as described in Little & Rubin (2002) with a slight modiﬁcation in the M-Step for parameters estimation: • E-step: compute conditional expectations and covariance of the missing data given the observed data. • M-step: (1) Fill in the missing entries with their conditional expectations. (2) Update the sample mean of the ﬁlled-in data.

* Author to whom correspondence may be addressed. E-mail: billy.chang@utoronto.ca

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

203

Figure 1: ROC curves.

(3) Update the sum of the ﬁlled-in data sample covariance and the conditional variance of the missing data given the observed data, and λI (we used λ = 0.02). To validate the classiﬁers, we employed the following procedure: (1) Estimate the missing-data distribution using all 47 subjects by the above EM-algorithm. (2) Generate 100 imputed data sets with missing values imputed by values drawn from the estimated missing data distribution. (3) From the 100 imputed data set, remove the 15 subjects with unknown status. (4) For each data set created from step 2 and 3, create 200 bootstrap resampled data set. Fit the classiﬁer on the bootstrap sample and get averaged out-of-bag (OOB) ROC curves, OOB AUC scores, and BS.632+ error. Note that all the penalization constants were chosen only to eliminate the singularity issues due to the high dimensionality of the data; no ﬁne-tuning was done to optimize classiﬁcation performance. 2. RESULTS The averaged ROC curve for PLR lies above the ROC curves of the other three classiﬁers (Figure 1), suggesting that PLR achieves better performances on average across all level of threshold. The AUC scores distributions (results not shown) also suggest that PLR consistently achieves good separation ability, however there are certain bootstrap samples which give very low OOB AUC. We use the OOB BS.632+ error to check the classiﬁers’ performance at threshold 0.5 for RDA, DLDA, PLR (i.e., classiﬁes a subject as Active if P(Active | the subject’s protein scores) > 0.5) and 0 for SVM, and found that PLR’s errors are also consistently lower than the other three classiﬁers (results not shown). 3. CONCLUSIONS Based on the above observations, PLR is the best classiﬁer among the four classiﬁers compared. However, the huge variance in AUC and BS.632+ errors casts doubt on whether PLR can truly be a legitimate alternative over the current pathological evaluation method for patients’ disease states.

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

204

Vol. 39, No. 2

ACKNOWLEDGEMENTS We thank our team mentor Rafal Kustra for his guidance and support throughout. BIBLIOGRAPHY

T. Hastie, R. Tibshirani & J. Friedman (2009). “The Elements of Statistical Learning,” 2nd ed., Springer, New York. R. J. A. Little & D. B. Rubin (2002). “Statistical Analysis with Missing Data,” 2nd ed., Wiley, New York.

Received 31 October 2010 Accepted 6 January 2011

The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2011

205

**Exploring proteomics biomarkers using a score
**

Qing GUO1 *, Yalin CHEN2 and Defen PENG2

of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada L8N 3Z5 2 Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1 Key words and phrases: AUC; biomarker; cluster analysis; cross-validation; jittered QQ plot; logistic regression; missing data imputation; proteomics score; sensitivity; speciﬁcity

1 Department

1. METHODOLOGY Proteins are constantly changing and interacting with each other, which makes it challenging to identify a single protein. Often, a function is implemented by some proteins, among which, some proteins up-regulate and some down-regulate. In this article, we propose a particular procedure (PHD-SCORE) to seek a proteomics score that is a combination of relevant functional groups of proteins. This can be used in place of a single or a limited number of potential proteins, as the biomarker to distinguish active disease status from the inactive. The procedure involves the following steps: Process the data; Condense the High dimensional Dataset; Identify Statistically meaningful clusters and Calculate a proteomic score; Build statistical models to ﬁnd one more appropriate for the data; Determine patients’ disease status by choosing an Optimal probability cut point; Test model prediction ability by using cross-validation; Repeat the procedure until a model is chosen with proper predictions and low Error rate; Apply the chosen model to unknown cases. 1.1. Process the Data Data checking has been conducted to ensure consistency and integrity. Nothing peculiar was found. Among 47 subjects, 11 had active disease, 21 were inactive, and 15 had unknown disease status. Based on our exploration, any protein with more than 38.2% of observations missing in either group (equivalent to 4 in active and 8 in inactive groups) was excluded from further analysis. This left us with 160 protein variables in the data set, together with 3 covariates (age, sex, race). 1.2. Condense the High-Dimensional Protein Data Hierarchical cluster analysis was employed to reduce the dimensionality of the protein data. We used (1 − r) (r is the Pearson correlation coefﬁcient) as the similarity between clusters using “average” linkage. Usually, cutting too many clusters will result in small number of variables within each cluster, and too few clusters will mean that many less correlative variables will be included. Given these considerations, we decided to choose 13 clusters with corresponding cut-off correlation coefﬁcient of 0.85. 1.3. Impute Missing Data Two imputation methods were used for handling missing data: hot-deck and linear regression methods. The missing value of the protein is predicted and imputed from the regression model with all available predictors.

* Author to whom correspondence may be addressed. E-mail: guoq@mcmaster.ca

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

206 Table 1: Fitted coefﬁcients for the logistic regression model. Predictor Constant Age Proteomic score Estimate −4.05 −0.06 20.75 SE 4.41 0.07 8.64

Vol. 39, No. 2

P-value 0.36 0.42 0.02

1.4. Identify Informative Clusters The clusters were selected by considering 13 plots to visually discriminate the active and inactive disease status by plotting the average value of proteins in each cluster (say, 13 clusters) for 32 subjects against disease status, and then stepwise logistic regression. It gave us 2 clusters with 8 variables in one cluster and 12 variables in the other. The proteomics score was calculated as the mean difference between the 2 clusters. 1.5. Build Model and Optimal Estimate of Cut-Off Probability The logistic regression model was built up by using covariates and the calculated proteomics score as predictors. Gender and race are not statistically signiﬁcant. The ﬁnal model is logit ˆ ˆ ˆ (P) = β0 + β1 × Age + β2 × Score, where P is the probability of having active disease status. A probability cut-off plot was drawn to detect the patients’ disease status. AUC and jittered QQ plot (Zhu & El-Shaarawi, 2009) were effective tools to diagnose the adequacy of ﬁtted model in numerical and graphical ways. 1.6. Test Prediction Ability and Apply to the Unknown Cases Cross-validation within the 32 subjects with known disease status was utilized to provide a nearly unbiased estimate of the prediction misclassiﬁcation rate (Farruggia, Macdonald & ViverosAguilera, 1999). We use a random split to partition the observed data into a training set (2/3 of 32 subjects, about 22 subjects) and a validation set (1/3 of 32 subjects, about 10 subjects). We pseudo-randomly repeated the cross-validation 200 times to assess the misclassiﬁcation rate. The steps 1–6 were run repeatedly until the model with low misclassiﬁcation rate was chosen. Finally, we applied the model to 15 unknown cases to identify their disease status. 2. RESULTS The analyses were performed by using the statistical package R. The maximum likelihood estimates of coefﬁcients obtained for the ﬁtted model and their P-values are shown in Table 1. The proteomic score is signiﬁcant at level 0.05 to detect patients’ disease status, while the insigniﬁcant covariate age was included to adjust for the logit(P). The probability cut-point of 0.26 was chosen based on the trade-off between the sensitivity and speciﬁcity. If patients’ disease probability is greater than 0.26, the active disease status is assumed. The misclassiﬁcation rates for the active, inactive and overall are 0.13, 0.11, and 0.12, respectively. This chosen model was applied to the data (15 participants), patients 7, 12, 13, 17, 18 were identiﬁed with active disease. 3. CONCLUSIONS There are limitations. Firstly, the chosen percentage of missingness (here, 38.2%) for each protein in either group was subjective. Secondly, multiple imputation methods were not adopted. To be conservative, we used hot-deck and linear regression ﬁt methods to impute the missing values.

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

207

Finally, some factors such as the number of clusters and the cut-point for the probability of disease and some speciﬁc proteins could be more targeted and ascertained with the involvement of principal investigator so to understand the underlying biological mechanism better. The steps of 1–6 in PHD-SCORE procedure need to be repeatedly run until the satisfactory model is met. ACKNOWLEDGEMENTS We thank Dr. Rong Zhu for the guidance, encouragement and availability to us during the study. We are also indebted to Drs. Eleanor Pullenayegum and Romn Viveros-Aguilera for their ﬁnancial support and constructive suggestions. We also would like to acknowledge Drs. Stephen Walter, Harry Shannon, and Lehana Thabane for their helpful comments. BIBLIOGRAPHY

J. Farruggia, P. D. Macdonald & R. Viveros-Aguilera (1999). Classiﬁcation based on logistic regression and trees. Canadian Journal of Statistics, 28, 197–205. R. Zhu & A. H. El-Shaarawi (2009). Model clustering and its application to water quality monitoring. Environmetrics, 20, 190–205.

Received 31 October 2010 Accepted 6 January 2011

DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

208

Vol. 39, No. 2

**A multiple testing procedure to proteomic biomarker for disease status
**

Zhihui (Amy) LIU* and Rajat MALIK

Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1 Key words and phrases: Multiple testing procedure; imputation; classiﬁcation tree; heatmap; proteomic biomarker

1. METHODOLOGY When hundreds of hypotheses are tested simultaneously, the chance of false positives is greatly increased. We ﬁrst removed the proteins that contain more than 5 missing values for the active group and 11 for the inactive group. To control Type I error rates, resampling-based single-step and stepwise multiple testing procedures (MTP) were applied, using the Bioconductor R package multtest (Pollard et al., 2009). Our null hypothesis was that each protein has equal mean relative abundance in the active and inactive group. The non-parametric bootstrap with centring and scaling was used with 1,000 iterations. Non-standardized Welch t-statistics were implemented, allowing for unequal variances in the two groups. We explored four imputation methods in the R package pcaMethods (Stacklies et al., 2007): the nonlinear iterative partial least squares algorithm, the Bayesian principle component analysis missing value estimator, the probabilistic principle component analysis missing value estimator, and the singular value decomposition algorithm. A classiﬁcation tree implemented in the R package rpart (Therneau et al., 2009) was ﬁtted by using the nine most “rejected” proteins from the MTP’s. To visualize the results, a heatmap was produced (see Figure 1). 2. RESULTS By setting the random seed = 20, the MTP’s result in ﬁve rejections: BPG0167, BPG0235, BPG0333, BPG0381, BPG0447 at the signiﬁcance level α = 0.05 Note that different choices of seed yield slightly different results. Applying the random seed = 20 and 1,000 bootstrap iterations to the imputed data, we found that the results from the four imputation methods are similar and they agree with those without imputation. A heatmap was plotted (Figure 1) using the nine most “rejected” proteins: BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310, BPG0333, BPG0381, and BPG0447. Notice that it correctly classiﬁes almost all the samples. The classiﬁcation tree predicts that patients 7, 11, 13, 18, 33, and 35 belong to the active group. 3. DISCUSSION Multiple testing procedures are a concern because an increase in speciﬁcity is coupled with a loss of sensitivity. Furthermore, we suspect that the proteins with the most difference in relative abundance between the active group and inactive group are not necessarily the key players in the relevant biological processes. These problems can only be addressed by incorporating prior biological knowledge into our analysis, which may lead to focusing on a speciﬁc set of proteins.

* Author to whom correspondence may be addressed. E-mail: amyatmac@yahoo.ca

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

209

Figure 1: A heatmap of nine proteins comparing the active (“+”) and inactive group.

4. CONCLUSION The results from MTP’s before the imputation agree more or less with those after the imputation. The reason for this is unclear, it could mean either that the imputation works very well or that it is not very helpful. There is no evidence that race, sex or age is associated with disease status. Proteins BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310, BPG0333, BPG0381, and BPG0447 seem to be important in indicating the disease status. Patients 7, 11, 13, 18, 33, and 35 are predicted to be active among the 15 patients of unknown status. ACKNOWLEDGEMENTS We thank Dr. Peter Macdonald for his valuable advice and encouragement. BIBLIOGRAPHY

K. S. Pollard, H. N. Gilbert, Y. Ge, S. Taylor & S. Dudoit (2009). multtest: Resampling-based multiple hypothesis testing. R package version 2.1.1. T. M. Therneau, B. Atkinson & B. Ripley (2009). rpart: Recursive Partitioning. R package version 3.1-44. W. Stacklies, H. Redestig & K. Wright (2007). pcaMethods: A collection of PCA methods. R package version 1.22.0.

Received 31 October 2010 Accepted 6 January 2011

DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

210

Vol. 39, No. 2

**The process of selecting a disease status classiﬁer using proteomic biomarkers
**

Christopher MEANEY1 * Calvin JOHNSTON1 and Jenna SYKES2

1 Dalla 2 Department

Lana School of Public Health, University of Toronto of Statistics and Actuarial Science, University of Waterloo

Key words and phrases: Statistical classiﬁer; biomarker; support vector machine

1. METHODOLOGY Our ﬁrst challenge was to narrow down the list of available statistical classiﬁers. One resource at this stage was the open-source data mining and statistical learning program Weka developed at the University of Waikato in New Zealand (Hall et al., 2009). With a few simple clicks on its intuitive “Explorer” interface, we quickly sifted through an extensive selection of supervised classiﬁcation techniques, including logistic regression, trees and forests, bagging, boosting, neural networks, Bayes classiﬁers, and support vector machines. Weka has a convenient cross-validation function which allowed us to whittle down this long, intimidating list to only the most promising methods and allowed us to focus our efforts in a fruitful direction. We found that the Multilayer Perceptron and Support Vector Machines (SVM) had strong empirical classiﬁcation properties according to leave-one-out cross validation. We opted to focus our energy on the SVM algorithm as it seemed more intuitive. Consider a Euclidean space with dimensionality equal to the number of “factors”; in this case the factors are the 400+ protein relative abundance measurements. Each patient maps to a vector in that space positioned according to his or her particular measurements on each protein. Along with this positioning, each patient also has a disease status: active or inactive. The SVM then ﬁnds the optimally separating hyperplane which splits the Euclidean space into two sections, each ideally containing patients of only one disease status. When there are multiple possible hyperplanes that achieve this completely separating objective, the SVM chooses the plane that maximizes the distance between the hyperplane and its nearest datapoints. In many SVM implementations, the use of linear hyperplanes is extended with kernels by replacing the linear vector dot product with a nonlinear kernel function, such as a polynomial or radial basis function. A common modiﬁcation is to relax the requirement that the hyperplane divide all cases perfectly by allowing for a few penalized exceptions known as slack vectors (Hastie et al., 2009). Since the provided data set was already of high dimensionality and was fully separable, neither of these extensions was used. We used the function svm() in the R (R Development Core Team, 2005) package e1071 (Wu, 2009). This function allows many options to be set including a choice of kernels and the penalization of slack vectors (Meyer, 2009). Data were scaled within the function to prevent proteins with large variance from dominating the classiﬁcation decision. 2. RESULTS Many protein abundance measurements were absent from the data. The function svm() requires complete data to operate. Thus, we imputed values for the missing data. Exploratory analysis indicated that there were a number of proteins for which fewer than 5 out of 47 measurements

* Author to whom correspondence may be addressed. E-mail: christopher.meaney@utoronto.ca

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

211

were missing and a large number for which more than 30 out of 47 measurements were missing. Furthermore, several of the proteins with more than 20% missing data had missing values for all patients in one of the two status groups. We decided to limit our imputation to only proteins for which fewer than 20 measurements were missing. As a result, the observational set of interest for a given case consisted of only 146 remaining proteins. After careful and extensive visual exploration of the missing data, we felt comfortable that the data were mostly missing at random and not systematically linked to the true underlying values (such as higher probability of missingness for very small true values). A review of imputation literature revealed many possible strategies for the replacement of the missing values in the dataset. The imputation segment of the Multivariate Statistics Task View of R/CRAN revealed eight possible libraries that were devoted to multivariate imputation. However, the fact that our dataset consisted of a large number of variables, measured on a small number of cases, restricted our imputation options slightly. We began by using simplistic strategies such as mean/median imputation; however, we felt that more advanced methods would permit us to build a more accurate classiﬁer. We settled on using a nonparametric, k-Nearest Neighbours (k-NN) imputation strategy, as it had been recommended by some authors from the ﬁeld of proteomics and it behaved well in leave-one-out crossvalidation (Troyanskaya et al., 2001). The k-NN imputation was performed using the impute.knn() function from the impute library in R. Ranging k from 1 to 10, we found that only one patient was misclassiﬁed for all k. The misclassiﬁed patient was in the inactive status group, so our ﬁnal empirical classiﬁcation rate was 100% (0/11 cases misclassiﬁed) and 95.2% (1/21 case misclassiﬁed) for active and inactive statuses, respectively. Since all k behaved equally well, we arbitrarily chose k to be 8 for our ﬁnal classiﬁer. Decision processes were heavily reliant on empirical ﬁtting, so there was some risk of overﬁtting. However, given the rigidity of the linearity requirement for our SVM implementation and the care we took in heeding this problem, we felt the amount of overﬁtting was probably small. 3. CONCLUSIONS Our ﬁnal classiﬁer was a linear Support Vector Machine with 8-Nearest-Neighbour missing value imputation for proteins with fewer than 20% missing data. Allowing for small amounts of overﬁtting, we suspect our technique would correctly classify about 90% of all patients in each disease status category. BIBLIOGRAPHY

T. Hastie, R. Tibshirani & J. Friedman (2009). “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” 2nd ed., Springer, New York. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I. Witten (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11, 1–9. D. Meyer (2009). Support Vector Machines, CRAN R Project, accessed July 15, 2009 at cran.rproject.org/web/packages/e1071/vignettes/svmdoc.pdf. R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, www.r-project.org. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. Altman (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525. T. Wu (2009). Misc Functions of the Department of Statistics (e1071), CRAN R Project, accessed July 15, 2009 at cran.r-project.org/web/packages/e1071/e1071.pdf.

**Received 31 October 2010 Accepted 6 January 2011
**

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

212

Vol. 39, No. 2

**Biomarker detection for disease status prediction
**

Xu WANG* and Chaoxiong XIA

Department of Statistics, University of British Columbia, Vancouver, BC, Canada V6T1Z2 Key words and phrases: Missing data; imputation; K-NN; classiﬁcation tree; logistic regression; variable selection; cross validation; bootstrap

1. METHODOLOGY In order to verify if the current invasive and expensive evaluation method can be substituted with a biomaker based on protein levels, we used cross validation to evaluate the accuracy of speciﬁc methods in classifying samples coming from an active or inactive patient. Our candidate classiﬁers were the classiﬁcation tree with the splitting rule following Breiman et al. (1984) using the optimal misclassiﬁcation cost chosen based on cross-validation error rates, and logistic regression models with protein selection criteria based on AIC values using the optimal number of covariates chosen based on cross-validation error rates. In hope of mitigating the effects of randomness in crossvalidation splitting, the bootstrap method was applied in re-sampling from the 32 samples with deﬁnite disease status for use in cross validations. For the missing values in protein level data, three different missing data treatment methods were applied before the protein level variables were used as covariates of the classiﬁers. We ﬁrst used the complete case method which removed all the variables with missing values. The second method imputed the missing values with overall mean of the protein identiﬁer. The third one was k nearest neighbours imputation based on the average protein level of the k nearest neighbours with the optimal k chosen based on cross-validation error rates. Log transformations were applied on the original protein levels for the k nearest neighbours imputation in order to obtain a distance that is symmetric in positive and negative percentage differences. The k nearest neighbours imputation worked better for classiﬁcation trees, and the complete case method was more favorable for logistic regressions in terms of cross-validation error rates.

Table 1: Fourfold cross-validation error rates for different methods. Method Tree LG Missing data imputation 12-nearest neighbours imputation complete case method Total error 0.0937 0.0684 False inactive 0.0917 0.0632 False active 0.0625 0.0828

2. RESULTS The two classiﬁers considered, the classiﬁcation tree (Tree) and logistic regression (LG), gave similar predictive performances according to both the cross-validation estimates of false inactive

* Author to whom correspondence may be addressed. E-mail: xuwang@stat.ubc.ca

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011 Table 2: Predictions of patients with missing status. Tree 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Inactive Inactive Inactive Inactive Inactive Inactive Active Inactive Inactive Active Inactive Inactive Active Active Inactive LG Inactive Active Inactive Inactive Inactive Active Active Active Inactive Inactive Inactive Inactive Active Active Inactive

213

Decision Inactive Active Inactive Inactive Inactive Active Active Active Inactive Active Inactive Inactive Active Active Inactive

For inconsistent results, Active by the classiﬁcation tree and Active by the logistic regression were regarded more reliable.

and false active error rates (see Table 1). Predictions from the two models were obtained for the status of patients with missing status (see Table 2). 3. CONCLUSIONS By the low false inactive error rates we obtained from both models, we conclude that it is possible to use classiﬁers as a pre-screening procedure in identifying active patients, particularly when there is a much larger sample available for model training. However, without knowing the error rates of the current diagnostic method, it seems infeasible to make a conclusion on whether the classiﬁcation methods based on a simple blood sample perform as well as the current diagnostic method. ACKNOWLEDGEMENTS We would like to thank the Department of Statistics of University of British Columbia for warm support on this case study. BIBLIOGRAPHY

L. Breiman, J. H. Friedman, R. A. Olshen & C. J. Stone (1984). “Classiﬁcation and Regression Trees,” Wadsworth & Brooks, Paciﬁc Grove.

Received 31 October 2010 Accepted 6 January 2011

DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

214

Vol. 39, No. 2

**Discussion of case study 2 analyses
**

Robert F. BALSHAW* and Gabriela V. COHEN FREUE

NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6 Key words and phrases: Variable selection; class prediction; imputation

1. INTRODUCTION These analyses have been performed on data provided as a case study for the 2009 SSC case study session organized by Alison Gibbs. A general overview of the data was provided in the Background and Description. A comparable analysis is described in Cohen et al. (2010). In many ways, the data represent what has become a fairly standard statistical challenge in supervised learning: to develop a classiﬁcation rule for prediction of unknown class labels for 15 new samples based on a small training set, possibly utilizing only a subset of the features. From our experience, the principal challenge was the abundance of missing data, whose presence may well carry information about class membership. Missing values often occur with our analytical proteomic platform due to the challenge of protein identiﬁcation as well as detection of low abundance proteins. We would like to thank all six teams their efforts; all are to be commended for their insightful analyses. The six teams will be referred to as follows: • • • • • • WX: Wang and Xia. LM: Liu and Malik. CDK: Chang, Demetrashvili, and Kowgier. MJS: Meaney, Johnston, and Sykes. GCP: Guo, Chen, and Peng. LMSS: Lu, Mann, Saab, and Stone.

2. METHODOLOGIES The six teams used a wide variety of methodologies, which we have attempted to summarize very brieﬂy in Table 1. In general, the teams all took similar approaches to pre-ﬁltering the features based on detection rates and explored a variety of imputation methods. All teams eliminated a subset of the proteins with lower rates of detection, though the detection rule and threshold used varied between the teams. Essentially, two variant pre-ﬁltering rules were used: (1) select proteins for which the overall detection rate was above an arbitrary, pre-determined threshold: 0 and 1 by WX; 0.5 by CDK; 0.8 by MJS; or 0.15 by LMSS; and (2) select proteins for which the within-class detection rates were above a threshold: approximately 0.5 by LM and 0.4 by GCP. Having chosen a set of proteins, the teams then utilized a variety of imputation methods (see Table 1) to permit the use of those classiﬁcation techniques which do not permit missing values. Two teams then conducted univariate or one-protein-at-a-time tests before building a multivariate classiﬁer (LM; LMSS). LM selected a reduced set of candidate signiﬁcant proteins (controlling the Type I error rate), and LMSS selected a reduced set of candidate proteins that showed differential expression in several methods (including other approaches that test multiple

* Author to whom correspondence may be addressed. E-mail: rob.balshaw@syreon.com

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011 Table 1: Summary of strategies and accuracy. Team WX Missing values Overall detection rate threshold: 1 and 0; Imputation Methods: mean and k-NN Within group detection rate threshold: ∼0.5; Imputation Methods: NIPALS, BPCA, Probabilistic PCA, SVD Overall detection rate threshold: 0.5; Imputation Method: EM-algorithm Overall detection rate threshold: 0.8; Imputation Method: median/mean and k-NN (several k) Within group detection rate threshold: ∼0.4; Imputation Methods: hot-deck and linear regression imputation Classiﬁer Tree with splitting rule (Tree) and logistic-AIC (LG) Selected model Ensemble of two models: Tree based on k-NN-imputed data and LG based on complete data Tree based on top-9 tested proteins. High concordance among imputation methods PLR

215

Accuracy 11/15

LM

Non-standardized Welch t-test using FDR followed by a classiﬁcation tree RDA, DLDA, PLR, and SVM

12/15

CDK

14/15

MJS

Extensive list of supervised classiﬁcation techniques available in Weka Hierarchical cluster analysis followed by stepwise logistic regression to select clusters. A proteomic score, the mean difference between the two identiﬁed clusters, is calculated Classical t-test, Wilcoxon signed rank test, LASSO, LARS, and SLR

SVM in k-NN-imputed data (k = 8, similar results for other values)

13/15

GCP

Logistic regression with the calculated proteomic score and age as covariates

11/15

LMSS

Overall detection rate threshold: 0.15; Imputation Methods: k-NN, LLS, SVD, BPCA

Model 1: LARS algorithm on the seven most frequent proteins from all methods explored on SVD imputed data. Model 2: logistic regression based on three proteins selected by a stepwise analysis on the seven most frequent proteins and gender

Model 1: 10/15; Model 2: 12/15

features simultaneously). Though with different emphasis and in combination with other methods, almost all teams utilized a logistic regression approach (WX, CDK, MJS, GCP, LMSS). Interestingly, one of these groups used a stepwise logistic regression to select clusters of proteins rather than individual proteins on the basis of a preceding hierarchical cluster analysis.

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

216

Vol. 39, No. 2

Two groups explored tree-based methodologies with (LM) and without (WX) any pre-ﬁltering test. Two other groups used Support Vector Machines (SVM) to build a classiﬁer based on all pre-ﬁltered features (MJS) and in comparison with alternative methods, including two discriminant analysis approaches (CDK). In general, all groups used cross-validation to select a methodology to be used as a classiﬁcation rule in the (unlabelled) test set. 3. COMPARISON OF RESULTS AND CONCLUSIONS We had intended to compare the results by considering the teams conclusions regarding the impact of the methods for dealing with missing values, the relevance of pre-selecting candidate features before building a classiﬁer, and the evaluation of different multivariate classiﬁcation methods. As most teams noticed, the high abundance of missing data required a pre-ﬁltering step in the analysis in which proteins with low detection rate need to be down-weighted or discarded outright. Interestingly, most groups applied only one criterion in this pre-ﬁltering step, and did not explore the effect of different pre-ﬁltering rules on the results. However, most groups then went on to explore the impact of imputation method, with each team bringing some unique insight to the problem and complementing its general view. Among the imputation methods evaluated in this study, k-Nearest Neighbors (k-NN) was used most often; it would appear that the number of neighbors (k) did not have a signiﬁcant impact on the results (MJS). LM found that the results from the univariate tests were robust to all imputation methods as well as to the presence of missing values. On the other hand, WX and LMSS each found that for some multivariate methodologies, the performance in classiﬁcation varies depending on the imputation approach. Although most groups gave particular emphasis to variations of logistic regression (WX, CDK, GCP, LMSS), only two teams (CDK and GCP) selected a logistic regression model after comparing its performance with that of other multivariate classiﬁcation methods. One group (MJS) built the classiﬁer using support vector machines after comparing it with other methods. Intriguingly, one team combined the results of two classiﬁers (a tree based and an AIC-logistic regression) using a majority vote ensemble to classify the unknown samples. Another distinctive and appealing approach was taken by the GCP team who ﬁrst performed a cluster analysis on the proteins and then proposed a method (PHD-SCORE) to identify signiﬁcant clusters of proteins and then generate a classiﬁer score. As proteins could quite reasonably be expected to demonstrate strong correlations, this approach may well identify groups of proteins with greater relevance than when considering proteins individually. Interestingly, none of the teams investigated the use of the Elastic Net as a means to extend the utility of the logistic regression methods. Even though the most relevant aspect of this study was the predictive performance of a protein-based classiﬁer, an ancillary problem that was not explicitly explored by any group is the composition of a resulting panel of features and, in particular, its size. While high throughput technologies have enabled the discovery of interesting proteins and genes related to many diseases, in general these technologies are not yet ready or may not yet be sufﬁciently cost-effective to be moved to the clinical setting. Thus, in many applications it is important to emphasize the selection step of the analytical pipeline, with a goal to retain only a small number of the most informative features. Along these lines, two teams used one-protein-at-a-time tests to select the most promising proteins (LM and LMSS) with two teams suggesting the incorporation of relevant biological understanding to focus on speciﬁc sets of candidate proteins (LM, GCP). Likewise, one team studied the frequency with which proteins were selected by the different univariate and multivariate methods (LMSS), which can be seen as an alternative way to select the most promising candidate proteins. Finally, three of the teams incorporated the clinical variables at the end of the study after exploring the set of proteins, with varying ﬁndings: LM reported no evidence that any of the clinical

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

2011

217

variables were related with the disease condition; GCP built a logistic regression classiﬁcation model that selected qge from all other covariates and the proteomic score; and LMSS found only gender to be statistically signiﬁcant using a chi-square test, though it was not selected in their ﬁnal logistic regression model. The predictive performance for each team’s ﬁnal classiﬁer was tested on 15 samples whose class labels were kept hidden until after each team’s results were provided to the organizers. The accuracy of each team’s ﬁnal model appears in the last column of Table 1. Though this permits a quantitative comparison between the teams, we will resist the temptation to over-interpret the relative predictive performance of the teams and their methods. Instead, we would like to once again thank all the teams for their hard work, insight and thoughtful questions. BIBLIOGRAPHY

G. V. Cohen Freue, M. Sasaki, A. Meredith, O. P. G¨ nther, A. Bergman, M. Takhar, A. Mui, R. F. Balu shaw, R. T. Ng, N. Opushneva, Z. Hollander, G. Li, C. H. Borchers, J. Wilson-McManus, B. M. McManus, P. A. Keown, W. R. McMaster, and the Genome Canada Biomarkers in Transplantation Group. (2010). Proteomic signatures in plasma during early acute renal allograft rejection. Molecular & Cellular Proteomics, 9, 1954–1967.

Received 31 October 2010 Accepted 6 January 2011

DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

Copyright of Canadian Journal of Statistics is the property of Wiley-Blackwell and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

- Phenology
- Vol50_2015_1_Art13
- 12th GRADE 2ND PERIOD
- Palynological and Organic Geochemical Characterization of Marine and Terrestrial Early Pleistocene Climate in Northwest Europe
- Climate Design of Building Using Passive
- Design Climate
- James Hardie Siding Brochure Trade
- 2B. Hargreaves and Other Reduced-Set Methods for Calculating Evapotranspiration
- Savannas
- Clause H1 Compliance
- Sanchez Goni Et Al. High Resolution Palynological Record
- AP 5 Quiz
- Review Lecture Climatic Design of Buildings2
- XP 3456
- Cutaneous Tectonics
- Climate Responsive Buildings
- rr1
- Climate- Cold and Dry
- Europe Cities
- Mast Poster Template
- RobockGeoEngineering36UN
- Vernacular Architecture
- Handbook on Energy Conscious Buildings
- ENB on the Side - Cancún Climate Change Conference - Issue #5
- Climatic Ppt
- ARDES 5 - ARCH PROG.II.final.docx
- Unit1
- Green houes
- Revisiting DroughtProne Districts in India
- Roesser Abigail 11100182 Taiga- Abigail Roesser
- Case Studies Eric Cormier

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd