You are on page 1of 312
Multi- and Megavariate Data Analysis Part Il Advanced Applications and Method Extensions Second revised and enlarged edition L. Eriksson, E. Johansson, N. Kettaneh-Wold, J. Trygg, C. Wikstrém, and S. Wold UMETRICS ACADEMY SCO ics Multi- and Megavariate Data Analysis Part Il: Method Extensions and Advanced Applications Second revised and enlarged edition By Umetrics AB © 1999-2006 Umetrics AB Information in this document is subject to change without notice and does not represent a commitment on the part of Umetries AB. The software, which includes mformation contained in any databases, deseribed in this document is furnished under a license agreement or non-disclosure agreement and may be used or copied only in accordance with the terms of the agreement, It is against the law to copy the software except as specifically allowed in the license or nondisclosure agreement, No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, for any purpose, without the express written permission of Umetries AB SIMCA and MODDE are registered trademarks of Umetries, Edition date: Mareh 2006 ID #1054 ISBN-10: 91-973730-2-8 ISBN-13: 978-91-973730-2-9 UMETRICS UMETRICS AB Box 7960 $-907 19 Umea Sweden Tel. +46 (0)90 184800 Fax. +46 (0)90 184899 Email: info@umetrics.com Home page: www.umetrios.com Contents 17 A multivariate approach to QSAR 1 17.1 Objective 1 172 Introduction 1 17.2.1 General considerations. 1 17.2.2 A short historical account 4 17.2.3 QSAR formalism a 3 17.2.4 Semi-empirical analogy modelling 3 17.2.5 Scope of current chapter 4 17.3 Fundamental conditions of QSAR modelling 4 17.3.1 Selection of the training set 4 17.3.2 Description of chemical and biological properties. 5 17.3.3 Data analytical techniques Ss 17.3.4 Predictive validation, 5 17.4 Example | - The SCHOURMANN data set (N=20, K=5, M=1) 6 17.4.1 Background. 6 17.4.2 PCA modelling 6 17.4.3 PLS modelling 7 17.4.4 Summary 7 175 Example 2— The MeCLOSKEY data set (N=20, K=13, M=1) 8 175.1 Background 8 17.5.2 PCA modelling 8 17.5.3 PLS modelling 8 17.5.4 Summary 9 17.6 Example 3 ~ The MULLER data set (N=66, K=56, M=1) 10 17.6.1 Background, 10 17.6.2 PCA modelling and training set selection 10 176.3 PLS modelling 12 17.6.4 Summary 13 17:7 Example 4 ~ The HERMENS data set (N=15, K=8, M=8). 14 17.7.1 Background son 14 17.7.2 PCA modelling 4 17.7.3 PLS modelling \¢ 17.7.4 Summary 0 17.8 Example 5 ~ The KARLEN data set (N=22, K=18, M=1) 17 17.8.1 Background 7 1782 PCA modelling ia 178.3 PLS modelling 18 178.4 Summary 20 17.9 Example 6 — The CELLT data set (N=16, K=6, M=2) 20 17.9.1 Background 20 17.9.2 PCA modelling 21 17.93 PLS modelling 21 17.9.4 Summary 3 ‘Multl- and Megavarlate Data Analysis: Part ll Contents.» 17.10 Example 7- The CYP3A4 inhibition data set ( 17.10.1 Background 17.10.2 PCA modelling 17.10.3 PLS modelling 17.104 Summary 17.11 Questions for Chapter 17 17.12 Summary and discussion 1712.1 PCA, PLS and multivariate design — a framework for QSAR 17 12.2 Related QSAR fields 18 Peptide QSAR 18.1 Objective. 182 Introduction 18.2.1 Peptide QSAR 18.2.2 The z-scales 18.2.3 Updated and extended z-scales 18.2.4 Ilustrations. 18.3 Example: Z-SCALES 18.3.1 The data 18.3.2 Pre-processing 18.3.3 Derivation of the z-scales 18.3.4 Interpretation of the z-seales 18.3.3 Multivariate design in z-seales 18.3.6 Summary 18.4 Example: ANGIO. 18.4.1 The data 18.4.2 Reference PLS model 18.4.3 Training set selection using the COST approach 18.44 Training set selection using fractional factorial design 18.4.5 Training set selection using D-optimal design 18.4.6 Summary 18.5 Example: BITTER. 18.5.1 The data 18.52 Reference PLS model 18.5.3 Training set selection. 18.5.4 Summary 18.6 Example: PENTAPEP 18.6.1 The data 18.6.2 Reference PLS model 18.6.3 Two disjoint sub-sets of data separated in time 18.6.4 Summary 18.7 Questions for Chapter 18 18.8 Summary and discussion 18.8.1 Building block description enables sequence quantification 18.8.2 Synthesis of non-natural amino acids with specific properties. 188.3 Quantitative sequence-activity modelling 18.8.4 Modelling of sequences of unequal length 19 Lead finding and optimization 19.1 Objective 19.2 Introduction 19.3 Lead finding using pharmacological profiling (Example: ALDRICH) 2 23 24 25 28 29 29 29 30 46 30 30 30 30 50 31 31 We Contents ‘Multl- and Megavariate Data Analysis:, Part I 193.1 Background 19.3.2 PCA for data overview 194 Lead optimization with SMD (EXAMPLE: ALDRICH) 19.4.1 Some preliminaries 19.4.2 Substituent scales for aromatic substituents. 19.4.3 Selection of representative substituents 19.4.4 Construction of lead-centered statistical molecular design 19.4.5 Summary of application 19.5 Multivariate modelling and design of hexapeptides (Example. HEXAPEP) 19.5.1 Background. 19.5.2 Initial PLS modelling 19.5.3 Refined PLS modelling. 19.5.4 Molecular design of new hexapeptides 19.5.5 Virtual sereening of new hexapeptides 19.5.6 Summary of application 196 Statistical molecular design in constrained PP-spaces (Example: PCBCYP2R) 19.6.1 Background 19.6.2 Mlustration data set (PCBCYP2B) 19.6.3 Modelling approaches. 19.6.4 An overview PCA model 19.6.5 Non-directional mapping of the CYP2B-region 19.6.6 Directional mapping of CYP2B-region 19.6.7 Summary and discussion of mapping results 19.7 Questions for Chapter 19 19.8 Summary and discussion 20 Multivariate combinatorial chemistry 20.1 Objective 20 2 Introduction 20.2.1 General considerations 20.2.2 The importance of representativity 20.2.3 Scope of multivariate CombC 20.3 Chemical characterization of compounds and building blocks (Step 1) 20,3.1 Demands on the chemical characterization 20.3.2 Two approaches to structure deseription -ction of representative building blocks (Step 2a) 20.4.1 Some systematic and some random selection 20.4.2 Conventional and cluster-based design approaches 20.4.3 A hypothetical example of cluster-based design. 20.4.4 Influence of research objective 20.5 Generation of the final library (Step 2b) 20.5.1 A simple example 20.5.2 A look at three disjoint S-spaces. 20.5.3 Preliminary compound selection to assess synthetic feasibility 20.5.4 Making the whole compound library 20.5.5 Some remarks 206 Biological testing (Step 3). 20,7 Linking of chemical and biological data: QSAR (Step 4) 208 Questions for Chapter 20 20.9 Summary and diseu 204 16 79 80 30 80 81 82 82 82 83 $3 83 83 ‘Multi- and Megavariate Data Analysis: Part It Contents « il 21 Chem- and Bioinformatics 85 21.1 Objective, 85 21.2 Introduction 85 21.2.1 A perspective: “Molecular informatics 85 21.2.2 Some thoughts on bioinformaties 86 21.2.3 Methods for the analysis of sequence data 86 21.2.4 A multivariate approach to bioinformatics. 7 21.3 Example. PromSeq 87 21.3.1 The data 87 21.3.2 Nucleoside principal properties 87 21.3.3 Interpretation of the PPs 88 21.3.4 Promoter sequence-activity modelling using PLS. 89 21.3.5 Converting model information to improved promoters x 21.3.6 Summary 92 21.4 Example: ACCPEP 2 21.4.) The data. m2 21.4.2 Standardizing peptides of unequal length, 2 21.43 The QSAR model B 21.44 Interpretation of the ACC-QSAR model 94 21.4.5 Summary 95 21.5 Questions for Chapter 21 95 21.6 Summary and discussion 95 21.6.1 Components of QSAM 95 21.6.2 Some recent developments in the cheminformatics area, % 22 ‘Omics’ Data Analysis 99 22.1 Objective 99 22.2 Introduction 99 22.2.1 The ‘omics’ world 99 22.2.2 Example data sets. 100 22.3 Results for data set T (Genomics data set) 102 22.3.1 PCA of 59 by 800 data set 102 22.3.2 Conclusions of example I 103 22.4 Resu for data set II (Metabonomics data set). 103 22.4.1 PLS-DA for Sprague-Davley animals 103 22.4.2 Discussion and conclusions of example If 105 22.5 Results for data set III (Proteomics data set) 105 22.5.1 Base level PCA models 105 22.5.2 Top level PCA model 105 3 Predictions for the prediction set 107 5.4 Discussion and conclusion of data set IIT 108 22.6 Questions for Chapter 22 108 22.7 Summary and discussion 108 22.7.1 The need for multivariate projection methods los 23 Orthogonal PLS (OPLS) 113 23.1 Objective 13 23.2 Introduction 1B 23.2.1 Direct and indirect calibration M4 23.2.2 Presence of structured noise in the X-mains. 15 Ww+ Contents ‘Multl- and Megavariate Data Analysis:, Part I 23.2.3 Orthogonal signal correction (OSC) - the first filter in a new generation of orthogonal filters ls 23.2.4 A second generation OSC procedure: The OPLS method. 16 23.2.5 OPLS and its impact on model interpretability 116 23.2.6 The value of Y-orthogonal variation for model and observation diagnostics 116 23.3 Theory of OPLS 7 23.3.1 Objective of OPLS 7 23.3.2 The OPLS model 7 23.3.3 Rotated regression coefficients us 2344 Implementation in SIMCA-P 18 23.5 Example I, a single-Y case: Binary Powder Tl 119 23.5.1 Removal of structured noise trom the score vector of the predictive component, tp 23.5.2 Interpretation of the predictive component 23.5.3 Interpretation of the orthogonal components 23.5.4 Discussion of example 23.6 Example IL, an OPLS-DA application: Ovarian Cancer. 23.6.1 Overall PCA model 23.6.2 Creation of balanced training and test sets. 23.6.3 PLS-DA modelling (23.6.4 OPLS-DA modelling 23.6.5 Interpretation of the predictive component (to get biomarker profiles)... 125 23.6.6 Discussion of example 126 23.7 Example Ill, a multi-Y calibration study: Metal-ion Complexes 126 23.7.1 Regression coeflicients, rotated regression coefficients, and estimation of pure spectral profiles ~ a foreword 126 23.7.2 Background to example data set 127 23.7.3 Use of all constituents in Y (all Y-variables) 128 23.7.4 Exclusion of the Ni**-response from Y (three Y-variables), 130 23.7 5 Predictions for prediction set samples 23.7.6 Discussion of example 238 Questions for Chapter 23 23.9 Summary and discussion 23.9.1 Some general remarks 23.9.2 The great asset of OPIS: Improved model transparency and interpretability 134 23.9.3 Specific uses of OPLS. 134 23.9.4 When and when not fo use OPLS 135 23.9.5 Related methods and some extensions 135 23.9.6, Concluding remarks 136 24 Hierarchical modelling 137 24.1 Objective 137 24 2 Introduction. 137 24.3 Process example: HI-PROC 138 24.3.1 Base level block formation 138 24.3.2 Base level model structure 139 24.3.3 Summarizing input and feed (1" base level model) 139 24.3.4 Summarizing reaction conditions (2"' base level model). 140 24.3.5 Summarizing the purification step (3 base level model) 140 24.3.6 Summarizing the less important Y-variables (4 base level model). mw 24.3.7 Top level model: PSV of the entire process, 142 ‘Multi- and Megavariate Data Analysis: Part It Contents «v. 24.3.8 Monitoring of the prediction set data 145 24.3.9 Discussion of HI-PROC example. 146 24.4 Questions tor Chapter 24 146 24.5 Summary and discussion 146 24.5.1 Zoom-in/zoom-out capability 146 24.3.2 Focus on the “right” variables at each level 147 24.5.3 An alternative to variable selection 147 24.5.4 An alternative to block-scaling. 148 2455.5 Filtering of data 148 24.5.6 Missing data and outlier handling 149 24.5.7 Principal component regression, PCR 150 24.5.8 Related multi-block techniques 150 25 Non-linear PLS modelling 25.1 Objective. 25.2 Introduction 25.2.1 General considerations 2.2 Degrees of non-linearity 25,23 Approaches to diagnosing and handling non-linearity 25.2.4 Non-linear PLS and other non-linear modelling tools. 25.3 Binning and expansion of X-variables (GIFI-PLS) 25.3.1 Introduction to GIFT-PLS 23.3.2 Main steps of GIFI-PLS 23 3.3 Properties of GIFI-PLS 25.4 QSAR Example: ELASTASE (Peptide QSAR) 25.4.1 The data 23.4.2 Reference PLS model 25.43 GIFL-PLS model 25.4.4 Full PLS model 25.4.5 External predietivity assessment 25.4.6 Summary 25.5 Process Example: SimCODM 25.5.1 Introduction to SimCODM. 25.5.2 Reference PLS model 23.5.3 PLS modelling using original and binned process variables 25.5.4 Where lies the non-linearity’? 35 Summary 25.6 Questions for Chapter 25 25.7 Summary and discussion 26 Image Analysis 26.1 Objective 262 Introduction 26.2.1 General considerations 26.2.2 Image analysis with different objectives 26 2.3 Aims and scope 26.3 Basie principles of MACI and its relation to other methods 26.3.1 Making mcongruent images congruent 26.3.2 Rationale of the 2D-DWT approach 26.3.3 Related approach 26.4 Work-flow of MACL 26.5 Tutorial example I, COFFEE POURING. vis Contents ‘Multl- and Megavariate Data Analysis:, Part I 26.5.1 Importing and preparing the images for compression 176 26.5.2 Compressing the images (“by variance” or “by seale”) 177 26.5.3 PCA to overview the data 18 26.5.4 PLS-DA between normal and strong coffee 179 26.5.5 Conelusions 180 266 Tutorial example II: COFFEE FILLING 181 26.6.1 Phase I’ Image compression 182 26.6.2 Phase IT: Multivariate analysis of the compressed images. 182 266.3 Phase III: Reconstruction of model parameters as images 133 26.6.4 Conclusions isd 267 Industrial example: Steel 185 7.1 Compression of the images 185 26.7.2 PLS-DA modelling and assessment of elissifieation ability 186 26.7.3 Reconstruction of PLS-DA model parameters 186 26.7.4 Conclusions 187 26.8 Questions for Chapter 26 187 26.9 Summary and discussion 187 26.9.1 Some general remarks. 2 187 26.9.2 A few practical hints 188 26.9.3 A future outlook 189 26 10 Appendix: The basies of the 2D-image compressor. cays 190 27 Data mining and integration 193 27.1 Objective 193 27:2 Introduction 193 27.2.1 General considerations 193 27.2.2 A few words on data mining 10d 27.2.3 A few words on data integration 195 27.2.4 Design and sampling 196 cope of chapter 197 27.3 Work-flow of data mining 197 27.3.1 Outliers, trimming and winsonsing 197 27.3.2 Representative and diverse data (the training set), 198 27.3.3 Selection of test data (validation set, prediction set) 198 27.3.4 Centering and scaling 198 27.3.5 Overview and cluster analysis 198 27.3.6 Classification and discriminant analysis 198 27.3.7 Relationships and predictive modelling. cone 99 27.4 A chemical data mining example: ChemGPS 199 27.4.1 Background and objectives 199 27.4.2 The need for a stable drugspace 200 27.4.3 ChemGPS and its use in data mining. 202 27.4.4 Discussion of example 203 27.5 Workflow of data integration 203 275.1 Pre-processing of data 203 27.5.2 Obtaining an overview of the data 203 27.5.3 Ascertamning homogeneity and rep: 203 27.5.4 Block-modelling using the hierarchical approach. 204 275.5 Linking and contrasting blocks of data, and information transfer 204 27.56 Predictive modelling 205 27.6 A spectroscopic data integration example: Carrageenan 205 27.6.1 Background and objective 205 27.6.2 Misture design to get representative Y-data in the training and test sets.....205, ‘Multi- and Megavariate Data Analysis: Part It Contents vil 27.6.3 The three blocks of spectral data 208 27.64 Base level PCA modelling 206 27.65 Integration of data: Top level PCA model. 206 27.6.6 Contrasting of data: Top level OPLS model 207 27.6.7 Prediction of Y. Top level PLS model 209 27.6.8 Separating Y-predictive and Y-orthogonal spectral variation 2u 27.6.9 Discussion of example. 213 27.7 Combining the concepts: The Novartis PAT example. 216 277.1 Background 216 27.7.2 Workflow to synchronies data 216 27.7.3 Nine base level batch models 207 27.7.4 Top level batch model 217 27.7.5 Discussion and Epilogue 219 27.8 Questions for Chapter 2” 219 27.9 Summary and discussion : 219 28 Multivariate analysis of preference data 28.1 Objective. 28.2 Introduction 28.3 Multivariate conjoint analysis (Example: CONJOINT) 28.3 1 Introduction 28.3.2 Background to data set 28.3.3 PCA for overview of response data 28 3.4 Linking of product attributes and respondent rankings: PLS regress 28.3.5 Summary of application 28.4 Multivariate preference mapping (Example: SENSCONS) 28.4.1 Introduction 28.4.2 Background to data set 28.4.3 PCA for overview of sensory data (X) 28.4.4 PCA for overview of preference data (Y) 28.4.5 PLS modelling of sensory (X) and preference (Y) data 28.4.6 Summary of application 28.5 Questions for Chapter 28 28.6 Summary and discussion Appendix |: Model derivation, interpretation, and validation 237 ALL Objective 237 AL2 Introduction 237 ‘A121 Main steps of MVDA. 237 AL2.2 The iterative nature of MVDA. 238 AL3 Evaluation and pre-processing of raw data 240 AL3 1 Main steps 240 AL3.2 What should we look for in univariate statistics? 240 AL3.3 Ellicacions plots. 240 AL4 Model derivation and interpretation 242 AL4 | Main steps 242 ALA.2 PCA for data overview 242 AL4.3 PLS for quantitative modelling and predictions 245 ALS Mode! validation and use of model 248 ALS I Main steps 248 ALS 2 Why is predictive validation necessary at all? 249 ALS.3 Internal validation (cross-validation) 250 vill» Contents ‘Multl- and Megavariate Data Analysis: Part I ALG ALT Appendix Sealii ji Princ’ ALS.4 Response permutation and cross-validation, ALS.5 External validation ALS.6 Predictions of new observations (Use of model) Questions for Appendix [ ‘Summary and discussion AL7.1 MVDA is an iterative procedure. AL7.2 Flowehart for MVDA ll: Statistical notes ng fing (JK) and uncertainties of coefficien ipal Components Modeling. R2X and R2Xadlj The number of components (model dimensions): A Q2 and Q2V Q2icum) and Q?V(eum) Relevance of variables Leverages Missing values correction factor Score plots and Loading plots. Hotelling T* Loading plots Distance to Model Distance to the model, augmented, DModXPS+ sand predictions Partial Least Squares Modeling RSD of observations and variables R2Y, R2X, R7Yadj, R2Xadj R2V and R2Vadj The number of model dit Q? and Q2V Qicum) and Q2V(cum) Variable influence: VIP (variable importance in the projection) Leverages Missing Dala Correction Factor PLS Plots Rotated Coefficients Predictions for new observations ensions (A), evaluation of the model 233 254 254 234 254 Distance to the model, augmented, DModXPS+ PLS Time Series Analysis. References: OPLS, References a7 References for MVDA book 277 Index Part II 295 (Index Part | 301) ‘Multl- and Megavarlate Data Analysis: Part ll Contents «be Part Il, Section V: Chapters 17-22 MVDA in QSAR, Drug Design, Bio- and Cheminformatics, and ‘Omics’ data analysis 17 A multivariate approach to QSAR 17.1 Objective Chapter 17 contains an introduetion to quantitative structure-activity relationship (QSAR) modelling. QSAR 1s useful for understanding relationships between chemical structure and the biological or pharmacological action of compounds. Inevitably, many such relationships are multivariate by nature, where groups of key chemical variables jointly influence biological behavior. Maltivariate methods are ideal tools for understanding these complex relationships, and for directing research towards compounds with enhanced biological performance This chapter introduces principal component analysis (PCA), partial least squares projections to latent structures (PIS), and multivariate design as useful tools in deriving multivariate QSAR. Seven QSAR data sets from the fields of drug design, pesticide research, and environmental toxicology and chemistry are worked out in detail, showing the benefits of PCA, PLS and multivariate design, PCA is useful when overviewing a data set and exploring relationships among compounds and among variables. PLS is the regression extension of PCA and is used for establishing QSARs. Multivariate design is essential for selecting an informative training set of compounds for QSAR calibration. Multivariate design is also called statistical molecular design (SMD) within the pharmaceutical industry 17.2 Introduction 17.2.1 General considerations In the pharmaceutical industry, one important goal is to identify interesting chemical structures, which have the potential to become approved drugs. Typically, such research aims to uncover possible relationships between chemical properties and preclinical activity patterns of drug candidates. One way to investigate such relationships is to use a semi- empirical mathematical model ~ a more or less simple polynomial function ~ in which the biological performance (activity, response, ete.) of a series of compounds is expressed as a function of their physico-chemical properties, This kind of mathematical expression is often referred to as a quantitative structure-activity relationship (QSAR) model (Figure 17.1). Multi- and Megavariate Data Analysis: Part Il 17 A multivariate approach to OSAR «1, Chemical data Biological data = =. X Y N N Figure 17.1: A schematic overview of OSAR. A QSAR model is able to predict the performance (activity, response, ete.) of a not yet biologically tested, but chemically characterized drug candidate or any other imteresting molecule [see, e.g., Norinder, ef al., 1998, Osterberg and Norinder, 2000, Zuegge, et al., 2002, Winiwarter, et al, 2003; Kriegl, et al., 2005], In addition, it also reveals whieh kind ‘of chemical property regulates a certain type of biological behavior, and how to modify these important properties to achieve improved performance (Dunn, 1989; Eriksson and Johansson, 1996]. Other areas of research where the QSAR methodology is increasingly used are environmental chemistry and environmental toxicology. The main interest within these fields is to establish QSAR models for environmental pollutants, so that their fate, performance and persistence in the environment can be reliably estimated. If available and properly validated, QSARs are invaluable tools for risk assessment of existing chemicals 17.2.2 A short historical account From time immemorial, man has been involved in a never-ending search for chemical substances with various kinds of optimal or desired biological properties. During the era of alchemy, scientists were engaged in trial-and-error investigations with the aim of identifying compounds capable of curing, or, at least, alleviating health conditions determined or believed to be abnormal Thompson, 1990]. During the latter part of the nineteenth century, these very early efforts were elaborated further to a period of study of the relationships of chemical structure to physiological action. As we contemplate mankind's subsequent research, we pass from the work of Meyer and Overton, who were able to find relations between the narcotic activity of a series of compounds and their hydrophobicity (Dearden, 1985}, through the more quantitative phystcal-organic-chemistry contributions from Hammett [Hammett, 1970] in the 1940s and Taft [Taft, 1956] in the 1950s, and on to the general breakthrough for quantitative structure-activity relationships (QSAR) as a discipline, which came with the work of Hansch [Hansch, et al, 1970] in the 1960s. Hansch and coworkers demonstrated how QSARS could be formulated for series of congeneric compounds, using model systems reflecting lipophilicity and electronic properties. At the time of the pioneering work of Hansch, the QSAR methodology was mainly developed and employed in the areas of drug design and pesticide research. It was not until the 1970s that the QSAR concept started to be applied in the environmental sciences, and in particular in environmental chemistry and toxicology (Blum, ef a/., 1990]. Owing to the sudden awareness of the multitude of chemicals present in the environment, it became popular to use QSARs to try to predict chemicals’ biological activity. fute, persistence and 2« 17 A multivariate approach to QSAR ‘Multi- and Megavariate Data Analysis:, Part Il other responses of environmental interest. During the 198s the use of QSAR in environmental chemistry and toxicology steadily inereased and broadened, and is today a well-established branch of the QSAR research field. A more elaborate historical account of QSAR ean be found in the tutorial of Dunn [Dunn, 1989] 17.2.3 QSAR formalism Here, some general comments on QSAR will be given, a schematic representation of a QSAR being provided in Figure 17-1, Typically, the development of a QSAR model can be formalized in the following manner: For a ceriain set of compounds appropriate biological or pharmacological activity, and/or environmental effect variables are monitored, these constituting the (N's Mf) response data matnx Y, with NV being the number of compounds in the training set and M/ being the number of response variables. Here, the name maining set corresponds to the series of chemicals on which the QSAR calculation is based. Moreover, for the same set of compounds, relevant descriptors reflecting their chemical and structural properties are assembled, This compilation forms the (Vx &) chemical descriptor matnx X, where Nis the same as above and & the number of descriptors. The biological and/or environmental data (¥) are subsequently modelled by the chemical and structural data (X) in ferms of Y Xp) +E (eqn. 17.1) where F(X. ) represents the systematic part of the data the correlation structure between chemical descriptors and biological responses — and E the residuals, such as measurement errors and model imperfections. In the expression above, f corresponds to the regression coefficients, which uncover the extent to which a given descriptor variable influences the modelling of a certain response 17.2.4 Semi-empirical analogy modelling ‘The chemical descriptor data and biological activities underlying QSAR investigations usually have very limited meaning in themselves; they are much more meaningful in relation to a conceptual model of the phenomenon studied. Sometimes the biological or environmental system under study is well understood, and it might be possible to draft a conceivable functional form based on theoretical considerations, Frequently, however, the mechanism of biological action is not well enough understood, or may be too complicated, to permit an exact model to be postulated from theory. In such circumstances, an empirical model describing the complex relationships between chemical properties and biological responses in a local interval might be a valuable alternative, The degree of complexity that must be incorporated into such a local empirical model can rarely be known in adv however, and typically QSAR modelling yields linear or low-order polynomial expressions Provided that the QSAR modelling is successful, tis to be expected that the form of the final model is consistent with that of the underlying fundamental relationship, because the QSAR model has the nature of a Taylor expansion [Wold and Dunn, 1983]. Asa result of this, QSAR analysis is usually regarded as partially empirical modelling, and consequently the prefix “semi” is often added. ‘The goal in any QSAR modelling is to obtain the mathematical expression that best portrays the relationship between chemistry and biology. To adequately deseribe the often complex nature of sch phenomena, it is necessary (0 use « battery of relevant and consistent chemical descnptors, The assumption, or expectation, is then that the factors goveming the events in a biological test system are represented in the multitude of descriptors characterizing the compounds. In other words, within a series of compounds it is anticipated that a sinall change in chemical structure will be accompanied by a proportionally small shift in biological acttvity. and that the multivariate set of deseriptors will reveal these analogies. Hence, QSARs are sometimes referred to as analogy models. ‘Multi- and Megavariate Data Analysis: Part Il 17 A multivariate approach to OSAR «3 Analogy models can be regarded as linearizations of the real, complicated structure-activity relationships. Wold and Dunn have shown that such analogy models normally have local validity only. that is, can embrace only compounds rather sumilar in structure and exhibiting, comparable modes of action. This is the reason why a QSAR should be based on a series of chemically and biologically similar compounds. [t is noted, however, that the substances must be disparate enough to cause some systematic change in biological activity. Thus, unless the biological activity studied 1s unsophisticated, it is not recommended to inelude structurally too diverse chemicals in the same QSAR. 17.2.5 Scope of current chapter To accomplish sound semi-empirical QSAR modelling, some fundamental conditions must be met. These are outlined in Section 173. Subsequent sections (Sections 17.4 through 17.10) are then used to illustrate these fundamental conditions. We will here discuss seven QSAR examples, taken from the fields of environmental toxicology and chemistry, and drug design. These six sets vary with regard to the type and number of compounds, the number of chemical descriptors and biological responses, as well as the type of endpoint The examples are only briefly introduced ~ reference is always made to the original data source ~ and summarized in terms of the number of compounds (1), number of chemical descriptors (A) and number of biological responses (\), 17.3 Fundamental conditions of QSAR modelling The quality ond usefulness of a QSAR model may vary considerably, Many QSARs are really useful, but some are not reliable at all. The latter situation arises because many QSARS are based on unbalanced data sets of varying quality, and sometimes rest on a Statistically doubtful footing. causing them to have poor, if any. predictive power. The article written by Cronin and Schultz points out some of the most common mistakes made by QSAR practitioners [Cronin and Schultz, 2003] For any QSAR application it is absolutely essential that the QSAR model can make g predictions of biological or environmental responses for new, previously untested or yet unsynthesized, compounds, and that some judgment of the reliability of these predictions can be made, Itis indeed possible to develop such predictively sound QSARS, provided that certain fundamental conditions are met regarding (i) the selection of the training set, (ii) the description of chemical and biological properties, (iti) the type of data analytical techniques used, and (wv) the manner in which the resulting model is predictively validated [Wold and Dunn, 1983}. vod 17.3.1 Selection of the training set Icis important to recognize that # QSAR model must be based on a representative compounds, One cannot construct a model based on a series exhibiting moderate Variation in chemical properties, say, chloroethanes, anid automatically expect that it will be predictively meaningtul for an entirely different type of chemical architecture, say. chlorinated naphthalenes, The training set compounds must be chemically and biologically similar to those compounds featuring in the validation set. and to those included in the prediction set for which systematic large-seale predictions are made [Eriksson and Hermens, 199Sa; Wold and Dunn, 1983}. cries of So the question which arises. then, is how do we make a meaningful selection of compounds to put in the training set? Fortunately, chemometries has equipped the QSAR ience field with a tool called multivariate design, which is aimed at solving the problem of how to form a well-composed training set. Multivariate design — in QSAR also known as 4+ 17 A multivariate approach to QSAR ‘Multl- and Megavariate Data Analysis: Part I statistical molecular design ~ and its use in selecting a representative set of compounds was outlined in Section 5.4.3 ‘Training sets generated according to multivariate design are encountered in Examples 6 and 7 below 17.3.2 Description of chemical and biological properties The intuitive belief of most chemists and toxicologists would probably be that measuring many variables provides more information about the chemnical and biological properties of compounds, than measuring just a few variables does. This also corresponds with current chemometric philosophy. Interestingly, the use of multivariate chemical and biological data is becoming increasingly widespread in QSAR, concerning both drug design and environmental sciences, A multitude of chemical descriptors will stabilize the description of the chemical properties of the compounds, facilitate the detection of groups (classes) of compounds with markedly different properties, and help expose chemical outliers. A multivariate description of the biological properties is also highly recommendable. This leads to statistically beneficial properties of the QSAR and better opportunities to explore the biological similarity of the studied substances. The absence of outliers in muluvanate biological data is a very valuable indication of homogeneity of the biological respon: profiles among the compounds ‘Thus, it is strongly recommended to include several chemical descriptors (“X-variubles”) and biological or environmental responses (“Y-variables”) in a QSAR model. Because all our QSAR modelling efforts rest critically on the assumption of chemical similarity and biological homogeneity of compounds, we must analyze data that are rich enough to allow an adequate testing of this important assumption. Multivariate chemical data are used in all seven examples. In Example 4, a multivariate Y- block is used, as well 17.3.3 Data analytical techniques A consequence of using multivariate chemival and biological data is the increased probability of multicollinearity. With multivariate data there is an obvious chance that many of the chemical and biological vanables wil be correlated, either partially or completely, i., that some variables will be linear functions of other variables, This multicollinearity property appears almost automatically in multivariate data, even though pair-wise variable correlations themselves might be low or moderate. Some aspects of multicollineanty were addressed in Seetion 2.3.2 Multicollinearity is not a problem in the data analysis, if it is recognized and treated appropriately, but it may give rise to serious problems in QSAR analysis if disregarded [Tophiss and Edwards, 1979], The traditional approach to QSAR analysts, based on multiple linear regression (MLR), often coupled with some kind of stepwise variable selection scheme, is n0f suitable for analyzing collinear variables. This is because the culated regression coefficients become unstable and uninterpretable [Topliss ancl Edwards, 1979]. Some regression coefficients may be much larger than expected, or they may even have the wrong sign [Mallet, 1976} Multivariate projection methods, on the other hand, such as, PCA and PLS are often much more usefull than MLR for multivariate QSAR analy sis, PCA and PLS are deployed in all seven illustrations below 17.3.4 Predictive validation Any QSAR model needs to be predictively validated before itis seriously used to understand or prediet biological or environmental responses of additional chemicals. ‘The ‘Multi- and Megavariate Data Analysis: Part Il 17 A multivariate approach to OSAR «5 best predictive validation of a QSAR model is, of course, that it predicts consistently the response values of additional compounds. But the usury of an independent external validation set of several compounds (say, between 5 and 10 compounds) is rare in QSAR. Predictive validation using external prediction compounds is a working prineiple in mples 3, 5, 6, and 7 In the absence of an extemal validation set, two reasonable principles of validation are discemible. One is based on an internal prediction procedure (cross-validation) and the other is based on fit fo random numbers (response permutation), Cross-validation simulates how well the QSAR predicts new data using only the training set eompounds, Response permutation estimates the chance (probability) of obtaining a good fit with randomly reordered response data. The cross-validation and response permutation procedures are further discussed in Appendix I. Cross-validation is used in all seven examples below 17.4 Example 1 - The SCHUURMANN data set (N=20, K=5, M=1) 17.4.1 Background Onur first example concerns a series of ten chlorophenols and ten nitrophenols, and their acute toxicity measured according to the pollen tube growth (PTG) test [Schudrmann, et al, 1996], The PTG test was carried out using in vitro growing pollen tubes of tobacco (Nicotina sylvestris), and the biological response was expressed as the concentration, causing 50% growth inhibition (og IC.) of the pollen tubes. As Sehitirmann avd coworkers pointed out, the PTG test is designed to reflect vertebrate toxicity rather than plant-specitic effects of xenobioties, because of the absence of chloroplasts and hence photosynthetic activity, In order to characterize the chemical properties of the 20 compounds, five chemical variables were compiled by Schiiirmann and coworkers. These Were included to account for lipophilicity (log K,.), aeidity (pK,), gas-phase dissociation enthalpy (AHL), difference in solvation free energy (AAG,), and nucleophilic reaction enthalpy (AH). With this data set, we wish particularly to show how important insights may be gained by using PCA, prior to PLS and QSAR analys 17.4.2 PCA modelling When applying PCA to the complete data set, Le. six variables, a three-component model was obtained with R°X = 0.96 and Q°X = 0.78 (A = 3). These results indicate a strong systematic variation in the data. The PCA scores in Figure 17,2 reveal differences between the two classes of phenols. The chlorophenols are found in the lefi-hand part and the nitrophenols in the right-hand region. Interestingly. the chemical and biological properties of these classes are almost orthogonal, Furthermore, it is evident that there are (at least) two clusters within the nitrophenol group, ane consisting of the mono-nitraphenols (compounds 11-13) and the other of the di- and trinitro congeners. This means that the mono- nitrophenols have more in common with the chlorophenols, than the other nitro-substituted phenols. A consequence of the strong separation seen in Figure 17.2 1s that one should expect QSAR modelling of the entire series, 1.¢., 20 compounds, to be of little value, Such a model would only reveal the trivial fact that we are working with two types of compounds, but be incapable of modelling the distributions of the compounds within the clusters To interpret the two first PCs we use the loadings (not displayed). The first PC. ie., the one disc ing between the chloro- and nitrophenols, is governed by the behavior of the four variables log Kn. PX. AH,, and AAG, Thus, it may tentatively be interpreted as reflecting acidity, both in aqueous phase and in gas-phase, and molecular lipophilicity. The 6+ 17 A multivariate approach to QSAR ‘Multl- and Megavariate Data Analysis: Part I second principal component differentiates among the substituted phenols mainly because of their nucleophilic reaction enthalpy (AH,,,) and their biological response. a wooo Figure 17.2: (left) PCA tyt2 score plot of Example 1. The compounds are clustered with the chlorophenols to the left and the nitrophenols to the right. Figure 17.3: (right) PLS regression coefficients of scaled and centered variables of Example 1 ~ Chlorophenol model. Log Ky ts the dominant descriptor for explaining log ICs» Variable Description: log Kay ~ log octanol/water partition coefficient; pK, = aqueous-phase dissociation constant; AH, ~ gas-phase dissociation enthalpy [kJ mol]: AAG, ~ difference in solvation free energy [KI mol]; AEyge ~ nucleophilic reaction enthalpy [k’mol}: log ICs ~ log inhibitory concentration causing 50% effect [moll]. 17.4.3 PLS modelling ‘The first QSAR was based on all twenty compounds. This model had R7¥ = 0.50 and Q?Y = 0.03 (A=1), ie.,a predictively weak model, an outcome not at all surprising remembering the discussion in the foregoing section. Subsequently, separate PLS models were fitted for the two classes. The PLS model for the chlorophenols yielded R*Y = 0.85 and Q*Y = 0.69 (A=1), which are satisfying values, while the PLS model of the nitrophenols was not convineing with its R°Y = 0.24 and °Y -0.10 (A*1). Apparently, the correlation between chemical properties and biological data for the chlorophenols is strong, whereas no such correlation exists in the nitrophenol class. ‘Thus, the conclusion must be that no QSAR may be obtained for the nitrophenols, although in the original work it was found possible to obtain QSARs, by excluding some compounds diagnosed as outliers [Schaurmann, ef a/., 1996]. We will not pursue this topic here, however. Finally, it is important to interpret the ehlorophenol model, The PLS regression coefficients are displayed in Figure 17.3. We can see that log Koy is the dominant chemical descriptor, and an increase in PTG toxicity (low values of log IC) is coupled to an increase in lipophilicity (high values of log Ks). This is in line with the conclusion reached by ‘Sch0trmann and coworkers, 17.4.4 Summary In this example, the advantage of using PCA prior to PLS, was the insight that compounds were strongly grouped. Accordingly, PLS was applied to separate sub-sets of ehloro- and nitrophenols. ‘Multi- and Megavariate Data Analysis: Part il 17 A multivariate approach to OSAR «7 17.5 Example 2 —- The MCCLOSKEY data set (N=20, K= 3, M= 17.5.1 Background QSAR technologies are predominantly applied to predict toxicity properties of organic molecules, In that respeet our second illustration is interesting, because it concerns the application of QSAR modelling to inorganic species, viz., metal 1ons. Twenty mono-, di-, and trivalent metal ions were submitted to the Microtox® bioassay for the determination of their toxicity to bacteria McCloskey. ef af.. 1996]. The biological endpoint recorded was the mean 15-min median effect concentration (log FC). In the original work [MeCloskey, et al., 1996), this response was modelled using various one- and two-parameter selections, chosen from a pool of altogether 13 chemical descriptors, and MLR. We will here use the entire battery of 13 chemical variables and PLS modelling, and contrast old (MLR) and. new (PLS) results 17.5.2 PCA modelling The PCA of the complete data set with 14 variables resulted in a model with R°X = 0.90 and Q°X = 0.73 (A = 3), When plotting the first two score vectors against each other (no plot shown), it was founs! that the metal ions were fairly evenly scattered, with the exception of the trivalent Fe- and Cr-species, and were all situated inside the 95% tolerance ellipse given by Hotelling’s 7°. Hence, the QSAR model should be fitted to all metal species af the same time 17.5.3 PLS modelling The PLS modelling gave a two-component QSAR with R°Y = 0.95 and 92. Thus, the explanatory power and predictive power are excellent. The PLS score plot of the first ‘model dimension is shown in Figure 17.4. According to this plot, the toxicity of the metal ions in the Microtox bioluminescence assay is well modelled by the thirteen chemical descriptors. ‘his is inferred trom the tight “correlation band” in Figure 17.4, which indicates a sound correlation structure. There is a weak deviation of observations CP" and Fe™ from the correlation diagonal, but this discrepancy is accounted for by the second PLS component (no plot shown), Furthermore, inspection of DModX revealed no moderate outliers, Similarly, the Y-residuals indicated no outliers, 8 17 A multivariate approach to QSAR ‘Multl- and Megavariate Data Analysis: Part I ; sentht | = 1 sugiebae oe Maina Lads 5 e om Fo eneg ‘Jogkgh "ped “ ANDIPTad “ cdl Ray on a ges 6 as| OSE Figure 17-4: lefo) PLS ty1r; score plot of Example 2. There is a strong correlation bebveen the X- and Yadata. Figure 17.5: (right) PLS w'c;'w*c> weight plot of Example 2 The single y=variable is marked with a box Variable Description: OSE = number of outer shell electrons: NobleGas = dummy variable indicanng whether the metal ion has noble gas configuration 0 (Noy | (Yes): AN = atonne number; rad ~ iene radius [A]: D-IP ~ difference in ionization potentials between ion oxidation number N and N-l fe]; D-Eo ~ the absolute difference in electrackemical potentials between the ion and its frst stable reduced state [eV] = reflects the ability of the ion to change its electronic state; Eneg = electronegativity - a measure of electron autracting ability [21]: logk yy = log of the first dnxdralysis constant for the reaction Me™ + HO => MeQH™!” © H": X20 = metal-legand binding tendency’, Z2:r = ionic boned stability for metal-tigand complexes; AN'DIP = ratio of AN and D-lP; sigmp ~ “sofiness index” - related ta the theory of hard and sofi acids and bases: pH! = resulting pH in test solution when examining taxictty: logEC 39 ~ log 15-min median effect concentration in the MICROTOX® bacterial assay [4M]. For the interpretation of this model, the PLS weights plotted in Figure 17.5 are useful. This plot displays the relationship between all the variables at the same time, and reveals that the five X-variables Enep, X2/t, sigmp, D-Eo, and NobleGas are more influential than the others In this plot one considers the distance to the plot origin The further away from the plot origin an X- or Y-variable ties, the stronger model impact that particular variable has In addition, the signs of the PLS weights inform about the correlation among the variables, For instance, the X-variable X2/r and the Y-variable log EC. are positioned in the lower left and upper right quadrants, respectively. This means that they are inversely related. Hence, when X2/r goes up (metal-ligand binding tendency raises) the value for log EC) Gininishes (toxicity increases) ‘The softness index (sigmp) is positively correlated with the biological response and discriminates among the metal ions according to their ote Hard tons, wh predominantly bind to oxygen or nitrogen, are Ca”, Li", Na", Borderline cases, which preferentially form ‘compleses with oxygen nitrogen and sul, are Co", Ni", Cu?" Zn”, and Pb**. Soft ions, which mainly bind to are Cd?*, He™ and Ag’. Itis seen that increased log EC,-values correlate with increased values of sigmp “Thus, soft ions are more toxic than hard ions in this toxicity assay. The relevance of the qualitative NobleGas descriptor indicates that metal ions with a noble gas electron configuration are modelled to be less toxic than those without. The contribution from the Eneg descriptor means that an inerease in electronegativity is linked to elevated toxicity (lowered log ECs) 17.5.4 Summary QSAR models are usually established for organic molecular structures, and rarely for inorganic species. The importance of the MeCloskey data set is that it shows the possibility Multi- and Megavariate Data Analysi Part Il 17 A multivariate approach to OSAR «@ of deriving QSARS also for the latter eategory of substance. In the original work (McCloskey, ef a/., 1996], the authors derived eight QSARs based on employing either one or two X-predictors, Four QSARs were one-predictor models with R*Y ranging from 0.51 - 0.81 (no Q*Y given), and four QSARs were two-predictor models showing an R°Y range of O81 -0.84, Recall that the RY we obtained with PLS, and by using the information in all 13 predictor vanubles, was 0.95. Thus. the conclusion is that with MLR one cannot utilize the full potential of the complete set of X-variables 17.6 Example 3 - The MULLER data set (N=66, K=56, M=1) 17.6.1 Background In the third example, QSAR modelling of the soil sorption characteristics, expressed as the logarithm of the soil sorption coefficient (log Koc), of a heterogeneous set of 66 compounds will be discussed. The data set contains five sub-classes of compounds, a typical situation in environmental QSAR, multivanately characterized with 56 experimentally snd quantum-chemically derived molecular descriptors. One initial aim of this project was to put together a meaningful training set. It was decided to allocate 20 compounds to the training set and utilize the remaining 46 substances as an external validation set [Fnksson, e¢ al., 1997} Because this series contained five sub-classes, ie , five more or less strong chemical clusters, the training set selection phase had to ensure thal representatives of all those sub- classes were chosen. This meant that one could not rely only on one single multivariate design, since this would have neglected the chemical clustering. Instead it was found more realistic to construct a sequence of local multivariate designs in each of the five sub-classes. The compounds chosen according to each local multivariate design would subsequently be united to forin the global training set. Thus, the final training set would contain members from all five chemical sub-classes Inis the aim of this example to review how multivariate design was implemented to select an efficient training set that really acknowledges the five chemical sub-classes. The results of the QSAR analysis of the log Koc response are summarized, and the predictive validation of the QSAR model is deseribed 17.6.2 PCA modelling and training set selection When analyzing all variables except the biological response, 2 four-component model was acquired with R°X = 0.72 and Q°X = 0.62. This model will serve as a reference model, and will not be used for the actual training set selection. The two first model dimensions have by far the largest impact and summarize 0.55) of the total variation. Hence, the ty/tz score plot shown in Figure 17.6 represents a good summary of the chemical properties of the 66 chemicals, We can see that the five sub-classes, denoted A ~F, partially overlap, with two large clusters and three small, The two large clusters, A and D, include 26 and 17 compounds, respectively, ‘These are, in fact, large enough to permit separate PCAS to be made. The clusters B, C, and E contain 5, 8, and 10 members, respectively 10 + 17 A multivariate approach to QSAR ‘Multi- and Megavariate Data Analysis: Part Il cores: t]1)/t[2 tit) Figure 17.64: Overview of the training set selection procedure for Example 3. The reference PCA 1% score plot is used to keep track of the selection. The clusters A ~ E include 26, 5, 8.17, and 10 compounds, respectively: ‘To be continued In the original work, it was decided to mcluce 20 compounds in the training set. comprising 8 from cluster A, 2 from B, 2 from C, 5 from D_ and 3 from E. This allocation reflects an intentional weighting of the five clusters, so that the larger clusters get more representatives in the final traming set than the smaller ones. In short, the training set was selected by making local multivariate designs in clusters A and D, and by visual inspection of clusters B, C and E. Thus, a separate PCA was made of sub- class A resulting in four principal components A D-optimal design proposing 8 candidates for the (raining set Was created in these four principal properties, Figure 17.6b underlines the good coverage of the A-cluster. Similarly. a PCA-modelling of cluster D resulted in three principal properties, which were exploited in a D-optimal design encoding the selection of five compounds, These five compounds well cover the spave spanned by the D- cluster (Figure 17.6¢). ‘Multi- and Megavariate Data Analysis: Part Il 17 A multivariate approach to OSAR «11 Figure 17-6 tcont ) (b) Selection of training set compounds from sub-class A. (c) Selection from sube class B. 1d) Selection from sub-class C. (e) Selection fram sub-class D. (f) Selection from sub-class B. (g) The distribution of the overall tratwing set (selected compounds ave encircled) The three remaining sub-classes, B,C and E, were processed visually, Hence, two B- representatives were selected by choosing one compound from each of two smaller clusters (Figure 17.6), two C-representatives were identified in an analogous manner (Figure 17.64), and three E-type compounds were chosen from the main cluster of this sub-class (Figure 17.61), The discussed procedure led to the overall training set depicted in Figure 17 6g The remarkably good coverage of this reference score space must be attributed to the ‘multivariate design approach, here prudently modilied to account for the five sub-classes. For more details regarding the training set selection phase, the reader is referred to the original work [Eriksson, ef af., 1997] 17.6.3 PLS modelling In the QSAR modelling, the selected training set and the 56 original X-variables were used The PLS investigation yielded a two-dimensional model explaining 86% (R°Y = 0.86) and predicting 74% (Q°Y= 0.74) of the variation in the log Koc response. Figure 17.7 displays the agreement between observed and fitted log Koc for the training set 126 17 A multivariate approach to QSAR ‘Multi- and Megavariate Data Analysis: Part Il at ay ste d ? i . ts eget rc) » so * : Le | — Figure 17.7: (let) Plot of observed log Koc versus fited values Jor the 20 training set compounds of Example 3. Figure 17.8. (right) Plot of observed log Koc against predicted values for the 46 prediction set compounds of Example 3. In the ensuing siep, the established QSAR was predictively validated by means of the external prediction set. Figure 17.8 presents the relationship between observed and predicted response values, The external Q"Y equals 0.70, which is strikingly similar to the internal Q°Y of 0.74 (obtained by cross-validation). Hence, the conclusion 1s that the developed QSAR is predietively valid. Finally, without entering into details regarding the model mterpretation, we conclude that large compounds with high lipophilicity are modelled as having high soil sorption [Enksson, e# a/., 1997] 17.6.4 Summary In QSAR analysis in environmental sciences adverse effects of chemicals released to the environment are modelled and predicted as a function of the chemical properties of the pollutants, Usually, the set of compounds under study contains several classes of substances, i¢., more or less strongly clustered set. It is then necessary to ensure that the selected training set comprises compounds representing all thos [Eriksson, ef af. 2000c} Multivariate design in the prineipal properties of the compound classes is usually appropriate for selecting a meaningful training set. However, with the clustered data often seen in environmental chemistry and toxicology, a single multivariate design may be sub- optimal, This is because of the risk of ignoring small classes with few members and only selecting training set compounds from the largest classes ‘hemical classes ‘The example reviewed in Section 17.6 proposes a procedure for training set selection which recognizes clustering (“cluster-based design”) Here, when non-selective biological or environmental responses are modelled, local multivariate designs are constructed within each cluster (class). The chosen compounds arising from the local designs are finally united in the overall training set, which will thus contain members ftom all clusters. The 66 compounds were divided into a training set of 20 chemicals and a prediction set of 46 substances. The use of the extemal prediction set indicated the good predictive power of the developed model ‘Multi- and Megavariate Data Analysis: Part Il 17 A multivariate approach to GSAR «13

You might also like