Chapter 32

Introduction to SIMCA-P and Its Application
Zaibin Wu, Dapeng Li, Jie Meng, and Huiwen Wang

Abstract SIMCA-P is a kind of user-friendly software developed by Umetrics, which is mainly used for the methods of principle component analysis (PCA) and partial least square (PLS) regression. This paper introduces the main glossaries, analysis cycle and basic operations in SIMCA-P via a practical example. In the application section, this paper adopts SIMCA-P to estimate the PLS model with qualitative variables in independent variables set and applies it in the stand storm prevention in Beijing. Furthermore, this paper demonstrates the advantage of lowering the wind erosion by Conservation Tillage method and shows that Conservation Tillage is worth promotion in Beijing sand storm prevention.

32.1 Introduction to SIMCA-P 32.1.1 About SIMCA-P Software
SIMCA-P is developed by Umetrics, which is mainly used for the methods of principle component analysis (PCA) and partial least square (PLS) regression. It is a kind of user-friendly software based on Windows: the operations of models in SIMCA-P are very convenient to handle and the results can be easily illustrated by plots and lists, which present the explanations of the models in kinds of forms. At present,

Z. Wu and H. Wang School of Economics and Management, Beihang University, 37 Xueyuan Road, Haidian District, Beijing 100083, China e-mail: binship@126.com, wanghw@vip.sina.com D. Li Agricultural Bank of China, Beijing 100036, China e-mail: zh.ldp@intl.abocn.com J. Meng School of Statistics, Central University of Finance and Economics, Beijing 100081, China e-mail: mengjie517@126.com V. Esposito Vinzi et al. (eds.), Handbook of Partial Least Squares, Springer Handbooks of Computational Statistics, DOI 10.1007/978-3-540-32827-8 33, c Springer-Verlag Berlin Heidelberg 2010 757

758 Z. the Y block in a PLS model refers to all the dependent variables (responses). users can do the estimation. The first row with the variable names is marked as the Primary variable ID. A project is a folder including the models with the relative statistics and results. Users can get the analysis results after several steps in accordance with the principles of PCA or PLS methods. Users should change the role of the variables to fit other models. (1) Start a project. known as class. . and other patterns in the data. A block is a combination of the variables with same role. 32. Users should exclude the outliers from the workset and go back to step (4) to fit a new model. Users should import the primary data form file or databases to create a new project. such as role (predictor variables X. lagging. A project may contain several datasets. (4) Model. (3) Prepare the workset. (6) Block. Users can do similar operations to preprocess the data. The analysis cycles can be summarized as follows. After all the preparing procedures.1. with a particular treatment of the variables. The first column is marked as identification numbers. or be a subset of the primary data. A workset is the set of data processed by the current active model.2 Some Glossaries in SIMCA-P There are several special glossaries in SIMCA-P system. and the default model (unfitted) is a principal component model of X. which would help us to gain a mastery of the software. View or modify a SIMCA-P data set. (2) Preprocess the data. etc.1. it is easy to generate new variables as functions of existing ones or from model results. All the data are listed in tables in the system. transformation. (5) Workset. Display the score scatter plot to show the possible presence of outliers.3 The Analysis Cycle It is convenient to do PCA or PLS estimation with SIMCA-P. The observations of a dataset can be spitted into different set for different purposes. (5) Detect the outliers. Models are mathematical representations of your process and are developed using the data specified in the workset and with a specified model type. SIMCA-P has been a standard tool in PLS regression analysis for researchers in many fields of science and technology. groups.scaling. A workset can contain all the data. (2) List. (3) Dataset. Wu et al. For example. (4) Fit the models. (1) Project. (7) Class. 32. SIMCA-P is organized into projects. The default workset is the whole data set with all variables as X at the project start. or responses Y). The set of processing data is known as a Dataset. For example.

. the project with the above data in Table 32. 32. The menu in the bar includes the entire functions menu. The name and the folder of a project will be shown in the title.2 The Basic Operations of SIMCA-P The example below will illustrate the main operations of SIMCA-P.2.1 The Main Window of SIMCA-P Double click on the SIMCA-P icon on the desktop. (Note: Before this operation. Prepare the workset Change the role of variable and specify model type 759 4.2.) The main window includes the following parts. Fit the model Fit the model automatically or artificially 5.1 has been created. the main window opens and displays as Fig. Build the prediction set from the primary or any secondary data sets to do prediction. 20 persons have been selected.32 Introduction to SIMCA-P and Its Application 1. 32. Users should judge the effect of the fitted model and decide whether to do prediction. used by Tenenhaus 1998). The data in this example describes the relationship between body condition and sports grade of people. After a fit. Table 32. Otherwise the active model status window will not display. The predictors reflect one’s body condition including avoirdupois.1 Road map to SIMCA-P (6) Review the fit.1 shows the original data set (Jone Neter.1). Preprocess the data View or modify a SIMCA-P dataset 3. Review the fit Judge the effect and interpretate the results 7. To do prediction Specify a dataset to do prediction Fig. 32. 32. the whole spectrum of plots and lists are available for model interpretation. (1) The command menu bar. The above steps can be shown as the above road map (Fig. curl and high jump. cummerbund and pulse. Start a project Import data to create a project 2. Detect outliers Detect outliers and exclude them if existing 6. (7) To do prediction. 32. The responses are three grades of physical exercise including chin-up.

760 Table 32.2 Main window of SIMCA-P . curl 162 110 101 105 155 101 101 125 200 251 120 210 215 50 70 210 60 230 225 110 high jump 60 60 101 37 58 42 38 40 40 250 38 115 105 50 31 120 25 80 73 43 Fig. 32. Wu et al.1 Observed data of body condition and sports grade No avoirdupois cummerbund pulse chin-up 1 191 36 50 5 2 189 37 52 2 3 193 38 58 12 4 162 35 62 12 5 189 35 46 13 6 182 32 56 4 7 211 38 56 8 8 167 34 60 6 9 176 31 74 15 10 154 33 56 17 11 169 34 50 17 12 166 33 52 13 13 154 34 64 14 14 247 46 50 1 15 193 36 46 6 16 202 37 62 12 17 176 37 54 4 18 157 32 52 11 19 156 33 54 15 20 138 33 68 2 Z.

etc. You can select any other row as the Primary variable ID. Select any desired column as Primary Observation ID. Use the buttons in Plot toolbar to insert labels or text in plot. Select FilejNewjGet data from file. 32. Double click on a symbol will execute a command or open a specified plot. get information about observations or variables. If the Primary variable ID has not been specified.3 Import the data file . SIMCA-P creates the Primary variable ID as Var 1. (7) The Audit Trail window. etc. number of components.2 The Operations of SIMCA-P The important operations are as followings.4). enlarge and read positions in graphs. The log events are shown in this window.32 Introduction to SIMCA-P and Its Application 761 (2) Standard and shortcut bar. the first page of the import wizard opens (Fig. name and source address of the data file to be imported (Fig. The Favorites window contains commands and plots. The main function of Maker toolbar is to exclude or include the observations or variables in the active model and create a new model. mat. (1) Import the data and create a project Data can only be imported from file or data bases. Var 2. type. (6) The active model status window. 32. such as model name. After importing the data file. 32. (4) The Favorites window. The window shows the information about all the models. 32. a standard dialog box opens to enter the file type. These shortcut buttons are for activating command menus and plots. show a regression line in scatter plots or rotate 3D graphs. The row with the variable names is by default marked as the Primary variable ID and colored in dark green. (5) The Workset bar. but not by keyboard. The system supports more than 10 types of files. Pressing a button will perform a certain task. Fig. such as txt. The column with observation numbers or names is colored in dark yellow. Data are colored in white and text are colored in blue.2. (3) Plot and maker bar. xls. etc. The bar displays the variables and observations in the workset and their status. which are marked with different symbols.3).

the project specification page of the import wizard displays (Fig. The file type is usp.762 Z.4 Import data wizard Fig. Click on Finish and the data set is imported.6).6 The active model status window In the Import data wizard. you should understand the data comprehensively. (2) Explore the data Before fitting the model. Click on Next. 32. 32. Users should specify the project name and the folder to save the work file. Fig. 32. The associated model is PCA. Wu et al.5). The window still displays other information about the data set. 32. you can do other operations by pressing the buttons on the left window. A project has been created (Fig.5 Specification of the project Fig. a window opens with the name of the . Select DatasetjQuick infojVariables/Observations. The default workset is the whole data set with all variables as X and scaled to unit variance. 32.

and the new variable is added at the end of the active dataset. with its formula. In some case. all the variables are selected as X variables (predictors).9). you can select WorksetjEdit to change roles of variables by marking the variable y and . In order to change the role of the variables or observations.7). The active model type is PCX (Principle Component Analysis of the X variables). Enter the expression defining the new variable and click on Next. statistics and Quick info plots (Fig. 32.32 Introduction to SIMCA-P and Its Application 763 Fig.7 Usual statistics and dataset Fig. 32. (3) Create the workset and set model options After the primary data set is loaded.10). Select Datasetj Generate Variable. In order to do PLS estimation. 32. missing values. SIMCA-P opens the wizard window displaying the active data set in a spreadsheet (Fig. SIMCA-P displays the new variable. Click finish. The usual statistics are showed in the window. The workset is organized into pages. etc.8 Generate variables variable/observation and default options (Fig. a new variable should be generated from raw data.8). you can select WorksetjEdit to open the Overview page of the workset dialog with the current observations and variables and their attributes (Fig. 32. mean. Select the desired page to change the attributes of the observations or variables. 32. 32. such as number.

you can select WorksetjOptions to set the options of the current active model (Fig. You can also set class of observations in the Observations page.11 Specify variables as responses clicking on the desired button Y (Fig. 32. The model is by default non hierarchical base model. After the above procedures. PCA on Y-block. (4) Fit the model You can select Analysis menu to fit the model.9 Info of new variable Fig. Fig.10 Overview page of the workset dialog Fig. The model type is decided according to the role of variables. The methods of fit include autofit. 32.12). etc.11).764 Z. 32. . 32. Wu et al. 32. including PCA on X-block.

32.12 Model options Fig. zero component. remove component and autofit class models (Fig. When you fit a model. a plot window opens and displays the cumulative R2 and Q2 for the X(PCA) or Y(PLS) matrix (Fig. SIMCA-P extracts as many components as considered significant.14).14 Model overview next component. or both as variables in . After fitting a model.32 Introduction to SIMCA-P and Its Application 765 Fig.13).13 Fit the model Fig. 32. you can mark the model and click on Active Model TypejHierarchical Base Model and select scores. residuals. 32. 32. 2 first components. 32. Select AnalysisjAutofit. next component.

observation 14 is excluded from workset and a new unfitted model is created without observation 14 (Fig.16). 32. It contains commands and plots. coefficients.17). (5) Detect the outliers Double click t[1]/t[2] Scatter Plot in Favorite window to display score scatter plot after fitting (Fig.766 Z. In order to eliminate the effect of observation 14. and then press the red arrow button.16 t[1]/t[2] scatter plot another model (Fig. or both would be added to the workset to be used as variables in another model. residuals. you can select AnalysisjSummaryjList to show the individual cumulative R2 and Q2 for each Y variable (Fig. 32. Fig. In Fig. They are marked with different symbols for specified . you can double click a ceratin symbol in Favorites window to execute a command. These plots show the possible presence of outliers. In order to illustrate this plot. 32. we extract two components. Press Mark item button in Marker toolbar and mark observation 14.15 Hieraarchical base model Fig. 32. observation 14 is outside the 95% confidence region of the model. including scores. 32. 32. you should exclude this observation from the workset.18). This means observation 14 is an outlier. In fact. For example. loading. etc.16. The Favorites bar is similar to a customized Navigation Bar. (6) Review the results You can select Analysis to plot or list some statistics. The scores. 32. Wu et al.15). and other patterns in the data. Resultingly. groups.

19 Coefficients Plot plots/lists and for a command (works on the active model).18 Summary of the results Fig. you can double click on the Coefficients Plot in Favorites window to show the coefficients plot (Fig.17 Detect the outliers Fig.32 Introduction to SIMCA-P and Its Application 767 Fig. 32. compute . 32. Besides the results included in Analysis menu and Favorites window.19). The Plot/List menus allow you to plot and list input data such as observations and variable values. 32. you can also select Plot/List menu to plot or list all the results. For example. 32.

21). 32. the workset is by default specified as prediction set. or removes observations from the prediction set. After specifying the prediction set. as well as results such as loadings. predictions. Select the ones you want in the prediction set and move them to the right window. etc.768 Z. variable variances.21 Distance to model . select PredictionsjSpecify Predictions SetjSpecify. of all the fitted models.. The observations are displayed in the left window (Fig. The residual standard deviation of an observation in the Y space is proportional to the observation distance to the hyper plane of the PLS model in the corresponding space. SIMCA-P computes the observation distances to the PLS model in the Y Fig. you can select Distance to ModeljY BlockjLine Plot to display this plot (Fig. Wu et al.20 Specify predictions set Fig. elements such as scaling weights. scores.20). 32. you can select Predictions menu to obtain the prediction information about the current model. For example. If you want to build a prediction set by combining observations from different data sets. etc. 32. (7) To do Prediction After fitting. 32..

Li and Gao 2001. 32.8 13.4 58. D1 .x2 /.the Rate of Straw Mulching . Table 32. D2 . Zang 2003).2 No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Wind erosion rate and IVs y x1 11:674 3:623 13:812 3:623 15:260 3:623 12:160 3:623 6:021 6:291 8:598 6:291 10:395 6:291 7:331 6:291 3:689 10:210 5:339 10:210 5:971 10:210 4:893 10:210 2:768 8:883 4:167 8:883 4:357 8:883 4:111 8:883 (32.y/with the above four IVs.4 12. we provide an application of SIMCA-P. the study on wind erosion of soil becomes very important in sand storm prevention (Shen et al. including 6. the Type of Farmland should be transformed into four dummy variables (DVs). D3 .4 45. To establish a regression model of Wind Erosion Rate . fallow farmland around the city. Soil Particle Size.4 45.8 13.4 45. these distances are computed after all extracted components.2. Consequently. The sample size is 16. traditional tillage farmland. Gao 2002. sand storms in Beijing have caused many serious problems. grass farmland and Conservation Tillage farmland.1) x2 0:651 0:651 0:651 0:651 0:266 0:266 0:266 0:266 0:337 0:337 0:337 0:337 0:339 0:339 0:339 0:339 x3 12. A large DModY value indicates that the observation is an outlier in the Y space.4 12.8 45.4 12. Based on the data in Table 32. The Type of Farmland is a qualitative variable (QV) consisting of the following four categories: sand farmland.x1 /. Sandstorms have been a big barrier against the development of the world. These categories are regarded as different types of farmland. In recent years.x3 / and the Type of Farmland is defined as four independent variables (IVs). the Water Content in Soil . Investigations showed that about 70 percent of the sand in these storms are generated by wind erosion of dry.4 13. the regression model can be written as follows: y D u C ˇ1 x1 C ˇ2 x2 C ˇ3 x3 C ˛1 D1 C ˛2 D2 C ˛3 D3 C ˛4 D4 Table 32.5 58.8 13.5 D1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 D2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 D3 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 D4 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 . which results in an annual global loss of 48 billion USD.5 58. 2000. D4 .5 billion USD in China. By default.32 Introduction to SIMCA-P and Its Application 769 space (DModY) and displays them as line plots.2 shows the original data set.5 58. In this research.3 Application In this section.

By default all variables are selected as X. the least-squares solutions for the parameters are not unique. Y or excluded ( )). To change roles. We have used SAS 8. After importing the data. Variables are displayed with their roles (X. OLS method is invalid in this case study.22 Get the data from file Fig. The first row with the variable names is by default marked as the Primary variable ID. After the above procedures. 32.24).22). select FilejNewjGet data from file to import the primary dataset and create a new project (Fig. Consequently.25). Fig. we should specify the project name and select the destination folder to save the project (Fig. 32.23 Import data It is clear that the following equation always exists in the model (1): D1 C D2 C D3 C D4 D 1 (32.26). The raw data was stored in c:n and the name of the source file is sand. In order to adopt PLS estimation. Wu et al.2) The above results show that there is full multicollinearity between the IVs. which was executed by SIMCA-P 9. A reported DF of 0 or B means that the estimate is biased. a project has been created. we adopted PLS to establish the regression model. some statistics will be misleading.23). 32. 32. The first column is by default marked as identification numbers. we can select WorksetjEdit to change all the options of the default model in Workset Window. the window of Import data wizard was displayed (Fig. mark the variable y and click on the desired button Y (Fig. 32. After the file was selected. At the beginning. . 32.Click on Next in Import data wizard when finished. 32.770 Z.0.0 to obtain the estimation.dif. The system provides the following notes: the model is not of full rank. The active model type is PCX (Fig. Therefore.

This model is unfitted and is the active model (Fig.32 Introduction to SIMCA-P and Its Application 771 Fig.27). 32. 32. 32. SIMCA-P extracts one component according to the cross validation rules after selecting AnalysisjAutofit.27 Change the role of variable y When we exit the Workset window. . The plot indicates a good fit corresponding to the small scatter around the straight line.24 Specification of the project Fig. the model type has been changed from PCX to PLS. 32. 32. 32.25 The default model Fig. Double click t[1]/u[1] Scatter Plot in Favorites window to display the t1 =u1 plot (Fig. It proves that there is a strong linear correlation between Wind Erosion Rate and its IVs. So the linear regression is fundamentally established. 32.28).29).26 Change the role of variable y Fig. The right plot displays the cumulative R2 and Q2 for the Y (PLS) matrix after the extracted component (Fig.

SIMCA-P draws the confidence ellipse based on Hotelling T2. 32.30 t[1]/t[2] scatter plot For a better illustration of the regression results. Predicted in Favorites window to shows the observed values vs.848 and it can explain 72. . we extracted two components.31). 32.30).6% variation of IVs and 90.28 The PLS model Fig. Observations situated outside the ellipse are outliers. 32. Wu et al. Fig.30 shows no outliers.2% variation of y. Double click t[1]/t[2] Scatter Plot in Favorites window to display a two-dimensional score plot (Fig.772 Z. the fitted or predicted values (Fig. 32. 32. Figure 32. Double click Observed vs.29 t[1]/u[1] scatter plot Fig. The cumulative Q2 for the extracted components is 0.

32 Standardized regression coefficients Figure 32. it is the most effective way to prevent wind erosion.32. while the Sand farmland has the highest.32 Introduction to SIMCA-P and Its Application 773 Fig. Considering the different kinds of farmland. we conclude that the Conservation Tillage farmland has the lowest wind erosion rate. the larger the size of soil particle. because Soil Water Content and Straw Mulching Rate are negatively correlated with Wind Erosion Rate. Double click w*c[1]/w*c[2] Scatter Plot in Favorites window to show both the Xweights (w or w*) and Y-weights (c) and thereby the correlation structure between X and Y (Fig.33). According to Fig.31 demonstrates that the estimation by PLS is effective. 32.3) Double click Coefficients Plot in Favorites window to show standardized regression coefficients of the model (Fig. these are beneficial in easing the soil wind erosion problem by adding Soil Water Content and increasing the Straw Mulching Rate.32). Since the Conservation Tillage method cultivates the farmland in a shallow way and leaves the crop residues on the land surface as much as possible. it can increase land coverage y D 9:36 0:36x1 C 5:54x2 0:03x3 C 2:11D1 C 0:10D2 0:59D3 . Predicted values Fig. 32. Furthermore. The model consisting of original variables is estimated as follows: 1:62D4 (32. 32.31 Observed vs. Additionally. the more serious the wind erosion. 32. 32.

J.774 Z.X. M. 56–60 (2003) . L. K. Journal of Arid Land Resources and Environment.: Source Analysis of Beijing Sand-Dust in 2000. PLS is preferable in dealing with QVs. The results provide valuable information for Beijing sand storm prevention. Research of Environmental Sciences. Wu et al. 11–14 (2000) Li. 468–471 (2002) Zang. Paris (1998) Shen. Gao. References UMETRI AB: SIMCA-P for Windows: Maltivariate Modeling.: The Dust Weather of Beijing and Its Impact. but also demonstrated that Conservation Tillage method is the most effective way to ease soil wind erosion. Analysis and SPC of Process Data. Therefore. 14(2). 32. 1–4 (2001) Gao.C. Y. 14(3). 22(5).X. Fig. It is obvious that SIMCA-P is an effective tool to conduct multivariate data analysis.. the results show that. 19(2). 32.: Experimental Study on Soil Erosion by Wind under Conservation Tillage. Q.33 Loading plots of IVs rate. In the investigation.Y. compared with OLS. Jing. Q. Y.4 Conclusion This paper has introduced fitting modes by employing SIMCA-P. it is worthwhile promoting the Conservation Tillage method both for the prevention of sand storms in the Beijing area and for agricultural production. the PLS model not only illustrated the factors of soil wind erosion. which conformed fairly well with reality.: La Rgression PLS: Thorie et Pratique. Xu.. TECHNIP. prevent water and soil loss and enlarge the level of production. China Environmental Science. Q. UMETRI AB (1996) Tenenhaus.: Sand-storm and Dust-storm in China and Prevention and Control.X. J. In the part of empirical research.. Yang.