You are on page 1of 13

Instructions on how to run ParLeS

For a description of the software and algorithms please see:

Viscarra Rossel, R.A. 2007. ParLeS: Software for chemometric analysis of spectroscopic data.
Chemometrics and Intelligent Laboratory Systems (in-press) doi: 10.1016/j.chemolab.2007.06.006

1. Data file format


Before you can run ParLeS you will need to format your data correctly.

Please read carefully.

The general format is shown in Figure 1.

Header information Label carbon 700 710 . . 2500


S1 Object1 OD11 OD12 . . OD1k
Each column
Each row . . . . represents the
represents the . . . . reflectance for
spectrum of a
. . . . all samples at a
sample
Sn Objectn ODn1 . . . ODnk particular
wavelength

s y X
Sample Response Spectroscopic predictor variables
label variable Reflectance measured at each K
wavelength

X (bold capital) represents a matrix (NxK scalars)


y (bold lower case) represents a vector (array of scalars)
z (lower case) represents a scalar ( a single number)

Figure 1. File format for importing into ParLeS

From Figure 1:
a. For the Calibration data:
First row must contain header information, i.e. labels that include those for your samples (S1,
S2, etc), your response variables (e.g. carbon, pH, etc) and labels for your predictor variables
(e.g. the wavelengths/wavenumbers).
First column should contain sample labels
The next column(s) should contain response variables (i.e. the y-variables). Note that ParLeS
accepts more than one response variable (see below).
The columns after that should contain the predictor variables (i.e. the X-data e.g. NIR spectra)

Example of format for calibration file, containing labels, a single response variable (OC) and NIR
spectra (700-2500nm):

Label OC 700 702 2500


S1 2.56 0.35 0.37 0.67
S2 1.35 0.32 0.33 0.62
.
.
.
Etc.
Note: you can have more than 1 response variable in your files. You will be asked to select the y-
variable you want to model or test in the appropriate sections of the software.

If you have more than one response variable then place them in the second third, etc. columns after the
sample labels and before the predictor variables (i.e. before the X-data).

Example of format for calibration file, containing labels, three response variables (OC, pH and N) and
NIR spectra (700-2500nm):

Sample name OC pH N 700 702 2500


S1 2.56 6.5 1.3 0.35 0.37 0.67
S2 1.35 7.3 1.8 0.32 0.33 0.62
.
.
.
Etc.

b. For the Prediction/test data


As for calibration data, but a prediction file requires a column of zeroes replacing the y-variable, as
this will be your unknowns that you want to predict using your model.

Example of format for prediction file:

Label OC 700 702 2500


S1 0 0.35 0.37 0.67
S2 0 0.32 0.33 0.62
.
.
.
Etc.

If you want to test your models with independent test data then your file format will be as in a.
above, i.e. including the response variable data to be used to test the models. Remember to also
include headers as in a. above.

Prepare your data files for import into ParLeS and save them as
tab delimited (ASCII) text files.
2 Importing data into ParLeS

a. Import data for modelling


In this tab, you may choose to:
(i) select a file (with the above formatting) for modelling, or,

(ii) by checking the box labelled Check to join files from a directory select to merge multiple
spectroscopic files with x,y format (e.g. where x is frequency and y is reflectance), into a single
file.

Figure 2. The import data for modelling tab

If you select (i) then:


- Use the 'Get file for modelling' browse folder button to select the path and your data file.
- Press the IMPORT DATA FOR MODELLING button and you will see the header information of
your file
- Using the numeric control 'Total number of y variables', select the total number of response variables
in your file. For example, if you have 3 response variables as in the example above then you will write
3 in this numeric control. If you only have 1 response variable then write 1.
- Using the numeric control 'Select y variable for modelling', select the response variable you want to
model. Remember that ParLeS uses the PLSR1 algorithm, i.e. models a single y-variable at a time.
Following from the above example, if you want to model pH then you would write 2 in this control,
if you want to model N then you write 3.
- Press the IMPORT DATA FOR MODELLING button and if your data file is correctly formatted
and you have correctly identified the total number of y-variables and the y-variable you want to model
then you should see a sample of your data in the windows labelled y variables, Labels, Selected y
X variables and Spectral range. You will also see a histogram and descriptive statistics of the y-
data and a sample of your spectra on the graph.
- If you cannot see the correct data in the windows or you cannot see the spectra then before you
proceed you will need to check your data file and if necessary remake the file by carefully following
the above instructions.

If the file format is incorrect or you have incorrectly identified the total number of y variables in your
data file then you will be able to see this in the sample data windows and more than likely your
spectra will not plot correctly.

In Figure 2, the data file contains 152 samples (size y: rows) and 933 predictor variables (size X:
columns). The data file has 25 response (y) variables, the Labels look correct (i.e. GYD), the first
y variable was selected for modelling, the Spectral range are in wavenumbers starting from 3992.5
cm-1 and the spectra are soil MIR spectra.

If you select (ii) then the files to be merge need to be:


- in a single directory which you specify in the Directory with files to join control (the best
thing to do is to copy the file path rather than use the browse button), and
- of the same type, i.e. with the same file extension protocol, which you specify in the File
extension control. For example text files will have a .txt extension. Note that the files should
be in ASCII format.

You may then run the program using the IMPORT DATA FOR MODELLING button. Once the
software has run, a sample of the merged spectra will be displayed. This may take some time
depending on the number of files that you have. If the sample spectra do not appear to plot properly,
then an error has occurred and you should check that you have the correct directory or that you have
the correct file extension.
The merged file may be saved by checking the SAVE MERGED FILE control or it may be further
analysed in ParLeS (see below).

b. Import data for prediction


In this tab you may select a file for: (i) prediction of unknowns or (ii) to test your models with
independent test data. Refer to file format instructions in 1 above.

(a) (b)

Figure 3. The import data for prediction tab. (a) for prediction of unknowns (b) for independent
testing of models. Note: in the latest version of the software you will also see a histogram and
descriptive statistics of the y-data.
For the prediction of unknowns the file requires a column of zeroes replacing the y variable. In this
case your Total number of y variables will be 1 and the Select y variable for prediction will also
be 1. See Figure 3a.

For testing your models with independent test data:


- Using the numeric control 'Total number of y variables', select the total number of response variables
in your test file.
- Using the numeric control 'Select y variable for modelling', select the response variable for which
you want to test your model.
- Press the IMPORT DATA FOR MODELLING button and if your data file is correctly formatted
and you have correctly identified the total number of y-variables and the y-variable you want to test
then you should see a sample of your data in the windows labelled y variables (this is All test y
variables in earlier versions of ParLeS), Labels, Selected y X variables and Spectral range. The
graph will show a sample of the spectra used for the predictions.
- If you cannot see the correct data in the windows or you cannot see the spectra then before you
proceed you will need to check your data file and if necessary remake the file by carefully following
the above instructions.

If the file format is incorrect or you have incorrectly identified the total number of y variables in your
data file then you will be able to see this in the sample data windows and more than likely your
spectra will not plot correctly.

In Figure 3b, the data file contains 76 test samples (size y: rows) and 933 predictor variables (size X:
columns). The data file has 25 test response (y) variables, the Labels look correct (i.e. GYD), the
first y variable was selected for testing, the Spectral range are in wavenumbers starting from 3992.5
cm-1 and the spectra are soil MIR spectra.
3. Data transformations, preprocessing and pretreatments
The Data Manipulations tab (called Preprocessing in earlier versions of ParLeS) can be used to
transform, preprocess and pretreat your spectra.

(a) (b)

(c) (d)

Figure 4. Transformations and preprocessing in the data manipulations tab

From the drop-down menus select the desired combination of transformation, preprocessing and
pretreatment to apply. You can test any combination of methods as long as you understand what they
do and you carefully follow the instructions.

From Figure 4, using the dropdown menus you can perform the following transformations and
preprocessing:
o Data transformation transform diffuse reflectance (R) data to Log(1/R) or Kubelka-Munk units
K/S = (1-R)^2/2R. You may also transform from Log(1/R) to R.
o Light scatter and baseline corrections correct data for light scattering effects, etc. using
Multiplicative Signal Correction (MSC), Standard Normal Variate (SNV), SNV with quadratic
detrending, Wavelet de-trending or SNV with wavelet detrending.
The wavelet de-trending level specifies the number of levels of the wavelet decomposition, which
is approximately (1 - trend level*log2(Ls), where Ls is the signal length. When trend level is zero,
signal trend is equal to zero, and signal detrended is identical to signal in. It may be thought of as a
form of baseline correction.
o De-noising/Smoothing de-noise data using a Median filter or the Savitzky-Golay or Wavelet de-
noising. For the Median Filter select the rank to be used in the filtering. For the Savitzky-Golay
first select the number of data points to fit the curve and then the order of the polynomial you wish
to fit. For the Wavelet de-noising select the desired wavelet scale for de-noising. ParLeS uses a
Daubechies wavelet with 4 vanishing moments.
o Differentiation correct the data for baseline, particle size, etc. using first or second derivatives
together with the desired sampling interval.

The software also offers a number of methods for pretreating the predictor data.

Figure 5. Pretreatments in the data manipulations tab

From Figure 5, using the drop-down menu you can select which data pretreatment (or enhancement)
to use before you move onto the multivariate modelling. The choices include:
- Mean centre,
- Variance scale,
- Mean centre & variance scale

NOTE: it is common practice, although not imperative, to Mean Centre your data before PCA and
PLSR

Once the particular combination is selected, press the RUN SELECTION button. The first graph will
show your raw data and the graph on the bottom part of the ParLeS window will show you the
combined transformed, preprocessed and pre-treated spectra. You may investigate the effect of each
algorithm separately by selecting it and then pressing the RUN SELECTION button.

For example if you have diffuse reflectance data you may choose to transform these to Log(1/R);
correct for light scattering effects using the MSC; de-noise your signal using the wavelet de-noising at
scale = 2; take the first derivative and mean centre your data before you perform PCA or PLSR.

You can save the manipulated data to a file using the SAVE MANIPULATED DATA (called the
SAVE PREPROCESSED DATA in earlier versions of ParLeS). The saved file will be a tab
delimited text file.
4. Principal Components Analysis (PCA)
ParLeS implements an iterative PCA algorithm based on the NIPALS algorithm described in Martens
& Naes (1989).
In the PCA tab, using the numeric control or slide bar you need to select the maximum number of
PCA components to calculate (Figure 6).

Figure 6. The PCA tab

The progress bar shows the component currently iterating.


The results from the PCA are displayed in a number of graphics that include:
- the loadings vs. wavelength/wavenumber plot. Using the numeric controls on this plot you may
select to view the loading for each principal component separately or, as in Figure 6, all loadings
simultaneously
- the scores vs. scores plot. Using the numeric controls on this plot you may select the scores for the
principal component that you want to plot.
- the loadings vs. loadings plot. Using the numeric controls on this plot you may select the loadings
for the principal component that you want to plot.
- the percent variation of the predictor data that is explained by each component

In Figure 6 results are shown for a total of 10 principal components.

Note that in ParLeS version 3.1 you can interact with the scores vs. scores and loadings vs. loadings
plot. Glide your mouse over the data points and click on the point that you want to identify. The point
will change colour and its label will be briefly displayed on the graph.

The PCA scores and loadings can be saved to tab delimited text files by checking the SAVE PCA
SCORES & LOADINGS check box. Two separate dialogues will appear once you check to save: the
first will ask you to give a name for the scores file and the second will ask you to provide a name for
the loadings file.
5. Jackknife cross validation
The cross validation tab can be used to help determine the optimal number of PLSR factors to model.
The results are shown in a number of graphics showing appropriate assessment statistics.
In the PLSR Cross validation tab, using the numeric slide bar or control select the maximum number
of factors for the leave-one-out cross validation (Figure 7).

Figure 7. The PLSR cross validation tab

With large data sets it may be too computationally expensive to use leave-one-out so you could for
example use leave-ten-out. To do this, type the number of samples n to leave out. To help you
decide, the total number of samples in your dataset are given in the numeric indicator No. Samples.
To start the cross validation, press the RUN X-VAL button. The progress bar indicates how much of
the data has been cross validated.
The results of the cross validation is displayed in the following graphics:
- the root mean squared error of cross validation (RMSE) vs. the number of factors
- R2 and Q2 statistics vs. the number of factors
- the Akaike Information Criterion (AIC) vs. the number of factors. Note the AIC preserves model
parsimony.
- the observed vs. cross validation predictions for a selected number of factors, where the user may
select the cross validate predictions to plot using the numeric control Select X-Val model to plot.
The fitted line and equation are also given. For this cross validated model, various assessment
statistics are given: R2, R2adjusted, RMSE, mean error (ME) the standard deviation of the error (SDE)
and the RPD.

The cross validation results can be saved by checking the SAVE X-VAL RESULTS check box. Two
separate dialogs will appear once you check to save: the first will ask you to give a name for the
assessment statistics file and the second will ask you to provide a name for the observed vs. cross
validation predictions for the selected number of factors.
.
Note if you do not need to cross-validate, proceed to the PLSR modelling tab.
6. Partial Least Squares Regression (PLSR)
The orthogonalised PLSR 1 algorithm implemented in ParLeS is that described by Martens & Naes
(1989). In the PLSR Modelling tab you may select the optimal number of factors to model, using the
slide bar or numerical indicator (Figure 8).

Figure 8. The PLSR model tab.

Once the number of factors to model are selected, run the software using the RUN PLSR
MODELLING button. Results from the PLSR modelling are shown in a number of graphs:
- Scores vs. scores plot
- Scores vs. y plot
- Regression coefficients (B) vs. wavelength/wavenumber plot
- Spectral loadings (P) and loading weights (W) vs. wavelength/wavenumber plot
- Variable importance for projection (VIP) vs. wavelength/wavenumber plot
- Sorted VIP and wavelength/wavenumber table
- the percent variation of each the predictor and response data that is explained by each factor in the
PLSR model

In Figure 8 results are shown for a total of 15 PLSR factors.

Note that in ParLeS version 3.1 you can interact with the scores vs. scores; scores vs. y plot;
regression coefficients vs. wavelength/wavenumber plot and the VIP vs. wavelength/wavenumber
plot. Glide your mouse over the data points and click on the point that you want to identify. The point
will change colour and its label will be briefly displayed on the graph.

The PLSR model (scores, regression coefficients (b), the intercept (b0), spectral loadings and loading
weights) as wells as the VIP results can be saved to tab delimited text files by checking the SAVE
SCORES; b, b0, p, w; and VIP check box. Three separate dialogues will appear once you check to
save: the first will ask you to give a name for the PLSR scores file; the second will ask you to provide
a name for the regression coefficients and the third for the VIP results.
7. Prediction
To make PLSR predictions press RUN PREDICTIONS to run the PLSR predictions using the
selected model selected in the PLSR Model tab (see 6. above). The program will run and results and
assessment statistics will be displayed (Figure 9).

Figure 9. The PLSR prediction tab

The results from the PLSR predictions are displayed in a number of graphics and assessed using
various statistics:
- a sample of the spectra used for predictions
- the predicted values
- when using a test data set, the residuals (observed predicted)
- when using a test data set, the observed vs. predicted and the fitted line, also showing its equation
- the following assessment statistics: R2, R2adjusted, RMSE and confidence intervals, mean error (ME)
the standard deviation of the error (SDE) and the RPD
- a histogram of the predicted values and their descriptive statistics

The predictions can be saved to a file using the SAVE PREDCITIONS check-box.
8. Bootstrap aggregation-PLSR or (bagging-PLSR)
To make the bagging-PLSR predictions first you need to select the number of bootstraps to use for
bagging (the default is 30 bootstraps) as well as the number of PLSR factors to use. Then press the
RUN BAGGING-PLSR button. The program will run and results and assessment statistics will be
displayed (Figure 10).

Figure 10. The bagging-PLSR tab

The results from bagging-PLSR are displayed in a number of graphics and assessed using various
statistics:
- the observed vs. predicted from the bootstraps
- the out-of-bag statistics, which may also be used to evaluate the models
- a plot of the predicted values and their 95% confidence intervals
- the descriptive statistics of the predictions
- the observed vs. predicted and the fitted line, also showing its equation
- the following assessment statistics: R2, R2adjusted, RMSE and confidence intervals, mean error (ME)
the standard deviation of the error (SDE) and the RPD

The bagging-PLSR predictions and confidence intervals can be saved to a file using the SAVE
BAGGED check-box.

Once finished you can exit ParLeS using the EXI PROGRAM button.
9. Errors
If incorrect file format, the software will not run, or run incorrectly.

10. Conditions of use


Please refer to ParLeS license agreement.

You may not use the software for commercial purposes, unless you have obtained permission, in
writing, from Raphael VISCARRA ROSSEL (r.viscarra-rossel@usyd.edu.au or tel. +61 413 326 457)

If the ParLeS is used in research you agree to cite the following reference:

Viscarra Rossel, R.A. 2007. ParLeS: Software for chemometric analysis of spectroscopic data.
Chemometrics and Intelligent Laboratory Systems (in-press) doi: 10.1016/j.chemolab.2007.06.006

For more up to date citation information you may also visit:


http://www.usyd.edu.au/su/agric/acpa/people/rvrossel/Publications.htm

I will appreciate comments/ suggestions for further improvements to ParLeS. In essence ParLeS is
still under development.

11. Disclaimer
I have taken all care to ensure that ParLeS is operationally sound. However, it is supplied 'as is' and no
warranty is provided or implied. I assume no liability for damages, direct or consequential that may
result from its use.

2007 R. VISCARRA ROSSEL, The University of Sydney