You are on page 1of 4

Comparing Metrics to Evaluate Performance of Regression Methods for

Decoding of Neural Signals

Martin Spüler1 , Andrea Sarasola-Sanz2 , Niels Birbaumer2,3 , Wolfgang Rosenstiel1 , Ander Ramos-Murguialday2,4

Abstract— The use of regression methods for decoding of metrics [7] and studies that compare those metrics for the use
neural signals has become popular, with its main applications in Brain-Computer Interfaces [8].
in the field of Brain-Machine Interfaces (BMIs) for control of When using regression methods for decoding neural sig-
prosthetic devices or in the area of Brain-Computer Interfaces
(BCIs) for cursor control. When new methods for decoding nals, there are different performance metrics being used,
are being developed or the parameters for existing methods with the correlation coefficient (CC) and the root mean
should be optimized to increase performance, a metric is needed squared error (RMSE) or its normalized version (NRMSE)
that gives an accurate estimate of the prediction error. In this being the most frequently used ones. While it is good
paper, we evaluate different performance metrics regarding scientific practice to state multiple performance metrics in a
their robustness for assessing prediction errors. Using simulated
data, we show that different kinds of prediction error (noise, publication, there is the need to decide on one metric when
scaling error, bias) have different effects on the different metrics it comes to automatic parameter optimization (e.g. in a grid-
and evaluate which methods are best to assess the overall search). While those metrics capture different properties of
prediction error, as well as the individual types of error. Based the prediction performance, it is unclear which method is
on the obtained results we can conclude that the most commonly overall best suited.
used metrics correlation coefficient (CC) and normalized root-
mean-squared error (NRMSE) are well suited for evaluation of A. Desired properties of a good performance metric
cross-validated results, but should not be used as sole criterion
for cross-subject or cross-session evaluations. When trying to find a performance metric that is overall
best suited, we first have to define what properties are
I. I NTRODUCTION
desirable. Therefore, we have to look at the most common
A Brain-Machine Interface (BMI) or Brain-Computer In- factors that lead to bad prediction performance. The most
terface (BCI) is a device that translates neural signals into important factor is noise. There is noise in the recorded
control signals to drive an external device or a computer. For neural signals and other possible reasons like ambiguous
the decoding of neural signals, machine learning methods data, which lead to a noisy prediction. When the prediction
are used which can be grouped into two areas: classification is evaluated using a cross-validation, noise is arguably the
methods and regression methods. Classification methods biggest reason for prediction error. However, when compar-
deliver a discrete output (like a yes/no response) and are ing a prediction model across sessions or across subjects,
often used in BCIs for communication purposes. On the the so-called non-stationarity of the data becomes a big issue.
other hand, regression methods deliver a continuous output Non-stationarity describes the fact the probability distribution
(like movement velocity), which will be the focus in this of the data changes over time (e.g., fatigue during the
paper. While the control of a robotic arm [1] is the most experiment changes neural signals) and is different between
prominent use, there are other applications that include subjects. In a cross-validation, training and testing data are
control of prosthetic devices [2] or a computer cursor [3]. chosen both from the whole dataset, which means that they
There are also other examples from related fields that use both have roughly the same probability distribution. When
regression methods for decoding human movement trajectory using a cross-session or cross-subject evaluation, probability
from neural signals [4], or use brain signals to predict distributions of the training and testing data are different.
electrical stimulation parameters [5] or estimate a user’s Although there are methods to alleviate the problem of non-
mental workload [6]. stationarity [9], it can be a large issue in cross-subject and
When it comes to developing and improving methods cross-session evaluation that can lead to prediction bias and
for decoding of neural signals, estimating the performance scaling errors.
of the prediction model is crucial to the whole process. Therefore, if we want to evaluate a prediction model across
For classification methods, there are established performance subjects or sessions, the performance metric should not only
1 MS and WR are with Computer Science Department, University of be able to capture noise, but should also work reliably when
Tübingen, Tübingen, Germany the results are affected by a prediction bias or a scaling error.
2 ASS, NB and ARM are with the Institute of Medical Psychology and
Further, the metric should be invariant to the total scaling of
Behavioral Neurobiology, University of Tübingen, Tübingen, Germany the data (but not the scaling error), which makes it easier
3 NB is also of Ospedale San Camillo, IRCCS, Venice, Italy
4 ARM is also at TECNALIA, Health Technologies Department, San to compare results between different datasets. To be able
Sebastian, Spain to compare different regression results (e.g. to decide on

978-1-4244-9270-1/15/$31.00 ©2015 IEEE 1083


the optimal parameters or the best regression model), the
performance metric should relate to the prediction error in a
monotonic fashion, with an ideally linear behavior.
While there are also application-centered metrics (like
time-to-reach target), we ignore those metrics in the course of
the paper, since they do not allow a comparison of methods
across different applications.
In this work, we tested different performance metrics, how
well they capture different error properties (noise, bias, and
scaling) and which method delivers overall the most robust
results.
II. M ETHODS
In the following, we describe what performance metrics
we used, how we generated simulated data, and how we
evaluated the robustness of the metrics.
A. Performance metrics
Leaning on the example of predicting movement trajectory
from brain signals, y = (y1 , ..., yn ) denotes the actual move- Fig. 1. Each of the three subplots shows an example of a simulated
ment trajectory for n time points, and ŷ = (ŷ1 , ..., ŷn ) is the trajectory prediction with the trajectory (red) and the predicted trajectory
corresponding predicted trajectory. Based on this example, (blue). The first run (top) shows a larger amount of noise with no scaling
error and no bias. The second run (middle) shows a run with prediction
we tested the following performance metrics: bias. The third run (bottom) shows a run with scaling error.
• Correlation coefficient (CC): Pearson’s correlation co-
efficient computed by
Pn As will be shown later, the performance metrics capture
(yi − m)(ŷi − m̂)
CC(y, ŷ) = pPn i=1 Pn (1) different properties of the prediction error, which is why
2 2
i=1 (yi − m) · i=1 (ŷi − m̂) we also tested different combinations of the metrics to find
with m and m̂ being the mean of y and ŷ, respectively. one metric (or combination of metrics) that is best suited.
In some publications the squared correlation coefficient Due to the limited space, we will only present results from
is used. But due to the monotonic relationship between combination of metrics that are meaningful in terms of good
CC and its squared version, there is no difference in the results.
robustness between those two. B. Simulation and evaluation procedure
• Normalized root mean squared error (NRMSE):
q Pn To test the different performance metrics, we used the
2
(ŷt −yt )
t=1 example of predicting a movement trajectory and generated
n
N RM SE(y, ŷ) = (2) an artificial one-dimensional trajectory y which consists of a
(ymax − ymin ) repeated sinusoidal movement with a length of 105 samples.
As the RMSE depends on the total scaling of the dataset, The maximum amplitude of y was set in a way that the
the normalized version (NRMSE) should be used to variance of y is 1. Based on y we generated ŷ, which is
allow a comparison of results between datasets. the predicted movement trajectory. To vary the effects of
• Signal-Noise Ratio (SNR): as there are different defini- noise, bias and scaling errors, we introduced the factors en
tions of SNR, we used the following method: to specify the amount of noise in the prediction, eb to vary
var(y − ŷ) the prediction bias, and es which controls the scaling error.
SN R(y, ŷ) = (3) Nσ(0,1) denotes a noise vector having the same length as the
var(ŷ)
trajectory. As we assume that noise in neural recordings is
• Coefficient of determination (COD): There exist also Gaussian, each value is drawn from a normal distribution
different definitions of the coefficient of determination, σ(0, 1) with mean 0 and a variance of 1. The predicted
with one of the definitions being the squared correlation. trajectory is then generated by
In this paper we used the COD defined by
Pn
(ŷt − yt )2 ŷ = es · (en · Nσ(0,1) + eb + y) (6)
COD(y, ŷ) = Pn t=1 2
(4)
t=1 (ŷt − mean(ŷ)) For evaluation of the different metrics, we performed
• Global deviation (GD): defined by the average squared multiple runs. In each run, the error factors were chosen
difference randomly from a predefined interval. The interval of en ∈
 Pn 2 [1, 4] was chosen, since these values resulted in a CC between
t=1 (ŷt − yt ) 0.2 and 0.8, which approximately are the highest and lowest
GD(y, ŷ) = (5)
n values that are published for trajectory prediction based on

1084
various brain signals. The interval of eb ∈ [0, 0.5] was chosen all error factors will be present with a varying degree, we
since the minimum value of 0 is expected for cross-validated also performed simulations with all factors being varied
results, while the maximum value stems from personal simultaneously. The results of this simulation can be found
experience in cross-subject or cross-session validated data. in table II. It can be seen that CC still captures noise robustly,
For es ∈ [0.5, 1.5], the interval was chosen since scaling remaining invariant to bias or scaling errors. However, the
errors of up to 50 % were observed in EEG-based cross- results for NRMSE and CC-NRMSE drastically change when
subject workload prediction [6]). all error factors are present simultaneously in the data, so that
Exemplary trajectories as well as the result of a simulated both methods still can be used as an indication of the amount
trajectory prediction with different amount of noise, bias and of noise in the prediction, but fail to assess a bias or scaling
scaling error are shown in figure 1. error.
To evaluate how well the different metrics are able to TABLE II
capture the three error factors individually, we performed C ORRELATION BETWEEN THE PERFORMANCE METRICS AND THE
1000 runs for each of the error factors, in which only AMOUNT OF ERROR SEPARATED BY EACH FACTOR , AS WELL AS THE
one error factor was randomly chosen from the predefined AVERAGE OVER ALL FACTORS . F OR THESE RESULTS THE PREDICTION
intervals, while the other factors where set to a default value ERROR WAS AFFECTED BY ALL FACTORS SIMULTANEOUSLY (N OISE ,
(en = 0.1,eb = 0,es = 1,), so that respective type of error B IAS , AND S CALING ). B EST RESULTS ARE MARKED BOLD .
has no (or only an insignificant) effect. Since a robust metric
should have a monotonic relation to the amount of error and
Noise Bias Scaling Mean
this relation should ideally be linear, we calculated Pearson’s CC 0.98 0.02 0.06 0.35
correlation coefficient to assess how good a metric is able to NRMSE 0.74 0.02 0.09 0.28
reflect the individual types of error. SNR 0.85 0.01 0.32 0.39
COD 0.84 0.07 0.33 0.41
To assess how well the metrics are able to capture GD 0.00 0.76 0.07 0.28
the individual types of errors, if all three types of errors CC-NRMSE 0.87 0.02 0.09 0.33
happen simultaneously (e.g. in a cross-subject or cross- CC/NRMSE 0.83 0.03 0.00 0.29
CC+SNR 0.09 0.02 0.55 0.22
session prediction), we also performed 1000 runs with all the
errors factors being randomly chosen from their predefined
interval and used Pearson’s correlation coefficient to assess Averaged over all three types of errors, COD performs
how robust the metrics are for assessing the amount of the best in terms of estimating the overall prediction error. But
individual error, as well as the overall error. it still fails to capture bias and has problems capturing scaling
III. R ESULTS errors. Due to the properties of CC (invariance to bias and
scaling), CC is best suited if only the amount of noise in a
The results for the simulation, in which each error factor
prediction should be estimated. GD performs best to capture
was varied individually, are shown in table I. As is expected
a prediction bias, while CC+SNR is the best method to
by definition, CC is invariant to bias and scaling errors, but
estimate a scaling error.
captures noise well. NRMSE, as well as the combination CC-
To illustrate how the metrics relate to the amount of noise,
NRMSE, have a Pearson correlation near 1 meaning that both
Figure 2 shows the results of the simulation run with all
capture all kinds of errors very well, if only one error factor
three error factors being varied simultaneously. Due to space
is present in the data. Worth mentioning is metric COD,
issues, only the scatter plot for the amount of noise is shown
which captures noise and bias errors well, although it does
and not for the amount of bias and scaling error.
not work that well (r = 0.65) for scaling errors.
IV. D ISCUSSION
TABLE I
P EARSON ’ S CORRELATION BETWEEN THE PERFORMANCE METRICS AND In this study, we used the example of trajectory prediction
THE AMOUNT OF ERROR , WHEN THE PREDICTION ERROR IS ONLY to investigate the reliability of several performance metrics
INFLUENCED BY ONE FACTOR ( EITHER N OISE , B IAS , OR S CALING ). for assessing the prediction error. Therefore, we performed
R ESULTS WITH VALUES NEAR ONE (≥ 0.95), ARE HIGHLIGHTED BOLD . simulations in which a trajectory is predicted with the
prediction being affected by three different kinds of error
in a varying degree.
Noise Bias Scaling
CC 0.98 0.03 0.07 When looking at predictions using real neural signals, we
NRMSE 1.00 1.00 1.00 do not know how noisy the prediction is and how much
SNR 0.96 0.03 0.65 it is affected by bias or scaling errors. Therefore, we need
COD 0.96 0.97 0.65
GD 0.26 0.97 0.04 performance metrics to estimate the amount of prediction
CC-NRMSE 1.00 0.99 1.00 error. On the other side, the simulation allows us to generate
CC/NRMSE 0.91 0.69 0.52 predictions for which the amount of noise, bias and scaling
CC+SNR 0.48 0.03 0.65
error is exactly known and performance metrics can be
evaluated in order to know how well they assess these errors.
Since it is unrealistic that only a bias or only a scaling By using simulations we could show which performance
error would occur during neural signal decoding and rather metrics are sensitive to which kind of error. To understand,

1085
the method with the overall best results, but although better
than most methods, it does not work satisfactorily to capture
bias or scaling errors.
A. Conclusion
In conclusion, it seems that the most popular metrics CC
and NRMSE can reliably be used for evaluation of cross-
validated results, but are not recommended as sole criteria
in a cross-subject or cross-session evaluation. As CC does
not take into account prediction bias and scaling errors,
using only this metric could lead to wrong decisions when
comparing the prediction performance in those scenarios.
For cross-subject or cross-session evaluations, we recom-
mend that multiple metrics (CC, GD, CC+SNR) should be
looked at, to get a reliable assessment of the prediction
performance. If only one performance metric can be used
(i.e. for parameter optimization in a grid-search), none of
the tested methods works satisfactorily for all types of error,
but the coefficient of determination (COD) is the best choice,
because it delivers the best results on average.
ACKNOWLEDGMENT
This study was funded by the Baden-Württemberg Stiftung
(GRUENS), Volkswagen Stiftung, the WissenschaftsCam-
Fig. 2. Result of the simulation, in which the prediction is affected by
all error types. Each scatter plot shows for a different metric how well the pus Tübingen, the Deutsche Forschungsgemeinschaft (DFG,
metric captures the amount of prediction noise, when simultaneously also a Grant RO 1030/15-1, KOMEG), the Indian-European col-
random amount of prediction bias and scaling error is present. Each circle laborative research and technological development projects
represents one run with a random amount of noise, bias and scaling error.
(INDIGO-DTB2-051) and the Natural Science Fundation
of China (NSFC 31450110072). Andrea Sarasola-Sanz is
which performance metric is best suited to assess the overall supported by the La Caixa-DAAD scholarship.
prediction, we have to consider two different scenarios: R EFERENCES
For cross-validated results, in which the data is similarly
[1] L. R. Hochberg, D. Bacher, B. Jarosiewicz, N. Y. Masse, J. D. Simeral,
distributed in the training and test sets, bias and scaling errors J. Vogel, S. Haddadin, J. Liu, S. S. Cash, P. van der Smagt et al., “Reach
are not to be expected and a noisy prediction will be the main and grasp by people with tetraplegia using a neurally controlled robotic
cause of error. Based on the results, the most popular metrics arm,” Nature, vol. 485, no. 7398, pp. 372–375, 2012.
[2] K. Ganguly and J. M. Carmena, “Emergence of a stable cortical map
CC and NRMSE give a reliable estimate on the amount of for neuroprosthetic control,” PLoS biology, vol. 7, no. 7, p. e1000153,
noise in the prediction and therefore can both be used as 2009.
reliable performance metrics. The same holds for SNR, COD [3] S.-P. Kim, J. D. Simeral, L. R. Hochberg, J. P. Donoghue, and M. J.
Black, “Neural control of computer cursor velocity by decoding motor
and CC-NRMSE. cortical spiking activity in humans with tetraplegia,” Journal of neural
However, if results of a regression shall be evaluated in a engineering, vol. 5, no. 4, p. 455, 2008.
cross-subject or a cross-session manner, the data distribution [4] M. Spüler, W. Rosenstiel, and M. Bogdan, “Predicting wrist movement
trajectory from ipsilesional ecog in chronic stroke patients,” in Proceed-
may differ between training and test sets due to the data ings of 2nd International Congress on Neurotechnology, Electronics and
being obtained on different subjects or sessions, which can Informatics (NEUROTECHNIX 2014), 10 2014, pp. 38–45.
lead to a prediction bias or a scaling error. When all three [5] A. Walter, G. Naros, M. Spüler, A. Gharabaghi, W. Rosenstiel, and
M. Bogdan, “Decoding stimulation intensity from evoked ecog activity,”
factors (noise, bias, scaling) are affecting the prediction error Neurocomputing, vol. 141, pp. 46–53, 2014.
simultaneously, the results are different. Due to CC being [6] C. Walter, P. Wolter, W. Rosenstiel, M. Bogdan, and M. Spüler,
invariant to bias and scaling, it is the best method to assess “Towards cross-subject workload prediction.” in Proceedings of the 6th
International Brain-Computer Interface Conference, Graz, Austria, 09
noise effects in such a scenario, but due to its invariance 2014.
it completely fails to capture possible errors arising from a [7] M. Sokolova and G. Lapalme, “A systematic analysis of performance
prediction bias or scaling error. To assess the bias in the measures for classification tasks,” Information Processing & Manage-
ment, vol. 45, no. 4, pp. 427–437, 2009.
prediction, GD is the method of choice since it is the only [8] M. Billinger, I. Daly, V. Kaiser, J. Jin, B. Z. Allison, G. R. Müller-Putz,
method allowing a reasonable estimate of the prediction bias. and C. Brunner, “Is it significant? guidelines for reporting bci perfor-
When it comes to assessing the scaling error, CC+SNR is the mance,” in Towards Practical Brain-Computer Interfaces. Springer,
2013, pp. 333–354.
method that works best. While the three above mentioned [9] M. Spüler, W. Rosenstiel, and M. Bogdan, “Principal component based
methods (CC, GD, CC+SNR) can assess one type of error covariate shift adaption to reduce non-stationarity in a meg-based
well, there is no method that performs good on all types of brain-computer interface,” EURASIP Journal on Advances in Signal
Processing, vol. 2012, no. 1, pp. 1–7, 2012.
errors. On average the coefficient of determination (COD) is

1086

You might also like