You are on page 1of 4

7th Annual International IEEE EMBS Conference on Neural Engineering

Montpellier, France, 22 - 24 April, 2015

Probe-Independent EEG Assessment of Mental Workload in Pilots


Michael K. Johnson, Justin A. Blanco, Rodolphe J. Gentili, Kyle J. Jaquess, Hyuk Oh, and Bradley
D. Hatfield

Abstract—Existing approaches for quantifying mental while a subject undergoes a cognitive challenge. It has been
workload using electroencephalography often rely on probe demonstrated that the amplitude of the P300 varies inversely
stimuli to elicit stereotyped neural responses such as the P300 with task-complexity [9]. Although measuring ERPs using
wave. Here we explore probe-independent algorithms for auditory stimuli has been successful in evaluating mental
classifying three levels of task-complexity in a flight simulator workload in controlled settings, this approach may be less
experiment. Using input features derived from estimates of the practical when the cognitive tasks themselves involve a
average power in five frequency bands, we test a variety of strong auditory component, and hence the introduction of
classifiers, using 10-fold cross-validation to estimate test set extraneous sounds could be disruptive. Therefore, an EEG-
error. Classification accuracy was above 50% (chance based metric of mental workload that exploits information
performance: 33.33%) in 13 of 20 subjects on at least one of the from non-evoked “background” neural activity is desired.
four recorded channels, and reached as high as 87.35%. There
The goal of this research was to develop an EEG-based
was strong variability across subjects in both the strength and
direction of the relationships between the input features and
algorithm to classify different levels of task complexity that
task-complexity labels, suggesting that classifiers using these does not rely upon ERPs. By choosing subjects with a
input features must be trained to the individual to be useful. similar level of task-experience, we partially control for
differences in the capacity to perform the experimental task
I. INTRODUCTION and therefore use task-complexity as a surrogate for mental
workload. As we were particularly interested in
Mental workload can be described as a ratio between
understanding the response of aircraft pilots to the cognitive
task-complexity and a person’s cognitive capacity to meet
demands imposed by their flight-missions, we used flight
task demands [1]. This description captures the intuitive idea
simulator tasks of varying challenge-level as our
that mental workload depends both on external factors such
experimental paradigm. Furthermore, since pilots are
as the objective difficulty of required tasks, and internal
typically in persistent radio or intercom communications via
factors such as a person’s past experiences and skill set.
headset during flight, this also represents a scenario that
There is a growing body of research focused on developing
would be particularly well-suited to a non-ERP-based index
quantitative methods to assess mental workload in order to
of cognitive workload. Signal processing methods were used
improve the mental resiliency of people in high stress
to extract computational features from the EEG, and machine
environments. Various metrics derived from physiological
learning techniques were used to classify the data and assess
signals such as heart rate, blood pressure, galvanic skin
algorithm performance.
response, and eye-gaze have been investigated as biomarkers
of mental workload [2-4]. These signals have been used to II. METHOD
distinguish mental workload levels with accuracies
significantly better than chance, but there are still no widely A. Data Acquisition
accepted standards or commercial products for mental EEG data were collected from 20 United States Naval
workload monitoring. Academy (Annapolis, MD, USA) midshipmen between the
With recent improvements in the ease-of-use, reliability, ages of 19 and 23, all with basic flight training, while they
and costs of portable electroencephalography (EEG) systems, performed visuo-motor tasks in a flight simulator
there has been increasing interest in using brain signals to (Prepar3D® v1.4, Lockheed Martin Corporation, Orlando,
measure mental workload [5]. It is hypothesized that EEG FL, USA) under three levels of task-complexity. The three
offers a more direct assay of mental workload than other tasks were selected from predefined flight training exercises
physiological biomarkers because of the proximity of EEG developed with advice from experienced United States Navy
sensors to the neural substrates of cognitive stress [6]. pilots and distinguished in challenge-level by differences in
A common method of using EEG to assess mental weather intensity and mission requirements. Specifically, the
workload involves delivering an auditory probe to evoke three tasks were: 1) Easy: maintain aircraft’s current altitude
event-related potentials (ERPs) such as the P300 wave [7-8] (4000 ft), heading (180º), and airspeed (180 kn). The weather
was defined by no clouds, precipitation, or wind, and
unlimited visibility; 2) Medium: maintain the aircraft’s
*Research partially supported by the Lockheed Martin Corporation,
USA (project # 13051318) and DARPA Service Academies Challenge current heading (180º), airspeed (180 kn), and a “wings-
Award HR0011411994. level” attitude, while continuously making altitude changes
M. K. Johnson nad J. A. Blanco are with the United States Naval between 4000 and 3000 ft, with ascent and descent rates of
Academy, Annapolis, MD 21402 USA (phone: 817-721-9303 and 410-293- 1000 feet per minute (fpm). The sky was completely overcast
6184; e-mail: michaeljohnson015@gmail.com and blanco@usna.edu). (1/16 mi of visibility), but there was no precipitation and no
R. J. Gentili, K. J. Jacquess, H. Oh, and B. D. Hatfield are with the
University of Maryland, College Park, MD 20742 USA (phone: 301-405-
wind; and 3) Hard: maintain the aircraft’s current airspeed
2485; email: bhatfiel@umd.edu). (180 kn), while changing heading between 180 and 090º at a
15-degree angle of bank, ascending while turning right and
978-1-4673-6389-1/15/$31.00 ©2015 IEEE 581
(g.SAHARAsys®, g.tec medical engineering GmbH,
Schiedlberg, Austria). Electrode impedances were measured
below 5 kOhm. The right mastoid was used as ground for the
system and the left ear as the hardware reference. Data were
also collected from the right ear for later re-referencing. EEG
were sampled at a rate of 512 Hz, re-referenced in an EEG
analysis software (BrainVision Analyzer 2, Brain Products
GmbH, Munich, Germany) to an average-ear montage, and
digitally lowpass filtered (in forward and reverse to give zero
phase response) with a Butterworth filter with a cutoff
frequency of 50 Hz and 48 dB/octave rolloff..
Data were visually inspected for the presence of eye-blink
and muscle-activity artifacts, and these segments (47% of all
recorded data) were excluded from analysis. In addition, all
data within a 600 ms window following the onset of an
auditory stimulus were excluded in order to eliminate P300
responses. Fig. 2 shows a typical P300 waveform, generated
Figure 1. An individual performs a flight task in the simulator by averaging 30 post-stimulus intervals in a single trial.
while wearing an EEG cap. The three experimental tasks required
subjects to operate a T-6A Texan II SP2 United States Navy B. Feature Extraction
aircraft using the control stick, throttle, and rudder pedals.
The remaining data set was segmented into 1-second
descending while turning left at 1000 fpm. The sky was (512-sample) epochs for analysis, based primarily on a desire
completely overcast as in the Medium task, with no to have no more than 1-second lag in an envisioned real-time
precipitation, but with the presence of a moderate (16 kn) monitoring system. Linear trends were removed from each
easterly wind. One trial per task difficulty was conducted, in segment by subtracting the least squares line of best fit. We
random order, consisting of a 1-minute setup period followed estimated the power spectral density (PSD) of each segment
by a 10-minute flight segment. Additionally, for use in a using Welch’s method (8 sections with 50% overlap;
separate analysis, audible stimuli were administered to Hamming window applied to each section.) [11]. The average
participants via ear-bud speakers with random inter-stimulus powers in each of the Delta (1-4 Hz), Theta (4-8 Hz), Alpha
intervals between 6 and 30 seconds to evoke the P300 (8-13 Hz), Beta (13-30 Hz), and Gamma (30-40 Hz) [12]
response. Since the goal of this work was to analyze bands were then computed by integrating this PSD estimate
background EEG only, data surrounding these stimuli were over the corresponding frequency range. Finally, feature
excluded using a procedure described below. Fig. 1 vectors were formed for each 1-second epoch by
illustrates the experimental setup. concatenating the average power estimates for each of the
Four active, dry (gel-free) electrodes were used to five frequency bands.
measure EEG signals from sites along the frontal (Fz),
C. Classification
fronto-central (FCz), central (Cz), and parietal (Pz) midline,
based on the International 10-20 System [10]. The EEG cap Several different classifiers were then trained to predict
was connected to an amplifier with an online band-pass filter task-complexity based on EEG features. Specifically, the
from 0.01 to 60 Hz (g.USBamp®, g.tec medical engineering classification techniques used were: 1) k-Nearest Neighbors
GmbH, Schiedlberg, Austria) through a driver-interface box (kNN); 2) Linear Discriminant Analysis (LDA); 3) Quadratic
Discriminant Analysis (QDA); 4) Naïve Bayes; 5) Decision
Trees (with and without pruning); and 5) Support Vector
Machines (SVM). Test set error was estimated using 10-fold
cross-validation, with percent correct classification used as
the performance metric.
Additionally, in an attempt to improve classifier
performance by deemphasizing potentially irrelevant
features, we performed principal component analysis on the
unnormalized five-element feature data, reducing the number
of dimensions to three (by taking the first three principal
component projections as the new features), accounting for
nearly 90% percent of the data variance on average.
III. RESULTS AND DISCUSSION
A. Average Power Computations
As a preliminary assessment of the relationship between
Figure 2. A typical P300 waveform generated by averaging 30 post- frequency band power and task-complexity, for each channel
stimulus intervals. The peak of the response is located at approximately within a subject, the powers in each frequency band were
250 ms post-stimulus, and a return to baseline is seen by around 600 averaged over all one-second segments for each task. Fig. 3
ms, supporting the choice of the 600 ms exclusion window.
shows the average power in each frequency band for channel
FCz in Subject 6, Subject 8, Subject 10, and Subject 11.

582
Figure 4. Example of a Linear Discriminant Analysis classifier (Subject
8, channel FCz). The input features were the first, second, and third
principal components of the average power of the five frequency bands
considered. The accuracy for this classifier was 82.91%.

five frequency bands; the average power in the alpha, beta,


and gamma frequency bands only; and the first, second, and
third principal components of the average power in all five
frequency bands. This subject and channel yielded some of
the strongest results we observed. Several classifier-feature
combinations yielded accuracies better than 80%, suggesting
that average powers in the selected frequency bands are
plausible features for assessing mental workload in this
Figure 3. Average power in each frequency band for the three different subject. However, as illustrated in Fig. 5, classification
task complexities. Each plot represents channel FCz in a different performance differed greatly across subjects and channels.
subject. Error bars are two standard errors of the mean. Sixty-five percent of subjects (13 of 20) had at least one
channel above 50% accuracy (chance performance is
These subjects were chosen to display the typical variability 33.33%) using a SVM (Polynomial Kernel) classifier with
in average spectral band power across individuals and tasks. Feature Set 2, and 7 subjects performed above 50% for all
Each subject exhibits separation of magnitudes among the four channels. The lowest classification accuracy across
task-complexities in some frequency bands. However, the subjects, channels, and classifier-feature combinations was
specific trends seen are not consistent across subjects; nor, 34.00%, while the highest was 87.35%.
generally speaking, are the relative powers for each level of
task complexity consistent across frequency bands within a
subject. For a given frequency band, some subjects showed
an increase in average power as task complexity increased
while other subjects showed the opposite trend. This suggests
that any relationship between average power and task
complexity will likely depend on the individual, and
classifiers that are either trained using data from a single
subject or using data aggregated across subjects are unlikely
to generalize well to new individuals.
B. Classification Performance
The average percent correct classification after 10-fold
cross-validation was used to assess classifier performance.
Fig. 4 illustrates an example of one particular classifier
selected for its ease of visualization: a Linear Discriminant
Analysis classifier using the first, second, and third principal
components of the average power in each frequency band as
input features, for channel FCz in Subject 8. The decision
boundaries are shown as black planes, and good class-
separation is evident, resulting in 82.91% classification Figure 5. Each box represents the distribution of classification
accuracy. Table I displays, for this same subject and accuracies across all subjects for the best performing classifier-feature
combination listed in Table I (i.e., Feature Set 2 and Polynomial SVM),
channel, the percent correct classification for all classifiers per channel.
tested, using three sets of features: the average power in all
583
TABLE I. CLASSIFIER PERFORMANCE performance based on these data will first need to be
developed.
Subject 8 – Channel FCz – Percent Correct Classification (%)
Classifier
Name
Feature Set 1a Feature Set 2b Features Set 3c REFERENCES
[1] B.H. Kantowitz, “Human Factors Psychology 3. Mental Workload,”
1-NN 66.11 81.32 81.92 Advances in Psychology, vol. 47, pp. 81 – 121, 1987.
[2] T.C Hankins, G.F Wilson, “A comparison of heart rate, eye activity,
EEG, and subjective measures of pilot mental workload during flight,”
k-NN (CVd) 64.72 85.47 83.50 Aviation, Space, and Environmental Medicine., vol. 69, no. 4, pp. 360,
1998.
[3] P. Besson, C. Bourdin, L. Bringoux, E. Dousset, C. Maïano, T.
LDA 82.11 82.21 82.91 Marqueste, D. R. Mestre, S. Gaetan, J. Baudry, and J. Vercher,
“Effectiveness of Physiological and Psychological Features to
Estimate Helicopter Pilots’ Workload: A Bayesian Network
QDA 85.38 86.86 85.18 Approach,” IEEE Trans. Intell. Transp. Syst., vol. 14, no. 4, pp. 1872-
1881, 2013.
[4] W. Li, F. Chiu, K. Wu, and B. District, “The Evaluation of Pilots
Naïve Bayes 83.79 86.07 71.34 Performance and Mental Workload by Eye Movement,” no.
September. Taipei City 112, Taiwan, pp. 24–28, 2012.
Decision Tree [5] T. K. Calibo, J. A. Blanco, and S. L. Firebaugh, “Cognitive stress
82.31 81.42 80.73 recognition,” in 2013 IEEE International Instrumentation and
(Not Pruned)
Measurement Technology Conference (I2MTC), 2013, pp. 1471–
Decision Tree 1475.
83.20 83.50 82.31 [6] A. Uetake, A. Murata, "Assessment of mental fatigue during VDT
(Pruned - CV)
task using event-related potential". Proceedings of the 2000 IEEE
SVM International Workshop on Robot and Human Interactive
(Radial Basis 66.40 86.46 85.08 Communication, 2000. Japan.
Kernel) [7] J. Song, D. Zhuang, G. Song, and X. Wanyan, “Pilot Mental
SVM Workload Measurement and Evaluation under Dual Task,” no. 2010.
82.41 80.04 81.72
(Linear Kernel) Beijing, China, pp. 809–812, 2011.
[8] I. Käthner, S. C. Wriessnegger, G. R. Müller-Putz, A. Kübler, and S.
SVM
Halder, “Effects of mental workload and fatigue on the P300, alpha
(Polynomial 72.13 87.35 87.35
and theta band power during operation of an ERP (P300) brain-
Kernel – CV)
computer interface.,” Biol. Psychol., vol. 102, pp. 118–29, Oct. 2014.
a. Feature Set 1 has five features: the average power in all five frequency bands. [9] M. W. Miller, J. C. Rietschel, C. G. McDonald, and B. D. Hatfield, “A
b. Feature Set 2 has three features: the average power in the alpha, beta, and gamma bands. novel approach to the physiological measurement of mental
c. Feature Set 3 has three features: the first, second, and third principal components of Feature Set 1 workload.,” International journal of psychophysiology : official
d. CV means a range of values for the specified classier (e.g., kNN values ranging between 3 and journal of the International Organization of Psychophysiology, vol.
51) were considered, and the best performing value selected via cross-validation.
80, no. 1, pp. 75–8, Apr. 2011.
[10] V. Jurcak, D. Tsuzuki, and I. Dan, “10/20, 10/10, and 10/5 Systems
Revisited: Their Validity As Relative Head-Surface-Based Positioning
III. CONCLUSION Systems,” Neuroimage, vol. 34, no. 4, pp. 1600–11, Feb. 2007.
[11] Welch, P.D., “A Fixed-point Fast Fourier Transform for the
The purpose of this research was to train a classifier to Estimation of Power Spectra,” IEEE Trans. Circuit Theory, Vol.15,
predict different levels of task-complexity, and by extension, pp. 70-73, June 1970.
mental workload, using features extracted from background [12] M. M. Shaker, “EEG waves classifier using wavelet transform and
Fourier transform,” World Academy of Science, Engineering and
EEG data collected during a flight simulator experiment. Technology, vol. 3, pp. 723–728, 2007.
Classification accuracies as high as 87.35% were achieved
for Subject 8 and channel FCz, but performance was highly
variable.
In future work, additional features and classifiers will be
tested in attempt to improve classification performance and
decrease the variability across subjects and channels.
Furthermore, a reassessment of the task labels intended to
correspond to the different levels of mental workload may
improve classifier performance. In this work, differences in
individuals’ capacity to meet task demands were partially
controlled for by selecting subjects with similar flight
experience. Still, it is likely that some individuals were
highly cognitively loaded during the “easy” task, for
example, while others may have performed the “hard” task
with minimal cognitive load. Therefore, behavioral
performance data logged by the flight simulator during
experimental trials may help to characterize mental workload
levels with greater reliability than objective task complexity
alone. However, in order to use this behavioral data to
reclassify trials, a metric to quantify the success of flight-task

584

You might also like