You are on page 1of 12

UAIS (2001) 1: 4–15 / Digital Object Identifier (DOI) 10.

1007/s102090100001

Long papers

Productivity, satisfaction, and interaction strategies of


individuals with spinal cord injuries and traditional users
interacting with speech recognition software
Andrew Sears1 , Clare-Marie Karat2 , Kwesi Oseitutu1 , Azfar Karimullah1 , Jinjuan Feng1
1
Laboratory for Interactive Systems Design, Information Systems Department, UMBC, 1000 Hilltop Circle,
Baltimore, MD 21250, USA; E-mail: asears@umbc.edu
2
IBM TJ Watson Research Center, 30 Sawmill River Road, Hawthorne, NY 10532, USA; E-mail: ckarat@us.ibm.com

Published online: 18 May 2001 –  Springer-Verlag 2001

Abstract. Speech recognition is an important technol- the temperature of a room. Speech-recognition-based dic-
ogy that is becoming increasingly effective for dictation- tation systems allow users to generate email, memos, pa-
oriented activities. While recognition accuracy has in- pers, medical reports, insurance claims, police reports, or
creased dramatically in recent years, recent studies con- meeting minutes by speaking to the computer.
firm that traditional computer users are still faster using For individuals without any significant disabilities,
a keyboard and mouse and spend more time correcting er- SR may be a convenient alternative to traditional in-
rors than dictating. Further, as these users become more put devices, especially in hands-busy/eyes-busy environ-
experienced they frequently adopt multimodal strategies ments such as taking notes while using a microscope. In
that require the keyboard and mouse when correcting contrast, effective SR systems can result in fundamen-
errors. While speech recognition can be a convenient al- tal changes in the lives of some individuals with dis-
ternative for traditional computer users, it can be a pow- abilities that limit their ability to physically manipu-
erful tool for individuals with physical disabilities that late their environment. For some individuals with physi-
limit their ability to use a keyboard and mouse. How- cal disabilities but no speech impairments, speech- acti-
ever, research into the performance, satisfaction, and us- vated environmental control systems can increase inde-
age patterns of individuals with physical disabilities has pendence while SR-based dictation systems provide an
not been reported. In this article, we report on a study additional method of communicating with others. As a re-
that provides initial insights into the efficacy of existing sult, SR can be an important technology for individuals
speech recognition systems with respect to individuals with a variety of physical disabilities including spinal cord
with physical disabilities. Our results confirm that pro- injuries, severe repetitive stress injuries, upper extrem-
ductivity does not differ between traditional users and ity amputations, arthrogryposis, arthritis, and muscular
those with physical disabilities. In contrast, numerous dystrophy.
differences were observed when users rated their satis- In this article we report on an experiment designed
faction with the system and when usage patterns were to provide insights into the efficacy of SR-based dictation
analyzed. systems when used by individuals with physical disabil-
ities. More specifically, we focus on these issues in the
Key words: Speech recognition – dictation – spinal cord context of individuals with high-level spinal cord injuries
injuries – usability evaluation – universal access (a more precise definition of the level and completeness of
the spinal cord injuries is provided below). The primary
goal of this study is to assess the productivity and satis-
faction of individuals with spinal cord injuries as well as
1 Introduction traditional computer users when using a state-of- the-art
SR-based dictation system. A secondary goal is to deter-
Speech recognition (SR) is an increasingly important mine whether or not these two groups of users employ
technology. Speech-activated environmental control sys- the same process as they complete various tasks using
tems allow users to manipulate anything from lights to this technology. By investigating current levels of produc-
a television, a telephone, or the thermostat that controls tivity and satisfaction we provide a baseline for evaluat-
A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies 5

ing alternative designs and insights into the expectations using the SR software, the same participants reached
these users have regarding this technology. By under- only 13.6 WPM for transcription tasks and 7.8 WPM
standing the processes utilized by these two user groups, for composition tasks during their initial use of the sys-
we can provide guidance to researchers seeking to im- tem, but performance for transcription tasks improved
prove SR-based interactions by determining which strate- to 25.1 WPM after 20 sessions with the SR software.
gies and commands are most often utilized. This will be These researchers also investigated the number of cor-
particularly important if the two groups of users em- rection episodes that occurred during each task, the
ploy different strategies when interacting with the soft- number of steps involved in each correction, and how
ware. these patterns changed with experience. To summa-
rize their results, the number of correction episodes did
decrease with experience (from 11.3 to 8.8), but per-
2 Related research haps more importantly the number of steps per episode
also decreased (from 7.3 to 3.5). The reduction in the
Manaris and Harkreader explored the use of SR as an number of steps suggests that these users discovered
alternative mechanism for generating keystrokes and more efficient strategies as they continued to use the
mouse events with the goal of developing an alternative system.
data entry technique for individuals with upper-body Halverson, Horn, Karat, and Karat discussed the
motor-control impairments [6]. While their approach can strategies that users adopt when correcting errors [4].
allow text to be generated, SR was being used for com- Their results indicate that individuals learning to use SR
mand and control rather than dictation. As a result, their systems often fixate on a single strategy. The most fre-
study provides few insights into the effectiveness of SR for quent strategy was to select the incorrect word and then
dictation-oriented activities. to redictate the desired word. Unfortunately, redictation
Many researchers have explored the use of SR by often resulted in additional errors that required correc-
traditional, able-bodied computer users. For example, tion. Since users frequently resisted changing strategies,
Ainsworth [1] examined the optimal string length for digit one error often resulted in numerous problems. Experi-
input and more recently Ainsworth and Pratt [2] explored enced users learned to try a different strategy when they
several error-correction strategies for use in a phone dial- experienced problems, reducing the number of cascading
ing scenario. Both of these studies used isolated-word SR errors. Interestingly, even experienced users spent about
systems, short input strings (i.e., 1–14 digits), and small 3 min correcting errors for every minute they spent dic-
vocabularies (i.e., 11 or 14 words). Noyes and Frankish [7] tating the original text.
explored the efficacy of auditory or visual feedback after These last two articles confirm that traditional com-
every word or sequence of six words while Baber and puter users can still produce text more quickly using
Hone [3] found that the optimal error correction strategy a keyboard and mouse than they can using commercial
depends on the accuracy of the system and the length of SR software. Interestingly, correcting errors took much
the input strings. longer than the creation of the original text. At present,
Each of these studies provides insights into the de- there is no comparable information regarding the us-
sign of effective SR systems, but the isolated-word recog- ability of SR software by individuals with physical dis-
nition algorithms utilized, small quantities of text gen- abilities. We do not know how productive or satisfied
erated, feedback strategies explored, and limited vocab- these individuals can be when using this technology. Fur-
ularies make it difficult to generalize these results to ther, we do not know whether these two user groups
the design of large-vocabulary SR systems that accept employ the same or different strategies when interact-
continuous speech as non-trivial quantities of the text ing with this technology. As a result, we do not know if
generated. While this is not an exhaustive discussion of these two groups of users experience the same or different
the existing research, it does summarize the topics ex- problems.
plored and the limitations of much of the existing re-
search.
Two recent articles discuss results obtained when tra- 3 Research Objectives
ditional computer users dictated larger quantities of text
using a large-vocabulary SR system. Karat, Halverson, The purpose of the current study is to provide the first
Karat, and Horn compared SR software to the tradi- empirical evaluation of current SR technology when used
tional keyboard and mouse for text entry when used by individuals with physical disabilities to generate non-
by traditional computer users with no physical disabil- trivial quantities of text. Currently, the focus is on indi-
ities [5]. Their results confirmed that for this popula- viduals with high-level spinal cord injuries. By investigat-
tion, the keyboard and mouse resulted in faster data ing the efficacy of existing SR technology as well as any
entry. Using the keyboard and mouse, these participants differences that may exist between traditional computer
reached 32.5 words per minute (WPM) for transcrip- users and those with high-level spinal cord injuries, we
tion tasks and 19.0 WPM for composition tasks. When will create a foundation for future research and develop-
6 A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies

ment aimed at enhancing the usability of SR for individ- the system to complete some of their normal computer-
uals with physical disabilities. With these goals in mind, based tasks. Finally, English had to be a native language
we define the following objectives for the current study: for all participants. Participants were recruited from the
Maryland area. Participants were offered a payment of
1) To identify any differences in performance or satis-
U.S. $ 30 as compensation for their time. To address the
faction of traditional computer users and computer
additional time and effort required by the SCI partici-
users with high-level spinal cord injuries. Perform-
pants, they were offered an additional U.S. $ 20 in com-
ance will be assessed using data entry rates (words per
pensation. Demographic information was gathered at the
minute), accuracy (for transcription tasks), and docu-
conclusion of the study. Thirteen participants were male,
ment quality (for composition tasks). Preferences will
the participants’ average age was 34.4 (st.dev.: 11.7), and
be assessed using a series of post-task questionnaires
participants averaged 15.6 years of computing experience
as well as a final questionnaire that will be completed
(st.dev.: 6.8). Also at the conclusion of the study, partic-
at the end of the study.
ipants rated the importance of speech recognition using
2) To gain insight into the processes these two user
a scale of 1 to 3. A rating of 1 indicated that SR was the
groups employ as they generate text using SR-based
most important method the participant used to interact
dictation software. This will be investigated by ex-
with computers. A rating of 2 indicated that SR was one
amining the proportion of time allocated to various
of several alternatives the participant used when inter-
activities, including dictation, navigation, and other
acting with computers. A rating of 3 indicated that the
speech-based commands. It will also be assessed by
participant did not feel that SR was useful.
examining the number of dictation episodes; this pro-
vides insights into the willingness of the users to inter-
rupt their dictation to correct errors. 4.2 Apparatus

Participants utilized a Gateway Solo Pro 9300 laptop


computer with a 600 MHz Pentium III processor, 128 M
4 Method of memory, and a VXI Parrot 10–3 microphone. Partic-
ipants interacted with a custom speech recognition ap-
4.1 Subjects plication, TkTalk Version 1.0, that uses IBM’s ViaVoice
Millennium Edition speech recognition engine. TkTalk al-
Fourteen individuals were recruited to participate in this lows users to dictate text and to edit that text using a full
study. Seven participants had no documented physical range of speech-activated text editing capabilities. By de-
impairments that would hinder their ability to use the veloping new SR software we were able to provide the
keyboard or mouse. The remaining seven participants core editing commands that were expected by individuals
had spinal cord injuries at or above C6 with American with previous experience using virtually any commercial
Spinal Cord Injury Association (ASIA) scores of A or B. SR product, including both IBM’s ViaVoice and Drag-
Spinal cord injuries can be described by the level and on’s Naturally Speaking software. Further, by providing
completeness of the injury. Injuries at or above C6 (level) the fundamental commands, this software effectively em-
can result in partial to complete loss of motor function in ulated whatever SR software our users had previously
the hands and arms. ASIA scores of A or B (complete- used. We designed TkTalk with the flexibility to accept
ness) indicate that there is no residual motor function a variety of commands to accomplish the same action.
below the level of the injury. An individual with an injury During the training session, participants were introduced
at or above C6 and an ASIA score of A or B is consid- to all of the commands available in TkTalk, including
ered a quadriplegic. More specifically, this type of injury a feature that could be used at any time to display a list
results in the individual’s hands being paralyzed and ei- of the commands currently available. As a result, partic-
ther no or limited upper extremity mobility. As a result, ipants knew the appropriate command for each function
some of these participants do have limited use of their they wanted to invoke during the study. Through this
arms which allows them to use an assistive device to en- combination of actions, we believe our results should gen-
ter text using a keyboard. Typically, this would involve eralize across SR products. A complete list of the com-
a brace and a typing stick. Throughout the remainder of mands available in TkTalk 1.0 is given in Appendix A.
this article, these two groups of participants are referred The computer was placed on a desk. Participants
to as traditional and SCI users. from the traditional user group were provided with
Participation was restricted to individuals who did a chair while individuals in the SCI group remained in
not have any uncorrected visual impairments or docu- their wheelchairs. The room contained multiple cam-
mented cognitive, hearing, or speech impairments. All eras, a scan converter, and microphones that enabled
participants had prior experience using a commercial audio/video recording of all activity by the participants
speech recognition product for dictation-oriented activi- during the usability sessions. In addition, the TkTalk
ties. Prior experience was defined as having gone through software was instrumented to provide a detailed record of
the enrollment process and subsequently worked with all speech-based activity.
A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies 7

4.3 Tasks comparisons with their normal method of completing


these types of tasks, recommendations for changes to the
Each participant completed four tasks that were pre- software, the tasks for which they would use SR, and
sented on separate pieces of paper. Two transcription how important SR was as a method of interacting with
tasks required users to enter a predefined paragraph of computers. They also provided a variety of demographic
text (67 and 71 words respectively). Two composition information.
tasks required users to respond to several questions posed At the end of the session, the experimenter answered
in a hypothetical email (3 and 4 questions respectively). any questions, thanked participants for their time, and
Responses for composition tasks could be as brief or provided the appropriate payment for the session. Ses-
lengthy as the participant desired. sions lasted approximately 1.5 h, including several re-
quired breaks. The duration of the sessions and the
4.4 Procedure schedule for the breaks was defined in cooperation with
professionals from the rehabilitation community who are
After reading and signing the consent form, participants familiar with the capabilities and limitations of individu-
were guided through the standard enrollment process uti- als with spinal cord injuries.
lized by the Millennium edition of ViaVoice. Each partic-
ipant completed one training passage, ensuring that the
4.5 Data Analysis
recognition engine had an equivalent amount of informa-
tion about each participant’s speech patterns. Since our The TkTalk software provides a transcript of all utter-
participants were all experienced SR users but had not ances as recognized by the SR engine. Each utterance is
used TkTalk previously, they were guided through 30 min automatically categorized as either dictation or a com-
of training and practice. Two sample tasks (one transcrip- mand for the SR engine. Any utterances that map di-
tion and one composition) were used to guide the training rectly to one of the commands available at the time the
and practice session. The experimenter guided the par- words were spoken is classified as a command. All other
ticipant’s activities to ensure that they were exposed to utterances are classified as dictation.
all of the capabilities of TkTalk during the training ses-
sion. At the end of the training session, the experimenter
left the participant alone to complete the tasks. During 4.5.1 Identifying discrepancies
the experimental tasks, the experimenter provided as-
sistance only if the participant repeatedly attempted to The audio and video recordings were used to annotate
execute a valid command using the wrong spoken com- the transcripts generated by TkTalk. Initially, the an-
mand. For example, to select an item from a list the user notation process focuses on identifying any discrepan-
must say “Pick N” where N corresponds to the num- cies between the words in the transcript and what the
ber of the desired entry. If a participant repeatedly at- user actually said. Each discrepancy is recorded in the
tempted to accomplish this task by saying “Choose N”, transcript.
the experimenter would inform the participant of the cor-
rect command. Participants were instructed that they 4.5.2 Reclassifying utterances
could use the keyboard and mouse whenever they felt it
was necessary, but they were told that our goal was to Once all discrepancies are identified, the transcript is
better understand how SR was used to complete these reviewed to allow some utterances to be reclassified.
tasks. Dictation-oriented activities using SR software result in
Participants were given one task at a time to en- a combination of dictation sequences where the user is
sure that the tasks were completed in a predefined order. entering text and command sequences where the user is
A typing stand was available to hold the current task. modifying text, moving within the document, or complet-
All participants completed all four tasks. Participants ing some other activity that does not result in new text
completed either the two composition tasks first or the being inserted into the document.
two transcription tasks first. Further, the order in which Some utterances that are classified as commands are,
each pair of tasks (e.g., the two composition or tran- in fact, misunderstood attempts at dictation. For ex-
scription tasks) was completed was randomized. After ample, the words “move to California” could be misrecog-
each task, participants responded to questions about the nized as an attempt to issue a “move to top of document”
ease of use of the software and whether or not they command. Reviewing the audio/video record of the ses-
were satisfied with the amount of time required to com- sion makes these reclassifications easy.
plete the task. After completing all four tasks, partic- Other utterances that are classified as dictation are,
ipants completed a final questionnaire regarding their in fact, better described as being part of a command se-
satisfaction with the software in the context of tran- quence. Figure 1 illustrates this point.
scribing predefined text and composing new text, feel- This session begins with the user dictating new text
ings about locating and correcting recognition errors, (1st and 2nd utterances). Utterance 3 begins a command
8 A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies

Utterance Description
1) While classes meet twice a week, Dictation begins here
2) my class only meet on Mondays. more dictation
3) Move to top of document A command
4) Move right one word more commands
5) meets dictation that is part of a command sequence
6) Move to end of document more commands
7) This week… Dictation begins again

Resulting text
While most classes meet twice a week, my class only meets on Mondays. This week…
Fig. 1. An example of an utterance (#5) that appears to be dictation, but is more appropriately described as part of
a command sequence

sequence where the user navigates to the desired loca- (composition only). To address variability in both word
tion (3rd and 4th utterances) and then dictates one or length and the length of the resulting documents, data
more words that should be inserted (5th utterance). As entry and error rates were normalized based upon a con-
a result, utterance 5 (i.e., “meets”) is not considered dic- version of five characters per word. The following defini-
tation, but is instead considered part of a command se- tions were used to compute the data entry rates (WPM)
quence (utterances 3–6). and error rates (EPW):
The fundamental question is when new text that is
added in the middle of existing text should be consid- Words in document (WiD) = number of characters in the
ered part of a command sequence and when it should be resulting document/5
viewed as new dictation. For instance, if the user had nav- Words per minute (WPM) = WiD/task completion time
igated to the same location and dictated two complete Errors per word (EPW) = number of errors in the re-
sentences it might be more appropriate to view this new sulting document/WiD
text as dictation rather than a modification of existing
text. Therefore, our goal was to provide a definition that For transcription tasks, two types of errors were
allows text to be systematically classified as either new identified. Content errors included extra words, missing
dictation or part of a command sequence. This definition words, or incorrect words. Format errors included extra or
should classify small insertions as part of a command se- missing spaces or punctuation, as well as a few instances
quence, but larger insertions as new dictation. where numbers should have been represented by words
An utterance is considered new dictation if: but were erroneously represented by numerals (e.g., 50
instead of fifty).
1) The text is the first text being inserted into the docu-
While counting errors is simple for transcription tasks,
ment. This captures activity that occurs as users be-
it is more difficult for composition tasks. Therefore, for
gin a new task.
composition tasks we evaluate the quality of the resulting
2) The text is being inserted at the end of the document.
document rather than the number of errors. Quality was
This is critical if correction activities are highly inte-
assessed by three members of the faculty in the English
grated with dictation activities.
Department at UMBC who teach writing courses. These
3) The text contains a minimum of one noun phrase and
assessments were based upon a predefined set of criteria
one verb phrase. This deals with text inserted in the
and resulted in three scores for each document. Assess-
middle of existing text that could be classified as part
ments were first completed individually, with the group
of a command sequence, but is large enough to be con-
meeting afterwards to resolve any discrepancies in their
sidered new dictation rather than a modification of
scoring. Scoring was done without knowledge of who gen-
existing text.
erated the responses.
Any utterance that fails to meet one of the three pre- The first score focuses on the content of the response:
ceding criteria is reclassified as being part of a command whether or not the resulting text clearly answered all of
sequence. Each utterance that is part of a command se- the issues raised in the email. The second score focuses
quence is then further classified as being a navigation on formatting, including issues such as spacing, punctu-
event (e.g., move up two lines, select Friday) or non- ation, and the editing of incorrect words. The final score
navigation event. focuses on the quality of the writing, including the over-
all coherence of the response, the ordering of the infor-
4.5.3 Productivity measures mation within the response, using audience-appropriate
tone, and sentence structure. All three scores range from
Productivity was assessed using data entry rates, error 1 to 5 (1 = unacceptable, 2 = weak, 3 = average, 4 = good,
rates (transcription only), and document quality ratings 5 = excellent).
A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies 9

4.5.4 Satisfaction measures Table 1. Means (and standard deviations) for data entry rates
measured in WPM
Participants responded to two satisfaction questions after Transcription Composition
each task. These questions inquired about the partici- T1 T2 C1 C2
pants’ satisfaction with the ease of completing the task
and the amount of time required to complete the task. Traditional 14.17 15.07 14.36 14.50
Participants answered questions using a five-point scale Group Users (8.83) (7.25) (6.12) (5.92)
where 1 = very satisfied and 5 = very dissatisfied. SCI 8.35 13.81 12.94 11.35
Participants also responded to a series of satisfaction Users (5.00) (7.58) (4.39) (9.21)
questions after completing all four tasks. The first three
questions inquired about the participants’ satisfaction
with the ease of completing the tasks, the time required, 5.1.1 Words per minute
and the accuracy of the speech recognition software with
respect to the transcription tasks. The next three ques- Means and standard deviations for WPM are reported
tions inquired about the participants’ satisfaction with in Table 1. A one-way ANCOVA with repeated mea-
the ease of completing the tasks, the time required, and sures for type of task (transcription vs. composition) and
the accuracy of the speech recognition software with re- task number (e.g., first transcription vs. second transcrip-
spect to the composition tasks. Two additional questions tion) did not identify any significant effects due to group,
asked about the ease of finding recognition errors and cor- the type of task, or task number (F (1, 10) = 1.21, n.s.;
recting such errors. The final two questions asked how F (1, 10) = 0.90, n.s.; F (1, 10) = 0.03, n.s. respectively).
this software compared to their normal method of inter- Neither importance nor experience had a significant effect
acting with a computer in terms of speed and ease of use. (F (1, 10) = 0.62, n.s.; F (1, 10) = 1.37, n.s. respectively).
All responses used the five-point scale described above. There were no significant interactions.

5.1.2 Errors per word – transcription only


5 Results
Means and standard deviations for the number of content
Three sets of analyses are reported. First, we analyze data and formatting errors per word are reported in Table 2.
entry rates to identify any significant differences in pro- A one-way ANCOVA with repeated measures for task
ductivity between the two groups of users. We conducted number (e.g., first transcription vs. second transcription)
a similar analysis of error rates (transcription tasks) and and type of error (content vs. format) did not identify
document quality ratings (composition tasks). Next, we any significant effects due to group, task number, or the
analyze the results from the satisfaction questionnaires. type of error (F (1, 10) = 0.90, n.s.; F (1, 10) = 1.65, n.s.;
Finally, we investigate the process employed while com- F (1, 10) = 0.76, n.s. respectively). Neither importance
pleting the tasks. nor experience had a significant effect (F (1, 10) = 0.73,
As discussed above, participants rated the importance n.s.; F (1, 10) = 0.00, n.s. respectively). There were no sig-
of SR and provided information regarding how long they nificant interactions.
had been using SR (measured in weeks). Three partici-
pants gave a rating of one, indicating that SR is the most
important method they use to interact with computers. 5.1.3 Document quality ratings – composition only
Ten participants gave a rating of two, indicating that SR
is one of several alternatives they use when interacting Means and standard deviations for the quality ratings
with computers. One gave a rating of three, indicating are reported in Table 3. A one-way ANCOVA with re-
that they do not feel SR is useful. Since importance and peated measures for task number (e.g., first composi-
experience were expected to influence results, but were tion vs. second composition) and type of score (content
not controlled, all analyses utilized an analysis of covari-
ance (ANCOVA) with importance and experience as co-
Table 2. Means (and standard deviations) for the number of
variates. Group (traditional vs. SCI users) was treated as content and formatting errors per word for the two
an independent variable in all analyses. transcription tasks

T1 T2
5.1 Productivity Content Format Content Format
Traditional 0.036 0.012 0.024 0.016
Productivity was assessed using data entry rates for both Group Users (0.033) (0.007) (0.024) (0.033)
transcription and composition tasks. Error rates are ana-
SCI 0.020 0.014 0.022 0.007
lyzed for transcription tasks while document quality rat-
Users (0.020) (0.018) (0.024) (0.010)
ings are analyzed for composition tasks.
10 A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies

Table 3. Means (and standard deviations) for the content, formatting, and writing
scores for the two composition tasks.
1 = unacceptable, 2 = weak, 3 = average, 4 = good, 5 = excellent

C1 C2
Content Format Writing Content Format Writing
Traditional 4.71 3.86 3.57 4.50 3.29 2.93
Group Users (0.49) (1.07) (0.79) (1.12) (1.50) (1.17)
SCI 4.57 3.43 3.14 4.71 3.64 3.50
Users (0.79) (0.73) (0.63) (0.49) (1.11) (1.26)

vs. formatting vs. writing) did not identify any signifi- first transcription vs. second transcription), and question
cant effects due to group, task number, or type of score (ease vs. speed) identified a significant effect of group
(F (1, 10) = 0.50, n.s.; F (1, 10) = 0.42, n.s.; F (1, 10) = on participant responses (F (1, 10) = 6.12, p < 0.05), but
2.20, n.s. respectively). Neither importance nor expe- no significant effects due to the type of task, task num-
rience had a significant effect (F (1, 10) = 0.41, n.s.; ber, or question (F (1, 10) = 1.22, n.s.; F (1, 10) = 0.10,
F (1, 10) = 4.24, n.s. respectively). Two significant in- n.s.; F (1, 10) = 0.00, n.s. respectively). Neither impor-
teractions were identified that involved both the task tance nor experience had a significant effect (F (1, 10) =
number and type of score, but these interactions provide 0.65, n.s.; F (1, 10) = 3.87, n.s. respectively). There were
no useful insights. no significant interactions. Overall, SCI users provided
more positive responses when completing the post-task
5.2 Subjective satisfaction questions.
Planned post-hoc tests revealed that SCI users’ re-
Means and standard deviations for the post-task ques- sponses were significantly more positive when evaluating
tions regarding time and effort are reported in Table 4. how fast they could complete both transcription tasks
A one-way ANCOVA with repeated measures for type of (F (1, 10) = 5.03, p < 0.05; F (1, 10) = 5.66, p < 0.05). Fig-
task (transcription vs. composition), task number (e.g., ure 2 illustrates these results.

Fig. 2. SCI users rated their experience more positively than did traditional users. Of the eight questions,
significant differences were identified for the two highlighted by an asterisk (*)

Table 4. Means (and standard deviations) for participants’ responses to post-task questions regarding the
ease of completing the task and how satisfied they were with the time required. Bold entries correspond to
individual questions where the responses from the two groups differed significantly

Easy to complete the task Satisfied with time required


T1 T2 C1 C2 T1 T2 C1 C2
Traditional 2.71 2.86 2.57 1.86 3.00 3.29 2.57 2.43
Group Users (1.60) (0.90) (1.27) (0.69) (1.73) (1.50) (1.27) (1.13)
SCI 1.43 1.86 1.57 1.71 1.86 2.14 2.00 1.57
Users (0.53) (0.69) (0.53) (0.76) (0.69) (0.90) (1.00) (0.79)
A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies 11

Table 5. Means (and standard deviations) for participants’ responses to post-task questions regarding
the ease of completing the tasks, how satisfied they were with the time required, and how satisfied
they were with the recognition accuracy of the software. Bold entries correspond to individual
questions where the responses from the two groups differed significantly

Transcription Composition
Ease Speed Accuracy Ease Speed Accuracy
Traditional 2.43 2.86 2.71 2.71 3.43 3.14
Group Users (0.98) (1.21) (1.11) (0.49) (1.13) (1.07)
SCI 1.57 1.71 1.71 1.43 1.57 2.00
Users (0.53) (0.76) (0.95) (0.53) (0.53) (0.58)

Table 6. Means (and standard deviations) for participants’ responses to post-task questions regarding
the ease of locating recognition errors and the ease of correcting such errors.Bold entries correspond to
individual questions where the responses from the two groups differed significantly

Overall
Finding Correcting Relative Relative
errors errors speed ease of use

Traditional 2.29 4.14 3.86 3.57


Group Users (1.60) (1.21) (1.35) (1.13)
SCI 1.71 2.29 1.86 2.14
Users (0.49) (0.49) (1.21) (0.69)

Means and standard deviations for ease, speed, and There were no significant interactions. Overall, SCI users
accuracy questions that were asked after all four tasks are provided more positive responses when completing the
reported in Table 5. Means and standard deviations for post-experiment questions.
the two questions regarding finding and correcting recog- Planned post-hoc comparisons of the responses pro-
nition errors as well as the two questions comparing SR vided by the two groups of users revealed that SCI users
to their traditional method of entering data when using were also more positive when evaluating the ease of com-
computers appear in Table 6. A one-way ANCOVA with pleting the composition tasks, how fast they could com-
repeated measures for question (post-experiment ques- plete these tasks, and the accuracy of the speech recogni-
tions 1–10) identified a significant effect of group on par- tion software (F (1, 10) = 12.00, p < 0.01; F (1, 10) = 12.93,
ticipant responses (F (1, 10) = 13.78, p < 0.005), but no p < 0.01; F (1, 10) = 6.62, p < 0.05 respectively). SCI users
significant effects due to question (F (9, 90) = 0.94, n.s.). were also more positive when evaluating the ease of cor-
Neither importance nor experience had a significant effect recting recognition errors (F (1, 10) = 13.15, p < 0.01). Fi-
(F (1, 10) = 0.11, n.s.; F (1, 10) = 3.11, n.s. respectively). nally, SCI users were more positive when evaluating the

Fig. 3. SCI users rated their experience more positively than did traditional users. Of the ten questions,
significant differences were identified for the six highlighted by an asterisk (*)
12 A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies

Table 8. Means (and standard deviations) for the percentage of


time spent issuing non-navigation commands

Transcription Composition
T1 T2 C1 C2

Traditional 36.1 36.0 30.1 20.8


Group Users (8.9) (9.9) (11.5) (11.8)
SCI 35.3 44.8 35.3 23.9
Users (7.6) (16.6) (8.1) (9.5)

6.76, p < 0.05), but did not identify any significant ef-
fects due to type of task or task (F (1, 10) = 0.99, n.s.;
(F (1, 10) = 1.06, n.s. respectively). Neither importance
nor experience had a significant effect (F (1, 10) = 0.23,
n.s.; F (1, 10) = 0.72, n.s. respectively). There were no sig-
Fig. 4. Allocation of time to dictation, non-navigation commands, nificant interactions. Post hoc comparisons revealed no
and navigation commands for each group of users significant differences between the two groups of users
when tasks were analysed seperatly.

speed of using this software and the ease of using this soft-
ware relative to their normal method of interacting with 5.3.2 Non-navigation commands
a computer (F (1, 10) = 11.23, p < 0.01; F (1, 10) = 13.21,
Means and standard deviations for the percentage of the
p < 0.01). These results are illustrated in Fig. 3.
task completion time spent issuing non-navigation com-
mands are reported in Table 8. A one-way ANCOVA with
5.3 The Process repeated measures for type of task (transcription vs. com-
position) and task number (e.g., first transcription vs.
To better understand the participants’ processes for
second transcription) did not identify any significant ef-
using SR software to complete these tasks, we analyzed
fects due to group, type of task, or task (F (1, 10) = 0.11,
the percentage of the task completion time allocated
n.s.; F (1, 10) = 1.11, n.s.; F (1, 10) = 0.21, n.s. respec-
to dictation, navigation commands, and non-navigation
tively). Neither importance nor experience had a signifi-
commands. We also analyzed the number of dictation
cant effect (F (1, 10) = 1.85, n.s.; F (1, 10) = 0.22, n.s. re-
episodes, which provides insights into the willingness of
spectively). There were no significant interactions.
the participants to interrupt their dictation to correct er-
rors. Figure 4 provides an overview of how time was spent
by the two groups of users. 5.3.3 Navigation time

5.3.1 Dictation time Means and standard deviations for the percentage of
the task completion time spent issuing navigation com-
Means and standard deviations for the percentage of mands are reported in Table 9. A one-way ANCOVA
the task completion time spent dictating are reported with repeated measures for type of task (transcription
in Table 7. A one-way ANCOVA with repeated mea- vs. composition) and task number (e.g., first transcrip-
sures for type of task (transcription vs. composition) and tion vs. second transcription) identified a significant ef-
task number (e.g., first transcription vs. second transcrip- fect due to group (F (1, 10) = 7.77, p < 0.02), but there
tion) identified a significant effect for group (F (1, 10) = was no significant effect due to type of task or task

Table 7. Means (and standard deviations) for the percentage of Table 9. Means (and standard deviations) for the percentage of
time spent dictating time spent issuing navigation commands

Transcription Composition Transcription Composition


T1 T2 C1 C2 T1 T2 C1 C2
Traditional 21.1 24.0 37.2 37.8 Traditional 42.8 40.1 32.7 41.3
Group Users (6.5) (12.6) (23.8) (15.0) Group Users (6.6) (10.0) (15.8) (13.3)
SCI 28.7 30.6 41.4 57.4 SCI 36.0 24.7 23.3 18.7
Users (11.6) (14.3) (16.0) (17.4) Users (15.0) (17.1) (11.6) (12.5)
A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies 13

Table 10. Means (and standard deviations) for the number of average to good. No differences were observed between
dictation episodes the two groups of users.
While performance did not differ, satisfaction differed
Transcription Composition
significantly between the two user groups. The SCI group
T1 T2 C1 C2
was more positive about the amount of time and effort
Traditional 4.00 4.29 3.86 4.57 required for both transcription and composition tasks.
Group Users (1.16) (1.38) (1.57) (1.40) They were also more positive when evaluating the ease
SCI 7.57 10.57 11.29 7.43 of correcting recognition errors, and when they compared
Users (6.16) (13.51) (12.75) (3.55) the SR software to their normal method of interacting
with computers. These subjective ratings, combined with
the fact that performance did not differ between the two
(F (1, 10) = 0.10, n.s.; F (1, 10) = 0.61, n.s. respectively). groups of users, indicate that the SCI users expected to
Neither importance nor experience had a significant effect work harder and longer to accomplish these tasks. These
(F (1, 10) = 0.15, n.s.; F (1, 10) = 0.36, n.s. respectively). expectations are probably due, at least in part, to the
There were no significant interactions. Post-hoc compar- different methods these users typically employ when in-
isons revealed no significant differences between the two teracting with computers. This observation is supported
groups of users when tasks were analyzed separately. by the fact that the SCI group provided positive ratings
Overall, SCI users spent a smaller percentage of the total when comparing this system to their normal method of
task completion time issuing navigation commands. interacting with computers, while the traditional users
provided negative ratings. These results suggest that it
will be easier to develop a SR system that is accepted by
5.3.4 Dictation episodes
SCI users.
Means and standard deviations for the number of dic- Finally, the two groups of users did not employ the
tation episodes are reported in Table 10. A one-way same process when completing these tasks. The SCI
ANCOVA with repeated measures for type of task (tran- group utilized more dictation episodes than the tradi-
scription vs. composition) and task number (e.g., first tional users, indicating that they interrupted their dicta-
transcription vs. second transcription) identified a sig- tion more often to correct errors. As a result, SCI users
nificant effect due to group (F (1, 10) = 5.12, p < 0.05), also spent a smaller percentage of their time navigating
but there were no significant effects due to type of task to incorrect words (see Fig. 4). These differences highlight
or task number (F (1, 10) = 0.01, n.s.; F (1, 10) = 0.42, the importance of designing future SR systems with spe-
n.s. respectively). Neither importance nor experience cific populations of users in mind. For traditional users,
had a significant effect (F (1, 10) = 0.57, n.s.; F (1, 10) = we may focus on improving navigation, which accounted
1.62, n.s. respectively). There were no significant in- for 39% of their time. On the other hand, SCI users inter-
teractions. Post-hoc comparisons identified a significant rupted their dictation more often, correcting more errors
effect of group on the number of dictation episodes as they occurred. As a result, they spent significantly
for the first transcription task (F (1, 10) = 7.11, p < less time navigating and would therefore benefit less from
0.05) and the second composition task (F (1, 10) = 6.01, improved navigation. SCI users may benefit more from
p < 0.05). Overall, SCI users utilized more dictation improved commands for interrupting dictation to cor-
episodes. rect errors, such as “scratch that.” Our results confirm
that individuals with disabilities and traditional com-
puter users do not always adopt the same strategies when
6 Conclusions using SR software to complete the same tasks. Therefore,
it is critical that future efforts to make SR systems uni-
Performance, measured in data entry rates, errors, and versally accessible are based upon an understanding of
document quality, did not differ between the two user the ways in which various user groups interact with the
groups. Both groups produced text at a rate of approxi- software.
mately 13 WPM, which compares favorably to the initial While our results indicate that existing SR-based dic-
usage results reported by Karat et al. [5]. During tran- tation systems (as represented by TkTalk) compare fa-
scription tasks, users averaged 1 incorrect, missing, or vorably to the techniques which SCI users are currently
extra word for every 39 words and 1 formatting error for using to enter text – including both commercial SR sys-
every 81 words in the resulting document (i.e., after cor- tems and keyboards (using typing braces, typing sticks,
rections). While the number of words misrecognized by or individual fingers) – both groups of users spent over
speech recognition software has decreased substantially 2 min on correction activities for each minute of dictation.
since the Karat et al. study [5], errors still occur and users Interestingly, this compares favorably to results reported
often fail to correct every error. For composition tasks, elsewhere. For example, Halverson et al. reported that ex-
the content of the responses was evaluated as good to ex- perienced users spent approximately 3 min on correction
cellent, while both formatting and writing were rated as activities for every minute they spent dictating [4]. Even
14 A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies

with this reduction in correction time, much of which on ventilators while individuals with lower level in-
is probably due to improvements in the SR algorithms, juries are likely to use a keyboard when interacting with
there is much room for improvement. Both user groups computers.
were largely satisfied with the effort involved in locat-
ing errors, and the SCI users were reasonably happy with Acknowledgements. We would like to thank Mike Monkowski and
John Vergo of IBM Research, Mark Young and E.C. Townsend of
the effort involved in correcting errors, but the amount the Maryland Rehabilitation Center, and all of the individuals that
of text created during this study was relatively small. As participated in this study for their assistance with this project.
the length and complexity of the documents increase, we This material is based upon work supported by the National Sci-
believe users will benefit from assistance in error identi- ence Foundation under Grant No. 9910607. Any opinions, findings,
and conclusions or recommendations expressed in this material are
fication and more effective error correction methods. Fu- those of the authors and do not necessarily reflect the views of the
ture research should explore these ideas with both user National Science Foundation (NSF).
groups.
Finally, previous research confirmed that as tradi-
tional users become more experienced they are more References
likely to employ multimodal correction strategies. While
our traditional users were instructed that they could use 1. Ainsworth WA (1988) Optimization of string length for spo-
the keyboard and mouse whenever they felt it was ne- ken digit input with error correction. Int J Man-Machine Stud
28:573–581
cessary, few chose to do so. We believe this is probably 2. Ainsworth WA, Pratt SR (1992) Feedback strategies for error
due to the novelty of using a new SR system, the instruc- correction in speech recognition systems. Int J Man-Machine
tions provided to participants, and the simple fact that Stud 36:833–842
3. Baber C, Hone K (1993) Modeling error recovery and re-
we were obviously studying an SR system. With addi- pair in automatic speech recognition. Int J Man-Machine Stud
tional experience, it is possible that our traditional users 39:495–515
would also convert to multimodal correction strategies. 4. Halverson C, Horn D, Karat C-M, Karat J (1999) The beauty
of errors: patterns of error correction in desktop speech sys-
For the SCI users, this option is either not available or tems. Proceedings of INTERACT’99, Seventh IFIPTC.13
substantially less attractive due to the pace at which Conferrence on Human-Computer Interaction. IOS Press,
they can interact with a keyboard. Future studies should pp 133–140
5. Karat C-M, Halverson C, Karat J, Horn D (1999) Patterns
adopt a longitudinal approach so we can better under- of entry and correction in large vocabulary continuous speech
stand how interaction processes change as experience recognition systems. Proceedings of CHI 99. Addison-Wesley,
increases. Further, it may prove interesting to recruit NY, pp 568–575
6. Manaris B, Harkreader A (1998) SUITEKeys: a speech un-
novices to participate in these longitudinal studies so we derstanding interface for the motor- control challenged. Pro-
can gain insight into the way in which performance and ceedings of the 3rd International ACM SIGCAPH Conference
preferences evolve with experience. The current study on Assistive Technologies (ASSETS’98). Addison-Wesley, NY,
focused on individuals with SCI for whom SR is an appro- pp 108–115
7. Noyes JM, Frankish CR (1994) Errors and error correction in
priate technology. Individuals with higher level injuries automatic speech recognition systems. Ergonomics 37:1943–
are unlikely to benefit from SR due to their dependence 1957
A. Sears et al.: Interacting with speech recognition software: productivity, satisfaction, and interaction strategies 15

Appendix : Speech-based commands available in tkTalk 1.0


A.1 Dictation mode
Begin new task Move right n words
Capitalize this (Capitalize) Move to end of document
Close Move to top of document
Close what can I say Move up n lines – n must be 1–10
Copy this (Copy) Page up
Correct text Page down
Correct this Paste this (Paste)
Cut this (Cut) Play audio
Delete this (Delete) Save task
Go to sleep Scratch that
Lowercase this (Lowercase) Select again
Move down n lines Select this
Move left n characters Select text
Move left n words Spell this
Move right n characters Uppercase this (Uppercase)

A.2 Correction mode (“Correct text” or “Correct this”)


Begin spell Pick n
Capitalize this (Capitalize) Play audio
Close correction window Lowercase this (Lowercase)
Close what can I say Uppercase this (Uppercase)
Delete this (Delete)

A.3 Spelling mode (“Begin spell” or “Spell this”)


a–z Enter
Apostrophe Hyphen
Backspace Pick n
Capital A–Z Space
Close correction window 0–9
Close what can I say

You might also like