Copyright:
Attribution Non-commercial
Robots dealing directly with people need to be able to handle the uncertainties associated with human-centric environments, including human interac...
(More)
Robots dealing directly with people need to be able to handle the uncertainties associated with human-centric environments, including human interaction. This field has seen much attention in the past decade, though work using multimodal systems has not been deeply explored. This work describes a Contextually Informed Multi-Modal Integrator, CIMMI, that fuses speech and symbolic gesture probabilistically and is informed by contextual knowledge in an assistive technology robotic application. Symbolic gestures are gestures that have semantic meaning, such as a wave gesture meaning `hello'. Very few systems use symbolic gestures in a robotic domain, primarily because symbolic gestures are not as commonly used in day-to-day human conversation as deictic (pointing) gestures. Using symbolic gestures allows the system to resolve ambiguities associated with action words in a command hypothesis where deictic gestures are often used to identify objects or locations. Exploring the use of symbolic gestures will be one focus of this work.
Since no deictic gestures are implemented in this work, contextual knowledge is used to resolve ambiguities associated with object and location words in a command. Contextual knowledge is defined as both conversational and situational. Conversational contextual knowledge uses the dialogue history to select the best command from a generated list of possible commands or to resolve ambiguities in the selected command. In human conversation, if a person had been talking about a `yellow cup' then the conversational context allows a person to simplify additional references to the same object by saying `the cup' without having to specify the implied `yellow' again. Using this idea as an influence, ambiguities, where the object or location information is missing in a command, could be resolved by referencing the conversational context. These simple ideas are explored and discussed further in this work.
The second kind of implemented contextual knowledge is situational. In this work, situational contextual knowledge consists of the last known locations of the user, the robot and a list of objects that exist in the environment. Each object in the list has a colour, object class type (such as `cup') and a location. Knowing the locations of an object allows the user to simplify their spoken requests so theyc an simply ask `bring me the blue cup' without stating the implied `from the table'. This same concept applies to specifying the colour of an object. Using these ideas as an influence, this knowledge can also resolve misrecognitions, where the colour or location information could be missing from a command. The user and robot locations are only used to assist the robotic responses. This situational knowledge is explored further in this thesis. The exploitation of conversational and situational contextual knowledge, as introduced above, will be the second focus of this work.
The accuracy of the speech recognition system alone (53%) was compared to using the speech with contextual knowledge, both conversational and situational. This increased the accuracy of the system to 72%. With the addition of the gesture recognition as well, the system's accuracy rose only by 4% more because only 3 cases required the gesture to clarify the ambiguity in the spoken command. CIMMI was implemented on a humanoid robot, Wakamaru, and the speech, gesture and object recognition systems were implemented on a separate laptop.
(Less)
Add a Comment