Making An On Device Personal Assistant A Reality

June 2018 @qualcomm_tech
Making an on-device
personal assistant a reality
Qualcomm Technologies, Inc.
Reasoning
Learn, infer context,
AI brings human-like and anticipate
understanding and
behaviors to the
machines
Perception Action
Act intuitively, interact
Hear, see, and
naturally, and protect
observe
privacy
2
Advancing AI research to make on-device AI ubiquitous
A common platform is fundamental to scaling AI internally and across the industry
IoT Mobile Automotive
Perception Reasoning Action

Object detection, speech Scene understanding, language Reinforcement learning
recognition, contextual fusion understanding, behavior prediction for decision making
Power efficiency Personalization Efficient learning

Model design, compression, quantization, Continuous learning, model adaptation, Robust learning through minimal data,
activation, algorithms, and efficient hardware and privacy-preserved distributed learning unsupervised learning, and on-device learning
System architecture
Multi-task and multi-modal learning, sensor fusion, and cloud-edge systems
3
A true personal assistant
One of many use cases requiring a broad set of AI capabilities
IoT Mobile Automotive
Perception Reasoning Action

Object detection, speech Scene understanding, language Reinforcement learning
recognition, contextual fusion understanding, behavior prediction for decision making
Power efficiency Personalization Efficient learning

Model design, compression, quantization, Continuous learning, model adaptation, Robust learning through minimal data,
activation, algorithms, and efficient hardware and privacy-preserved distributed learning unsupervised learning, and on-device learning
System architecture
Multi-task and multi-modal learning, sensor fusion, and cloud-edge systems
4
v
Voice is the
transformative user
interface (UI) we’ve
been waiting for
Designed to be:
Always-on
Conversational
Personal
Private
Critical to create a
true virtual assistant
5
Voice UI components required for an end-to-end solution
Machine speech chain: listener and speaker
Text-to-speech Signal acquisition and playback
Speech Front-end
synthesis processing
Natural language generation

Speech denoising
Natural
Speech
Dialog management language
pre-processing
processing
Echo cancellation
Natural language understanding
Speech Voice
recognition activation
Speech-to-text Always-on keyword detection

“Alexa,” “Hey Snapdragon”
6
Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
Machine learning has ignited the voice UI revolution
GMM CNN RNN + CNN
95%
Human accuracy
70%
60% 62%
55%
50%
Machine automatic speech recognition accuracy
1970 1980 1990 2000 2010 2020
GMM: Gaussian Mixture Model, CNN: Convolutional Neural Network, RNN: Recurrent Neural Network
“As speech recognition accuracy goes from say 95% to 99%, all of us in the room will go from barely using it today to using it all
the time. Most people underestimate the difference between 95% and 99% accuracy — 99% is a gamechanger. No one wants
to wait 10 seconds for a response. Accuracy, followed by latency, are the two key metrics for a production speech system.”
— Andrew Ng 7
Voice UI is proliferating
across product categories
IoT XR
Smartphones Smart
and tablets speakers
In-car
entertainment
systems
Headphones TV
and headsets
Wearables PCs and

laptops
8
Moving voice UI functionality to the end device
An end-to-end solution powered by machine learning
On-device processing Cloud processing

(always-on and real-time) (services)
Multi-mic echo Voice Automatic speech Maps Wikipedia

cancellation, activation recognition
beamforming, (ASR)
and speech Service manager
denoising SMS
Natural language
understanding News
(NLU)
Music
Text-to-speech
(TTS)
Weather Stocks
Cloud centric (today)

9
Moving voice UI functionality to the end device
An end-to-end solution powered by machine learning
On-device processing (always-on and real-time) Cloud processing (services)

cancellation, activation recognition
beamforming, (ASR)
denoising SMS
Natural language
understanding News
(NLU)
Music
Text-to-speech
(TTS)
Weather Stocks
On-device centric (future)

10
Machine learning models
On-device
Cloud tasks processing On-device tasks
Complex voice fallback of voice UI Automatic speech recognition
Training and model update Provides unique benefits Natural language processing
complementing the cloud
Knowledge base Always-on audio cognition
Services On-device training
Challenge
Providing the voice UI functionality Benefits
within the power/thermal envelope
Privacy
Instant response
Always-on
Device context
Offline raw data
Queries
11
4000
Frequency
3000
2000 Noisy speech spectrogram

1000
Speech “If people were more generous,

0
0.5 1 1.5 2 2.5 3
Time
there would be no need for welfare”
denoising
• Single or multiple mics
• Applicable for
◦ Two-way conversation DL-based DL-based denoising model
denoising trained with extensive speech
◦ Voice/speaker recognition
noise databases
◦ Keyword spotting
• Deep learning (DL)

significantly improves the
performance over traditional
methods 4000
Frequency
3000
• Robust in challenging Clean speech spectrogram
2000
interference and noise 1000
“If people were more generous,
scenarios there would be no need for welfare”
0
0.5 1 1.5 2 2.5 3
Time
12
WCD9330 Qualcomm® Voice
Activation (VA) Qualcomm
Voice
High accuracy, robust to background
noise, and supports multiple languages Activation
- 47% supports:
Qualcomm VA power consumption
Deep learning is improving performance

Among state-of-the-art in terms of
performance vs. power consumption Amazon Alexa
WCD9335 Baidu DUEROS

-11%
WCD9340 Microsoft Cortana
Google Assistant
2014 2015 2016 2017
13
Qualcomm Voice Activation, Qualcomm WCD9330, Qualcomm WCD9335, and Qualcomm WCD9340 are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
Automatic On-device automatic speech recognition (ASR)
speech
recognition
“Turn on the light”
Transcribe the Acoustic features Acoustic model Language model

audio to text
Reduce input audio to Deep learning converts Uses context and language
essential information input into linguistic units statistics for best utterance
Deep learning gives estimation
Adapted to each user’s
state-of-the-art accuracy accent and environment Adapted to each user’s
on a mobile device speaking tendencies
Personalization—adaptation
Natural language
to individual accent and understanding (NLU) User intention
acoustic environment
Allows the same intention
to be expressed in multiple
ways
Adapted to each user’s
intent expressions
14
An end-to-end on-device voice UI example for smart homes
Demo of automatic speech recognition and natural language understanding
Large command set Intent understanding
Turn on the living room lights Turn on the kitchen light
Click the kitchen lights off Click kitchen light on
Turn off all lights Switch on light in the kitchen
Switch on the ceiling fan Turn the light on in the kitchen
Shut off the sprinklers
Start music NLU: These four phrases
Pause song
map to the same intent
Next track
Go back one
Play previous song
Turn speaker off
Increase temperature
99% on-device intent accuracy

is achieved for domain specific command sets when adapted to accent and environmental condition
15
A true virtual
assistant
A “digital me” sitting on the device:
context aware and personalized
16
Contextual intelligence is required for personalization
The fusion of many types of sensors and personal information
Sensor data On-device data Off-device data
Environment Iris scan Ambient light Compass Calendar Cloud data
Camera Microphone Temperature Humidity Messaging IoT data
Sensor fusion
Gyroscope Pulse C-V2X Apps
Low power sensing, processing, and connectivity

Efficient, heterogeneous Sensor fusion and Integrated, always-on Low-energy wireless technologies
architectures machine learning data capturing (e.g. BT-LE, 5G NR IoT)
17
Creating personalized memories
Sound analysis Activity analysis

Talking with my son at sunset in Strolling on the beach at sunset
La Jolla in La Jolla talking with my son
Live sentiment
Visual analysis analysis
A sunset over the ocean Strolling on the beach at sunset
in La Jolla in La Jolla talking with my son
and laughing
GPS location History, number

La Jolla, California of people, identity
After the party, strolling on the
beach at sunset in La Jolla talking
with my son and laughing
Essential for a true virtual assistant

18
A true personal assistant is responsive and proactive
Responsive Proactive
Decision-making and conversation based Decision-making and conversation based
on contextual analysis and prompting on contextual analysis without prompting
(e.g. finding memories) (e.g. automatically sharing memories)
“Remember the time I was strolling with “I noticed that you are tired and stressed, I’m turning
my son after the party at La Jolla beach?” on the Rocky III soundtrack and navigating you
to the gym for a workout and sauna.”
“Yes I do, here is a picture you took of the sunset. “This music gets my blood going and a
Should I share it with your family group on WeChat?” workout and sauna will help me relieve stress.”
19
The first step to an on-device virtual assistant
Enabling on-device voice UI
On-device processing Cloud processing

cancellation, activation recognition (ASR)
beamforming,
denoising SMS
Natural language
understanding (NLU) News
Music
Text-to-speech
(TTS)
Weather Stocks
20
Adding an “AI agent” to create a true virtual assistant
The on-device AI agent continuously learns personal knowledge and acts intuitively
Sensors
On-device processing Cloud processing (services)
Multi-mic echo Voice Automatic speech

cancellation and activation recognition (ASR)
beamforming
Cloud knowledge graph
AI agent
Text-to-speech Natural language

(TTS) understanding (NLU)
21
Adding an “AI agent” to create a true virtual assistant
Contextualization allows personalization at acoustic, intent, and behavior levels
Sensors
On-device processing AI agent Cloud processing (services)
Multi-mic echo Voice Automatic speechContextual fusion Speaker identification

cancellation and activation recognition (ASR)and learning Acoustic event detection
beamforming
Gender and age detection
Voice activity detection
Dialog
management Emotion classification
Cloud knowledge
graph
Local knowledge
graph
Text-to-speech Natural language
(TTS) understanding (NLU)
22
Various kitchen
noise samples
Acoustic
event ML-based
acoustic event
detection classification
• ML techniques 50
0.9
Posterior-gram
are used to 100
0.8
0.7
◦ Classify acoustic 150
0.6
signals into a set of 200
0.5
predefined events 250
0.4
◦ Infer acoustic environment 300

0.3
350
0.2
• Low power, 400

0.1
always-on
450
2 4 6 8 10 12
Object rustling Cupboard Dishes Glass jingling Walking Water tap running
Object snapping Cutlery Drawer Object impact Washing dishes

23
We are advancing AI research
to make on-device AI ubiquitous
We are creating AI platform

innovations that are fundamental
to scaling AI across the industry
We provide the low-power

end-to-end on-device solution
for a true personal assistant
24
Thank you!
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Nothing in these materials is an offer to sell any of the References in this presentation to “Qualcomm” may mean Qualcomm
components or devices referenced herein. Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
©2018 Qualcomm Technologies, Inc. and/or its affiliated
applicable. Qualcomm Incorporated includes Qualcomm’s licensing
companies. All Rights Reserved.
business, QTL, and the vast majority of its patent portfolio. Qualcomm
Qualcomm and Snapdragon are trademarks of Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm
Incorporated, registered in the United States and other Incorporated, operates, along with its subsidiaries, substantially all of
countries. Other products and brand names may be Qualcomm’s engineering, research and development functions, and
trademarks or registered trademarks of their respective substantially all of its product and services businesses, including its
owners. semiconductor business, QCT.

Making An On Device Personal Assistant A Reality

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Making An On Device Personal Assistant A Reality

Uploaded by

Copyright:

Available Formats

June 2018 @qualcomm_tech

AI brings human-like and anticipate

IoT Mobile Automotive

Perception Reasoning Action

Power efficiency Personalization Efficient learning

IoT Mobile Automotive

Perception Reasoning Action

Power efficiency Personalization Efficient learning

Natural language generation

Speech-to-text Always-on keyword detection

Machine automatic speech recognition accuracy

1970 1980 1990 2000 2010 2020

Wearables PCs and

On-device processing Cloud processing

Multi-mic echo Voice Automatic speech Maps Wikipedia

Cloud centric (today)

On-device processing (always-on and real-time) Cloud processing (services)

Multi-mic echo Voice Automatic speech Maps Wikipedia

On-device centric (future)

2000 Noisy speech spectrogram

Speech “If people were more generous,

• Deep learning (DL)

Deep learning is improving performance

WCD9335 Baidu DUEROS

Transcribe the Acoustic features Acoustic model Language model

99% on-device intent accuracy

Environment Iris scan Ambient light Compass Calendar Cloud data

Camera Microphone Temperature Humidity Messaging IoT data

Low power sensing, processing, and connectivity

Sound analysis Activity analysis

GPS location History, number

Essential for a true virtual assistant

On-device processing Cloud processing

Multi-mic echo Voice Automatic speech Maps Wikipedia

On-device processing Cloud processing (services)

Multi-mic echo Voice Automatic speech

Text-to-speech Natural language

On-device processing AI agent Cloud processing (services)

Multi-mic echo Voice Automatic speechContextual fusion Speaker identification

◦ Infer acoustic environment 300

• Low power, 400

Object snapping Cutlery Drawer Object impact Washing dishes

We are creating AI platform

We provide the low-power

You might also like