You are on page 1of 25

June 2018 @qualcomm_tech

Making an on-device
personal assistant a reality
Qualcomm Technologies, Inc.
Reasoning
Learn, infer context,

AI brings human-like and anticipate

understanding and
behaviors to the
machines
Perception Action
Act intuitively, interact
Hear, see, and
naturally, and protect
observe
privacy

2
Advancing AI research to make on-device AI ubiquitous
A common platform is fundamental to scaling AI internally and across the industry

IoT Mobile Automotive

Perception Reasoning Action


Object detection, speech Scene understanding, language Reinforcement learning
recognition, contextual fusion understanding, behavior prediction for decision making

Power efficiency Personalization Efficient learning


Model design, compression, quantization, Continuous learning, model adaptation, Robust learning through minimal data,
activation, algorithms, and efficient hardware and privacy-preserved distributed learning unsupervised learning, and on-device learning

System architecture
Multi-task and multi-modal learning, sensor fusion, and cloud-edge systems
3
A true personal assistant
One of many use cases requiring a broad set of AI capabilities

IoT Mobile Automotive

Perception Reasoning Action


Object detection, speech Scene understanding, language Reinforcement learning
recognition, contextual fusion understanding, behavior prediction for decision making

Power efficiency Personalization Efficient learning


Model design, compression, quantization, Continuous learning, model adaptation, Robust learning through minimal data,
activation, algorithms, and efficient hardware and privacy-preserved distributed learning unsupervised learning, and on-device learning

System architecture
Multi-task and multi-modal learning, sensor fusion, and cloud-edge systems
4
v
Voice is the
transformative user
interface (UI) we’ve
been waiting for
Designed to be:
Always-on
Conversational
Personal
Private

Critical to create a
true virtual assistant

5
Voice UI components required for an end-to-end solution
Machine speech chain: listener and speaker
Text-to-speech Signal acquisition and playback

Speech Front-end
synthesis processing

Natural language generation


Speech denoising
Natural
Speech
Dialog management language
pre-processing
processing
Echo cancellation
Natural language understanding
Speech Voice
recognition activation

Speech-to-text Always-on keyword detection


“Alexa,” “Hey Snapdragon”
6
Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
Machine learning has ignited the voice UI revolution
GMM CNN RNN + CNN

95%
Human accuracy

70%

60% 62%
55%
50%

Machine automatic speech recognition accuracy

1970 1980 1990 2000 2010 2020

GMM: Gaussian Mixture Model, CNN: Convolutional Neural Network, RNN: Recurrent Neural Network

“As speech recognition accuracy goes from say 95% to 99%, all of us in the room will go from barely using it today to using it all
the time. Most people underestimate the difference between 95% and 99% accuracy — 99% is a gamechanger. No one wants
to wait 10 seconds for a response. Accuracy, followed by latency, are the two key metrics for a production speech system.”
— Andrew Ng 7
Voice UI is proliferating
across product categories

IoT XR

Smartphones Smart
and tablets speakers

In-car
entertainment
systems

Headphones TV
and headsets

Wearables PCs and


laptops
8
Moving voice UI functionality to the end device
An end-to-end solution powered by machine learning

On-device processing Cloud processing


(always-on and real-time) (services)

Multi-mic echo Voice Automatic speech Maps Wikipedia


cancellation, activation recognition
beamforming, (ASR)
and speech Service manager
denoising SMS

Natural language
understanding News
(NLU)
Music
Text-to-speech
(TTS)
Weather Stocks

Cloud centric (today)


9
Moving voice UI functionality to the end device
An end-to-end solution powered by machine learning

On-device processing (always-on and real-time) Cloud processing (services)

Multi-mic echo Voice Automatic speech Maps Wikipedia


cancellation, activation recognition
beamforming, (ASR)
and speech Service manager
denoising SMS

Natural language
understanding News
(NLU)
Music
Text-to-speech
(TTS)
Weather Stocks

On-device centric (future)


10
Machine learning models

On-device
Cloud tasks processing On-device tasks
Complex voice fallback of voice UI Automatic speech recognition
Training and model update Provides unique benefits Natural language processing
complementing the cloud
Knowledge base Always-on audio cognition
Services On-device training
Challenge
Providing the voice UI functionality Benefits
within the power/thermal envelope
Privacy
Instant response
Always-on
Device context
Offline raw data
Queries
11
4000

Frequency
3000

2000 Noisy speech spectrogram


1000

Speech “If people were more generous,


0
0.5 1 1.5 2 2.5 3
Time
there would be no need for welfare”

denoising
• Single or multiple mics
• Applicable for
◦ Two-way conversation DL-based DL-based denoising model
denoising trained with extensive speech
◦ Voice/speaker recognition
noise databases
◦ Keyword spotting

• Deep learning (DL)


significantly improves the
performance over traditional
methods 4000
Frequency

3000
• Robust in challenging Clean speech spectrogram
2000
interference and noise 1000
“If people were more generous,
scenarios there would be no need for welfare”
0
0.5 1 1.5 2 2.5 3
Time
12
WCD9330 Qualcomm® Voice
Activation (VA) Qualcomm
Voice
High accuracy, robust to background
noise, and supports multiple languages Activation
- 47% supports:
Qualcomm VA power consumption

Deep learning is improving performance


Among state-of-the-art in terms of
performance vs. power consumption Amazon Alexa

WCD9335 Baidu DUEROS


-11%
WCD9340 Microsoft Cortana

Google Assistant
2014 2015 2016 2017
13
Qualcomm Voice Activation, Qualcomm WCD9330, Qualcomm WCD9335, and Qualcomm WCD9340 are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
Automatic On-device automatic speech recognition (ASR)
speech
recognition
“Turn on the light”

Transcribe the Acoustic features Acoustic model Language model


audio to text
Reduce input audio to Deep learning converts Uses context and language
essential information input into linguistic units statistics for best utterance
Deep learning gives estimation
Adapted to each user’s
state-of-the-art accuracy accent and environment Adapted to each user’s
on a mobile device speaking tendencies

Personalization—adaptation
Natural language
to individual accent and understanding (NLU) User intention
acoustic environment
Allows the same intention
to be expressed in multiple
ways
Adapted to each user’s
intent expressions
14
An end-to-end on-device voice UI example for smart homes
Demo of automatic speech recognition and natural language understanding
Large command set Intent understanding
Turn on the living room lights Turn on the kitchen light
Click the kitchen lights off Click kitchen light on
Turn off all lights Switch on light in the kitchen
Switch on the ceiling fan Turn the light on in the kitchen
Shut off the sprinklers
Start music NLU: These four phrases
Pause song
map to the same intent
Next track
Go back one
Play previous song
Turn speaker off
Increase temperature

99% on-device intent accuracy


is achieved for domain specific command sets when adapted to accent and environmental condition
15
A true virtual
assistant
A “digital me” sitting on the device:
context aware and personalized

16
Contextual intelligence is required for personalization
The fusion of many types of sensors and personal information
Sensor data On-device data Off-device data

Environment Iris scan Ambient light Compass Calendar Cloud data

Camera Microphone Temperature Humidity Messaging IoT data

Sensor fusion
Gyroscope Pulse C-V2X Apps

Low power sensing, processing, and connectivity


Efficient, heterogeneous Sensor fusion and Integrated, always-on Low-energy wireless technologies
architectures machine learning data capturing (e.g. BT-LE, 5G NR IoT)
17
Creating personalized memories

Sound analysis Activity analysis


Talking with my son at sunset in Strolling on the beach at sunset
La Jolla in La Jolla talking with my son

Live sentiment
Visual analysis analysis
A sunset over the ocean Strolling on the beach at sunset
in La Jolla in La Jolla talking with my son
and laughing

GPS location History, number


La Jolla, California of people, identity
After the party, strolling on the
beach at sunset in La Jolla talking
with my son and laughing

Essential for a true virtual assistant


18
A true personal assistant is responsive and proactive
Responsive Proactive
Decision-making and conversation based Decision-making and conversation based
on contextual analysis and prompting on contextual analysis without prompting
(e.g. finding memories) (e.g. automatically sharing memories)

“Remember the time I was strolling with “I noticed that you are tired and stressed, I’m turning
my son after the party at La Jolla beach?” on the Rocky III soundtrack and navigating you
to the gym for a workout and sauna.”

“Yes I do, here is a picture you took of the sunset. “This music gets my blood going and a
Should I share it with your family group on WeChat?” workout and sauna will help me relieve stress.”

19
The first step to an on-device virtual assistant
Enabling on-device voice UI

On-device processing Cloud processing

Multi-mic echo Voice Automatic speech Maps Wikipedia


cancellation, activation recognition (ASR)
beamforming,
and speech Service manager
denoising SMS

Natural language
understanding (NLU) News

Music
Text-to-speech
(TTS)
Weather Stocks

20
Adding an “AI agent” to create a true virtual assistant
The on-device AI agent continuously learns personal knowledge and acts intuitively

Sensors

On-device processing Cloud processing (services)

Multi-mic echo Voice Automatic speech


cancellation and activation recognition (ASR)
beamforming
Cloud knowledge graph

AI agent

Text-to-speech Natural language


(TTS) understanding (NLU)

21
Adding an “AI agent” to create a true virtual assistant
Contextualization allows personalization at acoustic, intent, and behavior levels

Sensors

On-device processing AI agent Cloud processing (services)

Multi-mic echo Voice Automatic speechContextual fusion Speaker identification


cancellation and activation recognition (ASR)and learning Acoustic event detection
beamforming
Gender and age detection
Voice activity detection
Dialog
management Emotion classification
Cloud knowledge
graph
Local knowledge
graph
Text-to-speech Natural language
(TTS) understanding (NLU)

22
Various kitchen
noise samples

Acoustic
event ML-based
acoustic event
detection classification

• ML techniques 50
0.9
Posterior-gram
are used to 100
0.8

0.7
◦ Classify acoustic 150
0.6
signals into a set of 200
0.5
predefined events 250
0.4

◦ Infer acoustic environment 300


0.3
350
0.2

• Low power, 400


0.1

always-on
450
2 4 6 8 10 12

Object rustling Cupboard Dishes Glass jingling Walking Water tap running

Object snapping Cutlery Drawer Object impact Washing dishes


23
We are advancing AI research
to make on-device AI ubiquitous

We are creating AI platform


innovations that are fundamental
to scaling AI across the industry

We provide the low-power


end-to-end on-device solution
for a true personal assistant

24
Thank you!
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog

Nothing in these materials is an offer to sell any of the References in this presentation to “Qualcomm” may mean Qualcomm
components or devices referenced herein. Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
©2018 Qualcomm Technologies, Inc. and/or its affiliated
applicable. Qualcomm Incorporated includes Qualcomm’s licensing
companies. All Rights Reserved.
business, QTL, and the vast majority of its patent portfolio. Qualcomm
Qualcomm and Snapdragon are trademarks of Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm
Incorporated, registered in the United States and other Incorporated, operates, along with its subsidiaries, substantially all of
countries. Other products and brand names may be Qualcomm’s engineering, research and development functions, and
trademarks or registered trademarks of their respective substantially all of its product and services businesses, including its
owners. semiconductor business, QCT.

You might also like