You are on page 1of 5

Tarik Sulić Paper Report 1

Article 1:

Title: Speaker Recognition by Machines and Humans

Authors: John H.L. Hansen, Taufiq Hasan

Published in: IEEE Signal Processing Magazine Volume: 32, Issue: 6

Publication Year: 2015

Purpose of the study: This research aims to show the similarity and differences between
human and machine speak recognition, it also shows prominent speaker-modeling techniques
that have emerged in the last decade for automatic systems.

Methods used: The proposed technique, Gaussian-mixture-model-based method results in a


speaker dependent PDF. Evaluating the PDF at different data points (e.g., features obtained
from a test utterance) provides a probability score that can be used to compute the similarity
between a speaker GMM and an unknown speaker’s data. For a simple speaker-identification
task, a GMM, is first obtained for each speaker. During testing, the utterance is compared
against each GMM, and the most likely speaker (i.e., the highest-scoring GMM) is selected.

Dataset: C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learning, vol. 20, no. 3,
pp. 273–297, Sept. 1995.

Findings: A substantial amount of work still needs to be done to fully understand how the
human brain makes decisions about speech content and speaker identity. However, from what
we know, it can be said that automatic speaker-recognition systems should focus more on high-
level features for improved performance. Humans are effective at effortlessly identifying
unique traits of speakers they know very well, whereas automatic systems can only learn a
specific trait if a measurable feature parameter can be properly defined. Automatic systems are
better at searching over vast collections of audio and, perhaps, at being able to more effectively
set aside those audio samples which are less likely to be speaker matches; whereas humans are
better at comparing a smaller subset and overcoming microphone or channel mismatch more
easily.

1
Tarik Sulić Paper Report 1

Article 2:

Title: From simulated speech to natural speech, what are the robust features for emotion
recognition?

Authors: Ya Li, Linlin Chao, Yazhu Liu, Wei Bao, Jianhua Tao

Published in: 2015 International Conference on Affective Computing and Intelligent


Interaction (ACII)

Publication Year: 2015

Purpose of the study: This paper aims to investigate the effects of the common utilized
spectral, prosody and voice quality features in emotion recognition with the three types of
corpus, and finds out the robust feature for emotion recognition with natural speech.

Methods used: Three feature selection methods implemented in Weka are utilized in this work.
The feature selection method I is ranking the information gain of each feature with respect to
the emotion class. The second and third methods are evaluating the worth of a subset of features
by considering the individual predictive ability of each feature along with the degree of
redundancy between them. The difference between method II and III lies in the search strategy.
Best first search and genetic search strategy are adopted in method II and III, respectively.

Dataset: F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database


of German emotional speech," in INTERSPEECH, 2005, pp. 1517-1520.

Findings: This paper investigates the effects of the spectral, prosody and voice quality features
in emotion recognition on simulated/acted, elicited and natural corpus. The experiments suggest
that: a) recognition accuracies decrease when the corpus changing from simulated to natural
corpus, in addition, spectral related features could obtain comparable emotion recognition
accuracy with all the features used; b) there is a clear increase of emotion recognition accuracy
from speaker independent to speaker-dependent; however, the difference caused by text context
is small; c) prosody and voice quality features are robust for emotion recognition on simulated
corpus, while spectral features are robust in elicited and natural corpus.

2
Tarik Sulić Paper Report 1

Article 3:

Title: Voice command recognition using EEG signals

Authors: Marianna Rosinová, Martin Lojka, Ján Staš, Jozef Juhár

Published in: 2017 International Symposium ELMAR

Publication Year: 2017

Purpose of the study: The main purpose of the study is to bring better understanding of speech
production. Speech recognition from brain signals can to improve voice control or it can help
people with disturbing speech disorders.

Methods used: The experiment aimed at the recognition of voice commands from the emitted
brain waves was carried out in HTK toolkit. HTK toolkit is suitable for Hidden Markov model
(HMM) training and evaluation with various other functions.

Dataset: The dataset was recorded by the authors. The whole database consists of the set of
EEG signals, audio records and video records. Audio signal was recorded for purpose of time
labelling of spoken words. Audio was recorded with sample rate 48 kHz.

Findings: Evaluation results of speech recognition system based on EEG signals indicating low
classification accuracy. HMM models trained on the whole available spectrum resulting up to
1.2 % successful classification. Dividing database on left and right hemisphere has no effect on
recognition improvement. Decomposition signals on 5 frequency bands brings better results.
The highest classification accuracy has been recorded on alpha, beta and theta frequencies.

3
Tarik Sulić Paper Report 1

Article 4:

Title: Voice activity detection using discriminative restricted Boltzmann machines

Authors: Rogério G. Borin, Magno T. M. Silva

Published in: 2017 25th European Signal Processing Conference (EUSIPCO)

Publication Year: 2017

Purpose of the study: In this paper, the authors address the voice activity detection task
through machine learning by using a discriminative restricted Boltzmann machine (DRBM).

Methods used: They used an RBM (restricted Boltzmann machine), which is a stochastic
neural network that is able to generate data according to a probability distribution.

Findings: From the conducted experiments, the authors found that DRBM-based VADs
slightly outperformed the LTSD detector, which are usually considered as benchmark for
detectors, and considerably outperformed G.729-B and G.729-II, VADs used in industry.
Additionally, simulations also show that DRBM is able to properly deal with correlated inputs.

4
Tarik Sulić Paper Report 1

Article 5:

Title: Voice recognition by Google Home and Raspberry Pi for smart socket control

Authors: Chen-Yen Peng, Rung-Chin Chen

Published in: 2018 Tenth International Conference on Advanced Computational Intelligence


(ICACI)

Publication Year: 2018

Purpose of the study: The authors made use of Google Home's voice recognition with the
conception of machine-learning to prove the feasibility analysis about fulfilling the users' needs
by a smart home pattern with the design of machine learning.

Methods used: They used the Google Home voice control instructions to understand the
meaning of commands. Through the Raspberry Pi, it sent the signal to drive the Smart Bluetooth
Socket or control the relevant electrical appliances. All manipulations were recorded in the
cloud's database for future analysis applications.

Findings: The paper proposed an architecture for a new intelligent family service for users
through the machine learning applications. The system is highly feasible to complete the smart
home control through machine learning using Google Home voice command, Raspberry Pi and
Smart Bluetooth Socket.

Articles for the next report:

1. Development of a voice-control smart home environment


2. Implementation of voice control interface for smart home automation system
3. Voice control for smart home automation: Evaluation of approaches and possible
architectures
4. Controlling electronic devices remotely by voice and brain waves
5. Low cost voice and gesture controlled

You might also like