Professional Documents
Culture Documents
Article 1:
Purpose of the study: This research aims to show the similarity and differences between
human and machine speak recognition, it also shows prominent speaker-modeling techniques
that have emerged in the last decade for automatic systems.
Dataset: C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learning, vol. 20, no. 3,
pp. 273–297, Sept. 1995.
Findings: A substantial amount of work still needs to be done to fully understand how the
human brain makes decisions about speech content and speaker identity. However, from what
we know, it can be said that automatic speaker-recognition systems should focus more on high-
level features for improved performance. Humans are effective at effortlessly identifying
unique traits of speakers they know very well, whereas automatic systems can only learn a
specific trait if a measurable feature parameter can be properly defined. Automatic systems are
better at searching over vast collections of audio and, perhaps, at being able to more effectively
set aside those audio samples which are less likely to be speaker matches; whereas humans are
better at comparing a smaller subset and overcoming microphone or channel mismatch more
easily.
1
Tarik Sulić Paper Report 1
Article 2:
Title: From simulated speech to natural speech, what are the robust features for emotion
recognition?
Authors: Ya Li, Linlin Chao, Yazhu Liu, Wei Bao, Jianhua Tao
Purpose of the study: This paper aims to investigate the effects of the common utilized
spectral, prosody and voice quality features in emotion recognition with the three types of
corpus, and finds out the robust feature for emotion recognition with natural speech.
Methods used: Three feature selection methods implemented in Weka are utilized in this work.
The feature selection method I is ranking the information gain of each feature with respect to
the emotion class. The second and third methods are evaluating the worth of a subset of features
by considering the individual predictive ability of each feature along with the degree of
redundancy between them. The difference between method II and III lies in the search strategy.
Best first search and genetic search strategy are adopted in method II and III, respectively.
Findings: This paper investigates the effects of the spectral, prosody and voice quality features
in emotion recognition on simulated/acted, elicited and natural corpus. The experiments suggest
that: a) recognition accuracies decrease when the corpus changing from simulated to natural
corpus, in addition, spectral related features could obtain comparable emotion recognition
accuracy with all the features used; b) there is a clear increase of emotion recognition accuracy
from speaker independent to speaker-dependent; however, the difference caused by text context
is small; c) prosody and voice quality features are robust for emotion recognition on simulated
corpus, while spectral features are robust in elicited and natural corpus.
2
Tarik Sulić Paper Report 1
Article 3:
Purpose of the study: The main purpose of the study is to bring better understanding of speech
production. Speech recognition from brain signals can to improve voice control or it can help
people with disturbing speech disorders.
Methods used: The experiment aimed at the recognition of voice commands from the emitted
brain waves was carried out in HTK toolkit. HTK toolkit is suitable for Hidden Markov model
(HMM) training and evaluation with various other functions.
Dataset: The dataset was recorded by the authors. The whole database consists of the set of
EEG signals, audio records and video records. Audio signal was recorded for purpose of time
labelling of spoken words. Audio was recorded with sample rate 48 kHz.
Findings: Evaluation results of speech recognition system based on EEG signals indicating low
classification accuracy. HMM models trained on the whole available spectrum resulting up to
1.2 % successful classification. Dividing database on left and right hemisphere has no effect on
recognition improvement. Decomposition signals on 5 frequency bands brings better results.
The highest classification accuracy has been recorded on alpha, beta and theta frequencies.
3
Tarik Sulić Paper Report 1
Article 4:
Purpose of the study: In this paper, the authors address the voice activity detection task
through machine learning by using a discriminative restricted Boltzmann machine (DRBM).
Methods used: They used an RBM (restricted Boltzmann machine), which is a stochastic
neural network that is able to generate data according to a probability distribution.
Findings: From the conducted experiments, the authors found that DRBM-based VADs
slightly outperformed the LTSD detector, which are usually considered as benchmark for
detectors, and considerably outperformed G.729-B and G.729-II, VADs used in industry.
Additionally, simulations also show that DRBM is able to properly deal with correlated inputs.
4
Tarik Sulić Paper Report 1
Article 5:
Title: Voice recognition by Google Home and Raspberry Pi for smart socket control
Purpose of the study: The authors made use of Google Home's voice recognition with the
conception of machine-learning to prove the feasibility analysis about fulfilling the users' needs
by a smart home pattern with the design of machine learning.
Methods used: They used the Google Home voice control instructions to understand the
meaning of commands. Through the Raspberry Pi, it sent the signal to drive the Smart Bluetooth
Socket or control the relevant electrical appliances. All manipulations were recorded in the
cloud's database for future analysis applications.
Findings: The paper proposed an architecture for a new intelligent family service for users
through the machine learning applications. The system is highly feasible to complete the smart
home control through machine learning using Google Home voice command, Raspberry Pi and
Smart Bluetooth Socket.