Combining Voice and Image Recognition For Smart Home Security System

Combining Voice and Image Recognition
for Smart Home Security System
Hung-Te Lee, Rung-Ching Chen(&), and Wei-Hsiang Chung
Department of Information Management, Chaoyang University of Technology,

Taichung, Taiwan
crching@cyut.edu.tw
Abstract. Through to technique and hardware improvements, the human can

handle more significant data than before. Machine learning is a way to achieve
artificial intelligence, using difference algorithm to process difference type data,
let the system analyzing, learning, processing data like the way of a human. The
Artificial neural network has many types, such as the Recurrent Neural Network
that good at handling continuity data, the situation like speech recognition, since
speech is continuity, different letters combination with the various meaning, so
continuity data is essential to speech recognition. Raspberry Pi is an excellent
Internet of Things device because it is a kind of micro-computer, you can install
the operating system in it. The function is as a base; you can expand different
modules. For example, if we want to collect image information, we can install
the lens module, if we’re going to obtain sound data, we need to install the
microphone module. The current raspberry pi already has WIFI and Bluetooth,
so the raspberry pie is currently one of the favorite choices for IoT devices. In
this paper, we want to apply the technique of face recognition and speech
recognition in life, using a raspberry pi to build a home security system. We
proposed an architecture that combines face recognition and voice recognition to
develop a home security system. Through face and voice recognition, we
double-checking the identity is correct or not to improve convenience and
security of the smart home environment. Raspberry Pi is the home security
system’s kernel and uses lens module to collect face information and micro-
phone module to collect voice information. Raspberry Pi upload the data to the
face API and speaker recognition system. The face API is used to recognize face
information; the speech API is used to identify voice information. When the
identity of the face API result is equal to the character of the speech API result, it
is completing the verification requirement; the system will continue next step to
control devices, like an electronic lock or other devices that supported remote
control.
Keywords: Face recognition Speaker recognition Smart home

Internet of Things Raspberry Pi
© Springer Nature Singapore Pte Ltd. 2019

J. C. Hung et al. (Eds.): FC 2018, LNEE 542, pp. 212–221, 2019.
https://doi.org/10.1007/978-981-13-3648-5_25
Combining Voice and Image Recognition for Smart Home … 213
1 Introduction
1.1 Background
Internet of things (IOT) is a hot topic in this day; It makes world turning to intelligent
quickly, maybe the scene of Sci-fi movies is close to our life already. The world of IOT,
sensors are everywhere, because it collects many data to improve human’s life. How to
use these data to help human, so Artificial Intelligent (AI) becoming the hot topic, AI
can let computer learning data, and we can call this “Machine Learning,” while the
computer is getting data and work as a human. AI has experienced low ebb before, but
technique growing let AI relive, back to the hot topic.
AI has a lot of applications, like natural language processing, speech recognition,
image recognition… etc. Each application has the different technique, the most being
discussed is Neural Network (NN) currently. Because of the same type NN cannot
process the other kind applications well all the time, people evolve different type NN to
apply the various forms.
Inevitably, each NN was trained to get the better performance. “Deep Learning”
also is a hot topic now, and Its concept is using multilayer NN to improve performance,
and according to the related work, multilayer NN’s performance is better than single
layer NN’s performance.
1.2 Motivation
AI is the trending now; we want to follow the trending, use AI to build a system to
increase our live quality. Information security is the most crucial part in IOT; it is
dangerous that if everyone can control your smart home devices, to avoid that situation
we combine the concepts of IOT and AI to build intelligent home security Indicator,
using AI to filter user of smart home devices.
1.3 Purpose
In this paper, we are using IOT and AI technique to build a smart home security
Indicator. In Taiwan, most people still using the traditional key lock or electronic lock,
so they have to bring key or Magnetic buckle with the risk of losing. Our aim is
building home access control system by face and speaker recognition, that improves
people live quality, without bringing the key to outdoor, avoiding the risk of losing.
1.4 The Framework of Thesis

This thesis is organized as follows. Section 2 provides a brief introduction to IOT,
Raspberry Pi, Face recognition, Speech recognition, Machine learning, Artificial
Neural Network, Convolutional neural network and Recurrent neural network. Sec-
tion 3 proposes the method of this paper. Section 4 is the preliminary experiment of
this paper. We will discuss our future work in Sect. 5.
214 H.-T. Lee et al.
2 Related Work
To make sure our research is possible, we survey related paper, Ben-Yacoub et al. [1]
proposed a method to achieve face and speech verification that compares the face
speech matching score to make the decision (reject or accept). According to these
survey, we can know how to build us system. We discuss more related technique
below.
2.1 Internet of Things

As technology advances, especially in the hardware, people can handle a more massive
amount of data, to get more information quickly, people give sensors the ability to
connect, so that the sensor can collect the data back in time. Not only providing the
ability to the sensors, but also the devices can connect to the network, making it easier
to control devices by wireless transmission technique. Let life be more convenient than
ever. Gubbi et al. [2] introduce what is the Internet of Things and discuss the Internet of
Things history and future challenge, even safety of the Internet of Things.
2.2 Raspberry Pi
Raspberry Pi is a kind of micro-computer, you can install the operating system in it.
The function like a base, you can expand different modules, for example, if we want to
collect image information, we can install the lens module, if we’re going to obtain
sound information, we need to install the microphone module. Users have to install the
wireless technique module to get the ability of the Internet connecting or remote
controlling if using the second generation Raspberry Pi. The third generation Raspberry
Pi published on Feb. 29, 2016. The third generation Raspberry Pi already has WIFI and
Bluetooth, so it has more convenience than the second generation. Through Raspberry
Pi has the ability that easy expanding and compiling, it becomes one of the favorite
choices of IoT devices.
Jain et al. [3] present a necessary home automation application on Raspberry Pi that
can use local networking or wireless technique to control LEDs switching action by
reading the subject of E-mail.
2.3 Face Recognition

There have a lots pixels difference between object edge and background in an image, so
some research use that way to detect object location in an image. In face recognition
situation, using the same method to identify human face location, even facial features,
like eyes, nose, and mouth. If we can detect the facial position in an image, then we can
extract the facial features too. Every people have the same facial features is impossible,
even twin’s facial features also have slightly different, like Fig. 1 everyone has different
facial features even they all wear glasses still look not the same person. Through every
people facial features are different, we can use these various features to distinguish
everyone.
Fig. 1. Each people have the different facial features
A face recognition system can be divided into three parts: face detection, feature
extraction, and face recognition, as Fig. 2 [4]. In face detection part, the system has to
find the face region of the images to reduce calculating data that not belong face.
Feature extraction part is collecting the feature data of the face region. The final section
is the face recognition that using the feature data to recognize the identity of the person
in the image.
Fig. 2. The process of face recognition system
Chihaoui et al. [4] summarize the technique of 2-dimensional face recognition

before 2016, and they divide the method into three categories. The first category input
whole face data to the system, the second category is using the data of the facial region
and fusing above two classes is the third category. These techniques include PCA
(Principal component analysis), LDA (Linear discriminant analysis), SVM (Support
vector machine)… etc. They also mentioned the eigenfaces in their paper. The
eigenfaces concept is proposed by Sirovich et al. [5] in 1987, each person facial has
different features even twins, so each facial data input to the computation system will
get a different result. They have a lot of contribution to face recognition. Currently,
most popular technology in the field of face recognition is the convolutional neural
network, which is one method of the artificial neural network and has better
performance than other methods in most situation. We will discuss the convolutional
neural network below.
2.4 Speech Recognition

Unlike face recognition, computers recognize speech based on the processing of sound
information. Each person’s speech frequency, the tone is different, so the same sen-
tence, each person recorded data will be different, we can through the feature-capture to
recognize the identity.
The processing of voice information is not as simple as image information. The
voice data has characteristic, such as time, frequency, so that the computer has to
process the audio data by time-frequency domain conversion.
Audio signal data can separate two-parts, the frequency domain, and time domain.
Time domain represents the relationship between signal and time; we can use the time
domain to get the change of signal over time. The other part, according to the frequency
domain, we can get the frequency architecture of an audio signal. In a typical situation,
much more research focus on the frequency domain rather than the time domain,
because the frequency domain usually represents more audio signal features than the
time domain.
The first speech recognition systems were attempted in the early 1950s at Bell
Laboratories, Davis, Biddulph, and Balashek developed an isolated digit Recognition
system for a single speaker [6]. Gaikwad et al. [7] proposed a review paper about
speech recognition systems, survey the techniques of speech recognition systems. Their
research paper helps us easier to understand the environment of the speech signal
processing.
Fourier Transform is an approach to time-frequency domain conversion; it can let
researchers study signals convenience. Mel-Frequency Cepstral Coefficients (MFCC) is
an algorithm that uses Fourier Transform. MFCC is a famous algorithm in Audio signal
processing. Muda et al. [8] using MFCC and Dynamic Time Warping (DTW) tech-
niques to building a voice recognition system. They describe MFCC function is feature
extraction; it can extract the data that human hearing range. And DTW algorithm is for
measuring the similarity between two time-series which may vary in time or speed, like
a function like features matching.
3 Methodology
3.1 Architecture
In this paper, we proposed a structure that combines face recognition and voice
recognition to build a home security system. Raspberry Pi as the home security sys-
tem’s kernel, using lens module to collect face information, using microphone module
to receive voice information. Raspberry Pi upload the info to emotion API and speech
API, emotion API is used to recognize face information, speech API is used to identify
voice information. When the identity of emotion API result equals integrity of speech
API result, completing the verification requirement, the system will continue next step
that control devices, like an electronic lock or other devices that supported remote
control. Figure 3 is the architecture of our system.
Fig. 3. The architecture of our system
We have to finish three parts to complete this research; the first part is building the
system environment that using raspberry pi, and collecting data, including face images
and voice files is the second part, the last part is using Microsoft Azure platform to train
the model.
3.2 Environment Building

According to our previous research [9], we choose Raspberry Pi 3, camera module,
external microphone and some other devices for control propose. We use Python
programming language to build our environment because Microsoft API developed
environment supports Python.
3.3 Collecting Data

In case of the face image collection, we choose Raspberry Pi lens modules as our data
collecting device. In the voice collection case, we use the external microphone to
collect our voice data.
We choose Microsoft Azure platform [10] as our back-end system that has various
Application Programming Interface (API), including face recognition and speaker
recognition. We have to sign up Microsoft Azure if we want to use the API of them.
Microsoft has the webpage to explain their API, such as the various function in
Face API [11], or the description of each function parameters, speech API has the own
page of description, too [12].
Figure 4 is the process of Face API. First, we have to create a group where to save a
person profile. The second step is creating the personal patterns. In the third step, we
upload face image into the personal profile; then we train the person group in the step
number four. After completing four actions, we can do the face recognition.
Fig. 4. The process of Face API
The method of Speech API is in Fig. 5, like the Face API, we create the personal
profiles at first, then we do the enrollment job of the patterns, then we can do the
speaker recognition.
Fig. 5. The process of Speech API
4 Preliminary Experiment
We choose our lab partner as the tester, using their face data and voice data to do the
experiment, six testers in total. We collect five face images and five voice records to
each person that as our test data.
Open Source Computer Vision Library (OpenCV) [13, 14] can help us to collect
face image easier. OpenCV has the various filters, including the face, eyes, mouth and
so on. Therefore we can take a photo when detected human faces. We use the filters of
eyes and face in our system; the reason is if we only use the face filter, the method
misjudge rate will be increased, the eyes filter can make sure we captured the human
face images. Figure 6 is the image that obtained by using the filters of eyes and face.
Fig. 6. The image captured by using the filters
In the part of voice collection, we use PyAudio [15] package to help us collecting
voice data. Some codes of PyAudio were shown in Fig. 7, as we can see, we can
modify some parameters of voice recording, such as chunk, format, channels, etc.
Fig. 7. Some codes of PyAudio
Both face and speech recognition to each person we use three files to train model
and two files to test. Tables 1 and 2 are the result of Face API and Speech API.
According to these tables, we can know there was some error in API; the red value
means error value, the value 1 means both two files were accepted, the value 0.5 means
one file was approved, one file was rejected, the value 0 indicates both two records
were rejected. The result of API testing shocks us, especially in Speech API, the test
file of the tester E can pass each person verification, we will look for the solution to
improve the performance of our system in future work.
Table 1. Face recognition test result

A B EC D F
A 1 0 00 0 0
B 0 1 00 0 0
C 0 0 01 0 0
D 0 0.5 0
0 0 0.5
E 0 0 0 0 1 0
F 0 0 0 0 0 1
Table 2. Speaker recognition test result

A B C D E F
A 1 0 0 0 0 0
B 0 1 1 0.5 1 1
C 0 0 1 0 0 0
D 0 0 0 1 1 0
E 1 1 1 1 1 1
F 0 0.5 0 0 0 1
We use Bluetooth light device as our result display device, using different color
light let the user know the result of the system judge. According to Table 3, blue light
represents the result of the system acceptations; red light represents the result of the
system rejection.
Table 3. Result display

Accept Reject
5 Conclusions and Future Work
IoT technique has more common than before. Information security becomes essential
although IoT can improve life more conveniently. The smart home can bring you
convenient, also can bring you risk. So we build a home security system to enhance our
information security. We have a preliminary result of our research, and there still room
to improve our study, like Speech API accuracy, our system fluency… etc. We look for
the solution to improve the efficiency and complete the system in future.
Acknowledgements. This work was supported by Ministry of Science and Technology,

Taiwan, R.O.C. (Grant No. MOST-106-2221-E-324-025; MOST-106-2218-E-324-002).
References
1. Ben-Yacoub, S., Abdeljaoued, Y., Mayoraz, E.: Fusion of face and speech data for person
identity verification. IEEE Trans. Neural Netw. 10(5), 1065–1074 (1999)
2. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of Things (IoT): a vision,
architectural elements, and future directions. Future Gener. Comput. Syst. 29(7), 1645–1660
(2013)
3. Jain, S., Vaibhav, A., Goyal, L.: Raspberry Pi based interactive home automation system
through E-mail. In: 2014 International Conference on Reliability Optimization and
Information Technology (ICROIT) (2014)
4. Chihaoui, M., Elkefi, A., Bellil, W., Amar, C.B.: A survey of 2D face recognition
techniques. Computers 5(4), 1–25 (2016)
5. Sirovich, L., Kirby, M.: Low-dimensional procedure for the characterization of human faces.
J. Opt. Soc. Am. A 4, 519–524 (1986)
6. Klevans, R.L., Rodman, R.D.: Voice Recognition, 1st edn. Artech House, Inc., Norwood,
MA, USA (1997)
7. Gaikwad, S.K., Gawali, B.W., Yannawar, P.: A review on speech recognition technique. Int.
J. Comput. Appl. 10(3), 16–24 (2010)
8. Muda, L., Begam, M., Elamvazuthi, I.: Voice recognition algorithms using mel frequency
cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. J. Comput. 2(3),
138–143 (2010)
9. Lee, H.-T., Chen, R.-C., Wei, D.: Building emotion recognition control system using
Raspberry Pi. In: The 6th International Conference on Frontier Computing (FC 2017), Japan
(2017)
10. Microsoft Azure: https://azure.microsoft.com/zh-tw/. Last accessed 21 Feb 2018
11. Face API|Microsoft Azure: https://azure.microsoft.com/zh-tw/services/cognitive-services/
face/. Last accessed 21 Feb 2018
12. Speaker Recognition API|Microsoft Azure: https://azure.microsoft.com/zh-tw/services/
cognitive-services/speaker-recognition/. Last accessed 21 Feb 2018
13. OpenCV Library: https://opencv.org/. Last accessed 21 Feb 2018
14. OpenCV Tutorial: http://monkeycoding.com/?page_id=12. Last accessed 21 Feb 2018
15. PyAudio Documentation-PyAudio 0.2.11 Documentation: http://people.csail.mit.edu/hubert/
pyaudio/docs/. Last accessed 21 Feb 2018

Combining Voice and Image Recognition For Smart Home Security System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Combining Voice and Image Recognition For Smart Home Security System

Uploaded by

Copyright:

Available Formats

Combining Voice and Image Recognition

for Smart Home Security System

Hung-Te Lee, Rung-Ching Chen(&), and Wei-Hsiang Chung

Department of Information Management, Chaoyang University of Technology,

Abstract. Through to technique and hardware improvements, the human can

Keywords: Face recognition Speaker recognition Smart home

© Springer Nature Singapore Pte Ltd. 2019

1.4 The Framework of Thesis

2.1 Internet of Things

2.3 Face Recognition

Fig. 1. Each people have the different facial features

Fig. 2. The process of face recognition system

Chihaoui et al. [4] summarize the technique of 2-dimensional face recognition

2.4 Speech Recognition

Fig. 3. The architecture of our system

3.2 Environment Building

3.3 Collecting Data

Fig. 4. The process of Face API

Fig. 5. The process of Speech API

Fig. 6. The image captured by using the ﬁlters

Fig. 7. Some codes of PyAudio

Table 1. Face recognition test result

Table 2. Speaker recognition test result

Table 3. Result display

5 Conclusions and Future Work

Acknowledgements. This work was supported by Ministry of Science and Technology,

You might also like