SIGNaL PROCESSING

algorithms, architectures, arrangements, and applications
SPa 2016
THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS INC. September 21-23rd, 2016, Poznań, POLAND

Multithreaded Java approach
to speaker recognition
Radosław Weychan, Tomasz Marciniak, Adam Dąbrowski
Division of Signal Processing and Electronic Systems
Chair of Control and Systems Engineering
Poznań University of Technology
Poznań, Poland
radoslaw.weychan@put.poznan.pl

Abstract—In this paper an analysis of multithread approach
to speaker recognition is presented. Two cases have been ( | )= ( | , ) (2)
investigated – the use of multithreading first, to reduce the
total time of training and testing (when processing the
speaker database) and second, to reduce the time of a single where:
speaker recognition (e.g. in voice-based access systems). We
have investigated the processing time in function of a number
- a set of MFCC’s
of threads in order to find the optimal number for 8-core - the final model
CPU. The obtained results show that the computation time ( | , Σ ) - Gaussian component densities (Gaussians)
can be reduced even over 5 times for this kind of processor. - the number of Gaussians
The prepared implementation is available on Github source - -th weight
code repository. – -th mean
– -th covariance matrix.
Keywords- speaker identification, Java, multithreading

I. INTRODUCTION The number of Gaussians depends on the application.
In the standard GMM algorithm it is usually set between
Speaker recognition is a subject described in the
16 and 64 components [2].
literature with various applications, e.g., for the general
The recognition of particular unlabeled voice sample
purpose access, high-level customization, and speaker
lies in the computation of log-likelihood of the extracted
diarisation.
features (MFCC’s) using each model in the previously
The main idea of the speaker recognition is to extract
prepared database. The model with the highest obtained
individual features from the speaker voice and generate a
value is assumed to be the recognized speaker.
model, which guarantees further ability to distinguish
There are also other techniques of modeling: the
individual speakers. The most common features extracted
Gaussian mixture model − the universal background model
from voice are mel-frequency cepstral features (MFCC)
(GMM-UBM) [3], support vector machine (SVM) [4],
[1] computed according to the following formula
joint factor analysis (JFA) and i-vectors [5], but in most
MFCC( ) = log cos ( − 0.5) , cases GMM is the basis for them.
(1)
= 1, … , II. SPEECH PROCESSING AND MODELING TOOLKITS
where: Although the number of papers and experiments in the
- number of cepstral features field of speaker recognition is rather large, there are only
- number of mel-scale filters in the filterbank few tools, which can be used for developing the speaker
– mel-scale spectral magnitudes. recognition system. One of the most popular is Matlab
environment with toolboxes like VoiceBox [6] (Vector
Typically 13 cepstral coefficients are extracted from a Quantization [7], Gaussian Mixture Models) or MSR
speech frame of length from 10 to 30 ms (depending on Identity Toolbox [8] (GMM-UBM, i-vectors), but it is
sampling rate of analog to digital conversion). closed environment for scientific purposes. There are also
Having a set of those features, collected from recording C++ toolboxes like HTK [9] (Hidden Markov Model [9])
of appropriate length (at least 20 second of speech [2]), a or Alize [10] (GMM-UBM), but they are dedicated for
model technique can be applied. The base of most modern scientific purposes and are developed more as general
modeling technique in the field of speaker recognition is a systems than libraries which can be used in various
Gaussian mixture model (GMM) [2]. The algorithm scenarios. More flexible in this case is Python’s scikit-
models a set of numbers (vectors of particular MFCC’s in learn [11] package which includes very good
this case) with the use of weighted mixture of Gaussians implementation of GMM, but it can be used only on
and the expectation-maximization (EM) technique. The machines which are able to run Python. In the case of Java,
final form of the formula representing the GMM model is which can be run on the most modern devices, there is
as follows: WEKA [12] toolkit, but it is hard to use from the practical

292

displaying because it can be used in most of modern platforms. The second set. while using multithreading in function of the number of Having this information. process them. Java virtual machine can handle processors and threads.g. so they must be Matlab. 16. in two scenarios – as a technique for reducing total time of The low-level concurrency methods can be. It is also very fast in element if the queue is full. 19]. The detailed data will be presented in and testing part. or while attempting to remove computations. modeling the coefficients with the use of GMM- public void run() { EM algorithm initialized by K-means++. used in tasks of coordination of many threads.concurrent package the implementations of the algorithms are independent of automatically synchronized queue is a blocking queue. The number of papers related to parallel synchronized. an element if the queue is empty. which modeling algorithm. If one set of threads works slower than the solutions presented in [15. smartphones and tablets as the threads. or returning is MSR Identity Toolbox. The primitive data can be set as volatile. in most training/testing part of speaker recognition when using cases. It is a bounded FIFO blocking queue backed by III. There are few kinds of blocking queues in Java case of Python implementation. SOFTWARE DESCRIPTION implements the Runnable interface: The main part of the research was to develop the code Class MyRunnableClass implements related to all steps of the speaker recognition: reading Runnable unprocessed wav files.concurrent high memory consumption. The queue automatically controls the flow as it is very intuitive to use. implementation of speaker recognition is very limited. they are commonly called none of presented toolkits makes use of multithreading. replaced by high-level mechanisms like queues with large amount of input files. where no C++ parts were used.util. 17.). processes. computing MFCC set from the { speech. even tablets and smartphones. Thus it can be used in all } machines. threads put results in queue. We used Java programming language take them and process in other ways (sending. and also based on our previous of the work. the latter set must wait for the first one to finish experiments showed that the training time was lower in the their job. and as a technique for reducing producers and consumers. Blocking queues are but the codes written in these languages are CPU. SynchronousQueue. LinkedBlockingQueue. The proposed approach was used the process. 293 . but it cannot be used outside data which is used by other threads. threads with the use of lock() method from In this paper we present a multithread approach for RentrantLock class.point of view (no direct access to speaker model As the threads do not share any resources and are not parameters when using GMM). and the consumers threads access systems. it is also not time-triggered. and no the following sections. As the speaker-recognition task does likelihood in this case). as not require a special order of data acquisition in training Java is much faster. etc. It } was written in the pure Java. The producer- In order to run the thread. The only package which uses multithreading threads are processing the same data source. but it prevents from the concurrency issue are supported by java. The system automatically assigns time slots for where modern CPUs have up to 8 or even more cores in them.util. parts related to offline training and testing with the use of Thread newThread = new Thread(r). code in time consuming maximization parts of K-means PriorityBlockingQueue. while other (dedicated) The provided implementation of GMM is based on the threads take them (removing from queue) and continue to implementation available in scikit-learn Python package. CONCURRENCY IN JAVA an array. The situation is much more complicated when the some cases. object: The presented software contains three independent Runnable r = new MyRunnableClass(). Another fact is that almost dependent on each other. classes and interfaces related to contrast to LinkedBlockingQueue. security and elements (results) in queue. the ArrayBlockingQueue was chosen in this case as the best solution. The critical sections can be blocked for other VQ [13] and were not shared to be used by others [14]. The ANSI C and C++ are faster in general. The queue allows for safe data exchange between including small computers. 18. The There are many solutions for synchronization between presented solutions are based on ineffective methods like threads. package. The main idea of this paper is to is information for the compiler and Java virtual machine measure the reduction of time of the recognition process that it may be modified by more than one thread at a time. or with the use of synchronized speaker recognition with the use of general GMM keyword. speaker databases. but it uses lots of C++ like ArrayBlockingQueue. It processor type. The length of the array must be previously set in The methods. priority for training/testing data is needed. DelayQueue and and GMM. The producers threads put time for single speaker recognition in e. Second reason is that Java has a very causes blocking of a thread while attempting to put an extensive multithreading library. the new instance of the class consumer scenario was chosen as the basis for further must be created and set as a parameter of the new Thread improvements due to its functionality and simplicity. Specific dependent in contrast to JAVA. and an approach for fast single speaker newThread.start(). The written //the task code software uses FFT function from jtransforms library. The basic way of creating new thread is to put the task code inside the run() method of class which IV. In the case of testing part (computing log. In the java.

processing them (reading wav file. The database (ArrayList<> of speakers models) is shared among threads and must be synchronized individually. Fig. weights) and the ability to compute the log-likelihood of given data under its own model. It is presented in Fig. They were placed at the top because of the same level of significance (processing of audio signal) for speaker recognition task. all data (speaker ID and GMM parameters) is done with the use of serialization method. The workflow is MFCC and modeling them with the use of GMM). Scheme of multithreaded speaker testing 294 . 3. Fig. FileEnumerationTask is a producer thread enumerating files and putting them into queue declared in SpeakerTrainingTest class (which includes main method). The access to the result must be synchronized. The dotted lines represent the dependency relationship. SpeakerModel is only a container for data (speaker ID. Saving presented in Fig. Result variable is shared between threads and the queue. processing them (computing MFCC and log-likelihood under each model the saved database consists of) and saving the result. 2. GMM and K-Means classes are final classes Matrixes and Statistics. Queue and the cooperated consumer and producer threads are presented in Fig. The core of MFCC. Class diagram of speaker training project Figure 4 presents the class diagram of the project. Fig. while the solid line represents the association relationship. One thread is The multithreaded offline speaker training runs one used for consequently filling the queue with models to be thread for consequently filling the queue with training tested. and 2 or more threads for taking file handle from ratio. and 2 or more threads for computing log-likelihood files. 4. 2. One thread is used for consequently filling the queue with testing files.recognition which can be used in real-time scenario access Last solution for multithreaded real-time testing. Scheme of multithreaded speaker testing solution for single sentence testing Fig. 1. and 2 or more threads for taking file handle from queue. variances. 3. computing access to it must be synchronized. 1. containing static methods for manipulating arrays and basic algorithms. is system. means. Scheme of multithreaded speaker training The multithreaded offline speaker recognition testing. suitable only for testing a single sentence. works similarly to the previous one.

V. It has to be noted that there can be used in a number of various solutions in a very is no possibility of manual assignment of thread to CPU simple way. while consuming parts of the K-means-based GMM. 6 the time was measured individually for learning and parameterization techniques was tested with each processed file irrespective to the number of threads.9 s files of each speaker was used in training part. Influence of number of threads on processing time = _ +1 (3) of single file (offline case) It is proved that number of threads higher than 9 does The time of processing of a single file increases in not significantly decrease computation time. the code results are presented in Fig. which contains 10 short The processing time of a single file was about 1. which proves the correctness of implementation. The first experiment was related to training part where total time of computing speaker models was measured. As the recognition rate is not the aim of the paper.while TrainingTask contains the steps for computing In the same scenario. It has to be noted that the average duration of 1 TIMIT file is equal to 3. the subsequent experiments do not provide this parameter.07 s was measured. Although the difference may not be significant. The results are presented in Fig. when the general. 6. All tests were also performed with the use of TIMIT speaker database. additional time for managing tasks to the system (CPU time slots) is necessary. fs = 16 kSps. In the case of 100 threads. and the operating system. and another but as it was previously stressed. 6. The average rate of 16 kSps.93 GHz). The (offline case) processing parameters were the same as in first test (32 GMM. It has nothing in common with the total time of The correctness of the implementation of machine training as in Fig. Each experiment was performed with different number of threads – from 1 to 100. where time of processing were measured in a function of number of threads. the total time of training part decreases even over 5 times. thus the presented results may vary each time. number of MFCC to 13. In order to select best processing parameters. and 9 for testing. The overall recognition rate was 96. especially when the number of running threads is greater than number of cores in the CPU. It has to be noted that operation system consequently performs many operations which interrupt each other. in every case with the number of threads more than 1. THE ANALYSIS OF PROCESSING TIME FOR TRAINING This is the reason of time increase of processing a single AND TESTING PART sentence. a set of experiments were performed. but in the case of 100 threads it had also be set to 100.7 s but less various utterances of 630 speakers recorded with sampling than the average duration of a single sentence. 5. The resolution of FFT was dependent on sampling rate and was equal to 256. All files were processed separately. 256-point FFT). The software was developed in Eclipse core. In the book [20] the optimal number of threads ℎ (for application without communications with external sources) was described as Fig. 13 MFCC. The experiments were performed on PC equipped with CPU of 8 cores (Intel Core i7 @2. As the number of threads increases. 5. Thus. The dependencies between main classes is very small.07 s.4 %. Without multithreading. It is automatically done by the Java virtual machine MARS IDE for Java Developers. Influence of number of threads on total training time speaker was chosen for training step. available time slots of the CPU. 295 . It has to be noticed that the length of the queue does not influence the result. but is should be lower or equal to the number of consumer threads. In this scenario (without multithreading) 6 training time of the Python implementation was about 0. it uses C++ for time 4 in testing part. which means that 1 file of each Fig. the use of the TIMIT database. In most cases it was set to 20. as the Java virtual machine assigns various tasks to number of CPU cores was 8. only the general tendency has to be taken into consideration. It is impossible to switch off all of them. Number of Gaussians was set to 32. the time of processing single speaker model (consumer task). As the number of utterance of average length 3. total consumption time was relatively high – almost 1050 s (about 17 minutes).

In the previous experiment the time of testing a single The second experiment was related to the testing part. The use of a where 5670 files of 630 speakers were compared against queue for storing the models has decreased the recognition the database of 630 speakers. With the use of multithreading approach. In this case. CONCLUSIONS suitable for a single file/sentence processing. In comparison to the Python implementation of computing the probability model. the total time of training and testing has been processed – in this case also the number of threads may reduced over 5 times with the use of 8-core CPU and 8-9 increase unexpectedly. 9. the Java approach is much faster – 1. 8. In the second case. only results for processing a in comparison to 9 threads. Influence of number of threads on total testing time (offline case) The last experiment differed from the previous Fig. seconds. in which the speaker databases can be of single file (offline case) even larger than 630 speakers. This queue-related approach is VI. from the previous experiment.the processing of a single (average) sentence took over 32 synchronization was not optimal in the authors’ opinion. which is a consequence of the same process described for the previous experiment. file against 630 speakers was about 1. In the part did not finished and the next file started to be first case. The queue may be filled again with training and testing parts and. The function saturates around 8th thread.g. the total training time is also over 5 times lower in comparison to the single thread case. 7. which makes Python implementation hard to use in Fig.07 s as presented in Fig.5 times. Influence of number of threads on processing time experiments – the queue was filled with the speaker models instead of files. moreover. 7.7 s (similar to the training part) instead of 3 s. In this more complex testing case. The general tendency of increasing of the processing time can be noted. depending on the size of the speaker database. 630 comparisons for a single file took still less time than the average sentence time duration (in the case of a single thread). Fig. The general tendency is very similar to the result number of threads equal to the number of the CPU cores. threads. while the total training time was at the same level and need to be revised. The best result was obtained for the Fig. in the security applications. but the time of processing a single file are presented in Fig. Influence of number of threads on processing time real-time scenarios.7 s.5 all 5670 files were tested. 9. single sentence was significantly lower – about 3 s. the acceleration was about 3. but the solution for 296 . other synchronization techniques to accelerate speaker recognition – to speed up the general have to be applied. The processing time of a utterance of average duration equals 3. the total time of the testing part can be decreased also over 5 times. 8. The number of operations performed for a single tested sentence seems to be smaller in this case (only computation of MFCC and log-likelihood). to speed up a models already processed when the current recognition single recognition e. The results are presented in time up to 3. but the number of comparisons can be very high. In other cases (training/testing with the use of all input sentences as in Multithreading can be used in two ways: as a technique previous experiments). Thus.

Pedregosa. D. R. 72-83. T. A. and Applications SPA 2011. [18] T. SIGKDD Explorations. Grisel. “Speaker recognition based on short Polish sequences.”. Volume 40. and Language Processing. Time Recognition Of Speakers From Internet Audio Stream.1515/fcds-2015-0014. Holmes. Dunn. “HTK: The Hidden Markov Model Toolkit.” Signal Processing Algorithms. 2013. pp. J.univ-avignon. Drgas. “Robust text-independent speake identification This work was prepared within the DS-2016 project. H. 15. 16. >@ S. T. Speech. Zhanjiang.ac. Dąbrowski. “Alize 3. >@ P. no. 2015. “Parallel Implementation of a VQ-Based Text-Independent Speaker Identification”. M. IEEE Transactions on. Stankiewicz. G.0-open source toolkit for state-of- the-art speaker recognition. 5/5006.-A. C. J. vol. T. 2011. and Applications (SPA). P. Dabrowski. “Speaker recognition based on telephone quality short Polish sequences with removed silence. Available: http://htk. Voicebox: Speech processing toolbox for Matlab. Speech Audio Proc. Zhang.” Signal Processing Letters. O. >@ D.. Addison-Wesley Professional. Marciniak. A.” 297 . “Comparison of different >@ Source code repository for “Multithreaded speaker recognition implementations of MFCC.html >@ C. Minker. pp. . pp. using Gaussian mixture models.cam. A. pp. pp. G. Gramfort. Arrangements. Foundations of Computing and Decision Sciences.ee. Java Concurency in Practice.al.” Przeglad Elektrotechniczny 2012. Springer.J. Hall. R. D. http://mistral. Digital Signal Processing. M. and D. Kenny. F.” Journal of Machine Learning Research. “Implementation aspects of speaker recognition using Python language and Raspberry Pi platform. pp. Architectures. Issue 1. Cournapeau. B. K. 159-163. Weychan. >@ W. vol. Poznan. Zheng. Herbig. P. Frank. The presented software is available on the Github >@ R. Varoquaux. 2006. and Applications (SPA). 42-46. [13] R.” 2004. Dubourg. Springer Science & Business Media. Gerl. and L. Passos. Marciniak. 308-311. 2015. Poznan.ac. R. W. S. pp. “Scikit-learn: Machine learning in Python.” Audio. G. Gurgen.html [11]F. Fauve. S. [16] T. no 6/2001.and allowed for real-time recognition with the use of IEEE Signal Processing: Algorithms. Issue 3.eng. Thirion. Campbell. Parfait. M. [15] R. Dąbrowski. vol. Architectures. Tupcuoglu. >@ M. M.ic. H. Mason. A. O. F. V. Soganci. Weychan. Sadjadi. Marciniak. Brooks. Reynolds. September 2015. Heck. and J. P. 2006. Volume 11. >@ F. pp. [12] M.uk/hp/staff/dmb/voicebox/voicebox.-F. 223-233 REFERENCES >@ B. “Support vector machines using gmm supervectors for speaker verication. Reynolds. 4/2007. Goetz et. B. Duchesnay. “Speaker verification using adapted gaussian mixture models”. vol. pp. Advances in Information Systems 2005. Pattern Recognition and Machine Learning. 0: A matlab toolbox for speaker recognition research. Boulianne. Weiss. E. A. G. speaker databases containing even thousands of speakers. 162- 167. vol. P. Prettenhofer. Michel. IEEE.fr/ index_en. Available: http://www. Weychan “Influence of silence removal on speaker recognition based on short Polish sequences. E. A. 95-98. Bishop. Perrot. vol. “Joint factoranalysis versus eigenchannels in speaker recognition.” Journal of Computer Science and project”. A.. Brucher. pp. no. Krzykowska. speaker-recognition >@ D. pp. Reutemann. Bonastre. 12/2011. R. Li. >@ Cambridge University Engineering Department. no. T. A. A.com/audiodsp/Java-multithreaded-GMM- Technology. DOI: 10. “The WEKA Data Mining Software: An Update”. Self-Learning Speaker Identification: A System for Enhanced Speech Recognition. “Msr identity toolbox v1. Slaney. Vanderplas. Larcher. Architectures. Marciniak. Dabrowski. M. Sturim. [Online]. G.uk/ >@ A.” Speech and Language Processing Technical Committee Newsletter. 13. Marciniak. Witten. B. Pfahringer. Blondel. Krzykowska and R. H. 2010. Levy.” IEEE Signal Processing: Algorithms. Sz. Lee. 2825-2830.” IEEE Trans. 19-41. Real source code repository [21]. Weychan. Reynolds. 1435-1447. https://github. Ouellet. Dumouchel. Quatieri. and E. 6. 582-589. Krzykowska. V. no 10/2000. no 1/1995. 3. pp. [Online]. 291-300 [14] T. 2009. 88. Arrangements. Arrangements. M.-Y. I. A. Weychan. [17] A.