474 - Moutaman Mirghani

Evaluation of the Quality of Encoded Quran
Digital Audio Recording

Moutaman Mirghani Hadeel Osman Madane
National Center for Research Sudan Academy of Science
mtnmir@gmail.com hadeel4osman@gmail.com
Abstract—Digital audio compression is applied to reduce predictive coding (LPC), which relies on analysis of
the data rate required to send audio stream through band the speech stream to extract features. Those features
limited communication networks, as well as reducing the are used to compute the coefficients of the filter at the
size of data storage in case of recording audio streams. It decoding end, which can reproduce the speech when
is common to apply lossy audio compression algorithms in
stimulated by impulses or white noise. Later, other
many applications, which provide higher compression at
the expense of the fidelity. MPEG-1&2 Audio Layer III,
speech encoders were developed so as to overcome
or MP3, is encoding format for digital audio that uses a shortcomings of LPC, such as the CELP and MELP.
form of lossy data compression. It is a regular audio Similarly, there are different audio encoding
format for consumer audio streaming or storage, and formats that were developed to record and playback
a standard of digital audio compression for the transfer speech and audio in general, such as MPEG-1 and
and playback of music on the majority of digital audio MPEG-2 Audio Layer III, which is commonly referred
players. However, the quality of audio encoding depends to as MP3. MP3 audio files are less than tenth of the
on the phonemes of sound that distinguish one word from size of the original digital stream.
another in a particular language. Most of speech models
applied in audio compression relies on phonemes of II. PHONETICS IN ARABIC LANGUAGE
English language, rather than those of Arabic language.
Therefore, audio encoding of Holly Quran for recording Phonetics is a branch in linguistics which deals
on digital media is highly affected by these differences in with the study of sounds exist in human speech. Arabic
the models used. Consequently, serious errors could be is a language which is one of widely spoken languages
produced during the playback of Quran Kareem, which all over the world. It is the mother tongue of about 350
may change the meaning and quality of the output millions speakers who live in the 22 Arabic countries.
stream. The main objective of this paper is to discuss and Arabic language is a Semitic language, followed
evaluate encoding methods for speech in Arabic language by Amharic, Tigrinya, Aramaic, Hebrew and Maltese.
accents, focusing on deterioration occurs in producing
Arabic is characterized by the existence of unique
correct and clear Quran digital audio recording.
consonants such as pharyngeal, glottal and emphatic
Keywords—Audio, Recording, Playback, Encoding, MP3, consonants [2]. Moreover, it presents several phonetics
Compression, Arabic Language, Holly Quran. and morpho-syntactic particularities.
The alphabet of Arabic consists of 28 letters that
I. INTRODUCTION
could be extended to a set of 90 of additional shapes,
Historically, theory of data compression has been marks, and vowels. Those 28 letters represent the
formulated by Claude E. Shannon in 1949 in his paper: consonants and long vowels such as ‫ ى‬and ‫ٱ‬, which
“A Mathematical Theory of Communication” [1]. He both are pronounced as /a:/, ‫ ي‬pronounced as /i:/, and ‫و‬
has proved there are limits to how much we are able to that is pronounced as/u:/. Short vowels and some other
compress data without losing any information. certain phonetic information, like consonant doubling
According to that, when the compressed (encoded) data (shadda in Arabic), are not represented by letters but
is decompressed (decoded), the data stream will be by means of diacritics.
identical to the original bit stream. Such kind of data A diacritic is a short stroke placed above or below a
compression is known as lossless compression. consonant. Table I below shows the complete set of
On the other hand, audio data does not need to be Arabic diacritics. Arabic diacritics are split into three
exactly the same as the original source data. Instead, sets: short vowels, doubled case endings, and the
some amount of distortion (approximation) is tolerated. syllabification marks. Normally, short vowels are
Lossy compression can be applied to data sources like written as symbols either above or below the letter in
speech, and images, where we do not need all details to the text with diacritics, and dropped all together in text
distinguish the speaker and understand what he says. without diacritics.
That limit, i.e. the entropy rate, depends on the In Arabic text, there are three short vowels: fat’ha
probabilities of definite bit sequences in the data. It is that represents the /a/ sound and written as an oblique
possible to compress data with a compression rata close dash over the letter, and damma that represents the /u/
to the entropy rate, where it is mathematically not sound and has the shape of a comma written over the
possible to do better. Note that, entropy coding just letter and kasra that represents the /i/ sound and written
applies to lossless compression. as an oblique dash under the letter as seen in Table I.
Speech compression, or voice encoding, is used to It is important to realize that what is typically
reduce the data rate or the data storage size. The basic referred to as Arabic is not a single linguistic variety In
method used in speech compression is the linear fact; it is a collection of different dialects.
Generally speaking, the prime difficulties facing the
development of accurate analysis model as well as
speech recognition systems for Arabic are the spread
dialectal variety and the morphological complexity.
III. PRINCIPLES OF MP3 ENCODING
Moving Pictures Experts Group MPEG-1 (and
MPEG-2) Layer III standard is usually referred to as
MP3 standard [4]. MP3 is a compressed file format that
makes use of psychoacoustics and human perception.
Analog signals are sampled according to Shannon
sampling theory, which states that the sampling
frequency should be at least double the maximum
Table I. Short Arabic Vowels [5] frequency in the signal, i.e.
𝑓𝑠 ≥ 2𝑓𝑚 (1)
The Classical Arabic is an older literary form of the
language, which can be exemplified by the type of This frequency is known as the Nyquist rate. The
Arabic used in the Holly Quran. sampled signal is digitized to produce the regular pulse
Modern Standard Arabic (MSA) is a version of the code modulation (PCM) signal, which is the standard
Classical Arabic, with a modernized vocabulary. In uncompressed digital audio signal. By regular PCM,
general, MSA is the formal standard that is common to we mean linear PCM, using linear quantization. Such a
all Arabic-speaking countries. It is the language used in digital audio signal is commonly saved in a wave audio
the media, i.e. newspapers, radio, TV, and in official file (.wav), which is uncompressed audio file. An MP3
speeches, courtrooms, and mostly in any kind of formal file is example of compressed audio file (.mp3).
communication. Nevertheless, it is not being used for MP3 encoder takes Fast Fourier Transform (FFT)
everyday and informal communication, which is usually of the audio signal of a time window that is a fraction
carried out in one of the available local dialects. of a second. Each frame of the FFT is then broken into
Roughly, dialects of Arabic can be divided into two sub-bands that are based on frequency content, because
groups: Western Arabic, which includes the dialects different variations of the applied algorithms work
spoken in Morocco, Algeria, Tunisia, and Libya and the more accurately on specific frequency ranges.
Eastern Arabic, which is further subdivided into Then, the bit rate or the maximum data you can
Egyptian, Levantine, Gulf Arabic and Sudanese [3]. allocate to each frame is calculated based on the bit rate
setting chosen. Each signal frame is compared to
psychoacoustic models/algorithms. In general, MP3
audio encoding can be carried in four stages as follows:
1. The 1st stage of the MP3 encoder divides the
sampled audio signal, i.e. PCM stream, into smaller
components in form of time frames. Then, that signal is
passed into a time-frequency mapping filter bank,
which divides it into sub-bands. That is done with the
aid of a polyphase filter bank that includes a set of
pseudo Quadrature Mirror Filters (QMF). Then, the
output samples are quantized.
2. The 2nd stage consists of an FFT block of 1024
points, and then applies the psychoacoustic model. The
concepts of masking and thresholds are utilized to
discard data that is considered inaudible. One final
output of that model is a signal-to-mask ratio (SMR),
for each group of sub-bands.
3. The 3rd stage consists of quantifying and
encoding each sample of each sub-band, by calculating
a coefficient that is required to represent the signal-to-
noise ratio (SNR) in scale, which is known as noise
Table II. Arabic Digits Presentation allocation. In order to meet both the bit rate and
masking requirements simultaneously, it allocates or
Several issues regarding Arabic language, like the looks at both the output samples from the filter bank
phonology and the syntax, do not create difficulty for and the SMRs from the psycho-acoustic model, and
automatic speech recognition. Standard, language- adjusts the noise allocation.
independent techniques for acoustic and pronunciation 4. The 4th (the last) stage of the MP3 encoder
modeling, such as context-dependent phones, may consists of formatting the bit stream. The quantized
easily be applied to model acoustic-phonetic properties outputs from the filter bank, the noise allocation and
of audible Arabic. See Table II above for presentation other required side information are all collected,
of some Arabic digits. encoded and formatted, as seen in Fig. 1 below.
∞
𝑅(𝑘) = ∑ 𝑥 ∗ (𝑛)𝑦(𝑛 + 𝑘) (3)

𝑛=−∞
So as to easily evaluate the distortion occurs due to
compression; the cross correlation functions between
the uncompressed and compressed audio streams
signals are computed and plotted, both in time and
frequency domains. Correlation functions enable us to
clearly visualize similarities and differences between
the two audio signals. The Matlab built-in function
xcorr can be used to compute the cross correlation
Fig. 1 Black Diagram Of MP3 Encoder
function between the signals.
Since the Matlab built-in function wavread does not
IV. PROBLEM STATEMENT
support MP3 format, the compressed file should be
MP3 players are the most common audio players played back and recorded again and saved in a WAV
used today. They are found at home, in cars and at file. One has not to use an MP3-to-WAV converter
public transport. Those players are sometimes used to program to save that file, in order not to make changes
playback verses from Quran Kareem that has been in the played compressed audio.
encoded in MP3 format.
Unfortunately, there is a distortion when playing VI. RESULTS
back recorded MP3file contains Quran. That will be The Matlab code has been run for a WAV audio
clear in particular for people who know the right way to stream of short part of the Azan, which is of 10 seconds
read the Quran (i.e. Tajweed). One apparent distortion duration and a size of 166,444 bytes. The time domain
occurs during elongation of speech (i.e. Modood). and frequency domain plots shown below are obtained
While elongated vowels are being played, the period is for both the original and compressed audio streams.
shortened in very short time, even less than the natural Those two streams were recorded by Samsung TAB-4
time (i.e. Mad Tabeiee). tablet using WAV Recorder application under Android,
at 8 kHz sampling frequency.
V. METHODOLOGY
4
x 10 Original Audio signal
In order to evaluate the quality of encoded Quran 5
digital audio recording, uncompressed original audio
4
stream is compared with the played back compressed
version of the same audio stream. The two streams are 3
plot, both in time and frequency domains. 2
The Discrete Fourier Transform (DFT) is used to
Amp, V
1
present discrete time signals in the frequency domain by
computing their spectra. The DFT of a discrete signal 0
x(n) of length N is computed as follows -1
𝑁−1
𝑗2𝜋𝑛𝑘 -2
𝑋(𝑘) = ∑ 𝑥(𝑛)𝑒 − 𝑁 (2)
-3
𝑛=0
The methodology is based on using Matlab to create -4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
audio signal analysis of some recorded verses of Quran Time, sec 4
x 10
(or Azan), before and after compression. The analysis is
done both in time and frequency domain so as to Fig. 2 Original Audio Stream
compare the two audio streams. The original recording 6
x 10 Fourier Transform of Original signal
is saved in an uncompressed Waveform Audio File 3.5
Format, which is commonly known as WAV. A copy of
that file is compressed using a standard MP3 encoding 3
program. 2.5
A Matlab program is coded to save samples read
from a WAV file into a matrix using the built-in 2
Amp
wavread function. Then, those samples are sketched to

1.5
represent the audio signal in the time domain. The fft
built-in function is used to compute the Fast Fourier 1
Transform, which resembles DFT, of each of the two
signals and plotted to represent the spectra of the audio 0.5
signals in the frequency domain.

0
The cross correlation function (CCF) clarifies the 0 500 1000 1500 2000 2500 3000 3500 4000
frequency, Hz
similarities and contrasts between two signals. For the
discrete signals x(y) and y(n), it is computed as follows Fig. 3 Spectrum of Original Audio Stream
4
x 10 MP3 Encoded Signal VII. CONCLUSION
4
3 In general, there are several techniques to measure

2 the quality of speech and the quality of audio. For
example, Perceptual Evaluation of Speech Quality
1
(PESQ) which is a standard of ITU-T recommendation
Amp, V
0 comprises test methodology for assessment of the

-1
speech quality judged by users of telephony system.
Also, Perceptual Evaluation of Audio Quality (PEAQ)
-2 is a standardized algorithm to measure perceived audio
-3 quality, developed by a joint venture of experts.
In this paper, analytical and graphical evaluation,
-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 rather than totally perceptual one, is carried in order to
Time, sec 4
x 10 examine the distortion that occurs in played back
Fig. 4 Compressed Audio Stream Quran saved in MP3 audio format. Perceptual
assessment is surely important as it is dependent on the
x 10
6
Fourier Transform of Encoded Signal listener of audio himself. However, when it comes to
3 Quran, listeners are of different backgrounds and
unfortunately many of them cannot even distinguish
2.5
distorted verses and misleading pronunciation of them.
2
From the plots shown above, we can clearly notice
the changes occurred between the original audio stream
and the compressed stream, both in the time and
Amp
1.5
frequency domains. The computed cross correlation
1 functions showed clearly the contrasts between the
original and the compressed audio signals, which are
0.5 far away from being the autocorrelation functions of
the audio streams.
0
0 500 1000 1500 2000 2500 3000 3500 4000 From the correlation plots, it is clear that the
frequency, Hz
distortion is more obvious in the frequency domain
rather than the time domain. That coincides with nature
Fig. 5 Spectrum of Compressed Audio Stream
of human ear, which is less sensitive to phase distortion
while it is sensitive to changes occurring in frequency,
12
Cross Correlation in Time Domain
3
x 10
i.e. in the pitch of the voice.
2.5 The work done within this paper needs to expand in
2 order to fetch for a better encoding methods for Arabic
1.5 and for Quran recording in particular. It would be the
1 proper solution to develop voice encoders that are
0.5 designed for Arabic, rather than using those encoders
0 designed to cope with English.
-0.5
ACKNOWLEDGMENT
-1
-1.5
Authors would like to acknowledge the staff of the
-2
Arabic Language Institute for Non Arabic Speakers in
-1 0 1 2 3 4 5 6 7 8
Time Shift
x 10
4 Khartoum, Sudan.
Fig. 6 Cross Correlation in Time Domain REFERENCES
[1] C. E. Shannon, A Mathematical Theory of Communication,
13
6
x 10 Cross Correlation in Frequency Domain The Bell System Technical Journal, Vol. 27, pp. 379–423,
623–656, July, October, 1948.
5
[2] Zitouni I, Sarikaya R. Arabic Diacritic Restoration Approach
Based on Maximum Entropy Models, Computer Speech and
Language, 23:257–276, 2009.
4
[3] Kirchhoff K., Novel Approach to Arabic Speech Recognition,
Proceedings of the International Conference on ASSP, 344–
3
347, 2002.
[4] Scot Hacker, MP3: The Deﬁnitive Guide, O’Reilly and
2
Associates, Inc, 2000.
[5] Dimitra Vergyri, Katrin Kirchhoff, Automatic diacritization of
1
Arabic for Acoustic Modeling in Speech Recognition, Semitic
'04 Proceeding Workshop, Geneva, 2004.
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Frequency Shift 4
x 10
Fig. 7 Cross Correlation in Frequency Domain

474 - Moutaman Mirghani

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

474 - Moutaman Mirghani

Uploaded by

Copyright:

Available Formats

Evaluation of the Quality of Encoded Quran

Digital Audio Recording

𝑅(𝑘) = ∑ 𝑥 ∗ (𝑛)𝑦(𝑛 + 𝑘) (3)

wavread function. Then, those samples are sketched to

signals in the frequency domain.

3 In general, there are several techniques to measure

0 comprises test methodology for assessment of the

Fig. 7 Cross Correlation in Frequency Domain

You might also like