Audio-Based Multimedia

Indexing and Retrieval
Framework in MUVIS
System Overview & Applications
by Serkan KIRANYAZ.
Tampere Univ. of Tech.
MUVIS Overview
MBrowser
Query
Browsing
Video
Summarization
Display
DbsEditor
Encoding-Decoding-
Rendering
Database
Creation
FeX- AFeX
Management
AVDatabase
Capturing
Encoding
Recording
AV Database
Creation
Still Images
*.jpg
*.gif
*.bmp
*.pct
*.pcx *.png
*.pgm
*.wmf
*.eps
*.tga
Real Time
Video-Audio
Stored MM
(Video-Audio)
An Image
A Frame
A Video-
Audio Clip
Image and MM files
Appending - Deleting
Appending into Dbs.
Image and MM
files - types
Convertions
Database
Editing
*.jp2
*.yuv
AV
Database
Image
Database
Hybrid
Database
FeX Modules
AFeX
Modules
Fex & AFeX API
Indexing
Retrieval
MUVIS Multimedia
44.1 KHz
32 KHz PCM
RGB 24 MP4 24 KHz G723
YUV 4:2:0 AVI 22.050 KHz G721
MP4 MPEG-4 AAC Stereo 12 & 16 KHz AAC
AVI Any 1..25 fps H263+ MP3 Mono 8 & 11.025 Khz MP3
File Formats FrameSize Frame Rate Codecs File Formats Channel
Sampling
Freq. Codecs
MUVIS Video MUVIS Audio
PGM WMF EPS PCX TGA PCT GIF PCX PNG TIFF BMP JPEG 2K JPEG
MUVIS Images
Audio-Based Multimedia Indexing
and Retrieval Framework for MUVIS
| A global framework implementation in order to
achieve a robust and generic solution for audio-
based multimedia indexing and retrieval,
specifically:
z Generic Support for Audio Codecs
z Generic Support for File Formats
z Generic Support for Audio Capturing & Encoding Parameters
z Generic Support for AFeX Framework Parameters
| The main objective is content-based (speaker,
subject, “sounds like..”) retrieval of the audio, which
is suitable to human judgment and (aural)
perception.
Audio Indexing Scheme in MUVIS
Silence Music Speech NotClassified
Audio Framing & Classification Conversion
Uncertain Speech Music NotClassified
AFeX Module
. . . . . .
5
7
3
0
10
20
1
2
9
6
15
Audio Indexing
Speech Music NotClassified
KF Feature Vectors
KF Extraction
via MST Clustering
4
AFeX Operation
per frame
3
Audio Framing
in Valid Classes
2
Classification & Segmentation per granule/frame. 1
Audio Stream
2. Audio Framing with
Classification Conversion
M M M M M M M M S S S S S S S S S X X X X X
Music Speech Uncertain
Classification per granule/frame
Final Classification per audio frame
M: Music
S: Speech
X: Silence
Uncertain
Audio Feature Extraction
(AFeX) Framework
| Independent AFeX module(s) integration
capability into MUVIS framework for audio-
based indexing and retrieval.
DBSEditor
MBrowser
AFex_API.h
AFex_Bind()
AFex_Init()
AFex_Extract()
AFex_GetDistance()
AFex_Exit()
AFex_*.DLL
AFex_Bind
AFex_Init
AFex_Extract
AFex_Exit
AFex_GetDistance
Key-Framing via MST Clustering
S p
ee ch L a b
9
8
1
1
1
2
13
14
18
19
4
'a'
p
1
1
11
12
17
'L'
1
3
16 8
21
'b'
8
7
6
9
0
10 1
1
20
'S'
1
2
9
6
2
1
2
1
15
'ch'
5
7
3
1
2
'ee'
21
A Sample AFeX Module Imp.: MFCC
| MFCC (Mel-Frequency Cepstrum Coefficients)
AFeX module provide generic feature vectors
independent from the following parameters:
z Sampling Frequency.
z Number of audio channels (mono/stereo).
z Audio Volume level.
|
.
|

\
|


⋅ =

=
) 5 . 0 ( cos log ) / 2 (
1
2 / 1
j
N
i
m P c
P
j
j i
π
Audio Retrieval in MUVIS
FV(i)
FV(0)
Sub-feature Vectors
of a Database Clip
FV(i)
FV(0)
FV(i)
FV(0)
Sub-feature Vectors
of Query Clip
For each frame, a search is
done to find a matching frame
which gives minimum distance.
Matching Class Types
Feature vectors per
class type
Audio Retrieval in MUVIS (cont.)
| In order to accomplish an audio based query within MUVIS, an audio
clip is chosen in a multimedia database and queried through the
database if the database includes at least one audio feature.
| Let NoS be the number of feature sets for a database and let NoF(s)
is the number of sub-features per feature. Sub-features are obtained
by changing the AFeX module parameters or the audio frame size
during the audio feature extraction process.
( ) ( )
( )
) , ( ) , (
, ) , (
0
) , ( ), , ( min
) , (
) (
f s D f s W QD
f s D f s D
C j if
C j if f s DFV f s QFV SD
f s D
NoS
s
s NoF
f
c
C i
i
i
i
i
C j
C
j
C
i
i
i
i
i
i
∑ ∑

× =
=
¦
)
¦
`
¹
¦
¹
¦
´
¦
∅ = ∈
∅ ≠ ∈
=


Conclusions & Remarks
| Audio is important. Sometimes it bears more
semantic and content information than video.
| Henceforth the preliminary results shows the
effectiveness of the audio-based retrieval compared
to visual retrievals (similar or better results).
| Classification and segmentation algorithm has been
recently improved. A new approach based on fuzzy-
regions and semantic-rule-based classification with
intra segment boundary detection has been
developed.

pgm *.types Convertions Image and MM files Appending .AFeX Management Stored MM (Video-Audio) MBrowser Hybrid Database Query Browsing Video Summarization Display A Frame Encoding-DecodingRendering AVDatabase AV Database Creation Real Time Video-Audio Capturing Encoding Recording Appending into Dbs.gif *.wmf *.png *.pct *.bmp *. AV Database A VideoAudio Clip .jpg *.yuv *.Deleting *.eps Image Database An Image FeX.pcx *.jp2 Retrieval DbsEditor Database Creation Database Editing Image and MM files .MUVIS Overview FeX Modules AFeX Modules Fex & AFeX API Indexing Still Images *.tga *.

.MUVIS Multimedia MUVIS Audio Codecs MP3 AAC G721 G723 PCM MUVIS Video File Formats MP3 AAC AVI MP4 Sampling Freq.025 Khz 12 & 16 KHz 22.050 KHz 24 KHz 32 KHz 44.25 fps FrameSize Any File Formats AVI MP4 MUVIS Images JPEG JPEG 2K BMP TIFF PNG PCX GIF PCT TGA PCX EPS WMF PGM .1 KHz Channel Mono Stereo Codecs H263+ MPEG-4 YUV 4:2:0 RGB 24 Frame Rate 1. 8 & 11.

specifically: Generic Support for Audio Codecs Generic Support for File Formats Generic Support for Audio Capturing & Encoding Parameters Generic Support for AFeX Framework Parameters The main objective is content-based (speaker. .. subject. “sounds like.”) retrieval of the audio.Audio-Based Multimedia Indexing and Retrieval Framework for MUVIS A global framework implementation in order to achieve a robust and generic solution for audiobased multimedia indexing and retrieval. which is suitable to human judgment and (aural) perception.

.Audio Indexing Scheme in MUVIS Audio Stream 1 Classification & Segmentation per granule/frame. Speech Music NotClassified 0 KF Extraction 4 via MST Clustering 2 10 20 1 9 15 6 5 7 3 KF Feature Vectors Audio Indexing .. Silence Speech Music NotClassified 2 in Valid Classes Audio Framing Audio Framing & Classification Conversion Uncertain Speech Music NotClassified AFeX Module 3 AFeX Operation per frame .

2. Audio Framing with Classification Conversion M: Music S: Speech X: Silence X X S S S S S S S S S X X Classification per granule/frame M M M M M M M M X Music Uncertain Speech Uncertain Final Classification per audio frame .

DBSEditor AFex_*.h AFex_Bind AFex_Init AFex_Extract AFex_Exit AFex_GetDistance AFex_Bind() AFex_Init() AFex_Extract() AFex_GetDistance() AFex_Exit() MBrowser .DLL AFex_API.Audio Feature Extraction (AFeX) Framework Independent AFeX module(s) integration capability into MUVIS framework for audiobased indexing and retrieval.

Key-Framing via MST Clustering S p ee ch L a b 'S' 0 'ch' 1 10 8 2 2 'b' 9 1 1 1 9 6 20 1 15 2 8 3 21 9 16 1 5 2 7 3 'L' 'a' 8 4 1 13 2 1 18 14 6 1 'ee' 7 11 1 12 1 17 19 1 p 21 .

P π ⋅i  c i = ( 2 / P )1 / 2 ∑ j =1 log m j ⋅ cos  ( j − 0 .: MFCC MFCC (Mel-Frequency Cepstrum Coefficients) AFeX module provide generic feature vectors independent from the following parameters: Sampling Frequency. Number of audio channels (mono/stereo).5)   N  . Audio Volume level.A Sample AFeX Module Imp.

Audio Retrieval in MUVIS Sub-feature Vectors of Query Clip Sub-feature Vectors of a Database Clip Matching Class Types FV(0) FV(0) FV(i) FV(i) For each frame. FV(0) FV(i) Feature vectors per class type . a search is done to find a matching frame which gives minimum distance.

Sub-features are obtained by changing the AFeX module parameters or the audio frame size during the audio feature extraction process.Audio Retrieval in MUVIS (cont. f ) i i∈Ci ( ( )) j∈Ci if j ∈ Ci ≠ ∅    if j ∈ Ci = ∅   QDc = ∑ s NoS NoF ( s ) f ∑ W ( s. an audio clip is chosen in a multimedia database and queried through the database if the database includes at least one audio feature. DFV j Ci ( s. f ). f ) =  0  D( s. f ) . f ) × D ( s. Let NoS be the number of feature sets for a database and let NoF(s) is the number of sub-features per feature.) In order to accomplish an audio based query within MUVIS. min SD QFVi Ci ( s. f ) = ∑ Di (s. f )  Di ( s.

Conclusions & Remarks Audio is important. Sometimes it bears more semantic and content information than video. Henceforth the preliminary results shows the effectiveness of the audio-based retrieval compared to visual retrievals (similar or better results). . A new approach based on fuzzyregions and semantic-rule-based classification with intra segment boundary detection has been developed. Classification and segmentation algorithm has been recently improved.