You are on page 1of 13

Implementation of JPEG2000 Using SSE Instruction Set

ECE 734 VLSI rra! Structures "or #igital Signal Processing

mi $e%ta

Gilles $uller

Spring 2004

&a'le of Contents
Intro(uction))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))4 * + JPEG2000 lgorit%m))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))), *)* + rc%itecture))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))), *)2 - 2#+#.&))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))), 2 - Implementation))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))7 2+* - /on optimi0e( 1ersion))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))7 2)2 - 2ptimi0ations))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))3 3 - Verification))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))*0 4 - Profiler)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))** , - 4esults))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))** 5 - 4eferences))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))*3

Introduction
JPEG2000 is a ne6 image compression stan(ar( 'eing (e1elope( '! t%e Joint P%otograp%ic E7perts Group 8JPEG9: part of t%e International 2rgani0ation for Stan(ar(i0ation 8IS29) It is (esigne( for (ifferent t!pes of still images 8'i+le1el: gra!+ le1el: color: multicomponent9 allo6ing (ifferent imaging mo(els 8client;ser1er: real+time transmission: image li'rar! arc%i1al: limite( 'uffer an( 'an(6i(t% resources: etc9: 6it%in a unifie( s!stem) JPEG2000 is inten(e( to pro1i(e lo6 'it rate operation 6it% rate+(istortion an( su'<ecti1e image =ualit! performance superior to e7isting stan(ar(s: 6it%out sacrificing performance at ot%er points in t%e rate+(istortion spectrum) JPEG2000 a((resses areas 6%ere current stan(ar(s fail to pro(uce t%e 'est =ualit! of performance: suc% as> Lo6 'it rate compression performance 8rates 'elo6 0)2, 'pp for %ig%l!+(etaile( gra!+le1el images9 Lossless an( loss! compression in a single co(e stream Seamless =ualit! an( resolution scala'ilit!: 6it%out %a1ing to (o6nloa( t%e entire file) &%e ma<or 'enefit is t%e conser1ation of 'an(6i(t% Large images> JPEG is restricte( to 54? 7 54? images 86it%out tiling9) JPEG2000 6ill %an(le image si0es up to 8 232+* 9 Single (ecompression arc%itecture Error resilience for transmission in nois! en1ironments: suc% as 6ireless an( t%e Internet Computer generate( imager! Compoun( (ocuments 4egion of Interest co(ing Impro1e( compression tec%ni=ues to accommo(ate ric%er content an( %ig%er resolutions $eta(ata mec%anisms for incorporating a((itional non+image (ata as part of t%e file JPEG2000 6ill 'e a'le to %an(le up to 2,5 c%annels of information: as compare( to JPEG: 6%ic% is limite( to onl! 4G@ (ata) &%us: JPEG2000 6ill 'e capa'le of (escri'ing complete alternate color mo(els: suc% as C$AB: an( full ICC 8International Color Consortium9)

1 - JPEG2000 Algorithm

1.1 - Architecture

&%e enco(ing proce(ure is as follo6s> &%e source image is (ecompose( into components) &%e image an( its components are (ecompose( into rectangular tiles) &%e tile+ component is t%e 'asic unit of t%e original or reconstructe( image) &%e 6a1elet transform is applie( on eac% tile) &%e tile is (ecompose( in (ifferent resolution le1els) &%ese (ecomposition le1els are ma(e up of su''an(s of coefficients t%at (escri'e t%e fre=uenc! c%aracteristics of local areas 8rat%er t%an across t%e entire tile+component9 of t%e tile component) $ar?ers are a((e( in t%e 'itstream to allo6 error resilience) &%e co(estream %as a main %ea(er at t%e 'eginning t%at (escri'es t%e original image an( t%e 1arious (ecomposition an( co(ing st!les t%at are use( to locate: e7tract: (eco(e an( reconstruct t%e image 6it% t%e (esire( resolution: fi(elit!: region of interest an( ot%er c%aracteristics) &%e optional file format (escri'es t%e meaning of t%e image an( its components in t%e conte7t of t%e application)

1.2 2D-DWT
&%is process (ecomposes t%e original image into t6o su'+'an(s: usuall! (enote( as t%e coarse scale appro7imation 8lo6er 'an(9 an( t%e (etail signal 8%ig%er 'an(9) &%is transform can 'e easil! e7ten(e( to multiple (imensions '! using separa'le filters: i)e) '! appl!ing separate *+# transforms along eac% (imension) In particular: 6e %a1e implemente( t%e most common approac%: commonl! ?no6n as t%e s=uare (ecomposition) &%is sc%eme alternates 'et6een operations on ro6s an( columns: i)e) one stage of t%e *+# #.& is applie( first to t%e ro6s of t%e image an( t%en to t%e columns)

&%is process is applie( recursi1el! to t%e =ua(rant containing t%e coarse scale appro7imation in 'ot% (irections) In t%is 6a!: t%e (ata on 6%ic% computations are performe( is re(uce( to a =uarter in eac% step) "rom a performance point of 1ie6: t%e main 'ottlenec? of t%is transformation is cause( '! t%e 1ertical filtering 8t%e processing of image columns9 or t%e %ori0ontal one 8t%e processing of image ro6s9: (epen(ing on 6%et%er 6e assume a ro6+ma<or or a column+ ma<or la!out for t%e images) &%e lifting filter computes t%e lo6 fre=uenc! an( %ig% fre=uenc! coefficients in *+#)

&%e follo6ing filter coefficients are use( for JPEG2000 loss! filter>

2 Implementation

2-1 Non optimized version


.e c%ose to implement t%e lifting transform using #au'ic%ies C;7 nal!sis) &%is 6as (one mainl! 'ecause lifting re=uires less 6or?ing memor! an( fe6er arit%metic computations) &%e loss! filter 6as implemente( since our focus is to ma?e JPEG2000 faster on general purpose arc%itectures: 6%ere some loss of resolution is tolera'le an( loss! compression gi1es a 'etter compression ratio) &%is 1ersion 6as implemente( onl! using C co(e) &%e lifting filter (escri'e( a'o1e computes %ig% an( lo6 fre=uenc! coefficients in t%e same time: reusing interme(iate results) &%is is a goo( point as onl! one filter is nee(e( to compute 'ot% coefficients) &%e 6a1elet transform poses a ma<or %ur(le for t%e memor! %ierarc%!: (ue to t%e (iscrepancies 'et6een t%e memor! access patterns of t%e t6o main components of t%e 2+ # 6a1elet transform> t%e 1ertical an( %ori0ontal filtering) Conse=uentl!: t%e impro1ement in t%e memor! %ierarc%! use represents t%e most important c%allenge of t%is algorit%m from a performance perspecti1e) s a result: 6e nee(e( to t%in? a'out memor! use e1en if t%is 1ersion isnDt optimi0e( for SSE) &%e goal of t%is pro<ect 6as to optimi0e t%e algorit%m implementation for spee() s a result: 6e c%ose t%e strateg! to store all interme(iate results in(epen(entl! pre1enting unnecessar! false (epen(ences: 'ut t%us increasing t%e amount of memor! use() &%e $allat strateg! uses an au7iliar! matri7 to store t%e results of t%e %ori0ontal filtering) In t%is 6a!: as t%e follo6ing figure s%o6s: t%e %ori0ontal %ig% an( lo6 fre=uenc! components are not interlea1e( in memor!) &%is met%o( pre1ents memor! scattering an( ena'le to e7ploit 'etter SI$# parallelism) &%e 1ertical filtering rea(s t%ese components an( 6rites t%e results into t%e original matri7 follo6ing t%e or(er e7pecte( '! t%e =uanti0ation step) For our implementation we did the down sampling e!ore calculations" to ensure none o! our calculations were wasted and also reused calculations across computations whenever possi le.

2.2 #ptimizations
&%e anal!sis of t%is C co(e using V&une nal!0er s%o6e( t%at t%e ma<or 'ottlenec? of our co(e 6as t%e memor! access: an( particularl! a lo6 %it rate in t%e L* (ata cac%e) &o sol1e t%is pro'lem: t%e follo6ing optimi0ations are use(> #ata alignment Strictl! spea?ing: (ata alignment is not re=uire( in our co(es since t%e SSE instruction set inclu(es instructions t%at allo6 unaligne( (ata to 'e copie( into an( out of t%e 1ector registers) Eo6e1er: suc% operations are muc% slo6er t%an aligne( accesses: 6%ic% ma! cause a significant o1er%ea() &o a1oi( t%is (ra6'ac? 6e %a1e emplo!e( *5+'!te aligne( (ata in all our co(es: alt%oug% for t%e scalar 1ersions t%is optimi0ation %as no significant effect)

access

access

Cache layout without alignment

Cache layout with alignment

$atri7 <u7taposition Input an( output matrices are <u7tapose( in t%e memor! to pre1ent associati1it! conflicts) In(ee(: after <u7taposition: t%e 2 matrices can 'e loa(e( entirel! in t%e L* (ata cac%e 6it%out %a1ing to s6itc% 'et6een one matri7 to anot%er) Eere is a part of our co(e s%o6ing t%ese optimi0ations>
/* Allocate memory to the 2 matrices */ /* The 2 matrices are juxtaposed and aligned on the cache row size:128 bits 1! "ytes */ inline #oid load$matrix%&loat **matrix$in' int row' int column' &loat **matrix$out() unsigned long int size* size 2*row*column*sizeo&%&loat(* i&% %*matrix$in %&loat*(malloc%size+1!(( ,-.. () print&%/,ot enough memory a#ailable01n/(* exit%1(* 2 //align the address at the beginning o& a cache line *matrix$in %&loat*(%%%%int(%*matrix$in((+1!(3%%%%int(%*matrix$in(( +1!(41!((* *matrix$out %&loat*(%%*matrix$in(+row*column(* 2

2nce t%ese 6ere (one 6e transforme( t%e co(e using SSE2) &%is 6as (one to ma?e use of t%e in%erent parallelism in t%e co(e) SSE (oes four calculations at a time using special registers allocate( for SSE) .e use( t%ese *23 'it floating point registers to store four 54 'it 1alues) &%en t%ese computations 6ere (one in parallel) .e use( intrinsics for t%is part) &%e follo6ing piece of co(e is t!pical of our use of SSE intrinsics)
m1 $mm$add$ps%*pscr2'*pscr5(* m2 $mm$mul$ps%m5$a5' m1(* m6 $mm$add$ps%*pscr1'm2(* // high17i8 start17i8+a5*%start27i8 + start57i8(*

m*: m2: m3: pscr2: m0Fa0: etc are all 54 'it 1alues: store( in groups of four as mentione( 'efore) &%e follo6ing figure s%o6s %o6 t%e *23 'it registers are use( for four calculations to ta?e a(1antage of parallelism)

3 Verification
.e 1erifie( our co(e using $ &L @) .e too? an image an( con1erte( it to a matri7) .e use( t%is matri7 as an input to our #.& co(e) &%en t%e output matri7 of our co(e 6as (umpe( into a file) .e too? t%is output matri7 an( normali0e( it '! su'tracting t%e smallest num'er an( (i1i(ing '! t%e greatest num'er an( (ispla!e( it is an image) &%e follo6ing results 6ere pro(uce()

2riginal Image

$atla' output for #.&

fter ro6 transform using our co(e

fter #.&: using our co(e

&%e results from our co(e matc% t%ose of $ &L @ 6%ic% s%o6s our co(e 6or?s)

*0

4 Profiler
s mentione( 'efore: 6e use( V&une nal!0er to profile our results) &%e V&uneG Performance nal!0er %elps !ou anal!0e t%e performance of !our application '! locating %otspots) Eotspots are areas in !our co(e t%at ta?e a long time to e7ecute) /ot onl! t%at it also %elps !ou fin( out 6%at is causing t%ese %otspots an( (eci(e 6%at t!pe of impro1ements to ma?e) Aou can also trac? critical function calls an( monitor specific processor e1ents: suc% as cac%e misses: triggere( '! sections in !our co(e) Aou can also calculate e1ent ratios to (etermine if processor e1ents are causing t%e %otspots) &%e V&une anal!0er collects performance (ata on !our application an( s!stem: an( (ispla!s it in grap%s or ta'les) "rom t%is (ispla!: !ou can anal!0e t%e performance of !our application an( (etermine 6%ic% portions of an application are slo6est an( 6%!) &o optimi0e t%e performance of !our application or s!stem: !ou can (o one or more of t%e follo6ing to fin( t%e performance 'ottlenec?s> #etermine %o6 !our s!stem resources: suc% as memor! an( processor: are 'eing utili0e( to i(entif! s!stem+le1el 'ottlenec?s) $easure t%e e7ecution time for eac% mo(ule an( function in !our application) #etermine %o6 t%e 1arious mo(ules running on !our s!stem affect t%e performance of eac% ot%er) I(entif! t%e most time+consuming function calls an( call se=uences 6it%in !our application) #etermine %o6 !our application is e7ecuting at t%e processor le1el to i(entif! microarc%itecture+le1el performance pro'lems) &%e V&uneG Performance nal!0er can fin( t%is information '! automating t%e process of (ata collection 6it% t%ree t!pes of (ata collectors: namel!: sampling: call grap%: an( counter monitor) &%us 6e (eci(e( to use V&une nal!0er) .it% t%e %elp of profiling 6e 6ere a'le to i(entif! t%at memor! access 6as t%e 'iggest 'ottle nec? for our co(e)

!e"ult"
.e con(ucte( a set of simulation using t%e pre+optimi0e( an( post optimi0e( 1ersions of our co(e) &%e simulations 6ere (one on Pentium Celeron processor 2)4 GE0: 6it% a 3B uops L* trace cac%e an( a *23 B' L2 cac%e) .e (i( a num'er of simulations: since profiling ta?e samples to get a goo( estimate 6e ran a set of t%ree simulations on images of si0e 2,5 7 2,5: ,*2 7 ,*2 an( *024 7 *024) &%e results for t%e 'iggest image are s%o6n %ere) set of t%ree simulations 6ere (one on a *024 7 *024 image) &%e results for t%e impro1ement in c!cles per uops are gi1en 'elo6)

**

#$cle" per retired uop


4

3.5

2.5 #loc%tic%"

Before Optimization fter Optimization

1.5

0.5

0 1 2 3 4 5 6

s can 'e seen on an a1erage t%ere is an impro1ement of 44)5*H 6as seen (uring optimi0ation) &%e results for *st le1el cac%e loa( misses are s%o6n 'elo6)
1"t &e'el #ache &oad (i""e"
6000000

5000000

4000000

3000000

Before Optimization fter Optimization

2000000

1000000

0 1 2 3 4

*2

s can 'e seen 6e (ramaticall! re(uce( *st le1el cac%e loa( misses '! t%e 1arious optimi0ation tec%ni=ues) &%e misses re(uce( on an a1erage '! 74H: from a'out a little 'elo6 ,:000:000 to a little o1er *:000:000) &%e secon( le1el cac%e impro1ement is as s%o6n 'elo6)
2nd &e'el #ache &oad (i""e"
350000

300000

250000

200000 Before Optimization fter Optimization 150000

100000

50000

0 1 2 3 4

&%ere is an impro1ement of ,3H on an a1erage) &%us it can 'e s%o6n t%at our cac%e optimi0ations an( using SSE co(e to e7ploit parallelism 6or?e( 6ell causing an impro1ement on performance an( cac%e access)

) !eference"
I*J &au'man: #a1i( an( $arcellin: $ic%ael: JPEG 2000: Image compression "un(amentals: Stan(ar(s an( practice) I2J $ic%ael #) (ams: &%e JPEG+2000 Still Image Compression Stan(ar() I3J C) &enlla(o: #) C%a1er: L) PiKuel: $) Prieto an( ") &ira(o: Vectori0ation of t%e 2# .a1elet Lifting &ransform using SI$# e7tensions) I4J #) C%a1er: C) &enlla(o: L) PiKuel: $) Prieto an( ") &ira(o: 2+# .a1elet &ransform En%ancement on General+Purpose $icroprocessors> $emor! Eierarc%! an( SI$# Parallelism E7ploitation) I,J I 32 Intel Soft6are #e1elopers $anual: Volume 2> Instruction 4eference)

*3

You might also like