You are on page 1of 7

Previous Home Next

chapter 22

zenning and the flexible mind


Ch

And so we come &>theend of ourjourney; fornow, at least. What follows is a modest


riginally served to show readers of Zen of Assembly
Language that they morethanjust bits and pieces of
knowledge; that
they had also begun to apply the flexible
mind-unconventional,broadly
integrative thinkin hing high-level optimization the
at algorithmic and
urse, need no such reassurance, having just spent
xible mind in many guises,but I thinkyou’ll find
ve nonetheless. Try to stay ahead as the level of optimization
elimination to instruction substitution to more creative solu-
nding andredesign. We’ll start out by compacting
individual instructiods and bits of code, butby the endwe’ll come up with a solution
that involves the very structure of the subroutine, with each instruction carefully
integrated into a remarkably compact whole. It’s a neat example of how optimiza-
tion operates at many levels,some much less determininstic than others-and besides,
it’sjust plain fun.
Enjoy!

Lennmg
In Jeff Duntemann’s excellent book Bodand PascaZFrum Square One (Random House,
1993),there’s asmall assemblysubroutine that’s designed to be called from a Turbo

415
Pascal program in order to fill the screen or a system-memory screen buffer with a
specified character/attribute pair in text mode. This subroutine involves only 21
instructions and works perfectly well; however, with whatwe know, we can compact
the subroutine tremendously and speed it up a bit as well. To coin a verb, we can
“Zen” this already-tight assembly code to an astonishing degree. In the process, I
hope you’ll get a feel forhow advanced your assembly skills havebecome.
Jeff‘s original code follows as Listing 22.1(withsome text converted to lowercase in
order to match the style of this book), but thecomments are mine.

LISTING 22.1 122- 1.ASM


OnStack s t r u:cd a ttah a t ’sst o r e d on t hset a cakf t e r PUSH
BP
01 dBP dw ? : c a l l e r ’ s BP
RetAddr dw ? : r ae dt udrrne s s
Filler dw ? : c h a r a c t teor fill t hbeu f f ewri t h
Attrib dw ? : a t t r i b u tt eo fill t hbeu f f ewri t h
BufSize dw ? :number o c hf a r a c t e r / a t t r i b u tpea i rt os fill
BufOfs dw ? : b uoffffesre t
BufSeg dw ? : b u f f e r segment
EndMrk db ? : m a r k ef otrhr e end ot hfset a cf rka m e
OnStack ends

Clears p r once a r
push bP caller’s ;save BP
mov bP. SP f r a mset a c k t o : p o i n t
cmp word p t r C b p l . B u f S e g:.s0ktihpe fill i f a n u l l
jne Start : p o i n t e r i s passed
cmp word p t r Cbpl.BufOfs,O
je Bye
Start: cld :make STOSW countup
mov a x . C b p l . A t :t lroi ba d AX waittthr i bpuat er a m e t e r
and a x . O f; fpm 0r 0efhprw
ogari itrnheg fill c h a r
mov b x . [ b p l . F i :l l oe ar d BX w i t h fill c h a r
and ambt :tew xpr r.irf0gbteohfiuprn
f thager e
o r : caaotantmrxdib.bbiunxtee fill c h a r
mov bx,Cbpl.BufO : l of sa d DI w ti tahr gbeutf foef rf s e t
mov d i ,bx
mov b x . [ b p l . B u f S: leoga d ES w ti tahr gbeutf f e r segment
mov e s , bx
mov c x . C b p l . B u f S;ilzoea d C X w bi tuhf f e r size
rep stosw b u f;fill
fer the
p o i nstmov
Bye: teoarr;cri kge isnst aoplr. eb p
POP bp ; and c a l l e r ’ s BP
ret E n d M r k - R e t A d d r:-r2e t u rcnl .e a r i nt hpgea r m f rsot m
hs tea c k
Clears endp

The first thing you’ll notice about Listing 22.1 is that Clears uses a REP STOSW
instruction. That means that we’re not going to improve performance by any great
amount, no matter how clever we are. While we can eliminate some cycles, the bulk
of the work in Clears is done by that one repeated string instruction, and there’s no
way to improve on that.
Does that mean thatListing 22.1 is as good as it can be? Hardly. While the speed of
Clears is verygood, there’s another side to the optimization equation: size. The whole of
Clears is 52 bytes long as it stands-but, as we’ll see, that size is hardly set in stone.
41 6 Chapter 22
Where do we begin with Clears? For starters, there’s an instruction in there that
serves no earthly purpose-MOV SP,BP. SP is guaranteed to be equal to BP at that
point anyway, so why reload it with the same value? Removingthat instruction saves
us two bytes.
Well, that was certainly easy enough! We’re not going to find any more totally non-
functional instructions in Clears, however, so let’s get on to someserious optimizing.
We’ll look first for cases where we know of better instructions for particular tasks
than those that were chosen. For example, there’s no need to load any register,
whether segment or general-purpose, through BX; we can eliminate two instruc-
tions by loading ES and DI directly as shownin Listing 22.2.

LISTING22.2122-2.ASM
Clears p r once a r
push b p : s a v ec a l l e r ’ s BP
mov bp. sp : p o i n tt os t a c kf r a m e
cmp word p t r Cbpl.BufSeg.0 : s k i pt h e fill i f a n u l l
jne Start : p o i n t e ri sp a s s e d
cmp word p t r[ b p l . B u f O f s . O
je Bye
S t a r t :c l d :make STOSW c o u n t up
mov ax.Cbpl.Attrib : l o a d AX w i t h a t t r i b u t e p a r a m e t e r
ax.Off00h
and : p r e p a r ef o rm e r g i n gw i t h fill c h a r
mov b x . [ b p l .F i 1 1 e r : l o a d BX w i t h fill c h a r
and bx.0ffh : p r e p a r ef o rm e r g i n gw i t ha t t r i b u t e
a x . b xo r : c o m b i n ea t t r i b u t ea n d fill c h a r
mov d i .Cbp].BufOfs ;load D I w i t ht a r g e tb u f f e ro f f s e t
mov es,[bpl.BufSeg : l o a d ES w i t h t a r g e t b u f f e r segment
mov cx.[bpl.BufSize :load CX w i t hb u f f e rs i z e
s troespw :fill t h eb u f f e r
Bye :
POP bP : r e s t o r e c a l l e r ’ s BP
ret EndMrk-RetAddr-2 : r e t u r n .c l e a r i n gt h e parms f r o mt h es t a c k
C1endp
ears

(The OnStack structure definition doesn’t change in any of our examples, so I’m
not going clutter up this chapter by reproducing it for each new version ofClears.)
Okay, loading ES and DI directly saves another four bytes. We’ve squeezed a total of
6 bytes-about 11 percent-out of Clears. What next?
Well, LES would servebetter than two MOV instructions for loading ES and DI as shown
in Listing 22.3.

LISTING22.3122-3.ASM
nC
epalreor acr s
r’s :save bp push BP
mov bp,sp : p o i nf rt a tmoe s t a c k
cmp word p t r C b p l . B u f S e g:.s0kt ihpe fill i f a n u l l
Start jne passed i s: p o i n t e r
cmp word p t[ rb p l . B u f O f s , O
je Bye
cld Start: :makeu pSTOSW c o u n t
mov ax.[bpl.Attrib : l o a d AX w i t ha t t r i b u t ep a r a m e t e r
ax.Off00h
and : p r e p a r ef o rm e r g i n gw i t h fill c h a r

Zenning and the Flexible Mind 41 7


mov bx.[bpl.Filler : l o a d BX w i t h fill c h a r
and bx.0ffh : p r e p a r ef o rm e r g i n gw i t ha t t r i b u t e
or ax, bx ;combine a t t r i b u t e and fill c h a r
1es d i . d w o r dp t r[ b p l . B u f O f s : l o a d E S : D I w i t ht a r g e tb u f f e r
:segment:offset
mov cx,Cbpl.BufSize :load CX w i t hb u f f e rs i z e
rep stosw :fill t h eb u f f e r
Bye :
POP bP : r e s t o r e c a l l e r ’ s BP
ret EndMrk-RetAddr-2 : r e t u r n .c l e a r i n gt h e p a r m sf r o mt h es t a c k
Clears endp

That’s good for another three bytes. We’re downto 43 bytes, and counting.
We can save 3 more bytes by clearing the low and high bytes of AX and BX, respectively,
by using SUB reg8,reg8rather than ANDing 16-bit values as shown in Listing 22.4.

LISTING 22.4122-4.ASM
Clears p r once a r
push bp : s a v ec a l l e r ’ s B P
mov bp.sp : p o i n tt os t a c kf r a m e
cmp word p t r Cbpl.BufSeg.0 : s k i pt h e fill i f a n u l l
jne Start : p o i n t e ri sp a s s e d
cmp word p t rC b p l . B u f O f s . 0
je Bye
Start: cld ;make STOSW countup
mov ax.[bpl.Attrib : l o a d AX w i t h a t t r i b u t e p a r a m e t e r
sub a1,al : p r e p a r ef o rm e r g i n gw i t h fill c h a r
mov bx.Cbpl.Filler : l o a d BX w i t h fill c h a r
bh,bh
sub : p r e p a r ef o rm e r g i n gw i t ha t t r i b u t e
a x . b xo r : c o m b i n ea t t r i b u t ea n d fill c h a r
l edsi . d w o rpd[t br p l . B u f O f s : l o a d E S : D I w i t ht a r g e tb u f f e r
;segment:offset
mov cx.Cbpl.BufSize :load CX w i t hb u f f e rs i z e
srt eo ps w :fill t h eb u f f e r
Bye :
POP bP : r e s t o r e c a l l e r ’ s BP
E nr edtM r k - R e t A d d r - 2 : r e t u r n .c l e a r i n gt h e p a r m sf r o mt h es t a c k
Clears endD

Now we’re down to 40 bytes-more than 20 percent smaller than the original code.
That’s pretty much it for simple instruction-substitution optimizations. Now let’s look
for instruction-rearrangement optimizations.
It seems strange to load a word value into AX and then throw away AL. Likewise, it
seems strange to load a word value into BX and then throw away BH. However, those
steps are necessary becausethe two modified word valuesare ORed into a single char-
acter/attribute word value that is then used to fill the target buffer.
Let’s step back and see what this code really does, though. All it does in the end is
load one byte addressed relative to BP into AH and anotherbyte addressed relative
to BP into AL. Heck, we can just do that directly! Presto-we’ve saved another 6
bytes, and turned two word-sized memory accesses into byte-sized memory accesses
as well. Listing22.5 shows the new code.

41 8 Chapter 22
LISTING22.5122-5.ASM
Clears p r once a r
push bp ; s a v ec a l l e r ' s BP
mov bp,sp ; p o i n tt os t a c kf r a m e
cmp word p t r Cbpl.BufSeg.0 ; s k i pt h e fill i f a n u l l
jne Start : p o i n t e r i s passed
cmp word p t r[ b p l . B u f O f s . O
je Bye
S t a r t :c l d ;make STOSW countup
mov a h , b y t pe t[ rb p ] . A t t r i b [ l l ;load AH w i t h a t t r i b u t e
mov a1 , b y t ep t r[ b p l . F i l l e r ;load AL w i t h fill c h a r
l edsi . d w o rpd[t br p ] . B u f O f s ;load ES:OI w i t h t a r g e t b u f f e r s e g m e n t : o f f s e t
mov cx,Cbpl.BufSize ;load CX withbuffersize
s rt eo ps w ;fill t h eb u f f e r
Bye :
POP bp ; r e s t o r e c a l l e r ' s BP
E nr edtM r k - R e t A d d r - 2 : r e t u r n .c l e a r i n gt h e p a r m sf r o mt h es t a c k
e n d pC l e a r s

(We could getrid ofyet anotherinstruction by having the calling code pack both the
attribute and thefill valueinto thesame word, but that's not partof the specification
for this particular routine.)
Another nifty instruction-rearrangement trick saves 6 more bytes. Clears checks to see
whether the far pointer is null (zero) at the start
of the routine.. .thenloads and uses
that same far pointer later on.Let's get that pointerinto registers and keep it there;
that way we can check to seewhether it's null with a single comparison, and can use it
later without having toreload it from memory.This technique is shown in Listing22.6.

LISTING 22.6122-6.ASM
Clears p r once a r
push bp ; s a v ec a l l e r ' s BP
mov bp,sp ; p o i n tt os t a c kf r a m e
les d i . d w o r dp t r[ b p ] . B u f O f s ;load ES:DI w i t ht a r g e tb u f f e r
;segment:offset
mov ax.es ;putsegmentwhere we c a n t e s t it
or ax.di ; i s i t a n u l lp o i n t e r ?
je Bye ;yes. s o w e ' r ed o n e
Start: c l d ;make STOSW countup
mov a h . b y tpeCt rb p l . A t t r i b C 1 1 ; l o a d AH w i t h a t t r i b u t e
mov a l . b y t pe tCr b p ] . F i l l e r ; l o a d AL w i t h fill c h a r
mov cx.[bpl.BufSize :load CX w i t hb u f f e rs i z e
s rt oe spw :fill t h eb u f f e r
Bye :
POP bp ; r e s t o r ec a l l e r ' s BP
E nr edtM r k - R e t A d d r - 2 : r e t u r n ,c l e a r i n gt h e p a r m sf r o mt h es t a c k
e n d pC l e a r s

Well. Now we're down to 28 bytes, having reduced the size of this subroutine by
nearly 50 percent. Only 13 instructions remain. Realistically, howmuch smaller can
we make this code?
About one-third smaller yet, as it turnsout-but in order to do that, we must stretch
our minds and use the 8088's instructions inunusual ways. Let me ask youthis: What
do most of the instructionsin the currentversion of Clears do?

Zenning and the Flexible Mind 41 9


Previous Home Next
They either load parameters from the stack frame or set up the registers so that the
parameters can be accessed. Mind you, there’s nothing wrong with the stack-frame-
oriented instructions used in Clears;those instructions access the stack frame in a
highly efficientway, exactly asthe designers of the 8088 intended, and justas the code
generated by a high-level language would. That means that we aren’t going to be able
to improvethe code if we don’t bend therules a bit.
Let’s think ...the parameters are sitting on the stack, and most of our instruction
bytes are beingused to read bytes off the stack with BP-based addressing.. .we need a
more efficient way to address the stack...the stack.. .THE STACK!
Ye gods! That’s easy-we can use the stuck pointer to address thestack rather thanBP.
While it’s true that the stack pointer can’t be used for mod-reg-rm addressing, as BP
can, it can be used to pop data off the stack-and POP is a one-byte instruction.
Instructions don’t getany shorter than that.
There is one detail to be taken care of before we can put ourplan into action: The
return address-the address of the calling code-is on top of the stack, so the pa-
rameters we want can’t be reached with POP. That’s easily solved, however-we’ll
just pop the return address into an unused register, then branch through thatregis-
ter when we’re done, as we learned to do in Chapter 14. As we pop theparameters,
we’ll also be removing them from the stack, thereby neatly avoiding the need to
discard them when it’s time to return.
With that problem dealtwith, Listing 22.7 shows the Zennedversion of Clears.

LISTING 22.7 122-7.ASM


nC
epalreor acr s
POP dx ; g e tt h er e t u r na d d r e s s
POP ax ; p u t fill c h a ri n t o AL
POP bx ; g e tt h ea t t r i b u t e
mov ah.bh ;putattributeinto AH
POP cx ; g e tt h eb u f f e rs i z e
pop di :gettheoffsetofthebufferorigin
POP es : g e tt h es e g m e n to ft h eb u f f e ro r i g i n
mov bx.es ;putthesegmentwhere we c a n t e s t it
b x . doi r ;nul 1 p o i n t e r ?
je Bye ;yes. so we‘redone
cl d :make STOSW countup
s rt eo ps w :do t h e s t r i n g s t o r e
Bye:
dxjrnp : r cet hat tuelolrci nno gd e
e n d pC l e a r s

At long last, we’re down to the baremetal. This version of Clears isjust 19 bytes long.
That’s just 37 percent as long as the original version,without any change whatsoarer in the
&nctzonuZiCy that CbarS maka available to the culling code. The code is bound to run a bit
faster too, giventhat there arefar fewer instruction bytes and fewer memory accesses.
All in all,the Zenned version ofClearsis a vast improvement over the original. Probably
not thebest possible implementation-never say never!-but an awfully good one.

420 Chapter 22

You might also like