Professional Documents
Culture Documents
chapter 22
Lennmg
In Jeff Duntemann’s excellent book Bodand PascaZFrum Square One (Random House,
1993),there’s asmall assemblysubroutine that’s designed to be called from a Turbo
415
Pascal program in order to fill the screen or a system-memory screen buffer with a
specified character/attribute pair in text mode. This subroutine involves only 21
instructions and works perfectly well; however, with whatwe know, we can compact
the subroutine tremendously and speed it up a bit as well. To coin a verb, we can
“Zen” this already-tight assembly code to an astonishing degree. In the process, I
hope you’ll get a feel forhow advanced your assembly skills havebecome.
Jeff‘s original code follows as Listing 22.1(withsome text converted to lowercase in
order to match the style of this book), but thecomments are mine.
Clears p r once a r
push bP caller’s ;save BP
mov bP. SP f r a mset a c k t o : p o i n t
cmp word p t r C b p l . B u f S e g:.s0ktihpe fill i f a n u l l
jne Start : p o i n t e r i s passed
cmp word p t r Cbpl.BufOfs,O
je Bye
Start: cld :make STOSW countup
mov a x . C b p l . A t :t lroi ba d AX waittthr i bpuat er a m e t e r
and a x . O f; fpm 0r 0efhprw
ogari itrnheg fill c h a r
mov b x . [ b p l . F i :l l oe ar d BX w i t h fill c h a r
and ambt :tew xpr r.irf0gbteohfiuprn
f thager e
o r : caaotantmrxdib.bbiunxtee fill c h a r
mov bx,Cbpl.BufO : l of sa d DI w ti tahr gbeutf foef rf s e t
mov d i ,bx
mov b x . [ b p l . B u f S: leoga d ES w ti tahr gbeutf f e r segment
mov e s , bx
mov c x . C b p l . B u f S;ilzoea d C X w bi tuhf f e r size
rep stosw b u f;fill
fer the
p o i nstmov
Bye: teoarr;cri kge isnst aoplr. eb p
POP bp ; and c a l l e r ’ s BP
ret E n d M r k - R e t A d d r:-r2e t u rcnl .e a r i nt hpgea r m f rsot m
hs tea c k
Clears endp
The first thing you’ll notice about Listing 22.1 is that Clears uses a REP STOSW
instruction. That means that we’re not going to improve performance by any great
amount, no matter how clever we are. While we can eliminate some cycles, the bulk
of the work in Clears is done by that one repeated string instruction, and there’s no
way to improve on that.
Does that mean thatListing 22.1 is as good as it can be? Hardly. While the speed of
Clears is verygood, there’s another side to the optimization equation: size. The whole of
Clears is 52 bytes long as it stands-but, as we’ll see, that size is hardly set in stone.
41 6 Chapter 22
Where do we begin with Clears? For starters, there’s an instruction in there that
serves no earthly purpose-MOV SP,BP. SP is guaranteed to be equal to BP at that
point anyway, so why reload it with the same value? Removingthat instruction saves
us two bytes.
Well, that was certainly easy enough! We’re not going to find any more totally non-
functional instructions in Clears, however, so let’s get on to someserious optimizing.
We’ll look first for cases where we know of better instructions for particular tasks
than those that were chosen. For example, there’s no need to load any register,
whether segment or general-purpose, through BX; we can eliminate two instruc-
tions by loading ES and DI directly as shownin Listing 22.2.
LISTING22.2122-2.ASM
Clears p r once a r
push b p : s a v ec a l l e r ’ s BP
mov bp. sp : p o i n tt os t a c kf r a m e
cmp word p t r Cbpl.BufSeg.0 : s k i pt h e fill i f a n u l l
jne Start : p o i n t e ri sp a s s e d
cmp word p t r[ b p l . B u f O f s . O
je Bye
S t a r t :c l d :make STOSW c o u n t up
mov ax.Cbpl.Attrib : l o a d AX w i t h a t t r i b u t e p a r a m e t e r
ax.Off00h
and : p r e p a r ef o rm e r g i n gw i t h fill c h a r
mov b x . [ b p l .F i 1 1 e r : l o a d BX w i t h fill c h a r
and bx.0ffh : p r e p a r ef o rm e r g i n gw i t ha t t r i b u t e
a x . b xo r : c o m b i n ea t t r i b u t ea n d fill c h a r
mov d i .Cbp].BufOfs ;load D I w i t ht a r g e tb u f f e ro f f s e t
mov es,[bpl.BufSeg : l o a d ES w i t h t a r g e t b u f f e r segment
mov cx.[bpl.BufSize :load CX w i t hb u f f e rs i z e
s troespw :fill t h eb u f f e r
Bye :
POP bP : r e s t o r e c a l l e r ’ s BP
ret EndMrk-RetAddr-2 : r e t u r n .c l e a r i n gt h e parms f r o mt h es t a c k
C1endp
ears
(The OnStack structure definition doesn’t change in any of our examples, so I’m
not going clutter up this chapter by reproducing it for each new version ofClears.)
Okay, loading ES and DI directly saves another four bytes. We’ve squeezed a total of
6 bytes-about 11 percent-out of Clears. What next?
Well, LES would servebetter than two MOV instructions for loading ES and DI as shown
in Listing 22.3.
LISTING22.3122-3.ASM
nC
epalreor acr s
r’s :save bp push BP
mov bp,sp : p o i nf rt a tmoe s t a c k
cmp word p t r C b p l . B u f S e g:.s0kt ihpe fill i f a n u l l
Start jne passed i s: p o i n t e r
cmp word p t[ rb p l . B u f O f s , O
je Bye
cld Start: :makeu pSTOSW c o u n t
mov ax.[bpl.Attrib : l o a d AX w i t ha t t r i b u t ep a r a m e t e r
ax.Off00h
and : p r e p a r ef o rm e r g i n gw i t h fill c h a r
That’s good for another three bytes. We’re downto 43 bytes, and counting.
We can save 3 more bytes by clearing the low and high bytes of AX and BX, respectively,
by using SUB reg8,reg8rather than ANDing 16-bit values as shown in Listing 22.4.
LISTING 22.4122-4.ASM
Clears p r once a r
push bp : s a v ec a l l e r ’ s B P
mov bp.sp : p o i n tt os t a c kf r a m e
cmp word p t r Cbpl.BufSeg.0 : s k i pt h e fill i f a n u l l
jne Start : p o i n t e ri sp a s s e d
cmp word p t rC b p l . B u f O f s . 0
je Bye
Start: cld ;make STOSW countup
mov ax.[bpl.Attrib : l o a d AX w i t h a t t r i b u t e p a r a m e t e r
sub a1,al : p r e p a r ef o rm e r g i n gw i t h fill c h a r
mov bx.Cbpl.Filler : l o a d BX w i t h fill c h a r
bh,bh
sub : p r e p a r ef o rm e r g i n gw i t ha t t r i b u t e
a x . b xo r : c o m b i n ea t t r i b u t ea n d fill c h a r
l edsi . d w o rpd[t br p l . B u f O f s : l o a d E S : D I w i t ht a r g e tb u f f e r
;segment:offset
mov cx.Cbpl.BufSize :load CX w i t hb u f f e rs i z e
srt eo ps w :fill t h eb u f f e r
Bye :
POP bP : r e s t o r e c a l l e r ’ s BP
E nr edtM r k - R e t A d d r - 2 : r e t u r n .c l e a r i n gt h e p a r m sf r o mt h es t a c k
Clears endD
Now we’re down to 40 bytes-more than 20 percent smaller than the original code.
That’s pretty much it for simple instruction-substitution optimizations. Now let’s look
for instruction-rearrangement optimizations.
It seems strange to load a word value into AX and then throw away AL. Likewise, it
seems strange to load a word value into BX and then throw away BH. However, those
steps are necessary becausethe two modified word valuesare ORed into a single char-
acter/attribute word value that is then used to fill the target buffer.
Let’s step back and see what this code really does, though. All it does in the end is
load one byte addressed relative to BP into AH and anotherbyte addressed relative
to BP into AL. Heck, we can just do that directly! Presto-we’ve saved another 6
bytes, and turned two word-sized memory accesses into byte-sized memory accesses
as well. Listing22.5 shows the new code.
41 8 Chapter 22
LISTING22.5122-5.ASM
Clears p r once a r
push bp ; s a v ec a l l e r ' s BP
mov bp,sp ; p o i n tt os t a c kf r a m e
cmp word p t r Cbpl.BufSeg.0 ; s k i pt h e fill i f a n u l l
jne Start : p o i n t e r i s passed
cmp word p t r[ b p l . B u f O f s . O
je Bye
S t a r t :c l d ;make STOSW countup
mov a h , b y t pe t[ rb p ] . A t t r i b [ l l ;load AH w i t h a t t r i b u t e
mov a1 , b y t ep t r[ b p l . F i l l e r ;load AL w i t h fill c h a r
l edsi . d w o rpd[t br p ] . B u f O f s ;load ES:OI w i t h t a r g e t b u f f e r s e g m e n t : o f f s e t
mov cx,Cbpl.BufSize ;load CX withbuffersize
s rt eo ps w ;fill t h eb u f f e r
Bye :
POP bp ; r e s t o r e c a l l e r ' s BP
E nr edtM r k - R e t A d d r - 2 : r e t u r n .c l e a r i n gt h e p a r m sf r o mt h es t a c k
e n d pC l e a r s
(We could getrid ofyet anotherinstruction by having the calling code pack both the
attribute and thefill valueinto thesame word, but that's not partof the specification
for this particular routine.)
Another nifty instruction-rearrangement trick saves 6 more bytes. Clears checks to see
whether the far pointer is null (zero) at the start
of the routine.. .thenloads and uses
that same far pointer later on.Let's get that pointerinto registers and keep it there;
that way we can check to seewhether it's null with a single comparison, and can use it
later without having toreload it from memory.This technique is shown in Listing22.6.
LISTING 22.6122-6.ASM
Clears p r once a r
push bp ; s a v ec a l l e r ' s BP
mov bp,sp ; p o i n tt os t a c kf r a m e
les d i . d w o r dp t r[ b p ] . B u f O f s ;load ES:DI w i t ht a r g e tb u f f e r
;segment:offset
mov ax.es ;putsegmentwhere we c a n t e s t it
or ax.di ; i s i t a n u l lp o i n t e r ?
je Bye ;yes. s o w e ' r ed o n e
Start: c l d ;make STOSW countup
mov a h . b y tpeCt rb p l . A t t r i b C 1 1 ; l o a d AH w i t h a t t r i b u t e
mov a l . b y t pe tCr b p ] . F i l l e r ; l o a d AL w i t h fill c h a r
mov cx.[bpl.BufSize :load CX w i t hb u f f e rs i z e
s rt oe spw :fill t h eb u f f e r
Bye :
POP bp ; r e s t o r ec a l l e r ' s BP
E nr edtM r k - R e t A d d r - 2 : r e t u r n ,c l e a r i n gt h e p a r m sf r o mt h es t a c k
e n d pC l e a r s
Well. Now we're down to 28 bytes, having reduced the size of this subroutine by
nearly 50 percent. Only 13 instructions remain. Realistically, howmuch smaller can
we make this code?
About one-third smaller yet, as it turnsout-but in order to do that, we must stretch
our minds and use the 8088's instructions inunusual ways. Let me ask youthis: What
do most of the instructionsin the currentversion of Clears do?
At long last, we’re down to the baremetal. This version of Clears isjust 19 bytes long.
That’s just 37 percent as long as the original version,without any change whatsoarer in the
&nctzonuZiCy that CbarS maka available to the culling code. The code is bound to run a bit
faster too, giventhat there arefar fewer instruction bytes and fewer memory accesses.
All in all,the Zenned version ofClearsis a vast improvement over the original. Probably
not thebest possible implementation-never say never!-but an awfully good one.
420 Chapter 22