You are on page 1of 5

Singapore ICCSllSlTA '92

Studies on Protocol Layer Residency


in an Intelligent Network Interface Unit
N i t S . Bapat
Author(s)
S.V. Raghavan

Department of Computer Science & Engineering


Indian Institute of Technology, Madras, India

hcing cxcculcd un the NIU. There ;~ppcars tu he no gcncral


ABSTRACT consensus, on how niuch protocril ciidc should reside (in the NIU.
These designs concentrate on the NIU, and try to niake protocol
processing on the NIU more efficient. I n this paper, we are trying to
To exploit Ihe high data rates &ered by the physical media, clhient study the host machine and the NIU as a single syslcm, to optimiu:
protocol processing in mry network node is a must. Network the performance of a network node.
Interface Units (NIU) have a slgnllicant role in this mgard.
Balanced protocol partitioning and processing In a host machine
and an NIU, Is a key to achieve faster protocol processing in a 3.0 Typical Host-NIU System
network node. This paper discusses various issues related to
partitioning a protocol stack between a host machine and an NIU. N l U s on a network can have different types of host interface
We describe the experiments carried out to understand the mechanisms. A n N I U can have a character I/O type, a D M A type, or
partitioning of the OS1 protocol layers between a host machine and a shared memory type of interlace with the.hos.1 machine. I n the
an NIU. OSINET,a rrehuorking sofhvarc developed at IIT Madras, character I/O type of intcrl;ice, the interf:ice overhc;id i s Ixirnc by the
is used in the experiments that wrc designed speclticaliy for this host machine. I n the D M A type of interface, the job of carrying out
purpose. The impiementatlon details of the work along with the DMA transfer generally lies with the N l U . The shared memory
performance measurementsare presented. type of interface avoids an extra copy of data and offers least interface
overhead. For every mechanism, there is certain overhead of data
1.0 Intriiductiiin exchange across the interlace, which i s borne either by the NIU or by
the host.
I n the networkin~environnient. the data thruughput availahlc to an Various cases of h o s t - N I U operating environments are
application program has hecn considcrahly lower than tlic raw data depicted in figure 1. The host machine ciin run a unitasking or a
rate ol Ihc undcrlying phy5ic;il nicdium. Prutcicul pruccssing multitasking operating system. I n a unitasking host, the protocol
overhead 121 and lack of proper architectural support I S ] arc layers may run as a singlc monolithic program as shown in figures l a
considered to he the factors causing this inefficiency. We concentrate and Id, or they may run as multiplc threads (if a single program. as in
our study on pratocol processing in a network node. With the advent figures I b and IC. A multitasking host will generally run protocol
uf hardware technology, the s o callcd 'Intclligcnt NIUs' are fast layers as separate processes, as in figures IC and I f . The processor on
bcconiing commonplace. Most (if such N l U s have at least one on- the NIU may execute some protocol luncliims in sirftware, apart from
hoard processor, that pcrfurnis sumc protocol functiuns and controls controlling the hardware operations. The protocol software running
the protocol processing hardware. Such an intelligent NIU, form5 a on the N I U may run as a single process, as in figures la, l b and IC, or
multiprocessor system with the host machine. Protucol processing it m:!y run different layers as separate threads of a single process. as in
has to he properly partitioned between the processors (if this system. figures ld. IC and If. Some implementations can also have multiple
to achieve optimum performance froni the network node. This paper processors on the NIU. The operating environment of the host
addrcses the issues in partitioning of the OS1 protocol layers, in a decides the typical processing tinie for protocol functions. In a
typical host-NIU system. W e describe the experiments. carried out unitasking host, the host processor is mainly utilized for protocol
using OSINET protocol software and an intelligent NIU. to support processing. so protocol pniccssing time can be fixed. In a
our discus&n. multitasking host, the processing tinic will depend on factors like
scheduling policy of the host and its processing load.
Fur our discussion. we consider a typical N I U architecture as
2.0 Related Work shown in figure 2. Depending on implenientation, sume ofthe blocks
of the figure may be void and some may have varying complexity.
Several designs arc proposed for a high pcrfiirniancc NIU. that off- Protocul processing time on the NIU will depend on i t s execution
loads a host machine hy pcrfurming ccrtiiin tinic critical protocol speed. and the tmrdwarc resources nv;d;hk.
functions in hardware. The designs mnge from the use of protocol Ilcncc. a typical Iiost-NIU system c:m IKspecilicd Iiy the host
specific hardware, to iniplcnienting protocols in silicon. Kanakia el a1 priitocol processing time, the N I U protocol processing time. and the
151 and Chesson I11 have designed protocol specific NIUs, which overhead at the interface.
perform some transport level functions in hardware. ldoue er a1 141
have a dcsign which uses general purpose processor to execute
protocol layers in the NIU. Thcy propose to cxccutc entire OSI stack 4.0 Issues In Yarlitiening 111U Brolucel Sonware
on the NIU. G i a r r i z ~ i i e r a l131 has a design which cniploys multiple
processors on the NIU, cxccuting individual laycrl on scparate I n this section, we discuss the issues that have to be addressed, when
prtscsstirs. Thcsc designs citn uff.hud co~iipIctcprotocul stack from protocol functions have to be partitioned among the processors of the
the. hmt machine, hut only prototypes or siniuliition studies are host-NIU system.
rcportcd. All these cfforts differ in tlic :miiiunt of proiociil lunctims

1286
0 1992 IEEE

~~

Authorized licensed use limited to: UNIVERSITI UTARA MALAYSIA. Downloaded on August 24, 2009 at 16:21 from IEEE Xplore. Restrictions apply.
Singapore ICCS/ISITA '92

Most Protocol Stack


Figure 1. Host-NIU Operating Environments
Most Protocol Stack
- Most Protocol Stack

Fig 1.
NIU protocol stack

Monollthlc host stack.


-& 1"1.1.. .YII...

NIU protocol stack

Flp. 1b. Multl-thread hoat stack.


NIU protocol siack

FIg 1c. Multl-tasklnp host stack.

z
monollthlc NIU stack monolithic NIU stack monolllhlc NIU siack

Most Protocol Stack M o r t Protocol Stack Moat Protocol Stack

111.1l.0. mull...

NIU protocol stack


NIU protocol stack N I U protocol stack

FIg ld. Monollthlc host alack. Fig. le. Multi-thread host alack, FIg 11. Multl-taaklng host stack.
Multl-thread NIU stack Multl-thread NlU alack Multl-thread NIU sIaCk

4.1 Overhead due to Protocol Purtilioning host or the NIU proccssing strcam is slow, data packets will get
queued in this buffer memory. I f the diflcrcnce in their processing
When all the protocol layers are running on a single processor, times is large, morc and morc packets will gel queued. At some point
passing of c o n k and daia between them-is easier. -Da;a may be 01 time, inter-layer flow control will be cnlurccd, or the faster stream
passed between layers using pointers in shared memory or through will gct blocked o n uvcrrunning the bulfcr space. I f the buffer
system area. The underlying operating system provides suitable memory is suflicicnlly large, then hlocking may nut be observed. but
primitives for inter layer communication and synchronization. When still response time for a packcl queued at the interface, will
protocol stack is partitiuncd. it CXCCUICSun separate processors. It deteriorate due to longcr queue lengths. The blocking will remove
may involve an addition:il copy of dxta, as data is cxchangcd across temporal parallelism bctwccn the host and ihc NIU processes.
the processors. Overhead of this copy is borne by the host processor Balanced protocol processing hy the host and the NIU will rcducc
o r by the NIU processor, Synchronizalion between the two queuing time of data packets at the inlcrl;ice, and will increase the
processors has to he explicitly done through interrupts, or through degree of purallclisnl bctwecn the hurt and the NIU processors.

some polling mechanism. If there arc frequent interrupts from the


NIU, then conten switching overhead can be significant. In the 4 3 Pretc~.olh y e r Heiidrncy
polling type of synchronization, processor cycles arc wasted. An
improvement in the performancc of thc ovcr a11 system can be secn. I n t h e layered a p p r o a c h to protocol processing. e a c h layer
only when these additional costs due to partitioning. arc overcome by successively acts on a data packet. As a data packet is processed by
the performance advantages gained by splitting the stack. the layers. p r o t c d headers arc cithcr added or rcmovcd from the
data packet. In a multiprocessor situation (cg. a host L an NIU),
4 3 The Ilosl.NIU Ci~iiiniui~ir;itioi~ p e r f o r m a n c e can hc improved by idcntifying some temporal
parallelism in the processing of data packets. Pipelined cxccution of
Every NIU uses some local buffer memory. Data packets reside in this protocol functions cnablcs this parallelism. I f a pipeline of data
buffer memory for a hricf transit time. until they arc processed by the packets exists, the host and the NIU processors can concurrently act
NIU and transmitted, or they arc acccptcd by Ihc host profess and on this pipcline. I f both the processors are continuously busy, the
discarded. In character 1 / 0 or DMA type o f interlace with the hort. additional cost incurred hy p:irtitinniny of llic protocol layers can he
data packets are copied in this memory, proccsscd and discarded, In ziniiirti/ed over scvcriil uvcrlapping cycle\. The p:Ationing of
the shared memory type of interlace, the host and the NIU use the pr<)locd hyerr 4xmld C I I \ U T C :I nmving pipclinc o f 'l:tta hctwcen the
shared memory for forming and processing packets. I f either o l the host and the NIU.
If blocking is cncuuntcrcd at the host-NIU interlace. further
Flours 2 Tyolcd NIU k c h i l r l u r e packclsexchanged hctwccn thc host :Ind the NIU are likcly to follow
a stop-and-wait type of comniunicaliun, irrcspcctive of the amount of
buffer space at the interlace. The blocking will removc temporal
parallelism hctwccn the host and the NIU processors, and the
(moving) pipcline (of data packets) will assunie an alternating bursty
characteristic. Balanccd partitioning of the protocol processing load
between the host and the NIU. can be used ID e n s u e a (smoothly!)
moving pipcline of data packets.

1287

@ 1992 IEEE

Authorized licensed use limited to: UNIVERSITI UTARA MALAYSIA. Downloaded on August 24, 2009 at 16:21 from IEEE Xplore. Restrictions apply.
Singapore ICCSllSlTA '92

5.0 OSlNETund its Arcliltecture 7.0 Implementation uf Protocol Partitlenlng

OSINET I61 is an implementation of a subsct of ISO-OS1model for We have performed our work in unitasking host machines running
LANs. I t has a network kcrnel which providcs the hasic networking MS-DOS operating system. The host-NIU operating environment
support, and User Applications that run over the kernel. The looks like the one in figure la. The data and commands are
exchanged between the host and the NIU through the shared
network kernel consists of lower six layers of the OS1 reference memory on the NIU.
model. The MAC and the LLC type Ilayers form the data link layer.
Network layer is null. Transport class 4 is implemented to provide 7.1 Host-NIU Data Exchange
reliable connection oriented service. The session layer supports
dialogue management and synchronization. The presentation layer In OSINET. data cxchangc across the protocol layers i s done by
provides A S N l encoding and dccoding. The application layer rcfcrciice, using a buffer passing mcchmism. Two buffer pools are
consists of FTAM, CASE and D A S E services. A single control loop used. The OSINET handles the application layer messages in chunks
schedulesexecution of each laycr in round robin hshion.
of Woo bytes, which is scgmcntcd by the transport laycr in scgnicnts
of IWO bytes each. One of the buffer p o l s , callcd the session buffer
pool, thus has lixcd sized hullers of 2044 bytes each, enough to
6.0 l l i e Intelligent NIU
accommodate largest application layer message. The other buffer
pool, called the transport buffer pool, has buffers of size 1518 bytes
A n intelligent NIU, PC Link2 Network Interlace Adapter, from Intel
each, enough to accommodate a largest cthcrnct frame. The
Inc. was uscd in our work. The hoard is an ethernet interface NIU.
application laycr treats large ;ind small data differently, and copies it
providing is2586 ethernet co-processor, iLWlX6 proccssor and 256k
into a buffer from the appropriate buffer pool. This avoids an c n r a
hytcs local memory. Any arbitrary protocol software can be
copy of data. The host and the N I U protociil partitions use similar
downloaded and executed on the NIU. The hoard represents a
buffer structure. in their own address spiice. The data exchange
general purpose intelligcnt NIU, and hence was sclccted for the
experiments. k t w e c n the host and the N I U takes pi;icc through the buffer p(u)ls
dclined in the NIU mcmory spacc. The host protocol partition passes
6.1 The Host-NIU Coniniunieation data to thc N I U partition by writing i t intu appropriate buffer pool in
the NIU memory. Handshake variables arc uscd, which indicate to
The host machine can access entire 25hk hylcs of N I U memory, the host protocol partition. as to where the data meant for the NIU is
through an Rk window which is mappcd into its nddrcss space. Thc to bc written. This data is proccrsed by the NIU. and handed over to
N I U memory is shared by the hint processor. on-board XOIW and by the ethernet coprticcsror for transmission on the network. The data
the 82586. Access to the memory by the host m x h i n c i s through R- on the network is received by the CO-proccaa~ir into a transport huffcr
bit data path, while the NIU local processors access it through 16-bit pool buffer. This data is processed by the N I U and rcassemhled into
data path. Bccausc of this, and due t o the dual-ported a session buffer p l buffer if necessary. The NIU passes this data to
implementation. access to the N I U memory is somewhat slower for the host, hy posting the buller number and the buffer pool identity
the host processor, compared to the access to i t s local mcniory. The into the handshake variables. Figure 3 shows the typical data flow
host machine controls the w i d o w i n g mcchmisiii. and opcration of between the host and the NIU. The NIU offers shared memory
the NLA, through two control ports mapped in its IO spacc. The N I U option. but still we prefer a copy cif data across the intcrlace. This is
communicates with the hurt machinc through interrupt signals on the because of the slower access to the N I U memory from the host
standard PC bus. Apart from this, it cannot intcrlcre with the host machine, as explained in the earlier section.
opcratiiin.

E
Figure 3. Host-NIU Data Movement

Memory Area o f Host Protocol Procemm

H o s t Memory

NIU Memory

2 L a r g e Trensm8t data megrnenkd

E t h e r n e t Coprocessor 3 L a r g e received data reessemoled

t 4 Trsnsmtl end Receive dele

1 Network
linked to COProCeSsOr data s l r u c l s

1288
0 1992 IEEE

Authorized licensed use limited to: UNIVERSITI UTARA MALAYSIA. Downloaded on August 24, 2009 at 16:21 from IEEE Xplore. Restrictions apply.
Singapore ICCSllSlTA '92

73 Organization ortlie N I U Sonw;rre Each case of protocol partitioning represents different loads
on the host and the N I U processors. To compute the processing load
The software running in the N I U is organiicd into following three on each processor, we mcasurcd typical processing time for the
modules. individual protocol layer in each of the host and in the NIU, and the
1. Host Inlerhcs Module :This module handles interaction with the overhead at the interface. With these timings, partitioning of the
host machine. A corresponding module runs on the host machine. protocol processing load was calculated. Throughput offered by the
2. OS1 Protocol Softwure : This module consists of the protocol host-NIU system was measured for application message sizes of 16 to
laycrs of the OSINET software. 1024 bytes.
3. LAN Interfuce Module : Thih module consists of software that
handles the ethernet co-processor.
8 3 Observations

By partitioning the stack we arc incurring additional EOSI of a data


8.0 Performance Studies
copy across the interface. We are using a general purpose "I, which
is not faster than the host machine. The NIU does not pcrform any
The experiments to measure performance of a network node were
function in hardware and so complete protocol is executed in
carried out on IBM PC XT and PC AT compatible machines, running
software on the NIU processor. In such ad environment performance
MS-DOSoperating system. The machines were connected on a 10 gain is expected only due to parallel operation of the host and the
Mbps ethernet. which forms the backbone of the departmental LAN NIU processors.
at the Computer Science department of IIT Madras.
Figure 4 shows the relative performance obtaincd in the three
cases of protocol partitioning. In each host-NIU case, throughput
8.1 Design of the Experinients
obtained for appliption message size of lG bytes with only mac layer
executing on the NIU. is treated as unity, and throughput obtained in
W c intend to study the protocol layer partitioning in dillcrunt host- other cases are plotted with reference to this throughput. Thus, this
NIU systems and verify that the optimum partitioning varies from one ratio plot gives the relative performance of the host-NIU system for
host-NIU system to another. An 8 MHz PC XT host with an 8 M H z various application message sizes. In the 8 MHz host machine, for
NIU, and a 25 M H z PC AT host with an 8 MHz NIU, provide for two case I of layer partitioning, the host processor has about 80% of the
host-NIU systems under study. Throughput offered by the host-NIU processing load. The NIU processor remains under loaded and so
system was chosen as the performance parameter. A mcmory-lo- there is lesser degree of parallelism between the two processors. In
memory file transfer program, directly accessing the session layer case 11, the host shares about 46% of the processing load, while in
seMccs was used for the measurements. We consider the following case 111. the host has about 44% of the processing load. In data
three cases of OS1 layer partitioning. exchange stage of a connection, session layer has minimal tasks. so
1. Only the mac layer executes in the NIU. EdseS (1 and 111 reflect similar partitioning of the processing load. In
11. Layers upto the transport layer reside in NIU.
111. Layers uplo the session laycr reside in NIU.

Figure 4
Comparative Throughput

10 100 1000

- AT,SessNIU
+ XT,SessNIU
Message size (Bytes)
+ AT.TransNIU
++ XT.TransNIU
-
-3-
AT.MacNIU
XT.MacNIU

AT: 2 5 M h z h o s t , XT: 8 M h z h o s t m a c h i n e

1289

0 1992 IEEE

7~~-
Singapore ICCS/ISITA '92

these cases, protocol partitions are well balanced in terms of their T h e advantages due to balanced protocul load sharing hold good for
execution times. No blocking of the host process is observed at the a typical host-NIU pair ofa network node. Our implementation used
interface. nie liighcr degree ofparu//elistii berwecii (lie liosr and rlir a general purpose NIU which gave pcrforniance gain by suitably off-
N / U p m e s s u r s msrr/ls iiiro higher rhroirglipirr. loading the host machine. Upto four times improvement in
throughput of a host-NIU system was obtained. by exploiting
In the 25 MHz host machine, the protocol layers arc migrated temporal parallelism bctwecn the host and the NIU processors.
from a faster host machine to a slower NIU. Case I of protocol Imbalance in partitioning of prutocol processing load deteriorated
partitioning represents 46% load on the host. while case II and case the performance by a factor of two in another host-NIU system.
111 represent 23% and 21% load on the host machine respectively. In a high pcrfurmancc NIU, migrating the protocol functions
Thus, case I reflects balanced partitioning of the protocol processing to the efficient NIU may give better results. even in the absence of
load. For this case, no blocking was observed at the host-NIU any parallelism. But the performance may be further improved if the
interface. This results in higher throughput than the other two cases host and the NIU processors exhibit higher degree of parallelism.
of partitioning. In cases I I and 111, host process having less Balancing the host and the NIU protocol processing times, will give
processing load, queues up packets at the interlace at a faster rate different protocol layer rcsidcncy solutions in different host-NIU
than the NIU can process them. Hence. it runs out of the buffer pairs. This is being investigated further.
space in the NIU and gets blocked at the interface for every packet
transferred to the NIU. Thus, the host-NIU communication reduces
to a stop-and-wait type of protocol. The host process has to wait for a Hrrcrencrs
buffer and throughput of the systcm drops down compared to that
obtained for case 1. These two cases emphasize the loss in
performance caused by blocking of the host process. The blocking [ I ] G. Chcsson, "XTP/PE Design Considerations', in Profocdsfor
can bc avoided by balancing the protocol processing load. Hi811 Specd Ncrwurks, H. Rudin & R. Williamson (Eds), Elsevier
The performancc gain observed due to balanced protocol load Scicncc Puhl. 1989, pp 27-33.
partitioning, is available lor all sizes of application message. Our
experimental results show th:it, by appropriate partitioning of [ 2 ] D. Clark er U/, "Architectural Considerations for a New
protocol layers, temporal parallelism between the host and the NIU Generation ol Protocols". in Proc. 01ACAI SIGCOAIAI 'YO, l W , pp
processors can bc exploited and thruughpul of the network node can 2lXl-2IM.
be improved. For the 8 MHz host machine. up to 4 times increase in
throughput is obtained due to b:ilanccd sharing of protocol load of [ 3 ] D . (iiarrizyo el a / , 'High S p e e d Parallel Protocol
cases I I and 111. For the 25 MHz host machine, the blocking at the Implcmcntation", in Pr<~focol.~
Jor High Slxd Nc,lworks, ti. Rudin &
interface observed in cases II and 111 of layer partitioning, reduces the R. Williamson (Eds), Elsevier Science Puhl. 19x9, pp 27-33.
throughput by a factor of 2 compared t o the throughput obtained in
case 1. Balaticiiig 111e protocol processiil&! loud srrgesls diflemiit 141 A. l d o u c er U / , 'Design a n d implementation o f OS1
prolocol /ayer resideticyfur diflemtif host-NlUpairs. Communication Board for Personal Computers and Workstations',
in Proc. u/lCCC '90, New Delhi, India. lW1, pp 585-592.

9.0 Summary [SI H . Kanakia el ul, "The VMP Nctwork Board (NAB): High
Performance Network Communication for Multiprocessors", in Pmc.
The analysis or the host-NIU system presented in this paper is ofacMslccoAinis w , I ~ X Xpp, 175.187.
applicable to any of the host-NIU systems of figure 1. No assumption
is made about the host machine, the NIU Or the PrO~ocolsoflware. 161 S.V. Raghavan el a/," OS1 Protocol Suite Implementation - An
Indian Experience", in Pmc. uJlNFOCOA1 'YI, 1991. pp 41-78.

1290

You might also like