Professional Documents
Culture Documents
A Thesis
Presented to
the Faculty of the California Polytechnic State University
San Luis Obispo
In Partial Fullfilment
of the Requirements for the Degree
Master of Science in Computer Science
by
Michael S. Watts
June 2005
Multi-Threaded End-to-End Applications on Network Processors
Copyright
c 2005
by
Michael S. Watts
ii
APPROVAL PAGE
iii
Abstract
by
Michael S. Watts
High speed networks put a heavy load on network processors, therefore optimiza-
tion developer to utilize the available parallelism. To fully exploit this power, one
and a generic network processor simulator modeled after the Intel IXP1200. Us-
formance and that relying on such data can lead to sub-optimal parallelization.
iv
Contents
Contents v
1 Introduction 1
2 Related Work 4
2.1 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Intel IXP1200 . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Network Processor Simulators . . . . . . . . . . . . . . . . . . . . 7
2.2.1 SimpleScalar . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 PacketBench . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 MiBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 CommBench . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 NetBench . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Application Frameworks . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 NP-Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 NEPAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.4 NetBind . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 The Simulator 13
3.1 Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 14
v
3.3 Methods of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Application Development . . . . . . . . . . . . . . . . . . . . . . . 16
4 Benchmark Applications 18
4.1 Message Digest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 URL-Based Switch . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . 23
5 Results 25
5.1 Isolation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2 URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.4 Isolation Analysis . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Shared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1 MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.4 Shared Analysis . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Static Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Dynamic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Conclusion 42
7 Future Work 44
Bibliography 46
A Acronyms 50
vi
Chapter 1
Introduction
Application Specific Integrated Circuit (ASIC) chips are beginning to use pro-
processors aim to bridge the gap between speed and flexibility by taking advan-
tage of the benefits of both ASICs and general purpose processors. There is
this goal. However, there are several major strategies employed to bridge the gap:
mechanisms, and peripherals [24]. Network processors have made it possible for
the deployment of complex applications into the network at nodes that previously
High speed networks put a heavy load on network processors, therefore opti-
1
of end-to-end applications.
In the context of this thesis, kernels are programs that carry out a single task.
This task is of limited use in and of itself, however multiple kernels can often be
combined to provide a more useful solution. In the area of networking, kernels are
These kernels can also be applicable outside the area networking, however since
the context of this thesis is networking, these kernels focus on packet processing.
4 by first calculating the MD5 signature of each packet, then determining its
the proposed scenario, the integrity of the packet could be verified and its payload
network processor modeled on the Intel IXP1200. Our simulator fills a gap in
interaction between the six microengines of the Intel IXP1200 can be simulated.
application benchmarks based on the NetBench [18] and MiBench [10] single-
benchmarks have been designed to provide insight into the characteristics of end-
2
to-end applications.
characteristics of the kernels making up the end-to-end applications and the end-
to-end applications themselves, as well as insight into the strengths and weak-
and related work. In Chapter 3 we present our simulator. The kernels that make
3
Chapter 2
Related Work
As the size and capacity of the Internet continues to grow, devices within the
network and at the network edge are increasing in complexity in order to provide
more services. Traditionally, these devices have made use of ASICs which provide
high performance and low flexibility. NPUs bridge the gap between speed and
flexibility by taking advantage of the benefits of both ASICs and general pur-
pose processors. There is no single unifying characteristic that allows all network
processors to accomplish this goal. However, there are several major strate-
processors have made it possible for the deployment of complex applications into
the network at nodes that previously acted only as routers and switches.
4
programmability allows applications deployed to them to access higher layers
of the network stack than traditional routers and switches. The OSI reference
model defines seven layers of network communication from the physical layer
(layer 1) to the application layer (layer 7) [15]. NPUs are capable of supporting
layer 7 applications which have traditionally been reserved for desktop and server
computers.
There are over 30 different self-identified NPUs available today [24]. These
NPUs can be classified into two categories based on their processing element
each PE is capable of performing any task [24]. Both of these configurations are
Pipelined architectures include: Cisco PXF [25], EZChip NP1 [8], and Xel-
erator Network Processors [31]. Symmetric architectures include: Intel IXP [6]
order to prevent network communication delays, NPUs must quickly and effi-
ciently process packets. Parallel processing through the use of multiple PEs is
Co-processors are more complex then functional units. They may be attached
to several PEs, memories, and buses, and they may store state. A co-processor
5
can also dictate that the programmer use a specific algorithm in order to take
Since memory access can potentially waste processing cycles, NPUs often use
multi-threading such as separate register banks for different threads and hardware
units to schedule and swap threads with no overhead. Special units also handle
memory management and the copying of packets from network interfaces into
access, low latency access to network interfaces, and strong processing of bit,
byte, word, and longword operations. For processors, the IXP1200 provides a
single StrongARM processor and six independent 32-bit RISC PEs called mi-
croengines. This boils down to a single powerful processor coupled with 6 very
simple, weaker engines for highly parallel computation. In addition, each mi-
switching. The StrongARM was designed to manage complex tasks and to offload
fast accesses and 256 MBytes of SDRAM for larger memory space requirements
(but slow accesses). There is also a scratch memory unit available to all proces-
6
cache and 8 KByte data cache, providing it with fast accesses on a small amount
of data. Each microengine has a 1 KByte data cache and a large number of
transfer registers. The IXP1200 platform does not provide any built-in memory
also able to provide performance statistics such as cycle count, memory usage,
bus bandwidth, and cache misses. These statistics enable developers to identify
the high-cost and the wide variety of architecture found in current NPUs. High-
outdated NPUs are becoming more accessible. The wide variety of NPU archi-
2.2.1 SimpleScalar
7
binaries compiled for the SimpleScalar architecture and simulates their execution
in NPU platforms such as the Intel IXP. A modified version of GNU GCC allows
2.2.2 PacketBench
also emulates some of the functionality of a NPU by providing a simple API for
sending and receiving packets and for memory management [22]. In this way, the
underlying details of specific NPU architectures are hidden from the application
2.3 Benchmarks
load are called Synthetic, while Application benchmarks are real-world applica-
tions [27]. For the purposes of this paper, our interest is in application bench-
8
marks, and more specifically, representative benchmarks for the domain of NPUs.
2.3.1 MiBench
Secure Hash Algorithm (SHA), Blowfish, and Pretty Good Privacy (PGP) algo-
algorithms, while the other categories are not relevant to this discussion. All
2.3.2 CommBench
9
2.3.3 NetBench
tions are split into three categories: micro, IP, and application. The micro-level
includes the CRC-32 checksum calculation and the table lookup routing scheme.
cryption for VPN connections, and Message-Digest 5 (MD5) packet signing [18].
Several frameworks for NPUs are available in academia, each offering various
to their architectures, such as the Intel IXA Software Development Kit [5].
One key advantage of academic frameworks is the possibility that they will
10
realizes cross-platform support. The others are currently striving to meet this
goal.
2.4.1 Click
with processing elements at the nodes and packet flow described using edges.
Click supports multi-threading but has not been extended to multiprocessor ar-
chitectures. The modularity of Click applications gives insight into their inherent
functionality.
2.4.2 NP-Click
and the hardware through the use of a programming model. The code produced
using the NP-Click programming model has been shown to run within 10% of
time [19]. The current implementation of NP-Click targets only the Intel IXP1200
11
2.4.3 NEPAL
ment for developing and executing module-based applications for network proces-
defining a set of modules and a module tree that defines the flow of execution
was verified using their own customized version of SimpleScalar ARM simula-
tor for multiprocessor architectures. They provide performance results for two
simulated NPUs modeled after the IXP1200 [6] and Cisco Toaster [25].
2.4.4 NetBind
12
Chapter 3
The Simulator
The simulator developed for this work was built on the SimpleScalar tool set
tion software that models real-world architecture. This simulation tool set was
chosen because of its prevalence in architectural research. For this work, we mod-
ified an existing simulator with support for multiple processors in order to create
a generic network processor simulator modeled after the Intel IXP1200. We chose
to model the IXP1200 because it is a member of the commonly used Intel IXP
line of NPUs.
The simulator includes a single main processor and six auxiliary processors
13
Parameter StrongARM Microengines
Scheduling Out-of-order In-order
Width 1 (single-issue) 1 (single-issue)
L1 I Cache Size 16 KByte SRAM (no miss penalty)
L1 D Cache Size 8 KByte 1 KByte
engine must support four threads with zero overhead context switching [6], the
simulator creates one single-issue in-order processor for each microengine thread.
threads is specified on the command line when the simulator is run, therefore
The StrongARM and microengines share 8 MBytes of SRAM and 256 MBytes
of SDRAM [6]. There is also a scratch memory unit consisting of 1 MByte SRAM.
These memory units are represented in the simulator using a single DRAM unit.
Separate DRAM caches back these memory units for the StrongARM and micro-
engines.
The StrongARM has a 16 KByte instruction cache and 8 KByte data cache
that are backed by SRAM [6]. Each microengine has a 1 KByte data cache and
cache in order to mimic the behavior of the large number of transfer registers
associated with each microengine on the IXP1200. Since the number of simulated
registers cannot exceed the number of physical registers on the host architecture,
14
we determined this to be the best option available.
Since the IXP1200 is capable of connecting with any number of network de-
vices through its high speed 64 bit IX bus interface, the amount of delay incurred
to fetch a packet could very greatly. For the purposes of this simulator, network
delay is not important and it is assumed that the next packet is available as soon
as the application is ready to receive it. In order to imitate this behavior, a large
The simulator does not provide any built in memory management, therefore
the application developers are responsible for maintaining memory address space.
The simulator assigns address ranges to each of the memory units. SRAM is
dedicated for the call stack and DRAM is broken up into a range for text, global
defaults and indicate the location of a SimpleScalar native executable and any
arguments that should be passed to it. Its use can be expressed as:
These arguments can modify aspects of the simulation architecture including the
Default values for each available parameter are based on the IXP1200 architec-
15
ture.
The most important parameter for this work was Threads that controls the
Threads can be any value between 0 and 24 inclusive. Zero threads indicates the
microengines will not be used and therefore the application will execute only on
the StrongARM processor. When the number of threads are greater than zero,
that the threads are distributed as evenly as possible. For instance, if 8 threads
are requested, then 4 microengines will be run 1 thread and 2 microengines will
run 2 threads.
The applications developed for this work were written in C and compiled
For this work, the host architecture was Linux/x86 and the target architecture
Since the simulator does not support POSIX threads, developing multi-threaded
each simulator thread. In order for the application code to distinguish which
the thread identifier, not the CPU identifier. Code that is meant to run in a
particular thread must be isolated in an if block that tests the return value from
16
getcpu(). This function requires a penalty of one cycle, but it is typically called
only once and its value stored in a local variable during the initialization of the
using another function made available by the simulator called barrier(). A call
to barrier() requires one cycle for the function call, but induces no penalty
The simulator reports statistics on the utilization of each hardware unit at the
end of each execution. For each PE this includes cycle count, instruction count,
and fetch stalls. For each memory unit this includes hits, misses, reads, and
that can be used at any time to print the cycle count of the current thread to
standard error and standard output. This function is useful when an application
has an initialization process that should not count towards the total cycle count.
code, the total cycle count for that block can be determined by analyzing the
output. When the developer requires that the application make some calculation
based on cycle count, the function GetCycles() returning and integer can be
used. Both of these functions induce a penalty of one cycle for the call, but no
17
Chapter 4
Benchmark Applications
ing more complex applications that guide packets through a series of applications
running in parallel. For the first stage of this work we ported three typical network
advantage of multiple threads. For the second stage of this work we combined
these three applications into three types of end-to-end applications: shared, sta-
tic, and dynamic. These distinctions refer to three different ways of utilizing the
available threads.
18
4.1 Message Digest
messages with the same signature or to produce the original message given a
signature. However, in March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne
de Weger demonstrated [14] that two valid X.509 certificates [11] could be created
with identical MD5 signatures. Although more robust algorithms exist, MD5 is
Our implementation of MD5 was adopted from the NetBench suite of ap-
plications for network processors [18, 16]. The NetBench implementation was
nication between the StrongARM and microengines is done through the use of
the current packet and the length of the current packet to shared memory lo-
cations known by both the StrongARM and the thread. The StrongARM then
sets a semaphore that triggers the thread to begin executing. When all packets
have been processed, the StrongARM waits for each thread to become idle, then
19
Figure 4.1. MD5 - StrongARM Algorithm
Each microengine thread proceeds as shown in Figure 4.2. It waits until its
semaphore has changed, then either exits or copies the current packet to its stack
Locator (URL) found in a packet. Other terms for URL-based switch include
20
Figure 4.2. MD5 - Microengine Algorithm
services. A Layer 4 switch located in front of the cluster of servers can control
per connection basis. How requests are directed within a connection is out of
reach to a Layer 4 switch. Traffic can be managed per request, rather than per
TCP connection and establishes its own connections to the servers containing
the content requested by the client. It then relays content to the client. In this
way, the switch can perform load-balancing and fault detection and recovery.
For instance, if one server is overloaded or unreachable, the switch can send its
in NetBench [18, 16]. The algorithm searches the contents of a packet for a list
used to switch the packet or begin another process. The focus of our URL-based
21
switch is the pattern matching algorithm.
Unlike our implementation of MD5, our URL-based switch does not utilize
to process each packet. The data structure used to store patterns is a list of lists.
Each element of the primary list is made up of a secondary list and the largest
common substring of the patterns in the secondary list. The algorithm proceeds
Each packet received by the StrongARM is copied to the stack and run
through the Internet checksum algorithm to verify its integrity. For each ele-
ment in the list of largest common substrings, the StrongARM copies the el-
memory and then the idle microengine’s semaphore is set to notify it to begin
22
Figure 4.4. URL - Microengine Algorithm
executing. The microengine thread first copies the packet to its stack and then
uses a Boyer Moore search function to determine whether the packet contains
the largest common pattern. If this test is positive, then the thread proceeds to
search for a matching pattern in the secondary list. Otherwise, the microengine
resets its semaphore and returns to an idle state. If the thread finds a matching
pattern, it sets its semaphore to reflect this before returning to an idle state.
The StrongARM continues until it reaches the end of the primary list or until a
[20]. The standard was proposed by Vincent Rijmen and Joan Daemen under the
name Rijndael [7]. AES is a block cipher encryption algorithm based on 128, 192,
or 256 bit keys. The algorithm is known to perform efficiently in both hardware
and software.
23
the MiBench embedded benchmark suite [10, 21]. In much the same way that
our MD5 algorithm processes packets in parallel, our AES algorithm offloads
using a 256 bit key that is loaded into each thread’s stack during startup. This
algorithm executes on the simulator in the same manner as MD5 above (Figures
24
Chapter 5
Results
take advantage of the NPU architecture, we performed four types of tests: Isola-
tion, Shared, Static, and Dynamic. The Isolation tests establish a baseline and ex-
tests explore how each application is affected by the concurrent execution of other
and how to best distribute threads. Finally, the Dynamic tests serve to compare
The purpose of the Isolation tests is twofold: to establish a baseline for sub-
sequent tests and to explore the effects of multi-threading on the NPU. The
Isolation tests consisted of independent tests for each application. For each in-
25
was varied between 1 and 24, since the simulator supports up to 24 threads. A
data point was also gathered for the serial version of each application in which
5.1.1 MD5
Test results in Figure 5.1 show that parallelization of the MD5 algorithm
offers significant speedup compared to its serial counterpart. The data point at
zero threads represents the serial version of MD5 executed on the StrongARM
making use of the StrongARM and a single microengine. This case is slower than
the serial version because of the overhead involved in communication between the
StrongARM and the microengine and because the microengine does not offer as
overhead.
26
The slope of the speedup graph in Figure 5.1 decreases suddenly at 7, 13,
and 19 threads. These changes can be attributed to the fact that there are 6
burdened with 3 and then 4 threads causing the speedup to approach a flat line.
5.1.2 URL
Although test results for the parallelization of URL show improvements over
the previous chapter, the URL algorithm is parallelized in such a way that multi-
ple threads work together to process each packet. Each thread is responsible for
searching the packet for a particular set of patterns, and the first match preempts
further execution. The drawback of this algorithm is that since only one thread
will find a match, the other threads do work that in hindsight is unnecessary.
27
Figure 5.3. URL Isolated Speedup on 100 Packets (polling)
that all threads are vying for a limited number of shared resources.
the cycles spent searching false leads. The first version allows each thread to
the StrongARM that a match has been found, the StrongARM stops spawning
new threads and simply waits for the active threads to finish, although their
the StrongARM sets a global flag that is constantly polled by each thread. When
Although it was expected that the polling version of URL would perform
shown in Figures 5.2 and 5.3, the highest speedup attained by non-polling was
1.75 and for polling 1.64. Analysis of the application’s output shows that a
matching pattern is found in only about 40% of the trace packets, thus polling
28
is unable to preempt execution 60% of time. The difference in speedup is due to
the fact that the polling version is doing unnecessary work 60% of the time and
In both versions of URL, the speedup drops off after reaching a maximum
5.1.3 AES
Speedup tests on AES show that this algorithm performs poorly when of-
floaded to the microengines. The AES encryption algorithm requires each packet
be read and processed 16 bytes at a time. State is maintained for the lifetime
of each packet in an accumulator that made up of the encryption key and state
cache for the StrongARM is 8 Kbytes compared to 1 Kbytes for the microengines.
Due to the limited size of the microengine caches, AES suffers from substantial
29
cache misses.
Processing of each packet consumes roughly 1.36 million simulator cycles when
11.4 million simulator cycles on a microengine thread when it is the only micro-
consumes roughly 0.518 million cycles on the StrongARM and 0.922 million cy-
Thus, AES requires a substantially higher increase in cycles when moving from
far worse than the serial version, it remains relatively constant as the number of
threads is primarily a result of processing power and cache size, not memory
contention between threads which would be the case if speedup tailed off.
These tests reveal general characteristics of each kernel on both the Stron-
gARM and microengines. MD5 has been shown to offer strong speedup on mi-
AES reveals an algorithm with poor performance on the microengines that cannot
be overcome by multi-threading.
30
5.2 Shared Tests
The purpose of the Shared tests is to determine how sensitive each kernel
is to the concurrent execution of the other kernels. For these tests we ran all
three kernels on the simulator at the same time. The StrongARM served as the
one test for each kernel, in which the number of threads available to the kernel
under test was varied, while the threads available to the other kernels remained
constant. Our baseline for each of these tests was 1 thread for MD5, 4 threads
for URL, and 1 thread for AES. This baseline was chosen because running
URL with few than 4 threads was found to cause a significant bottleneck. The
number of threads available to the kernel under examination was increased for
each subsequent run. Each kernel processed a separate packet stream until the
kernel under test completed the desired number of packets, in this case 50.
Figure 5.5 shows the speedup results from all three tests on the same graph
revealing the relative speedup of each kernel. Clearly, MD5 and AES have much
greater speedup than URL, indicating they are less sensitive to the concurrent
speedup of each kernel with its Isolated speedup. This comparison is covered in
5.2.1 MD5
The speedup results of MD5 in the Isolation and Shared tests, shown in
Figures 5.1 and 5.5 respectively, show few differences. The slope of each graph
is approximately the same and both peak near a speedup of six. This indicates
that MD5 is not substantially affected by the concurrent execution of URL and
31
Figure 5.5. Shared Speedup on 50 Packets
AES. The lightweight nature of MD5 with regards to memory is the most likely
Figure 5.6 compares the MD5 Isolation and Shared tests with regard to the
revealing that more cycles are required to process the same packet stream when
MD5 is sharing the resources of the NPU. The horizontal-axis corresponds to the
number of MD5 threads employed to process the packets while the vertial-axis
corresponds to the number of cycles spent processing the packet stream. Since
with the Shared tests 4 threads are allocated to URL and 1 to AES, these threads
cause contention for access to shared resources and therefore higher cycle counts
32
Figure 5.6. MD5 Isolated vs. Shared Cycles on 50 Packets
5.2.2 URL
Although the Shared speedup of URL shown in Figure 5.5 steadily increases,
its maximum of 1.17 with 22 threads does not match the Isolation speedup shown
in Figure 5.2 that peaks at 1.75 and degrades to 1.41 with 22 threads. This
5.2.3 AES
greater than the Isolation speedup shown in Figure 5.4. This high speedup is
due to the fact that the baseline for this test performed extremely poorly. This
in the Isolation tests, AES performs poorly on the microengines due to their
lack of processing power and limited size of their cache. Secondly, since the
33
StrongARM is the controller for all three kernels it continuously monitors all of
baseline, the StrongARM has to monitor one thread for each kernel, thus only
one-third of its time is spent monitoring the AES thread. Therefore, the AES
thread occasionally finishes processing a packet and wastes idle cycles waiting
for the StrongARM to send it another packet. As more threads are allocated to
AES, the StrongARM spends a larger percentage of time monitoring AES threads
The Shared tests reveal that MD5 and AES are relatively insensitive to the
sensitive, and its speedup suffers when it is run alongside the other kernels.
The Static tests were designed to reveal characteristics of the end-to-end ap-
plication, such as the location of bottlenecks and the ideal thread configuration.
The testing process was similar to that of the Shared tests. The difference being
together to process a single packet stream. Each incoming packet was processed
first by MD5, then by URL, and finally by AES. This scenario represents a
internal network through the Internet to a variety of hosts. Each packet is re-
34
ceived by the application from the internal network, the application calculates
its MD5 signature, determines its destination based on a deep inspection of the
packet, and then encrypts it. Finally, the the encrypted packet along with its
signature is sent to a host machine; although this step is not included in the sim-
ulated application. To complete this scenario, the host machine would decrypt
the packet and verify that the contents were not modified in transit by comparing
the included signature to a newly generated one. This is also not included in the
simulation.
For these tests, the number of threads allocated to each stage of the end-to-
35
end application is static for each run. Once again, the baseline test is 1 thread for
MD5, 4 threads for URL, and 1 thread for AES. Each subsequent test increases
the number of threads by one and attempts to determine the optimal configura-
each of the applications in turn, and observing which configuration yields the best
speedup. This configuration is then used as a starting point for the subsequent
test.
Figure 5.8 shows the resulting optimal configurations for each number of avail-
able threads between 6 and 24. These configurations were found through test runs
of 50 packets. MD5 never became a bottleneck point and 1 thread remained suf-
ficient throughout the tests. URL and AES almost evenly split the remaining
threads, with the a final configuration of 12 threads for AES, 11 for URL, and
1 for MD5. These results show that the demands of AES and URL are similar
and parallelization offers increased performance for these applications, while the
application unnecessary.
application. Although MD5 provided the best speedup in the Isolation tests,
dahl’s Law [2], which states that the overall speedup achievable from the improve-
36
1
P
(1 − P ) + S
It is also interesting to note that although AES did not benefit from additional
microengines during the Isolation tests (Figure 5.4), in the high-load context of
Figure 5.8 also shows that initially more threads were allocated to URL and
after 14 threads more threads were allocated to AES. Since URL is required to
finish processing each packet before it can be sent to AES, URL caused more of
a bottleneck when it had less than 10 threads. After that point, AES required 4
The Dynamic tests present an alternative approach to the Static tests. Where
the Static tests represent ideal configurations, the Dynamic tests represent real-
feasible since all possible configurations must be run in order to determine the
best one for the given end-to-end application. This could become an extremely
complex and lengthy process. The trade-off with a dynamic heuristic is increased
37
all three kernels processing the same packet stream in serial, as in the Static
tests, but with threads dynamically allocated based on demand. Once again,
the StrongARM serves as the controller and is responsible for allocating threads.
Allocation is implemented through the use of queues for each stage of the end-
to-end application. Each queue stores pointers to packets that are waiting to be
processed by the next stage. The StrongARM detects when a queue has packets
Figure 5.9 shows the speedup of the Dynamic application using as a baseline
the Static configuration consisting of 1 MD5, 4 URL, and 1 AES thread. The
speedup increases from 4.29 with 6 threads to 4.39 with 24 threads, a substantial
Figure 5.10 shows the difference between the number of cycles requires for
each of the applications to process the same number of packets. While the Static
version spent in the neighborhood of 1.3 billion cycles per 50 packets, the Dynamic
38
Figure 5.10. Static vs. Dynamic Cycles on 50 Packets
This discrepancy can be attributed to cycles wasted on idle threads. With the
Static version, each thread is statically assigned to perform either MD5, URL, or
AES. Since the URL kernel requires much longer to run than MD5, the queue
of packets waiting for URL processing is quickly filled forcing the MD5 thread
to stop processing new packets until URL can reduced the queue. At the same
time, when the URL threads were unable to process packets as quickly as the
AES threads, some AES threads wasted idle cycles. The Dynamic version did
not suffer from these bottleneck issues because idle threads were put to use by
in load caused by varying packet sizes and payloads. Specifically, since URL
performs a thorough string matching on the payload of each packet, the size of
the packet has a large affect on the number of cycles required to process it. The
39
Figure 5.10 also shows that neither the Static nor the Dynamic versions of
the end-to-end application benefit much from additional threads. The number
show that between 6 and 24 threads MD5 is the only kernel to experience signif-
icant performance improvement. The speedup of URL declines slightly and AES
remains relatively constant. Therefore, with the exception of the MD5 kernel,
Isolation tests. Once again, this can be explained by Amdahl’s Law [2], because
MD5 constitutes only a small percentage of the overall computation. Thus, the
5.5 Analysis
We performed four types of tests for our analysis: Isolation, Shared, Static,
and Dynamic. The Isolation tests established a baseline and explored kernel
how each kernel was affected by the concurrent execution of other kernels. The
40
poor performance on the microengines that could not be overcome by multi-
threading.
The Shared tests revealed that MD5 and AES are relatively insensitive to the
was shown to be sensitive because its speedup suffered when it was run alongside
The Static tests provided a baseline for the Dynamic tests and revealed the
showed that the demands of AES and URL are similar and parallelization offered
increased performance for these applications, while the simplicity of MD5 made
the benefits of dynamically allocating threads. Overall the Dynamic tests re-
quired less than 25% as many cycles to process each 50 packet test as the Static
conterpart.
41
Chapter 6
Conclusion
cations on NPUs. Our first contribution was the creation of a simulator that
emulates a generic network processor modeled on the Intel IXP1200. Our simu-
simulator.
the MD5 kernel scaled well in the Isolation and Shared tests, parallelization of
the Static and Dynamic tests found that the end-to-end application did not have
much performance gain from the addition of more than 6 threads. Finally, the
Dynamic version of the end-to-end application required less than 25% as many
42
cycles to process the same packet stream compared to the Static version.
In an attempt to bridge the gap between the speed of ASIC chips and the
While NPUs make it possible to deploy complex end-to-end applications into the
network, high speed networks put heavy load on these devices making application
43
Chapter 7
Future Work
The simulator developed in this work provides a tool that can be used in a
variety of future projects. Thus far, the simulator has been used by Gridley in
his Master’s thesis on active network algorithm performance [9] and Tsudama to
test his denial-of-service detection algorithm as part of his Master’s thesis [29].
including support for dedicated processing chips, larger cycle count capability,
Other future work could include testing the existing end-to-end applications
found in this work have been overcome by the current generation of NPUs. If
bottlenecks are not found on current NPUs, then larger scale end-to-end appli-
44
and reveal new bottlenecks.
timization of the current and future kernels and end-to-end applications will
45
Bibliography
Development, 2003.
[2] Gene Amdahl. Validity of the single processor approach to achieving large-
[3] Douglas C. Burger and Todd M. Austin. The simplescalar tool set, version
In Proceedings of IEEE OPENARCH 2002, New York City, NY, June 2002.
46
[6] Intel Corporation. Ixp1200 network processor datasheet, September 2003.
[7] J. Daemen and V. Rijmen. AES proposal: Rijndael. First Advanced En-
com/html/tech nsppaper.html.
[11] R. Housley. Internet X.509 public key infrastructure certificate and certifi-
cate revocation list (CRL) profile. RFC 3280, Internet Engineering Task
February 2001.
[13] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans
[14] Arjen Lenstra, Xiaoyun Wang, and Benne de Weger. Colliding X.509
//eprint.iacr.org/.
47
[15] Alberto Leon-Garcia and Indra Widjaja. Communication Networks: Fun-
Group, 2000.
NetBench/, 2002.
[18] Mangione Smith Memik and Hu. Netbench: A benchmarking suite for
February 2003.
//csrc.nist.gov/publications/fips.
eecs.umich.edu/mibench/, 2002.
2003.
48
[23] Ronald L. Rivest. The MD5 message-digest algorithm. RFC 1321, Internet
2002. http://www.cisco.com/en/US/products/hw/routers/ps133/
June 2004.
April 2000.
[31] Xelerated. Xelerator X10q network processor. Product Brief, 2004. http:
//www.xelerated.com/file.aspx?file id=62.
49
Appendix A
Acronyms
DH Diffie-Hellmen
MD5 Message-Digest 5
PE processing element
50
PISA Portable Instruction Set Architecture
51