Thesis

Multi-Threaded End-to-End Applications on Network Processors
A Thesis
Presented to
the Faculty of the California Polytechnic State University
San Luis Obispo
In Partial Fullfilment
of the Requirements for the Degree
Master of Science in Computer Science
by
Michael S. Watts
June 2005
Copyright
c 2005
by
Michael S. Watts
ii
APPROVAL PAGE
TITLE: Multi-Threaded End-to-End Applications on Network Processors
AUTHOR: Michael S. Watts
DATE SUBMITTED: January 26, 2006
Professor Diana Franklin

Advisor or Committee Chair Signature
Professor Hugh Smith

Committee M ember Signature
Professor Phil Nico

Committee M ember Signature
iii
Abstract
by
Michael S. Watts
High speed networks put a heavy load on network processors, therefore optimiza-
tion of applications for these devices is an important area of research. Many
network processors provide multiple processing chips, and it is up to the applica-
tion developer to utilize the available parallelism. To fully exploit this power, one
must be able to parallelize full end-to-end applications that may be composed of
several less complex application kernels.
This thesis presents a multi-threaded end-to-end application benchmark suite
and a generic network processor simulator modeled after the Intel IXP1200. Us-
ing our benchmark suite we evaluate the effectiveness of network processors to
support end-to-end applications as well as the effectiveness of various paralleliza-
tion techniques to take advantage of the network processor architecture. We show
that kernel performance is an inaccurate indicator of end-to-end application per-
formance and that relying on such data can lead to sub-optimal parallelization.
iv
Contents
Contents v
1 Introduction 1
2 Related Work 4
2.1 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Intel IXP1200 . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Network Processor Simulators . . . . . . . . . . . . . . . . . . . . 7
2.2.1 SimpleScalar . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 PacketBench . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 MiBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 CommBench . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 NetBench . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Application Frameworks . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 NP-Click . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 NEPAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.4 NetBind . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 The Simulator 13
3.1 Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 14
v
3.3 Methods of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Application Development . . . . . . . . . . . . . . . . . . . . . . . 16
4 Benchmark Applications 18
4.1 Message Digest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 URL-Based Switch . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . 23
5 Results 25
5.1 Isolation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.2 URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.4 Isolation Analysis . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Shared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.1 MD5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.4 Shared Analysis . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Static Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Dynamic Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Conclusion 42
7 Future Work 44
Bibliography 46
A Acronyms 50
vi
Chapter 1
Introduction
As available processing power has increased, devices that traditionally used
Application Specific Integrated Circuit (ASIC) chips are beginning to use pro-
grammable processors in order to take advantage of their flexibility. This increase
in flexibility has traditionally been gained at the sacrifice of speed. Network
processors aim to bridge the gap between speed and flexibility by taking advan-
tage of the benefits of both ASICs and general purpose processors. There is
no single unifying characteristic that allows all network processors to accomplish
this goal. However, there are several major strategies employed to bridge the gap:
parallel processing, special-purpose hardware, memory structure, communication
mechanisms, and peripherals [24]. Network processors have made it possible for
the deployment of complex applications into the network at nodes that previously
acted only as routers and switches.
High speed networks put a heavy load on network processors, therefore opti-
mization of applications for these devices is an important area of research. It is up
to the application developer to utilize the parallelism available in network proces-
sors. Parallelization of kernels is often a trivial task compared to parallelization
1
of end-to-end applications.
In the context of this thesis, kernels are programs that carry out a single task.
This task is of limited use in and of itself, however multiple kernels can often be
combined to provide a more useful solution. In the area of networking, kernels are
programs such as MD5, URL-based switching, and AES discussed in Chapter 4.
These kernels can also be applicable outside the area networking, however since
the context of this thesis is networking, these kernels focus on packet processing.
An end-to-end application refers to a useful combination of kernels. The end-
to-end application discussed in Chapter 5 makes use of the kernels in Chapter
4 by first calculating the MD5 signature of each packet, then determining its
destination using URL-based switching, and finally encrypting it using AES. In
the proposed scenario, the integrity of the packet could be verified and its payload
decrypted at the destination node.
Our first contribution is the creation of a simulator that emulates a generic
network processor modeled on the Intel IXP1200. Our simulator fills a gap in
existing academic research by supporting multiple processing units. In this way,
interaction between the six microengines of the Intel IXP1200 can be simulated.
We chose to emulate the IXP1200 because it is a member of the commonly used
Intel IXP line of Network Processing Unit (NPU)s.
Our second contribution is the construction of multi-threaded, end-to-end
application benchmarks based on the NetBench [18] and MiBench [10] single-
threaded kernels. Since network processors are capable of supporting complex
applications, it is important to have benchmarks that fully utilize them. Exist-
ing benchmark suites make it difficult to research the properties of parallelized
end-to-end applications since they are made up of single-threaded kernels. Our
benchmarks have been designed to provide insight into the characteristics of end-
2
to-end applications.
Our third contribution is an analysis of our multi-threaded, end-to-end ap-
plication benchmarks on our network processor simulator. This analysis reveals
characteristics of the kernels making up the end-to-end applications and the end-
to-end applications themselves, as well as insight into the strengths and weak-
nesses of network processors.
This paper is organized as follows. In the next chapter we provide background
and related work. In Chapter 3 we present our simulator. The kernels that make
up our end-to-end application benchmark are presented in Chapter 4. Chapter
5 describes our testing methodology and our evaluation of the effectiveness of
network processors to support end-to-end applications as well as the effectiveness
of various parallelization techniques to take advantage of the network processor
architecture. Finally, our conclusion is presented in Chapter 6.
3
Chapter 2
Related Work
As the size and capacity of the Internet continues to grow, devices within the
network and at the network edge are increasing in complexity in order to provide
more services. Traditionally, these devices have made use of ASICs which provide
high performance and low flexibility. NPUs bridge the gap between speed and
flexibility by taking advantage of the benefits of both ASICs and general pur-
pose processors. There is no single unifying characteristic that allows all network
processors to accomplish this goal. However, there are several major strate-
gies employed to bridge the gap: parallel processing, special-purpose hardware,
memory structure, communication mechanisms, and peripherals [24]. Network
processors have made it possible for the deployment of complex applications into
the network at nodes that previously acted only as routers and switches.
2.1 Network Processors
NPU is a general term used to describe any processor designed to process
packets for network communication. Another characteristic of NPUs is that their
4
programmability allows applications deployed to them to access higher layers
of the network stack than traditional routers and switches. The OSI reference
model defines seven layers of network communication from the physical layer
(layer 1) to the application layer (layer 7) [15]. NPUs are capable of supporting
layer 7 applications which have traditionally been reserved for desktop and server
computers.
There are over 30 different self-identified NPUs available today [24]. These
NPUs can be classified into two categories based on their processing element
configuration: pipelined and symmetric. A processing element (PE) is a processor
able to decode an instruction stream [24]. Pipelined configurations dedicate each
PE to a particular packet processing task, while in symmetric configurations
each PE is capable of performing any task [24]. Both of these configurations are
capable of taking advantage of the inherent parallelism in packet processing.
Pipelined architectures include: Cisco PXF [25], EZChip NP1 [8], and Xel-
erator Network Processors [31]. Symmetric architectures include: Intel IXP [6]
and IBM PowerNP [1].
High-speed networks place high demands on the performance of NPUs. In
order to prevent network communication delays, NPUs must quickly and effi-
ciently process packets. Parallel processing through the use of multiple PEs is
only one strategy used in NPUs to improve performance. Another strategy is
to use special-purpose hardware to offload tasks from the PEs. Special-purpose
hardware includes co-processors and special functional units.
Co-processors are more complex then functional units. They may be attached
to several PEs, memories, and buses, and they may store state. A co-processor
can be advantageous to the programmer when implementing an application, but
5
can also dictate that the programmer use a specific algorithm in order to take
advantage of a particular co-processor. Special functional units are used to im-
plement common networking operations that are hard to implement efficiently in
software yet easy to implement in hardware [24].
Since memory access can potentially waste processing cycles, NPUs often use
multi-threading to efficiently utilize processing power. Hardware is dedicated to
multi-threading such as separate register banks for different threads and hardware
units to schedule and swap threads with no overhead. Special units also handle
memory management and the copying of packets from network interfaces into
shared memory [24].
2.1.1 Intel IXP1200
The IXP1200 was designed to support applications requiring fast memory
access, low latency access to network interfaces, and strong processing of bit,
byte, word, and longword operations. For processors, the IXP1200 provides a
single StrongARM processor and six independent 32-bit RISC PEs called mi-
croengines. This boils down to a single powerful processor coupled with 6 very
simple, weaker engines for highly parallel computation. In addition, each mi-
croengine provides four hardware supported threads with zero-overhead context
switching. The StrongARM was designed to manage complex tasks and to offload
specific tasks to individual microengines [6].
The StrongARM and microengines share 8 MBytes of SRAM for relatively
fast accesses and 256 MBytes of SDRAM for larger memory space requirements
(but slow accesses). There is also a scratch memory unit available to all proces-
sors consisting of 1 MByte SRAM. The StrongARM has a 16 KByte instruction
6
cache and 8 KByte data cache, providing it with fast accesses on a small amount
of data. Each microengine has a 1 KByte data cache and a large number of
transfer registers. The IXP1200 platform does not provide any built-in memory
management, therefore the application developers are responsible for maintaining
memory address space [6].
2.2 Network Processor Simulators
Simulators are often used to execute programs written to run on hardware
platforms that are inconvenient or inaccessible to developers [28]. Simulators are
also able to provide performance statistics such as cycle count, memory usage,
bus bandwidth, and cache misses. These statistics enable developers to identify
bottlenecks and tune applications to specific hardware configurations.
Simulators are an important aspect of research in network processors due to
the high-cost and the wide variety of architecture found in current NPUs. High-
cost often makes cutting-edge NPUs inaccessible in academic research although
outdated NPUs are becoming more accessible. The wide variety of NPU archi-
tectures makes developing applications to run across multiple platforms difficult.
Since simulators can potentially be configured to simulate multiple platforms,
analysis of architectural differences can be performed.
2.2.1 SimpleScalar
SimpleScalar provides tools for developing cycle-accurate hardware simulation
software that models real-world architecture [3]. We chose to use SimpleScalar
because of its prevalence in architectural research. SimpleScalar takes as input
7
binaries compiled for the SimpleScalar architecture and simulates their execution
[3]. The SimpleScalar architecture is similar to MIPS, which is commonly found
in NPU platforms such as the Intel IXP. A modified version of GNU GCC allows
binaries to be compiled from FORTRAN or C into SimpleScalar binaries [3].
2.2.2 PacketBench
PacketBench is a simulator developed at the University of Massachusetts to
provide exploration and understanding of NPU workloads [22]. PacketBench
makes use of SimpleScalar ARM for cycle-accurate simulation [22]. PacketBench
also emulates some of the functionality of a NPU by providing a simple API for
sending and receiving packets and for memory management [22]. In this way, the
underlying details of specific NPU architectures are hidden from the application
developer. Although PacketBench is useful in characterizing workload, it does
not provide simulation support for multiprocessor environments. Since NPUs
make extensive use of parallelization, we chose not to use this tool.
2.3 Benchmarks
Benchmarks are applications designed to assess the performance character-
istics of computer hardware architectures [27]. One approach is to use a single
benchmark suite to compare the performance of several different architectures.
Another approach is to compare the performance of different applications on a
specific architecture. Benchmarks designed to mimic a particular type of work-
load are called Synthetic, while Application benchmarks are real-world applica-
tions [27]. For the purposes of this paper, our interest is in application bench-
8
marks, and more specifically, representative benchmarks for the domain of NPUs.
2.3.1 MiBench
MiBench is a benchmark suite providing representative applications for em-
bedded microprocessors [10]. Due to the diversity of the embedded microproces-
sor domain, MiBench is composed of 35 applications divided into six categories:
Automotive and Industry Control, Network, Security, Consumer Devices, Office
Automation, and Telecommunications. The Network and Security categories in-
clude Rijndael encryption, Dijkstra, Patricia, Cyclic Redundancy Check (CRC),
Secure Hash Algorithm (SHA), Blowfish, and Pretty Good Privacy (PGP) algo-
rithms. The Telecommunications category consists of mostly signal processing
algorithms, while the other categories are not relevant to this discussion. All
MiBench applications are available in standard C source code allowing them to
be ported to any platform with compiler support.
2.3.2 CommBench
CommBench was designed to evaluate the performance of network devices
based on eight typical network applications. The applications included in Comm-
Bench are categorized into header-processing and payload-processing. Header-
processing applications include Radix-Tree Routing table lookup, FRAG packet
fragmentation, Deficit Round Robin scheduling, and tcpdump traffic monitor-
ing. Payload-processing applications include CAST block cipher encryption, ZIP
data compression, Reed-Solomon Forward Error Correction (REED) redundancy
checking, and JPEG lossy image compressing [30].
9
2.3.3 NetBench
NetBench is a benchmarking suite consisting of a representative set of network
applications likely to be found in the network processor domain. These applica-
tions are split into three categories: micro, IP, and application. The micro-level
includes the CRC-32 checksum calculation and the table lookup routing scheme.
IP-level programs include IPv4 routing, Deficit-Round Robin (DRR) scheduling,
Network Address Translation (NAT), and the IPCHAINS firewall application.
Finally, application-level includes URL-based switching, Diffie-Hellmen (DH) en-
cryption for VPN connections, and Message-Digest 5 (MD5) packet signing [18].
Although CommBench and NetBench offer good representations of typical
network applications, they are both limited to single-threaded environments. Our
work builds on the NetBench suite by parallelizing several NetBench applications.
2.4 Application Frameworks
Application framework is a widely used term referring to a set of libraries and
a standard structure for implementing applications for a particular platform [26].
Application frameworks often promote code-reuse and good design principles.
Several frameworks for NPUs are available in academia, each offering various
benefits to application developers. NPU vendors also provide frameworks specific
to their architectures, such as the Intel IXA Software Development Kit [5].
One key advantage of academic frameworks is the possibility that they will
be able to support multiple architectures, thus enabling developers to design and
implement applications independent of a specific architecture. Unfortunately,
of the NPU-specific frameworks surveyed in this paper only NEPAL currently
10
realizes cross-platform support. The others are currently striving to meet this
goal.
2.4.1 Click
Click is an application development environment designed to describe net-
working applications [13]. Applications implemented using Click are assembled
by combining packet processing elements. Each element implements a simple
autonomous function. The application is described by building a directed graph
with processing elements at the nodes and packet flow described using edges.
Click supports multi-threading but has not been extended to multiprocessor ar-
chitectures. The modularity of Click applications gives insight into their inherent
concurrency and allows alterations in parallelization to be made without changing
functionality.
2.4.2 NP-Click
NP-Click is based upon Click and designed to enable application development
on NPUs without requiring in-depth understanding of the details of the target
architecture [19]. NP-Click offers a layer of abstraction between the developer
and the hardware through the use of a programming model. The code produced
using the NP-Click programming model has been shown to run within 10% of
the performance of hand-coded solutions while significantly reducing development
time [19]. The current implementation of NP-Click targets only the Intel IXP1200
network processor although a goal of this project is to support other architectures.
11
2.4.3 NEPAL
The Network Processor Application Language (NEPAL) is a design environ-
ment for developing and executing module-based applications for network proces-
sors [17]. In a similar fashion to Click, application development takes place by
defining a set of modules and a module tree that defines the flow of execution
and communication between modules. The platform independence of NEPAL
was verified using their own customized version of SimpleScalar ARM simula-
tor for multiprocessor architectures. They provide performance results for two
simulated NPUs modeled after the IXP1200 [6] and Cisco Toaster [25].
2.4.4 NetBind
NetBind is a binding tool for dynamically constructing data paths in NPUs
[4]. Data paths are made up of components performing simple operations on
packet streams. NetBind modifies the machine code of executable components
in order to combine them into a single executable at run-time. The current
implementation of NetBind specifically targets the IXP1200 network processor,
although it could be ported to other architectures in the future.
12
Chapter 3
The Simulator
The simulator developed for this work was built on the SimpleScalar tool set
[3]. SimpleScalar provides tools for developing cycle-accurate hardware simula-
tion software that models real-world architecture. This simulation tool set was
chosen because of its prevalence in architectural research. For this work, we mod-
ified an existing simulator with support for multiple processors in order to create
a generic network processor simulator modeled after the Intel IXP1200. We chose
to model the IXP1200 because it is a member of the commonly used Intel IXP
line of NPUs.
3.1 Processing Units
The simulator includes a single main processor and six auxiliary processors
each supporting up to four concurrent threads. This configuration corresponds to
the StrongARM core processor and accompanying microengines on the IXP1200.
The StrongARM core is represented by an out-of-order processor. The micro-
engines are represented by single-issue in-order processors. Since each micro-
13
Parameter StrongARM Microengines
Scheduling Out-of-order In-order
Width 1 (single-issue) 1 (single-issue)
L1 I Cache Size 16 KByte SRAM (no miss penalty)
L1 D Cache Size 8 KByte 1 KByte
Table 3.1. Processor Parameters
engine must support four threads with zero overhead context switching [6], the
simulator creates one single-issue in-order processor for each microengine thread.
When a single-issue in-order processor is created, it is given the number of threads
allocated to its physical microengine so it knows to execute every n cycles, where
n is the number of threads on the microengine. The total number of required
threads is specified on the command line when the simulator is run, therefore
unused threads are not created.
3.2 Memory Structure
The StrongARM and microengines share 8 MBytes of SRAM and 256 MBytes
of SDRAM [6]. There is also a scratch memory unit consisting of 1 MByte SRAM.
These memory units are represented in the simulator using a single DRAM unit.
Separate DRAM caches back these memory units for the StrongARM and micro-
engines.
The StrongARM has a 16 KByte instruction cache and 8 KByte data cache
that are backed by SRAM [6]. Each microengine has a 1 KByte data cache and
unlimited instruction cache. The microengines are given unlimited instruction
cache in order to mimic the behavior of the large number of transfer registers
associated with each microengine on the IXP1200. Since the number of simulated
registers cannot exceed the number of physical registers on the host architecture,
14
we determined this to be the best option available.
Since the IXP1200 is capable of connecting with any number of network de-
vices through its high speed 64 bit IX bus interface, the amount of delay incurred
to fetch a packet could very greatly. For the purposes of this simulator, network
delay is not important and it is assumed that the next packet is available as soon
as the application is ready to receive it. In order to imitate this behavior, a large
chunk of the DRAM unit is allocated as “network” memory and is backed by a
no-penalty cache object available to all processors.
The simulator does not provide any built in memory management, therefore
the application developers are responsible for maintaining memory address space.
The simulator assigns address ranges to each of the memory units. SRAM is
dedicated for the call stack and DRAM is broken up into a range for text, global
variables, and the heap.
3.3 Methods of Use
The simulator compiles to Linux using GCC 3.2.3 to an executable called
sim3ixp1200. The simulator takes a list of arguments that modify architectural
defaults and indicate the location of a SimpleScalar native executable and any
arguments that should be passed to it. Its use can be expressed as:
sim3ixp1200 [-h] [sim-args] program [program-args]
The -h option lists available simulator arguments of the form -Parameter:value.
These arguments can modify aspects of the simulation architecture including the
number of PEs, threads, cache specifications, and memory unit specifications.
Default values for each available parameter are based on the IXP1200 architec-
15
ture.
The most important parameter for this work was Threads that controls the
number of microengine threads made available to the SimpleScalar application.
Threads can be any value between 0 and 24 inclusive. Zero threads indicates the
microengines will not be used and therefore the application will execute only on
the StrongARM processor. When the number of threads are greater than zero,
they are allotted to the 6 possible microengines using a round-robin scheme so
that the threads are distributed as evenly as possible. For instance, if 8 threads
are requested, then 4 microengines will be run 1 thread and 2 microengines will
run 2 threads.
3.4 Application Development
The applications developed for this work were written in C and compiled
using a GCC 2.7.2.3 cross-compiler. A cross-compiler translates code written on
one computer architecture into code that is executable on another architecture.
For this work, the host architecture was Linux/x86 and the target architecture
was SimpleScalar PISA (a MIPS-like instruction set).
Since the simulator does not support POSIX threads, developing multi-threaded
applications follows a completely different path. Instead of the main process
spawning child threads, the same application code is automatically executed in
each simulator thread. In order for the application code to distinguish which
thread it is running in, a function called getcpu() that returns an integer is
made available by the simulator. This function, although mis-named, returns
the thread identifier, not the CPU identifier. Code that is meant to run in a
particular thread must be isolated in an if block that tests the return value from
16
getcpu(). This function requires a penalty of one cycle, but it is typically called
only once and its value stored in a local variable during the initialization of the
application. A global variable called ncpus is automatically made available by
the simulator and populated with the number of threads.
It is often necessary in application development to require all threads to reach
a particular point before any thread is allowed to proceed. This is accomplished
using another function made available by the simulator called barrier(). A call
to barrier() requires one cycle for the function call, but induces no penalty
while a thread waits.
The simulator reports statistics on the utilization of each hardware unit at the
end of each execution. For each PE this includes cycle count, instruction count,
and fetch stalls. For each memory unit this includes hits, misses, reads, and
writes. In addition, the simulator provides a function called PrintCycleCount()
that can be used at any time to print the cycle count of the current thread to
standard error and standard output. This function is useful when an application
has an initialization process that should not count towards the total cycle count.
By making a call to PrintCycleCount() at the beginning and end of a block of
code, the total cycle count for that block can be determined by analyzing the
output. When the developer requires that the application make some calculation
based on cycle count, the function GetCycles() returning and integer can be
used. Both of these functions induce a penalty of one cycle for the call, but no
penalty for their execution.
17
Chapter 4
Benchmark Applications
Previous research in the area of network processors has focused on exploring
their performance characteristics by running individual applications in isolation
and in a single threaded environment. Network processors are capable of support-
ing more complex applications that guide packets through a series of applications
running in parallel. For the first stage of this work we ported three typical network
applications to our simulator: MD5, URL-switching, and Advanced Encryption
Standard (AES). This process involved modifying memory allocations to use
appropriate simulator address space and reorganizing each application to take
advantage of multiple threads. For the second stage of this work we combined
these three applications into three types of end-to-end applications: shared, sta-
tic, and dynamic. These distinctions refer to three different ways of utilizing the
available threads.
18
4.1 Message Digest
The MD5 algorithm [23] creates a 128-bit signature of an arbitrary length
input message. Until recently, it was believed to be infeasible to produce two
messages with the same signature or to produce the original message given a
signature. However, in March 2005, Arjen Lenstra, Xiaoyun Wang, and Benne
de Weger demonstrated [14] that two valid X.509 certificates [11] could be created
with identical MD5 signatures. Although more robust algorithms exist, MD5 is
still extensively used in public-key cryptography and in verifying data integrity.
Our implementation of MD5 was adopted from the NetBench suite of ap-
plications for network processors [18, 16]. The NetBench implementation was
designed to process packets in a serial fashion utilizing a single thread. The
multi-threaded, multiprocessor nature of NPUs is better utilized by processing
packets in parallel. In order to analyze the performance characteristics in this
environment, our implementation of MD5 offloads the processing of packets to
available microengine threads. In this way, the number of packets processed in
parallel is equal to the number of microengine threads.
As shown in Figure 4.1, the StrongARM processor is responsible for accepting
incoming packets and distributing them to idle microengine threads. Commu-
nication between the StrongARM and microengines is done through the use of
semaphores. When the StrongARM finds an idle thread, it copies a pointer to
the current packet and the length of the current packet to shared memory lo-
cations known by both the StrongARM and the thread. The StrongARM then
sets a semaphore that triggers the thread to begin executing. When all packets
have been processed, the StrongARM waits for each thread to become idle, then
notifies them to exit before exiting itself.
19
Figure 4.1. MD5 - StrongARM Algorithm
Each microengine thread proceeds as shown in Figure 4.2. It waits until its
semaphore has changed, then either exits or copies the current packet to its stack
before processing it to generate a 128-bit signature. It then resets its semaphore
and returns to waiting.
4.2 URL-Based Switch
URL-based switching directs network traffic based on the Uniform Resource
Locator (URL) found in a packet. Other terms for URL-based switch include
Layer 7 switch, content-switch, and web-switch. The purpose of switching based
on Layer 7 content is to realize improved performance and reliability of web-based
20
Figure 4.2. MD5 - Microengine Algorithm
services. A Layer 4 switch located in front of the cluster of servers can control
how each Transmission Control Protocol (TCP) connection is established on a
per connection basis. How requests are directed within a connection is out of
reach to a Layer 4 switch. Traffic can be managed per request, rather than per
connection, by a URL-based switch [12].
In order to manager requests, a URL-switch acts as the end point of each
TCP connection and establishes its own connections to the servers containing
the content requested by the client. It then relays content to the client. In this
way, the switch can perform load-balancing and fault detection and recovery.
For instance, if one server is overloaded or unreachable, the switch can send its
request to a different server.
Our URL-based switching algorithm is based on the implementation found
in NetBench [18, 16]. The algorithm searches the contents of a packet for a list
of matching patterns. Each pattern has an associated destination that can be
used to switch the packet or begin another process. The focus of our URL-based
21
switch is the pattern matching algorithm.
Unlike our implementation of MD5, our URL-based switch does not utilize
parallelism by processing multiple packets at once, instead it uses multiple threads
to process each packet. The data structure used to store patterns is a list of lists.
Each element of the primary list is made up of a secondary list and the largest
common substring of the patterns in the secondary list. The algorithm proceeds
as shown in Figures 4.3 and 4.4.
Figure 4.3. URL - StrongARM Algorithm
Each packet received by the StrongARM is copied to the stack and run
through the Internet checksum algorithm to verify its integrity. For each ele-
ment in the list of largest common substrings, the StrongARM copies the el-
ement’s secondary list pointer to a shared memory location known by an idle
microengine thread. A pointer to the current packet is also copied to shared
memory and then the idle microengine’s semaphore is set to notify it to begin
22
Figure 4.4. URL - Microengine Algorithm
executing. The microengine thread first copies the packet to its stack and then
uses a Boyer Moore search function to determine whether the packet contains
the largest common pattern. If this test is positive, then the thread proceeds to
search for a matching pattern in the secondary list. Otherwise, the microengine
resets its semaphore and returns to an idle state. If the thread finds a matching
pattern, it sets its semaphore to reflect this before returning to an idle state.
The StrongARM continues until it reaches the end of the primary list or until a
thread finds a matching pattern, it then processes the next packet.
4.3 Advanced Encryption Standard
The AES is an encryption standard adopted by the US government in 2001
[20]. The standard was proposed by Vincent Rijmen and Joan Daemen under the
name Rijndael [7]. AES is a block cipher encryption algorithm based on 128, 192,
or 256 bit keys. The algorithm is known to perform efficiently in both hardware
and software.
Our implementation of AES is based on the Rijndael algorithm found in
23
the MiBench embedded benchmark suite [10, 21]. In much the same way that
our MD5 algorithm processes packets in parallel, our AES algorithm offloads
the encryption of packets to microengine threads. The encryption is performed
using a 256 bit key that is loaded into each thread’s stack during startup. This
algorithm executes on the simulator in the same manner as MD5 above (Figures
4.1 and 4.2).
24
Chapter 5
Results
In order to evaluate the effectiveness of NPUs to support multi-threaded end-
to-end applications and the effectiveness of various parallelization techniques to
take advantage of the NPU architecture, we performed four types of tests: Isola-
tion, Shared, Static, and Dynamic. The Isolation tests establish a baseline and ex-
plore application behavior on the multi-threading NPU architecture. The Shared
tests explore how each application is affected by the concurrent execution of other
applications. The Static tests reveal characteristics of an end-to-end application
and how to best distribute threads. Finally, the Dynamic tests serve to compare
an on-demand thread allocation algorithm to statically allocated threads.
5.1 Isolation Tests
The purpose of the Isolation tests is twofold: to establish a baseline for sub-
sequent tests and to explore the effects of multi-threading on the NPU. The
Isolation tests consisted of independent tests for each application. For each in-
dependent test, the number of microengine threads available to the application
25
was varied between 1 and 24, since the simulator supports up to 24 threads. A
data point was also gathered for the serial version of each application in which
no microengine threads were used.
5.1.1 MD5
Figure 5.1. MD5 Isolated Speedup on 1000 Packets
Test results in Figure 5.1 show that parallelization of the MD5 algorithm
offers significant speedup compared to its serial counterpart. The data point at
zero threads represents the serial version of MD5 executed on the StrongARM
processor. The data point at 1 thread represents the multi-threaded version
making use of the StrongARM and a single microengine. This case is slower than
the serial version because of the overhead involved in communication between the
StrongARM and the microengine and because the microengine does not offer as
strong processing power as the StrongARM. As the number of threads increases,
the combined processing power of the microengines outweighs the communication
overhead.
26
The slope of the speedup graph in Figure 5.1 decreases suddenly at 7, 13,
and 19 threads. These changes can be attributed to the fact that there are 6
microengines, therefore, up until 7 threads each microengine is responsible for
a single thread. From 7-12 threads, each microengine is burdened with up to 2
threads. Similarly, as the number of threads increases to 24, each microengine is
burdened with 3 and then 4 threads causing the speedup to approach a flat line.
5.1.2 URL
Figure 5.2. URL Isolated Speedup on 100 Packets (non-polling)
Although test results for the parallelization of URL show improvements over
the serial version, characteristics of the algorithm limited speedup. As stated in
the previous chapter, the URL algorithm is parallelized in such a way that multi-
ple threads work together to process each packet. Each thread is responsible for
searching the packet for a particular set of patterns, and the first match preempts
further execution. The drawback of this algorithm is that since only one thread
will find a match, the other threads do work that in hindsight is unnecessary.
27
Figure 5.3. URL Isolated Speedup on 100 Packets (polling)
This in itself would not be detrimental to the application’s performance except
that all threads are vying for a limited number of shared resources.
We developed two variations of the URL algorithm in an attempt to minimize
the cycles spent searching false leads. The first version allows each thread to
run to completion after a matching pattern is found. Once a thread reports to
the StrongARM that a match has been found, the StrongARM stops spawning
new threads and simply waits for the active threads to finish, although their
processing is immaterial. In the alternative approach, when a match is found,
the StrongARM sets a global flag that is constantly polled by each thread. When
a thread detects that the flag has changed, it stops executing.
Although it was expected that the polling version of URL would perform
better, it actually performed slightly worse than the non-polling version. As
shown in Figures 5.2 and 5.3, the highest speedup attained by non-polling was
1.75 and for polling 1.64. Analysis of the application’s output shows that a
matching pattern is found in only about 40% of the trace packets, thus polling
28
is unable to preempt execution 60% of time. The difference in speedup is due to
the fact that the polling version is doing unnecessary work 60% of the time and
that polling itself wasts too many cycles.
In both versions of URL, the speedup drops off after reaching a maximum
between 4 and 6 threads. This indicates that contention to shared resources
becomes a problem after this point.
5.1.3 AES
Figure 5.4. AES Isolated Speedup on 100 Packets
Speedup tests on AES show that this algorithm performs poorly when of-
floaded to the microengines. The AES encryption algorithm requires each packet
be read and processed 16 bytes at a time. State is maintained for the lifetime
of each packet in an accumulator that made up of the encryption key and state
variables. In addition, a static lookup table of 8 Kbytes is required. The L1 data
cache for the StrongARM is 8 Kbytes compared to 1 Kbytes for the microengines.
Due to the limited size of the microengine caches, AES suffers from substantial
29
cache misses.
Processing of each packet consumes roughly 1.36 million simulator cycles when
encryption is performed on the StrongARM. The same process consumes roughly
11.4 million simulator cycles on a microengine thread when it is the only micro-
engine thread running. This is an increase by a factor of 8.4. In contrast, MD5
consumes roughly 0.518 million cycles on the StrongARM and 0.922 million cy-
cles on a single microengine thread. This results in an increase by a factor of 1.6.
Thus, AES requires a substantially higher increase in cycles when moving from
the StrongARM to a microengine thread.
Figure 5.4 shows that although performance on the microengine threads is
far worse than the serial version, it remains relatively constant as the number of
threads increases. Therefore, the poor performance of AES on the microengine
threads is primarily a result of processing power and cache size, not memory
contention between threads which would be the case if speedup tailed off.
5.1.4 Isolation Analysis
These tests reveal general characteristics of each kernel on both the Stron-
gARM and microengines. MD5 has been shown to offer strong speedup on mi-
croengine threads using conventional parallelization. URL, using an alternative
approach to multi-threading, has been shown to provide maximum speedup be-
tween 4 and 6 threads employing either a polling or non-polling scheme. Finally,
AES reveals an algorithm with poor performance on the microengines that cannot
be overcome by multi-threading.
30
5.2 Shared Tests
The purpose of the Shared tests is to determine how sensitive each kernel
is to the concurrent execution of the other kernels. For these tests we ran all
three kernels on the simulator at the same time. The StrongARM served as the
controller, passing incoming packets to available microengine threads. We ran
one test for each kernel, in which the number of threads available to the kernel
under test was varied, while the threads available to the other kernels remained
constant. Our baseline for each of these tests was 1 thread for MD5, 4 threads
for URL, and 1 thread for AES. This baseline was chosen because running
URL with few than 4 threads was found to cause a significant bottleneck. The
number of threads available to the kernel under examination was increased for
each subsequent run. Each kernel processed a separate packet stream until the
kernel under test completed the desired number of packets, in this case 50.
Figure 5.5 shows the speedup results from all three tests on the same graph
revealing the relative speedup of each kernel. Clearly, MD5 and AES have much
greater speedup than URL, indicating they are less sensitive to the concurrent
processing of other kernels. However, it is more interesting to compare the Shared
speedup of each kernel with its Isolated speedup. This comparison is covered in
the following subsections.
5.2.1 MD5
The speedup results of MD5 in the Isolation and Shared tests, shown in
Figures 5.1 and 5.5 respectively, show few differences. The slope of each graph
is approximately the same and both peak near a speedup of six. This indicates
that MD5 is not substantially affected by the concurrent execution of URL and
31
Figure 5.5. Shared Speedup on 50 Packets
AES. The lightweight nature of MD5 with regards to memory is the most likely
explanation for this behavior.
Figure 5.6 compares the MD5 Isolation and Shared tests with regard to the
number of cycles consumed by the StrongARM while 50 packets are processed,
revealing that more cycles are required to process the same packet stream when
MD5 is sharing the resources of the NPU. The horizontal-axis corresponds to the
number of MD5 threads employed to process the packets while the vertial-axis
corresponds to the number of cycles spent processing the packet stream. Since
with the Shared tests 4 threads are allocated to URL and 1 to AES, these threads
cause contention for access to shared resources and therefore higher cycle counts
than the Isolation tests.
32
Figure 5.6. MD5 Isolated vs. Shared Cycles on 50 Packets
5.2.2 URL
Although the Shared speedup of URL shown in Figure 5.5 steadily increases,
its maximum of 1.17 with 22 threads does not match the Isolation speedup shown
in Figure 5.2 that peaks at 1.75 and degrades to 1.41 with 22 threads. This
indicates that URL is affected by the concurrent execution of other applications
due to its memory access requirements.
5.2.3 AES
The Shared speedup of AES shown in Figure 5.5 is an order of magnitude
greater than the Isolation speedup shown in Figure 5.4. This high speedup is
due to the fact that the baseline for this test performed extremely poorly. This
can be attributed to two characteristics of the AES kernel. Firstly, as shown
in the Isolation tests, AES performs poorly on the microengines due to their
lack of processing power and limited size of their cache. Secondly, since the
33
StrongARM is the controller for all three kernels it continuously monitors all of
the microengine threads and distributes incoming packets as necessary. In the
baseline, the StrongARM has to monitor one thread for each kernel, thus only
one-third of its time is spent monitoring the AES thread. Therefore, the AES
thread occasionally finishes processing a packet and wastes idle cycles waiting
for the StrongARM to send it another packet. As more threads are allocated to
AES, the StrongARM spends a larger percentage of time monitoring AES threads
therefore increasing throughput.
5.2.4 Shared Analysis
The Shared tests reveal that MD5 and AES are relatively insensitive to the
concurrent execution of the other kernels on a single NPU. URL, however, is
sensitive, and its speedup suffers when it is run alongside the other kernels.
5.3 Static Tests
The Static tests were designed to reveal characteristics of the end-to-end ap-
plication, such as the location of bottlenecks and the ideal thread configuration.
The testing process was similar to that of the Shared tests. The difference being
that instead of processing independent packet streams, the applications worked
together to process a single packet stream. Each incoming packet was processed
first by MD5, then by URL, and finally by AES. This scenario represents a
possible end-to-end application running on a NPU as shown in Figure 5.7. The
purpose of this application is to distribute sensitive information from a trusted
internal network through the Internet to a variety of hosts. Each packet is re-
34
ceived by the application from the internal network, the application calculates
its MD5 signature, determines its destination based on a deep inspection of the
packet, and then encrypts it. Finally, the the encrypted packet along with its
signature is sent to a host machine; although this step is not included in the sim-
ulated application. To complete this scenario, the host machine would decrypt
the packet and verify that the contents were not modified in transit by comparing
the included signature to a newly generated one. This is also not included in the
simulation.
Figure 5.7. End-to-End Application Scenario
Figure 5.8. Optimization with Static Allocation of Threads
For these tests, the number of threads allocated to each stage of the end-to-
35
end application is static for each run. Once again, the baseline test is 1 thread for
MD5, 4 threads for URL, and 1 thread for AES. Each subsequent test increases
the number of threads by one and attempts to determine the optimal configura-
tion. The optimal configuration is determined by giving the additional thread to
each of the applications in turn, and observing which configuration yields the best
speedup. This configuration is then used as a starting point for the subsequent
test.
Figure 5.8 shows the resulting optimal configurations for each number of avail-
able threads between 6 and 24. These configurations were found through test runs
of 50 packets. MD5 never became a bottleneck point and 1 thread remained suf-
ficient throughout the tests. URL and AES almost evenly split the remaining
threads, with the a final configuration of 12 threads for AES, 11 for URL, and
1 for MD5. These results show that the demands of AES and URL are similar
and parallelization offers increased performance for these applications, while the
simplicity of MD5 makes parallelization of it in the context of this end-to-end
application unnecessary.
The above discovery reveals an interesting characteristic of this end-to-end
application. Although MD5 provided the best speedup in the Isolation tests,
parallelizing it in the Static tests resulted in less performance improvement than
further parallelization of the other applications. This can be explained by Am-
dahl’s Law [2], which states that the overall speedup achievable from the improve-
ment of a proportion of the required computation is affected by the size of that
proportion. If P is the proportion and S is the speedup, Amdahl’s Law states
that the overall speedup will be:
36
1
P
(1 − P ) + S
Therefore, the computation required to perform MD5 in this end-to-end ap-
plication is a small proportion of the overall computation. Subsequently, speedup
benefits more through increased parallelization of URL and AES.
It is also interesting to note that although AES did not benefit from additional
microengines during the Isolation tests (Figure 5.4), in the high-load context of
this end-to-end application additional AES threads benefit overall performance.
Figure 5.8 also shows that initially more threads were allocated to URL and
after 14 threads more threads were allocated to AES. Since URL is required to
finish processing each packet before it can be sent to AES, URL caused more of
a bottleneck when it had less than 10 threads. After that point, AES required 4
threads to ever 1 for URL in order to keep pace.
5.4 Dynamic Tests
The Dynamic tests present an alternative approach to the Static tests. Where
the Static tests represent ideal configurations, the Dynamic tests represent real-
istic configurations. Static allocation of microengine threads is also much less
feasible since all possible configurations must be run in order to determine the
best one for the given end-to-end application. This could become an extremely
complex and lengthy process. The trade-off with a dynamic heuristic is increased
complexity in the logic of the application.
The purpose of these tests was to determine how an on-demand allocation
of threads performs against a static approach. The Dynamic tests consist of
37
all three kernels processing the same packet stream in serial, as in the Static
tests, but with threads dynamically allocated based on demand. Once again,
the StrongARM serves as the controller and is responsible for allocating threads.
Allocation is implemented through the use of queues for each stage of the end-
to-end application. Each queue stores pointers to packets that are waiting to be
processed by the next stage. The StrongARM detects when a queue has packets
and creates threads to process them.
Figure 5.9. Dynamic Speedup on 50 Packets
Figure 5.9 shows the speedup of the Dynamic application using as a baseline
the Static configuration consisting of 1 MD5, 4 URL, and 1 AES thread. The
speedup increases from 4.29 with 6 threads to 4.39 with 24 threads, a substantial
increase over the Static baseline.
Figure 5.10 shows the difference between the number of cycles requires for
each of the applications to process the same number of packets. While the Static
version spent in the neighborhood of 1.3 billion cycles per 50 packets, the Dynamic
version spent closer to 300 million, a ratio of 4.3:1.
38
Figure 5.10. Static vs. Dynamic Cycles on 50 Packets
This discrepancy can be attributed to cycles wasted on idle threads. With the
Static version, each thread is statically assigned to perform either MD5, URL, or
AES. Since the URL kernel requires much longer to run than MD5, the queue
of packets waiting for URL processing is quickly filled forcing the MD5 thread
to stop processing new packets until URL can reduced the queue. At the same
time, when the URL threads were unable to process packets as quickly as the
AES threads, some AES threads wasted idle cycles. The Dynamic version did
not suffer from these bottleneck issues because idle threads were put to use by
whichever kernel required them.
Another benefit of the Dynamic version is that it is able to adjust to changes
in load caused by varying packet sizes and payloads. Specifically, since URL
performs a thorough string matching on the payload of each packet, the size of
the packet has a large affect on the number of cycles required to process it. The
Dynamic version is able to minimize bottlenecks in URL due to large packets by
putting more threads to work on the bottleneck.
39
Figure 5.10 also shows that neither the Static nor the Dynamic versions of
the end-to-end application benefit much from additional threads. The number
of cycles remains relatively constants from 6 to 24 threads. The Isolation tests
show that between 6 and 24 threads MD5 is the only kernel to experience signif-
icant performance improvement. The speedup of URL declines slightly and AES
remains relatively constant. Therefore, with the exception of the MD5 kernel,
the end-to-end applications experience performance characteristics similar to the
Isolation tests. Once again, this can be explained by Amdahl’s Law [2], because
MD5 constitutes only a small percentage of the overall computation. Thus, the
performance of the end-to-end application is driven by the performance of the
URL and AES kernels.
5.5 Analysis
We performed four types of tests for our analysis: Isolation, Shared, Static,
and Dynamic. The Isolation tests established a baseline and explored kernel
behavior on the multi-threading NPU architecture. The Shared tests explored
how each kernel was affected by the concurrent execution of other kernels. The
Static tests revealed characteristics of an end-to-end application and how to best
distribute threads. Finally, the Dynamic tests served to compare an on-demand
thread allocation algorithm to statically allocated threads.
The Isolation tests revealed general characteristics of each kernel on both
the StrongARM and microengines. MD5 offered strong speedup on microengine
threads using conventional parallelization. URL, using an alternative approach to
multi-threading, provided maximum speedup between 4 and 6 threads employing
either a polling or non-polling scheme. Finally, AES revealed an algorithm with
40
poor performance on the microengines that could not be overcome by multi-
threading.
The Shared tests revealed that MD5 and AES are relatively insensitive to the
concurrent execution of the other applications on a single NPU. URL, however,
was shown to be sensitive because its speedup suffered when it was run alongside
the other kernels.
The Static tests provided a baseline for the Dynamic tests and revealed the
optimal thread configurations for running the end-to-end application. Results
showed that the demands of AES and URL are similar and parallelization offered
increased performance for these applications, while the simplicity of MD5 made
parallelization of it in the context of this end-to-end application unnecessary.
As an alternative to statically allocating threads, the Dynamic tests explored
the benefits of dynamically allocating threads. Overall the Dynamic tests re-
quired less than 25% as many cycles to process each 50 packet test as the Static
conterpart.
41
Chapter 6
Conclusion
We have presented a network processor simulator, multi-threaded end-to-end
benchmark applications, and an analysis of the characteristics of these appli-
cations on NPUs. Our first contribution was the creation of a simulator that
emulates a generic network processor modeled on the Intel IXP1200. Our simu-
lator fills a gap in existing academic research by supporting multiple processing
units. Our second contribution was the construction of multi-threaded, end-to-
end application benchmarks. These benchmarks extend the functionality of ex-
isting benchmarks based on single-threaded kernels. Our final contribution was
an analysis of the characteristics of our benchmarks on our network processor
simulator.
Our analysis in Chapter 5 found several interesting results. Firstly, although
the MD5 kernel scaled well in the Isolation and Shared tests, parallelization of
it in an end-to-end application had little effect due to Amdahl’s Law. Secondly,
the Static and Dynamic tests found that the end-to-end application did not have
much performance gain from the addition of more than 6 threads. Finally, the
Dynamic version of the end-to-end application required less than 25% as many
42
cycles to process the same packet stream compared to the Static version.
In an attempt to bridge the gap between the speed of ASIC chips and the
flexibility of general purpose processors, NPUs utilize parallel processing and
special-purpose hardware and memory structure, as well as other techniques.
While NPUs make it possible to deploy complex end-to-end applications into the
network, high speed networks put heavy load on these devices making application
optimization an important area of research. The simulator presented in this paper
made development and analysis of two end-to-end application benchmarks as well
as the kernels making up these applications. Through the development of these
kernels and applications we explored several parallelization techniques. Using our
simulator and testing methodology, we unveiled the performance characteristics
of these kernels and application benchmarks on a typical NPU.
43
Chapter 7
Future Work
The simulator developed in this work provides a tool that can be used in a
variety of future projects. Thus far, the simulator has been used by Gridley in
his Master’s thesis on active network algorithm performance [9] and Tsudama to
test his denial-of-service detection algorithm as part of his Master’s thesis [29].
As future work, several improvements could be made to the existing simulator
including support for dedicated processing chips, larger cycle count capability,
and updates necessary to model the current generation of NPUs.
Other future work could include testing the existing end-to-end applications
on an updated simulator to determine whether or not the performance problems
found in this work have been overcome by the current generation of NPUs. If
the same performance problems remain, further investigation into methods of
designing parallel applications to avoid bottlenecks on NPUs will be required.
Additionally, the parameters of the NPU architecture could be adjusted to deter-
mine which changes lead to performance improvements. However, if performance
bottlenecks are not found on current NPUs, then larger scale end-to-end appli-
cations should be developed to push the performance limits of the architecture
44
and reveal new bottlenecks.
The benchmark suite could be extended by including additional kernels. The
end-to-end applications could be extended to include these kernels or new end-
to-end applications could be developed to model other real-world scenarios. Op-
timization of the current and future kernels and end-to-end applications will
continue to be an open area of research.
45
Bibliography
[1] J. Allen, B. Bass, C. Basso, R. Boivie, J. Calvignac, G. Davis, L. Frelechoux,
M. Heddes, A. Herkersdorf, A. Kind, J. Logan, M. Peyravian, M. Rinaldi,
R. Sabhikhi, M. Siegel, and M. Waldvogel. IBM PowerNP network proces-
sor: Hardware, software, and applications. IBM Journal of Research and
Development, 2003.
[2] Gene Amdahl. Validity of the single processor approach to achieving large-
scale computing capabilities. In AFIPS Conference Proceedings, pages 483–
485, Atlantic City, N.J., 1967.
[3] Douglas C. Burger and Todd M. Austin. The simplescalar tool set, version
2.0. Technical Report CS-TR-1997-1342, Computer Sciences Department,
University of Wisconsin, June 1997.
[4] A. Campbell, S. Chou, M. Kounavis, V. Stachtos, and J. Vicente. Netbind: A
binding tool for constructing data paths in network processor-based routers.
In Proceedings of IEEE OPENARCH 2002, New York City, NY, June 2002.
[5] Intel Corporation. Intel internet exchange architecture (IXA) software
development kit. http://www.intel.com/design/network/products/
npfamily/sdk download.htm. Accessed June 3, 2005.
46
[6] Intel Corporation. Ixp1200 network processor datasheet, September 2003.
[7] J. Daemen and V. Rijmen. AES proposal: Rijndael. First Advanced En-
cryption Standard (AES) Conference, August 1998.
[8] EZchip technologies. Network processor designs for next-generation net-
working equipment. White Paper, December 1999. http://www.ezchip.
com/html/tech nsppaper.html.
[9] Dave Gridley. Active network algorithm performance on a network processor:
Adaptive metric based routing and multicast. Master’s thesis, California
Polytechnic State University, San Luis Obispo, June 2004.
[10] M. Guthaus, J. Ringenberg, T. Austin, T. Mudge, and R. Brown. Mibench:
A free, commercially representative embedded benchmark suite. In Pro-
ceedings of the IEEE 4th Annual Workshop on Workload Characterization,
Austin, TX, December 2001.
[11] R. Housley. Internet X.509 public key infrastructure certificate and certifi-
cate revocation list (CRL) profile. RFC 3280, Internet Engineering Task
Force, April 2002.
[12] PMC-Sierra Inc. URL-based switching. WHITE PAPER PMC-2002232,
February 2001.
[13] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans
Kaashoek. The click modular router. ACM Transactions on Computer Sys-
tems, 18(3):263–297, August 2000.
[14] Arjen Lenstra, Xiaoyun Wang, and Benne de Weger. Colliding X.509
certificates. Cryptology ePrint Archive, Report 2005/067, 2005. http:
//eprint.iacr.org/.
47
[15] Alberto Leon-Garcia and Indra Widjaja. Communication Networks: Fun-
damental Concepts and Key Architectures. McGraw-Hill School Education
Group, 2000.
[16] Gokhan Memik. Netbench web site. http://cares.icsl.ucla.edu/
NetBench/, 2002.
[17] Gokhan Memik and William H. Mangione-Smith. NEPAL: A framework for
efficiently structuring applications for network processors. Second Workshop
on Network Processors (NP-2), February 2003.
[18] Mangione Smith Memik and Hu. Netbench: A benchmarking suite for
network processors. In Proceedings of IEEE International Conference on
Computer-Aided Design, November 2001.
[19] K. Keutzer N. Shah, W. Plishker. NP-click: A programming model for
the Intel IXP1200. HPCA-92nd Workshop on Network Processors (NP-2),
February 2003.
[20] National Institute of Standards and Technology, National Bureau of Stan-
dards, U.S. Department of Commerce. Advanced encryption standard. Fed-
eral Information Processing Standard (FIPS) 197, November 2001. http:
//csrc.nist.gov/publications/fips.
[21] University of Michigan at Ann Arbor. Mibench version 1. http://www.
eecs.umich.edu/mibench/, 2002.
[22] Ramaswamy R and T. Wolf. Packetbench: A tool for workload characteri-
zation of network processing. In Proceedings of IEEE 6th Annual Workshop
on Workload Characterization (WWC-6), pages 42–50, Austin, TX, October
2003.
48
[23] Ronald L. Rivest. The MD5 message-digest algorithm. RFC 1321, Internet
Engineering Task Force, April 1992.
[24] N. Shah and K. Keutzer. Network processors: Origin of species. In Proceed-
ings of ISCIS XVII, The Seventeenth International Symposium on Computer
and Information Sciences, 2002.
[25] Cisco Systems. Parallel express forwarding. White Paper,
2002. http://www.cisco.com/en/US/products/hw/routers/ps133/
products white paper09186a008008902a.shtml.
[26] Wikipedia, the free encyclopedia. Application framework. http://en.
wikipedia.org/wiki/Application framework, May 2005.
[27] Wikipedia, the free encyclopedia. Benchmark (computing). http://en.
wikipedia.org/wiki/Benchmark %28computing%29, June 2005.
[28] Wikipedia, the free encyclopedia. Simulator. http://en.wikipedia.org/
wiki/Simulator#Simulation in computer science, May 2005.
[29] Brett Tsudama. A novel distributed denial-of-service detection algorithm.
Master’s thesis, California Polytechnic State University, San Luis Obispo,
June 2004.
[30] T. Wolf and M. A. Franklin. Commbench - a telecommunications benchmark
for network processors. In Proceedings of IEEE International Symposium on
Performance Analysis of Systems and Software, pages 154–162, Austin, TX,
April 2000.
[31] Xelerated. Xelerator X10q network processor. Product Brief, 2004. http:
//www.xelerated.com/file.aspx?file id=62.
49
Appendix A
Acronyms
AES Advanced Encryption Standard
ASIC Application Specific Integrated Circuit
CRC Cyclic Redundancy Check
DH Diffie-Hellmen
DMM Dynamic Module Manager
DRR Deficit-Round Robin
HTTP HyperText Transport Protocol
MD5 Message-Digest 5
NAT Network Address Translation
NEPAL Network Processor Application Language
NPU Network Processing Unit
PE processing element
50
PISA Portable Instruction Set Architecture
PGP Pretty Good Privacy
REED Reed-Solomon Forward Error Correction
SHA Secure Hash Algorithm
TCP Transmission Control Protocol
URL Uniform Resource Locator
51

Thesis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

Multi-Threaded End-to-End Applications on Network Processors

TITLE: Multi-Threaded End-to-End Applications on Network Processors

AUTHOR: Michael S. Watts

DATE SUBMITTED: January 26, 2006

Professor Diana Franklin

Professor Hugh Smith

Professor Phil Nico

Multi-Threaded End-to-End Applications on Network Processors

tion of applications for these devices is an important area of research. Many

network processors provide multiple processing chips, and it is up to the applica-

must be able to parallelize full end-to-end applications that may be composed of

several less complex application kernels.

This thesis presents a multi-threaded end-to-end application benchmark suite

ing our benchmark suite we evaluate the effectiveness of network processors to

support end-to-end applications as well as the effectiveness of various paralleliza-

tion techniques to take advantage of the network processor architecture. We show

that kernel performance is an inaccurate indicator of end-to-end application per-

As available processing power has increased, devices that traditionally used

grammable processors in order to take advantage of their flexibility. This increase

in flexibility has traditionally been gained at the sacrifice of speed. Network

no single unifying characteristic that allows all network processors to accomplish

parallel processing, special-purpose hardware, memory structure, communication

acted only as routers and switches.

mization of applications for these devices is an important area of research. It is up

to the application developer to utilize the parallelism available in network proces-

sors. Parallelization of kernels is often a trivial task compared to parallelization

programs such as MD5, URL-based switching, and AES discussed in Chapter 4.

An end-to-end application refers to a useful combination of kernels. The end-

to-end application discussed in Chapter 5 makes use of the kernels in Chapter

destination using URL-based switching, and finally encrypting it using AES. In

decrypted at the destination node.

Our first contribution is the creation of a simulator that emulates a generic

existing academic research by supporting multiple processing units. In this way,

We chose to emulate the IXP1200 because it is a member of the commonly used

Intel IXP line of Network Processing Unit (NPU)s.

Our second contribution is the construction of multi-threaded, end-to-end

threaded kernels. Since network processors are capable of supporting complex

applications, it is important to have benchmarks that fully utilize them. Exist-

ing benchmark suites make it difficult to research the properties of parallelized

end-to-end applications since they are made up of single-threaded kernels. Our

Our third contribution is an analysis of our multi-threaded, end-to-end ap-

plication benchmarks on our network processor simulator. This analysis reveals

nesses of network processors.

This paper is organized as follows. In the next chapter we provide background

up our end-to-end application benchmark are presented in Chapter 4. Chapter

5 describes our testing methodology and our evaluation of the effectiveness of

network processors to support end-to-end applications as well as the effectiveness

of various parallelization techniques to take advantage of the network processor

architecture. Finally, our conclusion is presented in Chapter 6.

gies employed to bridge the gap: parallel processing, special-purpose hardware,

memory structure, communication mechanisms, and peripherals [24]. Network

2.1 Network Processors

NPU is a general term used to describe any processor designed to process

packets for network communication. Another characteristic of NPUs is that their

configuration: pipelined and symmetric. A processing element (PE) is a processor

able to decode an instruction stream [24]. Pipelined configurations dedicate each

PE to a particular packet processing task, while in symmetric configurations

capable of taking advantage of the inherent parallelism in packet processing.

and IBM PowerNP [1].

High-speed networks place high demands on the performance of NPUs. In

only one strategy used in NPUs to improve performance. Another strategy is

to use special-purpose hardware to offload tasks from the PEs. Special-purpose

hardware includes co-processors and special functional units.