LTE-Advanced Carrier Aggregation

LAB WORKBOOK FOR
A Short Course on
"Programming Multi-Core
Processors Based
Embedded Systems"
A Hands-On Experience with Cavium Octeon Based

Platforms
2010
Rev 1209-1
© Copyright 2010 Dr Abdul Waheed for Cavium University Program.

Cavium University Program LAB WORK BOOK
Page intentionally left blank.
© Copyright 2010 Dr Abdul Waheed for Cavium University Program 2

LAB WORKBOOK
This workbook is written for assisting the students of Short Course on
“Programming Multi-Core Processors Based Embedded Systems - A Hands-On
Experience with Cavium Octeon Based Platforms”.
The contents of this document have been compiled from various academic
resources to expose the students to the basics of multi-core architectures in a
hands-on fashion.
For Further information, please contact

Email: University@CaviumNetworks.com

TABLE OF CONTENTS
1. INTRODUCTION TO PARALLEL PROGRAMMING, ARCHITECTURE AND PERFORMANCE...................................5

1.1. LAB OBJECTIVES.............................................................................................................................................5
1.2. SETUP...........................................................................................................................................................5
1.3. INTRODUCTION TO MPAC................................................................................................................................5
1.4. UNDERSTANDING THE HARDWARE.....................................................................................................................6
1.5. UNDERSTANDING PROCESSOR ARCHITECTURE AND PERFORMANCE USING MPAC FOR HOST SYSTEM.............................7
1.6. UNDERSTANDING PROCESSOR ARCHITECTURE AND PERFORMANCE USING MPAC FOR TARGET SYSTEM.........................9
1.7. A SIMPLE "HELLO WORLD" PROGRAM............................................................................................................10
1.8. EXERCISE 2 – PTHREAD VERSION OF "HELLO WORLD"........................................................................................11
1.9. WRITING PARALLEL PROGRAM USING MPAC LIBRARY".......................................................................................12
1.10. EXERCISE 3 – "HELLO WORLD PROGRAM USING MPAC LIBRARY".......................................................................12
2. PARALLEL SORTING............................................................................................................................................13
2.1. LAB OBJECTIVES...........................................................................................................................................13
2.2. SETUP.........................................................................................................................................................13
2.3. INTRODUCTION TO MPAC SORT......................................................................................................................13
3. NETWORK PACKET SNIFFING (NPS)...................................................................................................................16
3.1. LAB OBJECTIVES...........................................................................................................................................16
3.2. SETUP.........................................................................................................................................................16
4. NETWORK PACKET FILTERING (NPF)..................................................................................................................18
4.1. LAB OBJECTIVES...........................................................................................................................................18
4.2. SETUP.........................................................................................................................................................18
4.3. INTRODUCTION TO NPF.................................................................................................................................18
5. DEEP PACKET INSPECTION (DPI)........................................................................................................................21
5.1. LAB OBJECTIVES...........................................................................................................................................21
5.2. SETUP.........................................................................................................................................................21
5.3. Introduction to DPI...................................................................................................................................21

1. Introduction to Parallel Programming, Architecture and Performance
1.1. Lab Objectives
The objective of this lab is to understand the underlying multi-core architecture and its
performance. For this purpose, this lab session introduces "Multi-core Processor Architecture
and Communication" (MPAC) Library and reference performance benchmarks. You will learn to
develop parallel applications using MPAC library for multi-core based systems. At the end of this
lab, you should know:
1. How to use MPAC benchmarks to understand the processor architecture and performance;
2. Learn to write a basic parallel program in C using MPAC library;
1.2. Setup
The required tools for this task are:
1. GNU C Compiler 4.3.0 or above

2. MPAC library and benchmarking suite
3. OCTEON Software Development Kit (SDK) with Cross building tools
All of the code used for this Lab is provided with the description on your host system. You will need
to build, execute, and analyze the code. MPAC library and benchmarking suite is available on-line at
http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code on the
target embedded system (Cavium Board), you will need to cross compile it on your host system
for target system, copy the executables to the target system and run the executables.
1.3. Introduction to MPAC
MPAC library provides a framework that eases the development of parallel applications and
benchmarks for state-of-the-art multi-core processors based computing and networking
platforms. MPAC Library uses multiple threads in a fork-and-join approach that helps
simultaneously exercising multiple processor cores of a system according to user specified
workload. The flexibility of MPAC software architecture allows a user to parallelize a task
withoug going into the peculiar intricacies of parallelism. MPAC library allows the user to
implement suitable experimental control and to replicate the same task across multiple
processors or cores using a fork-and-join parallelism. MPAC library is an open source C-based,
POSIX complaint, library, which is freely available under FreeBSD style licensing model. Fig. 1
provides an overview of MPAC’s software architecture. It provides an implementation of some
commons tasks, such as measurement of timer resolution, accurate interval timers, and other
statistical and experimental design related functions, which may be too time consuming or
complex to be written by a regular user. However, these ideas are fundamental to accurate and
repeatable measurement based evaluation.

Figure 1. A high-level architecture of MPAC Library’s extensible benchmarking infrastructure.
Fig. 2 shows an overview of MPAC’s fork-and-join execution model. In the following subsections,
we provide details about various MPAC modules and related APIs. Threading based parallel
application development requires thread creation, execution control, and termination. Thread
usage varies depending on a task. A user may require a thread to terminate after it has
completed its task or wait for other threads to complete their tasks and terminate together.
MPAC library provides a Thread Manager (TM), which facilitates handling thread activities
transparently from the end user. It offers high level functions to manage the life cycle of user-
specified thread pool of non-interacting workers. It is based on fork-and-join threading model
for concurrent execution of same workload on all processor cores. Thread manager functions
include thread creation, thread locking, thread affinity, and thread termination.
Thread Routine ( )
Thread Routine ( )
Thread Output
Thread
MPAC Argument Joining Processing
Creation
Initialization Handling & Display
& Forking
Thread Routine ( )
Figure 2. An overview of MPAC Benchmark fork and join infrastructure.
1.4. Understanding the Hardware
Before writing a parallel application for a specific platform, it is a good idea to indetify and
understand the underlying hardware architecture.
To explore the hardware details of the system: CPU, memory, I/O devices, and network
interfaces, Linux maintains a /proc file system to dynamically maintain system hardware

resource information and their performance statistics. We can read various files under /proc to
identify system hardware details:
$ cat /proc/cpuinfo
$ cat /proc/meminfo
The file "cpuinfo" gives details of all the processor cores available in the system. The main
variables to observe are "processor", "model name", "cpu MHz", "cache size", and
"cpu cores". The "processor" represents the processor id of the core. "Model name"
represents the type and processor vendor. "cpu MHz", "cache size", and "cpu cores",
represent the processor frequency, L2 cache size and the number of cores per socket
respectively. Another way to get the details of your processor is by issuing the following
command.
$ dmesg | grep CPU
The file "meminfo" gives details of the memory organization of the system. The main variable
to observe is "MemTotal" which represents the total size of the main memory of your system.
Another way to get the details of the system memory is by issuing the following command.
$ dmesg | grep mem
Fill in the following table to note your observations and measurements of the host and target
systems:
Table 1. System hardware details.
Hardware Attributes Values Values

(Development Host) (Target Cavium Board)
Processor type
Processor GHz rating
Total number of CPUs (cores)
L1 Data Cache size
L2 Cache size
Total Main Memory size
1.5. Understanding Processor Architecture and Performance using MPAC for host system
In order to understand the performance of your multi-core architecture for CPU and memory
intensive workload, MPAC provides CPU and memory benchmarks. After you have downloaded
and unpacked MPAC software run the following commands to configure and compile the MPAC
software on your development host.
To go to the main directory issue the following command.
host$ cd /<path-to-mpac>/mpac_1.2
where <path-to-mpac> is the directory where mpac is located. Then issue the following
commands.
host$ ./configure
host$ make clean
host $ make
To execute the MPAC CPU benchmark, follow the following steps.

host$ cd benchmarks/cpu
host$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations>
where –n is for number of threads, and -r for the number of times the task is run. For additional
arguments that can be passed through command line, issue the following command.
host$ ./mpac_cpu_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
CPU benchmark of the host system:
Table 2. CPU performance in MOPS
Operation Type No of Threads

1 2 4 8 16 32
Integer (summation)
Logical (String Operation)
Floating Point (Sin)
To execute the MPAC memory benchmark, follow the following steps.
host$ cd benchmarks/mem
host$ ./mpac_mem_bm –n <# of Threads> -s <array size>
-r <# of repetitions> -t <data type>
For additional arguments that can be passed through command line, issue the following
command.
host$ ./mpac_mem_bm –h
memory benchmark of the host system:
Table 3. Memory performance in Mbps for Integer data type
Array Size No of Threads

1 2 4 8 16 32
512 (4 KB)

65536 (512 KB)
1048576 (8 MB)
1.6. Understanding Processor Architecture and Performance using MPAC for Target system
In order to understand the performance of your target embedded multi-core architecture

(Cavium Board) for CPU and memory intensive workload, use MPAC CPU and memory
benchmarks. On your host system, run the following commands to configure and cross compile
the MPAC software for Target system, using the SDK the provided.
To cross compile the code for target system, set the environment variables for your specific
target system. Go to the directory where OCTEON SDK is installed. By default it will be installed
under /usr/local/Cavium_Networks/OCTEON_SDK/. Type the following command.
host$ source env-setup <OCTEON-MODEL>
where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX

To go to the MPAC main directory issue the following command.
host$ cd /<path-to-mpac>/mpac_1.2
where <path-to-mpac> is the directory where mpac is located. Then issue the following
commands.
host$ ./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linux-

gnu CC=mips64-octeon-linux-gnu-gcc
host$ make clean
host $ make CC=mips64-octeon-linux-gnu-gcc AR=mips64-octeon-linux-gnu-ar
where "mips64-octeon-linux-gnu-gcc" is the gcc cross compiler for OCTEON based systems.
To execute the MPAC CPU benchmark on the target system, copy the executable "
mpac_cpu_bm" on the target system and follow the following step.
target$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations>
where –n is for number of threads, and -r for the number of times the task is run. For additional
arguments that can be passed through command line, issue the following command.
target$ ./mpac_cpu_bm –h
CPU benchmark of the system you are using:
Table 4. CPU performance in MOPS
Operation Type No of Threads

1 2 4 8 16 32
Integer (summation)

Logical (String Operation)
Floating Point (Sin)
To execute the MPAC memory benchmark on the target system, copy the executable "
mpac_mem_bm" on the target system and follow the following step.
target$ ./mpac_mem_bm –n <# of Threads> -s <array size>

-r <# of repetitions> -t <data type>
For additional arguments that can be passed through command line, issue the following
command.
target$ ./mpac_mem_bm –h
memory benchmark of the system you are using:
Table 5. Memory performance in Mbps for Integer data type

1 2 4 8 16 32
512 (4 KB)
65536 (512 KB)
1048576 (8 MB)
1.7. A simple "Hello World" Program
In this exercise we will compile and run a simple sequential "Hello World" program written in C
language which prints "Hello World" to the screen and exits.
1. #include <stdio.h>
2.
3. int main(void)
4. {
5. printf("Hello World\n");
6. return 0;
7. }
To Compile & Run:
For Host System

host$ gcc -o outputFilename sourceFileName.c
host$ ./outputFileName
On Target System
host$ mips64-octeon-linux-gnu-gcc -o outputFilename sourceFileName.c
target$ ./outputFileName
1.8. Exercise 2 – Pthread version of "Hello World"
1. #include <pthread.h>
2. #include <stdio.h>
3. #include <stdlib.h>
4. #define MAX_WORKER 8
5. void *PrintHello(void *threadid)

6. {
7. int tid = (long)threadid;
8. printf("Hello World! It's me, thread #%ld!\n", tid);
9. pthread_exit(NULL);
10. }
11.
12. int main(int argc, char *argv[])
13. {
14. pthread_t threads[MAX_WORKER];
15. int t, num_thrs;
16.
17. num_thrs = (argc > 1) ? atoi(argv[1]) : MAX_WORKER;
18.
19. for(t=0;t< MAX_WORKER;t++)
20. {
21. printf("In main: creating thread %ld\n", t);
22. pthread_create(&threads[t], NULL, PrintHello, (void *)t);
23. }
24.
25. for(t=0;t< MAX_WORKER;t++)
26. pthread_join(threads[t],NULL);
27.
28. return 0;
26. }
This is the multithreaded version of the "Hello World" program using POSIX threads. On line 14 in
the main program, the thread variable is initialized. On line 22, "num_thrs" no. of threads are
created, as specified by the user through command line. If invalid value is given, "MAX_WORKER" no.
of threads are created. The number of threads created execute the funtion "PrintHello" parallely
on line 5 (as specified in the third argument in the pthread_create function) and the threads are
destroyed using ""pthread_exit" on line 9. Meanwhile the main thread waits for the created thread
to complete at line 26 using "pthread_join" function and after that the program control is handed
over to the main program , the program exits.
To Compile & Run:
For Host System

host$ gcc -o outputFilename sourceFileName.c -lpthread
host$ ./outputFileName <# of Threads>
For Target System

host$ mips64-octeon-linux-gnu-gcc -o outputFilename sourceFileName.c -lpthread
target$ ./outputFileName <# of Threads>
"-lpthread" is necessary for every pthread program. It is used to link the pthread library
"libpthread.so" to your code. Without this your program will not run and will report
compiling/linking errors.
1.9. Writing parallel program using MPAC Library"
A four step generic procedure is required to develop a parallel application using MPAC library:
(1) Declarations; (2) Thread Routine; (3) Thread Creation; and (4) Optional final calculations and
garbage collection. The declaration step requires the declaration and initialization of user input
structure and thread data structure variables. The ‘Thread Routine’ step requires the writing of
a thread subroutine to be executed by threads. The ‘Thread Creation’ phase requires creating a
joinable or detachable thread pool according to user requirements. The ‘Optional final
calculations and garbage collection’ step, in case of joinable threads, requires to perform the
final calculations, displaying output and releasing the resources acquired.
To write a parallel program using MPAC library, four files need to be created along with the
makefile which eases in compilation of your application. First is the header file for the
application to be developed which include data structure for user input (config_t), data
structure for passing data to threads (context_t), global variables and function prototypes. The
second file includes all the general functions which include processing user input arguments,
handling default arguments, initializing thread data structure, and help and printing functions.
The thread file includes the main function of the application and invokes the thread function.
The fourth file includes the thread function that is executed by each thread.
1.10. Exercise 3 – "Hello World Program using MPAC Library"
The "hello world" example is included in MPAC under the "apps" directory. The "hello world"
example takes two arguments form the user: (1) number of threads and (2) Processor Affinity.
To Compile & Run:
To execute the "hello world" example issue the following commands:
For Host System

host$ cd /<path-to-mpac>/mpac_1.2/apps/hello
host$ make clean
host$ make
host$ ./mpac_hello_app –n <# of Threads>
For Target System

host$ cd /<path-to-mpac>/mpac_1.2/apps/hello
host$ make clean
host$ make CC=mips64-octeon-linux-gnu-gcc AR=mips64-octeon-linux-gnu-ar
target$ ./mpac_hello_app –n <# of Threads>
In order to write your own application using MPAC library, just expand the "hello world"
example by updating the data structures and general functions in the mpac_hello.h and

mpac_hello.c files. The only major changes you have to do will be in the mpac_hello_app_hw.c
file which is the thread function.

2. Parallel Sorting
2.1. Lab Objectives

This lab session implements the parallel sort using MPAC library and measures its performance
on target Cavium system. The objective of this lab is to understand the concept of partitioning
of workload into parallel tasks. At the end of this lab, you should know:
1. How to partition a workload into parallel tasks;
2. Implement parallel sort using MPAC library;
3. Performance measurement and tuning of parallel sort
2.2. Setup

All of the code used for this Lab is provided with the description on your local system. You will need
http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code on the
target embedded system (Cavium Board), you will need to cross compile it on your host system
for target system, copy the executables to the target system and run the executables.
2.3. Introduction to MPAC Sort





To execute the Parallel Quick Sort example issue the following commands:
host$ cd /<path-to-mpac>/mpac_1.2/apps/sort
host$ make clean
host$ make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar
target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u q
Other options are –m and –l for upper and lower limit of random data, and –a is to set the
processor affinity.
Fill in the following table to note your observations and measurements after running MPAC
parallel quick sort the target system:
Table 6. Time taken in microseconds to sort an array of Million Elements using parallel quick sort.

1 2 4 8 16 32
1,000,000
In the parallel bucket sort algorithm, the minimum and maximum value elements are identified,
and the range between minimum and maximum is divided equally to the total number of
threads, hence forming buckets. So there will be as many buckets as there are threads. Then
each element of the data array is placed in its appropriate bucket array. Bucket arrays are
passed to the threads which sort the data using quick sort and return these bucket arrays back
to the main thread. In the main thread these bucket arrays are combined together to form a
sorted data array. The bucket array algorithm is shown in figure 4.
31 23 14 26 8 36 4 21 4 7 1 43 32 12 21 7
To Compile & Run:
To execute the Parallel Quick Sort example issue the following commands:
host$ cd /<path-to-mpac>/mpac_1.2/apps/sort
host$ make clean
host$ make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar
target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u b
Other options are –m and –l for upper and lower limit of random data, and –a is to set the
processor affinity.
1 - 11 12 - 23 - 34 -
8 4 Min
4 = 1;7 Max = 43; 22 Difference = 42
1 7
33
Threads = 4; Bucket Size 44
3 = 42/4  11
4
6 3
14 21 12 21 31 23 26 32
12 14 21 21 23 26 31 32
1 4 4 7 7 Thread
8© Copyright Function 3 4
2010 Dr Abdul Waheed for Cavium University Program
6 3
19
1 4 4 7 7 8 12 14 21 21 23 26 31 32 36 43
Figure 4. Parallel Bucket Sort Implementaion
Fill in the following table to note your observations and measurements after running MPAC
parallel bucket sort on the target system:
Table 7. Time taken in microseconds to sort an array of Million Elements using parallel bucket sort.

1 2 4 8 16 32
1,000,000

3. Network Packet Sniffing (NPS)

3.1. Lab Objectives
This lab session is about network packet sniffing and also serves as a base case for the next
networking labs. Referring to seven layer OSI model, we will capture packets at second layer
(i.e. data link layer). MPAC network benchmark will be used to generate network traffic. We
will measure the packet capturing throughput on a target Cavium system. The objective of this
lab is to understand the concept of parallel processing on network workload. At the end of this
lab, you should know:
1. Implementation of NPS appication;
2. Usage of NetServer and NetPerf traffic generation tool
3. Software designing of a parallel application;
4. Relationship of thread synchronization methods with the perceived and measured
performance
3.2. Setup

to build, execute, and analyze the code. The testbed setup for Labs 3-5 is shown in figure 5. By
defintion the word sniffing means that you ‘sniff’ or pick something for further analyses. In
computer networks terminology sniffing a packet means to capture a packet arriving or
departing from a network interface. This capturing does not disturb the ongoing
communication. This capturing can be done using one or more systems. Figure 5 shows a
typical scenario when the communication ends (server and client) have some ongoing network
flows and a sniffer captures the packet enroute. Such a scenario can also be produced on a
single machine using the loop back device.
Figure 5: Testbed setup for Lab # 3-5



Table 8. Network packets sniffing throughput in Mbps, for multiple queues case with lock free and optimized
enqueue/dequeue functions.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)

4. Network Packet Filtering (NPF)

4.1. Lab Objectives
This lab session implements the parallel NPF using MPAC library and measures network packet
capturing and filtering throughput on multi-core based system. The objective of this lab is to
understand the concept of parallel processing on network workload. At the end of this lab, you
should know:
1. Implement parallel NPF appication using MPAC library;
2. How to process parallel network workload;
3. Performance measurement of parallel NPF
4.2. Setup

All of the code used for this Lab is provided with the description on your host system. You will need
http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip.
4.3. Introduction to NPF
In this lab there will be one dispatcher thread which sniffs the packets. There will be multiple
threads to filter this sniffed traffic. To show gradual performance tuning, in MPAC, we
implemented packet filtering application using three architectures: (1) single shared queue
between multiple threads with lock-based design for enqueue/dequeue functions as shown in
figure 6; (2) multiple queues with lock-based design for enqueue/dequeue functions as shown
in figure 7; and (3) multiple queues with lock-free and optimized enqueue/dequeue functions.
We have one dispatcher thread in addition to number of worker threads. The dispatcher gets
packet from NIC and fills the packet queue. The queues are filled by dispatcher in round robin
fashion. Packet filtering is defined as packet header inspection at different OSI layers. In this lab
filtering is done using source and destination IP addresses, IP protocol field (which is fixed to be
TCP for this specific lab) and source and destination ports. These parameters are provided by
the user on the command line. The worker threads try to filter the packets based on the given 5
tuple comparison. We will measure the throughput of the worker threads that can they keep
up with the sniffer or not. You will also experiment and observe the improvements with
increasing threads and using thread to core affinity.
To Compile & Run:

Before running the NPS, generate traffic with MPAC network benchmark which the sniffer
application will capture. Copy the cross compiled "mpac_net_bm" executable (in the directory
mpac_1.2/benchmarks/net/) to target system and run the following commands:

target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads>

target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads>
-i <IP of Receiver>
In this case, the sender and receiver are on the same system, which means that the interface to
be tested is "lo". If you want to test the Ethernet interface, the sender can be executed on one
target and the receiver on the other target. The options for running NPF application are –n for
number of threads, -d for the duration of the test, -e for execution mode (in this case it is 4), -f
for the interface the sniffer should use to sniff packets from e.g. lo, eth0 or eth1, -p and –P for
port numbers of sender and receiver respectively, -i and -I for IP addresses of sender and
receiver respectively, and –a is to set the processor affinity. To execute the Parallel NPF example,
copy the "mpac_sniffer_app" executable under the directory
mpac_1.2/apps/sniffer/sniffer_1Q/ to the target system and issue the following
commands:
target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration>
-f <interface to sniff> -e 4
T0 T1 TN-1 T0 T1 TN-1 T0 T1 TN-1 T0 T1

TN-1
TN Worker Threads
Dispatcher putting space
Dispatcher putting direction
Figure 6: shared queue between multiple Workers getting direction

threads architectture with lock-based design for
enqueue/dequeue functions
Fill in the following table to note your observations and measurements after running MPAC NPF
on the target system:
Table 9. Network packets filetering throughput in Mbps, for single shared queue case.
1 2 4 8 16 32

T T T TN-1
0 1 2
TN Worker Threads
Dispatcher putting direction
Workers getting direction
Dispatcher putting space Figure 7: multiple queues architecture

with lock-based design for
enqueue/dequeue functions
To observe the results after performance tuning, we use the multiple queue architecture with
lock-based design for enqueue/dequeue functions. To execute this example, copy the
"mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ/
to the target system and issue the following command.
Fill in the following table after running MPAC NPF on the target system:
Table 10. Network packets filtering throughput in Mbps, for multiple queues case.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
We execute our example after further performance tuning and using multiple queues with lock-
free and optimized enqueue/dequeue functions. Copy the " mpac_sniffer_app" executable
under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system
and issue the following command.
Fill in the following table after running MPAC NPF on the target system:
Table 11. Network packets filtering throughput in Mbps, for multiple queues case with lock free and optimized
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
You are encouraged to compare the results of table 7, 8 and 9 so that you can appreciate the
performance of different synchronization and designing techniques.

5. Deep Packet Inspection (DPI)
5.1. Lab Objectives

This lab session implements the parallel DPI using MPAC library and measures compute
intensive payload extraction and string matching throughput of captured network packets on
multi-core system. The objective of this lab is to understand the concept of parallel deep packet
inspection of network workload. At the end of this lab, you should know:
1. Implement parallel DPI appication using MPAC library;
2. How to inspect parallel network workload;
3. Performance measurement parallel DPI
5.2. Setup

http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip.
5.3. Introduction to DPI

To show gradual performance tuning, in MPAC, we use three architectures for the sniffer
application as was done in lab 3 and 4. In this lab compute intensive payload extraction and
string matching throughput of captured network packets is measured. Dispatcher will sniff the
packets as before and the workers will try to find a string in the application payload.
To Compile & Run:

The options for running this application are –n for number of threads, -d for the duration of the
test, -e for execution mode (in this case it is 3), -f for the interface the sniffer should use to sniff
packets from e.g. lo, eth0 or eth1, -p and –P for port numbers of sender and receiver
respectively, -i and -I for IP addresses of sender and receiver respectively, and –a is to set the
processor affinity. To execute the Parallel DPI example, copy the " mpac_sniffer_app"
executable under the directory mpac_1.2/apps/sniffer/sniffer_1Q/ to the target system
and issue the following commands (as before run the MPAC network benchmark in separate
terminals):
target$ ./mpac_net_bm –c <Receiver> –d <duration> –l <# of threads>
target$ ./mpac_net_bm –c <Sender> –d <duration> –l <# of threads>
-i <IP of Receiver>
Fill in the following table to note your observations and measurements after running MPAC DPI
Table 12. Deep packet inspection throughput in Mbps, for single shared queue case.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
To observe the result after performance tuning, we use the multiple queue architecture with
lock-based design for enqueue/dequeue functions. To execute this example, copy the
"mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ/
to the target system and issue the following command.
Table 13. Deep packet inspection throughput in Mbps, for multiple queues case.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
We execute our example after further performance tuning and using multiple queues with lock-
free and optimized enqueue/dequeue functions. Copy the " mpac_sniffer_app" executable
under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system
and issue the following command.
Table 14. Deep packet inspection throughput in Mbps, for multiple queues case with lock free and optimized
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
You are encouraged to compare the results in table 10, 11 and 12 so that you can appreciate the
high performance of lock free design. You are also encouraged to see the code of the three
different implementations of these networking labs.

LTE-Advanced Carrier Aggregation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LTE-Advanced Carrier Aggregation

Uploaded by

Copyright:

Available Formats

LAB WORKBOOK FOR

A Hands-On Experience with Cavium Octeon Based

© Copyright 2010 Dr Abdul Waheed for Cavium University Program.

Page intentionally left blank.

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 2

For Further information, please contact

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 3

1. INTRODUCTION TO PARALLEL PROGRAMMING, ARCHITECTURE AND PERFORMANCE...................................5

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 4

1. Introduction to Parallel Programming, Architecture and Performance

1.1. Lab Objectives

The required tools for this task are:

1. GNU C Compiler 4.3.0 or above

1.3. Introduction to MPAC

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 5

Figure 1. A high-level architecture of MPAC Library’s extensible benchmarking infrastructure.

Figure 2. An overview of MPAC Benchmark fork and join infrastructure.

1.4. Understanding the Hardware

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 6

Hardware Attributes Values Values

Processor GHz rating

Total number of CPUs (cores)

L1 Data Cache size

Total Main Memory size

To execute the MPAC CPU benchmark, follow the following steps.

Operation Type No of Threads

Logical (String Operation)

Floating Point (Sin)

To execute the MPAC memory benchmark, follow the following steps.

Array Size No of Threads

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 8

65536 (512 KB)

In order to understand the performance of your target embedded multi-core architecture

host$ source env-setup <OCTEON-MODEL>

where <OCTEON-MODEL> is the model of your target board. E.g. OCTEON_CN56XX

host$ ./configure --host=i386-redhat-linux-gnu --target=mips64-octeon-linux-

target$ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations>

Operation Type No of Threads

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 9

Logical (String Operation)

Floating Point (Sin)

target$ ./mpac_mem_bm –n <# of Threads> -s <array size>

Array Size No of Threads

65536 (512 KB)

1.7. A simple "Hello World" Program

To Compile & Run:

For Host System

1.8. Exercise 2 – Pthread version of "Hello World"

5. void *PrintHello(void *threadid)

To Compile & Run:

For Host System

For Target System

1.10. Exercise 3 – "Hello World Program using MPAC Library"

To Compile & Run:

To execute the "hello world" example issue the following commands:

For Host System

For Target System

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 12

© Copyright 2010 Dr Abdul Waheed for Cavium University Program 13

2.1. Lab Objectives

The required tools for this task are:

1. GNU C Compiler 4.3.0 or above

2.3. Introduction to MPAC Sort

© Copyright 2010 Dr Abdul Waheed for Cavium University Program.

5. void PrintHello(void threadid)