Professional Documents
Culture Documents
A Short Course on
"Programming Multi-Core
Processors Based
Embedded Systems"
2010
Rev 1209-1
LAB WORKBOOK
This workbook is written for assisting the students of Short Course on
“Programming Multi-Core Processors Based Embedded Systems - A Hands-On
Experience with Cavium Octeon Based Platforms”.
The contents of this document have been compiled from various academic
resources to expose the students to the basics of multi-core architectures in a
hands-on fashion.
TABLE OF CONTENTS
The objective of this lab is to understand the underlying multi-core architecture and its
performance. For this purpose, this lab session introduces "Multi-core Processor Architecture
and Communication" (MPAC) Library and reference performance benchmarks. You will learn to
develop parallel applications using MPAC library for multi-core based systems. At the end of this
lab, you should know:
1. How to use MPAC benchmarks to understand the processor architecture and performance;
2. Learn to write a basic parallel program in C using MPAC library;
1.2. Setup
All of the code used for this Lab is provided with the description on your host system. You will need
to build, execute, and analyze the code. MPAC library and benchmarking suite is available on-line at
http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code on the
target embedded system (Cavium Board), you will need to cross compile it on your host system
for target system, copy the executables to the target system and run the executables.
MPAC library provides a framework that eases the development of parallel applications and
benchmarks for state-of-the-art multi-core processors based computing and networking
platforms. MPAC Library uses multiple threads in a fork-and-join approach that helps
simultaneously exercising multiple processor cores of a system according to user specified
workload. The flexibility of MPAC software architecture allows a user to parallelize a task
withoug going into the peculiar intricacies of parallelism. MPAC library allows the user to
implement suitable experimental control and to replicate the same task across multiple
processors or cores using a fork-and-join parallelism. MPAC library is an open source C-based,
POSIX complaint, library, which is freely available under FreeBSD style licensing model. Fig. 1
provides an overview of MPAC’s software architecture. It provides an implementation of some
commons tasks, such as measurement of timer resolution, accurate interval timers, and other
statistical and experimental design related functions, which may be too time consuming or
complex to be written by a regular user. However, these ideas are fundamental to accurate and
repeatable measurement based evaluation.
Fig. 2 shows an overview of MPAC’s fork-and-join execution model. In the following subsections,
we provide details about various MPAC modules and related APIs. Threading based parallel
application development requires thread creation, execution control, and termination. Thread
usage varies depending on a task. A user may require a thread to terminate after it has
completed its task or wait for other threads to complete their tasks and terminate together.
MPAC library provides a Thread Manager (TM), which facilitates handling thread activities
transparently from the end user. It offers high level functions to manage the life cycle of user-
specified thread pool of non-interacting workers. It is based on fork-and-join threading model
for concurrent execution of same workload on all processor cores. Thread manager functions
include thread creation, thread locking, thread affinity, and thread termination.
Thread Routine ( )
Thread Routine ( )
Thread Output
Thread
MPAC Argument Joining Processing
Creation
Initialization Handling & Display
& Forking
Thread Routine ( )
Before writing a parallel application for a specific platform, it is a good idea to indetify and
understand the underlying hardware architecture.
To explore the hardware details of the system: CPU, memory, I/O devices, and network
interfaces, Linux maintains a /proc file system to dynamically maintain system hardware
resource information and their performance statistics. We can read various files under /proc to
identify system hardware details:
$ cat /proc/cpuinfo
$ cat /proc/meminfo
The file "cpuinfo" gives details of all the processor cores available in the system. The main
variables to observe are "processor", "model name", "cpu MHz", "cache size", and
"cpu cores". The "processor" represents the processor id of the core. "Model name"
represents the type and processor vendor. "cpu MHz", "cache size", and "cpu cores",
represent the processor frequency, L2 cache size and the number of cores per socket
respectively. Another way to get the details of your processor is by issuing the following
command.
$ dmesg | grep CPU
The file "meminfo" gives details of the memory organization of the system. The main variable
to observe is "MemTotal" which represents the total size of the main memory of your system.
Another way to get the details of the system memory is by issuing the following command.
$ dmesg | grep mem
Fill in the following table to note your observations and measurements of the host and target
systems:
Table 1. System hardware details.
L2 Cache size
1.5. Understanding Processor Architecture and Performance using MPAC for host system
In order to understand the performance of your multi-core architecture for CPU and memory
intensive workload, MPAC provides CPU and memory benchmarks. After you have downloaded
and unpacked MPAC software run the following commands to configure and compile the MPAC
software on your development host.
To go to the main directory issue the following command.
host$ cd /<path-to-mpac>/mpac_1.2
© Copyright 2010 Dr Abdul Waheed for Cavium University Program 7
Cavium University Program LAB WORK BOOK
where <path-to-mpac> is the directory where mpac is located. Then issue the following
commands.
host$ ./configure
host$ make clean
host $ make
where –n is for number of threads, and -r for the number of times the task is run. For additional
arguments that can be passed through command line, issue the following command.
host$ ./mpac_cpu_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
CPU benchmark of the host system:
Table 2. CPU performance in MOPS
host$ cd benchmarks/mem
host$ ./mpac_mem_bm –n <# of Threads> -s <array size>
-r <# of repetitions> -t <data type>
For additional arguments that can be passed through command line, issue the following
command.
host$ ./mpac_mem_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
memory benchmark of the host system:
Table 3. Memory performance in Mbps for Integer data type
1048576 (8 MB)
1.6. Understanding Processor Architecture and Performance using MPAC for Target system
host$ cd /<path-to-mpac>/mpac_1.2
where <path-to-mpac> is the directory where mpac is located. Then issue the following
commands.
where "mips64-octeon-linux-gnu-gcc" is the gcc cross compiler for OCTEON based systems.
To execute the MPAC CPU benchmark on the target system, copy the executable "
mpac_cpu_bm" on the target system and follow the following step.
where –n is for number of threads, and -r for the number of times the task is run. For additional
arguments that can be passed through command line, issue the following command.
target$ ./mpac_cpu_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
CPU benchmark of the system you are using:
Table 4. CPU performance in MOPS
To execute the MPAC memory benchmark on the target system, copy the executable "
mpac_mem_bm" on the target system and follow the following step.
For additional arguments that can be passed through command line, issue the following
command.
target$ ./mpac_mem_bm –h
Fill in the following table to note your observations and measurements by running the MPAC
memory benchmark of the system you are using:
Table 5. Memory performance in Mbps for Integer data type
1048576 (8 MB)
In this exercise we will compile and run a simple sequential "Hello World" program written in C
language which prints "Hello World" to the screen and exits.
1. #include <stdio.h>
2.
3. int main(void)
4. {
5. printf("Hello World\n");
6. return 0;
7. }
On Target System
host$ mips64-octeon-linux-gnu-gcc -o outputFilename sourceFileName.c
target$ ./outputFileName
© Copyright 2010 Dr Abdul Waheed for Cavium University Program 10
Cavium University Program LAB WORK BOOK
1. #include <pthread.h>
2. #include <stdio.h>
3. #include <stdlib.h>
4. #define MAX_WORKER 8
This is the multithreaded version of the "Hello World" program using POSIX threads. On line 14 in
the main program, the thread variable is initialized. On line 22, "num_thrs" no. of threads are
created, as specified by the user through command line. If invalid value is given, "MAX_WORKER" no.
of threads are created. The number of threads created execute the funtion "PrintHello" parallely
on line 5 (as specified in the third argument in the pthread_create function) and the threads are
destroyed using ""pthread_exit" on line 9. Meanwhile the main thread waits for the created thread
to complete at line 26 using "pthread_join" function and after that the program control is handed
over to the main program , the program exits.
"-lpthread" is necessary for every pthread program. It is used to link the pthread library
"libpthread.so" to your code. Without this your program will not run and will report
compiling/linking errors.
1.9. Writing parallel program using MPAC Library"
A four step generic procedure is required to develop a parallel application using MPAC library:
(1) Declarations; (2) Thread Routine; (3) Thread Creation; and (4) Optional final calculations and
garbage collection. The declaration step requires the declaration and initialization of user input
structure and thread data structure variables. The ‘Thread Routine’ step requires the writing of
a thread subroutine to be executed by threads. The ‘Thread Creation’ phase requires creating a
joinable or detachable thread pool according to user requirements. The ‘Optional final
calculations and garbage collection’ step, in case of joinable threads, requires to perform the
final calculations, displaying output and releasing the resources acquired.
To write a parallel program using MPAC library, four files need to be created along with the
makefile which eases in compilation of your application. First is the header file for the
application to be developed which include data structure for user input (config_t), data
structure for passing data to threads (context_t), global variables and function prototypes. The
second file includes all the general functions which include processing user input arguments,
handling default arguments, initializing thread data structure, and help and printing functions.
The thread file includes the main function of the application and invokes the thread function.
The fourth file includes the thread function that is executed by each thread.
The "hello world" example is included in MPAC under the "apps" directory. The "hello world"
example takes two arguments form the user: (1) number of threads and (2) Processor Affinity.
In order to write your own application using MPAC library, just expand the "hello world"
example by updating the data structures and general functions in the mpac_hello.h and
mpac_hello.c files. The only major changes you have to do will be in the mpac_hello_app_hw.c
file which is the thread function.
2. Parallel Sorting
2.2. Setup
All of the code used for this Lab is provided with the description on your local system. You will need
to build, execute, and analyze the code. MPAC library and benchmarking suite is available on-line at
http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip. To run the code on the
target embedded system (Cavium Board), you will need to cross compile it on your host system
for target system, copy the executables to the target system and run the executables.
To execute the Parallel Quick Sort example issue the following commands:
host$ cd /<path-to-mpac>/mpac_1.2/apps/sort
host$ make clean
host$ make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar
target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u q
Other options are –m and –l for upper and lower limit of random data, and –a is to set the
processor affinity.
Fill in the following table to note your observations and measurements after running MPAC
parallel quick sort the target system:
Table 6. Time taken in microseconds to sort an array of Million Elements using parallel quick sort.
In the parallel bucket sort algorithm, the minimum and maximum value elements are identified,
and the range between minimum and maximum is divided equally to the total number of
threads, hence forming buckets. So there will be as many buckets as there are threads. Then
each element of the data array is placed in its appropriate bucket array. Bucket arrays are
passed to the threads which sort the data using quick sort and return these bucket arrays back
to the main thread. In the main thread these bucket arrays are combined together to form a
sorted data array. The bucket array algorithm is shown in figure 4.
31 23 14 26 8 36 4 21 4 7 1 43 32 12 21 7
To Compile & Run:
To execute the Parallel Quick Sort example issue the following commands:
host$ cd /<path-to-mpac>/mpac_1.2/apps/sort
host$ make clean
host$ make CC= mips64-octeon-linux-gnu-gcc AR= mips64-octeon-linux-gnu-ar
target$ ./mpac_sort_app –n <# of Threads> –s <Array Size> -u b
Other options are –m and –l for upper and lower limit of random data, and –a is to set the
processor affinity.
1 - 11 12 - 23 - 34 -
8 4 Min
4 = 1;7 Max = 43; 22 Difference = 42
1 7
33
Threads = 4; Bucket Size 44
3 = 42/4 11
4
6 3
14 21 12 21 31 23 26 32
12 14 21 21 23 26 31 32
1 4 4 7 7 Thread
8© Copyright Function 3 4
2010 Dr Abdul Waheed for Cavium University Program
6 3
19
1 4 4 7 7 8 12 14 21 21 23 26 31 32 36 43
Cavium University Program LAB WORK BOOK
Fill in the following table to note your observations and measurements after running MPAC
parallel bucket sort on the target system:
Table 7. Time taken in microseconds to sort an array of Million Elements using parallel bucket sort.
3.2. Setup
All of the code used for this Lab is provided with the description on your local system. You will need
to build, execute, and analyze the code. The testbed setup for Labs 3-5 is shown in figure 5. By
defintion the word sniffing means that you ‘sniff’ or pick something for further analyses. In
computer networks terminology sniffing a packet means to capture a packet arriving or
departing from a network interface. This capturing does not disturb the ongoing
communication. This capturing can be done using one or more systems. Figure 5 shows a
typical scenario when the communication ends (server and client) have some ongoing network
flows and a sniffer captures the packet enroute. Such a scenario can also be produced on a
single machine using the loop back device.
Table 8. Network packets sniffing throughput in Mbps, for multiple queues case with lock free and optimized
enqueue/dequeue functions.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
4.2. Setup
The required tools for this task are:
All of the code used for this Lab is provided with the description on your host system. You will need
to build, execute, and analyze the code. MPAC library and benchmarking suite is available on-line at
http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip.
In this lab there will be one dispatcher thread which sniffs the packets. There will be multiple
threads to filter this sniffed traffic. To show gradual performance tuning, in MPAC, we
implemented packet filtering application using three architectures: (1) single shared queue
between multiple threads with lock-based design for enqueue/dequeue functions as shown in
figure 6; (2) multiple queues with lock-based design for enqueue/dequeue functions as shown
in figure 7; and (3) multiple queues with lock-free and optimized enqueue/dequeue functions.
We have one dispatcher thread in addition to number of worker threads. The dispatcher gets
packet from NIC and fills the packet queue. The queues are filled by dispatcher in round robin
fashion. Packet filtering is defined as packet header inspection at different OSI layers. In this lab
filtering is done using source and destination IP addresses, IP protocol field (which is fixed to be
TCP for this specific lab) and source and destination ports. These parameters are provided by
the user on the command line. The worker threads try to filter the packets based on the given 5
tuple comparison. We will measure the throughput of the worker threads that can they keep
up with the sniffer or not. You will also experiment and observe the improvements with
increasing threads and using thread to core affinity.
In this case, the sender and receiver are on the same system, which means that the interface to
be tested is "lo". If you want to test the Ethernet interface, the sender can be executed on one
target and the receiver on the other target. The options for running NPF application are –n for
number of threads, -d for the duration of the test, -e for execution mode (in this case it is 4), -f
for the interface the sniffer should use to sniff packets from e.g. lo, eth0 or eth1, -p and –P for
port numbers of sender and receiver respectively, -i and -I for IP addresses of sender and
receiver respectively, and –a is to set the processor affinity. To execute the Parallel NPF example,
copy the "mpac_sniffer_app" executable under the directory
mpac_1.2/apps/sniffer/sniffer_1Q/ to the target system and issue the following
commands:
target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration>
-f <interface to sniff> -e 4
TN Worker Threads
Dispatcher putting space
Dispatcher putting direction
Fill in the following table to note your observations and measurements after running MPAC NPF
on the target system:
Table 9. Network packets filetering throughput in Mbps, for single shared queue case.
1 2 4 8 16 32
T T T TN-1
0 1 2
TN Worker Threads
Fill in the following table after running MPAC NPF on the target system:
Table 10. Network packets filtering throughput in Mbps, for multiple queues case.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
We execute our example after further performance tuning and using multiple queues with lock-
free and optimized enqueue/dequeue functions. Copy the " mpac_sniffer_app" executable
under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system
and issue the following command.
target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration>
-f <interface to sniff> -e 4
Fill in the following table after running MPAC NPF on the target system:
Table 11. Network packets filtering throughput in Mbps, for multiple queues case with lock free and optimized
enqueue/dequeue functions.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
You are encouraged to compare the results of table 7, 8 and 9 so that you can appreciate the
performance of different synchronization and designing techniques.
5.2. Setup
The required tools for this task are:
All of the code used for this Lab is provided with the description on your local system. You will need
to build, execute, and analyze the code. MPAC library and benchmarking suite is available on-line at
http://www.university.caviumnetworks.com/downloads/mpac_1.2.zip.
Fill in the following table to note your observations and measurements after running MPAC DPI
on the target system:
© Copyright 2010 Dr Abdul Waheed for Cavium University Program.
Cavium University Program LAB WORK BOOK
Table 12. Deep packet inspection throughput in Mbps, for single shared queue case.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
To observe the result after performance tuning, we use the multiple queue architecture with
lock-based design for enqueue/dequeue functions. To execute this example, copy the
"mpac_sniffer_app" executable under the directory mpac_1.2/apps/sniffer/sniffer_MQ/
to the target system and issue the following command.
target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration>
-f <interface to sniff> -e 5
Fill in the following table to note your observations and measurements after running MPAC DPI
on the target system:
Table 13. Deep packet inspection throughput in Mbps, for multiple queues case.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
We execute our example after further performance tuning and using multiple queues with lock-
free and optimized enqueue/dequeue functions. Copy the " mpac_sniffer_app" executable
under the directory mpac_1.2/apps/sniffer/sniffer_MQ_optimized/ to the target system
and issue the following command.
target$ ./mpac_sniffer_app –n <# of Threads> –d <test duration>
-f <interface to sniff> -e 5
Fill in the following table to note your observations and measurements after running MPAC DPI
on the target system:
Table 14. Deep packet inspection throughput in Mbps, for multiple queues case with lock free and optimized
enqueue/dequeue functions.
No of Threads
1 2 4 8 16 32
Throughput (Mbps)
You are encouraged to compare the results in table 10, 11 and 12 so that you can appreciate the
high performance of lock free design. You are also encouraged to see the code of the three
different implementations of these networking labs.