You are on page 1of 20

A TECHNICAL PAPER PRESENTATION

ON
MULTIPROCESSOR REAL-TIME ANALYSIS FOR
SCAN-BASED EMULATION: A METHODOLOGY
OF DSP APPLICATION
FOR
QUEST-2K8

AT

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY

BY

G UDLAVALLERU
ENGINEERING COLLEGE
GUDLAVALLERU

AUTHORS:

ABDUL HAFEEZ K.N.MALLIKHARJUN RAO
1II/IV B.TECH 1II/IV B.TECH
05481A0401 05481A0454

E MAIL:lathief_hafeez @yahoo.co.in E MAIL: mallik454@gmail.com
Abstract

Scan-based emulation is a pervasive method that is deployed to debug
and develop DSP applications. Scan-based emulation relies on scanning data out from
the target to the host computer using proprietary emulation hardware that is designed
into the DSP core. Emulation support in our single processor and multiprocessor
environments comprises software and hardware on both the target and the host, and
therefore provides a transport mechanism upon which to base real-time analysis.

In this paper we present an end-to-end methodology in developing a multiprocessor real-
time analysis capability formulated on JTAG scan-based mechanisms. Specifically we
discuss the issues related to extending the capability from single to multiprocessor
domains. We examine the issues related to supporting real-time analysis in all of the
software and hardware components. Finally, we enumerate the challenges in developing
a methodology for deploying both homogeneous and heterogeneous multiprocessor
combinations.

Introduction

The ability to analyze the proper execution of real-time applications in an
embedded system is critical to their development and deployment. This applies to
real-time applications ranging from mission critical to multimedia.

In embedded systems the ability to perform Real-Time Analysis (RTA) can involve a
dedicated hardware and software capability with an end-to-end methodology that
supports the transferring of data between the host and the target in a lossless and reliable
manner. Specifically, the Real-Time Analysis encompassed by this methodology consists
of capturing data from a target application using dedicated hardware, transferring it
through various layers of software dedicated to creating a real-time path and making it
available to a host application for the purpose of analysis. Analysis includes the
determining of whether applications meet both timing and logical correctness
requirements.

In this paper we present an end-to-end methodology in developing a multiprocessor
real- time analysis capability formulated on JTAG scan-based mechanisms. Scan-based
emulation is a pervasive method that is deployed to debug, develop and analyze real-
time applications running on DSPs. The JTAG boundary scan specification permits the
connecting of multiple devices in a serial daisy-chained arrangement. This covers scan-
based emulation in detail and sets the stage for RTA. .The RTA hardware and software
architecture that our methodology relies upon is presented .It also includes an
application that illustrates the necessity of RTA in a multiprocessor environment.
Scan-Based Emulation

With a traditional emulator, the CPU to be emulated is usually removed from its socket
and replaced with an emulator pod. The emulator pod typically has a replacement CPU,
plus various amounts of random logic and memory to monitor what is happening on
the CPU pins. With modern CPUs such as DSPs, the traditional approach has several
problems. The first problem is the speed of newer DSP chips. Bus cycle times can be
25 nanoseconds or shorter, and all instructions typically execute in a single cycle. This
makes it difficult for a traditional emulator to allow emulation at full speed. The
number of pins to monitor can be staggering, with chips having multiple 32-bit address
and data buses, making a traditional emulator expensive. The second, and more serious,
problem is that DSPs often have on-chip caches, pipelines, memory and peripherals.
Sometimes a whole algorithm can execute without any activity on the CPU pins.

The solution to these problems is scan-based emulation. With scan-based
emulation, the CPU is never removed from the socket; in fact it can be soldered directly
onto the board. Instead, the CPU has a serial scan interface, allowing the emulator to
scan the internals of the device through a standard connector.
The scan-based approach to emulation has many advantages:

• Emulation at full device speed - Since there is no logic needed to monitor
what happens on the CPU bus, the emulator can allow the device to execute
programs at full speed.
• Non-intrusive emulation - Since no logic is attached to any CPU pin, the
CPU bus is not affected at all by the emulation process. The emulator will not
affect the operation of the bus, as is so often the case with traditional
emulators.
• In-circuit emulation - The CPU can be soldered to the board while
emulating. This makes denser packaging possible, and also makes the
emulator a manufacturing test tool.
• Full access to internal memory, caches, pipelines and registers - The complete
state of the processor is visible to the outside through a scan interface.
• Complete access to the system from the CPU - Any peripheral or memory
that the CPU can access in the system can also be accessed through the scan
interface. The emulator can look at the system "through the eyes of the CPU".
This makes it possible to debug and diagnose a system where nothing is
working except the CPU itself.

The JTAG Interface

The Joint Test Action Group (JTAG) defines an interface called the JTAG interface for
testing individual devices on printed circuit boards, without the need to remove the
devices from the board. This is accomplished by a method called boundary scan, whereby
the state of each pin of each device (with some special logic on the device) is serially
scannedoutfromthedevice.
Multiple devices can be daisy chained, and an entire PC board can therefore be scanned
in a single scan chain. It is possible to use the same method to scan out not only the state
of a devices pins, but to scan out any internal information from the device, such as
register values, memory location; as a consequence scan-based emulation was born. The
JTAG specification does not include the pin out for the JTAG connector. The extension to
JTAG defines a 14 pin, 2 row, 0.1" spacing JTAG connector header, with pin out and
physical dimensions common to all DSPs that support JTAG involved in this
2
methodology .During JTAG emulation, the emulator supplies the clock that scans the
device. This means that the target clock speed is completely independent of the emulation
clock, and the emulator can support targets running at any clock speed.

The Boundary Scan Mechanism
Device Architecture

The JTAG device architecture is based on the IEEE 1149.1 architecture. In this
specification, there are four dedicated pins collectively known as the Test Access Port
(TAP). They are:

• Test Data In (TDI)
• Test Data Out (TDO)
• Test Mode Select (TMS)
• Test Clock (TCK)
• Test Rest (optional)

A boundary scan cell is connected to each boundary scan register on each device that is
being scanned. The architecture further specifies a finite machine TAP controller with
inputs TMS and TCK. There is an Instruction Register (IR) holding the current
instruction, a bypass register, and an optional 32-bit identification register for permanent
identification.

Principles of Boundary Scan

Boundary scan cells are configured into a parallel-in, parallel-out shift register. Parallel
load operations cause signals from the core logic to be loaded into the output cells. Parallel
unload operations cause the signals to be loaded from the input cells to the core logic. Data
is shifted in serial mode by daisy chaining devices. The figure below shows the TDI of
each device connected to the TDO of the next device in the scan chain. It is possible to
avoid scanning any device by placing it in bypass mode. Typically, the system architect
is responsible for determining the type (homogeneous or heterogeneous) type of
arrangement of devices, their order in the scan chain and if they will be placed in
bypass.
Fig 1: Boundary Scan

The following application in the domain of high energy physics illustrates the necessity
for RTA in a heterogeneous multiprocessor environment. The Fermi lab Tevatron
Collider generates 15 million particle collisions per second. These particle collisions
result in the creation of subatomic particles that travel through a spectrometer. The data
output from the spectrometer is in the order of terabytes per second and must be analyzed
in real time. The analysis engine comprises a massively parallel arrangement of
heterogeneous DSPs and GPPs (general purpose processors). Analysis consists of
applying algorithms that reconstruct and filter the collision data. The result is a select set
of interesting collisions from which physicists can study some of the remaining mysteries
.
of matter and antimatter in the universe .
Analysis of real-time embedded applications is necessary at several points during the
software life cycle: during development, as a means to debug; towards the end of
development, for tuning performance; and after the application is deployed, for
failure analysis. Logic analyzers have been used for many years to clamp onto the
data busses of the target and monitor the data flow of the application in order to
analyze application behavior. Aside from the fact that logic analyzers are expensive
($15K to $60K for a DSP), the increase in system-level integration over the years has
resulted in fewer exposed data paths for the logic analyzer to monitor. Most modern
microprocessors are architected with specialized hardware counters that can be
programmed for the purpose of tracing applications. Traditionally these registers
have been used to determine the design of the micro architecture such as caches and
TLBs, etc. Whereas these registers can be used to trace the behavior of applications
at a very fine level of granularity, they cannot easily be used as a RTA mechanism.
An ancillary yet significant issue is that analysis requires that the user have an
advanced knowledge of the target micro architecture in order to interpret the data.
Fig 2: Debugger with real-time data exchange

Finally, tracing supports data transfer only from target to host and not from host to
target.

An alternative real-time analysis solution based upon JTAG emulation is presented here.
This hardware and software architecture for a single processor is shown in. The JTAG
interface that connects the on-chip emulation logic to the host-based emulator provides
the physical mechanism on which to transport data from the target to the host and vice
versa. The target application is the subject matter to be analyzed; it is the source of data
to be sent to the host and the sink for data received from the host. Therefore, a RTA
target software library exists to bridge the gap from the target application to the on- chip
emulation hardware. Good software engineering practices dictate that an API exists for
this software library. On the host, the data is to be analyzed by a host application. This
host application may also input data to the target application. We must therefore bridge
the gap from the emulator to the host application. .

An emulation software driver controls the scanning of data to/from the target via the
emulator. It is the first piece of host software to receive data from the target and the last
piece of host software to handle data heading to the target. A RTA host software library
funnels the data between the emulation software driver and the host application. Again,
an API exists for the RTA host software library. It should be noted that multiple host
applications may be run concurrently.

Data flow in this architecture is bi-directional: data flows from the target application to
the host application for analysis, and data may flow from the host application to the target
application for supplying input parameters. Such input parameters may be used for fine
tuning performance, for supplying test data, etc.
Host Target

RTA Real-
Emu Target Time
SW Target
HW Lib App

Target API

F
ig
3
:
S
i
n
gl
e
p
r
o
c
es
s
o
r
R
T
A
b
a
se
d
u
p
o
n
J
T
A
G
e
m
u
la
ti
o
n

For target-to-host
data transfer, there
are two distinct
parts of the data
flow path. The first
part extends from
the target
application to the
RTA host software
library. This is the
real-time
transportation leg.
Since our target
application has
real-time
constraints, data
must be off-loaded
from the target to
the host at a
certain rate. The
RTA host software
library is the first
piece of software
on the host that
realizes it has
received real-time
data for analysis.
(The emulation
software is
agnostic to what
type of data it is
scanning.) The
RTA host software
can record the data
to disk and be
done with it, or
buffer it internally.
The second part of
the target-to-host
data flow path
extends from the
RTA host software
library to the host
application. The
data is analyzed by
the host
application. If the
data has been
recorded in
persistent storage,
then the data can
be played back at
any time. If the
data is not in
persistent storage,
then it must be
analyzed by the
host application as
it is produced; that
is, the data must be
drained from the
RTA host software
buffers as they fill.

Fig 4: Data flow
in single processor
RTA based upon
JTAG emulation

The above RTA
architecture for a
single embedded
processor is easily
extended to a
multiprocessor
environment. This
is shown in A RTA
target software
library must exist
on each embedded
target to connect
the target
application to the
emulation logic on
that target. For
multi-core
architectures, a
RTA target
software library
will exist for each
core. The data
from each
processor is
scanned up to the
host via the JTAG
interface ring as
described in (Scan-
Based Emulation).
On the host, there
exists an emulation
software driver
corresponding to
each target in the
system. Each
emulation software
driver receives the
data from its
corresponding
target and delivers
the data to the one
RTA host
software library.
Fig
5: Multiprocessor
RTA via Scan-
Based Emulation

This architecture
for multiprocessor
real-time analysis
via scan-based
emulation provides
the basis for the
methodology.

Methodology

In this section we
present an end-to-
end methodology
that is predicated
on support in
hardware and
software across
several families of
DSPs. There is
special emulation
hardware
architected into the
DSP core and
emulation drivers
as well as RTA
target and host
side software that
permit the user to
perform
RTA.Fundamentall
y, this methodology
involves using a
development
environment to
develop and
download a target
application to a
DSP. The
application running
on the DSP
interfaces with the
RTA software to
send and receive
data. The data is
scanned out using
JTAG boundary
scan. The data is
received on the
host by the
emulation driver
that interfaces to
the hos
Side RTA software. The data is then presented to the host client application for analysis.
The figures of merit used to determine the success of this methodology are performance,
scalability, ease of use and reliability. We discuss these criteria within the scope of both
hardware capabilities and the RTA software architecture in a multiprocessor
environment.

Performance

An important consideration in providing a methodology for multiprocessor RTA is

Performance
Dedicated Emulation Hardware

Data is transferred between target and host using dedicated emulation hardware to
improve performance. In a heterogeneous multiprocessor arrangement of DSPs one
complication that arises is that of varying scan lengths. Each design of DSP has its
own emulation hardware. This results in scan lengths that vary within a family of
DSPs and between Instruction Set Architectures (ISAs). The result of this variance is
that longer scan chains require greater disassembly time for the scanned data
resulting in lower throughput. This results in lower performance. In a multiprocessor
JTAG scan, a special JTAG boundary scan bypass instruction obviates the need to
scan any device set to bypass mode. This results in less time to disassemble data
being transferred between host and target.

Fig 6: Virtual Data Paths
Data Identification

A RTA solution for a multiprocessor environment must be able to identify the processor from
which data originates. This introduces the need to mark the data with a processor identifier.
The decision then becomes where in the system to do this. If we examine the host, we see
that there is a one-to-one correspondence between the emulation software drivers and the
processors in the system. Since there is an emulation software driver for each target in the
system, these drivers can stamp the data with a processor identifier.
Note that from a performance perspective, it is better to
mark the data on the host-side as to the target-side. If a
unique processor identifier were sent down to the target and
the data were tagged there, then more data would be sent
from the target to the host and would consume precious
bandwidth.. Virtual data paths that extend from the target
application to the host application are used to segregate
data.

Scalability

A key aspect of this methodology is scalability. This issue
is addressed in both hardware and software.

Hardware Scalability

The JTAG specification permits the daisy chaining of
hardware. The limits placed on the number of devices that
can be daisy chained is based on signal strength limitations
as opposed to the JTAG specification.

Software Scalability

In software, data is tagged from each target with a unique
identifier so that data being transferred between host and
target can be identified as to which processor it belongs.
Further, the RTA architecture is software scalable; writing
the target application is not dependent upon the number or
processors and does not need to be altered if processors
are either added or removed from the system
configuration. There is no requirement that the target
application have any knowledge of the type or number of
processors in a scan chain at the time of development.

Data Selection

The host application should be able to select from which
processor to send or receive data. This is accomplished
by incorporating this functionality into the host API.
This
Proves to be very favorable with respect to scalability. For
example, let’s assume that we have a target application
that performs a series of transformations on a given
vector, and then transfers the resulting vector to the host
for analysis. Let’s further assume that there are many
vectors that must be transformed and that we choose to
deploy the same target application on as many processors
as there are vectors to achieve maximum computing
parallelization. We can design a host application that
sends a different vector to each processor and then
collects the resulting vectors from each processor,
respectively, for analysis
Ease of Use

Ease of use is an important but often difficult figure of metric to sustain. A software
debugging environment is provided that permits the user to easily configure the
hardware in the system.

Fig 7: Processor Selection

Hardware Support

A trend in DSP emulation hardware is to support device
registers that are mapped at fixed addresses. This permits the
source code porting of applications. Further, a trend in more
contemporary DSP emulation logic is to replicate the logic
on all DSPs. This further simplifies the deployment of RTA
tools.

Software Support

At setup, the user selects the type of target and loads the
system with an emulation driver for that target. The user
also specifies the number of targets of each type and their
position in the scan chain. Without this capability users
would have to add code in their host applications that
performed the same function, resulting in messy and
unnecessarily complex code. Host side support is provided in
the way of object-oriented interfaces based on the Component
4
Object Model (COM) , which is a defacto industry standard.
This permits the host application developer to write client
programs that are not tightly coupled to a specific DSP.

Hardware Reliability
The JTAG specification has long established as a reliable
standard. It has been adopted and extended. An extensive
set of target libraries have been developed for various
flavors of DSPs based on boundary scan. Reliability is
achieved through reuse of the same register set in different
versions of emulation hardware across ISAs within ISA s.

Challenges

There are several challenges in supporting a uniform
multiprocessor RTA capability on various families of DSPs.
Each family of DSP has its particular variant of emulation
hardware. This has an impact on the RTA protocol that is
used to transfer data between host and target. For instance,
some of the emulation capabilities on some DSPs use
interrupt to signal the flow of data between host and target.
In the absence of emulation interrupt support, the application
must poll the emulation hardware for the presence of data.

Conclusion

The RTA methodology presented in this paper is extensively
used and widely accepted. The problem that is illustrated in
the high energy physics application presented is not limiting.
Experiences to date have shown that other domains such as
wireless and mobile computing require the processing of
RTA data where both DSPs and microcontrollers are on the
same scan chain. The development of this RTA capability
has been predicated on the JTAG specification and the
adherence to this standard in the emulation hardware that
has been designed into the DSP core. The software that has
been developed is able to differentiate between the various
DSPs. The virtual paths in the RTA architecture guarantee
data integrity.

References:

[1] JTAG IEEE 1149.1 Specification http://www.ieee.org

[2] Microsoft Component Object Model (COM)
http://www.microsoft.com

[3] Gottschalk, E.E., et al., "The BTeV DAQ and Trigger
System – Some Throughput, Usability, and Fault Tolerance
Aspects," Proceedings of the Computing in High Energy and
Nuclear Physics Conference (CHEP 2001), p. 628, Beijing,