You are on page 1of 13








05481A0454 05481A0401

E MAIL: E MAIL:lathief_hafeez
Abstract the target in a lossless and reliable
manner. Specifically, the Real-Time
Scan-based emulation is Analysis encompassed by this
a pervasive method that is deployed to methodology consists of capturing data
debug and develop DSP applications. from a target application using dedicated
Scan-based emulation relies on hardware, transferring it through various
scanning data out from the target to layers of software dedicated to creating
the host computer using proprietary a real-time path and making it available
emulation hardware that is designed to a host application for the purpose of
into the DSP core. Emulation support analysis. Analysis includes the
in our single processor and determining of whether applications
multiprocessor environments meet both timing and logical correctness
comprises software and hardware on requirements.
both the target and the host, and
therefore provides a transport In this paper we present an end-to-end
mechanism upon which to base real- methodology in developing a
time analysis. multiprocessor real- time analysis
capability formulated on JTAG scan-
In this paper we present an end-to-end based mechanisms. Scan-based
methodology in developing a emulation is a pervasive method that is
multiprocessor real- time analysis deployed to debug, develop and analyze
capability formulated on JTAG scan- real-time applications running on DSPs.
based mechanisms. Specifically we The JTAG boundary scan specification
discuss the issues related to extending permits the connecting of multiple
the capability from single to devices in a serial daisy-chained
multiprocessor domains. We examine arrangement. This covers scan-based
the issues related to supporting real-time emulation in detail and sets the stage
analysis in all of the software and for RTA. .The RTA hardware and
hardware components. Finally, we software architecture that our
enumerate the challenges in developing methodology relies upon is presented
a methodology for deploying both .It also includes an application that
homogeneous and heterogeneous illustrates the necessity of RTA in a
multiprocessor combinations. multiprocessor environment.

Introduction Scan-Based Emulation

The ability to analyze the proper With a traditional emulator, the CPU
execution of real-time applications to be emulated is usually removed
in an embedded system is critical to from its socket and replaced with an
their development and deployment. emulator pod. The emulator pod
This applies to real-time typically has a replacement CPU, plus
applications ranging from mission various amounts of random logic and
critical to multimedia. memory to monitor what is happening
on the CPU pins. With modern CPUs
In embedded systems the ability to such as DSPs, the traditional approach
perform Real-Time Analysis (RTA) can has several problems. The first
involve a dedicated hardware and problem is the speed of newer DSP
software capability with an end-to-end chips. Bus cycle times can be 25
methodology that supports the nanoseconds or shorter, and all
transferring of data between the host and instructions typically
execute in a single cycle. This makes emulators.
it difficult for a traditional emulator to • In-circuit emulation - The
allow emulation at full speed. The CPU can be soldered to the
number of pins to monitor can be board while emulating.
staggering, with chips having multiple This makes denser
32-bit address and data buses, making packaging possible, and
a traditional emulator expensive. The also makes the emulator a
second, and more serious, problem is manufacturing test tool.
that DSPs often have on-chip caches, • Full access to internal
pipelines, memory and peripherals. memory, caches, pipelines
Sometimes a whole algorithm can and registers - The complete
execute without any activity on the state of the processor is
CPU pins. visible to the outside through
a scan interface.
The solution to these problems is • Complete access to the
scan-based emulation. With scan-based system from the CPU - Any
emulation, the CPU is never removed peripheral or memory that
from the socket; in fact it can be the CPU can access in the
soldered directly onto the board. system can also be accessed
Instead, the CPU has a serial scan through the scan interface.
interface, allowing the emulator to scan The emulator can look at the
the internals of the device through a system "through the eyes of
standard connector. the CPU". This makes it
The scan-based approach to emulation possible to debug and
has many advantages: diagnose a system where
nothing is working except the
CPU itself.
• Emulation at full device
speed - Since there is no
The JTAG Interface
logic needed to monitor
what happens on the CPU
The Joint Test Action Group (JTAG)
bus, the emulator can allow
defines an interface called the JTAG
the device to execute
interface for testing individual devices
programs at full speed.
on printed circuit boards, without the
• Non-intrusive emulation - need to remove the devices from the
Since no logic is attached to board. This is accomplished by a method
any CPU pin, the CPU bus is called boundary scan, whereby the state
not affected at all by the of each pin of each device (with some
emulation process. The special logic on the device) is serially
emulator will not affect the scannedoutfromthedevice.
operation of the bus, as is so
often the case with traditional
Multiple devices can be daisy chained,
and an entire PC board can therefore be A boundary scan cell is connected to
scanned in a single scan chain. It is each boundary scan register on each
possible to use the same method to scan device that is being scanned. The
out not only the state of a devices pins, architecture further specifies a finite
but to scan out any internal information machine TAP controller with inputs
from the device, such as register values, TMS and TCK. There is an Instruction
memory location; as a consequence Register (IR) holding the current
scan-based emulation was born. The instruction, a bypass register, and an
JTAG specification does not include the optional 32-bit identification register for
pin out for the JTAG connector. The permanent identification.
extension to JTAG defines a 14 pin, 2
row, 0.1" spacing JTAG connector Principles of Boundary Scan
header, with pin out and physical Boundary scan cells are configured into
dimensions common to all DSPs that a parallel-in, parallel-out shift register.
support JTAG involved in this Parallel load operations cause signals
2 from the core logic to be loaded into the
methodology .During JTAG emulation, output cells. Parallel unload operations
the emulator supplies the clock that cause the signals to be loaded from the
scans the device. This means that the input cells to the core logic. Data is
target clock speed is completely shifted in serial mode by daisy chaining
independent of the emulation clock, and devices.
the emulator can support targets running
at any clock speed.
The figure below shows the TDI of
The boundary scan mechanism each device connected to the TDO of the
Device architecture next device in the scan chain. It is
possible to avoid scanning any device
The JTAG device architecture is by placing it in bypass mode.
based on the IEEE 1149.1
architecture. In this specification, Typically, the system architect is
there are four dedicated pins responsible for determining the type
collectively known as the Test (homogeneous or heterogeneous) type
Access Port (TAP). They are: of arrangement of devices, their order
in the scan chain and if they will be
• Test Data In (TDI) placed in bypass.
• Test Data Out (TDO)
• Test Mode Select (TMS)
• Test Clock (TCK)
• Test Rest (optional)
cycle: during development, as means
to debug; towards the end of
development, for tuning
performance; and after the
application is deployed, for failure
analysis. Logic analyzers have been
used for many years to clamp onto
the data busses of the target and
monitor the data flow of the
application in order to analyze
Real-Time Analysis (RTA) application behavior. Aside from the
fact that logic analyzers are
The following application in the domain expensive ($15K to $60K for a
of high energy physics illustrates the DSP), the increase in system-level
necessity for RTA in a heterogeneous integration over the years has
multiprocessor environment. The Fermi resulted in fewer exposed data paths
lab Tevatron Collider generates 15 for the logic analyzer to monitor.
million particle collisions per second. Most modern microprocessors are
These particle collisions result in the architected with specialized
creation of subatomic particles that hardware counters that can be
travel through a spectrometer. The data programmed for the purpose of
output from the spectrometer is in the tracing applications.
order of terabytes per second and must
be analyzed in real time. The analysis
engine comprises a massively parallel Traditionally these registers have
arrangement of heterogeneous DSPs and been used to determine the design of
GPPs (general purpose processors). the micro architecture such as caches
Analysis consists of applying algorithms and TLBs, etc. Whereas these
that reconstruct and filter the collision registers can be used to trace the
data. The result is a select set of behavior of applications at a very
interesting collisions from which fine level of granularity, they cannot
physicists can study some of the easily be used as a RTA mechanism.
remaining mysteries of matter and An ancillary yet significant issue is
antimatter in the universe . that analysis requires that the user
have an advanced knowledge of the
Analysis of real-time embedded target micro architecture in order to
applications is necessary at interpret the data.
severalpoints during the software life
Fig 2: Debugger with real-time data

RTA host software library funnels the
data between the emulation software
Finally, tracing supports data driver and the host application. Again,
transfer only from target to host and an API exists for the RTA host software
not from host to target. library. It should be noted that multiple
host applications may be run
An alternative real-time analysis
solution based upon JTAG emulation is
presented here. This hardware and
software architecture for a single
processor is shown in. The JTAG
interface that connects the on-chip
emulation logic to the host-based
emulator provides the physical
mechanism on which to transport data
from the target to the host and vice
versa. The target application is the
subject matter to be analyzed; it is the
source of data to be sent to the host and Fig 3:Single processor RTA based
the sink for data received from the host. upon JTAG
Therefore, a RTA target software library
exists to bridge the gap from the target . An emulation software driver controls
application to the on- chip emulation the scanning of data to/from the target
hardware. Good software engineering via the emulator. It is the first piece of
practices dictate that an API exists for host software to receive data from the
this software library. On the host, the target and the last piece of host software
data is to be analyzed by a host to handle data heading to the target. A
application. This host application may RTA host software library funnels the
also input data to the target application. data between the emulation software
We must therefore bridge the gap from driver and the host application. Again,
the emulator to the host application. . an API exists for the RTA host software
library. It should be noted that multiple
host applications may be run
An emulation software driver controls concurrently.
the scanning of data to/from the target
via the emulator. It is the first piece of
host software to receive data from the
target and the last piece of host software
to handle data heading to the target. A
Data flow in this architecture is bi-
directional: data flows from the target
application to the host application for
analysis, and data may flow from the
host application to the target application
for supplying input parameters. Such
input parameters may be used for fine
tuning performance, for supplying test
data, etc.

For target-to-host data transfer, there are
two distinct parts of the data flow path.
The first part extends from the target
application to the RTA host software
library. This is the real-time Fig 4: single processor RTA based
uponJTAG emulation
transportation leg. Since our target
application has real-time constraints, data
must be off-loaded from the target to The above RTA architecture for a single
the host at a certain rate. The RTA host embedded processor is easily extended
software library is the first piece of to a multiprocessor environment. This
software on the host that realizes it has is shown in A RTA target software
received real-time data for analysis. library must exist on each embedded
(The emulation software is agnostic to target to connect the target application to
what type of data it is scanning.) The the emulation logic on that target. For
RTA host software can record the data multi-core architectures, a RTA target
to disk and be done with it, or buffer it software library will exist for each core.
internally. The second part of the The data from each processor is scanned
target-to-host data flow path extends up to the host via the JTAG interface
from the RTA host software library to ring as described in (Scan-Based
the host application. The data is Emulation). On the host, there exists an
analyzed by the host application. If the emulation software driver corresponding
data has been recorded in persistent to each target in the system. Each
storage, then the data can be played emulation software driver receives the
back at any time. If the data is not in data from its corresponding target and
persistent storage, then it must be delivers the data to the one RTA host
analyzed by the host application as it is software library.
produced; that is, the data must be
drained from the RTA host software
buffers as they fill.
scanned out using JTAG boundary
scan. The data is received on the host
by the emulation driver that interfaces
to the host-side RTA software. The data
is then presented to the host client
application for analysis. The figures of
merit used to determine the success of
this methodology are performance,
scalability, ease of use and reliability.
We discuss these criteria within the
scope of both hardware capabilities and
the RTA software architecture in a
multiprocessor environment.


An important consideration in
providing a methodology for
multiprocessor RTA is

Dedicated Emulation Hardware

Fig 5: Multiprocessor RTA via Scan-Based The performance problem has
Emulation been partially addressed in
hardware by dedicating hardware
This architecture for multiprocessor for scan-based emulation. Data is
real-time analysis via scan-based transferred between target and
emulation provides the basis for the host using dedicated emulation
methodology. hardware to improve
performance. In a heterogeneous
Methodology multiprocessor arrangement of
DSPs one complication that arises
In this section we present an end-to- is that of varying scan lengths.
end methodology that is predicated on Each design of DSP has its own
support in hardware and software emulation hardware. This results
across several families of DSPs. There in scan lengths that vary within a
is special emulation hardware family of DSPs and between
architected into the DSP core and Instruction Set Architectures
emulation drivers as well as RTA target (ISAs). The result of this variance
and host side software that permit the is that longer scan chains require
user to perform RTA.Fundamentally, greater disassembly time for the
this methodology involves using a scanned data resulting in lower
development environment to develop throughput. This results in lower
and download a target application to a performance. Data can be
DSP. The application running on the streamed between target and host
DSP interfaces with the RTA software by using peripherals such as
to send and receive data. The data is DMA and by performing real-
time memory write operations. In
a multiprocessor JTAG scan, a
special JTAG boundary scan
bypass instruction obviates the
need to scan any device set to
bypass mode. This results in less
time to disassemble data being
transferred between host and
target Fig 6: Virtual Data better to mark
Paths the data on the
host-side as to
A RTA solution the target-side.
for a If a unique
multiprocessor processor
environment identifier were
must be able to sent down to the
identify the target and the
processor from data were
which data tagged there,
originates. This then more data
introduces the would be sent
need to mark the
from the target
data with a
to the host and
would consume
identifier. The
decision then precious
becomes where bandwidth. At
in the system to the processor
do this. If we level, it is
examine the possible to
host, we see that allow finer-
there is a one-to- grain
one identification of
correspondence data. Virtual
between the data paths that
emulation extend from the
software drivers target
and the application to
processors in the the host
system. Since application are
there is an used to
emulation segregate data.
software driver scalability
for each target in
the system, these A key aspect
drivers can of this
stamp the data
with a processor
is scalability.
identifier. Note
This issue is
that from a
addressed in
both hardware
perspective, it is
and software.
number or application
Hardware processors and that performs Ease of use is
Scalability does not need a series of an important
to be altered if transformatio but often
The JTAG processors are ns on a given difficult figure
specification either added vector, and of metric to
permits the or removed then transfers sustain. A
daisy chaining from the the resulting software
of hardware. system vector to the debugging
The limits configuration. host for environment is
placed on the There is no analysis. provided that
number of requirement Let’s further permits the user
devices that that the target assume that to easily
can be daisy application there are configure the
chained is have any many vectors hardware in the
based on signal knowledge of that must be system.
strength the type or transformed
limitations as number of and that we
opposed to the processors in choose to
JTAG a scan chain at deploy the
specification. the time of same target
development. application on
as many
Software Data Selection processors as
Scalability there are
The host vectors to
In software, application achieve
data is tagged should be maximum
from each able to computing
target with a select from parallelization Fig 7:
unique which . We can Processor
identifier so processor to design a host Selection
that data being send or application
transferred receive data. that sends a Hardware
between host This is different Support
and target can accomplishe vector to each
be identified d by processor and A trend in DSP
as to which incorporatin then collects emulation
processor it g this the resulting hardware is to
belongs. functionality vectors from support device
Further, the into the host each registers that are
RTA API. This processor, mapped at fixed
architecture is Proves to be respectively, addresses. This
software very favorable for analysis permits the
scalable; with respect source code
writing the to scalability. porting of
target For example, applications.
application is let’s assume Further, a trend
not dependent that we have a in more
upon the target Ease of Use contemporary
DSP emulation is a defacto to-host and sides.
logic is to industry host-to- Another
replicate the standard. This target data feature of
logic on all permits the host transfers the RTA
application assists in
DSPs. This architecture
developer to ensuring that
further is
write client there is no
simplifies the programs that congestion
deployment of data control.
are not tightly corruption.
RTA tools. coupled to a With this
Further, host capability
specific DSP.
Software buffers are
Support Hardware guaranteed
on data
Reliability not to
At setup, the connected to overflow.
user selects the The JTAG virtual paths During host-
type of target specification and so there to-target
and loads the has been long is no data data
system with an established as a loss. Buffer transfer, the
emulation reliable managemen RTA
driver for that standard. It has t is precise architecture
target. The user been adopted and is signals the
also specifies and extended. architected end of data
the number of An extensive to ensure no transfer
targets of each set of target data loss on through a
type and their libraries have both target virtual path
position in the been developed and host using
scan chain. for various callbacks.
Without this flavors of
capability users DSPs based on
would have to boundary scan.
add code in Reliability is
their host achieved
applications through reuse
that performed of the same
the same register set in
function, different
resulting in versions of
messy and emulation
unnecessarily hardware
complex code. across ISAs
Host side and within
support is
provided in the
way of object-
oriented Reliability
interfaces based
on the The use of
Component unidirectiona
Object Model l virtual
(COM) , which paths for
both target-
specification and the adherence to this
Challenges standard in the emulation hardware that
has been designed into the DSP core.
There are several challenges in The software that has been developed is
supporting a uniform multiprocessor able to differentiate between the various
RTA capability on various families of DSPs. The virtual paths in the RTA
DSPs.Each family of DSP has its architecture guarantee data integrity.
particular variant of emulation
hardware. This has an impact on the References:
RTA protocol that is used to transfer
data between host and target. For [1] JTAG IEEE 1149.1 Specification
instance, some of the emulation
capabilities on some DSPs use interrupt
to signal the flow of data between host [2]Gottschalk, E.E., et al., "The BTeV
and target. In the absence of emulation DAQ and Trigger System – Some
interrupt support, the application must Throughput, Usability, and Fault
poll the emulation hardware for the Tolerance Aspects," Proceedings of the
presence of data. Computing in High Energy and Nuclear
Physics Conference (CHEP 2001), p. 628,

The RTA methodology presented in this
paper is extensively used and widely
accepted. The problem that is illustrated
in the high energy physics application
presented is not limiting. Our experiences
to date have shown that other domains
such as wireless and mobile computing
require the processing of RTA data
where both DSPs and microcontrollers
are on the same scan chain. The
development of this RTA capability has
been predicated on the JTAG