You are on page 1of 15

RETURN

Class 543 Embedded Systems Conference San Francisco, CA March 2002

Multiprocessor Real-Time Analysis for Scan-Based Emulation:
A Methodology
Debbie Keil Prithvi Rao
Systems Software Architect Systems Software Developer
Texas Instruments, Inc. Texas Instruments, Inc.

Abstract: Scan-based emulation is a pervasive method that is deployed to debug and
develop DSP applications. Scan-based emulation relies on scanning data out from the
target to the host computer using proprietary emulation hardware that is designed into
the DSP core. Emulation support in our single processor and multiprocessor
environments comprises software and hardware on both the target and the host, and
therefore provides a transport mechanism upon which to base real-time analysis.

In this paper we present an end-to-end methodology in developing a multiprocessor real-
time analysis capability formulated on JTAG scan-based mechanisms. Specifically we
discuss the issues related to extending the capability from single to multiprocessor
domains. We examine the issues related to supporting real-time analysis in all of the
software and hardware components. Finally, we enumerate the challenges in developing
a methodology for deploying both homogeneous and heterogeneous multiprocessor
combinations.

1. Introduction

The ability to analyze the proper execution of real-time applications in an embedded
system is critical to their development and deployment. This applies to real-time
applications ranging from mission critical to multimedia.

In embedded systems the ability to perform Real-Time Analysis (RTA) can involve a
dedicated hardware and software capability with an end-to-end methodology that
supports the transferring of data between the host and the target in a lossless and reliable
manner. Specifically, the Real-Time Analysis encompassed by this methodology consists
of capturing data from a target application using dedicated hardware, transferring it
through various layers of software dedicated to creating a real-time path and making it
available to a host application for the purpose of analysis. Analysis includes the
determining of whether applications meet both timing and logical correctness
requirements.

In this paper we present an end-to-end methodology in developing a multiprocessor real-
time analysis capability formulated on JTAG scan-based mechanisms.

1
Scan-based emulation is a pervasive method that is deployed to debug, develop and
analyze real-time applications running on DSPs. The JTAG boundary scan specification
permits the connecting of multiple devices in a serial daisy-chained arrangement. Section
2 covers scan-based emulation in detail and sets the stage for RTA.

The RTA hardware and software architecture that our methodology relies upon is
presented in Section 3. It also includes an application that illustrates the necessity of
RTA in a multiprocessor environment.

The end-to-end methodology is discussed in Section 4. Specifically we discuss the issues
related to extending the capability from single to multiprocessor domains.

In Section 5 we enumerate the challenges in developing a methodology for deploying
both homogeneous and heterogeneous multiprocessor combinations.

Our conclusions are presented in Section 6.

2. Scan-Based Emulation

With a traditional emulator, the CPU to be emulated is usually removed from its socket
and replaced with an emulator pod. The emulator pod typically has a replacement CPU,
plus various amounts of random logic and memory to monitor what is happening on the
CPU pins.

With modern CPUs such as DSPs, the traditional approach has several problems. The
first problem is the speed of newer DSP chips. Bus cycle times can be 25 nanoseconds or
shorter, and all instructions typically execute in a single cycle. This makes it difficult for
a traditional emulator to allow emulation at full speed. The number of pins to monitor can
be staggering, with chips having multiple 32-bit address and data buses, making a
traditional emulator expensive. The second, and more serious, problem is that DSPs often
have on-chip caches, pipelines, memory and peripherals. Sometimes a whole algorithm
can execute without any activity on the CPU pins.

The solution to these problems is scan-based emulation. With scan-based emulation, the
CPU is never removed from the socket; in fact it can be soldered directly onto the board.
Instead, the CPU has a serial scan interface, allowing the emulator to scan the internals of
the device through a standard connector. The pinout for this connector is defined by a
standards committee, making it possible to support several devices with a single
emulator1.

The scan-based approach to emulation has many advantages:

• Emulation at full device speed - Since there is no logic needed to monitor what
happens on the CPU bus, the emulator can allow the device to execute programs
at full speed.

2
• Non-intrusive emulation - Since no logic is attached to any CPU pin, the CPU
bus is not affected at all by the emulation process. The emulator will not affect the
operation of the bus, as is so often the case with traditional emulators.
• In-circuit emulation - The CPU can be soldered to the board while emulating.
This makes denser packaging possible, and also makes the emulator a
manufacturing test tool.
• Full access to internal memory, caches, pipelines and registers - The complete
state of the processor is visible to the outside through a scan interface.
• Complete access to the system from the CPU - Any peripheral or memory that
the CPU can access in the system can also be accessed through the scan interface.
The emulator can look at the system "through the eyes of the CPU". This makes it
possible to debug and diagnose a system where nothing is working except the
CPU itself.

2.1 The JTAG Interface

The Joint Test Action Group (JTAG) defines an interface called the JTAG interface for
testing individual devices on printed circuit boards, without the need to remove the
devices from the board. This is accomplished by a method called boundary scan, whereby
the state of each pin of each device (with some special logic on the device) is serially
scanned out from the device. Multiple devices can be daisy chained, and an entire PC
board can therefore be scanned in a single scan chain.

It is possible to use the same method to scan out not only the state of a devices pins, but
to scan out any internal information from the device, such as register values, memory
location; as a consequence scan-based emulation was born.

The JTAG specification does not include the pinout for the JTAG connector. The
extension to JTAG defines a 14 pin, 2 row, 0.1" spacing JTAG connector header, with
pinout and physical dimensions common to all DSPs that support JTAG involved in this
methodology2.

During JTAG emulation, the emulator supplies the clock that scans the device. This
means that the target clock speed is completely independent of the emulation clock, and
the emulator can support targets running at any clock speed.

2.2 The Boundary Scan Mechanism

2.2.1 Device Architecture

The JTAG device architecture is based on the IEEE 1149.1 architecture. In this
specification, there are four dedicated pins collectively known as the Test Access Port
(TAP). They are:

• Test Data In (TDI)
• Test Data Out (TDO)

3
• Test Mode Select (TMS)
• Test Clock (TCK)
• Test Rest (optional)

A boundary scan cell is connected to each boundary scan register on each device that is
being scanned. The architecture further specifies a finite machine TAP controller with
inputs TMS and TCK. There is an Instruction Register (IR) holding the current
instruction, a bypass register, and an optional 32-bit identification register for permanent
identification.

2.2.2 Principles of Boundary Scan

Boundary scan cells are configured into a parallel-in, parallel-out shift register. Parallel
load operations cause signals from the core logic to be loaded into the output cells.
Parallel unload operations cause the signals to be loaded from the input cells to the core
logic. Data is shifted in serial mode by daisy chaining devices. The figure below shows
the TDI of each device connected to the TDO of the next device in the scan chain.

Loosely Coupled DSP Arrangement
(Multiple Boards)

Target Board JTAG Header (14 Pin) Target Board
1 2

TDI TDO TDI TDO
TDO TDI
TDI TDO

JTAG Splitter Card
TDI TDO

To Host Computer

Class 543: Debbie Keil & Prithvi Rao 1
Figure 1
Boundary Scan

In a homogeneous multiprocessor environment all devices have the same emulation
hardware with the same scan chain length. In a heterogeneous environment, devices have
different emulation hardware resulting in varying scan chain lengths.

It is possible to avoid scanning any device by placing it in bypass mode.

Typically, the system architect is responsible for determining the type (homogeneous or
heterogeneous) type of arrangement of devices, their order in the scan chain and if they
will be placed in bypass.

4
3. Real-Time Analysis (RTA)

The following application in the domain of high energy physics illustrates the necessity
for RTA in a heterogeneous multiprocessor environment. The Fermilab Tevatron
Collider generates 15 million particle collisions per second. These particle collisions
result in the creation of subatomic particles that travel through a spectrometer. The data
output from the spectrometer is in the order of terabytes per second and must be analyzed
in real time. The analysis engine comprises a massively parallel arrangement of
heterogeneous DSPs and GPPs (general purpose processors). Analysis consists of
applying algorithms that reconstruct and filter the collision data. The result is a select set
of interesting collisions from which physicists can study some of the remaining mysteries
of matter and antimatter in the universe3.

Analysis of real-time embedded applications is necessary at several points during the
software life cycle: during development, as a means to debug; towards the end of
development, for tuning performance; and after the application is deployed, for failure
analysis.

Historically, several different methods have been employed to debug and analyze real-
time embedded applications. Traditional debuggers were used to set breakpoints that
stop the target application so that the memory state could be examined. This method has
proven to be inadequate for most real-time applications because setting breakpoints halts
the application and therefore interferes with the timing constraints of the system. The
memory state is not guaranteed to contain reliable results.

Logic analyzers have been used for many years to clamp onto the data busses of the
target and monitor the data flow of the application in order to analyze application
behavior. Aside from the fact that logic analyzers are expensive ($15K to $60K for a
DSP), the increase in system-level integration over the years has resulted in fewer
exposed data paths for the logic analyzer to monitor.

In some cases, pre-production versions of the chips containing in-circuit emulation (ICE)
structures were manufactured. These could be used to debug real-time applications.
However, since the debugging environment is not equivalent to the final production
environment, the application’s performance cannot be guaranteed to remain the same
from the ICE version to the final chip.

Most modern microprocessors are architected with specialized hardware counters that can
be programmed for the purpose of tracing applications. Traditionally these registers have
been used to determine the design of the microarchitecture such as caches and TLBs, etc.
Whereas these registers can be used to trace the behavior of applications at a very fine
level of granularity, they cannot easily be used as a RTA mechanism. An ancillary yet
significant issue is that analysis requires that the user have an advanced knowledge of the
target microarchitecture in order to interpret the data. Finally, tracing supports data
transfer only from target to host and not from host to target.

5
Host Target

Host
RTA RTA Real-
App1
Host Emu Emulator Emu Target Time

SW SW JTAG HW SW Target
Host Lib Lib App
AppM

Host API Target API

Figure 2
Single processor RTA based upon JTAG emulation

An alternative real-time analysis solution based upon JTAG emulation is presented here.
This hardware and software architecture for a single processor is shown in Figure 2. The
JTAG interface that connects the on-chip emulation logic to the host-based emulator
provides the physical mechanism on which to transport data from the target to the host
and vice versa. The target application is the subject matter to be analyzed; it is the source
of data to be sent to the host and the sink for data received from the host. Therefore, a
RTA target software library exists to bridge the gap from the target application to the on-
chip emulation hardware. Good software engineering practices dictate that an API exist
for this software library. On the host, the data is to be analyzed by a host application.
This host application may also input data to the target application. We must therefore
bridge the gap from the emulator to the host application. An emulation software driver
controls the scanning of data to/from the target via the emulator. It is the first piece of
host software to receive data from the target and the last piece of host software to handle
data heading to the target. A RTA host software library funnels the data between the
emulation software driver and the host application. Again, an API exists for the RTA
host software library. It should be noted that multiple host applications may be run
concurrently.

Data flow in this architecture is bi-directional: data flows from the target application to
the host application for analysis, and data may flow from the host application to the target
application for supplying input parameters. Such input parameters may be used for fine
tuning performance, for supplying test data, etc. Refer to Figure 3.

For target-to-host data transfer, there are two distinct parts of the data flow path. The
first part extends from the target application to the RTA host software library. This is the
real-time transportation leg. Since our target application has real-time constraints, data

6
must be off-loaded from the target to the host at a certain rate. The RTA host software
library is the first piece of software on the host that realizes it has received real-time data
for analysis. (The emulation software is agnostic to what type of data it is scanning.) The
RTA host software can record the data to disk and be done with it, or buffer it internally.
The second part of the target-to-host data flow path extends from the RTA host software
library to the host application. The data is analyzed by the host application. If the data
has been recorded in persistent storage, then the data can be played back at any time. If
the data is not in persistent storage, then it must be analyzed by the host application as it
is produced; that is, the data must be drained from the RTA host software buffers as they
fill.

Host Target

RTA Real-Time Transportation

Host
RTA RTA Real-
App1
Host Emu Emulator Emu Target Time

SW SW JTAG HW SW Target
Host Lib Lib App
AppM

Input

Figure 3
Data flow in single processor RTA based upon JTAG emulation

The above RTA architecture for a single embedded processor is easily extended to a
multiprocessor environment. This is shown in Figure 4. A RTA target software library
must exist on each embedded target to connect the target application to the emulation
logic on that target. For multi-core architectures, a RTA target software library will exist
for each core. The data from each processor is scanned up to the host via the JTAG
interface ring as described in Section 2 (Scan-Based Emulation). On the host, there exists
an emulation software driver corresponding to each target in the system. Each emulation
software driver receives the data from its corresponding target and delivers the data to the
one RTA host software library.

7
Target Processors

Host Processor TDI Emu RTA Real-
HW Target Time P1
Host TDO SW App
App1 Emu
SW P1 TDI
Host TDI Emu RTA Real-
RTA Emu
App2 HW Target Time P2
Host SW P2 Emu SW App

SW … TDO


Host Emu TDO
AppM SW Pn
TDI Emu RTA Real-
HW Target Time Pn
TDO SW App

Class 543: Debbie Keil & Prithvi Rao 4

Figure 4
Multiprocessor RTA via Scan-Based Emulation

This architecture for multiprocessor real-time analysis via scan-based emulation provides
the basis for the methodology.

4. Methodology

In this section we present an end-to-end methodology that is predicated on support in
hardware and software across several families of DSPs. There is special emulation
hardware architected into the DSP core and emulation drivers as well as RTA target and
host side software that permit the user to perform RTA.

Fundamentally, this methodology involves using a development environment to develop
and download a target application to a DSP. The application running on the DSP
interfaces with the RTA software to send and receive data. The data is scanned out using
JTAG boundary scan. The data is received on the host by the emulation driver that
interfaces to the host-side RTA software. The data is then presented to the host client
application for analysis.

The figures of merit used to determine the success of this methodology are performance,
scalability, ease of use and reliability. We discuss these criteria within the scope of both
hardware capabilities and the RTA software architecture in a multiprocessor
environment.

4.1 Performance

An important consideration in providing a methodology for multiprocessor RTA is
performance.

8
4.1.1 Dedicated Emulation Hardware

The performance problem has been partially addressed in hardware by dedicating
hardware for scan-based emulation. Data is transferred between target and host using
dedicated emulation hardware to improve performance.

In a heterogeneous multiprocessor arrangement of DSPs one complication that arises is
that of varying scan lengths. Each design of DSP has its own emulation hardware. This
results in scan lengths that vary within a family of DSPs and between Instruction Set
Architectures (ISAs). The result of this variance is that longer scan chains require greater
disassembly time for the scanned data resulting in lower throughput. This results in lower
performance.

Data can be streamed between target and host by using peripherals such as DMA and by
performing real-time memory write operations.

In a multiprocessor JTAG scan, a special JTAG boundary scan bypass instruction
obviates the need to scan any device set to bypass mode. This results in less time to
disassemble data being transferred between host and target.

Host Target

Virtual Path 1

Host Virtual Path 2 Target
App … App

Virtual Path N

Figure 5
Virtual Data Paths

4.1.2 Data Identification

A RTA solution for a multiprocessor environment must be able to identify the processor
from which data originates. This introduces the need to mark the data with a processor
identifier. The decision then becomes where in the system to do this. If we examine the
host, we see that there is a one-to-one correspondence between the emulation software
drivers and the processors in the system. Since there is an emulation software driver for
each target in the system, these drivers can stamp the data with a processor identifier.

9
Note that from a performance perspective, it is better to mark the data on the host-side as
to the target-side. If a unique processor identifier were sent down to the target and the
data were tagged there, then more data would be sent from the target to the host and
would consume precious bandwidth.

At the processor level, it is possible to allow finer-grain identification of data. Virtual
data paths that extend from the target application to the host application are used to
segregate data. See Figure 5. For target-to-host data transfer, the segregation policy is
determined by the target application writer, whereas for host-to-target data transfer, the
segregation policy is determined by the host application writer. In either case, the
corresponding application (host or target) must be aware of how the data is segregated
according to virtual paths. Therefore, both the target API and the host API must contain
methods to identify the virtual path on which the data is flowing. The introduction of
virtual data path identification has ramifications on performance because this identifier
must be carried with the data.

4.2 Scalability

A key aspect of this methodology is scalability. This issue is addressed in both hardware
and software.

4.2.1 Hardware Scalability

The JTAG specification permits the daisy chaining of hardware. The limits placed on the
number of devices that can be daisy chained is based on signal strength limitations as
opposed to the JTAG specification.

4.2.2 Software Scalability

In software, data is tagged from each target with a unique identifier (as described in
Section 4.1.2) so that data being transferred between host and target can be identified as
to which processor it belongs.

Further, the RTA architecture is software scalable; writing the target application is not
dependent upon the number or processors and does not need to be altered if processors
are either added or removed from the system configuration. There is no requirement that
the target application have any knowledge of the type or number of processors in a scan
chain at the time of development.

The emulation drivers and the RTA host software are architected to manage the data from
the different processors.

4.2.2.1 Data Selection

The host application should be able to select from which processor to send or receive
data. This is accomplished by incorporating this functionality into the host API. This

10
proves to be very favorable with respect to scalability. By allowing the host application
to select the processor, the same target application can be replicated without change on
multiple processors to exploit parallel computing power.

For example, let’s assume that we have a target application that performs a series of
transformations on a given vector, and then transfers the resulting vector to the host for
analysis. Let’s further assume that there are many vectors that must be transformed and
that we choose to deploy the same target application on as many processors as there are
vectors to achieve maximum computing parallelization. We can design a host application
that sends a different vector to each processor and then collects the resulting vectors from
each processor, respectively, for analysis. See Figure 6.

Host Application
• Select Processor1 Processor1
• Send vector1
Transform
• Select Processor2
• Send vector2
• Select Processor3 Processor2
• Send vector3
• Select Processor1 Transform
• Get resultant vector
• Select Processor2 Processor3
• Get resultant vector
• Select Processor3 Transform
• Get resultant vector

Class 543: Debbie Keil & Prithvi Rao 6

Figure 6
Processor Selection

Note that this example still holds if the target application sends its data via a virtual data
path. Since the target application is replicated unchanged, each processor would be
sending data on the same virtual path. However, since the host application selects data
on the basis of both processor identifier and virtual data path identifier, all data is
uniquely identifiable.

This example illustrates that host-application control over processor selection results in a
scalable multiprocessor methodology.

4.3 Ease of Use

11
Ease of use is an important but often difficult figure of metric to sustain. A software
debugging environment is provided that permits the user to easily configure the hardware
in the system.

4.3.1 Hardware Support

A trend in DSP emulation hardware is to support device registers that are mapped at fixed
addresses. This permits the source code porting of applications. Further, a trend in more
contemporary DSP emulation logic is to replicate the logic on all DSPs. This further
simplifies the deployment of RTA tools.

4.3.2 Software Support

At setup, the user selects the type of target and loads the system with an emulation driver
for that target. The user also specifies the number of targets of each type and their
position in the scan chain. Without this capability users would have to add code in their
host applications that performed the same function, resulting in messy and unnecessarily
complex code.

The debugging support software permits the setting of devices on a scan chain to be
bypassed. In the absence of this support, the application might have to disassemble
unwanted scans.

Host side support is provided in the way of object-oriented interfaces based on the
Component Object Model (COM)4, which is a defacto industry standard. This permits the
host application developer to write client programs that are not tightly coupled to a
specific DSP.

4.4 Reliability

Reliability is critical to the deployment of the RTA capability.

4.4.1 Hardware Reliability

The JTAG specification has been long established as a reliable standard. It has been
adopted and extended. An extensive set of target libraries have been developed for
various flavors of DSPs based on boundary scan. Reliability is achieved through reuse of
the same register set in different versions of emulation hardware across ISAs and within
ISAs.

4.4.2 Software Reliability

The use of unidirectional virtual paths for both target-to-host and host-to-target data
transfers assists in ensuring that there is no data corruption. Further, host applications
synchronize on data buffers connected to virtual paths and so there is no data loss.

12
Buffer management is precise and is architected to ensure no data loss on both target and
host sides.

Another feature of the RTA architecture is congestion control. With this capability
buffers are guaranteed not to overflow.

During host-to-target data transfer, the RTA architecture signals the end of data transfer
through a virtual path using callbacks. Callbacks are used to notify target applications
that data sent by the host has to be read. The virtual paths through which data is passed
cannot be reused unless previously written data has been consumed.

Data is copied from the target application into buffers in the RTA target software library.
This supports reliability by ensuring that the target application does not accidentally
overwrite data.

5. Challenges

There are several challenges in supporting a uniform multiprocessor RTA capability on
various families of DSPs.

Each family of DSP has its particular variant of emulation hardware. This has an impact
on the RTA protocol that is used to transfer data between host and target. For instance,
some of the emulation capabilities on some DSPs use interrupts to signal the flow of data
between host and target. In the absence of emulation interrupt support, the application
must poll the emulation hardware for the presence of data.

Another issue is the support for DSPs with varying word sizes (16 bit and 32 bit).

There is a need to support RTA in the presence of various memory hierarchies.
Specifically, RTA must run when the application is loaded into on-chip or off-chip
memory.

These issues have been addressed on the target side by developing the RTA target
software libraries that get linked in with the application. These target libraries comprise
the software that is responsible for programming the emulation and peripheral device
registers and effect data transfer.

On the host side, the RTA host software is a target independent layer that can filter data
in a multiprocessor environment to send and receive the data from a particular target
unambiguously.

6. Conclusion

The RTA methodology presented in this paper is extensively used and widely accepted.
The problem that is illustrated in the high energy physics application presented in Section
3 is not limiting. Our experiences to date have shown that other domains such as wireless

13
and mobile computing require the processing of RTA data where both DSPs and
microcontrollers are on the same scan chain.

The development of this RTA capability has been predicated on the JTAG specification
and the adherence to this standard in the emulation hardware that has been designed into
the DSP core.

The software that has been developed is able to differentiate between the various DSPs.
The virtual paths in the RTA architecture guarantee data integrity. The COM interfaces
permit the analysis of data via Commercial Off-The-Shelf (COTS) tools such as
MATLAB® and LabVIEW™.

We consider this methodology to be successful based on the criteria cited in Section 4.
The methodology has been demonstrated to scale well. It incorporates many ease-of-use
features and is reliable from both hardware and software perspectives. We have
maximized performance by utilizing emulation hardware and streamlining the software
layers.

Biographies

Debbie Keil is a member of Texas Instruments’ technical ladder, a distinction held by the
top twenty percent of TI’s technical staff worldwide. She is the co-architect of TI’s Real-
Time Analysis technology and the technical lead of the software engineering team that
develops real-time analysis solutions for TI DSPs. Debbie has extensive industry
experience in compiler development and embedded systems programming. She has
published in the EE Times and has patents pending. She is a member of the International
WHO’S WHO of Information Technology. Debbie holds a Master’s degree in Computer
Science and a Bachelor’s degree in Computer Science and Math from the University of
Pittsburgh.

Prithvi Rao is a member of technical staff with Texas Instruments. He is currently
working on developing Real-Time Analysis capability on multiprocessor arrangements of
DSPs. He has worked in the development of the Real-Time Mach operating system at
CMU and was awarded two patents for his work in distributed control architectures for
mobile robots. He has a Bachelor's degree in Electrical Engineering from the University
of Canterbury, New Zealand and a Master's degree in Electrical and Computer
Engineering from CMU. He has been invited to present several tutorials on Java for the
Usenix organization and has published numerous articles on Java and CORBA in their
“;login:”technical journal. He has an adjunct faculty appointment at Carnegie Mellon
where he teaches graduate courses in Electronic Commerce, Java, Distributed Object
Technologies and Telecommunications Management.

References

[1] JTAG IEEE 1149.1 Specification http://www.ieee.org

14
[2] Texas Instruments JTAG Extensions http://www.ti.com

[3] Gottschalk, E.E., et al., "The BTeV DAQ and Trigger System – Some Throughput,
Usability, and Fault Tolerance Aspects," Proceedings of the Computing in High Energy
and Nuclear Physics Conference (CHEP 2001), p. 628, Beijing, China, September 2001.

[4] Microsoft Component Object Model (COM) http://www.microsoft.com

15