You are on page 1of 16

A TECHNICAL PAPER PRESENTATION

ON
MULTIPROCESSOR REAL-TIME ANALYSIS FOR
SCAN-BASED EMULATION: A METHODOLOGY
OF DSP APPLICATION
FOR

TECHFEST’08
AT

Sir C R Reddy College of Engineering
BY

G UDLAVALLERU
ENGINEERING COLLEGE
GUDLAVALLERU

M.V.S NAIDU K.N.MALLIKHARJUN RAO
1II/IV B.TECH ECE 1II/IV B.TECH ECE
05481A0443 05481A0454

E MAIL:mandala.naidu@gmail.com E MAIL: mallik454@gmail.com
Abstract

Scan-based emulation is a pervasive method that is deployed to debug
and develop DSP applications. Scan-based emulation relies on scanning data out from
the target to the host computer using proprietary emulation hardware that is designed
into the DSP core. Emulation support in our single processor and multiprocessor
environments comprises software and hardware on both the target and the host, and
therefore provides a transport mechanism upon which to base real-time analysis.

In this paper we present an end-to-end methodology in developing a multiprocessor real-
time analysis capability formulated on JTAG scan-based mechanisms. Specifically we
discuss the issues related to extending the capability from single to multiprocessor
domains. We examine the issues related to supporting real-time analysis in all of the
software and hardware components. Finally, we enumerate the challenges in developing
a methodology for deploying both homogeneous and heterogeneous multiprocessor
combinations.

Introduction

The ability to analyze the proper execution of real-time applications in an
embedded system is critical to their development and deployment. This applies to
real-time applications ranging from mission critical to multimedia.

In embedded systems the ability to perform Real-Time Analysis (RTA) can involve a
dedicated hardware and software capability with an end-to-end methodology that
supports the transferring of data between the host and the target in a lossless and reliable
manner. Specifically, the Real-Time Analysis encompassed by this methodology consists
of capturing data from a target application using dedicated hardware, transferring it
through various layers of software dedicated to creating a real-time path and making it
available to a host application for the purpose of analysis. Analysis includes the
determining of whether applications meet both timing and logical correctness
requirements.

In this paper we present an end-to-end methodology in developing a multiprocessor
real- time analysis capability formulated on JTAG scan-based mechanisms. Scan-based
emulation is a pervasive method that is deployed to debug, develop and analyze real-
time applications running on DSPs. The JTAG boundary scan specification permits the
connecting of multiple devices in a serial daisy-chained arrangement. This covers scan-
based emulation in detail and sets the stage for RTA. .The RTA hardware and software
architecture that our methodology relies upon is presented .It also includes an
application that illustrates the necessity of RTA in a multiprocessor environment.
Scan-Based Emulation

With a traditional emulator, the CPU to be emulated is usually removed from its socket
and replaced with an emulator pod. The emulator pod typically has a replacement CPU,
plus various amounts of random logic and memory to monitor what is happening on
the CPU pins. With modern CPUs such as DSPs, the traditional approach has several
problems. The first problem is the speed of newer DSP chips. Bus cycle times can be
25 nanoseconds or shorter, and all instructions typically execute in a single cycle. This
makes it difficult for a traditional emulator to allow emulation at full speed. The
number of pins to monitor can be staggering, with chips having multiple 32-bit address
and data buses, making a traditional emulator expensive. The second, and more serious,
problem is that DSPs often have on-chip caches, pipelines, memory and peripherals.
Sometimes a whole algorithm can execute without any activity on the CPU pins.

The solution to these problems is scan-based emulation. With scan-based
emulation, the CPU is never removed from the socket; in fact it can be soldered directly
onto the board. Instead, the CPU has a serial scan interface, allowing the emulator to
scan the internals of the device through a standard connector.
The scan-based approach to emulation has many advantages:

• Emulation at full device speed - Since there is no logic needed to monitor
what happens on the CPU bus, the emulator can allow the device to execute
programs at full speed.
• Non-intrusive emulation - Since no logic is attached to any CPU pin, the
CPU bus is not affected at all by the emulation process. The emulator will not
affect the operation of the bus, as is so often the case with traditional
emulators.
• In-circuit emulation - The CPU can be soldered to the board while
emulating. This makes denser packaging possible, and also makes the
emulator a manufacturing test tool.
• Full access to internal memory, caches, pipelines and registers - The complete
state of the processor is visible to the outside through a scan interface.
• Complete access to the system from the CPU - Any peripheral or memory
that the CPU can access in the system can also be accessed through the scan
interface. The emulator can look at the system "through the eyes of the CPU".
This makes it possible to debug and diagnose a system where nothing is
working except the CPU itself.

The JTAG Interface

The Joint Test Action Group (JTAG) defines an interface called the JTAG interface for
testing individual devices on printed circuit boards, without the need to remove the
devices from the board. This is accomplished by a method called boundary scan, whereby
the state of each pin of each device (with some special logic on the device) is serially
scannedoutfromthedevice.
Multiple devices can be daisy chained, and an entire PC board can therefore be scanned
in a single scan chain. It is possible to use the same method to scan out not only the state
of a devices pins, but to scan out any internal information from the device, such as
register values, memory location; as a consequence scan-based emulation was born. The
JTAG specification does not include the pin out for the JTAG connector. The extension to
JTAG defines a 14 pin, 2 row, 0.1" spacing JTAG connector header, with pin out and
physical dimensions common to all DSPs that support JTAG involved in this
2
methodology .During JTAG emulation, the emulator supplies the clock that scans the
device. This means that the target clock speed is completely independent of the emulation
clock, and the emulator can support targets running at any clock speed.

The Boundary Scan Mechanism
Device Architecture

The JTAG device architecture is based on the IEEE 1149.1 architecture. In this
specification, there are four dedicated pins collectively known as the Test Access Port
(TAP). They are:

• Test Data In (TDI)
• Test Data Out (TDO)
• Test Mode Select (TMS)
• Test Clock (TCK)
• Test Rest (optional)

A boundary scan cell is connected to each boundary scan register on each device that is
being scanned. The architecture further specifies a finite machine TAP controller with
inputs TMS and TCK. There is an Instruction Register (IR) holding the current
instruction, a bypass register, and an optional 32-bit identification register for permanent
identification.

Principles of Boundary Scan

Boundary scan cells are configured into a parallel-in, parallel-out shift register. Parallel
load operations cause signals from the core logic to be loaded into the output cells. Parallel
unload operations cause the signals to be loaded from the input cells to the core logic. Data
is shifted in serial mode by daisy chaining devices. The figure below shows the TDI of
each device connected to the TDO of the next device in the scan chain. It is possible to
avoid scanning any device by placing it in bypass mode. Typically, the system architect
is responsible for determining the type (homogeneous or heterogeneous) type of
arrangement of devices, their order in the scan chain and if they will be placed in
bypass.
Loosely Coupled DSP Arrangement
(Multiple Boards)

Target Board
1 JTAG Header (14 Pin) Target Board

2

TDI TDO TDO TDI TDO

TDI

TDI TDO

JTAG Splitter Card
TDI TDO

To Host Computer

Class 543: Debbie Keil & Prithvi Rao 1

Fig 1: Boundary Scan

Real-Time Analysis (RTA)

The following application in the domain of high energy physics illustrates the necessity
for RTA in a heterogeneous multiprocessor environment. The Fermi lab Tevatron
Collider generates 15 million particle collisions per second. These particle collisions
result in the creation of subatomic particles that travel through a spectrometer. The data
output from the spectrometer is in the order of terabytes per second and must be analyzed
in real time. The analysis engine comprises a massively parallel arrangement of
heterogeneous DSPs and GPPs (general purpose processors). Analysis consists of
applying algorithms that reconstruct and filter the collision data. The result is a select set
of interesting collisions from which physicists can study some of the remaining mysteries
3
of matter and antimatter in the universe .
Analysis of real-time embedded applications is necessary at several points during the
software life cycle: during development, as a means to debug; towards the end of
development, for tuning performance; and after the application is deployed, for
failure analysis. Logic analyzers have been used for many years to clamp onto the
data busses of the target and monitor the data flow of the application in order to
analyze application behavior. Aside from the fact that logic analyzers are expensive
($15K to $60K for a DSP), the increase in system-level integration over the years has
resulted in fewer exposed data paths for the logic analyzer to monitor. Most modern
microprocessors are architected with specialized hardware counters that can be
programmed for the purpose of tracing applications. Traditionally these registers
have been used to determine the design of the micro architecture such as caches and
TLBs, etc. Whereas these registers can be used to trace the behavior of applications
at a very fine level of granularity, they cannot easily be used as a RTA mechanism.
An ancillary yet significant issue is that analysis requires that the user have an
advanced knowledge of the target micro architecture in order to interpret the data.

Fig 2: Debugger with real-time data exchange

Finally, tracing supports data transfer only from target to host and not from host to
target.

An alternative real-time analysis solution based upon JTAG emulation is presented here.
This hardware and software architecture for a single processor is shown in. The JTAG
interface that connects the on-chip emulation logic to the host-based emulator provides
the physical mechanism on which to transport data from the target to the host and vice
versa. The target application is the subject matter to be analyzed; it is the source of data
to be sent to the host and the sink for data received from the host. Therefore, a RTA
target software library exists to bridge the gap from the target application to the on- chip
emulation hardware. Good software engineering practices dictate that an API exists for
this software library. On the host, the data is to be analyzed by a host application. This
host application may also input data to the target application. We must therefore bridge
the gap from the emulator to the host application. .

An emulation software driver controls the scanning of data to/from the target via the
emulator. It is the first piece of host software to receive data from the target and the last
piece of host software to handle data heading to the target. A RTA host software library
funnels the data between the emulation software driver and the host application. Again,
an API exists for the RTA host software library. It should be noted that multiple host
applications may be run concurrently.
Host Target

Host RTA
RTA Real-
Emu Target Time
App1 Host Emu Emulator SW Target
SW SW JTAG HW Lib App

Host Lib
AppM
Target API
Host API

Fig 3:Single processor RTA based upon JTAG emulation

. An emulation software driver controls the scanning of data to/from the target via the
emulator. It is the first piece of host software to receive data from the target and the last
piece of host software to handle data heading to the target. A RTA host software library
funnels the data between the emulation software driver and the host application. Again,
an API exists for the RTA host software library. It should be noted that multiple host
applications may be run concurrently.

Data flow in this architecture is bi-directional: data flows from the target application to
the host application for analysis, and data may flow from the host application to the target
application for supplying input parameters. Such input parameters may be used for fine
tuning performance, for supplying test data, etc.

For target-to-host data transfer, there are two distinct parts of the data flow path. The first
part extends from the target application to the RTA host software library. This is the real-
time transportation leg. Since our target application has real-time constraints, data must be
off-loaded from the target to the host at a certain rate. The RTA host software library is
the first piece of software on the host that realizes it has received real-time data for
analysis. (The emulation software is agnostic to what type of data it is scanning.) The
RTA host software can record the data to disk and be done with it, or buffer it internally.
The second part of the target-to-host data flow path extends from the RTA host software
library to the host application. The data is analyzed by the host application. If the data
has been recorded in persistent storage, then the data can be played back at any time. If
the data is not in persistent storage, then it must be analyzed by the host application as it
is produced; that is, the data must be drained from the RTA host software buffers as they
fill.
Host Target

RTA Real-Time Transportation

Host RTA RTA Real-

App1 Host Emu Emulator Emu Target Time

SW SW JTAG HW SW Target

Host Lib Lib App
AppM

Input

Fig 4: Data flow in single processor RTA based upon JTAG emulation

The above RTA architecture for a single embedded processor is easily extended to a
multiprocessor environment. This is shown in A RTA target software library must exist
on each embedded target to connect the target application to the emulation logic on that
target. For multi-core architectures, a RTA target software library will exist for each core.
The data from each processor is scanned up to the host via the JTAG interface ring as
described in (Scan-Based Emulation). On the host, there exists an emulation software
driver corresponding to each target in the system. Each emulation software driver
receives the data from its corresponding target and delivers the data to the one RTA host
software library.

Fig 5: Real-time trace and debug for multiple processing
Target Processors

Host Processor TDI Emu RTA Real-
HW Target Time
TDO SW App P1
Host Emu
TDI Emu RTA Real-
App1 SW P1 HW Target Time
Host RTA TDI TDO SW App P2


Host Emu Emu TDI

App2 SW P2 Emu RTA Real-
HW Target Time
SW … TDO SW App Pn

Host Emu TDO

AppM SW Pn

Class 543: Debbie Keil & Prithvi Rao 4

Fig 6: Multiprocessor RTA via Scan-Based Emulation

This architecture for multiprocessor real-time analysis via scan-based emulation provides
the basis for the methodology.

Methodology

In this section we present an end-to-end methodology that is predicated on support in
hardware and software across several families of DSPs. There is special emulation
hardware architected into the DSP core and emulation drivers as well as RTA target and
host side software that permit the user to perform RTA.Fundamentally, this
methodology involves using a development environment to develop and download a
target application to a DSP. The application running on the DSP interfaces with the RTA
software to send and receive data. The data is scanned out using JTAG boundary scan.
The data is received on the host by the emulation driver that interfaces to the host-side
RTA software. The data is then presented to the host client application for analysis. The
figures of merit used to determine the success of this methodology are performance,
scalability, ease of use and reliability. We discuss these criteria within the scope of both
hardware capabilities and the RTA software architecture in a multiprocessor
environment.

Performance

An important consideration in providing a methodology for multiprocessor RTA is

Performance
Dedicated Emulation Hardware

The performance problem has been partially addressed in hardware by dedicating
hardware for scan-based emulation. Data is transferred between target and host using
dedicated emulation hardware to improve performance. In a heterogeneous
multiprocessor arrangement of DSPs one complication that arises is that of varying
scan lengths. Each design of DSP has its own emulation hardware. This results in
scan lengths that vary within a family of DSPs and between Instruction Set
Architectures (ISAs). The result of this variance is that longer scan chains require
greater disassembly time for the scanned data resulting in lower throughput. This
results in lower performance. Data can be streamed between target and host by using
peripherals such as DMA and by performing real-time memory write operations. In a
multiprocessor JTAG scan, a special JTAG boundary scan bypass instruction
obviates the need to scan any device set to bypass mode. This results in less time to
disassemble data being transferred between host and target.

Host Target

Virtual Path 1

Virtual Path 2
Host
App … Target
App

Virtual Path N

Fig 7: Virtual Data Paths

Data Identification

A RTA solution for a multiprocessor environment must be able to identify the processor
from which data originates. This introduces the need to mark the data with a processor
identifier. The decision then becomes where in the system to do this. If we examine the
host, we see that there is a one-to-one correspondence between the emulation software
drivers and the processors in the system. Since there is an emulation software driver for
each target in the system, these drivers can stamp the data with a processor identifier. Note
that from a performance perspective, it is better to mark the data on the host-side as to
the target-side. If a unique processor identifier were sent down to the target and the data
were tagged there, then more data would be sent from the target to the host and would
consume precious bandwidth. At the processor level, it is possible to allow finer-grain
identification of data. Virtual data paths that extend from the target application to the
host application are used to segregate data.
Scalability

A key aspect of this methodology is scalability. This issue is addressed in both hardware
and software.

Hardware Scalability

The JTAG specification permits the daisy chaining of hardware. The limits placed on the
number of devices that can be daisy chained is based on signal strength limitations as
opposed to the JTAG specification.

Software Scalability

In software, data is tagged from each target with a unique identifier so that data being
transferred between host and target can be identified as to which processor it belongs.
Further, the RTA architecture is software scalable; writing the target application is not
dependent upon the number or processors and does not need to be altered if processors
are either added or removed from the system configuration. There is no requirement
that the target application have any knowledge of the type or number of processors in a
scan chain at the time of development.

Data Selection

The host application should be able to select from which processor to send or receive
data. This is accomplished by incorporating this functionality into the host API. This
Proves to be very favorable with respect to scalability. For example, let’s assume that
we have a target application that performs a series of transformations on a given vector,
and then transfers the resulting vector to the host for analysis. Let’s further assume that
there are many vectors that must be transformed and that we choose to deploy the same
target application on as many processors as there are vectors to achieve maximum
computing parallelization. We can design a host application that sends a different
vector to each processor and then collects the resulting vectors from each processor,
respectively, for analysis

Ease of Use

Ease of use is an important but often difficult figure of metric to sustain. A software
debugging environment is provided that permits the user to easily configure the
hardware in the system.
Host Application
• Select Processor1 Processor1

• Send vector1
Transform
• Select Processor2
• Send vector2
Processor2
• Select Processor3
• Send vector3
Transform
• Select Processor1
• Get resultant
vector Processor3
• Select Processor2
Transform
• Get resultant
vector
• Select Processor3
• Get resultant vector

Class 543: Debbie Keil & Prithvi Rao 6

Fig 8: Processor Selection

Hardware Support

A trend in DSP emulation hardware is to support device registers that are mapped at
fixed addresses. This permits the source code porting of applications. Further, a trend in
more contemporary DSP emulation logic is to replicate the logic on all DSPs. This
further simplifies the deployment of RTA tools.

Software Support

At setup, the user selects the type of target and loads the system with an emulation
driver for that target. The user also specifies the number of targets of each type and
their position in the scan chain. Without this capability users would have to add code in
their host applications that performed the same function, resulting in messy and
unnecessarily complex code. Host side support is provided in the way of object-oriented
4
interfaces based on the Component Object Model (COM) , which is a defacto industry
standard. This permits the host application developer to write client programs that are not
tightly coupled to a specific DSP.

Hardware Reliability

The JTAG specification has been long established as a reliable standard. It has been
adopted and extended. An extensive set of target libraries have been developed for
various flavors of DSPs based on boundary scan. Reliability is achieved through reuse
of the same register set in different versions of emulation hardware across ISAs and
within ISAs.
Software Reliability

The use of unidirectional virtual paths for both target-to-host and host-to-target data
transfers assists in ensuring that there is no data corruption. Further, host applications
synchronize on data buffers connected to virtual paths and so there is no data loss. Buffer
management is precise and is architected to ensure no data loss on both target and
host sides. Another feature of the RTA architecture is congestion control. With this
capability buffers are guaranteed not to overflow. During host-to-target data transfer,
the RTA architecture signals the end of data transfer through a virtual path using
callbacks.

Challenges

There are several challenges in supporting a uniform multiprocessor RTA capability on
various families of DSPs.Each family of DSP has its particular variant of emulation
hardware. This has an impact on the RTA protocol that is used to transfer data between
host and target. For instance, some of the emulation capabilities on some DSPs use
interrupt to signal the flow of data between host and target. In the absence of emulation
interrupt support, the application must poll the emulation hardware for the presence of
data.

Conclusion

The RTA methodology presented in this paper is extensively used and widely accepted. The
problem that is illustrated in the high energy physics application presented is not limiting.
Our experiences to date have shown that other domains such as wireless and mobile
computing require the processing of RTA data where both DSPs and microcontrollers are
on the same scan chain. The development of this RTA capability has been predicated on
the JTAG specification and the adherence to this standard in the emulation hardware that
has been designed into the DSP core. The software that has been developed is able to
differentiate between the various DSPs. The virtual paths in the RTA architecture
guarantee data integrity.

References:

[1] JTAG IEEE 1149.1 Specification http://www.ieee.org

Texas Instruments JTAG Extensions http://www.ti.com

Gottschalk, E.E., et al., "The BTeV DAQ and Trigger System – Some Throughput, Usability,
and Fault Tolerance Aspects," Proceedings of the Computing in High Energy and Nuclear
Physics Conference (CHEP 2001), p. 628, Beijing,