You are on page 1of 9

When Zero Picoseconds Edge Placement Accuracy Is Not Enough

John Cheng, Teradyne Inc.

1.0 Abstract In the last ten years, test equipment suppliers have driven improvements in edge placement accuracy taking it from f225ps to sub-1OOps through a combination of architectural improvements and new calibration technology. However with the adoption of high speed source synchronous buses such as HyperTransport and RapidIO on high performance devices, it is no longer sufficient to just look at the tester EPA component in the overall timing budget. Although test system accuracy is still very important, error terms from the DUT must also be considered. The proposed methodology of device strobed comparators addresses both test system and device error terms. 2.0 Background Microprocessor and PC system designers have long since known that one of the major bottlenecks in achieving high system level performance is bandwidth (Its the bandwidth stupid, Dick Sites) [ 11. Microprocessors core speed has been growing at a faster rate than Moores Law. But without a high bandwidth bus, the processor is left with executing wait cycles while off-chip data is being retrieved. Legacy VO bus architectures such as PCI provides a maximum bandwidth of 5 12MB/sec (PCI-X at lGB/sec) [2]. Because PCI is a shared bus architecture, it contains certain inherent limitations.
In a shared bus architecture, bandwidth of the bus is shared among several devices. Any other device that wants to send or receive data must wait for the arbitrator to complete a previous transaction before beginning another. The total available bandwidth of the bus is limited by the ability of one device to complete a transaction with another device in a given clock period. This means that the clock-to-data valid on the transmitting device and the setup time in the receiving device must be considered. Because there are well established software and hardware standards, PCI can be easily implemented and has gained wide acceptance not only in PC applications but in embedded applications such as networking and communication systems, digital consumer electronics and information appliances. In embedded systems such as networking and communication systems, system designers are facing a similar challenge as their PC counterparts. Data rates are reaching in excess of lGHz to comply with Ethernet standards of 10 Gigabit and OC-192. Multimedia data, communications and compression algorithms, routing and addressing databases all need to be processed at high speeds [3]. Currently there is the added complexity of the multitude of busses which need to talk to one another; as network processor, control plane processor, switch fabric chip and security processor vendors all have their own proprietary busses. Ideally, there would be an industry standard interconnect that makes it easy to link chips from different vendors while meeting the bandwidth requirement without straining tight pin counts [4]. RapidIO and HyperTransport are both point to point packet switched interconnects. In such a system, all devices are connected through multiple switching devices forming a common switching fabric network. In systems with more than two devices, any two devices can communicate at the same time with no effect on other devices. This means transactions can take place concurrently which increases overall system bandwidth [ 5 ] . The RapidIO and HyperTransport protocols transmit a clock along with the data with both being clocked from the same PLL (synchronized at the transmitter hence the term source synchronousFigure 1). This has the advantage or reducing skew between clock and data [6]. In addition, clock periods much less than the flight time can be realized. This allows for more scalable topologies, including the ability to use switches to connect more devices.

Paper 41.2

ITC INTERNATIONAL TEST CONFERENCE

1134

0-7803-7169-0/01$10.00 0 2001 IEEE

Example of errors at transmitter r71:

. . . .
9

Example of errors due to transmission path:


9 9

Clockjitter Clock duty cycle variation Clock to data skew Ground bounce Threshold and delay mismatch of device output cells Simultaneous switching effects (crosstalk, diidt, etc)

Etch mismatch Clock to data skew

Example of errors at receiver: DLLjitter Common mode skew Clock to data skew Edge rate mismatch between clockand-data output cells 9 Receiver threshold mismatch = Receiver Vref variation due to on-chip
9

. .

Clk

I Data

Transmitter "

Receiver

I
I

Source Swchronous system: Data and clock is transmitted simultaneously clocked from the same PLL. At the destination, clock and data are received
I

Figure 1: Simplified Block Diagram Of Source Synchronous System The widespread adoption of these buses provides much needed bandwidth but it also leads to new testing challenges. As speeds increase and data valid windows decrease, reducing the error component in the overall AC timing budget becomes increasingly important (Figure 2). The error components of the DUT can't be allowed to dominate the already small data valid windows at high speeds. To this end, the test environment must mimic the end application of a source synchronous system. In this paper, issues that impact test as a result of source synchronous design are examined by reviewing three methodologies: 1. Search and strobe 2. Sweeping window 3. Device strobed comparator A proposal is made as to which methodology offers the best trade-offs.
Data Rate: 250MTs Bittime= 4ns
I 4
D

clk-out

data-out
DUT/Tester Error Data rate: 1.6GTs Bit time = 625as Data Valid

As data rate increases, data valid window decreases which decreases the amount of timing margin available
DUT error terms Tester error terms Data Valid: Timing - Bit
-

clk-out

300ps 150ps 1oops DUT


+

Tester terms 3450ps 450ps 75ps


-k Valid

terms

DUT/TesterError

Valid

i
i

Timing margin at 250MTs: Timing margin at 1GTs: Timing margin at I.6GTs:

Figure 2: Timing Margin In AC Timing Budget

Paper 41.2

1135

3.0 Search And Strobe Method In an application where the DUTs output to output timings need to be tested, the traditional method is to program the tester comparator strobe at a fixed location to test for the start of the data valid window. When using a search and strobe technique, the placement of the strobe for the data output is calculated based on the position of the output clock (Figure 3). After the strobe position of data-out is programmed, the patterns are bursted again.
3.1 Limitations With Search And Strobe Method The search and strobe method is effective in that it compensates for the DUTs analog delay errors such as skew errors (e.g. clock skew, common mode skew, etc). However, this method doesnt compensate for any analog drift errors such as PLL drift and low frequency jitter that affects both clock and data. In addition, this method introduces error sources from the test system. The first source of error is in the strobe search of the output clock (clk-out). The amount of error introduced is governed by the compare side EPA. The second source of error is in the strobe for the output data (data-out) which is calculated from the strobe search of clk-out. In this case, the device guardband would typically include the RMS of the EPA of the test system. In a source synchronous system, timing for clk-out and data-out is derived from the same PLL and transmitted together. The receiving device latches the data based on the output clock of the transmitting device. The critical timing relationship is the relative timing between clk-out and data-out and less so the absolute position of the pair. Because the relative timing for clk-outldata-out is a tighter spec than the spec for absolute position of the pair, simply programming a fixed strobe for data-out is insufficient. There will be devices that incorrectly fail because they dont meet the absolute timing specification but meet the relative timing.

Strobe for dab-out is calculated based on position of ck-out

.....................................................................................................................................................................................
clk-out

Tester capture

I@
)E ( ; <-

-E 5T r
i
@

Compare event for ck-out This edge is swept through the valid window range to find the position of the clock

.....................................................................................................................................................................................

Tester capture

To pass/fail processing

compare event for data-out This edge is determined by the data-valid spec and is programmed relative to the position of cll-out

..................................................................................................................................................................................

Figure 3: Search And Strobe Method


Paper 41.2

1136

4.0 Sweeping Window Method


While the search and strobe method compensates for skew and delay errors, analog drift errors still need to be addressed. Analog drift errors due to variations in temperature, process and voltage will cause output clock and data to vary within each cycle. This means for every cycle in the burst, the absolute timing position of clk-out and data-out will be different, even though the relative timing may remain unchanged. Because of the variation in the absolute position of clk-out cycle to cycle, programming the data strobe after an edge search for output clock is insufficient. Using such a technique will reduce yield, allow escapes or both. In our experience, there are devices that have these characteristics, especially at speeds above 800MTs. In order to address the cycle to cycle variation of the output clock, a different methodology needs to be used (Figure 4): 1. Program the offset between clk-out and data-out by using data-valid spec 2. Sweep clk-out, data-out pair through valid range 3. Keep track of whether each cycle has ever passed

clk-out

data-out

1.

Program the offset between ck-out and data-out using data-valid spec Sweep ck-out, data-out pair through the valid range Keep track of whether each cycle has ever passed Repeat for next cycle

2.

3.

4.

Figure 4: Sweeping Window Method


In order to test every combination of clk-out and data-out, timing sets need to be programmed to sweep the clk-out, data-out timing relationship for the entire cycle. The algorithm would look something like: Iteration 1: clk-out-middata-out-min Iteration 2: clk-out-middata-out-min

...
Iteration N: clk-out-middata-out-min

+ minimum timing resolution

...
Iteration last: clk-out-max/data_out-max

+ O\J-l)*minimumtiming resolution

Paper 41.2

1137

4.1 Burst Time Optimization All combinations of timing values are used to sweep through the entire cycle. If we can reduce the number of combinations needed, clearly burst time goes down. By assuming that each vector is only valid over a range of time values the number of timing value sets goes down. In addition, we can assume that per device, the range of the absolute position of the clk-out and data-out values will be smaller than the range possible for all the devices. By assuming that this range can be determined per device, the timing value combinations can be hrther reduced. Finally, we can assume that some patterns have greater variations in clk-out than others. By characterizingpatterns to find the ones with greatest variations and the smallest minimum and maximum values, timing value combinations can be reduced. 4.2 Applications Using Sweeping Window Method Based on the work that we have done with different devices ranging from microprocessorsto clupsets/graphicsand communications devices, different type of devices will have different amount of output jitter (e.g. the output jitter of a microprocessor will be more than that of a chipset). If the amount of outputjitter from the device is minimal (as a percentage of the period), the sweeping window method is sufficient. When the output jitter of the device becomes a significant portion of the period, it is critical that tester error sources be kept at a minimum. The ideal methodolo4ygould minimize the effects of the DUT and tester error sources while capturing and processing-the-data in a reasonable amount of time. In order to accomplish this, the ATE capture eriGi6knent must mimic a receiving device in a source synchronous receiving device latches its data based on the forwarded clock of the system. In su$a.syste<ke transmitter (Figure 5). For the tester to mimic the end application, the tester comparator must also latch data based on the forwarded clock from the DUT (transmitter) with the resulting value stored in a capture

RAM.

TransmittingDevice

......................................................................
data-out

.......................................................................

Receiving Device

Source Synchronous System

clk out

.......................................................................
ATE ............................................................................................................
+Tocatch i R A M ;

ATE Environment

DUT
Cll-out
:

............................................................................................................

Figure 5: Device Strobed Comparator Method

Paper 41.2

1138

5.0 Device Strobed Comparator Methodology The key motivation for adopting a source synchronous capture methodology is to reduce or eliminate error components that would be present in a standard fixed strobe environment. In a source synchronous system, the clock and data are derived from the same PLL at the source (i.e. the transmitting device). Since the receiving device latches data with the forwarded clock, any error components on the output clock and data can be ignored as long as the relative timing between clock and data is valid. In a fixed strobe system, any errors that affect the clock and data equally including ground bounce, clock jitter, clock duty cycle variation, simultaneousswitching effects (crosstalk, di/dt, etc) and analog delays due to temperature or voltage supply changes have to be added to the error budget. The magnitude of these errors can be the largest single component of the error budget. In addition, because these errors are dynamic, it can prove very difficult to compensate for if the location of the strobe has to be programmed prior to the pattern burst. In a device strobed comparator ATE system however, the data is latched by the DUTs output clock. This removes any error terms that affect the clock and data equally as described above. After data is captured, it needs to be processed. This is accomplished by examining the capture RAM and comparing it to a programmed vector set similar to a standard functional test pattern (Figure 6).

Data rate: 1.6GTs Bit time = 62511s


I

clk-out

i
i
I

data-out
DUT/TesterError

Valid

As data rate increases, data valid window decreases which decreases the amount of timing margin available DUT error terms Tester error terms Data Valid: Timing = Bit 30ps loops DUT terms
+

Data rate: 1.6GTs Bit time = 625us

lops

Tester terms
+

clk-out

Valid

I
data-out
DUT/Tester Error Data Valid

Fixed Strobe Timing margin at 250MTs: 3450ps Timing margin at IGTs: 45Ops Timing margin at 1.6GTs: 75ps

Device Strobed Comu 3860ps 86Ops 485ps

!
In a device strobed comparator ATE environment, DUT and Tester errors are essentially eliminated.

Figure 6: Timing Margin In Source Synchronous ATE System

Paper 41.2

1139

In a clock forwarded scheme such as DDR, the DQS signal becomes active with valid data [8]. This DQS signal (analogous to clk-out in Figure 7) would be used to trigger the comparator. Any clock signal may be used to trigger the comparator as long as the data being captured is aligned to the cycle of the expect data. In the DDR protocol, all of the captured data is relevant and needs to be processed for padfail information. However, in protocols that have a free running clock, this is not necessarily true. If the free running clock triggers the data capture, there needs to be a way to determine when a valid data word begins. One way to address this is to have the tester generate a signal that corresponds to the beginning of the valid data. This reintroduces the analog delay error component because the tester will generate this signal based on when the data should be valid but the data out of the DUT is subject to the analog delay effects on the signal path. Therefore, there is going to be a shift between when the tester starts capturing data and when relevant data appears on the bus. In order to process pass/fail results, this shift needs to be known. Because this shift is a function of the DUTs analog delays, it is not trivial to determine what this shift is.

DUT-reset

clk-out

data-out

$
Tester capture

Clk-out triggers tester to start capture

RAM contents

Figure 7: Data Capture Using Device Strobed Comparator Method


In order to accurately align the compare values with the captured data, there needs to be a method to determine where the first valid data bit is stored in the RAM.This requires the tester to take one of three paths to account for the analog delays of the various data paths. The first would be to determine the amount of delays on each bus and then program dont care cycles in the pattern to account for them. However, since the amount of analog delays can change as a function of speed, this would require the generation of several patterns, each with different numbers of dont care vectors, for each test. This quickly increases the amount of vector memory needed to store all of these redundant patterns. The second is to search the first few bits of the output data, using a match routine, to determine which bit in the stored array corresponds to the start of the data and then making pasdfail determinations. Based on the complexity of the algorithm used, this may require a significant amount of time to find a match and begin the compare.

A third option is to once again use one of the DUT pins to tell the tester when data becomes valid. When two source synchronous devices are talking to each other, they send a bit of data on a separate control line that signifies the beginning of a valid data word. A test platform can use this bit to trigger the capture of data allowing it to store only the valid data bits. Since the sync pulse will shift with the data, the latency delay is removed. This method eliminates the need to perform search routines or load additional patterns into the tester.

Paper 41.2
1140

Figure 8 shows experimental results on a bench fixture. An oscilloscope, acting as a receiver, is triggered by a clock that is independent of the data at the source (i.e. the tester) in one case while in the other case, the trigger clock is synchronized with the data. At a data rate of over 600Mbps, there was significant improvement in the amount of timing margin available (i.e. the size of the data eye).

Figure 8: Capture Triggered by asynchronous and synchronous

Paper 41.2 1141

6.0 Conclusion The key benefit of a source synchronous system is that a receiving device latches data based on the forwarded clock of the transmitting device. As long as the relative timing between the output clock and data is met, the receiver is able to latch in the correct data. This architecture inherently provides additional timing margin to the overall system. In a test environment, it is essential that the tester comparator operates in the same manner as the receiving device in the source synchronous system or else the benefit of having additional timing margin will not be realized. A device strobed comparator system best mimics the end application of a source synchronous system. Depending on the output jitter characteristics of the device, it is possible that not having this will cause devices to fail at test but will work properly in the end application. The author would like to thank the following individuals for their contribution: Calvin Cheung, Dumith Desilva, Greg Hilliard, Scott Schaber, Jason Sturm

7.0 References [ 13 Gwennap, Linley. Digital 21264 Sets New Standard, 1996, Microprocessor Report [2] Nwaekwe, Laverty, Chowdhury, Syeed, PCI-X Boosts Bus Bandwidth to 1 Gbps, 2000, EDN [3] The Lightning Data Transport I/O Bus Architecture, 2000, M I Networks [4] Glaskowsky, Peter N. RapidIO Expands Narrow-Bus Options, 2000, Microprocessor Report [5] Powell, Courtney. Internetworking Equipment Design: RapidIO Renders Overall Performance, 2000, EE Times [6] Bouvier, Dan. Example AC Timing Budget, 2000, RapidIO Trade Association [7] Haller, Robert. The Nuts And Bolts Of Signal-IntegrityAnalysis, 3/16/00, EDN [8] Double Data Rate (DDR) Spe~ification~, JEDEC Solid State Technology Association 2000,

Paper 41.2

1142