Professional Documents
Culture Documents
by
Danny Yoo
c Copyright 2018 by Danny Yoo
Abstract
Danny Yoo
Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
University of Toronto
2018
This thesis presents an adaptive baud-rate CDR with CTLE and 1-tap DFE. The novelty
in this design is the adaptation engine tailored for baud-rate clock and data recovery
where the comparators for the DFE and the PD are shared to save power. A testchip was
fabricated in TSMC 28nm CMOS. The adaptation engine is demonstrated for 34-36Gb/s
operation with a Tyco 5” channel resulting in 15.05-18.25dB channel losses. At 35Gb/s,
the total power consumption is measured to be 106.3mW or a FOM of 3.04pJ/bit.
This thesis also presents a 2x half-baud-rate clock and data recovery technique with 2x
oversampling at half-baud-rate (every other UI). A testchip was also fabricated in TSMC
28nm CMOS. A 30Gb/s 2x half-baud-rate CDR was tested with a Tyco 5” channel with
13.06dB of loss. The total power consumption is measured to be 79.2mW or a FOM of
2.64pJ/bit.
ii
Acknowledgements
I would like to sincerely thank my supervisor, Professor Ali Sheikholeslami for providing
me the opportunity to conduct research in the area of high-speed wireline circuits. Pro-
fessor Sheikholeslami has supported me throughout every step of my tapeout, which is
fabricated in a leading-edge advanced process technology.
I thank Professor David Johns, Professor Tony Chan Carusone and Professor Joyce
Poon for serving on my thesis examination committee. Their insightful comments and
recommendations were invaluable addition to this thesis.
I am thankful for the support and design review provided by Fujitsu’s staff, espe-
cially, Hirotaka Tamura, Takayuki Shibasaki and Junji Ogawa. Special thanks to Wahid
Rahman and Joshua Liang for guidance throughout my MASc research and Mohammad
Tabrizi for layout and measurement support. I would also like to thank Nikola Nedovic
for his visit to help set up the digital synthesis flow back in 2015, which still had an
impact on my 2017 tapeout.
My gratitude goes out to Jaro Pristupa and MOSIS support team for CAD and tech-
nical support. I would also like to acknowledge Professor Antonio Liscidini, Professor
Sorin Voinigescu and CMC for test equipment rental as I could not have finished my
testchip measurements without them.
Finally, I would like send my deepest thanks to my parents and my brother for their
unconditional love and support.
iii
Contents
Acknowledgements iii
Table of Contents iv
List of Abbreviations x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 An Adaptive Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 A 2x Half-Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Overview of Baud-Rate PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Pattern-based Baud-Rate Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Pattern Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Optimal Sampling Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Why Adaptation Engine? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 CTLE Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Comparator Level Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
iv
4 Circuit Simulations and Measurement Results 32
4.1 Analog Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Closed-loop CDR Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Digital Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Lab Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Testchip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7 Conclusion 86
7.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.1 Improvements for an Adaptive Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . 87
7.2.2 Improvements for a 2x Half-Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . . 87
Bibliography 89
Appendices 94
v
A Ancillary 95
A.1 Portlist for Synthesized Digital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Output Pad MUX Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vi
List of Figures
3.1 Full-rate system level block diagram of CDR and the proposed adaptation engine . . . . . 12
3.2 Block diagram of proposed data level loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Example of dLev converging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Example of data level (dLev) filtered for 111 and 011 pattern . . . . . . . . . . . . . . . . 14
3.5 Diagram illustrating the goal of on-chip adaptation . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Diagram illustrating the optimal PD level . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Diagram of eye opening for comparator optimized for DFE and PD respectively . . . . . . 16
3.8 Flow Diagram of Proposed Adaptation Engine . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9 Schematic of a CTLE stage with tunable Cs . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 CTLE transfer function across Cs settings simulated in MATLAB Simulink . . . . . . . . 19
3.11 CTLE adaptation using line thickness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.12 Block diagram of spectrum balancing [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.13 CTLE transfer function showing 0011 pattern and its neighboring patterns for three dif-
ferent CTLE settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.14 Visual Example of CTLE adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.15 Visual example of theory behind the proposed algorithm for finding optimal PD level . . . 24
3.16 Visual example of finding Vamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.17 Visual example of why Vamp = dLev(011)max . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.18 Adaptation where line thickness guides CTLE adaptation of Cs (top right) and optimal
sampling phase deduced from the slew rate guides adaptation of comparator level . . . . 26
3.19 Proposed schematic of quarter-rate baud-rate receiver. Digital adaptation is completed
in digital back-end and the rest of the CDR is done in analog front-end . . . . . . . . . . 27
3.20 Plot of channel characteristic of various channels imported and converted to rational
system model in MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
3.21 Adaptation vs time where line thickness guides CTLE adaptation and optimal PD level
guides α level adaptation. Tyco 5” channel at 36 Gb/s . . . . . . . . . . . . . . . . . . . . 29
3.22 Step response and pulse response of channel + CTLE . . . . . . . . . . . . . . . . . . . . 30
3.23 Jitter tolerance simulated for the converged adaptive setting of event-driven model (BER
< 10−6 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
5.1 Schematic of Alexander 2x-oversampled bang-bang PD and its basic operation . . . . . . 56
5.2 Visual example of the lock point of Mueller-Muller PD [25] . . . . . . . . . . . . . . . . . 57
5.3 Half baud-rate data sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Sub baud-rate data recovery by exploiting ISI . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Eye diagram example of sub baud-rate data recovery by exploiting ISI. Green arrows on
the left show theoretical maximum horizontal eye opening of 0.5UI. Green arrows on the
right show the small vertical eye opening margins . . . . . . . . . . . . . . . . . . . . . . . 59
5.6 Sub baud-rate (0.5x-sampled) data and clock recovery in comparison to Mueller-Muller
CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Full-rate block diagram of the proposed 2x half-baud-rate CDR architecture . . . . . . . . 61
5.8 Eye diagram of the proposed 2x half-baud-rate scheme . . . . . . . . . . . . . . . . . . . . 62
5.9 Proposed 2x half-baud-rate PD compared to the conventional baud-rate Mueller-Muller PD 63
5.10 Proposed quarter-rate implementation of 2x half-baud-rate CDR. Proposed 2x half-baud-
rate PD and the data decoder are simple custom high-speed digital logic gates . . . . . . 64
5.11 Jitter tolerance simulated for event-driven model of 2x half-baud-rate CDR (BER < 10−6 ) 65
5.12 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count 67
5.13 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered
clock frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.14 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO 67
5.15 Open-cavity QFN under a microscope showing wire bond connections for the proposed
2x half-baud-rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.16 Package Pinout for D2: Non-uniform baud-rate CDR with CTLE . . . . . . . . . . . . . . 69
5.17 Die photo in TSMC 28nm HPC process for the proposed 2x half-baud-rate CDR. Dimen-
sions of each building block is listed in a table. . . . . . . . . . . . . . . . . . . . . . . . . 70
5.18 High-speed testboard for design 2: non-uniform baud-rate CDR testchip. Testboard is
programmed and controlled by Arduino Mega2560 + PC . . . . . . . . . . . . . . . . . . . 71
5.19 Measurement setup for testing 2x half-baud-rate CDR . . . . . . . . . . . . . . . . . . . . 72
5.20 Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA.
Bias tees are required for setting the input common-mode as the SHF 12104A PRBS bit
pattern generator cannot set voltage offset to set the common-mode. . . . . . . . . . . . . 73
5.21 Measured clock spectrum of an open-loop CDR . . . . . . . . . . . . . . . . . . . . . . . . 75
5.22 Measured clock spectrum and phase noise for locked CDR at 30 Gb/s . . . . . . . . . . . 76
5.23 Measured clock spectrum and phase noise for locked CDR at 30 Gb/s . . . . . . . . . . . 77
5.24 Measured jitter tolerance with sinusoidal jitter injected at 30 Gb/s PRBS31 & 7 . . . . . 78
5.25 Power breakdown of 2x half-baud-rate CDR testchip . . . . . . . . . . . . . . . . . . . . . 78
5.26 Performance comparison to recently published baud-rate CDRs . . . . . . . . . . . . . . . 79
6.1 Layout of the full-chip die with two CDRs on top & bottom . . . . . . . . . . . . . . . . . 81
6.2 Layout of the full-chip die showing top aluminum layer for power distribution . . . . . . . 82
ix
List of Abbreviations
CM Common-Mode
CP Charge Pump
CT Continuous-Time
DEMUX Demultiplexer
x
LF Loop Filter
MUX Multiplexer
P & R Place-and-Route
PD Phase Detector
PI Phase Interpolator
PM Phase Margin
PN Phase Noise
RF Radio Frequency
SJ Sinusoidal Jitter
UI Unit Interval
xi
Chapter 1
Introduction
1.1 Motivation
In today’s society, digital data is ubiquitous. To get the most out of the data that sur-
rounds us in this digital age, sufficient processing speed and data-rate are imperative.
As the demand for speed is greater than ever before, development of wireline circuits
and SERDES is critical. However, an increase in data-rate is usually accompanied by
an increase in power consumption, which leads to environmental concerns. The current
scenario urgently calls for new low-power architectures and circuit techniques that will
improve speed and data-rate, while addressing the energy concerns. This thesis intro-
duces two unique research topics based on baud-rate clock and data recovery which aim
to solve the power consumption problem in high-speed wireline I/O links.
In an attempt to further save power, the baud-rate clock and data recovery circuit (CDR)
published in ISSCC 2016 [35] shares the front-end comparators between the decision
feedback equalizer (DFE) and the phase detector (PD). In the following year, a frequency
detector (FD) based on the same baud-rate scheme was pushed in ISSCC 2017 [28].
However, these two CDRs lacked an adaptation engine where the CDR settings could
be autonomously adapted to the optimal lock point for various channels. Furthermore,
manual tuning of these CDRs is difficult and tedious as will be discussed in this thesis.
As a result, this thesis will present the proposed adaptive baud-rate CDR with CTLE
and 1-tap DFE. The novelty in this design is the adaptation engine tailored for baud-rate
clock and data recovery where comparators for the DFE and the PD are shared to save
power.
1
Chapter 1. Introduction 2
In this thesis, there are two stand-alone baud-rate CDRs taped out in TSMC 28nm
technology. To address both chip designs, this thesis will be organized as follows. Chapter
2, 3 and 4 covers the first design, which is the proposed adaptive baud-rate CDR. First,
Chapter 2 will cover the background. Second, Chapter 3 will delve into the proposed
adaptation engine. Lastly, Chapter 4 will present circuit simulations and measurement
results. The second design which is the proposed 2x half-baud-rate CDR will be presented
in Chapter 5. Chip design methodology of how the two separate chip designs were taped
out on time for the same fabrication shuttle will follow in Chapter 6. The final chapter
(Chapter 7) will be the conclusion which summarizes the two contributions of this thesis.
Chapter 2
Background
The background to the proposed adaptive baud-rate CDR with CTLE and 1-tap DFE
fabricated in TSMC 28nm technology will be covered in this chapter. The testchip was
an analog mixed-signal design with a synthesized digital incorporated via a place & route
tool. First, an overview of baud-rate PD scheme will be presented in the following section.
In recent years, baud-rate clock and data recovery has been prominent over the classical
oversampling CDRs such as the Alexander Bang-Bang PD [4] which require multiple
clock phases to perform phase detection and clock and data recovery. Details and the
background of the Alexander BBPD can be found later in Section 5.1.1 which serves
as a background to the proposed 2x half-baud-rate CDR. Baud-rate samples the data
only once per UI, therefore requires fewer number of clock phases, and hence reduces
the power consumption in the clock distribution network [9]. There are many different
baud-rate PD schemes such as an integrating-based [8], Mueller-Muller [24], and a MMSE
(minimum mean squared error) [26]. Background to the Mueller-Muller PD which is a
prevalent baud-rate scheme will be discussed in details in Section 5.1.2. Despite the wide
range of baud-rate schemes, this thesis will focus on 1) pattern-based PD presented in
ISSCC 2016 [35] which the proposed adaptive baud-rate CDR is based on and 2) proposed
2x half-buad-rate PD which will be presented in Chapter 5.
The proposed adaptive CDR is based on a baud-rate CDR from ISSCC 2016 shown
in Figure 2.1 [34, 35]. Its novel analog front-end and the phase detector (PD) will be
discussed first.
3
Chapter 2. Background 4
In order to save power, the 1-tap DFE and the PD share the same comparators in
the analog front-end. Figure 2.2 illustrates the advantages of combining the front-end
comparators. First, the number of comparators is 2/3 compared to the conventional data
and edge sampling bang-bang PD. Essentially the proposed scheme detects phase (clock
information) from 1-tap speculative DFE data. In addition, the number of clock phases
that needs to be routed is half compared to the conventional BBPD scheme. As a result,
PLL and clock distribution power is also half. The comparator levels are set to +/-α to
cancel out 1st post-cursor ISI (inter-symbol interference) left after CTLE’s equalization.
At the same time, this α level is also used as a locking point for the phase detector. In
Figure 2.2, +α is labelled as DH and -α is labelled as DL.
Shibasaki’s baud-rate PD detects slow rising and slow falling patterns to make timing
decision whether the clock is late or early. It looks at 3 consecutive samples at a time =
(Sn−1 , Sn , Sn+1 ) to filter out a specific pattern. For example, 011 pattern is used for de-
tecting a rising waveform and 100 pattern is used for detecting a falling waveform. Figure
2.3 demonstrate a rising waveform detection with +α comparator level setting the CDR’s
lock point. In terms of data recovery, eye opening available for the 1-tap speculative DFE
for a rising waveform is shown in blue in Figure 2.4. Although Shibasaki’s PD only looks
at 3 consecutive samples at a time, for a valid 011 rising waveform pattern satisfying
the PD logic table in Figure 2.5, 001 pattern must be prior to sample n. Essentially
when Shibasaki’s PD detects for 011 for a rising pattern and 100 for a falling pattern,
it is actually detecting 0011 and 1100 patterns respectively. In other words, 2UI pulse
pattern is used for pattern detection in order to recover timing information.
The CTLE’s boost changes data slope of 011 and 100 patterns which the PD detects.
The CTLE should be adjusted so that the comparator assigned for the phase detection
produces 0 and 1 at an equal probability at the “optimal sampling point”.
To illustrate the optimal sampling point, let us take a look at the scenario when the
previous bit is a 1. The eye opening available to a data sampler when the previously
detected bit is a 1 is shown in red on top left of Figure 2.5. The optimal sampling position
is 1/2 UI from the convergence point of low-rate pattern (i.e. 100 sequence) with +α
(DH) level such that data decision is made at the x-axis center of the eye opening. At
this point, the 100 pattern should produce equal chance of early/late with the second
comparator set at -α, (DL) used for the PD.
Chapter 2. Background 5
Figure 2.4: Eye opening of 1-tap speculative DFE for Shibasaki’s baud-rate PD
Autonomous adaptation scheme that can adapt to the best CDR settings dynamically
on-chip is imperative because the CDR in each lane cannot be tuned manually in a real-
world product. In the following, we outline the challenges in implementing an adaptive
engine.
2.3.1 Challenges
The challenge of forming an adaptation scheme for the pattern-based PD used in this
CDR is that there exists two tuning knobs, one which is the CTLE setting and the other
for the comparator level (+/-α). The problem with having two tuning knobs is that,
even for a known, fixed channel it is hard and even harder for various unknown channels.
Furthermore, the fact that these two tuning knobs are correlated complicates the
adaptive scheme. First, changing the CTLE setting will change the data slope which
essentially changes the PD gain and the +/-α required for an optimal lock point with
the maximum timing margin. Second, changing the CTLE setting also changes the +/-α
needed for the 1-tap DFE by affecting the amount of post-cursor ISI remaining. For
example, more equalization means a smaller α is needed and less equalization means
a larger α is needed. As a result, when the CTLE and/or the α level changes, the
optimal sampling point changes which makes manual tuning difficult. Even a slight shift
in the comparator level α undermines the eye opening as shown in Figure 2.6. Reduced
eye opening is critical to the robustness of CDR’s system as noise and jitter margin is
significantly undermined. Therefore, the goal of adaptation is to find the CTLE setting
and the comparator level (+/-α) for the optimal jitter tolerance.
The ultimate goal of an inductor-less CTLE is not to fully compensate and equalize for
the channel loss. The 1-tap DFE present in the CDR design would not be required in
that scenario, hence making the DFE wasteful in terms of the CDR’s power budget.
We want f3dB of CTLE output to be at fbaud /3 ∼ fbaud /4 for an ideal PD operation.
This means that at the output of the CTLE, 2UI pulse swing should reach a full swing
in an ideal scenario. In other words, all residual ISI ( ∞
P
i=2 αi ), other than the first post-
cursor should be fully minimized, for a system with a pulse response of (1 + α1 D1 +
α2 D2 + α3 D3 + · · · ) as shown in Figure 2.7. The remaining 1 significant post-cursor ISI
can be canceled out by the 1-tap DFE, responsible for equalizing the content at fbaud /2
(the Nyquist frequency).
Chapter 2. Background 8
Figure 2.6: Sub-optimal α levels illustrating reduced eye opening for noise and jitter margin
Since the front-end comparators are shared by the DFE and the PD, it could only be
optimized for either one of the two. If our primary goal is to adapt the comparator levels
to optimize the DFE, (i.e. set the comparators to exactly α to perfectly cancel out the
1st post-cursor ISI remaining after CTLE) α level can be extracted by looking at data
levels present in the CTLE’s output eye. A system with one significant post-cursor ISI
will have 4 distinct levels: (1+α), (1-α), (-1+α), (-1-α). Exact value of α can be obtained
by equations Eq. 2.1 and Eq. 2.2 below as an example. Figure 2.8 demonstrates the
latter equation visually. The apex of red eye opening for when the previous bit was a 1 is
(1+α). The apex of blue eye opening for when the previous bit was a 0 is (1-α). Taking
the difference would yield exactly 2α and dividing by two would then be the value of 1st
post-cursor ISI.
(1 + α) + (−1 + α) = 2α (2.1)
(1 + α) − (1 − α) = 2α (2.2)
In practice, by exploiting the 4 distinct data levels, a simple sign-sign LMS could be
implemented to find α. Two sign-sign LMS loops, one for the eref (error reference)
and a slower loop for α would get α to converge to the middle of the 110 pattern eye
opening for when the previously bit was 1 as shown in Figure 2.9. Essentially, this LMS
DFE adaptation is taking (1+α) + (-1+α) = 2α and dividing it by two to obtain α.
The α in Figure 2.9 converges to the value shown in green. This green α value is the
vertical mid-point of 110 eye pattern. The sign-sign LMS convergence is governed by the
Chapter 2. Background 10
where err signal is generated by a comparator which compares dout to ±α ±eref depend-
ing on the data pattern. For example for data pattern where the previous data and the
current data are both 1, the threshold of the comparator is +α + eref . If the previous
data is 1 and the current data 0 then the threshold would be +α − eref . However,
it is apparent in Figure 2.9 that the convergence of α is not the optimal comparator
level for an optimal PD operation in terms of jitter tolerance. In fact, the optimal PD
level is shown in red. It was previously stated that for the optimal DFE operation, the
comparator’s α level should be the vertical midpoint of 110 eye opening available to the
1-tap loop-unrolled DFE. For the optimal PD operation, the comparator’s α level should
intersect with 011 rising data sequence at the x-axis midpoint of the 110 pattern eye
opening. This x-axis midpoint should provide the greatest peak-to-peak jitter tolerance.
In the next section, novel technique for obtaining optimal PD level will be discussed.
Chapter 3
11
Chapter 3. Proposed Adaptation Engine 12
Figure 3.1: Full-rate system level block diagram of CDR and the proposed adaptation engine
The calculated values of dLev for various filters are then used by the adaptive logic block
to guide both the CTLE parameter (CS ) and the one-tap DFE coefficient (+/-α) that
also determines the sampling phase of the PD.
In order to perform both CTLE and DFE/PD adaptation, we rely heavily on the data
level loop. The original use of the data level loop for finding voltage levels was published
in JSSC 2005 by Stojanovic et al.[37]. The proposed data level loop is modified to serve
the adaptive scheme for the Shiabaski baud-rate CDR specifically. Figure 3.2 illustrates
the block diagram of the proposed data level loop. The dLev converges to the middle
of the data level of the filtered data sequence sampled by the adaptive clock, CKadapt .
dLev convergence is governed by Eq. 3.1.
Different data sequence patterns can be filtered out for the data level loop, making it very
useful in determining the voltage level for any specific data pattern. For example, Figure
Chapter 3. Proposed Adaptation Engine 13
3.3 shows that dLev converges to the correct value of 200 mV. The final dLev value after
convergence can be further stabilized by applying more filtering. To demonstrate that the
data level loop can track any data patterns, we investigate sweeping of the adaptive clock
for different pattern filter settings. Figure 3.4 depicts an example eye diagram and the
result of dLev filtered out for the 111 pattern shown in green and 011 pattern shown in
blue. The adaptive clock here is swept for 1UI as a demonstration. Again, the data level
loop could have been filtered out more, which would have improved the monotonicity of
dLev values, especially for the 011 pattern shown in blue.
The end goal of the proposed adaptation engine is to arrive at the optimal jitter tolerance
for the baud-rate CDR system. Initially, the adaptation engine will need to run with the
CDR locked to a sub-optimal phase. This essentially means that it needs to be frequency
locked and somewhat phase locked, although, presence of bit errors is totally acceptable
Chapter 3. Proposed Adaptation Engine 14
Figure 3.4: Example of data level (dLev) filtered for 111 and 011 pattern
(e.g. 1E-3). Initial bit errors in the system is okay because once the adaptation engine
is turned on, errors average out in the data level loop. Therefore, the data level loop is
still able to track and perform data filtering to obtain dLev. Via adaptation, maximum
jitter tolerance is autonomously achieved by adapting the lock position to the optimal
data-sampling phase as shown in Figure 3.5.
There are two steps to the proposed adaptive scheme. First is the CTLE adaptation
and the latter is the DFE/PD adaptation:
1. Find the optimal CTLE setting (Cs) with the flattest equalization up to fbaud /3 ∼
fbaud /4 ensuring that there is only 1 significant post-cursor ISI with other higher-
order residual ISIs all minimized.
2. Find the optimal comparator level for PD operation. The optimal PD level intersects
Chapter 3. Proposed Adaptation Engine 15
with 011 data sequence at the x-axis midpoint of the 110 pattern eye opening (Figure
3.6).
When the two conditions above are satisfied via adaptation, our baud-rate CDR should
have the optimal jitter tolerance.
It was previously mentioned that the comparator level can only be optimized for either
the PD or the DFE since the front-end comparators are shared. Instead of adapting +/-α
to optimize the DFE to cancel out the 1st post-cursor ISI perfectly, we opt to adapt for
the optimal PD operation. The main reason for optimizing for PD operation is because
the Shibasaki baud-rate CDR has less timing margin (jitter) compared to voltage margin
(amplitude) for the DFE to recover the data correctly. Even if the optimal PD level is not
at the exact value of 1st post-cursor ISI α for the DFE, we are trading off a little bit of
eye opening for better jitter tolerance by locking to a more optimal data-sampling phase.
Figure 3.7 depicts a fictitious example of the final eye opening where the comparator
level is optimized for the DFE in (a) and optimized for the PD in (b). It is apparent that
the total peak-to-peak jitter tolerance is larger for the scenario where the comparator
level is optimized for the PD.
A flow diagram of the proposed adaptation scheme is shown in Figure 3.8. First, the
CTLE is set to the maximum equalization setting and the comparator level α for the
DFE/PD is set to 0.3FS (full-scale). These settings should cause our baud-rate CDR
to lock to a sub-optimal phase. This means that the CDR must be frequency locked
although there could be bit errors present due to the clock phase being sub-optimal.
Chapter 3. Proposed Adaptation Engine 16
Figure 3.7: Diagram of eye opening for comparator optimized for DFE and PD respectively
The initial comparator level of 0.3FS is chosen as the starting point because the system
should only have one significant post-cursor ISI after the CTLE since our baud-rate CDR
only has a 1-tap loop-unrolled DFE which can cancel out just one post-cursor ISI. When
the pulse response has just one significant post-cursor ISI, α is usually a value close to
0.3FS. This is not always the case as different channels exhibit different channel and pulse
response. An initial α level of 0.3FS (e.g. 100mV for a 300mV full-scale input) should
be a viable starting point but if the baud-rate CDR is unable to achieve phase lock with
a reasonable BER (e.g. 1E-3), the proposed adaptation engine can be restarted with a
different initial comparator level α.
After the CTLE setting is set to the maximum and the comparator level α is set
to the pre-defined value, adaptation begins. First, for the highest CTLE setting, the
adaptation engine obtains a new α value (optimal PD level). This means that for the
current eye, which is most likely over equalized by the maximum CTLE setting, the
adaptation engine has picked the optimal PD level which would give us the largest peak-
to-peak jitter tolerance. The α level selection algorithm done by the proposed adaptation
engine will be further discussed in the next section. Once the optimal PD level is obtained,
the adaptation engine lowers the CTLE setting by one setting. The proposed adaptation
engine essentially goes back and fourth between the CTLE (blue) and the comparator
(red) adaptation as shown in Figure 3.8. In essence, for each CTLE setting along the
way of the adaptive process, the proposed adaptation engine is always tuning the CDR
to the best lock position. In other words, it’s tuning the CDR in small tick-tock like
increments to avoid losing phase lock. If the CTLE adaptation was to finish completely
before adapting the comparator level α at all, there is no guarantee that it will even
maintain a CDR lock. Furthermore, as previously described, the CTLE setting and the
comparator level α are correlated, and therefore it makes sense to adapt them in small
increments to find the optimal solution in a multi-dimensional solution space. Once the
Chapter 3. Proposed Adaptation Engine 17
proposed adaptation engine detects that the lowered CTLE setting did not lower the
CTLE line thickness, then the engine should revert back to the previous CTLE setting
without updating the α level and end adaptation. Using line thickness as the metric for
CTLE adaptation will be further elaborated in the next section.
The proposed adaptation engine is broken into two different phases: CTLE and compara-
tor level adaptations. The former, CTLE adaptation, will be discussed in more detail in
this section.
The CTLE used in our baud-rate CDR shown in Figure 3.1 is a common current-mode
logic (CML) CTLE architecture. It is a differential pair with RC source degeneration
with a resistor in parallel with a capacitor (Figure 3.9). This CTLE stage is repeated
twice for the 2-stage CTLE in the CDR’s design. The transfer function of the CTLE is
Chapter 3. Proposed Adaptation Engine 18
1
s+
gm Rs C
s
H(s) = (3.2)
CL 1 + gm Rs /2 1
s+ s+
Rs Cs RD CL
The CTLE’s zero, poles, DC gain, and peaking gains are governed by equations below.
1
ωz = (3.3)
Rs Cs
1 + gm Rs /2
ωp1 = (3.4)
Rs Cs
1
ωp2 = (3.5)
RD CL
gm RD
DC gain = (3.6)
1 + gm Rs /2
For our proposed CTLE adaptation scheme, the adaptive variable is the source de-
generation capacitor Cs . Sweeping Cs is very similar to sweeping the zero, ωz , overall.
Chapter 3. Proposed Adaptation Engine 19
Figure 3.10: CTLE transfer function across Cs settings simulated in MATLAB Simulink
Digitally tunable capacitors are used to adjust Cs for gentle tuning of the CTLE transfer
function [27]. By increasing the source degeneration capacitance, the zero frequency of
the system can be reduced while maintaining the low frequency DC gain as seen in Figure
3.10. In addition, if Rs was an adaptive variable, then a VGA would have been needed
to boost the DC gain since Rs increases peaking by lowering the DC gain. Adding a
VGA is complicated because the VGA setting must be adapted as well, thus adding a
3rd variable which would further complicate the adaptive scheme.
As mentioned in Section 3.2, the goal of CTLE adaptation is to find the optimal CTLE
setting (Cs) with the flattest equalization up to fbaud /3 ∼ fbaud /4, ensuring that there is
only 1 significant post-cursor ISI. This also means that 0011 2UI pulse pattern should
reach full-scale and have the minimum line thickness. In other words, the line thickness
of the CTLE’s eye is representative of the residual ISI ( ∞
P
i=2 αi ) for a pulse response of
1 2 3
(1 + α1 D + α2 D + α3 D + · · · ). The CTLE adaptation involves observing the CTLE
output’s line thickness and exploiting this property. Figure 3.11(a) illustrates that the
line thickness of the CTLE’s eye can be measured at three different data patterns at the
crossing (CKX ): 111, 011, and 0101 pattern. Measuring line thickness at 011 and 111
patterns would ensure flattest equalization up to fbaud /3 ∼ fbaud /4. For example, the line
thickness for 011 pattern can be obtained by setting the data filter of the data level loop
to 011 and allow the loop to track the maximum dLev which would be the dLev(011)max .
Similarly, the minimum value of the 011 pattern can be obtained by allowing the data level
Chapter 3. Proposed Adaptation Engine 20
loop to track the minimum dLev which would be the dLev(011)min . Taking the difference
would yield the line thickness for the 011 pattern as shown in Eq. 3.9. Similarly, the line
thickness of the 111 pattern could be obtained in the same manner. Since max and min
values of dLev are heavily dictated by voltage noise, line thickness is heavily filtered out
to average out the noise.
Figure 3.11(b) depicts the CTLE’s line thickness for different data patterns vs. CTLE
Cs settings. This plot is generated in MATLAB Simulink with the baud-rate CDR
running with the settings described in the plot title. It is evident that a Cs setting of 200
fF in Figure 3.11(b) yields the optimal CTLE setting with the minimum line thickness,
thus the flattest equalization up to fbaud /3 ∼ fbaud /4. To converge to the minimum
thickness, the initial CTLE setting can be set to the maximum value as described in the
previous section (Section 3.3) and lower the CTLE setting until the sign of the change
in line thickness flips. At this point, the adaptation engine reverts back to the previous
CTLE setting and ends adaptation.
Contrary to spectrum balancing a CTLE like in Figure 3.12 [33, 20, 17, 10, 18],
performing CTLE adaptation based on the line thickness of its output is essentially
a pattern-guided CTLE adaptation similar to [12, 13, 32]. Figure 3.13 demonstrates the
CTLE transfer function showing the 001 pattern and its neighboring patterns for different
CTLE settings. Filtering out 011 and 111 patterns and measuring the line thickness is
a pattern-guided method of optimizing the CTLE equalization setting such that the
transfer function has the flattest response without certain patterns being over/under
equalized. Essentially, minimizing line thickness is getting rid of all the higher-order
post-cursor ISIs except for the first post-cursor ISI which the DFE will cancel out after
the CTLE.
A visual example demonstrating the CTLE adaptation is shown in Figure 3.14. The
CTLE’s eye diagram for various Cs settings from the highest to the lowest is shown on
the left. Following the CTLE adaptation algorithm starting from the highest setting, it
will converge to the CTLE setting with minimum line thickness which is at Cs = 200 fF.
The eye diagram associated with this optimal setting is highlighted in red.
The comparator level adaptation is predicated on optimizing for the PD operation instead
of the DFE. The optimal PD level essentially is the x-axis midpoint of the eye opening.
Chapter 3. Proposed Adaptation Engine 21
Figure 3.13: CTLE transfer function showing 0011 pattern and its neighboring patterns for three different
CTLE settings
Chapter 3. Proposed Adaptation Engine 23
In order to find the optimal PD level, we take a look at slew rate similar to [23]. It is
apparent from Figure 3.15 that the 011 rising data sequence slews and thus the slew rate
is given by:
Vamp
Slew rate = (3.11)
0.5 U I
0.5 UI for the base of the triangle arrives from a premise that if the CTLE is able to
equalize up to fbaud /4, all ISIs are equalized for the 0011 pattern, and thus the rise time
of a 011 sequence equals the fall time of a 110 sequence. Exploiting the formation of the
right triangle shown in Figure 3.15, the optimal PD level would follow Eq. 3.12. Setting
α to this optimal PD level will sample the data at the center of the eye opening available
for the 1-tap speculative DFE with 0.5UI of timing margin on both sides. As a result,
we could set the comparator level α at the optimal PD level once we find Vamp , simply
by dividing by 2.
Vamp
Optimal P D level = 0.25 U I × Slew = (3.12)
2
To find Vamp , the adaptive sampler inside the data level loop is used to find the voltage
levels for different data patterns. First, the PI code is swept to find the cross point
between dLev(011)avg and dLev(110)avg . This cross point of average values of the two
patterns is illustrated in Figure 3.16. This figure also illustrates various patterns high-
lighted on the eye diagram and its corresponding values of dLev simulated in MATLAB
Simulink on the right. At this clock phase of CKX , (at the crossing) Vamp = dLev(011)max
Chapter 3. Proposed Adaptation Engine 24
Figure 3.15: Visual example of theory behind the proposed algorithm for finding optimal PD level
can be obtained. At this same clock phase, line thickness for CTLE adaptation is also
obtained by measuring dLev(011)max and dLev(011)min . The maximum value is filtered
out and used for Vamp on purpose instead of the average value because the rising wave-
form actually does not slew perfectly in an ideal line as the slope tails off a little near
the crossing point of the 011 and 110 patterns. This phenomenon can be seen in Figure
3.17. Taking the maximum value of dLev helps to alleviate from the nonideality of rising
waveform not slewing perfectly.
Figure 3.18 shows a visual summary of the entire adaptation plotted against CTLE
parameter Cs. The following graph should be read from right to left where the adaptation
process starts at the highest setting of 200 fF until it finds the minimum line thickness
(e.g. 100 fF). Where and how the line thickness is taken is shown on the left side. Every
step along the way, for each value of Cs, the α level is updated to the optimal PD level
at the x-axis center of the eye opening with 0.5UI margin on both sides as shown in the
eye diagram. At Csopt which has the minimum line thickness, α level is considered αopt
and these are the final converged values after adaptation. In the next section, (Section
3.7) where system-level behavioral model will be discussed, adaptation versus time will
be illustrated via a time-domain simulation.
Chapter 3. Proposed Adaptation Engine 25
Figure 3.18: Adaptation where line thickness guides CTLE adaptation of Cs (top right) and optimal
sampling phase deduced from the slew rate guides adaptation of comparator level
A system-level behavioral model was built in MATLAB Simulink for a quarter-rate ar-
chitecture as shown in Figure 3.19 which is actually the exact schematic of the circle
implementation to be discussed in the next chapter when we delve into analog design.
Variables that are adjusted during adaptation are highlighted in red. These variables
(CTLE parameter, α and dLev, PI code) form a feedback loop from digital to analog
and could be observed in a system-level model. Both continuous-time and discrete-time
event driven models were created as they each have specific pros and cons. An advantage
of the continuous-time model is that it is more accurate and it provides an insight on
real-time eye diagram of the signals which is vastly useful in debugging. The cost of a
continuous-time model is that the simulation time is slow. An event-driven model allows
for a faster simulation which is useful for running a sweep of long simulations e.g. a jitter
tolerance test. In both continuous-time and event-driven models of the CDR, analog
front-end is separated from digital back-end which is solely for digital adaptation.
For modelling the channel, the Tyco 5” channel’s real s-parameters measured with a
vector network analyzer (VNA) were imported. The RF Toolbox from MATLAB was
Chapter 3. Proposed Adaptation Engine 27
Figure 3.19: Proposed schematic of quarter-rate baud-rate receiver. Digital adaptation is completed in
digital back-end and the rest of the CDR is done in analog front-end
used to map this into a rational function with poles and zeros such that the transfer
function block can be used in Simulink for the channel model. An example of various
channels mapped in MATLAB is shown in Figure 3.20. Specifically, the proposed CDR
was mainly designed for the Tyco 5” channel as other channels have too much channel
loss at the Nyquist frequency of 18 GHz for the target data-rate of 36 Gb/s.
One problem of a behavioral model is that it is usually all ideal models, and hence,
too unrealistic. Gaussian and sinusoidal jitter are added to the quarter-rate clocks in
order to imitate real life jitter and noise as much as possible. Even on the comparators,
Gaussian voltage noise is added. Although metastability, offset and hysteresis are not
captured properly for the comparator, adding voltage noise is far better than a noise-less
ideal comparator.
A graphical example of adaptation vs. time is shown in Figure 3.21. To explain the
context for this simulation, the channel model is Tyco 5” at a data-rate of 36 Gb/s.
This channel in MATLAB has 19.9 dB loss at Nyquist. The VCO’s phase noise at 1
MHz offset is set at -80dBc/Hz. Other simulation conditions include 0.1 U Ipp sinusoidal
jitter, 0.1 U Ipp random Gaussian jitter, additive white Gaussian noise with 40 dB SNR,
and comparator noise turned on. The left column with the eye diagrams illustrates the
quality of the eye for the initial CTLE equalization setting and the final CTLE setting.
At a Cs value of 200 fF, it has the smallest line thickness hence the most optimal CTLE
equalization setting. On the right are the different variables plotted against time. The
CTLE parameter Cs is initially set to the highest level of 400 fF and then adaptation
begins. Once the line thickness of the next Cs value is greater than the previous value,
Chapter 3. Proposed Adaptation Engine 28
Figure 3.20: Plot of channel characteristic of various channels imported and converted to rational system
model in MATLAB
Cs returns to the previous state and adaptation finishes. In this example, when Cs is
set to 150 fF, the line thickness is increased, therefore, Cs setting returns to 200 fF
which corresponds to the minimum line thickness. At each Cs setting along the way, the
comparator level α is set to the optimal PD level discussed in the previous section such
that the CDR always preserves phase lock since large jerky changes in CDR settings can
cause the CDR to lose lock and diverge completely. To verify that the CDR is error free,
a BERT (bit error rate tester) was built in Simulink to confirm that the recovered data
is still a PRBS just like the input to the channel. The error count after adaptation had
converged was stable after the adaptation finished at 20µs.
Figure 3.21: Adaptation vs time where line thickness guides CTLE adaptation and optimal PD level
guides α level adaptation. Tyco 5” channel at 36 Gb/s
Chapter 3. Proposed Adaptation Engine 30
continuous-time transfer function with poles and zeros. CTLE’s transfer function is also
combined with the channel model’s transfer function all into one combined step response
since it cannot be cascaded. Figure 3.22 illustrates an example of the combined step
and pulse responses of the channel + CTLE. Furthermore, the CDR’s loop filter is also
converted into a discrete time z-domain function using bilinear transform.
Using the event-driven model of the CDR, jitter tolerance was simulated with injection
of sinusoidal jitter after the digital adaptation converged to the final values. The jitter
tolerance simulation was programmed in MATLAB to use a binary search algorithm with
5 iterations from the initial search points (red line) shown in Figure 3.23. For example,
from this initial search point, if the jitter tolerance test passes for the specified BER,
that is the jitter tolerance for that frequency. If the jitter tolerance test fails at the initial
search point, the test would try again at the halfway point in terms of amplitude, like
a binary search. The high-frequency jitter tolerance is 0.5 U Ipp and the minimum jitter
tolerance is around 0.4 U Ipp at the dip due to an undershoot. These jitter tolerance
values are for BER < 10−6 , therefore for BER < 10−12 , degradation is expected.
Chapter 3. Proposed Adaptation Engine 31
Figure 3.23: Jitter tolerance simulated for the converged adaptive setting of event-driven model (BER
< 10−6 )
Chapter 4
Circuit implementation (schematic design) in Cadence and its simulation results as well
as the measured results of the testchip will be discussed in this section.
A 36 Gb/s inductor-less baud-rate CDR from Figure 3.19, in which the adaptation
engine will be demonstrated is fully implemented in analog front-end. Since it’s a stand-
alone analog CDR, it is fully functional without digital synthesis. The adaptation engine
and the BERT are the only circuits built digitally. An advantage of having a standalone,
fully functional CDR designed in analog is that locking of the CDR can be simulated
in Cadence. When a digital CDR is used, AMS (analog/mixed signal) verification tool
has to be setup to test the analog circuits together with the digital circuits in order to
verify locking of the CDR which is a more complicated state of affairs. To minimize the
chance of the adaptive baud-rate CDR not working, the analog CDR was fully verified
for a lock in Cadence using Spectre and verified that there are no bit errors. In addition,
the digital circuit is verified functionally in ModelSim and NCsim every step along the
way (Verilog RTL, synthesized RTL, post place & route) against the vectors generated
in Simulink simulation. The digital implementation will be discuss in more details after
the analog design section.
Since the digital adaptation is the novelty of this adaptive baud-rate CDR, only key
schematics and simulation results will be discussed.
To begin the analog design section of the thesis, Figure 4.1 shows the schematic of a 2-
stage CTLE. The tunable source degeneration capacitor is a 4-bit digitally programmable
array of MOM capacitors and pass-gate switches. A 2-stage CTLE (both with high-
32
Chapter 4. Circuit Simulations and Measurement Results 33
frequency boost) had to be implemented because the CTLE is heavily loaded at the
output by 8 comparators for the DFE and the PD, and one additional comparator used for
adaptation. As a result, with a single CTLE stage, output capacitance limits bandwidth
of boost. When two stages are used, the first stage is only loaded by the input gate
of the second CTLE stage, hence, provides boost at a higher frequency and improves
CTLE’s overall bandwidth performance. Figure 4.2 is the simulated AC response of 2-
stage CTLE with post-layout extraction. Due to layout parasitic from the interconnect,
power-grid and fill, it is very challenging to push the bandwidth any further. Figure 4.3
demonstrates the eye opening for various Cs values with post-layout extraction with 32
Gb/s PRBS31 as the input.
Following the CTLE in the analog front-end are the comparators. In a quarter-rate
architecture, the number of comparators required increases. In the proposed adaptive
baud-rate CDR, 8 comparators are required for the DFE and the PD. Four comparators
have +α and the other four compartors have -α as the threshold level as shown in Figure
3.19. Double-tail latch published in ISSCC2007 is used as sense amplifier as opposed
to a StrongArm becuase double-tail latches have performance advantage in lower power
supply cases due to less stacking of devices [31]. A schematic of the double-tail latch is
shown in Figure 4.4. Since TSMC 28nm HPC uses 0.9V for supply of thin-oxide devices,
double-tail latches were preferred. Since we require comparisons to +/-α instead of the
zero-level, dual-difference comparator scheme was used where the input data is compared
to the threshold level α. These threshold levels are generated by a 9-bit reference DAC
block with 512 levels of 1mV/step. For the single adaptive sampler used to provide error
information to digital adaptation, the input sensitivity is modified by optimizing the
sizing of the double-tail latch. A higher input accuracy was necessary for adaptation
Chapter 4. Circuit Simulations and Measurement Results 34
Figure 4.3: Simulated eye diagram at the output of the 2-stage CTLE. Top two eyes are under-equalized.
The bottom left is optimally equalized and the bottom right is starting to over-equalize
Chapter 4. Circuit Simulations and Measurement Results 35
where for the rest of 8 comparators used for the DFE and the PD, higher sensitivity was
not required to achieve a CDR lock and error-free recovered data.
The DFE and the PD which uses the information gathered by the comparators are
custom digital logic blocks made up of simple logic gates, flip-flops, multiplexer and
adders. The DFE is designed to operate at 2 Gb/s in 16 parallel interleaved paths. The
PD is designed to operate at 4 Gb/s in 8 parallel paths. The four quarter-rate comparator
paths are demuxed accordingly for both the DFE and the PD.
The charge pump is a simple current steering differential pair and the loop filter is
a common higher-order RC loop filter of type-II PLL. Figure 4.5 depicts a simplified
schematic of the charge pump and loop filter combination. Amount of current steered by
the charge pump can be digitally adjusted with 4-bit settings which affect the CDR’s loop
gain. C1 and C2 of loop filter are fixed capacitance values set by MOM capacitors. The
resistance is a 4-bit tunable resistor switch array for adjusting the CDR’s loop dynamic.
The VCO is made up of CML based 8-stage ring oscillator as shown in Figure 4.6.
8 stages were used to reduce phase noise as it has been published that increasing the
number of stages in a ring oscillator reduces the phase noise [3]. The proposed 8-stage
Chapter 4. Circuit Simulations and Measurement Results 36
ring VCO uses the same CML delay stage architecture as [16, 29] and has a tuning
range between 6.76 to 9.14 GHz when Vctrl is swept from 200 to 700mV. For this tuning
range, the simulated VCO’s free-running phase noise in Cadence Spectre after post-layout
extraction was -80.77 to -82.42dBc/Hz at 1 MHz offset. The CML clocks coming out of
the VCO is converted to CMOS signals used by the comparators, using similar structure
of CML2CMOS circuit from [9].
The VCO and the clock buffers are under a regulated voltage to suppress supply
voltage noise. An LDO (low-drop out) regulator with PMOS pass-gate was implemented
for the regulator. A PMOS design with a lower drop-out voltage had to be used instead
of a NMOS pass-gate which has a superior PSRR (power supply rejection ratio) because
high voltage thick-oxide devices were not available in the TSMC 28nm HPC design kit
through MOSIS. Since the nominal supply voltage is 0.9V with maximum recommended
voltage of 1.0V, an LDO regulator had to be used. Even then, the LDO is designed with
a 1.1V supply which is a little higher than the recommended maximum supply voltage
and hence it could have some repercussions in terms of reliability. Since this is a testchip
rather than a real product with a more stringent reliability requirements, applying 1.1V
supply solely to the PMOS pass-gate of LDO was deemed okay. If a higher supply thick-
oxide devices were available, The LDO regulator would have been placed on the higher
supply (VDDH).
Inputs to the digital back-end of the CDR are designed to be approximately 1 Gb/s
data and 1 GHz core clock (CKrec/8 ). The Dout data path is 32-parallel bits of data as
shown in Figure 3.19. An adaptive sampler is clocked by an adaptive clock CKX with
PI’s phase controlled by output of digital adaptation. The comparator threshold level
uses dLev from the output of the digital adaptation as well. The error signal from the
Chapter 4. Circuit Simulations and Measurement Results 37
The simulation results of the closed-loop CDR simulations with post-layout extraction
will be discussed in this subsection for the proposed adaptive baud-rate CDR with just
the analog CDR portion sans digital adaptation. The testbench of the analog CDR
portion of the adaptive baud-rate CDR is as follows. The input data is 32 Gb/s PRBS31
pattern which is attenuated through a Verilog-A model of the Tyco 5” channel imported
into Cadence. Since the CDR is a PLL-style CDR, the initial frequency is adjusted by
setting the Vctrl of the VCO to a initial frequency that matches the input data within
the frequency capture range of the PD.
Figure 4.7 illustrates that the CDR is error free for all 16 parallel DFE paths of the
data after phase lock. Even if all parallel data paths are error free for a PRBS pattern, it
does that prove that the interleaved data at full-rate is also error free. Therefore, another
test with a PRBS7 was conducted to verify that the parallel paths are still a PRBS7 when
the parallel recovered data are interleaved and combined manually. A PRBS7 was used
for this as it is a short repeating pattern that could be checked much more easily, than say
a PRBS31. For a PRBS31, a digital BERT written in Verilog for digital synthesis could
be used to check that it is error free after fabrication. It interleaves all 32 down-sampled
parallel data paths and verifies that it is error free.
Chapter 4. Circuit Simulations and Measurement Results 38
Figure 4.7: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count
Figure 4.8: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered
clock frequency
Figure 4.8 illustrates the frequency of the recovered clock. It fluctuates & dithers at
an average value of 8 GHz quarter-rate for 32 Gb/s PRBS31 data. Figure 4.9 shows
the Vctrl which is the control voltage going into the VCO. It fluctuates after the CDR
is locked since the PD periodically dithers between early and late. The peak-to-peak
amplitude of Vctrl matched the Simulink simulation result very closely, confirming a close
resemblance between the behavioral model and the circuit simulation.
During the initial design of the digital adaptation in MATLAB Simulink, the building
blocks were specifically designed with MATLAB fcn (function) blocks with codes written
much like Verilog for an easier HDL conversion in the later design stage, i.e. RTL logic
synthesis. This is because the end goal of the digital circuit was not to just simulate and
validate the results solely in a behavioral model but to synthesize the digital and place
& route such that the digital layout could be taped out along with the custom analog
layout.
Chapter 4. Circuit Simulations and Measurement Results 39
Figure 4.9: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO
There are three major digital blocks for the adaptation engine in Figure 3.1. First is the
accumulator block which makes up the data level loop in conjunction with the adaptive
sampler which is an analog block designed in Cadence. The accumulator block consists
of a digital filter, pattern filter, integrator and an FSM for pattern filter scheduling. The
second digital block is the PI logic block for obtaining the appropriate clock phase for
the adaptive sampler at the crossing point of the 011 and 110 data patterns. Thirdly,
the adaptation logic block is responsible for measuring the line thickness, Vamp to adapt
the CTLE parameter Cs and the comparator level α for the DFE/PD adaptation.
Since the digital blocks have been introduced, the process of digital design will now
be discussed. The first procedure was to generate the test vectors from a Simulink
simulation for both input and output variables of digital adaptation circuit. These test
vectors are saved for every rising edge of core clock used to clock the digital circuits.
Second, all MATLAB fcn blocks in Simulink were manually converted into Verilog codes.
The validity of the Verilog models were confirmed with a testbench where the Verilog
models were being fed with the input test vectors saved from Simulink. For each bit, the
output of the Verilog model was compared to the expected output vectors from Simulink.
The next step in the digital design was to take the RTL design written in Verilog
and run RTL synthesis in Cadence using a .tcl script. After doing RTL synthesis, the
adaptation engine was not able to meet timing constraint at 1GHz clock speed, even with
the high-speed custom standard cells. To fix this timing failure, the critical path was
identified. The adaptation engine takes in 32 parallel data as the input and evaluates all
of them in 8 blocks (groups) of 4-bit data. This is a serial operation in digital hence the
timing could not be closed. The solution was to only use one block of 4-bit data instead
of all 8 and throw away 7 blocks (28 bits) of data every clock cycle. As a result, the
timing constraint is met for the standard cells and the trade off is that digital adaptation
takes 8x longer to achieve convergence.
Alternatively, one could make an argument to slow down the digital core clock to
Chapter 4. Circuit Simulations and Measurement Results 40
250 MHz to meet timing, but this means that the recovered data going into the digital
adaptation has to be demuxed from 32-to-128 bits. To process all 128 bits, there is a
4x increase in the propagation delay of combination logic. Although 250 MHz means 4x
longer period, there is zero gain in terms of timing since it scales linearly.
The second method which further improved timing of digital circuit was to remove
>= (greater or equal) operation which required 5-bit digital comparator that is cascaded
for each bit. Instead, >= was modified to == to use parallel XNOR gates in favour,
which is a much cheaper operation in digital logic gates during RTL synthesis. Although
>= is always a safer operation in case that there’s a glitch in the digital bit, == had to
be used in order to meet timing constraint for the testchip. Once the synthesized Verilog
met timing, it was again tested against the test vectors from Simulink to validate an
error-free operation. NCSim was used after RTL synthesis, which is a digital verification
tool from Cadence.
In the next step, the synthesized Verilog was used with the P & R (place & route)
flow in Cadence Innovus. This process again regenerates a new Verilog file representing
the end result of the place & route and was tested against the test vectors in NCSim.
Since the Verilog codes were validated against the test vectors generated from Simulink
in every step of the digital design process, it provided confidence that the synthesized
digital circuits will be functional post-tapeout even without AMS simulation. An AMS
simulation was omitted due to lack of time and resources, especially with a tight tapeout
schedule. A GDS (graphic database system) file created from P & R was streamed into
Cadence for a layout of the digital adaptation and the final Verilog generated from P
& R was imported into Cadence for the schematic. With the imported schematic, LVS
(layout-versus-schematic) check was performed to ensure that all the connections were
correct without shorts or opens.
In this section, measured results of the testchip from the lab will be discussed.
4.3.1 Testchip
The testchip of adaptive baud-rate CDR with CTLE and 1-tap DFE was fabricated in
TSMC 28nm HPC CMOS technology with a 0.9V supply. The testchip die was packaged
with an open-cavity QFN so that the high-speed input and output could be probed. Un-
der the microscope, Figure 4.10 reveals the packaged die with the wire bond connections
and Figure 4.11 is the package pinout instruction sent to the packaging company. Figure
4.12 is more zoomed into the die and all the major building blocks are highlighted in aqua
Chapter 4. Circuit Simulations and Measurement Results 41
Figure 4.10: Open-cavity QFN under a microscope showing wire bond connections for the proposed
adaptive baud-rate CDR
blue. The total testchip area was 1.57 mm width by 0.785 mm height. The following
subsection will explain the test setup for the testchip.
Figure 4.15 illustrates the testing setup for a normal operation of adaptive baud-rate
CDR. The packaged testchip was soldered onto a PCB board as shown in Figure 4.13
which is programmed via Arduino Mega2560 (Figure 4.14) with a PC. Figure 4.10 depicts
the QFN package under a microscope. High-speed probes rated for 40G was used to probe
the high-speed PRBS input data. The SHF 12104A bit pattern generator was used to
generate both PRBS7 and PRBS31. Input data then passes through the Tyco 5” channel
using 36” SMA cables and is connected to a 40G bias tee before being connected to the
GSGSG probe head. The channel loss through this setup is shown in Figure 4.16 which
is measured using the Agilent N5222A PNA microwave network analyzer. The setup for
obtaining S21 channel characteristic is depicted in Figure 4.17. Figure 4.18 illustrates an
eye diagram of this PRBS input at 36 Gb/s, observed using the Agilent Infiniium DCA-J
86100C digital communication analyzer with an 86112A electrical module. At this data-
rate, the input eye is completely closed before being equalized. The measurement setup
for observing this PRBS input eye diagram is shown in Figure 4.19.
On the output side, high-speed quarter-rate recovered clocks were probed at the output
Chapter 4. Circuit Simulations and Measurement Results 42
Figure 4.11: Package Pinout for D1: Adaptive baud-rate CDR with CTLE + 1-tap DFE
Figure 4.12: Die micrograph in TSMC 28nm HPC process for the proposed adaptive baud-rate CDR
Chapter 4. Circuit Simulations and Measurement Results 43
Figure 4.13: High-speed testboard for design 1: adaptive baud-rate CDR testchip. Testboard is pro-
grammed and controlled by Arduino Mega2560 + PC
(a)
(b)
Figure 4.16: Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias
tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator
cannot set voltage offset to set the common-mode. Low-frequency loss is cause by poor low-frequency
performance of bias tees.
Figure 4.18: 36 Gb/s PRBS31 input eye measured using a sampling scope including all channel loss
pads for observing the clock spectrum and the phase noise with the Rohde & Schwarz
FSWP26 phase noise analyzer and VCO tester. Before the spectrum analyzer is con-
nected, the differential clocks being probed needs to be converted to a single-ended signal
using a Narda 4346 180◦ . coupler. For low-speed (250-500 Mb/s) or static digital signals,
the Agilent DSAX91604A Infiniium 16GHz real time digital storage oscilloscope was used
to observe and debug digital adaptation.
All measurements presented in this subsection were obtained with the setup shown in
Figure 4.15. With 35 Gb/s PRBS31 input data, the proposed adaptive baud-rate CDR
is able to converge to the optimal CDR settings and achieve a CDR lock. The CDR’s
recovered clock spectrum at 8.75 GHz quarter-rate is shown in Figure 4.20(a). The locked
clock spectrum exhibits a skirt characterized by the loop dynamics or the bandwidth of
the CDR. Phase noise of a locked CDR for the same converged adaptive settings is shown
in Figure 4.20(b). There is an overshoot present in the CDR’s loop dynamic but this
is the optimal setting with minimum overshoot for the analog loop filter present on the
chip. Since the loop filter was designed using fixed MOM capacitor values, we do not
have an extra tuning knob to completely fix the overshoot. The worst case phase noise
is at -104 dBc/Hz at 40 MHz offset. Total integrated jitter is 875.8 fs with a PRBS31
input pattern. Figure 4.21 represents the same measured results but for PRBS7. The
worst case phase noise is -105 dBc/Hz at 35 MHz offset. Total integrated jitter is 750.6
Chapter 4. Circuit Simulations and Measurement Results 47
Figure 4.20: Measured clock spectrum and phase noise for locked CDR at 35 Gb/s
Chapter 4. Circuit Simulations and Measurement Results 49
Figure 4.21: Measured clock spectrum and phase noise for locked CDR at 35 Gb/s
Chapter 4. Circuit Simulations and Measurement Results 50
In addition, the converged setting is actually the optimal setting with the least amount
of undershoot. This becomes evident when we plot and observe the jitter tolerance
of settings around the converged setting after adaptation. Figure 4.23 shows that the
minimum 10-100MHz JTol (jitter tolerance) degrades rapidly as the CTLE’s parameter
Cs diverge away from the converged value of 8 highlighted in red. When sweeping the
Cs value, we hold the comparator level α constant at the converged value (α = 133 mV).
Similarly, Figure 4.24 is when we take the converged setting and manually sweep the
comparator level α while holding Cs constant (Cs = 8). It is evident that the minimum
10-100MHz JTol is less sensitive to the change in finer 9b comparator level α with 512
settings compared to a coarser 4b CTLE parameter Cs with 16 settings.
Figure 4.25 demonstrates that the designed adaptive baud-rate CDR is able to adapt
to different channel losses as all three curves passes the IEEE 802.3 masks. Different
channel losses were created by changing the input data-rate, which in essence changes
the channel loss at Nyquist, since the Nyquist frequecy itself alters. New channels with
different attenuation could not be obtained which is the reason why the input data-rate
had to be swept. Ironically, at a slower data-rate of 34 Gb/s, the proposed adaptive
baud-rate CDR actually performs more poorly due to the fact that 34 Gb/s is at the
bottom of the VCO’s tuning range therefore KVCO gain is lower and may be very noisy
and perhaps not even monotonic down there. The testchip returned from fab as being
faster than TT (typical) corner therefore the center frequency of the VCO is higher than
the intended design. Ideally, the CDR with this process shift to a FF corner should
Chapter 4. Circuit Simulations and Measurement Results 51
Figure 4.23: Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweeping
CTLE parameter Cs
Figure 4.24: Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweeping
comparator level α
Chapter 4. Circuit Simulations and Measurement Results 52
Figure 4.25: Measured jitter tolerance for different channel losses by sweeping the data-rate, hence
Nyquist frequency
be operating at a higher range between 36-45 Gb/s but we do not have an appropriate
channel with 10-18 dB loss at Nyquist for the mentioned data-rates.
Final measurement done in the lab is the power consumption measurement. On the
PCB board, sense resistors were installed in order to measure the current being drawn
by the DUT (device under test) for each of the power domain. The power domains are
separated into VDDA, VDD CTLE, VDD DAC, VDD LDO, VDD DIG and VDD IO.
VDDA contains most of analog circuits including: comparators, demux, PD, DFE, CP,
LF and some clock buffers & clock dividers. VDD CTLE domain has the two stage CTLE
powered on it. VDD DAC is a separate supply domain solely for the reference DAC used
to set the comparator’s threshold levels. The reference DAC’s power supply was kept
separate in case that the reference DAC’s levels had to be adjusted independently from
other power domains after tapeout. VDD LDO consists of the VCO, the VCO bias, the
LDO regulator and clock buffers. VDD DIG is for the synthesized digital circuit and
VDD IO is for the IO drivers from TSMC standard library and intermediate buffers to
the output pads for low-speed digital signals for debugging (also contains heavily down-
sampled analog signals such as the comparator and the DFE outputs).
Since the adaptation engine implemented in digital is designed to turn off automati-
cally after convergence, VDD DIG power is omitted for the total power consumption in
normal operation although the power consumed by VDD DIG (6.3 mW) is still reported.
Same with VDD IO, 2.7 mW is omitted as IO drivers and buffers were only present for
testchip’s debugging purposes. Figure 4.26 is the measured power consumption with 35
Chapter 4. Circuit Simulations and Measurement Results 53
Figure 4.26: Measured power consumption with 35 Gb/s PRBS31 input with CDR lock
Gb/s PRBS31 input while the CDR is phase locked and error free. The total power
consumption is 106.3 mW which is 3.04 pJ/bit. Most of the power is consumed by the
VCO to bring down the phase-noise. Ring VCO’s phase noise improves by 3 dB with
every twofold in the current consumption. Extra power was spent in the VCO to lower
the risk of CDR not locking due to poor phase noise, especially due to inductance from
the package wirebonds. Therefore, if less power was spend on the VCO by trading off
phase noise margin, the total power consumption of the CDR could have been improved
drastically with a better figure of merit (FOM) in terms of pJ/bit.
Finally, Figure 4.27 compares the performance of the proposed work to prior works.
This is the first on-chip, live adaptation engine tailored for a baud-rate CDR where
the comparators are shared between the DFE and the PD to save power. This figure
concludes Chapter 3 on adaptive baud-rate CDR with CTLE and 1-tap DFE.
Chapter 4. Circuit Simulations and Measurement Results 54
Figure 4.27: Performance comparison to prior work for the same CDR architecture
Chapter 5
This chapter presents the details of the second baud-rate CDR design that was taped out
in TSMC 28nm technology. The proposed design #2 is a 2x half-baud-rate CDR with
CTLE and data decoder. This testchip consists of an analog CDR with digital BERT
being the only synthesized digital incorporated via a place & route tool.
5.1 Background
Robustness is a crucial aspect of building a receiver for an I/O link. Alexander (2x-
oversampled) bang-bang phase detector (BBPD) where the data is sampled twice, at the
center and the edge, has been prominent in clock and data recovery due to its robustness
and simple hardware implementation [4]. Figure 5.1 illustrates simple hardware involved
with the Alexander 2x-oversampled BBPD. The basic operation is as follows. If the
previous data Dn and edge En are the same then the clock is early. If the next data
Dn+1 and edge En are the same then the clock is late. Since this is a bang-bang PD,
theoretically, the output should be totally non-linear. However, due to the presence of
inevitable jitter in real life, it linearizes the PD characteristic where the slope or the PD
gain is a function of σ of the jitter.
55
Chapter 5. Proposed 2x Half-Baud-Rate CDR 56
Figure 5.1: Schematic of Alexander 2x-oversampled bang-bang PD and its basic operation
Chapter 5. Proposed 2x Half-Baud-Rate CDR 57
Despite the robustness and simple hardware complexity of the Alexander 2x-oversampled
BBPD, recent trend has shifted towards baud-rate phase detectors as a means of reducing
power consumption by sampling only once per UI [36, 9, 7, 11, 14, 38]. However, it is
apparent from prior works [9, 7] that Mueller-Muller phase detector (MMPD), which
is a popular option of baud-rate PD, is sensitive to equalization and symmetry in the
pulse response [25]. MMPD’s lock point is at the middle of the symmetric pulse where
the pre-cursor equals the post-cursor as shown in Figure 5.2. As a result, if the pulse
response is not perfectly symmetric, the locking point will not be at the peak of the pulse
response. Also, MMPD is only functional for uncorrelated random data and cannot lock
to an alternating 0101 pattern. These disadvantages of the MMPD will be compared to
the proposed 2x half-baud-rate scheme in a later section.
Following the trend of opting for a lower power consumption by reducing the number
of samples per UI, the most intuitive solution would be to take a baud-rate scheme and
somehow sample it even less frequently. A half-baud-rate scheme shown in Figure 5.3
where the data is sampled every other UI (0.5x-sampled) would potentially lower the
power consumption. Whenever the data is sampled every other UI, information about
the previous bit needs to be recovered as illustrated in Figure 5.4. For a system with
only one significant post-cursor ISI, four distinct data levels exist: (h0 +h1 ), (h0 -h1 ), (-
Chapter 5. Proposed 2x Half-Baud-Rate CDR 58
h0 +h1 ), (-h0 -h1 ) where ho is the magnitude of main-cursor and h1 is the post-cursor ISI
as illustrated in bottom portion of Figure 5.4. Samplers can be placed at appropriate
threshold levels (+/-Vref and 0) to recover the data for the unsampled UI, thus recovering
2 bits of data for each sample. Figure 5.5 illustrates the threshold levels in which the
CDR could yield an error-free data recovery for any sequence of 2-bit data (dn−1 , dn ):
(0,0), (0,1), (1,0), (1,1). However, green arrows in Figure 5.5 highlight the theoretical
maximum horizontal eye opening of 0.5 UI on the left and vertical eye opening margins
on the right. Small vertical eye opening translates into poor noise margin.
Despite this, half-baud-rate operation is theoretically feasible for a clock recovery as
well, even without an integration & dump technique. Figure 5.6 illustrates that clock
recovery could be achieved by adding two additional samplers at +/-α on top of the
three samplers originally required solely for data recovery. The white circles at +/-α
indicate the lock points. In comparison to the Mueller-Muller baud-rate CDR, this
would still require fewer number of comparator (samplers) even for a quarter-rate clocking
implementation. A total of 12 comparators would be required for a MM-CDR whereas
half-baud-rate (0.5x sampled) CDR would only require 10 comparators. In addition,
since only every other UI is sampled, there would be a huge power saving over MM-CDR
as well in the clock distribution network. A half-baud-rate scheme without the need of
an integrating & dump technique sounds attractive in theory but is not very feasible in
reality due to poor noise and jitter margin.
Figure 5.5: Eye diagram example of sub baud-rate data recovery by exploiting ISI. Green arrows on
the left show theoretical maximum horizontal eye opening of 0.5UI. Green arrows on the right show the
small vertical eye opening margins
Chapter 5. Proposed 2x Half-Baud-Rate CDR 60
Figure 5.6: Sub baud-rate (0.5x-sampled) data and clock recovery in comparison to Mueller-Muller CDR
the data and edge are sampled every other UI, the 2x half-baud-rate PD is essentially a 2x
oversampling BBPD at half-baud-rate that locks to the edge. Advantages of edge locking
will be delved into in this section. Figure 5.7 illustrates the full-rate block diagram of the
proposed 2x half-baud-rate CDR. Blocks highlighted in red are simple logic-gate circuits
required for the 2x half-baud-rate operation. The data decoder circuit is imperative for
the data-recovery of 1UI that is not sampled at all, and this is done by exploiting the
inherent ISI present in the system.
Figure 5.8 illustrates an eye diagram corresponding to a channel with one significant
post-cursor ISI while all other ISI terms are assumed to be minimized through a front-
end equalizer. We sample a UI by three comparators at the edge phase φe with their
outputs labeled as DL, ED, and DH, and by one comparator at the center phase φc with
its output labled as DM, while we skip sampling the following UI altogether. Indeed, we
rely on ISI to recover the previous bit. In doing so, we perform 4 comparisons in every
other UI, or on average 2 comparisons per UI. By having the center and edge samples,
albeit in every other UI, this scheme inherits the benefits of a bang-bang PD (BBPD)
by locking to the edge, as will be demonstrated later. By skipping every other UI, the
proposed scheme shares the benefits of reduced hardware and low power consumption
with the baud-rate Mueller-Muller PD (MMPD).
We explain the phase detector (PD) and the data decoder (DD) logic by observing
samples from current UI (n). If at φe the data falls between +/-Vref , we conclude that
there is a data transition (0→1 or 1→0) at this phase and hence we will judge the
early/late by the output of the edge (ED) and the data (DM) comparators, similar to
a BBPD logic. If these two bits are identical, the clock is late; otherwise, it is early as
shown in the phase detector table of Figure 5.8.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 61
Figure 5.7: Full-rate block diagram of the proposed 2x half-baud-rate CDR architecture
The DD only needs to observe the outputs of the data comparators (DH, DL, and
DM) to decode the current and the previous bit. Similar to a 1-tap speculative DFE,
the DD recovers the unsampled UI by slicing the data eye at a threshold that is adjusted
depending on the previous bit sequence. If the output of all three comparators are zero,
Dn−1 and Dn are both zero. Similarly, if the outputs of all three comparators are 1,
Dn−1 and Dn are both 1. If the data at φe falls between +/-Vref , it implies a transition
between Dn−1 and Dn . Therefore, by observing the sign of DM (which indicates Dn ), we
can find Dn−1 =D̄n . The data decoder logic is also summarized in a table in Figure 5.8.
Although the eye diagram of the proposed 2x half-baud-rate scheme may look similar
to duobinary signalling, there are some advantages to the proposed scheme. The disad-
vantage of duobinary is that precoder & decoder are required and the precoder especially
is not trivial as stated in various prior works [40, 19]. The proposed 2x half-baud-rate
does not need a precoder and operates for conventional NRZ signalling. In addition,
the proposed data decoder is a simple hardware made up of digital logic gates which is
efficient in terms of both power and area.
Robustness of a conventional 2x oversampling BBPD and power saving of MMPD by
sampling at baud-rate are combined in the proposed 2x half-baud-rate PD. Both MMPD
and the proposed 2x half-baud-rate PD in Figure 5.9 display similar PD characteris-
tic over 1UI period when properly tuned and equalized, depicted by the black curves.
However, MMPD suffers significantly as equalization setting and comparator level Vref
Chapter 5. Proposed 2x Half-Baud-Rate CDR 62
diverge from the optimal point. For instance, when an offset is present for the comparator
reference level +/-Vref , dead zone forms for MMPD as shown in Figure 5.9 (first row).
Similarly, second row of Figure 5.9 depicts that when the residual ISI exists due to poor
front-end equalization, dead zone appears for MMPD. It is apparent from the simulation
results that the proposed PD (similar to BBPD) does not show sensitivity to these two
settings as much.
A system-level behavioral model was built in MATLAB Simulink for the quarter-rate
architecture as shown in Figure 5.10. Similar to the behavioral models built for the first
design: adaptive baud-rate CDR from Section 3.7, both continuous-time and event-driven
models in MATLAB Simulink were created for 2x half-baud-rate CDR. The details of
how continuous-time and event-driven behavioral models were built will be omitted as
they are designed the same way with minor tweaks in some of the building blocks such
as the PD and the addition of data decoder.
Using the event-driven model of the 2x half-baud-rate CDR, jitter tolerance was sim-
ulated with the injection of sinusoidal jitter. The jitter tolerance simulation was pro-
grammed in MATLAB to use a binary search algorithm with 5 iterations from the initial
search points (red line) shown in Figure 3.23. The high-frequency jitter tolerance is 0.5
Chapter 5. Proposed 2x Half-Baud-Rate CDR 64
Figure 5.11: Jitter tolerance simulated for event-driven model of 2x half-baud-rate CDR (BER < 10−6 )
U Ipp and the minimum jitter tolerance is 0.394 U Ipp . These jitter tolerance values are for
BER < 10−6 due to simulation time, therefore for BER < 10−12 , degradation is expected.
Circuit implementation (schematic design) in Cadence and its simulation results will be
presented in this section. The quarter-rate circuit implementation follows Figure 5.10
such that the behavioral model and the schematic in Cadence match exactly.
compared to the plus and minus polarities of the input signal itself, hence, effectively the
zero-level.
The two critical circuit blocks: 1) 2x half-baud-rate PD and 2) Data decoder high-
lighted by red boxes in Figure 5.10 are designed using custom high-speed digital logic
gates operating at 7.5 Gb/s in accordance to the truth table in Figure 5.8. These blocks
made up of digital logic gates are extremely simple in complexity with very little power
consumption.
The CDR’s loop remains as a higher-order RC loop of type-II PLL for the proposed
2x half-baud-rate CDR. While the charge pump and the loop filter are the same as
the proposed adaptive baud-rate CDR, 8-stage ring VCO is tuned such that the centre
frequency is a little lower to compensate for the fact that there is no DFE. Therefore, the
proposed 2x half-baud-rate CDR is not able to tolerate the same data-rate or the same
attenuation. The quarter-rate VCO clock is divided down by a factor of eight to be used
for the digital BERT. Similarly, in the data path, output of the data decoder is demuxed
to produce 32 parallel data signals for the digital BERT. This digital BERT gives the true
error-rate as opposed to checking one of the demuxed data path in the analog domain.
A demuxed version of PRBS is guaranteed to be PRBS but not the other way around.
Therefore it cannot be assumed that after checking one of the demuxed data path being
error-free, the interleaved version of all parallel paths are error free. As a result, digital
BERT is imperative for obtaining the true bit error rate, which is used for the BER
measurement of a testchip.
The simulation results of the closed-loop CDR simulations with post-layout extraction
will be discussed in this subsection for the proposed 2x half-baud-rate CDR.
Figure 5.12 illustrates that CDR is error free for all parallel paths of the data after
phase lock. To ensure that the interleaved data is also a PRBS, a test with a PRBS7
was conducted to verify that the parallel paths are still a PRBS7 when the parallel
recovered data are interleaved and combined manually. For a PRBS31, the synthesized
digital BERT written in Verilog could be used to check that the CDR is error free when
measured in the lab after fabrication.
Figure 5.13 illustrates the frequency of the recovered clock. It fluctuates & dithers at
an average value of 7 GHz quarter-rate for 28 Gb/s PRBS31 data. Figure 5.14 shows
the Vctrl which is the control voltage going into the VCO. It fluctuates after the CDR
is locked since the PD periodically dithers between early and late. The peak-to-peak
amplitude of Vctrl matched the Simulink simulation result very closely, confirming a close
resemblance between behavioral model and the circuit simulation.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 67
Figure 5.12: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count
Figure 5.13: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered
clock frequency
Figure 5.14: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO
Chapter 5. Proposed 2x Half-Baud-Rate CDR 68
Figure 5.15: Open-cavity QFN under a microscope showing wire bond connections for the proposed 2x
half-baud-rate CDR
The digital BERT from 5.10 is the only circuit that is synthesized in digital and place &
routed. The digital BERT takes in 32 parallel down-sampled recovered data as the input
and interleaves them to check that it is still an error-free PRBS pattern. Errcnt[19:0] is
the total error count, err is the bit error for every clock cycle and erronce is a flag that
stays high if there’s at least one bit error after the BERT is enabled. Due to the fact that
AMS (analog/mixed signal) verification tool was not setup for the TSMC 28nm design
kit, the interface between analog CDR and digital BERT was never simulated. Since
the analog CDR locked with post-layout extraction and the digital BERT was tested
separately in ModelSim and NCSim throughout the digital design stages, the chance of
the interface breaking was minimized. In addition, the digital BERT has the option to
flip LSB and MSB order of the input data in case bus ordering at the analog/digital
interface doesn’t match.
In this section, measured results of the testchip from the lab will be presented.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 69
Figure 5.16: Package Pinout for D2: Non-uniform baud-rate CDR with CTLE
Chapter 5. Proposed 2x Half-Baud-Rate CDR 70
Figure 5.17: Die photo in TSMC 28nm HPC process for the proposed 2x half-baud-rate CDR. Dimensions
of each building block is listed in a table.
5.4.1 Testchip
The testchip of 2x half-baud-rate CDR with CTLE was fabricated in TSMC 28nm HPC
CMOS technology with a 0.9V supply. The testchip die was packaged with an open-cavity
QFN so that the high-speed input could be probed. Under the microscope, Figure 5.15
reveals the packaged die with the wire bond connections and Figure 5.16 is the package
pinout instruction sent to the packaging company. Figure 5.17 is more zoomed into the
die and all the major building blocks are highlighted. The total testchip area was 1.57
mm width by 0.785 mm height. The total die area is 1.232 mm2 and the area consumed
by the building blocks of the CDR is only 0.135 mm2 . The following subsection will
explain the test setup for the testchip.
Figure 5.19 illustrates the testing setup for a normal operation of the 2x half-baud-rate
CDR. The packaged testchip was soldered onto a PCB board as shown in Figure 5.18
which is programmed via Arduino Mega2560 with a PC. Figure 5.15 depicts the QFN
Chapter 5. Proposed 2x Half-Baud-Rate CDR 71
Figure 5.18: High-speed testboard for design 2: non-uniform baud-rate CDR testchip. Testboard is
programmed and controlled by Arduino Mega2560 + PC
package under a microscope. High-speed probes rated for 40G was used to probe high-
speed PRBS input data. The SHF 12104A bit pattern generator was used to generate
both PRBS7 and PRBS31. Input data then passes through the Tyco 5” channel using
36” SMA cables and is connected to a 40G bias tee before being connected to the SGS
probe head. SGS probe had to be used instead of GSGSG due to limited clearance in
between wirebonds. The Channel loss through this setup is shown in Figure 5.20 which
is measured using the Agilent N5222A PNA microwave network analyzer.
For the output recovered clock, CK/16 was observed instead of probing the high-speed
quarter-rate clock because there was not enough clearance between the wirebonds to land
the probes. The clock spectrum and the phase noise of low-speed divided down version
of recovered clock (CK/16 ) is observed using the Rohde & Schwarz FSWP26 phase noise
analyzer and VCO tester. For low-speed (250-500 Mb/s) or static digital signals, the
Agilent DSAX91604A Infiniium 16GHz real time digital storage oscilloscope was used.
All measurements presented in this subsection were done with the setup shown in Figure
5.19(a) where Tyco 5” channel with 13.06 dB loss at Nyquist for 30 Gb/s was used for
all measurements. Initially, the VCO’s frequency is manually tuned to 30 Gb/s for an
Chapter 5. Proposed 2x Half-Baud-Rate CDR 72
(a)
(b)
Figure 5.20: Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias
tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator
cannot set voltage offset to set the common-mode.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 74
open-loop CDR as shown in Figure 5.21. Figure 5.22(a) illustrates the measured clock
spectrum of the divided recovered clock (CK/16 ) for PRBS31 when the CDR is locked.
The locked clock spectrum exhibits a skirt characterized by the loop dynamics or the
bandwidth of the CDR. The integrated jitter from the phase noise plot is 823.5 fs for
PRBS31. The recovered clock spectrum for PRBS7 and its phase noise plot is shown in
Figure 5.23. The integrated jitter for PRBS7 is lower as expected at 731.8 fs.
In addition, the capture range was measured to be -2300ppm to +66000ppm. The
higher ppm in the positive direction is due to the asymmetric nature of the 2x half-baud-
rate PD logic where the data sample always follows the edge sample, not the other way
around. This property makes frequency acquisition available for free in one direction
without adding any additional feedback loop in the CDR. In other words, in the positive
direction, where the incoming data is faster than the CDRs initial VCO frequency, the PD
is able to pull up the VCO frequency by +66000ppm (equivalently 2Gb/s) to a frequency
lock and then track the phase simultaneously to achieve a phase lock.
The measured jitter tolerance with sinusoidal jitter injected at the input bit pattern
generator is shown in Figure 5.24. The jitter tolerance curves for both PRBS31 & PRBS7
passes the IEEE 802.3 masks although PRBS31 passes marginally. The proposed CDR
was originally designed for 28 Gb/s, however, after fabrication the VCO’s tuning range
was shifted up due to a process shift perhaps to an FF corner thus VCO’s frequency
cannot be brought down to 28 Gb/s even after adjusting the VCO’s supply voltage and
the bias tail current. As a result, the measured jitter tolerance at 30 Gb/s is not as high
since it was never designed to operate at such a speed.
Figure 5.25 illustrates the power breakdown per block. The total power consumption
measured is 79.2 mW and the FOM is 2.64 pJ/bit (at 30 Gb/sa0. It is clear that the
VCO and the clocking has been over-designed for phase noise therefore there’s a room for
improvement in terms of power and FOM. Omitting the VCO and the clocking power,
only 25 mW is consumed which is very low-power. Finally, the table in Figure 5.26
compares the performance of the proposed 2x half-baud-rate CDR to recently published
baud-rate CDRs. This work is the first 2x half-baud-rate CDR reported that is 2x
oversampling at half-baud-rate, hence sampling every other UI and locking to the edge.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 75
Figure 5.22: Measured clock spectrum and phase noise for locked CDR at 30 Gb/s
Chapter 5. Proposed 2x Half-Baud-Rate CDR 77
Figure 5.23: Measured clock spectrum and phase noise for locked CDR at 30 Gb/s
Chapter 5. Proposed 2x Half-Baud-Rate CDR 78
Figure 5.24: Measured jitter tolerance with sinusoidal jitter injected at 30 Gb/s PRBS31 & 7
In a single tapeout shuttle run, two separate wireline receiver chips with two separate
CDRs were designed and fabricated for this research. It is known that meeting the
tapeout deadline for a single CDR design itself is quite onerous due to the sheer size
of the CDR as a system. In addition, doing two system level designs and simulations,
schematic designs and simulations, digital designs and verification and layout designs is
very time consuming. This chapter will delve into the design methodology that allowed
the design of two separate CDRs within a very tight tapeout time-frame. Furthermore,
advanced layout techniques implemented will be discussed.
As discussed in Section 5.2.1, two CDR designs share most of the continuous-time and
event-driven behaviour model. When designing the second CDR, most of the building
blocks were recycled with some minor tweaks such as the new PD and the data decoder.
For the second CDR design, half-rate clocking architecture was preferred over the quarter-
rate clocking architecture but due to the tight tapeout schedule, quarter-rate clocking
scheme was re-used from the initial design stage when building the behaviour model so
that completely new continuous-time and event-driven models did not have to be built.
The schematic and the layout were designed to be shared and re-used for both CDR
designs. Many of the circuit components such as the CTLE, the charger pump and the
loop filter and many of the biasing circuitry are shared. Even circuis that are different,
such as the comparators and the VCOs tailored to each CDR’s needs, were only slightly
modified instead of doing a full re-design. Furthermore, filler cells, decoupling capacitor
80
Chapter 6. Chip Design Methodology 81
Figure 6.1: Layout of the full-chip die with two CDRs on top & bottom
cells, IO pads, and the power grid were designed to be shared and used for both CDR
designs. Figure 6.1 illustrate that the top CDR is a mirror of the bottom CDR with
tweaks made to the building blocks and routing. Without sharing many of the common
blocks, tye tapeout deadline could not have been met for such a large layout of 1.57mm
by 1.57mm die area in a 28nm technology. Figure 6.2 reveals the top aluminum (AP)
layer used to distribute power vertically to the metal8 layer below. This figure clearly
illustrate that the power grid, output pads and the IO were also shared and re-used for
both CDR designs.
6.3.1 Matching
The matching of the double-tail latch clocked comparators in layout was critical as a
mismatch in device properties such as the threshold voltage (Vt ) can eat into the timing
Chapter 6. Chip Design Methodology 82
Figure 6.2: Layout of the full-chip die showing top aluminum layer for power distribution
Chapter 6. Chip Design Methodology 83
and noise margin. Therefore, although offset calibration was built for the comparators,
comparator layout was done with meticulous care. First, common centroid layout tech-
niques [22, 21] were used for sets of four quarter-rate comparators. As a result, any linear
gradients in the die tend to cancel out. MOSFETs are afflicted by gradients in etching,
in Vt , and in oxide thickness. All capacitor and resistor arrays incorporated common
centroid layout to enhance relative matching between them as well. In addition to the
common centroid layout technique, the quarter-rate comparators were interconnected
using a symmetrically RC distributed H-tree method [30] for routing data and clocks to
minimize skew and delay offset. A disadvantage of the H-tree method is that it is more
heavily loaded by parasitic capacitance from extra routing metals, hence, causing larger
absolute delay and requiring larger clock buffers. However, we gain timing margin due
to lower skew between the quarter-rate comparators which is absolutely critical for the
CDR’s front-end.
Layout techniques such as common centroid and H-tree interconnect method were
discussed. Another technique used in this chip design is “interdigitation” [6] found in
analog differential pairs. Interdigitation was mainly implemented in CML circuits such
as the CTLE and the ring oscillator delay stages in the VCO. Interdigitation lowers the
device mismatch between M1 and M2 of the input diff-pairs and also helps to cancel out
linear gradients in the die similar to common-centroid.
This subsection will discuss the design methodology for EMIR (electromigration and
IR drop). Considerations were made for the maximum current that a specific width of
metal wire can carry during the physical layout design to pass electromigration (EM)
requirements. The maximum current values (Imax ) were found in the design rule check
document for the TSMC 28nm process. In order to meet the allowable maximum current
density for CML analog blocks, multiple fingers had to be used instead of a device with
a larger width (W). By increasing the number of fingers, more metal tracks are available
for the source/drain to meet EM rules.
In terms of IR drop, a full mesh power grid with via stacks was incorporated to
minimize IR drop on the supply. For example, the aluminum (AP) layer of the power
grid was routed vertically and metal8 (M8) layer below was routed horizontally to form
a power mesh from top to bottom, all the way down to the base of the transistor.
Furthermore, to improve EM, staggered output pads could be seen in Figure 6.2 on some
power supply pads that sink a lot of current such as the VCO. By staggering, more pads
could be placed which could be wirebonded when packaging the chip. To improve the EM
even further and to reduce pad inductance due to the double bonding that was applied
Chapter 6. Chip Design Methodology 84
Many other layout considerations will be discussed in this subsection. First, for signal
integrity, all high-speed clocks that were routed a long distance were shielded. Shield-
ing the clock routes with VSS ground metal traces reduces the mutual inductance and
capacitive cross-talk on high-speed quadrature clock routes.
ESD was also considered during the layout stage of the chip design. ESD diodes and
secondary ESD diode were added to protect the gates from singals coming into input
pads. ESD clamps from the TSMC ESD library were placed between power/ground to
properly clamp the supplies during an electrostatic discharge.
In order to prevent any latch-up, an adequate number of n-taps and p-taps were placed
in the layout. For custom analog layouts, all the MOS devices were laid out in such a
way that it was always seeing the same environment which includes the n-tap/p-tap. For
example, if there were many rows of NMOS devices, p-taps that connect the substrate
to the ground were places in between every row as well as at the outer edges. This way,
any row of NMOS devices were seeing the same distance to top and bottom p-taps.
When placing PMOS devices, they must reside in an n-well which is more isolated
in terms of noise compared to the p− subtrate for the NMOS devices. The deep n-well
(DNW) layout technique was used to provide noise isolation for the NMOS devices inside
isolated p-wells. In addition, DNW had to be put in to provide ground isolation between
digital ground (vss dig) and analog ground (vssa) since they cannot be sharing the same
p-substrate or they will be shorted. Only at the PCB board level, different ground
domains were shorted together.
During tapeout, the digital adaptation engine was designed for both baud-rate CDR
designs. However, for the second CDR design from Chapter 5, the adaptation engine did
not work for the testchip after fabrication which is why its details are omitted from this
thesis. Nonetheless, during the chip design process, even the digital flow from the RTL
syntehsis and place & route were shared between the two CDR designs. The digital flow
script just had to be modified to point to different Verilog files for their respective digital
adaptation scheme. Furthermore, the area allocated for the digital block after place &
route in the layout was the same as well. Therefore, once the layout was streamed into
Cadence Virtuoso, it fit perfectly for both CDR layouts. The pin coordinates were also
Chapter 6. Chip Design Methodology 85
preserved during the digital flow for P & R, thus, routing to the input and output of the
P & R layout was shared between the two CDR in order to save valuable time.
Chapter 7
Conclusion
In summary, this thesis began by motivating the need for a baud-rate CDR with the
goal of reducing power consumption. In chapter 2, the background to the first design
(adaptive baud-rate CDR) was presented. Chapter 3 followed up with the details of the
proposed adaptive engine. Chapter 4 shared the simulated and measured results. For the
second design (2x half-baud-rate CDR) presented in Chapter 5, all of the background,
proposed 2x half-baud-rate scheme and the measured results were self contained within
the same chapter. In addition, Chapter 6 gave an insight on how the two chips were
being designed simultaneously in an efficient manner as well as some advanced layout
techniques incorporated in the testchips.
The contributions from each of the two baud-rate CDR designs will be summarized in
this section as a conclusion.
The first contribution of this thesis is the proposed adaptive baud-rate CDR with
CTLE and 1-tap DFE. The novelty in this design is the adaptation engine tailored for
baud-rate clock and data recovery where the comparators for the DFE and the PD are
shared to save power. A testchip was fabricated in TSMC 28nm HPC CMOS technology
with a 0.9 V supply. The adaptation engine is demonstrated for 34-36 Gb/s operation
with a Tyco 5” channel resulting in 15.05-18.25 dB channel losses. Measurement in the
lab demonstrated that the testchip is able to pass the IEEE 802.3 jitter tolerance masks
for the mentioned channel losses. At 35 Gb/s, the total power consumption is measured
to be 106.3mW or a FOM of 3.04 pJ/bit. A paper that presents this 36Gb/s adaptive
baud-rate CDR has been submitted to ISSCC 2019.
The second contribution is the proposed 2x half-baud-rate clock and data recovery
technique using both data and edge samples every other UI (half-baud-rate) to lock at
86
Chapter 7. Conclusion 87
the edge. A testchip was also fabricated in TSMC 28nm HPC CMOS technology with
a 0.9 V supply. A 30 Gb/s 2x half-baud-rate CDR was tested with a Tyco 5” channel
with 13.06 dB of loss. The total power consumption is measured to be 79.2 mW or a
FOM of 2.64 pJ/bit. A paper written for a 30Gb/s 2x-half-baud-rate CDR also has been
submitted to ISSCC 2019.
In conclusion, two separate CDR testchips were fabricated in a 28nm process technol-
ogy and successfully measured in the lab.
There are several ways to follow up with the work from this thesis, which is broken down
into two sections for each of the two design.
One possible works is to take the pattern-based baud-rate CDR where the comparators
are shared between the DFE and the PD and turn it into a PAM4 receiver. Since
PAM4 signaling [5, 14, 38] sends/receives two bits per symbol, it is a popular approach
for achieving a higher data-rate. In addition, if PAM4 signaling is indeed feasible, an
adaptive scheme that is compatible with PAM4 should be studied as well.
Second, the CDR’s equalization capabilities could be improved. For example, a new
tuning knob could be added to the CTLE. By tuning the source degeneration resistance,
the peaking could be improved by lowering the DC gain. In addition, another tuning
knob can be added to the speculative DFE as well. A 2-tap speculative DFE would be
able to cancel out two post-cursor ISI as opposed to one. Adding more knobs would
complicate the adaptation but as a trade off, high channel attenuation could be handled.
First, a 2x half-baud-rate CDR could be improved in terms of power. The VCO made
up the majority of the power consumption which hindered it from achieving state-of-
the-art FOM for a high-speed wireline CDR. Instead of wirebonding, a flip-chip could
be implemented which would get rid of inductance from the wirebond. In addition,
an LC VCO could be implemented to the lower phase noise and improving the power
consumption at the trade off of reduced tuning range. Since the use of inductors is
required for the LC VCO, t-coils can also be implemented in the front-end to extend
the bandwidth and inductive peaking using inductors could be applied to the CTLE to
extend the bandwidth of the high-frequency boost. As a result, the data-rate could be
Chapter 7. Conclusion 88
improved as well.
Second, adaptation of the Vref and CTLE settings could be implemented for the 2x
half-baud-rate CDR. Out adaptation scheme for this proposed CDR did not work after
fabrication due to an error in the interface between the analog and the digital circuits.
In future works, the adaptive scheme could be fixed and implemented properly.
Third, a 2x half-baud-rate CDR that can tolerate a higher channel loss for MR and
LR applications would be an interesting project. The proposed 2x half-baud-rate CDR
in this thesis was tailored for an XSR/USR application with only the CTLE as the
main equalization scheme. For a higher-loss application such as for a backplane, a direct
feedback DFE with multiple taps could be implemented after the CTLE. At the summing
node of the DFE, which is still an analog eye, the phase detector and the data decoder
may still work.
Lastly, similar to the possible future works of the first design, feasibility of PAM4
signaling should be studied for a 2x half-baud-rate CDR. If PAM4 is indeed feasible, it
would be possible to double the data-rate at the same clock rate.
Bibliography
[1] Ieee standard for ethernet - amendment 10: Media access control parameters, phys-
ical layers, and management parameters for 200 gb/s and 400 gb/s operation.
IEEE Std 802.3bs-2017 (Amendment to IEEE 802.3-2015 as amended by IEEE’s
802.3bw-2015, 802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-
2016, 802.3bz-2016, 802.3bu-2016, 802.3bv-2017, and IEEE 802.3-2015/Cor1-2017),
pages 1–372, Dec 2017.
[2] Ieee standard for ethernet - amendment 11: Physical layer and management param-
eters for serial 25 gb/s ethernet operation over single-mode fiber. IEEE Std 802.3cc-
2017 (Amendment to IEEE Std 802.3-2015 as amended by IEEE s 802.3bw-2015,
802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-2016, 802.3bz-
2016, 802.3bu-2016, 802.3bv-2017, 802.3-2015/Cor 1-2017, and 802.3bs-2017),
pages 1–45, Jan 2018.
[3] A. A. Abidi. Phase noise and jitter in cmos ring oscillators. IEEE Journal of Solid-
State Circuits, 41(8):1803–1816, Aug 2006.
[4] J.D.H. Alexander. Clock recovery from random binary data. 11:541 – 542, 02 1975.
[5] M. Bassi, F. Radice, M. Bruccoleri, S. Erba, and A. Mazzanti. 3.6 a 45gb/s pam-4
transmitter delivering 1.3vppd output swing with 1v supply in 28nm cmos fdsoi. In
2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 66–67, Jan
2016.
[6] J. D. Bruce, H. W. Li, M. J. Dallabetta, and R. J. Baker. Analog layout using alas!
IEEE Journal of Solid-State Circuits, 31(2):271–274, Feb 1996.
[7] R. Dokania, A. Kern, M. He, A. Faust, R. Tseng, S. Weaver, K. Yu, C. Bil, T. Liang,
and F. O’Mahony. 10.5 a 5.9pj/b 10gb/s serial link with unequalized mm-cdr in
14nm tri-gate cmos. In 2015 IEEE International Solid-State Circuits Conference -
(ISSCC) Digest of Technical Papers, pages 1–3, Feb 2015.
89
Bibliography 90
[11] J. Han, Y. Lu, N. Sutardja, and E. Alon. 6.2 a 60gb/s 288mw nrz transceiver
with adaptive equalization and baud-rate clock and data recovery in 65nm cmos
technology. In 2017 IEEE International Solid-State Circuits Conference (ISSCC),
pages 112–113, Feb 2017.
[15] Raj Jain. The art of computer systems performance analysis - techniques for experi-
mental design, measurement, simulation, and modeling. Wiley professional comput-
ing. Wiley, 1991.
Bibliography 91
[17] H. Y. Joo, K. S. Ha, and L. S. Kim. A data pattern-tolerant adaptive equalizer using
spectrum balancing method. In 2009 Symposium on VLSI Circuits, pages 220–221,
June 2009.
[18] Y. H. Kim, Y. J. Kim, T. Lee, and L. S. Kim. A 21-gbit/s 1.63-pj/bit adaptive ctle
and one-tap dfe with single loop spectrum balancing method. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 24(2):789–793, Feb 2016.
[19] J. Lee, M. Chen, and H. Wang. Design and comparison of three 20-gb/s backplane
transceivers for duobinary, pam4, and nrz data. IEEE Journal of Solid-State Circuits,
43(9):2120–2133, Sept 2008.
[20] Jri Lee. A 20gb/s adaptive equalizer in 0.13/spl mu/m cmos technology. In 2006
IEEE International Solid State Circuits Conference - Digest of Technical Papers,
pages 273–282, Feb 2006.
[22] Di Long, Xianlong Hong, and Sheqin Dong. Optimal two-dimension common cen-
troid layout generation for mos transistors unit-circuit. In 2005 IEEE International
Symposium on Circuits and Systems, pages 2999–3002 Vol. 3, May 2005.
[24] K. Mueller and M. Muller. Timing recovery in digital synchronous data receivers.
IEEE Transactions on Communications, 24(5):516–531, May 1976.
[25] F. A. Musa. High-speed baud-rate clock recovery. PhD thesis, University of Toronto,
Toronto, ON, 2008.
Bibliography 92
[26] F. A. Musa and A. C. Carusone. A baud-rate timing recovery scheme with a dual-
function analog filter. IEEE Transactions on Circuits and Systems II: Express Briefs,
53(12):1393–1397, Dec 2006.
[27] P. J. Peng, J. F. Li, L. Y. Chen, and J. Lee. 6.1 a 56gb/s pam-4/nrz transceiver in
40nm cmos. In 2017 IEEE International Solid-State Circuits Conference (ISSCC),
pages 110–111, Feb 2017.
[33] M. H. Shakiba. A 2.5 gb/s adaptive cable equalizer. In 1999 IEEE International
Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. First Edition
(Cat. No.99CH36278), pages 396–397, Feb 1999.
94
Appendix A
Ancillary
95
Appendix A. Ancillary 96
oPAD0
coreck_16 (500 MHz)
oPAD1:
0) dout clk0 sample
1) dLev <0 or 4>
2) Thresh <8:0> selectable
3) Thresh [0-->8] burst mode
4) new_trial
5) dcomp <0>
6) PIPO<0>
7) mode_pi <0> (1-bit)
8) FSM_state [0-->2] burst mode
9) max [0-->8] burst mode
10) tot_max [0-->15] burst mode
11) tot_vamp [0-->15] burst mode
12) dLev011 [0-->8] burst mode
13) o_bert_pnebitcnt [0->19] burst mode
14) block_detected<0>
oPAD2:
0) dout clk90 sample
1) dLev <1 or 5>
2) adapt_rdy
3) Thresh flag_bit0
4) cross_rdy
5) dcomp <1>
6) PIPO<1>
7) new_Cs_setting
8) FSM_state flag_bit0
9) max flag_bit0
10) tot_max flag_bit0
11) tot_vamp flag_bit0
12) dLev011 flag_bit0
13) o_bert_pnebitcnt flag_bit0
14) ck_div16
oPAD3:
0) dout clk180 sample
1) dLev <2 or 6>
2) PI <4:0> selectable
3) PI [0-->4] burst mode
4) Cs [0-->3] burst mode
5) dcomp <2>
6) PIPO <2>
7) mode_adpt [0-->2] burst mode
8) thick_out [0-->9] burst mode
9) min [0-->8] burst mode
10) tot_min [0-->15] burst mode
11) prev_diff [0-->9] burst mode
12) dLev110 [0-->8] burst mode
13) o_bert_pnerr (1-bit)
oPAD4:
0) e_k_demux (between ck90 and 180)
1) dLev <3 or 7>
2) Cs <3:0> selectable
Appendix A. Ancillary 97
3) PI flag_bit0
4) Cs flag_bit0
5) dcomp <3>
6) PIPO<3>
7) mode_adpt flag_bit0
8) thick_out flag_bit0
9) min flag_bit0
10) tot_min flag_bit0
11) prev_diff flag_bit0
12) dLev110 falg_bit0
13) o_bert_pneonce (1-bit)
14) ck_div16
4 bit select
e.g.
Sel<dec 0> = selects dout<0>, dout<1>, dout<2>, e_k<0> [block0]
Sel<dec 1> = selects dout<4>, dout<5>, dout<6>, e_k<0> [block 1]
.
.
Sel<dec 7> = select dout<28>,dout<29>,dout<30>, e_k<0> [block 7]
Sel<dec 8> = select dout<32>,dout<33>,dout<34>, e_k<8> [block 8]
Sel<dec 15> = select dout<60>,dout<61>,dout<62>, e_k<8> [block 15]