You are on page 1of 108

High-Speed Baud-Rate Clock and Data Recovery

by

Danny Yoo

A thesis submitted in conformity with the requirements


for the degree of Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
University of Toronto


c Copyright 2018 by Danny Yoo
Abstract

High-Speed Baud-Rate Clock and Data Recovery

Danny Yoo
Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
University of Toronto
2018

This thesis presents an adaptive baud-rate CDR with CTLE and 1-tap DFE. The novelty
in this design is the adaptation engine tailored for baud-rate clock and data recovery
where the comparators for the DFE and the PD are shared to save power. A testchip was
fabricated in TSMC 28nm CMOS. The adaptation engine is demonstrated for 34-36Gb/s
operation with a Tyco 5” channel resulting in 15.05-18.25dB channel losses. At 35Gb/s,
the total power consumption is measured to be 106.3mW or a FOM of 3.04pJ/bit.
This thesis also presents a 2x half-baud-rate clock and data recovery technique with 2x
oversampling at half-baud-rate (every other UI). A testchip was also fabricated in TSMC
28nm CMOS. A 30Gb/s 2x half-baud-rate CDR was tested with a Tyco 5” channel with
13.06dB of loss. The total power consumption is measured to be 79.2mW or a FOM of
2.64pJ/bit.

ii
Acknowledgements
I would like to sincerely thank my supervisor, Professor Ali Sheikholeslami for providing
me the opportunity to conduct research in the area of high-speed wireline circuits. Pro-
fessor Sheikholeslami has supported me throughout every step of my tapeout, which is
fabricated in a leading-edge advanced process technology.
I thank Professor David Johns, Professor Tony Chan Carusone and Professor Joyce
Poon for serving on my thesis examination committee. Their insightful comments and
recommendations were invaluable addition to this thesis.
I am thankful for the support and design review provided by Fujitsu’s staff, espe-
cially, Hirotaka Tamura, Takayuki Shibasaki and Junji Ogawa. Special thanks to Wahid
Rahman and Joshua Liang for guidance throughout my MASc research and Mohammad
Tabrizi for layout and measurement support. I would also like to thank Nikola Nedovic
for his visit to help set up the digital synthesis flow back in 2015, which still had an
impact on my 2017 tapeout.
My gratitude goes out to Jaro Pristupa and MOSIS support team for CAD and tech-
nical support. I would also like to acknowledge Professor Antonio Liscidini, Professor
Sorin Voinigescu and CMC for test equipment rental as I could not have finished my
testchip measurements without them.
Finally, I would like send my deepest thanks to my parents and my brother for their
unconditional love and support.

iii
Contents

Acknowledgements iii

Table of Contents iv

List of Figures vii

List of Abbreviations x

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 An Adaptive Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 A 2x Half-Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 Overview of Baud-Rate PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Pattern-based Baud-Rate Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Pattern Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Optimal Sampling Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Why Adaptation Engine? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 CTLE Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Comparator Level Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Proposed Adaptation Engine 11


3.1 Data Level Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Goals for On-Chip Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Adaptation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Part 1: CTLE Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Part 2: Comparator Level Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Summary of Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.7 System-level Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7.1 Behavioral Model: Continuous-time Model . . . . . . . . . . . . . . . . . . . . . . 26
3.7.2 Behavioral Model: Event-driven Model . . . . . . . . . . . . . . . . . . . . . . . . . 28

iv
4 Circuit Simulations and Measurement Results 32
4.1 Analog Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Closed-loop CDR Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Digital Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Lab Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Testchip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Proposed 2x Half-Baud-Rate CDR 55


5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Alexander 2x-oversampled Bang-Bang PD . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 Mueller-Muller Baud-Rate PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.3 Sub-Baud-Rate Clock and Data Recovery . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Proposed 2x half-baud-rate scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 System-level Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Circuit Implementation & Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1 Analog Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.2 Closed-loop CDR Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Digital Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Lab Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.1 Testchip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.2 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.3 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6 Chip Design Methodology 80


6.1 Behaviour Model Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Schematic & Layout Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Advanced Layout Techniques & Considerations . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.2 Design for Electromigration (EM) & IR drop . . . . . . . . . . . . . . . . . . . . . 83
6.3.3 Other Layout Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4 Place & Route Digital Implementation Methodology . . . . . . . . . . . . . . . . . . . . . 84

7 Conclusion 86
7.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.1 Improvements for an Adaptive Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . 87
7.2.2 Improvements for a 2x Half-Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . . 87

Bibliography 89

Appendices 94

v
A Ancillary 95
A.1 Portlist for Synthesized Digital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2 Output Pad MUX Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

vi
List of Figures

2.1 ISSCC 2016 Shibasaki’s Baud-Rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


2.2 Shibasaki’s proposed analog front-end (VLSI2014) [34] . . . . . . . . . . . . . . . . . . . . 5
2.3 Pattern detection of Shibasaki’s baud-rate PD . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Eye opening of 1-tap speculative DFE for Shibasaki’s baud-rate PD . . . . . . . . . . . . 6
2.5 VLSI2014 Shibasaki’s proposed PD logic [34] . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Sub-optimal α levels illustrating reduced eye opening for noise and jitter margin . . . . . 8
2.7 Pulse response of Channel + CTLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Eye diagram demonstrating (1+α) - (1-α) = 2α . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 LMS for adapting comparators for DFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Full-rate system level block diagram of CDR and the proposed adaptation engine . . . . . 12
3.2 Block diagram of proposed data level loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Example of dLev converging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Example of data level (dLev) filtered for 111 and 011 pattern . . . . . . . . . . . . . . . . 14
3.5 Diagram illustrating the goal of on-chip adaptation . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Diagram illustrating the optimal PD level . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Diagram of eye opening for comparator optimized for DFE and PD respectively . . . . . . 16
3.8 Flow Diagram of Proposed Adaptation Engine . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9 Schematic of a CTLE stage with tunable Cs . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.10 CTLE transfer function across Cs settings simulated in MATLAB Simulink . . . . . . . . 19
3.11 CTLE adaptation using line thickness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.12 Block diagram of spectrum balancing [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.13 CTLE transfer function showing 0011 pattern and its neighboring patterns for three dif-
ferent CTLE settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.14 Visual Example of CTLE adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.15 Visual example of theory behind the proposed algorithm for finding optimal PD level . . . 24
3.16 Visual example of finding Vamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.17 Visual example of why Vamp = dLev(011)max . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.18 Adaptation where line thickness guides CTLE adaptation of Cs (top right) and optimal
sampling phase deduced from the slew rate guides adaptation of comparator level . . . . 26
3.19 Proposed schematic of quarter-rate baud-rate receiver. Digital adaptation is completed
in digital back-end and the rest of the CDR is done in analog front-end . . . . . . . . . . 27
3.20 Plot of channel characteristic of various channels imported and converted to rational
system model in MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii
3.21 Adaptation vs time where line thickness guides CTLE adaptation and optimal PD level
guides α level adaptation. Tyco 5” channel at 36 Gb/s . . . . . . . . . . . . . . . . . . . . 29
3.22 Step response and pulse response of channel + CTLE . . . . . . . . . . . . . . . . . . . . 30
3.23 Jitter tolerance simulated for the converged adaptive setting of event-driven model (BER
< 10−6 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Schematic of the 2-stage CTLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


4.2 Simulated AC response of the 2-stage CTLE in Cadence . . . . . . . . . . . . . . . . . . . 34
4.3 Simulated eye diagram at the output of the 2-stage CTLE. Top two eyes are under-
equalized. The bottom left is optimally equalized and the bottom right is starting to
over-equalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Schematic of double-tail latch published in ISSCC2007 [31] . . . . . . . . . . . . . . . . . 35
4.5 Schematic of charge pump and loop filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Schematic of 8-stage ring oscillator used as VCO . . . . . . . . . . . . . . . . . . . . . . . 37
4.7 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count 38
4.8 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered
clock frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.9 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO 39
4.10 Open-cavity QFN under a microscope showing wire bond connections for the proposed
adaptive baud-rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.11 Package Pinout for D1: Adaptive baud-rate CDR with CTLE + 1-tap DFE . . . . . . . . 42
4.12 Die micrograph in TSMC 28nm HPC process for the proposed adaptive baud-rate CDR . 42
4.13 High-speed testboard for design 1: adaptive baud-rate CDR testchip. Testboard is pro-
grammed and controlled by Arduino Mega2560 + PC . . . . . . . . . . . . . . . . . . . . 43
4.14 Arduino Mega2560 used to program the testboard PCB . . . . . . . . . . . . . . . . . . . 43
4.15 Measurement setup for testing adaptive baud-rate CDR . . . . . . . . . . . . . . . . . . . 44
4.16 Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA.
Bias tees are required for setting the input common-mode as the SHF 12104A PRBS bit
pattern generator cannot set voltage offset to set the common-mode. Low-frequency loss
is cause by poor low-frequency performance of bias tees. . . . . . . . . . . . . . . . . . . . 45
4.17 Measurement setup for measuring S21 channel loss . . . . . . . . . . . . . . . . . . . . . . 45
4.18 36 Gb/s PRBS31 input eye measured using a sampling scope including all channel loss . . 46
4.19 Measurement setup for eye diagram of input PRBS31 . . . . . . . . . . . . . . . . . . . . 47
4.20 Measured clock spectrum and phase noise for locked CDR at 35 Gb/s . . . . . . . . . . . 48
4.21 Measured clock spectrum and phase noise for locked CDR at 35 Gb/s . . . . . . . . . . . 49
4.22 Measured jitter tolerance with sinusoidal jitter injected . . . . . . . . . . . . . . . . . . . . 50
4.23 Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweep-
ing CTLE parameter Cs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.24 Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweep-
ing comparator level α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.25 Measured jitter tolerance for different channel losses by sweeping the data-rate, hence
Nyquist frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.26 Measured power consumption with 35 Gb/s PRBS31 input with CDR lock . . . . . . . . 53
4.27 Performance comparison to prior work for the same CDR architecture . . . . . . . . . . . 54

viii
5.1 Schematic of Alexander 2x-oversampled bang-bang PD and its basic operation . . . . . . 56
5.2 Visual example of the lock point of Mueller-Muller PD [25] . . . . . . . . . . . . . . . . . 57
5.3 Half baud-rate data sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Sub baud-rate data recovery by exploiting ISI . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 Eye diagram example of sub baud-rate data recovery by exploiting ISI. Green arrows on
the left show theoretical maximum horizontal eye opening of 0.5UI. Green arrows on the
right show the small vertical eye opening margins . . . . . . . . . . . . . . . . . . . . . . . 59
5.6 Sub baud-rate (0.5x-sampled) data and clock recovery in comparison to Mueller-Muller
CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Full-rate block diagram of the proposed 2x half-baud-rate CDR architecture . . . . . . . . 61
5.8 Eye diagram of the proposed 2x half-baud-rate scheme . . . . . . . . . . . . . . . . . . . . 62
5.9 Proposed 2x half-baud-rate PD compared to the conventional baud-rate Mueller-Muller PD 63
5.10 Proposed quarter-rate implementation of 2x half-baud-rate CDR. Proposed 2x half-baud-
rate PD and the data decoder are simple custom high-speed digital logic gates . . . . . . 64
5.11 Jitter tolerance simulated for event-driven model of 2x half-baud-rate CDR (BER < 10−6 ) 65
5.12 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count 67
5.13 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered
clock frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.14 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO 67
5.15 Open-cavity QFN under a microscope showing wire bond connections for the proposed
2x half-baud-rate CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.16 Package Pinout for D2: Non-uniform baud-rate CDR with CTLE . . . . . . . . . . . . . . 69
5.17 Die photo in TSMC 28nm HPC process for the proposed 2x half-baud-rate CDR. Dimen-
sions of each building block is listed in a table. . . . . . . . . . . . . . . . . . . . . . . . . 70
5.18 High-speed testboard for design 2: non-uniform baud-rate CDR testchip. Testboard is
programmed and controlled by Arduino Mega2560 + PC . . . . . . . . . . . . . . . . . . . 71
5.19 Measurement setup for testing 2x half-baud-rate CDR . . . . . . . . . . . . . . . . . . . . 72
5.20 Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA.
Bias tees are required for setting the input common-mode as the SHF 12104A PRBS bit
pattern generator cannot set voltage offset to set the common-mode. . . . . . . . . . . . . 73
5.21 Measured clock spectrum of an open-loop CDR . . . . . . . . . . . . . . . . . . . . . . . . 75
5.22 Measured clock spectrum and phase noise for locked CDR at 30 Gb/s . . . . . . . . . . . 76
5.23 Measured clock spectrum and phase noise for locked CDR at 30 Gb/s . . . . . . . . . . . 77
5.24 Measured jitter tolerance with sinusoidal jitter injected at 30 Gb/s PRBS31 & 7 . . . . . 78
5.25 Power breakdown of 2x half-baud-rate CDR testchip . . . . . . . . . . . . . . . . . . . . . 78
5.26 Performance comparison to recently published baud-rate CDRs . . . . . . . . . . . . . . . 79

6.1 Layout of the full-chip die with two CDRs on top & bottom . . . . . . . . . . . . . . . . . 81
6.2 Layout of the full-chip die showing top aluminum layer for power distribution . . . . . . . 82

ix
List of Abbreviations

ADC Analog-to-Digital Converter

BBPD Bang-Bang Phase Detector

BERT Bit Error Rate Test

BER Bit Error Rate

CDR Clock and Data Recovery

CML Current Mode Logic

CMOS Complementary Metal-Oxide Semiconductor

CM Common-Mode

CP Charge Pump

CTLE Continuous-Time Linear Equalizer

CT Continuous-Time

DAC Digital-to-Analog Converter

DCC Duty Cycle Correction

DCD Duty Cycle Distortion

DEMUX Demultiplexer

DFE Decision Feedback Equalizer

dLev Data Level

ESD Electrostatic Discharge

FOM Figure of Merit

HDL Hardware Description Language

I/O Input / Output

ISI Inter-Symbol Interference

LDO Low Drop Out

x
LF Loop Filter

LMS Least Mean Square

LPF Low-Pass Filter

MMPD Mueller-Muller Phase Detector

MMSE Minimum Mean Square Error

MUX Multiplexer

P & R Place-and-Route

PCB Printed Circuit Board

PD Phase Detector

PIPO Parallel In Parallel Out

PI Phase Interpolator

PLL Phase-Locked Loop

PM Phase Margin

PN Phase Noise

PRBS Pseudo Random Binary Sequence

PSRR Power Supply Rejection Ratio

PVT Process, Voltage and Temperature

QFN Quad Flat No Lead

RF Radio Frequency

RMS Root Mean Square

RTL Register Transfer Level

RTL Register-Transfer Level

SERDES Serializer / Deserializer

SIPO Serial In Parallel Out

SJ Sinusoidal Jitter

UI Unit Interval

VCO Voltage-Controlled Oscillator

VGA Variable Gain Amplifier

VNA Vector Network Analyzer

xi
Chapter 1

Introduction

1.1 Motivation

In today’s society, digital data is ubiquitous. To get the most out of the data that sur-
rounds us in this digital age, sufficient processing speed and data-rate are imperative.
As the demand for speed is greater than ever before, development of wireline circuits
and SERDES is critical. However, an increase in data-rate is usually accompanied by
an increase in power consumption, which leads to environmental concerns. The current
scenario urgently calls for new low-power architectures and circuit techniques that will
improve speed and data-rate, while addressing the energy concerns. This thesis intro-
duces two unique research topics based on baud-rate clock and data recovery which aim
to solve the power consumption problem in high-speed wireline I/O links.

1.2 An Adaptive Baud-Rate CDR

In an attempt to further save power, the baud-rate clock and data recovery circuit (CDR)
published in ISSCC 2016 [35] shares the front-end comparators between the decision
feedback equalizer (DFE) and the phase detector (PD). In the following year, a frequency
detector (FD) based on the same baud-rate scheme was pushed in ISSCC 2017 [28].
However, these two CDRs lacked an adaptation engine where the CDR settings could
be autonomously adapted to the optimal lock point for various channels. Furthermore,
manual tuning of these CDRs is difficult and tedious as will be discussed in this thesis.
As a result, this thesis will present the proposed adaptive baud-rate CDR with CTLE
and 1-tap DFE. The novelty in this design is the adaptation engine tailored for baud-rate
clock and data recovery where comparators for the DFE and the PD are shared to save
power.

1
Chapter 1. Introduction 2

1.3 A 2x Half-Baud-Rate CDR

A traditional, Alexander-like [4] 2x oversampling bang-bang CDR where the recovered


clock locks to the data edge has many inherent advantages such as robustness. On the
contrary, a Mueller-Muller PD, which is a prevalent baud-rate PD scheme, locks the
sampling phase to the middle of the pulse response. By doing so, and locking to the
center of the data, there are some innate disadvantages. Therefore, the propose clock
and data recovery technique aims to combine the advantage of locking to the edge, similar
to a BBPD and baud-rate sampling, similar to a MMPD. The proposed 2x half-baud-rate
CDR collects two samples (2x) from every other UI (half-baud-rate), effectively sampling
the data at baud-rate, but lock to the data edge similar to a BBPD

1.4 Thesis Outline

In this thesis, there are two stand-alone baud-rate CDRs taped out in TSMC 28nm
technology. To address both chip designs, this thesis will be organized as follows. Chapter
2, 3 and 4 covers the first design, which is the proposed adaptive baud-rate CDR. First,
Chapter 2 will cover the background. Second, Chapter 3 will delve into the proposed
adaptation engine. Lastly, Chapter 4 will present circuit simulations and measurement
results. The second design which is the proposed 2x half-baud-rate CDR will be presented
in Chapter 5. Chip design methodology of how the two separate chip designs were taped
out on time for the same fabrication shuttle will follow in Chapter 6. The final chapter
(Chapter 7) will be the conclusion which summarizes the two contributions of this thesis.
Chapter 2

Background

The background to the proposed adaptive baud-rate CDR with CTLE and 1-tap DFE
fabricated in TSMC 28nm technology will be covered in this chapter. The testchip was
an analog mixed-signal design with a synthesized digital incorporated via a place & route
tool. First, an overview of baud-rate PD scheme will be presented in the following section.

2.1 Overview of Baud-Rate PD

In recent years, baud-rate clock and data recovery has been prominent over the classical
oversampling CDRs such as the Alexander Bang-Bang PD [4] which require multiple
clock phases to perform phase detection and clock and data recovery. Details and the
background of the Alexander BBPD can be found later in Section 5.1.1 which serves
as a background to the proposed 2x half-baud-rate CDR. Baud-rate samples the data
only once per UI, therefore requires fewer number of clock phases, and hence reduces
the power consumption in the clock distribution network [9]. There are many different
baud-rate PD schemes such as an integrating-based [8], Mueller-Muller [24], and a MMSE
(minimum mean squared error) [26]. Background to the Mueller-Muller PD which is a
prevalent baud-rate scheme will be discussed in details in Section 5.1.2. Despite the wide
range of baud-rate schemes, this thesis will focus on 1) pattern-based PD presented in
ISSCC 2016 [35] which the proposed adaptive baud-rate CDR is based on and 2) proposed
2x half-buad-rate PD which will be presented in Chapter 5.

2.2 Pattern-based Baud-Rate Scheme

The proposed adaptive CDR is based on a baud-rate CDR from ISSCC 2016 shown
in Figure 2.1 [34, 35]. Its novel analog front-end and the phase detector (PD) will be
discussed first.

3
Chapter 2. Background 4

In order to save power, the 1-tap DFE and the PD share the same comparators in
the analog front-end. Figure 2.2 illustrates the advantages of combining the front-end
comparators. First, the number of comparators is 2/3 compared to the conventional data
and edge sampling bang-bang PD. Essentially the proposed scheme detects phase (clock
information) from 1-tap speculative DFE data. In addition, the number of clock phases
that needs to be routed is half compared to the conventional BBPD scheme. As a result,
PLL and clock distribution power is also half. The comparator levels are set to +/-α to
cancel out 1st post-cursor ISI (inter-symbol interference) left after CTLE’s equalization.
At the same time, this α level is also used as a locking point for the phase detector. In
Figure 2.2, +α is labelled as DH and -α is labelled as DL.

2.2.1 Pattern Detection

Shibasaki’s baud-rate PD detects slow rising and slow falling patterns to make timing
decision whether the clock is late or early. It looks at 3 consecutive samples at a time =
(Sn−1 , Sn , Sn+1 ) to filter out a specific pattern. For example, 011 pattern is used for de-
tecting a rising waveform and 100 pattern is used for detecting a falling waveform. Figure
2.3 demonstrate a rising waveform detection with +α comparator level setting the CDR’s
lock point. In terms of data recovery, eye opening available for the 1-tap speculative DFE
for a rising waveform is shown in blue in Figure 2.4. Although Shibasaki’s PD only looks
at 3 consecutive samples at a time, for a valid 011 rising waveform pattern satisfying
the PD logic table in Figure 2.5, 001 pattern must be prior to sample n. Essentially
when Shibasaki’s PD detects for 011 for a rising pattern and 100 for a falling pattern,
it is actually detecting 0011 and 1100 patterns respectively. In other words, 2UI pulse
pattern is used for pattern detection in order to recover timing information.

2.2.2 Optimal Sampling Point

The CTLE’s boost changes data slope of 011 and 100 patterns which the PD detects.
The CTLE should be adjusted so that the comparator assigned for the phase detection
produces 0 and 1 at an equal probability at the “optimal sampling point”.
To illustrate the optimal sampling point, let us take a look at the scenario when the
previous bit is a 1. The eye opening available to a data sampler when the previously
detected bit is a 1 is shown in red on top left of Figure 2.5. The optimal sampling position
is 1/2 UI from the convergence point of low-rate pattern (i.e. 100 sequence) with +α
(DH) level such that data decision is made at the x-axis center of the eye opening. At
this point, the 100 pattern should produce equal chance of early/late with the second
comparator set at -α, (DL) used for the PD.
Chapter 2. Background 5

Figure 2.1: ISSCC 2016 Shibasaki’s Baud-Rate CDR

Figure 2.2: Shibasaki’s proposed analog front-end (VLSI2014) [34]

Figure 2.3: Pattern detection of Shibasaki’s baud-rate PD


Chapter 2. Background 6

Figure 2.4: Eye opening of 1-tap speculative DFE for Shibasaki’s baud-rate PD

Figure 2.5: VLSI2014 Shibasaki’s proposed PD logic [34]


Chapter 2. Background 7

2.3 Why Adaptation Engine?

Autonomous adaptation scheme that can adapt to the best CDR settings dynamically
on-chip is imperative because the CDR in each lane cannot be tuned manually in a real-
world product. In the following, we outline the challenges in implementing an adaptive
engine.

2.3.1 Challenges

The challenge of forming an adaptation scheme for the pattern-based PD used in this
CDR is that there exists two tuning knobs, one which is the CTLE setting and the other
for the comparator level (+/-α). The problem with having two tuning knobs is that,
even for a known, fixed channel it is hard and even harder for various unknown channels.
Furthermore, the fact that these two tuning knobs are correlated complicates the
adaptive scheme. First, changing the CTLE setting will change the data slope which
essentially changes the PD gain and the +/-α required for an optimal lock point with
the maximum timing margin. Second, changing the CTLE setting also changes the +/-α
needed for the 1-tap DFE by affecting the amount of post-cursor ISI remaining. For
example, more equalization means a smaller α is needed and less equalization means
a larger α is needed. As a result, when the CTLE and/or the α level changes, the
optimal sampling point changes which makes manual tuning difficult. Even a slight shift
in the comparator level α undermines the eye opening as shown in Figure 2.6. Reduced
eye opening is critical to the robustness of CDR’s system as noise and jitter margin is
significantly undermined. Therefore, the goal of adaptation is to find the CTLE setting
and the comparator level (+/-α) for the optimal jitter tolerance.

2.3.2 CTLE Adaptation

The ultimate goal of an inductor-less CTLE is not to fully compensate and equalize for
the channel loss. The 1-tap DFE present in the CDR design would not be required in
that scenario, hence making the DFE wasteful in terms of the CDR’s power budget.
We want f3dB of CTLE output to be at fbaud /3 ∼ fbaud /4 for an ideal PD operation.
This means that at the output of the CTLE, 2UI pulse swing should reach a full swing
in an ideal scenario. In other words, all residual ISI ( ∞
P
i=2 αi ), other than the first post-
cursor should be fully minimized, for a system with a pulse response of (1 + α1 D1 +
α2 D2 + α3 D3 + · · · ) as shown in Figure 2.7. The remaining 1 significant post-cursor ISI
can be canceled out by the 1-tap DFE, responsible for equalizing the content at fbaud /2
(the Nyquist frequency).
Chapter 2. Background 8

Figure 2.6: Sub-optimal α levels illustrating reduced eye opening for noise and jitter margin

Figure 2.7: Pulse response of Channel + CTLE


Chapter 2. Background 9

Figure 2.8: Eye diagram demonstrating (1+α) - (1-α) = 2α

2.3.3 Comparator Level Adaptation

Since the front-end comparators are shared by the DFE and the PD, it could only be
optimized for either one of the two. If our primary goal is to adapt the comparator levels
to optimize the DFE, (i.e. set the comparators to exactly α to perfectly cancel out the
1st post-cursor ISI remaining after CTLE) α level can be extracted by looking at data
levels present in the CTLE’s output eye. A system with one significant post-cursor ISI
will have 4 distinct levels: (1+α), (1-α), (-1+α), (-1-α). Exact value of α can be obtained
by equations Eq. 2.1 and Eq. 2.2 below as an example. Figure 2.8 demonstrates the
latter equation visually. The apex of red eye opening for when the previous bit was a 1 is
(1+α). The apex of blue eye opening for when the previous bit was a 0 is (1-α). Taking
the difference would yield exactly 2α and dividing by two would then be the value of 1st
post-cursor ISI.

(1 + α) + (−1 + α) = 2α (2.1)
(1 + α) − (1 − α) = 2α (2.2)

In practice, by exploiting the 4 distinct data levels, a simple sign-sign LMS could be
implemented to find α. Two sign-sign LMS loops, one for the eref (error reference)
and a slower loop for α would get α to converge to the middle of the 110 pattern eye
opening for when the previously bit was 1 as shown in Figure 2.9. Essentially, this LMS
DFE adaptation is taking (1+α) + (-1+α) = 2α and dividing it by two to obtain α.
The α in Figure 2.9 converges to the value shown in green. This green α value is the
vertical mid-point of 110 eye pattern. The sign-sign LMS convergence is governed by the
Chapter 2. Background 10

Figure 2.9: LMS for adapting comparators for DFE

following two equations Eq. 2.3 & Eq. 2.4:

αn+1 = αn + µ · sgn(err)sgn(dout ) (2.3)


erefn+1 = erefn + k · sgn(err)sgn(dout ) (2.4)

where err signal is generated by a comparator which compares dout to ±α ±eref depend-
ing on the data pattern. For example for data pattern where the previous data and the
current data are both 1, the threshold of the comparator is +α + eref . If the previous
data is 1 and the current data 0 then the threshold would be +α − eref . However,
it is apparent in Figure 2.9 that the convergence of α is not the optimal comparator
level for an optimal PD operation in terms of jitter tolerance. In fact, the optimal PD
level is shown in red. It was previously stated that for the optimal DFE operation, the
comparator’s α level should be the vertical midpoint of 110 eye opening available to the
1-tap loop-unrolled DFE. For the optimal PD operation, the comparator’s α level should
intersect with 011 rising data sequence at the x-axis midpoint of the 110 pattern eye
opening. This x-axis midpoint should provide the greatest peak-to-peak jitter tolerance.
In the next section, novel technique for obtaining optimal PD level will be discussed.
Chapter 3

Proposed Adaptation Engine

As indicated in the background chapter, an adaptive scheme is imperative as fine tun-


ing for the best jitter tolerance manually is difficult. In addition, the jitter tolerance of
Shibasaki’s baud-rate CDR is heavily affected by both the CTLE setting and the com-
parator level. Worse, the CTLE setting and the comparator level are correlated. Figure
3.1 illustrates the proposed adaptation engine which is specifically tailored for a baud-
rate CDR where comparators for the PD and the DFE are shared. Hence, the full-rate
system level block diagram in Figure 3.1 contains many building blocks found in the
Shibasaki baud-rate CDR even though all building blocks were designed independently
in a slightly different process technology. The front-end consists of a CTLE, a 1-tap
speculative DFE and a baud-rate PD which shares the comparators, and the CDR. The
CDR loop is an analog loop consisting of a charge pump (CP) and a low-pass filter (LPF)
as the loop filter. A ring oscillator is used for the voltage-controlled oscillator (VCO) in
this inductor-less CDR.
The adaptive engine, highlighted at the bottom receives the recovered clock (CKrec )
from the CDR and rotates it to a new phase (CKX ) in order to adaptively sample the
CTLE output at the intersection of the 011 and 110 patterns. The rotation is performed
by a phase interpolator (PI) and guided by the PI logic in the quarter-rate system im-
plementation. The heart of the adaptive engine is a data level loop, which is a feedback
loop that observes the CTLE samples and reconstructs the data level (dLev) with 9-bit
resolution. To do so, the adaptive sampler subtracts the stored dLev from the current
CTLE sample, quantizes the difference to 1-bit, and feeds the resulting bits to a digital
filter prior to summing them up in an accumulator. The DAC produces an analog level
corresponding to the 9-bit dLev and feeds it to the sampler as its threshold. The role
of the digital filter and the pattern filter is to calculate the average, the maximum, and
the minimum of the CTLE output at phase CKX . A cycle counter (or filter scheduler)
schedules 16k clock cycles to execute each of these tasks using the same data level loop.

11
Chapter 3. Proposed Adaptation Engine 12

Figure 3.1: Full-rate system level block diagram of CDR and the proposed adaptation engine

The calculated values of dLev for various filters are then used by the adaptive logic block
to guide both the CTLE parameter (CS ) and the one-tap DFE coefficient (+/-α) that
also determines the sampling phase of the PD.

3.1 Data Level Loop

In order to perform both CTLE and DFE/PD adaptation, we rely heavily on the data
level loop. The original use of the data level loop for finding voltage levels was published
in JSSC 2005 by Stojanovic et al.[37]. The proposed data level loop is modified to serve
the adaptive scheme for the Shiabaski baud-rate CDR specifically. Figure 3.2 illustrates
the block diagram of the proposed data level loop. The dLev converges to the middle
of the data level of the filtered data sequence sampled by the adaptive clock, CKadapt .
dLev convergence is governed by Eq. 3.1.

dLevn+1 = dLevn + ∆dLev · sgn(en ) (3.1)

Different data sequence patterns can be filtered out for the data level loop, making it very
useful in determining the voltage level for any specific data pattern. For example, Figure
Chapter 3. Proposed Adaptation Engine 13

Figure 3.2: Block diagram of proposed data level loop

Figure 3.3: Example of dLev converging

3.3 shows that dLev converges to the correct value of 200 mV. The final dLev value after
convergence can be further stabilized by applying more filtering. To demonstrate that the
data level loop can track any data patterns, we investigate sweeping of the adaptive clock
for different pattern filter settings. Figure 3.4 depicts an example eye diagram and the
result of dLev filtered out for the 111 pattern shown in green and 011 pattern shown in
blue. The adaptive clock here is swept for 1UI as a demonstration. Again, the data level
loop could have been filtered out more, which would have improved the monotonicity of
dLev values, especially for the 011 pattern shown in blue.

3.2 Goals for On-Chip Adaptation

The end goal of the proposed adaptation engine is to arrive at the optimal jitter tolerance
for the baud-rate CDR system. Initially, the adaptation engine will need to run with the
CDR locked to a sub-optimal phase. This essentially means that it needs to be frequency
locked and somewhat phase locked, although, presence of bit errors is totally acceptable
Chapter 3. Proposed Adaptation Engine 14

Figure 3.4: Example of data level (dLev) filtered for 111 and 011 pattern

Figure 3.5: Diagram illustrating the goal of on-chip adaptation

(e.g. 1E-3). Initial bit errors in the system is okay because once the adaptation engine
is turned on, errors average out in the data level loop. Therefore, the data level loop is
still able to track and perform data filtering to obtain dLev. Via adaptation, maximum
jitter tolerance is autonomously achieved by adapting the lock position to the optimal
data-sampling phase as shown in Figure 3.5.
There are two steps to the proposed adaptive scheme. First is the CTLE adaptation
and the latter is the DFE/PD adaptation:

1. Find the optimal CTLE setting (Cs) with the flattest equalization up to fbaud /3 ∼
fbaud /4 ensuring that there is only 1 significant post-cursor ISI with other higher-
order residual ISIs all minimized.

2. Find the optimal comparator level for PD operation. The optimal PD level intersects
Chapter 3. Proposed Adaptation Engine 15

Figure 3.6: Diagram illustrating the optimal PD level

with 011 data sequence at the x-axis midpoint of the 110 pattern eye opening (Figure
3.6).

When the two conditions above are satisfied via adaptation, our baud-rate CDR should
have the optimal jitter tolerance.
It was previously mentioned that the comparator level can only be optimized for either
the PD or the DFE since the front-end comparators are shared. Instead of adapting +/-α
to optimize the DFE to cancel out the 1st post-cursor ISI perfectly, we opt to adapt for
the optimal PD operation. The main reason for optimizing for PD operation is because
the Shibasaki baud-rate CDR has less timing margin (jitter) compared to voltage margin
(amplitude) for the DFE to recover the data correctly. Even if the optimal PD level is not
at the exact value of 1st post-cursor ISI α for the DFE, we are trading off a little bit of
eye opening for better jitter tolerance by locking to a more optimal data-sampling phase.
Figure 3.7 depicts a fictitious example of the final eye opening where the comparator
level is optimized for the DFE in (a) and optimized for the PD in (b). It is apparent that
the total peak-to-peak jitter tolerance is larger for the scenario where the comparator
level is optimized for the PD.

3.3 Adaptation Flow

A flow diagram of the proposed adaptation scheme is shown in Figure 3.8. First, the
CTLE is set to the maximum equalization setting and the comparator level α for the
DFE/PD is set to 0.3FS (full-scale). These settings should cause our baud-rate CDR
to lock to a sub-optimal phase. This means that the CDR must be frequency locked
although there could be bit errors present due to the clock phase being sub-optimal.
Chapter 3. Proposed Adaptation Engine 16

Figure 3.7: Diagram of eye opening for comparator optimized for DFE and PD respectively

The initial comparator level of 0.3FS is chosen as the starting point because the system
should only have one significant post-cursor ISI after the CTLE since our baud-rate CDR
only has a 1-tap loop-unrolled DFE which can cancel out just one post-cursor ISI. When
the pulse response has just one significant post-cursor ISI, α is usually a value close to
0.3FS. This is not always the case as different channels exhibit different channel and pulse
response. An initial α level of 0.3FS (e.g. 100mV for a 300mV full-scale input) should
be a viable starting point but if the baud-rate CDR is unable to achieve phase lock with
a reasonable BER (e.g. 1E-3), the proposed adaptation engine can be restarted with a
different initial comparator level α.
After the CTLE setting is set to the maximum and the comparator level α is set
to the pre-defined value, adaptation begins. First, for the highest CTLE setting, the
adaptation engine obtains a new α value (optimal PD level). This means that for the
current eye, which is most likely over equalized by the maximum CTLE setting, the
adaptation engine has picked the optimal PD level which would give us the largest peak-
to-peak jitter tolerance. The α level selection algorithm done by the proposed adaptation
engine will be further discussed in the next section. Once the optimal PD level is obtained,
the adaptation engine lowers the CTLE setting by one setting. The proposed adaptation
engine essentially goes back and fourth between the CTLE (blue) and the comparator
(red) adaptation as shown in Figure 3.8. In essence, for each CTLE setting along the
way of the adaptive process, the proposed adaptation engine is always tuning the CDR
to the best lock position. In other words, it’s tuning the CDR in small tick-tock like
increments to avoid losing phase lock. If the CTLE adaptation was to finish completely
before adapting the comparator level α at all, there is no guarantee that it will even
maintain a CDR lock. Furthermore, as previously described, the CTLE setting and the
comparator level α are correlated, and therefore it makes sense to adapt them in small
increments to find the optimal solution in a multi-dimensional solution space. Once the
Chapter 3. Proposed Adaptation Engine 17

Figure 3.8: Flow Diagram of Proposed Adaptation Engine

proposed adaptation engine detects that the lowered CTLE setting did not lower the
CTLE line thickness, then the engine should revert back to the previous CTLE setting
without updating the α level and end adaptation. Using line thickness as the metric for
CTLE adaptation will be further elaborated in the next section.

3.4 Part 1: CTLE Adaptation

The proposed adaptation engine is broken into two different phases: CTLE and compara-
tor level adaptations. The former, CTLE adaptation, will be discussed in more detail in
this section.
The CTLE used in our baud-rate CDR shown in Figure 3.1 is a common current-mode
logic (CML) CTLE architecture. It is a differential pair with RC source degeneration
with a resistor in parallel with a capacitor (Figure 3.9). This CTLE stage is repeated
twice for the 2-stage CTLE in the CDR’s design. The transfer function of the CTLE is
Chapter 3. Proposed Adaptation Engine 18

Figure 3.9: Schematic of a CTLE stage with tunable Cs

as follows (Eq. 3.2).

1
s+
gm Rs C
s
H(s) =   (3.2)
CL 1 + gm Rs /2 1
s+ s+
Rs Cs RD CL

The CTLE’s zero, poles, DC gain, and peaking gains are governed by equations below.
1
ωz = (3.3)
Rs Cs

1 + gm Rs /2
ωp1 = (3.4)
Rs Cs

1
ωp2 = (3.5)
RD CL

gm RD
DC gain = (3.6)
1 + gm Rs /2

Ideal peak gain = gm RD (3.7)

Ideal peak gain ωp1


Ideal peaking = = = 1 + gm Rs /2 (3.8)
DC gain ωz

For our proposed CTLE adaptation scheme, the adaptive variable is the source de-
generation capacitor Cs . Sweeping Cs is very similar to sweeping the zero, ωz , overall.
Chapter 3. Proposed Adaptation Engine 19

Figure 3.10: CTLE transfer function across Cs settings simulated in MATLAB Simulink

Digitally tunable capacitors are used to adjust Cs for gentle tuning of the CTLE transfer
function [27]. By increasing the source degeneration capacitance, the zero frequency of
the system can be reduced while maintaining the low frequency DC gain as seen in Figure
3.10. In addition, if Rs was an adaptive variable, then a VGA would have been needed
to boost the DC gain since Rs increases peaking by lowering the DC gain. Adding a
VGA is complicated because the VGA setting must be adapted as well, thus adding a
3rd variable which would further complicate the adaptive scheme.
As mentioned in Section 3.2, the goal of CTLE adaptation is to find the optimal CTLE
setting (Cs) with the flattest equalization up to fbaud /3 ∼ fbaud /4, ensuring that there is
only 1 significant post-cursor ISI. This also means that 0011 2UI pulse pattern should
reach full-scale and have the minimum line thickness. In other words, the line thickness
of the CTLE’s eye is representative of the residual ISI ( ∞
P
i=2 αi ) for a pulse response of
1 2 3
(1 + α1 D + α2 D + α3 D + · · · ). The CTLE adaptation involves observing the CTLE
output’s line thickness and exploiting this property. Figure 3.11(a) illustrates that the
line thickness of the CTLE’s eye can be measured at three different data patterns at the
crossing (CKX ): 111, 011, and 0101 pattern. Measuring line thickness at 011 and 111
patterns would ensure flattest equalization up to fbaud /3 ∼ fbaud /4. For example, the line
thickness for 011 pattern can be obtained by setting the data filter of the data level loop
to 011 and allow the loop to track the maximum dLev which would be the dLev(011)max .
Similarly, the minimum value of the 011 pattern can be obtained by allowing the data level
Chapter 3. Proposed Adaptation Engine 20

loop to track the minimum dLev which would be the dLev(011)min . Taking the difference
would yield the line thickness for the 011 pattern as shown in Eq. 3.9. Similarly, the line
thickness of the 111 pattern could be obtained in the same manner. Since max and min
values of dLev are heavily dictated by voltage noise, line thickness is heavily filtered out
to average out the noise.

Line T hickness(011) = dLev(011)max − dLev(011)min (3.9)


Line T hickness(111) = dLev(111)max − dLev(111)min (3.10)

Figure 3.11(b) depicts the CTLE’s line thickness for different data patterns vs. CTLE
Cs settings. This plot is generated in MATLAB Simulink with the baud-rate CDR
running with the settings described in the plot title. It is evident that a Cs setting of 200
fF in Figure 3.11(b) yields the optimal CTLE setting with the minimum line thickness,
thus the flattest equalization up to fbaud /3 ∼ fbaud /4. To converge to the minimum
thickness, the initial CTLE setting can be set to the maximum value as described in the
previous section (Section 3.3) and lower the CTLE setting until the sign of the change
in line thickness flips. At this point, the adaptation engine reverts back to the previous
CTLE setting and ends adaptation.
Contrary to spectrum balancing a CTLE like in Figure 3.12 [33, 20, 17, 10, 18],
performing CTLE adaptation based on the line thickness of its output is essentially
a pattern-guided CTLE adaptation similar to [12, 13, 32]. Figure 3.13 demonstrates the
CTLE transfer function showing the 001 pattern and its neighboring patterns for different
CTLE settings. Filtering out 011 and 111 patterns and measuring the line thickness is
a pattern-guided method of optimizing the CTLE equalization setting such that the
transfer function has the flattest response without certain patterns being over/under
equalized. Essentially, minimizing line thickness is getting rid of all the higher-order
post-cursor ISIs except for the first post-cursor ISI which the DFE will cancel out after
the CTLE.
A visual example demonstrating the CTLE adaptation is shown in Figure 3.14. The
CTLE’s eye diagram for various Cs settings from the highest to the lowest is shown on
the left. Following the CTLE adaptation algorithm starting from the highest setting, it
will converge to the CTLE setting with minimum line thickness which is at Cs = 200 fF.
The eye diagram associated with this optimal setting is highlighted in red.

3.5 Part 2: Comparator Level Adaptation

The comparator level adaptation is predicated on optimizing for the PD operation instead
of the DFE. The optimal PD level essentially is the x-axis midpoint of the eye opening.
Chapter 3. Proposed Adaptation Engine 21

(a) Measuring line thickness for different data patterns

(b) Line thickness Vs CTLE Cs setting

Figure 3.11: CTLE adaptation using line thickness


Chapter 3. Proposed Adaptation Engine 22

Figure 3.12: Block diagram of spectrum balancing [17]

Figure 3.13: CTLE transfer function showing 0011 pattern and its neighboring patterns for three different
CTLE settings
Chapter 3. Proposed Adaptation Engine 23

Figure 3.14: Visual Example of CTLE adaptation

In order to find the optimal PD level, we take a look at slew rate similar to [23]. It is
apparent from Figure 3.15 that the 011 rising data sequence slews and thus the slew rate
is given by:
Vamp
Slew rate = (3.11)
0.5 U I
0.5 UI for the base of the triangle arrives from a premise that if the CTLE is able to
equalize up to fbaud /4, all ISIs are equalized for the 0011 pattern, and thus the rise time
of a 011 sequence equals the fall time of a 110 sequence. Exploiting the formation of the
right triangle shown in Figure 3.15, the optimal PD level would follow Eq. 3.12. Setting
α to this optimal PD level will sample the data at the center of the eye opening available
for the 1-tap speculative DFE with 0.5UI of timing margin on both sides. As a result,
we could set the comparator level α at the optimal PD level once we find Vamp , simply
by dividing by 2.

Vamp
Optimal P D level = 0.25 U I × Slew = (3.12)
2
To find Vamp , the adaptive sampler inside the data level loop is used to find the voltage
levels for different data patterns. First, the PI code is swept to find the cross point
between dLev(011)avg and dLev(110)avg . This cross point of average values of the two
patterns is illustrated in Figure 3.16. This figure also illustrates various patterns high-
lighted on the eye diagram and its corresponding values of dLev simulated in MATLAB
Simulink on the right. At this clock phase of CKX , (at the crossing) Vamp = dLev(011)max
Chapter 3. Proposed Adaptation Engine 24

Figure 3.15: Visual example of theory behind the proposed algorithm for finding optimal PD level

can be obtained. At this same clock phase, line thickness for CTLE adaptation is also
obtained by measuring dLev(011)max and dLev(011)min . The maximum value is filtered
out and used for Vamp on purpose instead of the average value because the rising wave-
form actually does not slew perfectly in an ideal line as the slope tails off a little near
the crossing point of the 011 and 110 patterns. This phenomenon can be seen in Figure
3.17. Taking the maximum value of dLev helps to alleviate from the nonideality of rising
waveform not slewing perfectly.

3.6 Summary of Adaptation

Figure 3.18 shows a visual summary of the entire adaptation plotted against CTLE
parameter Cs. The following graph should be read from right to left where the adaptation
process starts at the highest setting of 200 fF until it finds the minimum line thickness
(e.g. 100 fF). Where and how the line thickness is taken is shown on the left side. Every
step along the way, for each value of Cs, the α level is updated to the optimal PD level
at the x-axis center of the eye opening with 0.5UI margin on both sides as shown in the
eye diagram. At Csopt which has the minimum line thickness, α level is considered αopt
and these are the final converged values after adaptation. In the next section, (Section
3.7) where system-level behavioral model will be discussed, adaptation versus time will
be illustrated via a time-domain simulation.
Chapter 3. Proposed Adaptation Engine 25

Figure 3.16: Visual example of finding Vamp

Figure 3.17: Visual example of why Vamp = dLev(011)max


Chapter 3. Proposed Adaptation Engine 26

Figure 3.18: Adaptation where line thickness guides CTLE adaptation of Cs (top right) and optimal
sampling phase deduced from the slew rate guides adaptation of comparator level

3.7 System-level Behavioral Model

A system-level behavioral model was built in MATLAB Simulink for a quarter-rate ar-
chitecture as shown in Figure 3.19 which is actually the exact schematic of the circle
implementation to be discussed in the next chapter when we delve into analog design.
Variables that are adjusted during adaptation are highlighted in red. These variables
(CTLE parameter, α and dLev, PI code) form a feedback loop from digital to analog
and could be observed in a system-level model. Both continuous-time and discrete-time
event driven models were created as they each have specific pros and cons. An advantage
of the continuous-time model is that it is more accurate and it provides an insight on
real-time eye diagram of the signals which is vastly useful in debugging. The cost of a
continuous-time model is that the simulation time is slow. An event-driven model allows
for a faster simulation which is useful for running a sweep of long simulations e.g. a jitter
tolerance test. In both continuous-time and event-driven models of the CDR, analog
front-end is separated from digital back-end which is solely for digital adaptation.

3.7.1 Behavioral Model: Continuous-time Model

For modelling the channel, the Tyco 5” channel’s real s-parameters measured with a
vector network analyzer (VNA) were imported. The RF Toolbox from MATLAB was
Chapter 3. Proposed Adaptation Engine 27

Figure 3.19: Proposed schematic of quarter-rate baud-rate receiver. Digital adaptation is completed in
digital back-end and the rest of the CDR is done in analog front-end

used to map this into a rational function with poles and zeros such that the transfer
function block can be used in Simulink for the channel model. An example of various
channels mapped in MATLAB is shown in Figure 3.20. Specifically, the proposed CDR
was mainly designed for the Tyco 5” channel as other channels have too much channel
loss at the Nyquist frequency of 18 GHz for the target data-rate of 36 Gb/s.
One problem of a behavioral model is that it is usually all ideal models, and hence,
too unrealistic. Gaussian and sinusoidal jitter are added to the quarter-rate clocks in
order to imitate real life jitter and noise as much as possible. Even on the comparators,
Gaussian voltage noise is added. Although metastability, offset and hysteresis are not
captured properly for the comparator, adding voltage noise is far better than a noise-less
ideal comparator.
A graphical example of adaptation vs. time is shown in Figure 3.21. To explain the
context for this simulation, the channel model is Tyco 5” at a data-rate of 36 Gb/s.
This channel in MATLAB has 19.9 dB loss at Nyquist. The VCO’s phase noise at 1
MHz offset is set at -80dBc/Hz. Other simulation conditions include 0.1 U Ipp sinusoidal
jitter, 0.1 U Ipp random Gaussian jitter, additive white Gaussian noise with 40 dB SNR,
and comparator noise turned on. The left column with the eye diagrams illustrates the
quality of the eye for the initial CTLE equalization setting and the final CTLE setting.
At a Cs value of 200 fF, it has the smallest line thickness hence the most optimal CTLE
equalization setting. On the right are the different variables plotted against time. The
CTLE parameter Cs is initially set to the highest level of 400 fF and then adaptation
begins. Once the line thickness of the next Cs value is greater than the previous value,
Chapter 3. Proposed Adaptation Engine 28

Figure 3.20: Plot of channel characteristic of various channels imported and converted to rational system
model in MATLAB

Cs returns to the previous state and adaptation finishes. In this example, when Cs is
set to 150 fF, the line thickness is increased, therefore, Cs setting returns to 200 fF
which corresponds to the minimum line thickness. At each Cs setting along the way, the
comparator level α is set to the optimal PD level discussed in the previous section such
that the CDR always preserves phase lock since large jerky changes in CDR settings can
cause the CDR to lose lock and diverge completely. To verify that the CDR is error free,
a BERT (bit error rate tester) was built in Simulink to confirm that the recovered data
is still a PRBS just like the input to the channel. The error count after adaptation had
converged was stable after the adaptation finished at 20µs.

3.7.2 Behavioral Model: Event-driven Model

An event-driven simulation refers to a class of discrete-time simulations where the smallest


simulation step size corresponds to the occurrence of an event as opposed to time [15].
An advantage of an event-driven model of a CDR is that the simulation speed iup to
1800 times faster compared to a continuous-time model [39]. Event-driven simulations
use a variable time step that captures events of interest, resulting in faster simulation.
The event-driven model is designed to share many of the same components as the
continuous-time model. However, one difference is how the VCO clocks and the input
data are created. For the channel model, channel step response is used instead of a
Chapter 3. Proposed Adaptation Engine 29

Figure 3.21: Adaptation vs time where line thickness guides CTLE adaptation and optimal PD level
guides α level adaptation. Tyco 5” channel at 36 Gb/s
Chapter 3. Proposed Adaptation Engine 30

Figure 3.22: Step response and pulse response of channel + CTLE

continuous-time transfer function with poles and zeros. CTLE’s transfer function is also
combined with the channel model’s transfer function all into one combined step response
since it cannot be cascaded. Figure 3.22 illustrates an example of the combined step
and pulse responses of the channel + CTLE. Furthermore, the CDR’s loop filter is also
converted into a discrete time z-domain function using bilinear transform.
Using the event-driven model of the CDR, jitter tolerance was simulated with injection
of sinusoidal jitter after the digital adaptation converged to the final values. The jitter
tolerance simulation was programmed in MATLAB to use a binary search algorithm with
5 iterations from the initial search points (red line) shown in Figure 3.23. For example,
from this initial search point, if the jitter tolerance test passes for the specified BER,
that is the jitter tolerance for that frequency. If the jitter tolerance test fails at the initial
search point, the test would try again at the halfway point in terms of amplitude, like
a binary search. The high-frequency jitter tolerance is 0.5 U Ipp and the minimum jitter
tolerance is around 0.4 U Ipp at the dip due to an undershoot. These jitter tolerance
values are for BER < 10−6 , therefore for BER < 10−12 , degradation is expected.
Chapter 3. Proposed Adaptation Engine 31

Figure 3.23: Jitter tolerance simulated for the converged adaptive setting of event-driven model (BER
< 10−6 )
Chapter 4

Circuit Simulations and


Measurement Results

Circuit implementation (schematic design) in Cadence and its simulation results as well
as the measured results of the testchip will be discussed in this section.
A 36 Gb/s inductor-less baud-rate CDR from Figure 3.19, in which the adaptation
engine will be demonstrated is fully implemented in analog front-end. Since it’s a stand-
alone analog CDR, it is fully functional without digital synthesis. The adaptation engine
and the BERT are the only circuits built digitally. An advantage of having a standalone,
fully functional CDR designed in analog is that locking of the CDR can be simulated
in Cadence. When a digital CDR is used, AMS (analog/mixed signal) verification tool
has to be setup to test the analog circuits together with the digital circuits in order to
verify locking of the CDR which is a more complicated state of affairs. To minimize the
chance of the adaptive baud-rate CDR not working, the analog CDR was fully verified
for a lock in Cadence using Spectre and verified that there are no bit errors. In addition,
the digital circuit is verified functionally in ModelSim and NCsim every step along the
way (Verilog RTL, synthesized RTL, post place & route) against the vectors generated
in Simulink simulation. The digital implementation will be discuss in more details after
the analog design section.

4.1 Analog Design

Since the digital adaptation is the novelty of this adaptive baud-rate CDR, only key
schematics and simulation results will be discussed.
To begin the analog design section of the thesis, Figure 4.1 shows the schematic of a 2-
stage CTLE. The tunable source degeneration capacitor is a 4-bit digitally programmable
array of MOM capacitors and pass-gate switches. A 2-stage CTLE (both with high-

32
Chapter 4. Circuit Simulations and Measurement Results 33

Figure 4.1: Schematic of the 2-stage CTLE

frequency boost) had to be implemented because the CTLE is heavily loaded at the
output by 8 comparators for the DFE and the PD, and one additional comparator used for
adaptation. As a result, with a single CTLE stage, output capacitance limits bandwidth
of boost. When two stages are used, the first stage is only loaded by the input gate
of the second CTLE stage, hence, provides boost at a higher frequency and improves
CTLE’s overall bandwidth performance. Figure 4.2 is the simulated AC response of 2-
stage CTLE with post-layout extraction. Due to layout parasitic from the interconnect,
power-grid and fill, it is very challenging to push the bandwidth any further. Figure 4.3
demonstrates the eye opening for various Cs values with post-layout extraction with 32
Gb/s PRBS31 as the input.
Following the CTLE in the analog front-end are the comparators. In a quarter-rate
architecture, the number of comparators required increases. In the proposed adaptive
baud-rate CDR, 8 comparators are required for the DFE and the PD. Four comparators
have +α and the other four compartors have -α as the threshold level as shown in Figure
3.19. Double-tail latch published in ISSCC2007 is used as sense amplifier as opposed
to a StrongArm becuase double-tail latches have performance advantage in lower power
supply cases due to less stacking of devices [31]. A schematic of the double-tail latch is
shown in Figure 4.4. Since TSMC 28nm HPC uses 0.9V for supply of thin-oxide devices,
double-tail latches were preferred. Since we require comparisons to +/-α instead of the
zero-level, dual-difference comparator scheme was used where the input data is compared
to the threshold level α. These threshold levels are generated by a 9-bit reference DAC
block with 512 levels of 1mV/step. For the single adaptive sampler used to provide error
information to digital adaptation, the input sensitivity is modified by optimizing the
sizing of the double-tail latch. A higher input accuracy was necessary for adaptation
Chapter 4. Circuit Simulations and Measurement Results 34

Figure 4.2: Simulated AC response of the 2-stage CTLE in Cadence

Figure 4.3: Simulated eye diagram at the output of the 2-stage CTLE. Top two eyes are under-equalized.
The bottom left is optimally equalized and the bottom right is starting to over-equalize
Chapter 4. Circuit Simulations and Measurement Results 35

Figure 4.4: Schematic of double-tail latch published in ISSCC2007 [31]

where for the rest of 8 comparators used for the DFE and the PD, higher sensitivity was
not required to achieve a CDR lock and error-free recovered data.
The DFE and the PD which uses the information gathered by the comparators are
custom digital logic blocks made up of simple logic gates, flip-flops, multiplexer and
adders. The DFE is designed to operate at 2 Gb/s in 16 parallel interleaved paths. The
PD is designed to operate at 4 Gb/s in 8 parallel paths. The four quarter-rate comparator
paths are demuxed accordingly for both the DFE and the PD.
The charge pump is a simple current steering differential pair and the loop filter is
a common higher-order RC loop filter of type-II PLL. Figure 4.5 depicts a simplified
schematic of the charge pump and loop filter combination. Amount of current steered by
the charge pump can be digitally adjusted with 4-bit settings which affect the CDR’s loop
gain. C1 and C2 of loop filter are fixed capacitance values set by MOM capacitors. The
resistance is a 4-bit tunable resistor switch array for adjusting the CDR’s loop dynamic.
The VCO is made up of CML based 8-stage ring oscillator as shown in Figure 4.6.
8 stages were used to reduce phase noise as it has been published that increasing the
number of stages in a ring oscillator reduces the phase noise [3]. The proposed 8-stage
Chapter 4. Circuit Simulations and Measurement Results 36

Figure 4.5: Schematic of charge pump and loop filter

ring VCO uses the same CML delay stage architecture as [16, 29] and has a tuning
range between 6.76 to 9.14 GHz when Vctrl is swept from 200 to 700mV. For this tuning
range, the simulated VCO’s free-running phase noise in Cadence Spectre after post-layout
extraction was -80.77 to -82.42dBc/Hz at 1 MHz offset. The CML clocks coming out of
the VCO is converted to CMOS signals used by the comparators, using similar structure
of CML2CMOS circuit from [9].
The VCO and the clock buffers are under a regulated voltage to suppress supply
voltage noise. An LDO (low-drop out) regulator with PMOS pass-gate was implemented
for the regulator. A PMOS design with a lower drop-out voltage had to be used instead
of a NMOS pass-gate which has a superior PSRR (power supply rejection ratio) because
high voltage thick-oxide devices were not available in the TSMC 28nm HPC design kit
through MOSIS. Since the nominal supply voltage is 0.9V with maximum recommended
voltage of 1.0V, an LDO regulator had to be used. Even then, the LDO is designed with
a 1.1V supply which is a little higher than the recommended maximum supply voltage
and hence it could have some repercussions in terms of reliability. Since this is a testchip
rather than a real product with a more stringent reliability requirements, applying 1.1V
supply solely to the PMOS pass-gate of LDO was deemed okay. If a higher supply thick-
oxide devices were available, The LDO regulator would have been placed on the higher
supply (VDDH).
Inputs to the digital back-end of the CDR are designed to be approximately 1 Gb/s
data and 1 GHz core clock (CKrec/8 ). The Dout data path is 32-parallel bits of data as
shown in Figure 3.19. An adaptive sampler is clocked by an adaptive clock CKX with
PI’s phase controlled by output of digital adaptation. The comparator threshold level
uses dLev from the output of the digital adaptation as well. The error signal from the
Chapter 4. Circuit Simulations and Measurement Results 37

Figure 4.6: Schematic of 8-stage ring oscillator used as VCO

adaptive sampler is also down-sampled to 1 Gb/s by multiple demux stages. However,


it is still a 1-bit signal as only the MSB bit is propagated through the demux stages.
To generate the core clock (CKrec/8 ) used for digital adaptation, quarter-rate clock is
divided down by clock dividers.

4.1.1 Closed-loop CDR Simulations

The simulation results of the closed-loop CDR simulations with post-layout extraction
will be discussed in this subsection for the proposed adaptive baud-rate CDR with just
the analog CDR portion sans digital adaptation. The testbench of the analog CDR
portion of the adaptive baud-rate CDR is as follows. The input data is 32 Gb/s PRBS31
pattern which is attenuated through a Verilog-A model of the Tyco 5” channel imported
into Cadence. Since the CDR is a PLL-style CDR, the initial frequency is adjusted by
setting the Vctrl of the VCO to a initial frequency that matches the input data within
the frequency capture range of the PD.
Figure 4.7 illustrates that the CDR is error free for all 16 parallel DFE paths of the
data after phase lock. Even if all parallel data paths are error free for a PRBS pattern, it
does that prove that the interleaved data at full-rate is also error free. Therefore, another
test with a PRBS7 was conducted to verify that the parallel paths are still a PRBS7 when
the parallel recovered data are interleaved and combined manually. A PRBS7 was used
for this as it is a short repeating pattern that could be checked much more easily, than say
a PRBS31. For a PRBS31, a digital BERT written in Verilog for digital synthesis could
be used to check that it is error free after fabrication. It interleaves all 32 down-sampled
parallel data paths and verifies that it is error free.
Chapter 4. Circuit Simulations and Measurement Results 38

Figure 4.7: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count

Figure 4.8: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered
clock frequency

Figure 4.8 illustrates the frequency of the recovered clock. It fluctuates & dithers at
an average value of 8 GHz quarter-rate for 32 Gb/s PRBS31 data. Figure 4.9 shows
the Vctrl which is the control voltage going into the VCO. It fluctuates after the CDR
is locked since the PD periodically dithers between early and late. The peak-to-peak
amplitude of Vctrl matched the Simulink simulation result very closely, confirming a close
resemblance between the behavioral model and the circuit simulation.

4.2 Digital Design

During the initial design of the digital adaptation in MATLAB Simulink, the building
blocks were specifically designed with MATLAB fcn (function) blocks with codes written
much like Verilog for an easier HDL conversion in the later design stage, i.e. RTL logic
synthesis. This is because the end goal of the digital circuit was not to just simulate and
validate the results solely in a behavioral model but to synthesize the digital and place
& route such that the digital layout could be taped out along with the custom analog
layout.
Chapter 4. Circuit Simulations and Measurement Results 39

Figure 4.9: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO

There are three major digital blocks for the adaptation engine in Figure 3.1. First is the
accumulator block which makes up the data level loop in conjunction with the adaptive
sampler which is an analog block designed in Cadence. The accumulator block consists
of a digital filter, pattern filter, integrator and an FSM for pattern filter scheduling. The
second digital block is the PI logic block for obtaining the appropriate clock phase for
the adaptive sampler at the crossing point of the 011 and 110 data patterns. Thirdly,
the adaptation logic block is responsible for measuring the line thickness, Vamp to adapt
the CTLE parameter Cs and the comparator level α for the DFE/PD adaptation.
Since the digital blocks have been introduced, the process of digital design will now
be discussed. The first procedure was to generate the test vectors from a Simulink
simulation for both input and output variables of digital adaptation circuit. These test
vectors are saved for every rising edge of core clock used to clock the digital circuits.
Second, all MATLAB fcn blocks in Simulink were manually converted into Verilog codes.
The validity of the Verilog models were confirmed with a testbench where the Verilog
models were being fed with the input test vectors saved from Simulink. For each bit, the
output of the Verilog model was compared to the expected output vectors from Simulink.
The next step in the digital design was to take the RTL design written in Verilog
and run RTL synthesis in Cadence using a .tcl script. After doing RTL synthesis, the
adaptation engine was not able to meet timing constraint at 1GHz clock speed, even with
the high-speed custom standard cells. To fix this timing failure, the critical path was
identified. The adaptation engine takes in 32 parallel data as the input and evaluates all
of them in 8 blocks (groups) of 4-bit data. This is a serial operation in digital hence the
timing could not be closed. The solution was to only use one block of 4-bit data instead
of all 8 and throw away 7 blocks (28 bits) of data every clock cycle. As a result, the
timing constraint is met for the standard cells and the trade off is that digital adaptation
takes 8x longer to achieve convergence.
Alternatively, one could make an argument to slow down the digital core clock to
Chapter 4. Circuit Simulations and Measurement Results 40

250 MHz to meet timing, but this means that the recovered data going into the digital
adaptation has to be demuxed from 32-to-128 bits. To process all 128 bits, there is a
4x increase in the propagation delay of combination logic. Although 250 MHz means 4x
longer period, there is zero gain in terms of timing since it scales linearly.
The second method which further improved timing of digital circuit was to remove
>= (greater or equal) operation which required 5-bit digital comparator that is cascaded
for each bit. Instead, >= was modified to == to use parallel XNOR gates in favour,
which is a much cheaper operation in digital logic gates during RTL synthesis. Although
>= is always a safer operation in case that there’s a glitch in the digital bit, == had to
be used in order to meet timing constraint for the testchip. Once the synthesized Verilog
met timing, it was again tested against the test vectors from Simulink to validate an
error-free operation. NCSim was used after RTL synthesis, which is a digital verification
tool from Cadence.
In the next step, the synthesized Verilog was used with the P & R (place & route)
flow in Cadence Innovus. This process again regenerates a new Verilog file representing
the end result of the place & route and was tested against the test vectors in NCSim.
Since the Verilog codes were validated against the test vectors generated from Simulink
in every step of the digital design process, it provided confidence that the synthesized
digital circuits will be functional post-tapeout even without AMS simulation. An AMS
simulation was omitted due to lack of time and resources, especially with a tight tapeout
schedule. A GDS (graphic database system) file created from P & R was streamed into
Cadence for a layout of the digital adaptation and the final Verilog generated from P
& R was imported into Cadence for the schematic. With the imported schematic, LVS
(layout-versus-schematic) check was performed to ensure that all the connections were
correct without shorts or opens.

4.3 Lab Measurements

In this section, measured results of the testchip from the lab will be discussed.

4.3.1 Testchip

The testchip of adaptive baud-rate CDR with CTLE and 1-tap DFE was fabricated in
TSMC 28nm HPC CMOS technology with a 0.9V supply. The testchip die was packaged
with an open-cavity QFN so that the high-speed input and output could be probed. Un-
der the microscope, Figure 4.10 reveals the packaged die with the wire bond connections
and Figure 4.11 is the package pinout instruction sent to the packaging company. Figure
4.12 is more zoomed into the die and all the major building blocks are highlighted in aqua
Chapter 4. Circuit Simulations and Measurement Results 41

Figure 4.10: Open-cavity QFN under a microscope showing wire bond connections for the proposed
adaptive baud-rate CDR

blue. The total testchip area was 1.57 mm width by 0.785 mm height. The following
subsection will explain the test setup for the testchip.

4.3.2 Test Setup

Figure 4.15 illustrates the testing setup for a normal operation of adaptive baud-rate
CDR. The packaged testchip was soldered onto a PCB board as shown in Figure 4.13
which is programmed via Arduino Mega2560 (Figure 4.14) with a PC. Figure 4.10 depicts
the QFN package under a microscope. High-speed probes rated for 40G was used to probe
the high-speed PRBS input data. The SHF 12104A bit pattern generator was used to
generate both PRBS7 and PRBS31. Input data then passes through the Tyco 5” channel
using 36” SMA cables and is connected to a 40G bias tee before being connected to the
GSGSG probe head. The channel loss through this setup is shown in Figure 4.16 which
is measured using the Agilent N5222A PNA microwave network analyzer. The setup for
obtaining S21 channel characteristic is depicted in Figure 4.17. Figure 4.18 illustrates an
eye diagram of this PRBS input at 36 Gb/s, observed using the Agilent Infiniium DCA-J
86100C digital communication analyzer with an 86112A electrical module. At this data-
rate, the input eye is completely closed before being equalized. The measurement setup
for observing this PRBS input eye diagram is shown in Figure 4.19.
On the output side, high-speed quarter-rate recovered clocks were probed at the output
Chapter 4. Circuit Simulations and Measurement Results 42

Figure 4.11: Package Pinout for D1: Adaptive baud-rate CDR with CTLE + 1-tap DFE

Figure 4.12: Die micrograph in TSMC 28nm HPC process for the proposed adaptive baud-rate CDR
Chapter 4. Circuit Simulations and Measurement Results 43

Figure 4.13: High-speed testboard for design 1: adaptive baud-rate CDR testchip. Testboard is pro-
grammed and controlled by Arduino Mega2560 + PC

Figure 4.14: Arduino Mega2560 used to program the testboard PCB


Chapter 4. Circuit Simulations and Measurement Results 44

(a)

(b)

Figure 4.15: Measurement setup for testing adaptive baud-rate CDR


Chapter 4. Circuit Simulations and Measurement Results 45

Figure 4.16: Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias
tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator
cannot set voltage offset to set the common-mode. Low-frequency loss is cause by poor low-frequency
performance of bias tees.

Figure 4.17: Measurement setup for measuring S21 channel loss


Chapter 4. Circuit Simulations and Measurement Results 46

Figure 4.18: 36 Gb/s PRBS31 input eye measured using a sampling scope including all channel loss

pads for observing the clock spectrum and the phase noise with the Rohde & Schwarz
FSWP26 phase noise analyzer and VCO tester. Before the spectrum analyzer is con-
nected, the differential clocks being probed needs to be converted to a single-ended signal
using a Narda 4346 180◦ . coupler. For low-speed (250-500 Mb/s) or static digital signals,
the Agilent DSAX91604A Infiniium 16GHz real time digital storage oscilloscope was used
to observe and debug digital adaptation.

4.3.3 Measurement Results

All measurements presented in this subsection were obtained with the setup shown in
Figure 4.15. With 35 Gb/s PRBS31 input data, the proposed adaptive baud-rate CDR
is able to converge to the optimal CDR settings and achieve a CDR lock. The CDR’s
recovered clock spectrum at 8.75 GHz quarter-rate is shown in Figure 4.20(a). The locked
clock spectrum exhibits a skirt characterized by the loop dynamics or the bandwidth of
the CDR. Phase noise of a locked CDR for the same converged adaptive settings is shown
in Figure 4.20(b). There is an overshoot present in the CDR’s loop dynamic but this
is the optimal setting with minimum overshoot for the analog loop filter present on the
chip. Since the loop filter was designed using fixed MOM capacitor values, we do not
have an extra tuning knob to completely fix the overshoot. The worst case phase noise
is at -104 dBc/Hz at 40 MHz offset. Total integrated jitter is 875.8 fs with a PRBS31
input pattern. Figure 4.21 represents the same measured results but for PRBS7. The
worst case phase noise is -105 dBc/Hz at 35 MHz offset. Total integrated jitter is 750.6
Chapter 4. Circuit Simulations and Measurement Results 47

Figure 4.19: Measurement setup for eye diagram of input PRBS31

fs for PRBS7 which is better than PRBS31 as expected.


The measured phase noise of the locked CDR is a result of the CDR loop bandwidth
suppressing the poor phase noise of free running VCO which was simulated in Cadence
Spectre to be -80.77 to -82.42dBc/Hz for the entire tuning range at 1 MHz offset. In-
creasing the CDR’s bandwidth suppresses the phase noise of the VCO further but as a
consequence allows more input jitter from the data signals in the case of a CDR to enter
the system. As a results, there is a fine balance between increasing/decreasing the loop
bandwidth of the CDR from the optimal point, before bit errors begin to be introduced.
For the converged CDR setting after adaptation, jitter tolerance was tested by inject-
ing sinusoidal jitter with the SHF 12104A bit pattern generator which was programmed
via a PC using an ethernet cable. Figure 4.22 is the measured jitter tolerance for PRBS7
& PRBS31 plotted against IEEE 802.3 masks [1, 2]. The equipment limit was 54ps of
absolute sinusoidal jitter amplitude and the maximum jitter frequency of 400 MHz. The
sinusoidal jitter was injected up to 300 MHz as it was not sure if it is accurate to go
up to the maximum frequency of 400 MHz. However, for the jitter amplitude, 54ps was
used since even 54ps seemed a little low at a lower jitter frequencies.
The dip in the jitter tolerance curve is due to the undershoot which goes hand to hand
with an overshoot in the CDR’s jitter transfer curve represented by the phase noise plot.
Again, this dip caused by the undershoot cannot be completely fixed due to the fixed
capacitors in the analog loop filter which sets the CDR’s loop dynamics. Despite the
undershoot, the jitter tolerance after adaptation passes both IEEE 802.3bs and IEEE
802.3cc [1, 2] receiver jitter tolerance masks.
Chapter 4. Circuit Simulations and Measurement Results 48

(a) Clock Spectrum

(b) Phase Noise

Figure 4.20: Measured clock spectrum and phase noise for locked CDR at 35 Gb/s
Chapter 4. Circuit Simulations and Measurement Results 49

(a) Clock Spectrum

(b) Phase Noise

Figure 4.21: Measured clock spectrum and phase noise for locked CDR at 35 Gb/s
Chapter 4. Circuit Simulations and Measurement Results 50

Figure 4.22: Measured jitter tolerance with sinusoidal jitter injected

In addition, the converged setting is actually the optimal setting with the least amount
of undershoot. This becomes evident when we plot and observe the jitter tolerance
of settings around the converged setting after adaptation. Figure 4.23 shows that the
minimum 10-100MHz JTol (jitter tolerance) degrades rapidly as the CTLE’s parameter
Cs diverge away from the converged value of 8 highlighted in red. When sweeping the
Cs value, we hold the comparator level α constant at the converged value (α = 133 mV).
Similarly, Figure 4.24 is when we take the converged setting and manually sweep the
comparator level α while holding Cs constant (Cs = 8). It is evident that the minimum
10-100MHz JTol is less sensitive to the change in finer 9b comparator level α with 512
settings compared to a coarser 4b CTLE parameter Cs with 16 settings.
Figure 4.25 demonstrates that the designed adaptive baud-rate CDR is able to adapt
to different channel losses as all three curves passes the IEEE 802.3 masks. Different
channel losses were created by changing the input data-rate, which in essence changes
the channel loss at Nyquist, since the Nyquist frequecy itself alters. New channels with
different attenuation could not be obtained which is the reason why the input data-rate
had to be swept. Ironically, at a slower data-rate of 34 Gb/s, the proposed adaptive
baud-rate CDR actually performs more poorly due to the fact that 34 Gb/s is at the
bottom of the VCO’s tuning range therefore KVCO gain is lower and may be very noisy
and perhaps not even monotonic down there. The testchip returned from fab as being
faster than TT (typical) corner therefore the center frequency of the VCO is higher than
the intended design. Ideally, the CDR with this process shift to a FF corner should
Chapter 4. Circuit Simulations and Measurement Results 51

Figure 4.23: Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweeping
CTLE parameter Cs

Figure 4.24: Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweeping
comparator level α
Chapter 4. Circuit Simulations and Measurement Results 52

Figure 4.25: Measured jitter tolerance for different channel losses by sweeping the data-rate, hence
Nyquist frequency

be operating at a higher range between 36-45 Gb/s but we do not have an appropriate
channel with 10-18 dB loss at Nyquist for the mentioned data-rates.
Final measurement done in the lab is the power consumption measurement. On the
PCB board, sense resistors were installed in order to measure the current being drawn
by the DUT (device under test) for each of the power domain. The power domains are
separated into VDDA, VDD CTLE, VDD DAC, VDD LDO, VDD DIG and VDD IO.
VDDA contains most of analog circuits including: comparators, demux, PD, DFE, CP,
LF and some clock buffers & clock dividers. VDD CTLE domain has the two stage CTLE
powered on it. VDD DAC is a separate supply domain solely for the reference DAC used
to set the comparator’s threshold levels. The reference DAC’s power supply was kept
separate in case that the reference DAC’s levels had to be adjusted independently from
other power domains after tapeout. VDD LDO consists of the VCO, the VCO bias, the
LDO regulator and clock buffers. VDD DIG is for the synthesized digital circuit and
VDD IO is for the IO drivers from TSMC standard library and intermediate buffers to
the output pads for low-speed digital signals for debugging (also contains heavily down-
sampled analog signals such as the comparator and the DFE outputs).
Since the adaptation engine implemented in digital is designed to turn off automati-
cally after convergence, VDD DIG power is omitted for the total power consumption in
normal operation although the power consumed by VDD DIG (6.3 mW) is still reported.
Same with VDD IO, 2.7 mW is omitted as IO drivers and buffers were only present for
testchip’s debugging purposes. Figure 4.26 is the measured power consumption with 35
Chapter 4. Circuit Simulations and Measurement Results 53

Figure 4.26: Measured power consumption with 35 Gb/s PRBS31 input with CDR lock

Gb/s PRBS31 input while the CDR is phase locked and error free. The total power
consumption is 106.3 mW which is 3.04 pJ/bit. Most of the power is consumed by the
VCO to bring down the phase-noise. Ring VCO’s phase noise improves by 3 dB with
every twofold in the current consumption. Extra power was spent in the VCO to lower
the risk of CDR not locking due to poor phase noise, especially due to inductance from
the package wirebonds. Therefore, if less power was spend on the VCO by trading off
phase noise margin, the total power consumption of the CDR could have been improved
drastically with a better figure of merit (FOM) in terms of pJ/bit.
Finally, Figure 4.27 compares the performance of the proposed work to prior works.
This is the first on-chip, live adaptation engine tailored for a baud-rate CDR where
the comparators are shared between the DFE and the PD to save power. This figure
concludes Chapter 3 on adaptive baud-rate CDR with CTLE and 1-tap DFE.
Chapter 4. Circuit Simulations and Measurement Results 54

Figure 4.27: Performance comparison to prior work for the same CDR architecture
Chapter 5

Proposed 2x Half-Baud-Rate CDR

This chapter presents the details of the second baud-rate CDR design that was taped out
in TSMC 28nm technology. The proposed design #2 is a 2x half-baud-rate CDR with
CTLE and data decoder. This testchip consists of an analog CDR with digital BERT
being the only synthesized digital incorporated via a place & route tool.

5.1 Background

Background to the proposed 2x half-baud-rate scheme will be discussed in this chap-


ter. First, background to some prior architectures in clock and data recovery will be
introduced.

5.1.1 Alexander 2x-oversampled Bang-Bang PD

Robustness is a crucial aspect of building a receiver for an I/O link. Alexander (2x-
oversampled) bang-bang phase detector (BBPD) where the data is sampled twice, at the
center and the edge, has been prominent in clock and data recovery due to its robustness
and simple hardware implementation [4]. Figure 5.1 illustrates simple hardware involved
with the Alexander 2x-oversampled BBPD. The basic operation is as follows. If the
previous data Dn and edge En are the same then the clock is early. If the next data
Dn+1 and edge En are the same then the clock is late. Since this is a bang-bang PD,
theoretically, the output should be totally non-linear. However, due to the presence of
inevitable jitter in real life, it linearizes the PD characteristic where the slope or the PD
gain is a function of σ of the jitter.

55
Chapter 5. Proposed 2x Half-Baud-Rate CDR 56

Figure 5.1: Schematic of Alexander 2x-oversampled bang-bang PD and its basic operation
Chapter 5. Proposed 2x Half-Baud-Rate CDR 57

Figure 5.2: Visual example of the lock point of Mueller-Muller PD [25]

5.1.2 Mueller-Muller Baud-Rate PD

Despite the robustness and simple hardware complexity of the Alexander 2x-oversampled
BBPD, recent trend has shifted towards baud-rate phase detectors as a means of reducing
power consumption by sampling only once per UI [36, 9, 7, 11, 14, 38]. However, it is
apparent from prior works [9, 7] that Mueller-Muller phase detector (MMPD), which
is a popular option of baud-rate PD, is sensitive to equalization and symmetry in the
pulse response [25]. MMPD’s lock point is at the middle of the symmetric pulse where
the pre-cursor equals the post-cursor as shown in Figure 5.2. As a result, if the pulse
response is not perfectly symmetric, the locking point will not be at the peak of the pulse
response. Also, MMPD is only functional for uncorrelated random data and cannot lock
to an alternating 0101 pattern. These disadvantages of the MMPD will be compared to
the proposed 2x half-baud-rate scheme in a later section.

5.1.3 Sub-Baud-Rate Clock and Data Recovery

Following the trend of opting for a lower power consumption by reducing the number
of samples per UI, the most intuitive solution would be to take a baud-rate scheme and
somehow sample it even less frequently. A half-baud-rate scheme shown in Figure 5.3
where the data is sampled every other UI (0.5x-sampled) would potentially lower the
power consumption. Whenever the data is sampled every other UI, information about
the previous bit needs to be recovered as illustrated in Figure 5.4. For a system with
only one significant post-cursor ISI, four distinct data levels exist: (h0 +h1 ), (h0 -h1 ), (-
Chapter 5. Proposed 2x Half-Baud-Rate CDR 58

Figure 5.3: Half baud-rate data sampling

h0 +h1 ), (-h0 -h1 ) where ho is the magnitude of main-cursor and h1 is the post-cursor ISI
as illustrated in bottom portion of Figure 5.4. Samplers can be placed at appropriate
threshold levels (+/-Vref and 0) to recover the data for the unsampled UI, thus recovering
2 bits of data for each sample. Figure 5.5 illustrates the threshold levels in which the
CDR could yield an error-free data recovery for any sequence of 2-bit data (dn−1 , dn ):
(0,0), (0,1), (1,0), (1,1). However, green arrows in Figure 5.5 highlight the theoretical
maximum horizontal eye opening of 0.5 UI on the left and vertical eye opening margins
on the right. Small vertical eye opening translates into poor noise margin.
Despite this, half-baud-rate operation is theoretically feasible for a clock recovery as
well, even without an integration & dump technique. Figure 5.6 illustrates that clock
recovery could be achieved by adding two additional samplers at +/-α on top of the
three samplers originally required solely for data recovery. The white circles at +/-α
indicate the lock points. In comparison to the Mueller-Muller baud-rate CDR, this
would still require fewer number of comparator (samplers) even for a quarter-rate clocking
implementation. A total of 12 comparators would be required for a MM-CDR whereas
half-baud-rate (0.5x sampled) CDR would only require 10 comparators. In addition,
since only every other UI is sampled, there would be a huge power saving over MM-CDR
as well in the clock distribution network. A half-baud-rate scheme without the need of
an integrating & dump technique sounds attractive in theory but is not very feasible in
reality due to poor noise and jitter margin.

5.2 Proposed 2x half-baud-rate scheme

As discussed in the previous subsection on half-baud-rate (0.5x sampled) scheme, due


to poor jitter and noise margin, it would be very difficult to prototype this in real life.
Most certainly, jitter and noise would prevent the CDR from being error-free for BER
< 10−12 . Therefore, the proposed CDR opts for a 2x half-baud-rate scheme where edge
sampling is added to the half-baud-rate scheme to make it baud-rate on average. Since
Chapter 5. Proposed 2x Half-Baud-Rate CDR 59

Figure 5.4: Sub baud-rate data recovery by exploiting ISI

Figure 5.5: Eye diagram example of sub baud-rate data recovery by exploiting ISI. Green arrows on
the left show theoretical maximum horizontal eye opening of 0.5UI. Green arrows on the right show the
small vertical eye opening margins
Chapter 5. Proposed 2x Half-Baud-Rate CDR 60

Figure 5.6: Sub baud-rate (0.5x-sampled) data and clock recovery in comparison to Mueller-Muller CDR

the data and edge are sampled every other UI, the 2x half-baud-rate PD is essentially a 2x
oversampling BBPD at half-baud-rate that locks to the edge. Advantages of edge locking
will be delved into in this section. Figure 5.7 illustrates the full-rate block diagram of the
proposed 2x half-baud-rate CDR. Blocks highlighted in red are simple logic-gate circuits
required for the 2x half-baud-rate operation. The data decoder circuit is imperative for
the data-recovery of 1UI that is not sampled at all, and this is done by exploiting the
inherent ISI present in the system.
Figure 5.8 illustrates an eye diagram corresponding to a channel with one significant
post-cursor ISI while all other ISI terms are assumed to be minimized through a front-
end equalizer. We sample a UI by three comparators at the edge phase φe with their
outputs labeled as DL, ED, and DH, and by one comparator at the center phase φc with
its output labled as DM, while we skip sampling the following UI altogether. Indeed, we
rely on ISI to recover the previous bit. In doing so, we perform 4 comparisons in every
other UI, or on average 2 comparisons per UI. By having the center and edge samples,
albeit in every other UI, this scheme inherits the benefits of a bang-bang PD (BBPD)
by locking to the edge, as will be demonstrated later. By skipping every other UI, the
proposed scheme shares the benefits of reduced hardware and low power consumption
with the baud-rate Mueller-Muller PD (MMPD).
We explain the phase detector (PD) and the data decoder (DD) logic by observing
samples from current UI (n). If at φe the data falls between +/-Vref , we conclude that
there is a data transition (0→1 or 1→0) at this phase and hence we will judge the
early/late by the output of the edge (ED) and the data (DM) comparators, similar to
a BBPD logic. If these two bits are identical, the clock is late; otherwise, it is early as
shown in the phase detector table of Figure 5.8.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 61

Figure 5.7: Full-rate block diagram of the proposed 2x half-baud-rate CDR architecture

The DD only needs to observe the outputs of the data comparators (DH, DL, and
DM) to decode the current and the previous bit. Similar to a 1-tap speculative DFE,
the DD recovers the unsampled UI by slicing the data eye at a threshold that is adjusted
depending on the previous bit sequence. If the output of all three comparators are zero,
Dn−1 and Dn are both zero. Similarly, if the outputs of all three comparators are 1,
Dn−1 and Dn are both 1. If the data at φe falls between +/-Vref , it implies a transition
between Dn−1 and Dn . Therefore, by observing the sign of DM (which indicates Dn ), we
can find Dn−1 =D̄n . The data decoder logic is also summarized in a table in Figure 5.8.
Although the eye diagram of the proposed 2x half-baud-rate scheme may look similar
to duobinary signalling, there are some advantages to the proposed scheme. The disad-
vantage of duobinary is that precoder & decoder are required and the precoder especially
is not trivial as stated in various prior works [40, 19]. The proposed 2x half-baud-rate
does not need a precoder and operates for conventional NRZ signalling. In addition,
the proposed data decoder is a simple hardware made up of digital logic gates which is
efficient in terms of both power and area.
Robustness of a conventional 2x oversampling BBPD and power saving of MMPD by
sampling at baud-rate are combined in the proposed 2x half-baud-rate PD. Both MMPD
and the proposed 2x half-baud-rate PD in Figure 5.9 display similar PD characteris-
tic over 1UI period when properly tuned and equalized, depicted by the black curves.
However, MMPD suffers significantly as equalization setting and comparator level Vref
Chapter 5. Proposed 2x Half-Baud-Rate CDR 62

Figure 5.8: Eye diagram of the proposed 2x half-baud-rate scheme


Chapter 5. Proposed 2x Half-Baud-Rate CDR 63

Figure 5.9: Proposed 2x half-baud-rate PD compared to the conventional baud-rate Mueller-Muller PD

diverge from the optimal point. For instance, when an offset is present for the comparator
reference level +/-Vref , dead zone forms for MMPD as shown in Figure 5.9 (first row).
Similarly, second row of Figure 5.9 depicts that when the residual ISI exists due to poor
front-end equalization, dead zone appears for MMPD. It is apparent from the simulation
results that the proposed PD (similar to BBPD) does not show sensitivity to these two
settings as much.

5.2.1 System-level Behavioral Model

A system-level behavioral model was built in MATLAB Simulink for the quarter-rate
architecture as shown in Figure 5.10. Similar to the behavioral models built for the first
design: adaptive baud-rate CDR from Section 3.7, both continuous-time and event-driven
models in MATLAB Simulink were created for 2x half-baud-rate CDR. The details of
how continuous-time and event-driven behavioral models were built will be omitted as
they are designed the same way with minor tweaks in some of the building blocks such
as the PD and the addition of data decoder.
Using the event-driven model of the 2x half-baud-rate CDR, jitter tolerance was sim-
ulated with the injection of sinusoidal jitter. The jitter tolerance simulation was pro-
grammed in MATLAB to use a binary search algorithm with 5 iterations from the initial
search points (red line) shown in Figure 3.23. The high-frequency jitter tolerance is 0.5
Chapter 5. Proposed 2x Half-Baud-Rate CDR 64

Figure 5.10: Proposed quarter-rate implementation of 2x half-baud-rate CDR. Proposed 2x half-baud-


rate PD and the data decoder are simple custom high-speed digital logic gates
Chapter 5. Proposed 2x Half-Baud-Rate CDR 65

Figure 5.11: Jitter tolerance simulated for event-driven model of 2x half-baud-rate CDR (BER < 10−6 )

U Ipp and the minimum jitter tolerance is 0.394 U Ipp . These jitter tolerance values are for
BER < 10−6 due to simulation time, therefore for BER < 10−12 , degradation is expected.

5.3 Circuit Implementation & Simulations

Circuit implementation (schematic design) in Cadence and its simulation results will be
presented in this section. The quarter-rate circuit implementation follows Figure 5.10
such that the behavioral model and the schematic in Cadence match exactly.

5.3.1 Analog Design

The proposed 2x half-baud-rate scheme is implemented on a PLL-style, fully analog CDR.


The analog design also shares many components from the proposed adaptive baud-rate
CDR from the previous chapter. The same two-stage CTLE is used with tunable source
degeneration resistor and capacitor with 4-bit controls each.
The output of CTLE is sampled by a total of 8 double-tail latch comparators from [31]
which is also the same comparator from the proposed adaptive baud-rate CDR. Since the
2x half-baud-rate CDR samples every other UI, only two comparators each at +/-Vref
are required for a quarter-rate implementation. For the zero-level comparators, two extra
comparators are required for edge sampling on φe phase for clock recovery, thus a total of
four. One difference with the zero-level comparators at φe and φc is that dual-difference
scheme is removed so that instead of the input being compared to a threshold, it is
Chapter 5. Proposed 2x Half-Baud-Rate CDR 66

compared to the plus and minus polarities of the input signal itself, hence, effectively the
zero-level.
The two critical circuit blocks: 1) 2x half-baud-rate PD and 2) Data decoder high-
lighted by red boxes in Figure 5.10 are designed using custom high-speed digital logic
gates operating at 7.5 Gb/s in accordance to the truth table in Figure 5.8. These blocks
made up of digital logic gates are extremely simple in complexity with very little power
consumption.
The CDR’s loop remains as a higher-order RC loop of type-II PLL for the proposed
2x half-baud-rate CDR. While the charge pump and the loop filter are the same as
the proposed adaptive baud-rate CDR, 8-stage ring VCO is tuned such that the centre
frequency is a little lower to compensate for the fact that there is no DFE. Therefore, the
proposed 2x half-baud-rate CDR is not able to tolerate the same data-rate or the same
attenuation. The quarter-rate VCO clock is divided down by a factor of eight to be used
for the digital BERT. Similarly, in the data path, output of the data decoder is demuxed
to produce 32 parallel data signals for the digital BERT. This digital BERT gives the true
error-rate as opposed to checking one of the demuxed data path in the analog domain.
A demuxed version of PRBS is guaranteed to be PRBS but not the other way around.
Therefore it cannot be assumed that after checking one of the demuxed data path being
error-free, the interleaved version of all parallel paths are error free. As a result, digital
BERT is imperative for obtaining the true bit error rate, which is used for the BER
measurement of a testchip.

5.3.2 Closed-loop CDR Simulations

The simulation results of the closed-loop CDR simulations with post-layout extraction
will be discussed in this subsection for the proposed 2x half-baud-rate CDR.
Figure 5.12 illustrates that CDR is error free for all parallel paths of the data after
phase lock. To ensure that the interleaved data is also a PRBS, a test with a PRBS7
was conducted to verify that the parallel paths are still a PRBS7 when the parallel
recovered data are interleaved and combined manually. For a PRBS31, the synthesized
digital BERT written in Verilog could be used to check that the CDR is error free when
measured in the lab after fabrication.
Figure 5.13 illustrates the frequency of the recovered clock. It fluctuates & dithers at
an average value of 7 GHz quarter-rate for 28 Gb/s PRBS31 data. Figure 5.14 shows
the Vctrl which is the control voltage going into the VCO. It fluctuates after the CDR
is locked since the PD periodically dithers between early and late. The peak-to-peak
amplitude of Vctrl matched the Simulink simulation result very closely, confirming a close
resemblance between behavioral model and the circuit simulation.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 67

Figure 5.12: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count

Figure 5.13: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered
clock frequency

Figure 5.14: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO
Chapter 5. Proposed 2x Half-Baud-Rate CDR 68

Figure 5.15: Open-cavity QFN under a microscope showing wire bond connections for the proposed 2x
half-baud-rate CDR

5.3.3 Digital Design

The digital BERT from 5.10 is the only circuit that is synthesized in digital and place &
routed. The digital BERT takes in 32 parallel down-sampled recovered data as the input
and interleaves them to check that it is still an error-free PRBS pattern. Errcnt[19:0] is
the total error count, err is the bit error for every clock cycle and erronce is a flag that
stays high if there’s at least one bit error after the BERT is enabled. Due to the fact that
AMS (analog/mixed signal) verification tool was not setup for the TSMC 28nm design
kit, the interface between analog CDR and digital BERT was never simulated. Since
the analog CDR locked with post-layout extraction and the digital BERT was tested
separately in ModelSim and NCSim throughout the digital design stages, the chance of
the interface breaking was minimized. In addition, the digital BERT has the option to
flip LSB and MSB order of the input data in case bus ordering at the analog/digital
interface doesn’t match.

5.4 Lab Measurements

In this section, measured results of the testchip from the lab will be presented.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 69

Figure 5.16: Package Pinout for D2: Non-uniform baud-rate CDR with CTLE
Chapter 5. Proposed 2x Half-Baud-Rate CDR 70

Figure 5.17: Die photo in TSMC 28nm HPC process for the proposed 2x half-baud-rate CDR. Dimensions
of each building block is listed in a table.

5.4.1 Testchip

The testchip of 2x half-baud-rate CDR with CTLE was fabricated in TSMC 28nm HPC
CMOS technology with a 0.9V supply. The testchip die was packaged with an open-cavity
QFN so that the high-speed input could be probed. Under the microscope, Figure 5.15
reveals the packaged die with the wire bond connections and Figure 5.16 is the package
pinout instruction sent to the packaging company. Figure 5.17 is more zoomed into the
die and all the major building blocks are highlighted. The total testchip area was 1.57
mm width by 0.785 mm height. The total die area is 1.232 mm2 and the area consumed
by the building blocks of the CDR is only 0.135 mm2 . The following subsection will
explain the test setup for the testchip.

5.4.2 Test Setup

Figure 5.19 illustrates the testing setup for a normal operation of the 2x half-baud-rate
CDR. The packaged testchip was soldered onto a PCB board as shown in Figure 5.18
which is programmed via Arduino Mega2560 with a PC. Figure 5.15 depicts the QFN
Chapter 5. Proposed 2x Half-Baud-Rate CDR 71

Figure 5.18: High-speed testboard for design 2: non-uniform baud-rate CDR testchip. Testboard is
programmed and controlled by Arduino Mega2560 + PC

package under a microscope. High-speed probes rated for 40G was used to probe high-
speed PRBS input data. The SHF 12104A bit pattern generator was used to generate
both PRBS7 and PRBS31. Input data then passes through the Tyco 5” channel using
36” SMA cables and is connected to a 40G bias tee before being connected to the SGS
probe head. SGS probe had to be used instead of GSGSG due to limited clearance in
between wirebonds. The Channel loss through this setup is shown in Figure 5.20 which
is measured using the Agilent N5222A PNA microwave network analyzer.
For the output recovered clock, CK/16 was observed instead of probing the high-speed
quarter-rate clock because there was not enough clearance between the wirebonds to land
the probes. The clock spectrum and the phase noise of low-speed divided down version
of recovered clock (CK/16 ) is observed using the Rohde & Schwarz FSWP26 phase noise
analyzer and VCO tester. For low-speed (250-500 Mb/s) or static digital signals, the
Agilent DSAX91604A Infiniium 16GHz real time digital storage oscilloscope was used.

5.4.3 Measurement Results

All measurements presented in this subsection were done with the setup shown in Figure
5.19(a) where Tyco 5” channel with 13.06 dB loss at Nyquist for 30 Gb/s was used for
all measurements. Initially, the VCO’s frequency is manually tuned to 30 Gb/s for an
Chapter 5. Proposed 2x Half-Baud-Rate CDR 72

(a)

(b)

Figure 5.19: Measurement setup for testing 2x half-baud-rate CDR


Chapter 5. Proposed 2x Half-Baud-Rate CDR 73

Figure 5.20: Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias
tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator
cannot set voltage offset to set the common-mode.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 74

open-loop CDR as shown in Figure 5.21. Figure 5.22(a) illustrates the measured clock
spectrum of the divided recovered clock (CK/16 ) for PRBS31 when the CDR is locked.
The locked clock spectrum exhibits a skirt characterized by the loop dynamics or the
bandwidth of the CDR. The integrated jitter from the phase noise plot is 823.5 fs for
PRBS31. The recovered clock spectrum for PRBS7 and its phase noise plot is shown in
Figure 5.23. The integrated jitter for PRBS7 is lower as expected at 731.8 fs.
In addition, the capture range was measured to be -2300ppm to +66000ppm. The
higher ppm in the positive direction is due to the asymmetric nature of the 2x half-baud-
rate PD logic where the data sample always follows the edge sample, not the other way
around. This property makes frequency acquisition available for free in one direction
without adding any additional feedback loop in the CDR. In other words, in the positive
direction, where the incoming data is faster than the CDRs initial VCO frequency, the PD
is able to pull up the VCO frequency by +66000ppm (equivalently 2Gb/s) to a frequency
lock and then track the phase simultaneously to achieve a phase lock.
The measured jitter tolerance with sinusoidal jitter injected at the input bit pattern
generator is shown in Figure 5.24. The jitter tolerance curves for both PRBS31 & PRBS7
passes the IEEE 802.3 masks although PRBS31 passes marginally. The proposed CDR
was originally designed for 28 Gb/s, however, after fabrication the VCO’s tuning range
was shifted up due to a process shift perhaps to an FF corner thus VCO’s frequency
cannot be brought down to 28 Gb/s even after adjusting the VCO’s supply voltage and
the bias tail current. As a result, the measured jitter tolerance at 30 Gb/s is not as high
since it was never designed to operate at such a speed.
Figure 5.25 illustrates the power breakdown per block. The total power consumption
measured is 79.2 mW and the FOM is 2.64 pJ/bit (at 30 Gb/sa0. It is clear that the
VCO and the clocking has been over-designed for phase noise therefore there’s a room for
improvement in terms of power and FOM. Omitting the VCO and the clocking power,
only 25 mW is consumed which is very low-power. Finally, the table in Figure 5.26
compares the performance of the proposed 2x half-baud-rate CDR to recently published
baud-rate CDRs. This work is the first 2x half-baud-rate CDR reported that is 2x
oversampling at half-baud-rate, hence sampling every other UI and locking to the edge.
Chapter 5. Proposed 2x Half-Baud-Rate CDR 75

Figure 5.21: Measured clock spectrum of an open-loop CDR


Chapter 5. Proposed 2x Half-Baud-Rate CDR 76

(a) Clock Spectrum

(b) Phase Noise

Figure 5.22: Measured clock spectrum and phase noise for locked CDR at 30 Gb/s
Chapter 5. Proposed 2x Half-Baud-Rate CDR 77

(a) Clock Spectrum

(b) Phase Noise

Figure 5.23: Measured clock spectrum and phase noise for locked CDR at 30 Gb/s
Chapter 5. Proposed 2x Half-Baud-Rate CDR 78

Figure 5.24: Measured jitter tolerance with sinusoidal jitter injected at 30 Gb/s PRBS31 & 7

Figure 5.25: Power breakdown of 2x half-baud-rate CDR testchip


Chapter 5. Proposed 2x Half-Baud-Rate CDR 79

Figure 5.26: Performance comparison to recently published baud-rate CDRs


Chapter 6

Chip Design Methodology

In a single tapeout shuttle run, two separate wireline receiver chips with two separate
CDRs were designed and fabricated for this research. It is known that meeting the
tapeout deadline for a single CDR design itself is quite onerous due to the sheer size
of the CDR as a system. In addition, doing two system level designs and simulations,
schematic designs and simulations, digital designs and verification and layout designs is
very time consuming. This chapter will delve into the design methodology that allowed
the design of two separate CDRs within a very tight tapeout time-frame. Furthermore,
advanced layout techniques implemented will be discussed.

6.1 Behaviour Model Methodology

As discussed in Section 5.2.1, two CDR designs share most of the continuous-time and
event-driven behaviour model. When designing the second CDR, most of the building
blocks were recycled with some minor tweaks such as the new PD and the data decoder.
For the second CDR design, half-rate clocking architecture was preferred over the quarter-
rate clocking architecture but due to the tight tapeout schedule, quarter-rate clocking
scheme was re-used from the initial design stage when building the behaviour model so
that completely new continuous-time and event-driven models did not have to be built.

6.2 Schematic & Layout Design Methodology

The schematic and the layout were designed to be shared and re-used for both CDR
designs. Many of the circuit components such as the CTLE, the charger pump and the
loop filter and many of the biasing circuitry are shared. Even circuis that are different,
such as the comparators and the VCOs tailored to each CDR’s needs, were only slightly
modified instead of doing a full re-design. Furthermore, filler cells, decoupling capacitor

80
Chapter 6. Chip Design Methodology 81

Figure 6.1: Layout of the full-chip die with two CDRs on top & bottom

cells, IO pads, and the power grid were designed to be shared and used for both CDR
designs. Figure 6.1 illustrate that the top CDR is a mirror of the bottom CDR with
tweaks made to the building blocks and routing. Without sharing many of the common
blocks, tye tapeout deadline could not have been met for such a large layout of 1.57mm
by 1.57mm die area in a 28nm technology. Figure 6.2 reveals the top aluminum (AP)
layer used to distribute power vertically to the metal8 layer below. This figure clearly
illustrate that the power grid, output pads and the IO were also shared and re-used for
both CDR designs.

6.3 Advanced Layout Techniques & Considerations

6.3.1 Matching

The matching of the double-tail latch clocked comparators in layout was critical as a
mismatch in device properties such as the threshold voltage (Vt ) can eat into the timing
Chapter 6. Chip Design Methodology 82

Figure 6.2: Layout of the full-chip die showing top aluminum layer for power distribution
Chapter 6. Chip Design Methodology 83

and noise margin. Therefore, although offset calibration was built for the comparators,
comparator layout was done with meticulous care. First, common centroid layout tech-
niques [22, 21] were used for sets of four quarter-rate comparators. As a result, any linear
gradients in the die tend to cancel out. MOSFETs are afflicted by gradients in etching,
in Vt , and in oxide thickness. All capacitor and resistor arrays incorporated common
centroid layout to enhance relative matching between them as well. In addition to the
common centroid layout technique, the quarter-rate comparators were interconnected
using a symmetrically RC distributed H-tree method [30] for routing data and clocks to
minimize skew and delay offset. A disadvantage of the H-tree method is that it is more
heavily loaded by parasitic capacitance from extra routing metals, hence, causing larger
absolute delay and requiring larger clock buffers. However, we gain timing margin due
to lower skew between the quarter-rate comparators which is absolutely critical for the
CDR’s front-end.
Layout techniques such as common centroid and H-tree interconnect method were
discussed. Another technique used in this chip design is “interdigitation” [6] found in
analog differential pairs. Interdigitation was mainly implemented in CML circuits such
as the CTLE and the ring oscillator delay stages in the VCO. Interdigitation lowers the
device mismatch between M1 and M2 of the input diff-pairs and also helps to cancel out
linear gradients in the die similar to common-centroid.

6.3.2 Design for Electromigration (EM) & IR drop

This subsection will discuss the design methodology for EMIR (electromigration and
IR drop). Considerations were made for the maximum current that a specific width of
metal wire can carry during the physical layout design to pass electromigration (EM)
requirements. The maximum current values (Imax ) were found in the design rule check
document for the TSMC 28nm process. In order to meet the allowable maximum current
density for CML analog blocks, multiple fingers had to be used instead of a device with
a larger width (W). By increasing the number of fingers, more metal tracks are available
for the source/drain to meet EM rules.
In terms of IR drop, a full mesh power grid with via stacks was incorporated to
minimize IR drop on the supply. For example, the aluminum (AP) layer of the power
grid was routed vertically and metal8 (M8) layer below was routed horizontally to form
a power mesh from top to bottom, all the way down to the base of the transistor.
Furthermore, to improve EM, staggered output pads could be seen in Figure 6.2 on some
power supply pads that sink a lot of current such as the VCO. By staggering, more pads
could be placed which could be wirebonded when packaging the chip. To improve the EM
even further and to reduce pad inductance due to the double bonding that was applied
Chapter 6. Chip Design Methodology 84

when wire bonding to the QFN package.

6.3.3 Other Layout Considerations

Many other layout considerations will be discussed in this subsection. First, for signal
integrity, all high-speed clocks that were routed a long distance were shielded. Shield-
ing the clock routes with VSS ground metal traces reduces the mutual inductance and
capacitive cross-talk on high-speed quadrature clock routes.
ESD was also considered during the layout stage of the chip design. ESD diodes and
secondary ESD diode were added to protect the gates from singals coming into input
pads. ESD clamps from the TSMC ESD library were placed between power/ground to
properly clamp the supplies during an electrostatic discharge.
In order to prevent any latch-up, an adequate number of n-taps and p-taps were placed
in the layout. For custom analog layouts, all the MOS devices were laid out in such a
way that it was always seeing the same environment which includes the n-tap/p-tap. For
example, if there were many rows of NMOS devices, p-taps that connect the substrate
to the ground were places in between every row as well as at the outer edges. This way,
any row of NMOS devices were seeing the same distance to top and bottom p-taps.
When placing PMOS devices, they must reside in an n-well which is more isolated
in terms of noise compared to the p− subtrate for the NMOS devices. The deep n-well
(DNW) layout technique was used to provide noise isolation for the NMOS devices inside
isolated p-wells. In addition, DNW had to be put in to provide ground isolation between
digital ground (vss dig) and analog ground (vssa) since they cannot be sharing the same
p-substrate or they will be shorted. Only at the PCB board level, different ground
domains were shorted together.

6.4 Place & Route Digital Implementation Methodology

During tapeout, the digital adaptation engine was designed for both baud-rate CDR
designs. However, for the second CDR design from Chapter 5, the adaptation engine did
not work for the testchip after fabrication which is why its details are omitted from this
thesis. Nonetheless, during the chip design process, even the digital flow from the RTL
syntehsis and place & route were shared between the two CDR designs. The digital flow
script just had to be modified to point to different Verilog files for their respective digital
adaptation scheme. Furthermore, the area allocated for the digital block after place &
route in the layout was the same as well. Therefore, once the layout was streamed into
Cadence Virtuoso, it fit perfectly for both CDR layouts. The pin coordinates were also
Chapter 6. Chip Design Methodology 85

preserved during the digital flow for P & R, thus, routing to the input and output of the
P & R layout was shared between the two CDR in order to save valuable time.
Chapter 7

Conclusion

In summary, this thesis began by motivating the need for a baud-rate CDR with the
goal of reducing power consumption. In chapter 2, the background to the first design
(adaptive baud-rate CDR) was presented. Chapter 3 followed up with the details of the
proposed adaptive engine. Chapter 4 shared the simulated and measured results. For the
second design (2x half-baud-rate CDR) presented in Chapter 5, all of the background,
proposed 2x half-baud-rate scheme and the measured results were self contained within
the same chapter. In addition, Chapter 6 gave an insight on how the two chips were
being designed simultaneously in an efficient manner as well as some advanced layout
techniques incorporated in the testchips.

7.1 Thesis Contribution

The contributions from each of the two baud-rate CDR designs will be summarized in
this section as a conclusion.
The first contribution of this thesis is the proposed adaptive baud-rate CDR with
CTLE and 1-tap DFE. The novelty in this design is the adaptation engine tailored for
baud-rate clock and data recovery where the comparators for the DFE and the PD are
shared to save power. A testchip was fabricated in TSMC 28nm HPC CMOS technology
with a 0.9 V supply. The adaptation engine is demonstrated for 34-36 Gb/s operation
with a Tyco 5” channel resulting in 15.05-18.25 dB channel losses. Measurement in the
lab demonstrated that the testchip is able to pass the IEEE 802.3 jitter tolerance masks
for the mentioned channel losses. At 35 Gb/s, the total power consumption is measured
to be 106.3mW or a FOM of 3.04 pJ/bit. A paper that presents this 36Gb/s adaptive
baud-rate CDR has been submitted to ISSCC 2019.
The second contribution is the proposed 2x half-baud-rate clock and data recovery
technique using both data and edge samples every other UI (half-baud-rate) to lock at

86
Chapter 7. Conclusion 87

the edge. A testchip was also fabricated in TSMC 28nm HPC CMOS technology with
a 0.9 V supply. A 30 Gb/s 2x half-baud-rate CDR was tested with a Tyco 5” channel
with 13.06 dB of loss. The total power consumption is measured to be 79.2 mW or a
FOM of 2.64 pJ/bit. A paper written for a 30Gb/s 2x-half-baud-rate CDR also has been
submitted to ISSCC 2019.
In conclusion, two separate CDR testchips were fabricated in a 28nm process technol-
ogy and successfully measured in the lab.

7.2 Future Works

There are several ways to follow up with the work from this thesis, which is broken down
into two sections for each of the two design.

7.2.1 Improvements for an Adaptive Baud-Rate CDR

One possible works is to take the pattern-based baud-rate CDR where the comparators
are shared between the DFE and the PD and turn it into a PAM4 receiver. Since
PAM4 signaling [5, 14, 38] sends/receives two bits per symbol, it is a popular approach
for achieving a higher data-rate. In addition, if PAM4 signaling is indeed feasible, an
adaptive scheme that is compatible with PAM4 should be studied as well.
Second, the CDR’s equalization capabilities could be improved. For example, a new
tuning knob could be added to the CTLE. By tuning the source degeneration resistance,
the peaking could be improved by lowering the DC gain. In addition, another tuning
knob can be added to the speculative DFE as well. A 2-tap speculative DFE would be
able to cancel out two post-cursor ISI as opposed to one. Adding more knobs would
complicate the adaptation but as a trade off, high channel attenuation could be handled.

7.2.2 Improvements for a 2x Half-Baud-Rate CDR

First, a 2x half-baud-rate CDR could be improved in terms of power. The VCO made
up the majority of the power consumption which hindered it from achieving state-of-
the-art FOM for a high-speed wireline CDR. Instead of wirebonding, a flip-chip could
be implemented which would get rid of inductance from the wirebond. In addition,
an LC VCO could be implemented to the lower phase noise and improving the power
consumption at the trade off of reduced tuning range. Since the use of inductors is
required for the LC VCO, t-coils can also be implemented in the front-end to extend
the bandwidth and inductive peaking using inductors could be applied to the CTLE to
extend the bandwidth of the high-frequency boost. As a result, the data-rate could be
Chapter 7. Conclusion 88

improved as well.
Second, adaptation of the Vref and CTLE settings could be implemented for the 2x
half-baud-rate CDR. Out adaptation scheme for this proposed CDR did not work after
fabrication due to an error in the interface between the analog and the digital circuits.
In future works, the adaptive scheme could be fixed and implemented properly.
Third, a 2x half-baud-rate CDR that can tolerate a higher channel loss for MR and
LR applications would be an interesting project. The proposed 2x half-baud-rate CDR
in this thesis was tailored for an XSR/USR application with only the CTLE as the
main equalization scheme. For a higher-loss application such as for a backplane, a direct
feedback DFE with multiple taps could be implemented after the CTLE. At the summing
node of the DFE, which is still an analog eye, the phase detector and the data decoder
may still work.
Lastly, similar to the possible future works of the first design, feasibility of PAM4
signaling should be studied for a 2x half-baud-rate CDR. If PAM4 is indeed feasible, it
would be possible to double the data-rate at the same clock rate.
Bibliography

[1] Ieee standard for ethernet - amendment 10: Media access control parameters, phys-
ical layers, and management parameters for 200 gb/s and 400 gb/s operation.
IEEE Std 802.3bs-2017 (Amendment to IEEE 802.3-2015 as amended by IEEE’s
802.3bw-2015, 802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-
2016, 802.3bz-2016, 802.3bu-2016, 802.3bv-2017, and IEEE 802.3-2015/Cor1-2017),
pages 1–372, Dec 2017.

[2] Ieee standard for ethernet - amendment 11: Physical layer and management param-
eters for serial 25 gb/s ethernet operation over single-mode fiber. IEEE Std 802.3cc-
2017 (Amendment to IEEE Std 802.3-2015 as amended by IEEE s 802.3bw-2015,
802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-2016, 802.3bz-
2016, 802.3bu-2016, 802.3bv-2017, 802.3-2015/Cor 1-2017, and 802.3bs-2017),
pages 1–45, Jan 2018.

[3] A. A. Abidi. Phase noise and jitter in cmos ring oscillators. IEEE Journal of Solid-
State Circuits, 41(8):1803–1816, Aug 2006.

[4] J.D.H. Alexander. Clock recovery from random binary data. 11:541 – 542, 02 1975.

[5] M. Bassi, F. Radice, M. Bruccoleri, S. Erba, and A. Mazzanti. 3.6 a 45gb/s pam-4
transmitter delivering 1.3vppd output swing with 1v supply in 28nm cmos fdsoi. In
2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 66–67, Jan
2016.

[6] J. D. Bruce, H. W. Li, M. J. Dallabetta, and R. J. Baker. Analog layout using alas!
IEEE Journal of Solid-State Circuits, 31(2):271–274, Feb 1996.

[7] R. Dokania, A. Kern, M. He, A. Faust, R. Tseng, S. Weaver, K. Yu, C. Bil, T. Liang,
and F. O’Mahony. 10.5 a 5.9pj/b 10gb/s serial link with unequalized mm-cdr in
14nm tri-gate cmos. In 2015 IEEE International Solid-State Circuits Conference -
(ISSCC) Digest of Technical Papers, pages 1–3, Feb 2015.

89
Bibliography 90

[8] A. Emami-Neyestanak, S. Palermo, Hae-Chang Lee, and M. Horowitz. Cmos


transceiver with baud rate clock recovery for optical interconnects. In 2004 Sym-
posium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.04CH37525),
pages 410–413, June 2004.

[9] P. A. Francese, T. Toifl, P. Buchmann, M. Brndli, C. Menolfi, M. Kossel, T. Morf,


L. Kull, and T. M. Andersen. A 16 gb/s 3.7 mw/gb/s 8-tap dfe receiver and baud-
rate cdr with 31 kppm tracking bandwidth. IEEE Journal of Solid-State Circuits,
49(11):2490–2502, Nov 2014.

[10] C. Gimeno, E. Guerrero, C. Aldea, S. Celma, and C. Azcona. A fully-differential


adaptive equalizer using the spectrum-balancing technique. In 2013 IEEE Inter-
national Symposium on Circuits and Systems (ISCAS2013), pages 1187–1190, May
2013.

[11] J. Han, Y. Lu, N. Sutardja, and E. Alon. 6.2 a 60gb/s 288mw nrz transceiver
with adaptive equalization and baud-rate clock and data recovery in 65nm cmos
technology. In 2017 IEEE International Solid-State Circuits Conference (ISSCC),
pages 112–113, Feb 2017.

[12] Y. Hidaka, W. Gai, A. Hattori, T. Horie, J. Jiang, K. Kanda, Y. Koyanagi, S. Mat-


subara, and H. Osone. A 4-channel 3.1/10.3gb/s transceiver macro with a pattern-
tolerant adaptive equalizer. In 2007 IEEE International Solid-State Circuits Con-
ference. Digest of Technical Papers, pages 442–443, Feb 2007.

[13] Y. Hidaka, W. Gai, T. Horie, J. H. Jiang, Y. Koyanagi, and H. Osone. A 4-channel


1.25-10.3 gb/s backplane transceiver macro with 35 db equalizer and sign-based zero-
forcing adaptive control. IEEE Journal of Solid-State Circuits, 44(12):3547–3559,
Dec 2009.

[14] J. Im, D. Freitas, A. Roldan, R. Casey, S. Chen, A. Chou, T. Cronin, K. Geary,


S. McLeod, L. Zhou, I. Zhuang, J. Han, S. Lin, P. Upadhyaya, G. Zhang, Y. Frans,
and K. Chang. 6.3 a 40-to-56gb/s pam-4 receiver with 10-tap direct decision-feedback
equalization in 16nm finfet. In 2017 IEEE International Solid-State Circuits Con-
ference (ISSCC), pages 114–115, Feb 2017.

[15] Raj Jain. The art of computer systems performance analysis - techniques for experi-
mental design, measurement, simulation, and modeling. Wiley professional comput-
ing. Wiley, 1991.
Bibliography 91

[16] M. S. Jalali, A. Sheikholeslami, M. Kibune, and H. Tamura. A reference-less single-


loop half-rate binary cdr. IEEE Journal of Solid-State Circuits, 50(9):2037–2047,
Sept 2015.

[17] H. Y. Joo, K. S. Ha, and L. S. Kim. A data pattern-tolerant adaptive equalizer using
spectrum balancing method. In 2009 Symposium on VLSI Circuits, pages 220–221,
June 2009.

[18] Y. H. Kim, Y. J. Kim, T. Lee, and L. S. Kim. A 21-gbit/s 1.63-pj/bit adaptive ctle
and one-tap dfe with single loop spectrum balancing method. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 24(2):789–793, Feb 2016.

[19] J. Lee, M. Chen, and H. Wang. Design and comparison of three 20-gb/s backplane
transceivers for duobinary, pam4, and nrz data. IEEE Journal of Solid-State Circuits,
43(9):2120–2133, Sept 2008.

[20] Jri Lee. A 20gb/s adaptive equalizer in 0.13/spl mu/m cmos technology. In 2006
IEEE International Solid State Circuits Conference - Digest of Technical Papers,
pages 273–282, Feb 2006.

[21] M. P. Lin, Y. He, V. W. Hsiao, R. Chang, and S. Lee. Common-centroid ca-


pacitor layout generation considering device matching and parasitic minimization.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
32(7):991–1002, July 2013.

[22] Di Long, Xianlong Hong, and Sheqin Dong. Optimal two-dimension common cen-
troid layout generation for mos transistors unit-circuit. In 2005 IEEE International
Symposium on Circuits and Systems, pages 2999–3002 Vol. 3, May 2005.

[23] H. Miyaoka, F. Terasawa, M. Kudo, H. Kano, A. Matsuda, N. Shirai, S. Kawai,


T. Shibasaki, T. Danjo, Y. Ogata, Y. Sakai, H. Yamaguchi, T. Mori, Y. Koyanagi,
H. Tamura, Y. Ide, K. Terashima, H. Higashi, T. Higuchi, and N. Naka. A 28.3 gb/s
7.3 pj/bit 35 db backplane transceiver with eye sampling phase adaptation in 28 nm
cmos. In 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), pages 1–2, June
2016.

[24] K. Mueller and M. Muller. Timing recovery in digital synchronous data receivers.
IEEE Transactions on Communications, 24(5):516–531, May 1976.

[25] F. A. Musa. High-speed baud-rate clock recovery. PhD thesis, University of Toronto,
Toronto, ON, 2008.
Bibliography 92

[26] F. A. Musa and A. C. Carusone. A baud-rate timing recovery scheme with a dual-
function analog filter. IEEE Transactions on Circuits and Systems II: Express Briefs,
53(12):1393–1397, Dec 2006.

[27] P. J. Peng, J. F. Li, L. Y. Chen, and J. Lee. 6.1 a 56gb/s pam-4/nrz transceiver in
40nm cmos. In 2017 IEEE International Solid-State Circuits Conference (ISSCC),
pages 110–111, Feb 2017.

[28] W. Rahman, D. Yoo, J. Liang, A. Sheikholeslami, H. Tamura, T. Shibasaki, and


H. Yamaguchi. A 22.5-to-32-gb/s 3.2-pj/b referenceless baud-rate digital cdr with
dfe and ctle in 28-nm cmos. IEEE Journal of Solid-State Circuits, 52(12):3517–3531,
Dec 2017.

[29] W. Rahman, D. Yoo, J. Liang, A. Sheikholeslami, H. Tamura, T. Shibasaki, and


H. Yamaguchi. A 22.5-to-32-gb/s 3.2-pj/b referenceless baud-rate digital cdr with
dfe and ctle in 28-nm cmos. IEEE Journal of Solid-State Circuits, 52(12):3517–3531,
Dec 2017.

[30] B. Ravelo and A. K. Jastrzebski. Modelling of symmetrical distributed clock rc


h-tree. In International Symposium on Electromagnetic Compatibility - EMC EU-
ROPE, pages 1–6, Sept 2012.

[31] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta. A double-


tail latch-type voltage sense amplifier with 18ps setup+hold time. In 2007 IEEE
International Solid-State Circuits Conference. Digest of Technical Papers, pages 314–
605, Feb 2007.

[32] S. Shahramian, C. Ting, A. Sheikholeslami, H. Tamura, and M. Kibune. A pattern-


guided adaptive equalizer in 65nm cmos. In 2011 IEEE International Solid-State
Circuits Conference, pages 354–356, Feb 2011.

[33] M. H. Shakiba. A 2.5 gb/s adaptive cable equalizer. In 1999 IEEE International
Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. First Edition
(Cat. No.99CH36278), pages 396–397, Feb 1999.

[34] T. Shibasaki, W. Chaivipas, Yanfei Chen, Y. Doi, T. Hamada, H. Takauchi, T. Mori,


Y. Koyanagi, and H. Tamura. A 56-gb/s receiver front-end with a ctle and 1-tap dfe
in 20-nm cmos. In 2014 Symposium on VLSI Circuits Digest of Technical Papers,
pages 1–2, June 2014.

[35] T. Shibasaki, T. Danjo, Y. Ogata, Y. Sakai, H. Miyaoka, F. Terasawa, M. Kudo,


H. Kano, A. Matsuda, S. Kawai, T. Arai, H. Higashi, N. Naka, H. Yamaguchi,
Bibliography 93

T. Mori, Y. Koyanagi, and H. Tamura. 3.5 a 56gb/s nrz-electrical 247mw/lane


serial-link transceiver in 28nm cmos. In 2016 IEEE International Solid-State Circuits
Conference (ISSCC), pages 64–65, Jan 2016.

[36] F. Spagna, L. Chen, M. Deshpande, Y. Fan, D. Gambetta, S. Gowder, S. Iyer,


R. Kumar, P. Kwok, R. Krishnamurthy, C. Lin, R. Mohanavelu, R. Nicholson, J. Ou,
M. Pasquarella, K. Prasad, H. Rustam, L. Tong, A. Tran, J. Wu, and X. Zhang. A
78mw 11.8gb/s serial link transceiver with adaptive rx equalization and baud-rate
cdr in 32nm cmos. In 2010 IEEE International Solid-State Circuits Conference -
(ISSCC), pages 366–367, Feb 2010.

[37] V. Stojanovic, A. Ho, B. W. Garlepp, F. Chen, J. Wei, G. Tsang, E. Alon, R. T.


Kollipara, C. W. Werner, J. L. Zerbe, and M. A. Horowitz. Autonomous dual-mode
(pam2/4) serial link transceiver with adaptive equalization and data recovery. IEEE
Journal of Solid-State Circuits, 40(4):1012–1026, April 2005.

[38] P. Upadhyaya, C. F. Poon, S. W. Lim, J. Cho, A. Roldan, W. Zhang, J. Namkoong,


T. Pham, B. Xu, W. Lin, H. Zhang, N. Narang, K. H. Tan, G. Zhang, Y. Frans,
and K. Chang. A fully adaptive 19-to-56gb/s pam-4 wireline transceiver with a
configurable adc in 16nm finfet. In 2018 IEEE International Solid - State Circuits
Conference - (ISSCC), pages 108–110, Feb 2018.

[39] M. van Ierssel, H. Yamaguchi, A. Sheikholeslami, H. Tamura, and W. W. Walker.


Event-driven modeling of cdr jitter induced by power-supply noise, finite decision-
circuit bandwidth, and channel isi. IEEE Transactions on Circuits and Systems I:
Regular Papers, 55(5):1306–1315, June 2008.

[40] K. Yamaguchi, K. Sunaga, S. Kaeriyama, T. Nedachi, M. Takamiya, K. Nose,


Y. Nakagawa, M. Sugawara, and M. Fukaishi. 12gb/s duobinary signaling with
/spl times/2 oversampled edge equalization. In ISSCC. 2005 IEEE International
Digest of Technical Papers. Solid-State Circuits Conference, 2005., pages 70–585
Vol. 1, Feb 2005.
Appendices

94
Appendix A

Ancillary

A.1 Portlist for Synthesized Digital


This section outlines the portlist for synthesized digital. Digital includes the adaptation
engine and the BERT. Input ports listed are programmed using an Arduino.
PORT WIDTH DIRECTION DESCRIPTION
i_CORECLK 1 INPUT Input clock for digital core
i_RSTb 1 INPUT Active-low reset for digital core
i_e_k 8 INPUT 8-bits for 1 GHz
i_DOUT 32 INPUT Input data for digital core 32-bit for 1GHz
i_ncon_avg 5 INPUT
i_ncon_max 5 INPUT
i_dLev_delay_cycle 10 INPUT
i_counter_max 24 INPUT
i_num_pi_trial 5 INPUT
i_num_trial 9 INPUT
i_thick_mode 2 INPUT
i_max_Cs 4 INPUT
o_new_trial 1 OUTPUT
o_block_detected 1 OUTPUT Accumulator port. Simplified only has 1 block processed
o_dLev 9 OUTPUT Accumulator port
o_pi_ctrl 5 OUTPUT PI_logic port
o_mode_pi 3 OUTPUT PI_logic port
o_cross_rdy 1 OUTPUT PI_logic port
o_prev_diff 10 OUTPUT PI_logic port WIDTH_DLEV+1
o_dLev011 9 OUTPUT PI_logic port
o_dLev110 9 OUTPUT PI_logic port
o_new_Cs_setting 1 OUTPUT adaptive_logic port
o_thresh_H 9 OUTPUT adaptive_logic port
o_Cs 4 OUTPUT adaptive_logic port
o_FSM_state 3 OUTPUT adaptive_logic port
o_tot_max 16 OUTPUT adaptive_logic port
o_tot_min 16 OUTPUT adaptive_logic port
o_tot_vamp 16 OUTPUT adaptive_logic port
o_max 9 OUTPUT adaptive_logic port
o_min 9 OUTPUT adaptive_logic port
o_thick_out 10 OUTPUT adaptive_logic port WIDTH_MAX+1
o_mode_adpt 3 OUTPUT adaptive_logic port
o_adapt_rdy 1 OUTPUT adaptive_logic port
i_bert_ENAb 1 INPUT BERT port active low enable to process data or reset error counter
i_bert_pncctl 2 INPUT BERT port 2b00 = PRBS7 2b01 = PRBS31 2b10 = PRBS23 2b11 = PRBS15
i_bert_rxorder 1 INPUT BERT port input data invert order 0: Yes Bit 31 first 1:[Default] No Bit 0 first
o_bert_pnerr 1 OUTPUT BERT port comparison error current sample
o_bert_pneonce 1 OUTPUT BERT port comparison error all past samples

95
Appendix A. Ancillary 96

o_bert_pnebitcnt 20 OUTPUT BERT port error counter

A.2 Output Pad MUX Selection


This section outlines the 4-bit selectable output pad mux and what each of the 16 possible settings output to the outside world
via PCB traces.

oPAD0
coreck_16 (500 MHz)

oPAD1:
0) dout clk0 sample
1) dLev <0 or 4>
2) Thresh <8:0> selectable
3) Thresh [0-->8] burst mode
4) new_trial
5) dcomp <0>
6) PIPO<0>
7) mode_pi <0> (1-bit)
8) FSM_state [0-->2] burst mode
9) max [0-->8] burst mode
10) tot_max [0-->15] burst mode
11) tot_vamp [0-->15] burst mode
12) dLev011 [0-->8] burst mode
13) o_bert_pnebitcnt [0->19] burst mode
14) block_detected<0>

oPAD2:
0) dout clk90 sample
1) dLev <1 or 5>
2) adapt_rdy
3) Thresh flag_bit0
4) cross_rdy
5) dcomp <1>
6) PIPO<1>
7) new_Cs_setting
8) FSM_state flag_bit0
9) max flag_bit0
10) tot_max flag_bit0
11) tot_vamp flag_bit0
12) dLev011 flag_bit0
13) o_bert_pnebitcnt flag_bit0
14) ck_div16

oPAD3:
0) dout clk180 sample
1) dLev <2 or 6>
2) PI <4:0> selectable
3) PI [0-->4] burst mode
4) Cs [0-->3] burst mode
5) dcomp <2>
6) PIPO <2>
7) mode_adpt [0-->2] burst mode
8) thick_out [0-->9] burst mode
9) min [0-->8] burst mode
10) tot_min [0-->15] burst mode
11) prev_diff [0-->9] burst mode
12) dLev110 [0-->8] burst mode
13) o_bert_pnerr (1-bit)

oPAD4:
0) e_k_demux (between ck90 and 180)
1) dLev <3 or 7>
2) Cs <3:0> selectable
Appendix A. Ancillary 97

3) PI flag_bit0
4) Cs flag_bit0
5) dcomp <3>
6) PIPO<3>
7) mode_adpt flag_bit0
8) thick_out flag_bit0
9) min flag_bit0
10) tot_min flag_bit0
11) prev_diff flag_bit0
12) dLev110 falg_bit0
13) o_bert_pneonce (1-bit)
14) ck_div16

<63:0> 500Mhz data_rec


<15:0> 500Mhz e_k_demux

4 bit select
e.g.
Sel<dec 0> = selects dout<0>, dout<1>, dout<2>, e_k<0> [block0]
Sel<dec 1> = selects dout<4>, dout<5>, dout<6>, e_k<0> [block 1]
.
.
Sel<dec 7> = select dout<28>,dout<29>,dout<30>, e_k<0> [block 7]
Sel<dec 8> = select dout<32>,dout<33>,dout<34>, e_k<8> [block 8]
Sel<dec 15> = select dout<60>,dout<61>,dout<62>, e_k<8> [block 15]

You might also like