You are on page 1of 12

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO.

4, APRIL 2022 1199

A 1.24-pJ/b 112-Gb/s (870 Gb/s/Mm) Transceiver


for In-Package Links in 7-nm FinFET
Chi Fung Poon , Wenfeng Zhang, Junho Cho, Shaojun Ma, Yipeng Wang, Ying Cao,
Asma Laraba, Eugene Ho, Winson Lin, Daniel Zhaoyin Wu, Kee Hian Tan,
Parag Upadhyaya , Member, IEEE, and Yohan Frans , Member, IEEE

Abstract— This article describes the design of a 1.24-pJ/b


112-Gb/s PAM4 transceiver test chip in 7-nm FinFET for
in-package die-to-die communication. The receiver supports
0–1.2-V input common mode and utilizes a single-stage active
inductor-based CMOS continuous-time linear equalizer (CTLE)
with 12 data slicers and two error slicers. The quad-rate voltage-
mode transmitter implements delay-based sub-UI two-tap FFE
and digital I/Q and DCC clock calibration. A single-phase clock
from a wideband LC phase-locked loop (PLL) is distributed
to eight transceiver channels. In each channel, an injection-
locked oscillator (ILO) generates eight-phase clocks that feed
an 8-bit CMOS phase interpolator (PI). The transceiver achieves
<1e−12 bit error rate (BER) over 30-mm channel at 106.25 Gb/s
and over 20-mm channel at 112 Gb/s.
Index Terms— 112G-XSR, chiplet, CMOS continuous-time lin-
ear equalizer (CTLE), die-to-die links, digital clock calibration,
in-package links, low-power wireline, short reach.
Fig. 1. Chiplet illustrating inter-die connectivity within the module.

I. I NTRODUCTION

A GROWING trend of chiplet-based architecture drives


the need for a power-efficient, cost-effective, and high-
bandwidth interface for in-package die-to-die communication.
Fig. 1 shows an example of high-performance multi-die field-
programmable gate arrays (FPGAs) or application-specific
integrated circuits (ASICs) that require an interface between a
core die and an electrical or optical I/O die [1]. Standardization
of such in-package interface, shown in Fig. 2, is already in
works as part of OIF CEI-112G-XSR, which covers up to
112-Gb/s per-lane data rate [2]. Two key metrics for such
interface are power efficiency (pJ/b) and die-edge (shore-
Fig. 2. XSR interconnect framework.
line) bandwidth density (Gb/s/mm). The power efficiency
requirement dictates the architecture choice, supply levels, and both high shoreline bandwidth density (>800 Gb/s/mm) and
equalization scheme, while shoreline density is achieved by low power (<1.3 pJ/b) over 5–30-mm package trace, while
employing extreme area scaling and metallization scheme. meeting <1E−12 bit error rate (BER) at 106.25-Gb/s line rate
This article demonstrates that an eight-lane transceiver test and <1e−9 BER at 112-Gb/s line rate, is presented. A small,
chip implemented in 7-nm FINFET technology that achieves low-power, low-latency FEC could be used to make the link
error free if needed for certain applications. This article is
Manuscript received August 17, 2021; revised November 2, 2021 and
December 15, 2021; accepted December 29, 2021. Date of publication organized into eight sections. Section II offers an architec-
January 28, 2022; date of current version March 28, 2022. This article ture overview. Receiver (RX), transmitter (TX), clocking,
was approved by Associate Editor Yusuke Oike. (Corresponding author: and package considerations are discussed in Sections III–VI,
Chi Fung Poon.)
Chi Fung Poon, Wenfeng Zhang, Junho Cho, Shaojun Ma, Ying Cao, Asma respectively. Section VII provides the measurement results.
Laraba, Eugene Ho, Winson Lin, Daniel Zhaoyin Wu, Parag Upadhyaya, This article is concluded with key highlights in Section VIII.
and Yohan Frans are with Xilinx, Inc., San Jose, CA 94086 USA (e-mail:
chifungp@xilinx.com).
Yipeng Wang and Kee Hian Tan are with Xilinx, Inc., Singapore 789976. II. A RCHITECTURE
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2022.3141802. To achieve high power efficiency and shoreline bandwidth
Digital Object Identifier 10.1109/JSSC.2022.3141802 density, many traditional approaches, such as multi-tap TX
0018-9200 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
1200 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 4, APRIL 2022

Fig. 3. Octal transceiver block diagram. Fig. 4. DC-coupled PAM4/NRZ RX architecture.

FIR, analog DFE-based [3], or analog-to-digital converter


(ADC)-based [4] architecture traditionally used for PAM4, had
to be abandoned to simplify architecture and for area scaling.
A CMOS-only architecture is selected for its area benefit
over CML-based approaches traditionally used for front-end
circuits such as continuous-time linear equalizer (CTLE) and
for calibrations. A scalable regulated supply from lower supply
voltages, which allow the use of core transistors, is used for
high-speed clock distribution and multi-phase clock generation
to achieve power efficiency while meeting performance. The
use of regulators also minimizes the need for additional supply
bumps needed to meet shoreline density. Passive inductor
usage is restricted to I/Os only due to metal congestion to
support shoreline density, and hence, an active inductor-based Fig. 5. RX ODT and level shifter.
solution is employed in circuit such as CTLE. In addition,
simple equalization architecture is used to take advantage
of lower channel loss ranging from 6 to 10 dB at 28-GHz at Nyquist. The HF boosting is controlled using programmable
Nyquist. In the RX, single-stage CTLE is implemented to active inductor loads, while the MF boosting is controlled
provide up to 5 dB of equalization, while the TX adopts a using programmable RC filters. The single-stage CTLE drives
two-tap delay-based FIR with 2 dB of equalization. Finally, 14 strong-arm slicers for quarter rate PAM4, three per each of
the single phase-locked loop (PLL) is shared among eight four phases and two error slices, instead of four, with a ping-
channels for the area and power benefit. pong scheme. The data and error samples are then deserialized
The overall test chip architecture is shown in Fig. 3. Clock into 64 bit for data and 16 bit for error. The baud-rate CDR
distribution from PLL is done using a single-phase CMOS to logic takes in these samples and controls a 1/64-UI-step 8-bit
save power and area. Multi-phase clocks are then generated CMOS PI. The PI drives another ILO (ILO2), which generates
locally using ring-based injection-locked oscillators (ILOs) four clock phases for the slicers. The ILO1-PI-ILO2 structure
within each lane. A 64-bit parallel interface with integrated is tuned using a single quadrature error correction loop (QLL).
clock and data recovery (CDR) is used. A 64k memory is used The output of ILO1 is also used to generate quadrature clock
to store data and test chip control consists of PRBS generation phases for the TX.
and checker.
III. R ECEIVER A. On-Die Termination and Level Shifter
The PAM4/NRZ RX consists of a level shifter, a CTLE, Fig. 5 shows the RX termination and level shifter to
14 slicers (12 data and 2 errors), deserializers, and a phase support dc-coupled link. The PAM-4/NRZ signal received at
interpolator (PI)-based baud-rate CDR, as shown in Fig. 4. the RX+/RX− input pads is terminated by the calibrated on-
Since the extra short reach (XSR) standard calls for a dc- die termination (ODT) resistors. T-coil inductors are used for
coupled interface, a level shifter, shown in Fig. 5, with wide impedance matching to enhance the return loss (RL) and for
input common range is used to drive a single-stage CMOS bandwidth extension by neutralizing the parasitic capacitances
CTLE. Fig. 6 shows the CTLE architecture that has both introduced by input pads, ESD diodes, ODT, and CTLE
mid-frequency (MF) boosting and high-frequency (HF) boost- input capacitance. Since channels are embedded in a package,
ing [5]. To save area and metal utilization, active inductor is ESD diodes are optimized to reduce parasitic capacitance to
used in the CTLE to achieve peaking and bandwidth extension meet the CDM ESD rating of 200 V. High-density metal

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
POON et al.: 1.24-pJ/b 112-Gb/s (870 Gb/s/mm) TRANSCEIVER FOR IN-PACKAGE LINKS IN 7-nm FinFET 1201

Fig. 8. PAM-4 eye and naming convention.

Fig. 6. CTLE and offset/equalization loops.


to support low data rate by adjusting HF code and Rg . The
MF peaking provides additional long-tail ISI cancellation. The
MF peaking frequency is also programmable by changing
the resistor Rmf with a range of 100 MHz–1 GHz, and MF
peaking range is from 0 to 1 dB. An offset adjustment circuit
is included in parallel with the output of CTLE for front-end
offset and baseline wander (BLW) cancellation. Each PAM4
slicing level shown in Fig. 8 has a corresponding digital-to-
analog converter (DAC) code, and offset can be computed by
comparing the difference between EHP, DZ, ELN, and DZ
using a ping-pong scheme. The error slice EHP and ELN is
sensed in an alternating pattern (i.e., ping-ponged) for offset
correction. Logic is implemented inside an off-chip adaptation
Fig. 7. CTLE transfer function. block. A digital offset code is converted to analog voltages
(VCAL+ and VCAL−) to generate offset canceling current
through a CTLE gm stage (Fig. 6). Nominal bandwidth of
capacitor, Cterm , between two ODT resistors is used to meet offset loop is a few hundreds of kHz, but it is sufficient to
CM RL specification. Programmable OCAPs in introduced track offset variation caused by VT drift. Area overhead of
in the termination to adjust peaking to support 10–30-mm adaptation logic is already included as part of RX area for
channel and for low line-rate application. To support a wide next implementation.
range input common mode as required by the XSR standard,
a level-shifting input buffer shifts the input common mode
from 0 to 1.2 V to a level optimum for CTLE. The level C. Slicer and DAC I-to-V
shifter consists of a unity gain amplifier (Opa) (low-frequency
A modified strong-arm-based slicer with un-clocked keeper
path) and ac-coupling caps (Cac ) (HF path) to achieve <0.5-dB
is optimized to operate at CTLE common mode. The slicer
in-band droop (Fig. 5). Cac along with Rbig set the high-pass
array contains 14 slicers (12 data and 2 errors) which are
corner frequency. Output common mode is set to mid supply
clocked by four phases of 14-GHz clocks to sample the input
and is dependent on reference Vcm . Vcm is generated by a
data.
replica CTLE stage that is configured as a voltage divider.
Fig. 8 shows the PAM-4 signal and corresponding data and
error slicers naming convention used in this article. The slicing
B. Continuous-Time Linear Equalizer level is adapted and converted to analog voltage through
The compact inverter-based CTLE is implemented with current DAC, as shown in Fig. 6. For every sampling UI, three
additive topology, as shown in Fig. 6, and its transfer function data slicers (DH, DZ, and DL) generate a 3-bit thermometer-
is shown in Fig. 7. Hybrid peaking (both HF and MF) is coded data corresponding to one of four PAM-4 symbols. The
realized in a single stage. The HF peaking is programmable by thermometer-coded data are later translated to two-bit binary
changing the active inductor feedback resistor Rg . The active data. Error slicers are used to sample the eye peak to control
inductor is realized by a complementary pass-gate transistor the CDR, CTLE adaption, and offset calibrations. To reduce
operating in the triode region, instead of the conventional power and area, a time multiplexing (ping-pong) scheme is
TiN resistors, dramatically reducing the cell size and parasitic used for error slicers on 0◦ and 90◦ samping UI. A pair of
wire loading. The HF provides 5-dB peaking at max setting. error slicers are time-shared between EHP/ELP and EHN/ELN
The HF peaking frequency is tunable from 10 to 25 GHz by a switch clock (divided down from the 32-UI deserializer

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
1202 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 4, APRIL 2022

Fig. 9. (a) Switch clock. (b) Error slicers.


Fig. 10. TX block diagram.

parallel clock), as shown in Fig. 9(a). The error samples are tuning is done, the QLL loop is enabled to lock the frequency
shown in Fig. 9(b). of ILO by quadrature error cancellation [11].
Both ILOs are identical to minimize phase error caused by
D. Quadrature Clock Generation mismatch. The LDO is placed in the middle of two ILOs
to lower any IR introduced frequency mismatch. In addition,
As shown in Fig. 4, quadrature clocks are generated per lane ILO1 and PI are placed close to each other to save power on
using the ILO-PI-ILO scheme. To achieve good PI linearity, eight-phase clocks delivery.
eight clock phases are generated from a single-phase clock
using a ring-based ILO (ILO1). The PI drives another identical IV. T RANSMITTER
ILO (ILO2) to generate four sampling phases needed in the
Fig. 10 shows the TX architecture with 4:1 MUX as the final
slicers.
serialization stage, the voltage-mode driver employing delay-
The ILO-PI-ILO structure shares the same supply and is
based sub-UI two-tap FFE that supports up to 0.7-V diff-pp
tuned using a single quadrature error correction loop (QLL).
swing.
The quadrature error is sensed at the ILO2 output to tune
A voltage-mode driver is chosen for its power benefit. The
the supply voltage accordingly (typically in the range of
TX impedance calibration is done digitally to ensure good
0.4–0.7 V) to ensure that injection locking is optimally close
RL based on an externally calibrated resistor. A delay-based
to the natural oscillation frequency of ILOs. Since sensing is
sub-UI FFE provides up to 2 dB of pre-cursor equalization.
done at ILO2 output (slicer input), <2◦ phase error is achieved
Post-cursor ISI is equalized by RX CTLE.
at RX slicer at 14 GHz. This comes in an expense of phase
A 4:1 MUX as final muxing stage eliminates the need
error at ILO1 output, < 5◦ even though they are identical due
for high-speed 2-UI clock (28 GHz), enabling the TX clock
to the Monte Carlo variation. The ILO1 outputs drive RX PI
distribution to operate at a low regulated voltage of 0.7 V for
inputs and are the source for the TX IQ clocks, and the phase
power saving and supply noise rejection. Since the 4:1 MUX
error from ILO1 will degrade PI linearity and TX output jitter
uses all edges of the quadrature clocks, any duty cycle (DCD)
but at acceptable levels. A digital timing calibration loop is
or quadrature (I/Q) error directly translates into deterministic
described in the TX section to minimize jitter due to I/Q phase
jitter. To correct these phase errors, an area-efficient digital
error. Integrated non-linearity (INL) of PI is less than 3.5 LSB,
calibration loop is proposed.
including mismatch that is acceptable for source synchronous
links.
To accommodate the wideband operation of 4–16 GHz A. TX Front End
while avoiding large Kvco and false sub-harmonic locking, the TX front end comprises of pre-driver followed by the
ILOs are segmented into three operation ranges (4–6, 6–10, voltage-mode driver. To enable large fan-out for power
and 10–16 GHz). This also ensures that the adjusted supply reduction, sub-UI de-emphasis is adopted at pre-driver [6].
of the structure (regulated from 0.88 V) is within 0.4–0.7 V A four-stage CMOS pre-driver with average fan-out of 2 is
in each band to avoid the headroom issue of the LDO. used to drive the voltage-mode driver while keeping the
The ILO-PI-ILO structure provides two major benefits: data-dependent jitter below 1 ps (56-m UI) at 56-Gb/s NRZ
small area and clock jitter filtering. An ILO has a low-pass to reduce power.
noise response, which filters the HF jitter from its input clock Fig. 11 shows the voltage-mode driver with programmable
(e.g., DCD and RJ from PI). two-tap FFE capable of 2-dB pre-cursor equalization. A three-
During initial coarse frequency tracking, the off-chip control stage inverter chain is used to provide the open-loop delay
logic searches for the correct band setting of ILOs and sweeps (12–17 ps across PVT) in the two-tap sub-UI FFE, eliminating
the on-chip voltage DAC by comparing the injection frequency the need of flip-flops and clock buffer typically associated with
with the free-running frequency of the ILOs. Once coarse tap generation. The drawback of this FFE implementation is

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
POON et al.: 1.24-pJ/b 112-Gb/s (870 Gb/s/mm) TRANSCEIVER FOR IN-PACKAGE LINKS IN 7-nm FinFET 1203

Fig. 11. TX front end with sub-UI FIR.

Fig. 13. Digital impedance calibration block diagram.

Fig. 12. Voltage-mode driver slice: (a) with programmability and (b) without
programmability.

that delay does not scale with data rate. However, at a lower
data rate, the need for equalization diminished, justifying the
tradeoff for its area and power benefit.
The two variants of driver slice are shown in Fig. 12: one
includes gating logic to enable or disable the slice, as in
Fig. 12(a), while the other is without it, as shown in Fig. 12(b). Fig. 14. TX 4:1 MUX diagram.
The number of driver slices is programmable to provide ±20%
tuning range with 5% resolution to cover resistance variation There are three steps in the calibration. First, the dynamic
over PVT. The basic structure of the driver slice remains the comparator offset is calibrated out. By shorting the positive
same with a common Hi-R poly resistor and active device and negative inputs of the comparator, offset will result as
operating in the triode region for pull-up/ down. The ratio uneven number of sampled “1s” and “0s.” Based on the result,
of resistance contributed by the Hi-R poly resistor to an comparator offset is corrected by adjusting the capacitive DAC
active device is approximately 1.5–1. The lower bound of this at the drain node of input devices (Fig. 13). The maximum
ratio is limited by output linearity as active device resistance CDAC step size is 1.8 mV referred to as comparator input,
becomes non-linear with large drain–source voltage drop. A t- which is less than half of the voltage resolution of Vdrv . Once
coil inductor is used to help maintain good RL at the driver the comparator offset is corrected, the driver replica pull-up
that includes the ESD diode. path is connected in series with the trimmed resistor. If the
midpoint voltage, Vdrv , in Fig. 13, is higher than the reference
voltage (0.5∗ Vrefp ), replica pull-up slice count reduces and vice
B. Digital Impedance Calibration
versa. Finally, a similar approach is taken to calibrate replica
Analog impedance calibration loop often occupies a large pull-down resistance against Rtrim by putting them in series.
area due to the large device needed for low offset and capacitor The calibration codes are sent to the TX driver once all three
needed for stability. An all-digital approach is adopted to loops are done.
circumvent this problem. As shown in Fig. 13, a digital The use of replica path allows background calibration for
impedance calibration circuit employs a dynamic comparator voltage and happens slowly, and an all-digital loop allows
to compare the pull-up and pull-down resistance of driver power cycling or duty cycling the calibration loop to further
replica (LSB half cell replica) against a calibrated resistor reduce power consumption.
(Rtrim ). Rtrim is digitally calibrated to the ideal output resis-
tance of LSB driver (50 ∗ 3 = 150 ) to serve as a
target resistance for the replica pull-up and pull-down path. C. Digital Timing Calibration
To minimize the comparator noise requirement, a large sample Final 4:1 MUX utilizes four-phase 4-UI clocks to generate
size (nominally 1024) is taken for averaging. a 1-UI bitstream (Fig. 14). Since only 4-UI clocks are used in

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
1204 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 4, APRIL 2022

Fig. 16. 4:1 MUX output (Dout ) with training patterns A and B. Perfect I/Q
clock alignment (left). Misalignment between I/Q clocks (right).

Fig. 17. LC-PLL block diagram.

between mission mode and replica path is captured as a


difference in “1s” count. Timing calibration is then continued
with replica path with Nos added for VT tracking. Fig. 15(b)
Fig. 15. (a) Digital timing calibration scheme. (b) Digital timing calibration shows the timing diagram that illustrates the IQ calibration
scheme. scheme described above. With this scheme, we can capture
and correct the mismatch “digitally,” enabling small replica
path to be used to maintain phase error <150 fs over PVT.
this muxing scheme, clock power is lowered compared to the
V. C LOCKING
traditional scheme, which requires a higher speed 2-UI clock
or 1-UI pulses. Clock misalignment between I-Q clock phases The central clocking architecture consists of a 12–16.5-GHz
will appear as TX output jitter, and hence, timing calibration LC PLL that supports both full- and half-rate modes. To sup-
is required. A timing calibration scheme is discussed in [7], port high dynamic range, the charge pump (CP) requires 0.9-V
which sends two training patterns to replica 4:1 MUX, and an regulated supply, which is generated from a 1.2-V supply using
analog calibration loop is used for foreground and background a low-dropout, high-PSRR, flipped voltage follower (FVF)
calibrations. regulator. The LC voltage-controlled oscillator (VCO) and
There are two drawbacks with such scheme. First, mismatch single-phase high-speed clock distribution is run on a 0.7-V
between main and replica path limits the minimum replica path regulated supply to reduce power consumption while meeting
size. Second, as discussed in Section IV-B, analog components jitter requirements.
in calibration loop take up substantial area.
To meet the stringent area requirement, a new, all-digital A. LC Phase-Locked Loop (PLL)
timing calibration loop is proposed. Fig. 15(a) shows the new Fig. 17 shows the block diagram of the third-order analog
timing calibration scheme that employs a slow asynchronous LC-PLL with dual regulators. To provide high output dynamic
clock to sample the 4-UI clocks and 4:1 MUX output to obtain range, the CP is powered by a 0.9-V regulated supply from
information on clock duty cycle and IQ phase alignment, a low-noise, high-PSRR FVF regulator. On the other hand,
respectively. In the case of IQ phase alignment, two calibration LC-VCO and global high-speed clock distribution network
patterns, 0101 (pattern A) and 1010 (pattern B) at 4:1 MUX share a 0.7-V supply generated by a low-noise source-follower
input, are used as in [7]. As shown in Fig. 16, misalignment (SF) regulator with high load regulation to achieve better
will result in mismatch in output pulsewidth between the power efficiency for high-speed clock distribution network.
two patterns. A pulsewidth difference will then manifest as An NMOS cross-coupled LC-VCO topology is selected for
a difference in “1s” captured by the flip-flop clocked by the high output swing and low phase noise under low supply
asynchronous slow clock over a large number of samples. voltage. With a 6-b coarse-tuning capacitor bank, the VCO
To circumvent the need for large replica path, foreground is designed to cover 12–16.5 GHz with better than 70%
calibration is done with mission mode path [MSB and LSB band overlap over PVT. An open-loop temperature compen-
in Fig. 15(a)]. When data traffic begins, calibration is then sation (TC) block controlling an auxiliary varactor to min-
switched to replica path [replica in Fig. 15(a)]. Mismatch imize frequency drift and keep Vctrl within CP’s dynamic

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
POON et al.: 1.24-pJ/b 112-Gb/s (870 Gb/s/mm) TRANSCEIVER FOR IN-PACKAGE LINKS IN 7-nm FinFET 1205

Fig. 18. FVF regulator.

range over temperature variations after the coarse loop band


searching is completed. The simulated worst case phase noise
is −98 dBc/Hz at 1-MHz offset when operating at 14 GHz.
Closed-loop PLL phase noise at 14 GHz is measured to be −99
and −120 dBc/Hz at the 1- and 10-MHz offset, respectively,
as shown in Fig. 24.
Fig. 18 shows the block diagram of the FVF voltage reg-
ulator consisting of an operational transconductance amplifier
(OTA), two FVF cells (replica and main), and noise compen-
sation low-pass filters. Compared to conventional designs, the
proposed FVF regulator suppresses supply noise by a “supply
modulated bias current,” as shown in blue in Fig. 19(a).
With noise compensation, gate voltage of PMOS current
source (Pbias) is filtered by an RC filter referencing to ground
and stays “quite” for frequencies outside RC filter bandwidth.
Therefore, Pbias drain current is modulated by supply noise
via its noisy Vgs and thus called “supply-modulated bias
current.” As shown in Fig. 19(b), this supply noise-modulated
bias current stabilizes Vg of Mbuf (Vg_buf ) and thus reduces
variation of Ids,Mfvf as well as noise injection to the regulator’s
output. The proposed supply noise compensation scheme sig-
nificantly improves PSRR in the mid-frequency range where
conventional FVF regulator loses its loop gain and degrades
supply noise rejection [Fig. 19(c)]. Fig. 19. (a) FVF regulator supply noise analysis. “Supply modulated bias
current” (blue) and noise compensated (gray). (b) Voltage/current waveform.
(c) Simulated PSRR showing improved performance.
B. High-Speed Clock Distribution
A single-phase 14-GHz clock is delivered over 800-μm
distance to each channel for local I/Q generation. To reduce
additive random and supply-induced noise, the clock buffers package is implemented to tune out the bump pad parasitic
are designed to have a minimum number of stages and less using package metal traces. Short traces are routed at the
than 12-ps rise/fall time. The worst case additive rms jitter is top layer of package as embedded microstrip to reduce via
65 fs measured at the farthest destination. pad capacitance with layer 2. Fig. 20 shows the time-domain
reflectometer (TDR) impedance for traces routed as microstrip
and stripline, demonstrating the advantage of microstrip in
VI. PACKAGE terms of impedance continuity.
Since only simple equalization is used to save power and The on-package inductor, realized with substrate metalliza-
area, the link BER performance is sensitive to reflections. tion to provide ∼200 pH of inductance, along with microstrip
While TX and RX RL are improved using an on-die inductor route improves RX RL by up to 2 dB. The simulated single-
and T-coil, a significant reflection still exists at the RX bump bit response (SBR), in Fig. 21, shows the improvement in
pad. The problem is more severe for short traces (<10 mm) reflection after on-package inductance is implemented. Pack-
where reflection limits BER to 1E−10 at 112 Gb/s based age layers 1–4 are used to route the high-speed signals (eight
on simulation. To address this, a broadband inductor in the channels) to meet shoreline bandwidth density.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
1206 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 4, APRIL 2022

Fig. 20. TDR between microstrip and stripline.

Fig. 22. (a) Die micrograph. (b) Power breakdown.

Fig. 23. Testbench setup.

Fig. 21. SBR without (top) and with (bottom) on-package inductor. CHout:
@Receiver bump. capFFin_post_bw: @Slicer input.

To minimize crosstalk, adjacent channels are routed in dif-


ferent PCB layers. Near-end crosstalk (NEXT) is particularly
detrimental due to the lack of aggressor attenuation; as such,
ground bump is inserted between adjacent TX and RX chan-
nels to mitigate its impact. A worst case power sum crosstalk
of −40 dB at bump is achieved based on the simulation result. Fig. 24. 28-GHz spectrum at TX output.
Integrated crosstalk noise (ICN) is <1 mVrms , while dominant
noise source of the system is from the RX front-end thermal
noise at 3–4 mVrms . (amortized across eight channels) taking up 55%, 39%, and
6% of the total power, respectively.
Fig. 23 shows the testbench setup for two chips residing
VII. M EASUREMENT R ESULTS within the same package with organic substrate. Eight pairs
The chip is fabricated in the TSMC 7-nm CMOS FINFET of in-package high-speed traces with lengths varying from
process. The die micrograph presented in Fig. 22(a) shows 10 to 40 mm are used to connect the two chips along the
eight channels sharing a common bandgap bias generation and beachfront. Four package layers are needed to route all eight
LC PLL. Dual channels are stacked along the critical die edge pairs of high-speed signals.
to improve the bandwidth density. With 1.03-mm die edge and In order to characterize the performance of LC-PLL and TX,
aggregated data throughput of 896 Gb/s (8 × 112 Gb/s), the one of the TXs is bonded out for jitter measurement. Since
shoreline bandwidth density of 870 Gb/s/mm is achieved. The the chip is designed for in-package trace, loss from package
energy efficiency is 1.24 pJ/b when operating at 112 Gb/s. bump to measurement scope is calibrated and de-embedded
Power breakdown in Fig. 22(b) shows RX, TX, and clocking from the measurement result. Random jitter, integrated from

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
POON et al.: 1.24-pJ/b 112-Gb/s (870 Gb/s/mm) TRANSCEIVER FOR IN-PACKAGE LINKS IN 7-nm FinFET 1207

Fig. 25. TX eye diagram with PRBS7 pattern at 56-Gb/s NRZ.

Fig. 27. PRBS23 BER at 112 Gb/s (20 mm). Bathtub with PI code
sweep (top) and continuous-time BER (bottom).

Fig. 26. PRBS23 BER at 106.25 Gb/s (30 mm). Bathtub with PI code
sweep (top) and continuous-time BER (bottom).

10 kHz to 14 GHz, at TX output of 28-GHz clock pattern


is measured to be 114 fs with 4-MHz CDR filter applied,
as shown in Fig. 24. The eye diagram of TX with PRBS7
pattern at 56-Gb/s NRZ is shown in Fig. 25. The overall link
performance is measured with RX eye scan and on-chip BER
counter at 106.25 and 112 Gb/s over 10-, 20-, and 30-mm
channels using the PRBS23 pattern.
The links achieved BER <1E−12 for trace length ≤30 mm
at 106.25 Gb/s (see Fig. 26). At 112 Gb/s, BER is <1E−12
and 3E−9 for 20- and 30-mm channels, respectively, as shown Fig. 28. PRBS23 BER at 112 Gb/s (30 mm). Bathtub with PI code
in Figs. 27 and 28. It is worthwhile to note that PI code sweep sweep (top) and continuous-time BER (bottom).
BER (bathtub curve) is worse than continuous-time BER due
to worse jitter tracking without CDR running during PI code by sweeping PI code (x-axis) and error slicer DAC code
sweep. RX slicer input eye scan for 20- and 30-mm channels (y-axis). For each PI code, a probability density func-
at 112 Gb/s is shown in Fig. 29. PAM4 eye is constructed tion (PDF) can be reconstructed based on the probability of

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
1208 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 4, APRIL 2022

TABLE I
C OMPARISON W ITH S TATE OF THE A RT

yield as process technology scaling slows down. Chiplets are


integrated in the same package to form a complex system.
High shoreline bandwidth density, energy efficiency, and low
latency are the critical parameters for in-package links in
order to keep the cost down and avoid thermal issues. In this
article, we demonstrated an octal of 112-Gb/s PAM-4 links
that achieved a shoreline bandwidth density of 870 Gb/s/mm
and an energy efficiency of 1.24 pJ/b. Pre-FEC BER is less
than 1E−12 for 20-mm channel and less than 3E−9 for 30-mm
channel at 112 Gb/s. Architecture and circuit choices are made
to enable low-voltage operation, small footprint, and minimum
metal utilization.
ACKNOWLEDGMENT
The authors would like to thank Santiago Asuncion for data
collection and the entire transceiver team who contributed to
this work.

R EFERENCES
[1] M. Thiara, “Die-to-die interconnects for chip disaggregation,” Semicond.
Eng., Tech. Rep., Nov. 2018.
[2] N. Tracy et al., “112 Gbps electrical interfaces—An OIF update on CEI-
112G,” presented at the OFC, 2020. [Online]. Available: https://www.
oiforum.com/wp-content/uploads/00311c-OIF-112G-OFC-
slides_ofc20_presentation.pdf
[3] J. Im et al., “A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct
Fig. 29. Eye diagram at 112 Gb/s with 20- (top) and 30-mm (bottom) decision-feedback equalization in 16-nm FinFET,” IEEE J. Solid-State
channels. Eye-opening BER <1E−7 . Left: PAM-4 eye. Right: concatenation Circuits, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
of the three eyes. Y -axis: RX slicing-level DAC code (LSB = 1.5 mV). [4] P. Upadhyaya et al., “A fully adaptive 19–58-Gb/s PAM-4 and
9.5–29-Gb/s NRZ wireline transceiver with configurable ADC in 16-nm
ones. The eye diagram is formed by piecing the PDFs for all FinFET,” IEEE J. Solid-State Circuits, vol. 54, no. 1, pp. 18–28,
Jan. 2019.
PI codes together. [5] J. Im et al., “A 112 Gb/s PAM-4 long-reach wireline transceiver using a
Table I shows the comparison of this work to the state-of- 36-way time-interleaved SAR-ADC and inverter-based RX analog front-
the-art in-package links. Based on the author’s best knowl- end in 7 nm FinFET,” IEEE J. Solid-State Circuits, vol. 56, no. 1, pp7-18,
Jan. 2021.
edge, this work achieved the best shoreline bandwidth density [6] M. Erett et al., “A 126 mW 56 Gb/s NRZ wireline transceiver for syn-
(870 Gb/s/mm) and energy efficiency (1.24 pJ/b). chronous short-reach applications in 16 nm FinFET,” in IEEE Int. Solid-
State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, pp. 274–276.
[7] K. Tan et al., “A 112-GB/S PAM4 transmitter in 16 nm FinFET,” in
VIII. C ONCLUSION Proc. IEEE Symp. VLSI Circuits, Jun. 2018, pp. 45–46.
Progress made in heterogenous integration in recent years [8] R. Yousry et al., “A 1.7 pJ/b 112 Gb/s XSR transceiver for intra-package
communication in 7 nm FinFET technology,” in IEEE Int. Solid-State
offers a path to drive down cost per function and improves Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 180–182.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
POON et al.: 1.24-pJ/b 112-Gb/s (870 Gb/s/mm) TRANSCEIVER FOR IN-PACKAGE LINKS IN 7-nm FinFET 1209

[9] R. Shivnaraine et al., “A 26.5625-to-106.25 Gb/s XSR SerDes with Yipeng Wang received the B.S. degree in electri-
1.55 pJ/b efficiency in 7 nm CMOS,” in IEEE Int. Solid-State Circuits cal engineering from Xiamen University, Xiamen,
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 181–183. China, in 2010, the M.S. degree in electronics
[10] K. McCollough, S. D. Huss, J. Vandersand, R. Smith, C. Moscone, and and communication engineering from the Univer-
Q. O. Farooq, “A 480 Gb/s/mm 1.7 pJ/b short-reach wireline transceiver sity of California at Santa Barbara, Santa Barbara,
using single-ended NRZ for die-to-die applications,” in IEEE Int. Solid- CA, USA, in 2012, and the Ph.D. degree in elec-
State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 1–3. tronics and communication engineering from The
[11] M. Raj, S. Saeedi, and A. Emami, “A wideband injection locked Hong Kong University of Science and Technology,
quadrature clock generation and distribution technique for an energy- Hong Kong, in 2016.
proportional 16–32 Gb/s optical receiver in 28 nm FDSOI CMOS,” IEEE He is currently a Staff Mixed-Signal Design Engi-
J. Solid-State Circuits, vol. 51, no. 10, pp. 2446–2462, Oct. 2016. neer with Xilinx, Inc., Singapore, where he is
involved in high-speed wireline transceiver design.

Ying Cao received the B.S. degree in physics from Jilin University,
Chi Fung Poon received the B.S. degree in electrical Changchun, China, in 1998, and the M.S. degree in physics from California
engineering from the University of California at State University, Long Beach, CA, USA, in 2000.
Santa Barbara, Santa Barbara, CA, USA, in 2004, In 2009, she joined the SerDes Technology Group, Xilinx, Inc., San Jose,
and the M.S. degree in electrical engineering from CA, USA, where she has been working on high-speed SerDes transceiver
Stanford University, Stanford, CA, USA, in 2006. circuits. Her current interests include high-speed clocking, transceiver front-
He is currently a Senior Staff Design Engineer end circuits, clock and data recovery (CDR), and phase-locked loops (PLLs).
with the SerDes Technology Group, Xilinx, Inc.,
San Jose, CA, USA, where he works on the high-
speed mixed-signal circuit. His current research
interests include data converters, clocking, and ana-
log front end for the serial link. Asma Laraba received the M.Sc. degree in micro-
electronics and nanoelectronics from Joseph Fourier
University, Grenoble, France, in 2010, and the Ph.D.
degree in electrical engineering from the Grenoble
Institute of Technology, Grenoble, in 2013. Her
Ph.D. thesis research was conducted at the TIMA
Wenfeng Zhang received the M.S. degree in elec- Laboratory, Grenoble, and focused on DFT and
trical engineering from Oklahoma State University, BIST of analog-to-digital converters.
Stillwater, OK, USA, in 1994. Since joining Xilinx in 2014, she worked on
From 1994 to 2004, he was a Circuit Designer data converter DFT, RFSoC analog-to-digital con-
with Datapath Systems, Los Gatos, CA, USA, and verter (ADC) design, and various high-speed SerDes
Micron Technology, San Jose, CA, USA. Since projects. Her current research interests include data converters and the analog
2005, he has been with Xilinx, Inc., San Jose, frontend of wireline links.
where he is involved in various projects on DDR Dr. Laraba was a recipient of the 2012 European Test Symposium Best
memory interfaces (DLL) and high-speed IOs. He is Paper Award.
involved in high-speed low-power analog front end
and custom digital blocks for transceivers.

Eugene Ho received the B.S. degree in electrical


engineering and the B.S. degree in computer sci-
ence from the University of Southern California,
Los Angeles, CA, USA, in 2001, and the M.S.
Junho Cho received the B.S. and M.S. degrees degree in electrical engineering from Stanford Uni-
in electronic engineering from Sogang University, versity, Stanford, CA, USA, in 2003.
Seoul, South Korea, in 1994 and 1996, respectively.
From 2003 to 2018, he was with Rambus Inc.,
Since 1999, he was a Design Engineer with the Sunnyvale, CA, USA, working on high-speed I/O
Memory Division, Samsung Electronics, Suwon, design and microarchitectures of high-speed logic
South Korea. From 2002 to 2004, he was an
buses for consumer gaming systems, as well as
Associate Principal Engineer with FTD Solutions, DDRx memory interface PHYs and buffer chips.
Singapore, where he was involved in analog and He joined Xilinx, Inc., San Jose, CA, USA, in 2019, working on SerDes
mixed-signal IP development. In 2004, he joined architectures. Since 2021, he has been working with Nubis Communications
AMD Inc., Markham, ON, Canada, where he was
Inc., New Providence, NJ, USA. His current research interests include
in charge of display SerDes. From 2011 to 2014, mixed analog-to-digital circuit design and low-power memory and SerDes
he was a Staff Engineer with AMD Inc., Sunnyvale, CA, USA, where architectures.
he worked on high-speed SerDes. From 2014 to 2021, he was a Senior
Staff Engineer with Xilinx, Inc., San Jose, CA, USA, where he worked on
high-speed SerDes and data converters. He holds more than ten U.S. patents.
Mr. Cho was a member of the Korea International Cooperation Agency,
Ministry of Foreign Affairs of Korean Government, from 1996 to 1998. Winson Lin received the B.S. and M.Eng. degrees
in electrical and computer engineering from Cornell
University, Ithaca, NY, USA, in 2006 and 2007,
respectively.
Since 2007, he has been a Design Engineer with
Xilinx, Inc., San Jose, CA, USA, where he was
Shaojun Ma received the M.S. degree in electrical engineering from involved in the design and development of digital
New York University, New York, NY, USA, in 2010. circuits for high-performance transceivers, such as
Since 2010, he has been a Staff Design Circuit Design Engineer with Xilinx clock data recovery, delta–sigma modulation, equal-
San Jose, CA, USA. His current research interests include circuit design in ization, and calibration loops. He is involved in
leading-edge CMOS technology, including high-speed IO circuits, clocking both the front-end RTL design and back-end SAPR
circuits, quadrature phase-locked loop (QLL), and analog/mixed-signal circuit implementation of these high-speed digital circuits, and behavioral model
design. development of analog circuits for system-level verification.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.
1210 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 4, APRIL 2022

Daniel Zhaoyin Wu received the B.S. degree in Parag Upadhyaya (Member, IEEE) received the
electro-physics from National Chao Tung University, B.S., M.S., and Ph.D. degrees in electrical engineer-
Hsinchu, Taiwan, in 1991, and the M.S. degree in ing from Washington State University, Pullman, WA,
electrical engineering from National Taiwan Univer- USA, in 2000, 2005, and 2008, respectively.
sity, Taipei, Taiwan, in 1993. From 2001 to 2003, he was with Cypress Semicon-
He joined Xilinx, Inc., San Jose, CA, USA, ductor, Austin, TX, USA, working on the develop-
in 2010, and is currently a Principal Engineer at ment of high-speed wireline and optical transceivers.
the Wire Engineering Group, working on archi- Since 2008, he has been with Xilinx, Inc., San Jose,
tecture development and system-level modeling of CA, USA, where he is currently the Director of
high-speed SerDes and silicon photonics. Prior to Engineering with the Wired and Wireless Group,
Xilinx, he worked for a few companies, including leading to the development of high-speed trans-
Ansoft Corporation, Altrabroadband, and ITRI/CCL, designing the built-in ceivers for field-programmable gate array applications. He has authored
passive components for RF front end of cell phone and analog front end of or coauthored over 72 journal, conference, and book chapter publications.
high-speed SerDes. He holds more than 47 U.S. patents. His research interests include high-
speed mixed-signal circuits for wireline, wireless, and optical transceivers;
high-speed data converters; and phase-locked loops.

Yohan Frans (Member, IEEE) received the


B.S. degree in electrical engineering from the
Bandung Institute of Technology, Bandung, Indone-
sia, in 1995, and the M.S. degree in electrical
engineering from Stanford University, Stanford, CA,
USA, in 2001.
From 2001 to 2012, he was a Circuit Design
Kee Hian Tan received the B.S. and M.Eng. degrees Engineer, a Circuit Architect, and a Design Manager
in electrical engineering from the National Univer- with Rambus Inc., Sunnyvale, CA, USA, where he
sity of Singapore, Singapore, in 2000 and 2001, was involved in high-performance and low-power
respectively. serial links and memory interfaces. Since 2012,
He joined Marvel Asia, Singapore, in 2001, and he has been with Xilinx, Inc., San Jose, CA, USA. He is currently leading
LSI, San Jose, CA, USA, in 2008, both working on design teams as the Vice President of the Xilinx Wired Engineering Group,
hard disk preamplifier design. Since 2012, he has San Jose, where he is involved in the development of high-speed wireline
been with Xilinx, Inc., Singapore, leading a team transceivers for advanced field-programmable gate arrays. His current research
of circuit designers developing analog front-end and interests include high-speed mixed-signal circuit design, serial link architec-
high-speed digital circuit blocks for SerDes PHY. ture, transmitter/receiver design, phase-locked loops/DLL, memory interfaces,
and low-power circuit architectures.

Authorized licensed use limited to: University of Science & Technology of China. Downloaded on May 30,2022 at 01:26:27 UTC from IEEE Xplore. Restrictions apply.

You might also like