You are on page 1of 5

Coarse-Grained Online Monitoring of BTI Aging

by Reusing Power-Gating Infrastructure

In this paper, we present a novel coarse-grained technique for monitoring online the bias
temperature instability (BTI) aging of circuits by exploiting their power gating
infrastructure. The proposed technique relies on monitoring the discharge time of the
virtual-power-network during standby operations, the value of which depends on the
threshold voltage of the CMOS devices in a power-gated design (PGD). It does not
require any distributed sensors, because the virtual-power-network is already distributed
in a PGD. It consists of a hardware block for measuring the discharge time concurrently
with normal standby operations and a processing block for estimating the BTI aging
status of the PGD according to collected measurements. Through SPICE simulation, we
demonstrate that the BTI aging estimation error of the proposed technique is less than
1% and 6.2% for PGDs with static operating frequency and dynamic voltage and
frequency scaling, respectively. Its area cost is also found negligible. The power gating
minimum idle time (MIT) cost induced by the energy consumed for monitoring the
discharge time is evaluated on two scalar machine models using either x86 or ARM
instruction sets. It is found less than 1.3 x and 1.45x the original power gating MIT,
respectively. We validate the proposed technique through accelerated aging
experiments conducted with five actual chips that contain an ARM cortex M0 processor,
manufactured with a 65 nm CMOS technology.

A Fully Integrated Discrete-Time Superheterodyne

The zero/low intermediate frequency (IF) receiver (RX) architecture has enabled full
CMOS integration. As the technology scales and wireless standards become ever more
challenging, the issues related to time-varying dc offsets, the second-order nonlinearity,
and flicker noise become more critical. In this paper, we propose a new architecture of a
superheterodyne RX that attempts to avoid such issues. By exploiting discrete-time (DT)
operation and using only switches, capacitors, and inverter-based gm-stages as building
blocks, the architecture becomes amenable to further scaling. Full integration is
achieved by employing a cascade of four complex-valued passive switched-cap-based
bandpass filters sampled at 4xof the local oscillator rate that perform IF image rejection.
Channel selection is achieved through an equivalent of the seventh-order filtering. A new
twofold noise-canceling low-noise transconductance amplifier is proposed. Frequency
domain analysis of the RX is presented by the proposed DT model. The RX is wideband
and covers 0.4-2.9 GHz with a noise figure of 2.9-4 dB. It is implemented in 65-nm
CMOS and consumes 48-79 mW.

Defect- and Variation-Tolerant Logic Mapping in

Nanocrossbar Using Bipartite Matching and Memetic

High defect density and extreme parameter variation make it very difficult to implement
reliable logic functions in crossbar-based nanoarchitectures. It is a major design
challenge to tolerate defects and variations simultaneously for such architectures. In this
paper, a method based on a bipartite matching and memetic algorithm is proposed for
defect- and variation-tolerant logic mapping (D/VTLM) problem in crossbar-based
nanoarchitectures. In the proposed method, the search space of the D/VTLM problem
can be dramatically reduced through the introduction of the min-max weight maximumbipartite-matching (MMW-MBM) and a related heuristic bipartite matching method.
MMW-MBM is defined on a weighted bipartite graph as an MBM, where the maximal
weight of the edges in the matching has a minimal value. In addition, a defect- and
variation-aware local search (D/VALS) operator is proposed for D/VTLM and embedded
in a global search framework. The D/VALS operator is able to utilize the domain
knowledge extracted from problem instances and, thus, has the potential to search the
solution space more efficiently. Compared with the state-of-the-art heuristic and
recursive algorithms, and a simulated annealing algorithm, the good performance of our
proposed method is verified on a 3-bit adder and a large set of random benchmarks of
various scales.

Streaming Elements for FPGA Signal and Image

Processing Accelerators
Field-programmable gate array (FPGA) devices boast abundant resources with which
custom accelerator components for signal, image, and data processing may be realized;
however, realizing high-performance, low-cost accelerators currently demands manual
register transfer level design. Software-programmable soft processors have been
proposed as a way to reduce this design burden, but they are unable to support
performance and cost comparable to custom circuits. This paper proposes a new soft
processing approach for FPGA that promises to overcome this barrier. A highperformance, fine-grained streaming processor, known as a streaming accelerator
element, is proposed, which realizes accelerators as large-scale custom multicore
networks. By adopting a streaming execution approach with advanced program control
and memory addressing capabilities, typical program inefficiencies can be almost
completely eliminated to enable performance and cost, which are unprecedented among
software-programmable solutions. When used to realize accelerators for fast Fourier
transform, motion estimation, matrix multiplication, and sobel edge detection, it is shown
how the proposed architecture enables real-time performance and with performance and
cost comparable with hand-crafted custom circuit accelerators and up to two orders of
magnitude beyond existing soft processors.

Toward Solving Multichannel RF-SoC Integration

Issues Through Digital Fractional Division
In modern RF system on chips (SoCs), the digital content consumes up to 85% of the IC
chip area. The recent push to integrate multiple RF-SoC cores is met with heavy
resistance by the remaining RF/analog circuitry, which creates numerous strong

aggressors and weak victims leading to RF performance degradation. A key such

mechanism is injection pulling through parasitic coupling between various LC-tank
oscillators as well as between them and strong transmitter (TX) outputs. Any static or
dynamic frequency proximity between aggressors (i.e., oscillators and TX outputs) and
victims (i.e., oscillators) that share the same die causes injection pulling, which produces
unwanted spurs and/or modulation distortion. In this paper, we propose and demonstrate
a new frequency planning technique of a multicore TX where each LC -tank oscillator is
separated from other aggressors beyond its pulling range. This is done by breaking the
integer harmonic frequency relationship of victims/aggressors within and between the
RF transmission channels using digital fractional divider based on a phase rotation.
Each oscillator's center frequency can be fractionally separated by ~28% but, at the
same time, both producing closely spaced frequencies at the phase rotator outputs. The
injection-pulling spurs are so far away that they are insignificantly small (-80 dBc) and
coincide with the second harmonic of the carrier. This method is experimentally verified
in a two-channel system in 65-nm digital CMOS, each channel comprising a high-swing
class-C oscillator, frequency divider, and phase rotator.

A CMOS PWM Transceiver Using Self-Referenced

Edge Detection
A CMOS pulsewidth modulation (PWM) transceiver circuit that exploits the selfreferenced edge detection technique is presented. By comparing the rising edge that is
self-delayed by about 0.5 T and the modulated falling edge in one carrier clock cycle,
area-efficient and high-robustness (against timing fluctuations) edge detection enabling
PWM communication is achieved without requiring elaborate phase-locked loops. Since
the proposed self-referenced edge detection circuit has the capability of timing error
measurement while changing the length of self-delay element, adaptive data-rate
optimization and delay-line calibration are realized. The measured results with a 65-nm
CMOS prototype demonstrate a 2-bit PWM communication, high data rate (3.2 Gb/s),
and high reliability (BER> 10-12) with small area occupation (540 m2). For reliability
improvement, error check and correction associated with intercycle edge detection is
introduced and its effectiveness is verified by 1-bit PWM measurement.

An I/O Efficient Model Checking Algorithm for LargeScale Systems

Model checking is a powerful approach for the formal verification of hardware and
software systems. However, this approach suffers from the state space explosion
problem, which limits its application to large-scale systems due to space shortage. To
overcome this drawback, one of the most effective solutions is to use external memory
algorithms. In this paper, we propose an I/O efficient model checking algorithm for largescale systems. To lower I/O complexity and improve time efficiency, we combine three
new techniques: 1) a linear hash-sorting technique; 2) a cached duplicate detection
technique; and 3) a dynamic path management technique. We show that the new

algorithm has a lower I/O complexity than state-of-the-art I/O efficient model checking
algorithms, including detect accepting cycle, maximal accepting predecessors, and
iterative-deepening depth-first search. In addition, the experiments show that our
algorithm obviously outperforms these three algorithms on the selected representative
benchmarks in terms of performance.

Built-in Self-Calibration and Digital-Trim Technique

for 14-Bit SAR ADCs Achieving 1 LSB INL
Several state-of-the-art monitoring and control systems, such as dc motor controllers,
power line monitoring and protection systems, instrumentation systems, and battery
monitors, require direct digitization of high-voltage (HV) input signals. Analog-to-digital
converters (ADCs) that can digitize HV signals require high linearity and low-voltage
coefficient capacitors. A built-in self-calibration and digital-trim algorithm correcting static
mismatches in capacitive digital-to-analog converter (DAC) used in successive
approximation register analog-to-digital converters (SAR ADCs) is proposed. The
algorithm uses a dynamic error correction (DEC) capacitor to cancel the static errors
occurring in each capacitor of the array as the first step upon power-up and eliminates
the need for an extra calibration DAC. Self-trimming is performed digitally during normal
ADC operation. The algorithm is implemented on a 14-bit HV input range SAR ADC with
integrated DEC capacitors. The IC is fabricated in 0.6-m HV-compliant CMOS process,
accepting up to 24Vpp differential input signal. The proposed approach achieves 73.32-dB
signal-to-noise and distortion ratio, which is an improvement of 12.03 dB after selfcalibration at 400-kS/s sampling rate, consuming 90 mW from a 15 V supply. The
calibration circuitry occupies 28% of the capacitor DAC and consumes <;15 mW during
operation. Measurement results show that this algorithm reduces integral nonlinearity
from as high as 7 LSBs down to 1 LSB, and it works even in the presence of larger
mismatches exceeding 260 LSBs. Similarly, it reduces differential nonlinearity errors
from 10 LSBs down to 1 LSB. The ADC occupies an active area of 9.76 mm 2.

Design of Self-Timed Reconfigurable Controllers for

Parallel Synchronization via Wagging
Synchronization is an important issue in modern system design as systems-on-chips
integrate more diverse technologies, operating voltages, and clock frequencies on a
single substrate. This paper presents a methodology for the design and implementation
of a self-timed reconfigurable control device suitable for a parallel cascaded flip-flop
synchronizer based on a principle known as wagging, through the application of
distributed feedback graphs. By modifying the endpoint adjacency of a common
behavior graph via one-hot codes, several configurable modes can be implemented in a
single design specification, thereby facilitating direct control over the synchronization
time and the mean-time between failures of the parallel master-slave latches in the
synchronizer. Therefore, the resulting implementation is resistant to process
nonidealities, which are present in physical design layouts. This paper includes a
discussion of the reconfiguration protocol, and implementations of both a sequential

token ring control device, and an interrupt subsystem necessary for reconfiguration, all
simulated in UMC 90-nm technology. The interrupt subsystem demonstrates operating
frequencies between 505 and 818 MHz per module, with average power consumptions
between 70.7 and 90.0 W in the typical-typical case under a corner analysis.