You are on page 1of 4

A 0.78-μW 96-Ch.

Deep Sub-Vt Neural Spike Processor Integrated


with a Nanowatt Power Management Unit
Jiangyi Li1, Pavan K. Chundi1, Sung Kim1, Zhewei Jiang1, Minhao Yang1, Joonseong Kang2, Seungchul Jung2,
Sang Joon Kim2, Mingoo Seok1
1
Department of Electrical Engineering, Columbia University, New York, NY, USA
2
Samsung Electronics, Suwon, Korea
Abstract—We present a sub-μW Neural Spike Processor correction (EDAC) [9], timing-error prediction [7], etc., and DVS
integrated with a Power Management Unit (PMU) for on-implant capabilities are desirable in the PMU design such that the supply
processing in motor intention decoding, demonstrating: (i) among voltage (VDD) can be tuned dynamically to minimize voltage
the highest level of integration including spike detection, feature margin while guaranteeing robust operation.
extraction, sorting, the first half of decoding, which reduces
wireless data rate by more than 4 orders of magnitude; (ii) on-chip Hybrid EDAC Controller

Generator
Controller

Monitor
Error
PMU integration enabling the system directly powered by SC CLK

VREF
UV/OV error
harvesters; (iii) the lowest power dissipation of 0.78μW for 96 LDO/Buck SC DC-DC
VDD TRC VDD
channels, 21× lower than the prior art at a comparable/better
error
accuracy. Core w/ EDAC Core CLK Core w/ EDAC
Core CLK
(a) (b)
Keywords—neural spike processing; motor intention decoding;
power management; in-situ error detection and correction; switched- Fig. 2 The architecture of (a) existing and (b) proposed adaptive DVS
capacitor DC-DC converter; systems.
I. INTRODUCTION Previous adaptive DVS architectures (Fig. 2(a)) either employ off-
chip regulator [6] which results in slow transient response, increased
Recent advances in neuroscience and integrated circuits enables system form factor, and extra power consumption from off-chip
possibilities for long-term Brain-Computer-Interface (BCI) circuits, or utilize on-chip LDO [10] that has substantial power
implants [1]. Such implants have stringent power budgets as they conversion loss when the difference between input and output of the
are preferably powered with wireless energy-harvesting. Thus, LDO become large. These factors prohibit their use in a sub-μW,
with wireless communication often easily dominating system low voltage system. Moreover, they require analog/mixed-signal
power, on-implant processing that reduces wireless data rate is circuitries such as voltage references and comparators which further
highly desirable. In this paper, a Neural Spike Processor (NSP) is increase quiescent power and degrade power efficiency.
designed for motor intention decoding that decodes neural spike Direct error regulation with Switched-Capacitor DC-DC
information into direction and velocity of intended muscle converter (SC-DC) is proposed in [5] to enable fully-integrated,
movement, enabling rehabilitative services for patients. By fully-digital PMU design. The timing errors directly trigger
integrating spike detection, feature extraction, sorting and the first switching of the SC-DC for fast droop response while error statistics
half of decoding onto the implant, it is able to reduce the wireless are used to tune VDD by changing the SC-DC topology. Yet, the loss
data rate by more than four orders of magnitude (Fig. 1). of regulation due to inactivity or non-critical-path execution can
8bit@30kHz Motor Intention Decoding cause critical failures. Also, the SC-DC is designed to switch at a
240kbps/ch. 1.6kbps/ch. 0.7kbps/ch. 320bps/96ch. 320bps/96ch.
fixed frequency, leading to efficiency degradation when the load
AFE
Detection & Ensemble
Kinematic power varies.
Brain & Sorting Averaging
State Prosthetics
Feature Ext.
ADC Prediction This paper presents a PMU-Load co-design that tackles the
On-implant processing On-Prosthetics processing
aforementioned challenges. The proposed PMU is designed based
~100 spikes/ch./s Neural Spike Processor 100ms motor decoding rate
on a fully-integrated SC-DC and integrated with the NSP. With
Assuming 20 channels used in motor decoding, various algorithm and circuit design optimizations, the NSP marks
data rate reduction: 240kbps×20/320bps=15,000×
a record power efficiency of 0.61μW for a 96-channel system
Fig. 1 The proposed NSP with on-implant processing capability
(6.35nW/Ch.), 27× better than the prior art [3].
reduces wireless data rate by more than 4 orders of magnitude.
We propose a hybrid error/replica-based control scheme (Fig.
On the other hand, thanks to the ultra-low-power circuit 2(b)) to remove the voltage reference and comparator, making the
techniques, the power consumption of the NSP scales down to sub- control scheme fully digital and able to operate under low input
μW range. This imposes new challenges on Power Management voltage (VIN), allowing the System-on-Chip (SoC) to be directly
Unit (PMU) design, including compact form factor, ultra-low powered by harvesters and avoiding efficiency degradation in
quiescent power, wide input voltage range and low input voltage conversions to/from batteries (~4V) [4]. Moreover, a Tunable
support for harvesters, among others. Replica Circuit (TRC) is added to assist the error regulation and
Moreover, as sub-threshold (Sub-Vt) operation gains popularity for prevent loss of regulation. We also propose an automatic energy-
its significantly improved system energy efficiency, it introduces robustness co-optimization method that sets the optimal conversion
new challenges. Due to the impact of Process, Voltage, ratio (CR) and switching frequency (fSC) for the SC-DC. With the
Temperature (PVT) variations and other fast-varying variations PMU achieving the PCE of 77.7% (72.2%), the NSP-PMU SoC
consumes 0.78μW (0.84μW) at VIN of 0.6V (1V) at the margin-free
(voltage droop, coupling noise, etc.), robust sub-threshold
operating point, respectively, marking a record-high power
operation requires a prohibitive voltage margin. Thus, adaptive efficiency of 8.1nW/Ch. for the 96-channel system, 21× reduction
circuit techniques such as in-situ timing-error detection and from the prior art [3].

978-1-5386-5404-0/18/$31.00 ©2018 IEEE 154

Authorized licensed use limited to: University College London. Downloaded on July 14,2021 at 10:29:06 UTC from IEEE Xplore. Restrictions apply.
II. NSP ARCHITECTURE leakage, while the performance requirement for NSP is relatively
Fig. 3 shows the architecture of the NSP-PMU SoC. The NSP starts low.
with 96 sets of threshold-crossing spike detectors and feature III. PMU AND NSP CO-DESIGN
extractors whose outputs feed three sorters via three 32-to-1
priority-encoded MUXs. The three sorter outputs then merge into The PMU and NSP are co-designed to support the following
the decoder via a queue. The sorters adopt an 1.5D Bayesian features: (i) modulating VDD of NSP across PVT variations to
boundary sorting algorithm, where the decision is made based on remove the safety margin that is prohibitive for sub-Vt circuits and
partitions defined by orthogonal boundaries in the 2D space of (ii) optimizing its Power Conversion Efficiency (PCE) by
features (max and min values of a spike waveform) (Fig. 4). This automatically finding an optimal configuration for the SC-DC.
algorithm requires 10-100× less computation yet achieves a To enable the first feature, we propose hybrid error/replica-
comparable/better accuracy than the conventional distance-based based regulation. This scheme uses EDAC [5] embedded in the
algorithms for a range of datasets [1]. NSP, which directly regulates VDD by controlling a 63-ratio
Vin configurable SC-DC. We added the EDAC capability to the NSP
by leveraging recent advances [5]. First, following the sparse
2:1 SC VM1 2:1 SC VM2 2:1 SC VM3 2:1 SC VM4 2:1 SC VM5 2:1 SC VDD=Vin×CR/64
(1x) (2x) (4x) (8x) (16x) (32x) insertion scheme, we inserted error detection latches (EDL) only in
between the sorters and the queue (Fig. 3; highlighted in red).
VSS 2:1 SC: Vout=Vin/2 or (Vin+VM)/2 or VM/2
Second, we employed the body-swap-based error correction in the
PMU CLKCNTL VDD CLKCore
EDAC control
weight memory. To reduce the power overhead of body swapping,
63-ratio SC UV, OV Body-driving unit we designed it to swap only the bodies of one memory bank (out
VIN Controller TRC
(0.6-1V)
DC-DC Body-swap of 16) that is being accessed.
VDD (~0.32V) demux
Error 3 errors observed Counting error for 16 cycles no error for 16 cycles
CLKCore Ch65-96 CLKCore
Ch33-64 Weights Error
decoder

Ch1-32 Sorter Mem Q addr Mem


32ch Dectector & features Ch. # Features U 128ex16b CLKSC
Spike M 32ex57b
Feature Extractor E fSC Initial Value Increase by 1 step Not changing
Input U 1.5D boundaries
Threshold & feature sample # U weights Example Error Regulation: :Single-cycle fSC boost Decrease by 1 step
X E
Detector Mem E Output TER = [1/16, 2/16] in the counting window of 16 cycles ( small window chosen for showcase )
Sorter D
features Accumulator
32ex18b
NSP L neuron # Fig. 7 Showcase waveforms of the error-based regulation process.
EDL: Error Dection Latch
Fig. 3 The architecture of the proposed NSP-PMU SoC. Fig. 7 showcases the process of the error/replica-based regulation.
Distance-Based 2D Boundaries 1.5D Boundaries Upon each error detection, the PMU controller forces SC-DC to
Class 4 Class 1
Class 1 Class 4 Class 1 immediately start a new switching phase and boost fSC until the
Class 4
next NSP clock (CLKCore) cycle so as to recover from VDD droop.
If errors continue to occur and the number of errors in the counting
Feature 2

Feature 2

Feature 2

window becomes larger than a Target Error Rate (TER), the


Class 2 Class 2 Class 3 Class 2 Class 3
controller increases the fSC setting of the SC-DC and thus raise VDD.
Class 3
Feature 1 Feature 1 Feature 1 On the other hand, if the number of errors in a counting window is
Fig. 4 The proposed 1.5D-boundary-based sorting. less than the TER, fSC will decrease and lower VDD.
KF Operations # of Calculation (Mult./Add/Div.) TRC_in SLUV
Standard KF EOKF TRC_UV
D Q Delay1 D Q
D Q
A Priori Estimate 4/2/~ 4/2/~
Posterior Estimate 80/80/~ 46/46/~ FFOV
CLKCore CLKCore
Kalman Gain 32180/32060/1180 10/9/4 Delay2 D Q
Post. Error Cov. Matrix 96/92/~ 16/16/~ OV detection TRC_OV
D Q
Total 32360/32234/1180 76/73/4 (~400x less) abc+abc
CLKCore
Fig. 5 EOKF reduces computational complexity by 400×.
Fig. 8 The schematic of TRC.
0.12 0.12
Moreover, we added a Tunable Replica Circuit (TRC) to enable
Error - Standard KF

0.10 0.10
continuous error-based regulation (Fig. 9), as the existing error-
Error - EOKF

0.08 0.08
0.06 0.06
based works [5, 6] lose regulation if the critical paths are not
0.04 0.04
executed due to e.g., input inactivity or VDD overshoot. The TRC
0.02 0.02
sets the upper and lower bounds of VDD and thus avoid such issue.
0.00
0.0 0.1 0.2 0.3 0.4
0.00
0.0 0.1 0.2 0.3 0.4
Since we pipelined the NSP with two-phase latches, the critical
Velocity (m/s) Velocity (m/s) path per latch stage is 0.5∙TCLK. In the TRC, Delay1 is tuned to
Fig. 6 Performance comparison between standard KF and EOKF. 0.4∙TCLK such that timing violation is detected by SLUV slightly
For decoding, we adopt Ensemble Observation Kalman Filter before the actual critical path in the NSP fails. On the other hand,
(EOKF) algorithm [1], which uses regressed spiking rates as state if the NSP critical path becomes much shorter than 0.5∙TCLK under
observation in a Kalman Filter (KF) and saves 400× computation the VDD overshoot case, TRC will assert TRC_OV through the OV
than the standard KF (Fig. 5). For the DREAM dataset, the EOKF detection logic when (Delay1+Delay2)<0.5∙TCLK. However, false
achieves better accuracy than the standard KF especially in the OV detection happens when 2∙TCLK<(Delay1+Delay2)<2.5∙TCLK,
higher velocity regime (Fig. 6) [1]. We map these algorithms to which limits the maximal value of Delay2. As a result, we chose
deep sub-Vt circuits in a 0.18μm that is carefully selected for low Delay2 as 0.8∙TCLK and added a NOR gate after OV detection logic

155

Authorized licensed use limited to: University College London. Downloaded on July 14,2021 at 10:29:06 UTC from IEEE Xplore. Restrictions apply.
to mask TRC_OV with TRC_UV. In such cases, when false OV avoids variable VREF generators or voltage comparators which
detection happens, Delay1 will reside in [0.67, 0.83]∙TCLK and otherwise would impose power and delay overhead and also V IN
TRC_UV will be asserted. Note that TRC does not directly affect scaling limit on the PMU. As a result, the SoC can be powered
exact VDD settings thus adding no margin. directly by capacitors and energy harvesters thus avoid efficiency
To enable the second feature in the PMU, we devised a CR/fSC degradation in conversions to/from the battery level (4V) [4].
search scheme in the context of the error-based regulation. It IV. MEASUREMENT & COMPARISON
searches fSC and CR of the SC-DC for optimal PCE. Prior works
on error-based regulation paid little attention to such scheme [5, 6], We prototyped the NSP-PMU SoC in a 0.18μm technology as
while voltage look-up tables are used in some of the conventional leakage power is the major energy efficiency bottleneck, given that
voltage-based regulations [7, 8]. The scheme is devised to first the NSP has relatively low performance requirement. The NSP
scale VDD to the Point of the First Failure (PoFF, VDDmin) using operates at the target frequency of 30 kHz at VDD of 0.32V with
EDAC. This minimizes NSP’s power dissipation since it operates 0.1% TER while consuming only 0.61 μW (Fig. 11(a)). The SoC
at a fixed clock frequency (30 kHz). Then, it finds a PCE-optimal consumes 0.78μW and 0.84 μW for VIN of 0.6V and 1V,
CR-fSC setting by setting a proper value of ΔV=VDD,OC-VDD (VDD,OC respectively, at the SC-DC configuration found automatically by
as the open-circuit SC-DC output voltage) which balances the the proposed control scheme. The NSP consumes 0.61μW and
switching loss and linear loss of the SC-DC. TRC does 7nW. The SC-DC loss (SC Loss) are 151nW at 0.6V VIN
and 198nW at 1V. The PMU controller (SC CNTL) consumes
RESET CR+
CR=CRmax, fSC=fSC,max CR=CR+1,fSC=fSC,max 14nW at 0.6V VIN and 27nW at 1V. Fig. 11(b) summarizes the
ER>=TER power breakdown.
CR- SCR+
CR=CR-1,fSC=fSC,max 1000 1.0
ER<TER Measurement Measurement
ER<TER 10 PIN=0.84PW
CRO+ Core Frequency
SCR- Core Power
NSP Frequency (kHz)
0.8

Power Breakdown (PW)


CR=CR+1+CRoffset
ER>=TER

NSP Power (PW)


RUN+ 198nW PIN=0.78PW
14nW 151nW
CRO- fSC++
CR=CR+CRoffset RUN- PTRC=7nW
0.6
fSC-- (ER>TERHI || TRC_UV) 100 1 27nW
F- && fSC<fSC,max
fSC-- Operating Point:
(ER<TERLO || TRC_OV) 0.4 PNSP=0.61PW
ER<TER TERLO<ER<TERHI VDD|0.32V
SF && fSC>fSC,min RUN
(ER>TERHI || TRC_UV) PCore|0.61PW SC Loss
ER>=TER && fSC>fSC,max 0.1 0.2 SC CNTL
(ER<TERLO || TRC_OV) @ER=0.1%
FO TRC
CR- 10
fSC++ && fSC<fSC,min CR+ 0.30 0.33 0.36 0.39 0.42 0.45 NSP
0.0
Fig. 9 The state diagram of the proposed controls: the CR/fSC search (a) VDD (V) (b) 1V VIN 0.6V
and the error-based regulation. Fig. 11 (a) Power and performance of the NSP. (b) SoC power
Measured breakdown.
The PMU is also tested fully-functional as it varies its output in a
CR search CRmin=20, ER>TER fine-grained manner when changing the CR (Fig. 12(a)) and
CR=CRmax=27 26 25 24 23 22 21 CRopt=CRmin+CRoffset=23 achieves high PCEs at target VDD(Fig. 12(b)).
Stop at
12.8kHz
51.4 42.9 36.7 32.1 28.5 25.7 23.3 ...
1.0 Measurement 100 Measurement
fsc=fsc,max=64.2kHz
@ ILoad=1.5PA, fSC=35kHz* @ ILoad=1.5uA, fSC=35kHz*
0.9
(*:Fix, Non-optimal Fsc) 90 (*: Fixed, Non-optimal fSC)
fSC search, unit: kHz
0.8
Fig. 10 Waveforms of the CR/fSC search process. VIN=0.6V
80
0.7
VIN=1V
PCE:77% VIN=0.6V
Fig. 9 shows the state diagram of the CR/f SC search. It first finds VIN=1V
PCE (%)

0.6
70
Operating point: PCE:70%
VDDmin by reducing CR to a minimal CR (CRmin) where the error
VDD

0.5
VDD|0.32V Operating point:
rate (ER) reaches the TER, using the highest allowed fSC that 0.4
60 VDD|0.32V
minimizes ΔV (CR- and SCR-). CR is then set to an optimal value: 0.3 50 Peak PCE: 83%
CRopt=CRmin+CRoffset at CRO-. The CRoffset is proportional to the CR=20 CR=36 87%
0.2
40
optimal ΔV, which is a function of SC-DC design parameters, but 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8
insensitive to load conditions. Setting CR to CRopt raises VDD, (a) CR (b) VDD (V)
Fig. 12 Performance of the integrated SC DC-DC showing: (a)
largely reducing ER. Then, the optimal fSC is found by reducing fSC
Desirable VDD scaling behavior and (b) high PCE.
until ER reaches TER again, at which point ΔV also reaches its
Measurement @VIN=0.6V 80 Measurement @V =1V
optimal value. The system then goes into the aforementioned
PCE @ min. fSC for VDD=0.32V (%)
PCE @ min. fSC for VDD=0.32V (%)

IN
85.1%@Opt. 77.5%@Opt.
85
error/replica-based regulation (RUN, RUN+, and RUN-). During 83.9%@Opt. 76.9%@Opt.
75
the regulation, if fSC reaches pre-defined upper/lower bounds 80
(fSC,max/fSC,min), implying either switching or linear loss becomes
70
dominant, the controller re-initiates the above CR/fSC search (CR+ 75
Optimal CRoffset=2 or 3
Optimal
CRoffset=3
or CR-). CR+, SCR+, and CRO+ states are for the CR search if the 70 CRmin=20
CRmin=35 65
optimal CR is higher than the current CR. Measured example ILoad=1.5PA ILoad=1.5PA
waveforms of the control scheme are shown in Fig. 10, confirming 65
CRmin=33 ILoad=0.75PA Load
I =0.75PA
60
its correct operation. 32 33 34 35 36 37 38 39 40 41 42 43 19 20 21 22 23 24 25 26 27 28 29 30 31 32
(a) CR (b) CR
We designed the controller that performs the error/replica-based Fig. 13 The measurements validate the optimal CRoffset’s robustness
regulation and the CR/fSC search, in fully-digital circuits. This across load and VIN conditions.

156

Authorized licensed use limited to: University College London. Downloaded on July 14,2021 at 10:29:06 UTC from IEEE Xplore. Restrictions apply.
We also validated the proposed CR/fSC search. We first validated regulation designs [7, 8] in terms of PMU VIN scalability, fully-
that CRoffset remains 2 to 3 robustly across ranges of CR settings digital control, and safety margin reduction.
and load current levels (Fig. 13). Using CRoffset=3 and for 1V VIN, Table. II Comparison to the prior PMU-load co-designs.
the proposed scheme found [fSC=14.9kHz, CR=23], at which the
This work [5] [7] [8]
SoC draws the input power (PIN) of 0.84μW (Fig. 14(a)). For 0.6V
Process (nm) 180 65 65 180
VIN, it found [fSC=18.4kHz, CR=39] at which the SoC consumes
Converter Type 63-ratio SC 63-ratio SC LDO & SC Multi-SCs
0.78μW (Fig. 14(b)). We then swept all the possible CRs and fSC’s
PLoad Range (W) <1μ ~25μ 8-173m 20n-500μ
and found that the SoC consumes the optimal PIN of 0.83μW at
VIN (V) 0.6-1 0.6-1 0.65-1.05 1-4
[fSC=12.9kHz, CR=23, VIN=1V] and 0.78μW at [18.4kHz, 39,
Core VDD (V) ~0.32 ~0.45 0.38-0.92 0.6/1.2/3.3
0.6V]. The proposed search scheme achieves the PCEs that are
PVT Adaptive Scheme EDAC EDAC Replica N.A.
within 1.1% of results from brute-force search (Fig. 14(a, b)).
PVT margin None None YES N.A.
Measurement @VIN=1V Measurement @Vin=0.6V
Regulation Method Error/Replica Error VREF+CMP VREF+CMP
CR=21 CR=22 CR=37 CR=38
0.96 0.88 Continuous Regulation Yes No Yes Yes
CR=23 CR=24 CR=39 CR=40
CR=25 CR=26 CR=41 CR=42 Error-based VREF Look-up VREF look-up
CR&fSC search: CRopt=23 CR&fSC search: CRopt=39 PCE optimazation Fixed f SC
Search Table Table
0.92 0.84
PCE Range 72-77% 73-87% ~52-73%(SC) 60-68%
Pin(PW)
PIN(PW)

0.88
0.80 0.1% higher V. CONCLUSION
than optimal
1.1% higher
than optimal In this paper, we present a sub-μW NSP integrated with PMU
0.84
0.76
through a hybrid error/replica control scheme and an automatic
energy-robustness co-optimization method. The NSP not only
10 100
(a) fSC (kHz) (b) 20 f (kHz) 40
SC
60
achieves the highest level of integration from spike detection to the
Fig. 14 PIN comparison between the brute-force search and the power management but also marks a record energy efficiency. The
proposed CR/fSC search at (a) VIN=0.6V; (b) VIN=1V. proposed PMU-NSP co-design approach improves the energy
We further evaluated the efficacy of the scheme across the wider efficiency of a sub-μW system by employing fully-digital, fully-
range of load power. It still achieves at-most 2.2% worse PCE than integrated design with the capability of energy-robustness co-
the brute-force search (Fig. 15(a)). optimization via EDAC.
The chip die photo is shown in Fig. 15(b). ACKNOWLEDGMENT
NSP: 1.86mm2
Measurement
Length Unit: μm 1308
This work is supported by National Science Foundation,
80 78.6% Catalyst Foundation, Wei Family Private Foundation, Samsung,
and Semiconductor Research Corporation.
Decoder

2.2% 2.2%
76
Detector

77.7%@VIN=0.6V
Sorter
Test Circuits
PCE (%)

SC DC-DC
0.54mm2

REFERENCES
1420

72 72.2%@VIN=1V
72.8%
VIN=0.6V, B. F. [1] Zhewei Jiang et al., “Microwatt End-to-End Digital Neural Signal
68 Processing Systems for Motor Intention Decoding,” DATE, 2017.
530

VIN=0.6V, Prop.
PNSP=0.61PW VIN=1V, B. F. [2] Zhewei Jiang et al., “1.74-μW/ch, 95.3%-accurate Spike-Sorting
64
VIN=1V, Prop. Scan Chain Hardware based on Bayesian Decision,” VLSIC, 2016.
0.0 0.5 1.0 1.5 [3] S. M. A. Zeinolabedin et al., “A 128-Channel Spike Sorting Processor
(a) PLoad (PW) (b) Test circuits SC Cntl. Level Conv.
Featuring 0.175 μW and 0.0033 mm2 per channel in 65-nm CMOS,”
Fig. 15 (a) PCE comparison between brute-force search and the VLSIC, 2016.
proposed CR/fSC search for varying load power. (b) Die photo. [4] Jiangyi Li et al., “Triple-Mode, Hybrid-Storage Energy Harvesting Power
Table. I Comparison to the prior BCI processors. Management Unit: Achieving High Efficiency against Harvesting and
Load Variabilities,” JSSC, vol. 52, no. 10, pp. 2550–2562, 2017.
This work [2] [3]
[5] SeongJong Kim and Mingoo Seok, “Ultra-Low-Power and Robust Power-
Process (nm) 180 65 65 Management/Microprocessor System Using Digital Error-based
No. of Channels 96 96 128 Regulation,” ESSCIRC, 2017.
Detection & Sorting Y Y Y [6] David Bull et al., “A Power-Efficient 32 bit ARM Processor Using
Partial Decoding Y N N Timing-Error Detection and Correction for Transient-Error Tolerance and
Integrated PMU Y N N Adaptation to PVT Variation,” JSSC, vol. 46, no. 1, pp. 18–31, 2011.
Core VDD (V) 0.32 0.6 0.54 [7] Stephen T. Kim et al., “Enabling Wide Autonomous DVFS in a 22 nm
Core Power/Ch. (nW) 6.3 1740 175
Graphics Execution Core Using a Digitally Controlled Hybrid
LDO/Switched-Capacitor VR with Fast Droop Mitigation,” ISSCC, 2015.
Core Area/Ch. (mm2) 0.0194 0.12 0.003
[8] Wanyeong Jung et al., “A 60%-Efficiency 20 nW-500 μW Tri-Output
Compared to the state-of-the-art BCI processors [2, 3] (Table I), Fully Integrated Power Management Unit with Environmental Adaptation
the proposed SoC achieves 21× smaller power dissipation than [3] and Load-Proportional Biasing for IoT Systems,” ISSCC, 2016.
even after including the PMU conversion loss. It also demonstrates [9] Seongjong Kim, Joao Pedro Cerqueira, and Mingoo Seok, “A 450 mV
the first end-to-end integration of neural signal processing at timing-margin-free waveform sorter based on body swapping error
correction,” VLSIC, 2016.
comparable/better accuracy over prior arts. As compared to the
[10] K. Hirairi et al., “13% Power Reduction in 16 b Integer Unit in 40 nm
prior PMU-load co-design works (Table II), our design CMOS by Adaptive Power Supply Voltage Control with Parity-Based
demonstrates continuous regulation and optimal PCE search in the Error Prediction and Detection (PEPD) and Fully Integrated Digital
error-based regulation which has an advantage over voltage-based LDO,” ISSCC, 2012.

157

Authorized licensed use limited to: University College London. Downloaded on July 14,2021 at 10:29:06 UTC from IEEE Xplore. Restrictions apply.

You might also like