You are on page 1of 4
15.1 (8073) IEEE Asian Solid-State Circuits Conference November 5-7, 2018/Taman, Taiwan A bulk 65nm Cortex-M0+ SoC with All-Digital Forward Body Bias for 4.3X Subthreshold Speedup Pranay Prabhat’, Graham Knight, Supreet Jeloka’, Sheng Yang’, James Myers! ‘Arm Lid, Cambridge, UK; Arm Inc., Austin, US first.last@arm.com Abstract—I0T devices demand ultra-low power operation while stil achieving the performance demanded by application constraints. Dynamic forward body biasing can help to achieve this by providing a speed-up during active operation without incurring a leakage penalty during standby periods. While body ising has been fully explored in FD-Sol technology, bulk CMOS, can also benefit from «fflelent forward body biasing. At subthreshold voltage levels, the Low Voltage Swapped Body (LVSB) technique, i ‘n-well and pewell are driven to VSS and VDD respectively, helps to realize a significant speedup ‘without incurring analog bias generation overheads. This work presents key advances to leverage LVSB, proven on « bulk 68am Subthreshold Arm Cortex-MO+ system. The system achieves a 43X speedup on the ULPBench benchmark ata cost of only 11% average power and 10.4% area, while showing that LVSB can be ‘usefully applied up to 0.50V. Keywords— Arm; Cortes-Mo*; LVSB: subthreshold; ToT; FBB 1. Istropucrion The energy efficiency demands of IoT devices are driving renewed interest in subthreshold systems [I-3]. However, reduced performance is a major barrier to adoption, Fine-grained VDD control or analog body bias are commonly used to improve performance atthe expense of area, energy and complexity. Low Voltage Swapped Body (LVSB) is a performance boosting technique introduced in [5], in which n-well and p-well may be “swapped” to power and ground rails respectively as long as the device is operating below the substrate latch-up voltage (Fig. 1), This provides a large forward body bias to both n and p devices, which translates to a significant speedup. Prior work [4-6] applied LVSB primarily to standalone logic blocks and reported speed increases of up 0 3.7». FeBNW. neBBPW Powell Fig. 1s LVS operating principle (oo <0.6¥) (6) 976-1-5386-6419-118/831.00 ©2018 IEEE 183 This work proposes: 1. A no-trim, all-digital, fully synthesized LVSB implementation with s-order | mode switching. 2. Fine-grained independent bias control per power ‘domain, 3 Splitbias for SRAM array and periphery. 4. Adaptive clock generation to track system performance. A bulk 651m ‘Arm Cortex-M0+ LVSB SoC using these techniques achieves a 4.3x subthreshold speedup with only 11% power penalty for an industry standard duty-cycled workload. This paper is organized as follows, Section II details the digital logic, memory, clock generation and power control sequencing considerations with block-level silicon results, Scotion III presents full system silicon results and section IV presents conclusions. 1. LVSBCReums A, Logie and well drivers Careful implementation is required to integrate dynamic LSB into synthesized logic with multiple power domains. The CPU, bus and each peripheral are implemented as separate power domains for fine-grained bias control (Fig. 2). A Physically separate deep newell under each power domain ‘isolates p-wells from the substrate at a marginal area cost. This ‘enables per-domain well control to minimize leakage by forward biasing only active domains. Special tie cells are used to disconnect wells from the supply rails, allowing an industry- standard router fo route wires as signals from well driver buffer ‘outputs to well ties. Custom well driver buffer cells (Fig. 2, ‘zoomed inset) are supplied by the same low-voltage power rail as the rest ofthe logic. The well driver cells incorporate a deep ‘nowell hole, to prevent their internal n-well from shorting to the driven n-well and creating unwanted feedback. A pair of p-and newell drivers is combined into a single cell to reduce DRC keep-out overhead. To aecommodate these constraints in a Iibrary cell footprint, the buffer cel is 3 rows high with a jogged power rail. Well drivers are kept on the always-on ground rail so that wells do not float when power-gating the rest of the logic, as this would forward-bias power-gatcd logic and inerease sleep power. When enabled, power-gating footers are forward biased along with the rest of the logie to supply the inereased current demand from forward-biased operation. Biller density and sizing is determined from SPICE simulation including extracted well diodes, to mect an LVSB ‘wansition target of 16 cycles forall domains (Fig. 2, bottom right inset), Place and route is fully automated, with 43% area and 9% (simulated) active leakage overhead. 154 (8073) Well Pca IEEE Asian Solid-State Circuits Conference November 8-7, 2018/Tainan, Taiwan ‘Simalted ype wel eaten tine 30.360 (9) ig. 2: Lope Noorplan showing well afer ayut, extracted wel diode snd mulated wl anion ins Logic LVSB is tested with a 128b AES encryptidecrypt accelerator driven by on-chip LBIST, showing a 6.2 speedup on silicon ata Minimum Energy Point (MEP) of 0.30 with only (6% energy penalty (Fig. 6). B. Memory LVSB logic alone is not sufficient for a functional and efficient LVSB system. Ifthe entire SRAM is at Zero Body Bias (ZBB), clock buffers inside it will increase clock skew in system Forward Body Bias (FBB) mode, causing hold time violations, Also, SRAM access can be up to 50% of system eycle time, reducing potential FBB speedup and degrading system energy efficiency. However, the SRAM array has a low activity factor simulated FBB array drives up the MEP, increasing system nergy by 16% with only 16% performance improvement, To solve these issues, SRAM array and periphery are spit into separate deep n-wells (Fig. 3). The array is kept ZBB while the periphery incorporates LVSB, Well drivers are integrated into the macro to drive periphery wells, Tested on silicon with ‘on-chip MBIST, the 4KB 10T SRAM macro with LVSB shows a 3* speedup near the individual MEP of 0.40V with 19% ‘energy penalty. The SRAM is fully functional down to 0.29V in both FBB and ZBB modes. C. Clock generation and control ‘A Tuned Clock Ring Oscillator (CRO) [7] is used to generate voltage and temperature-tracking system clock. The delay chain is physically spread out (Fig. 4) to capture representative wire load and gate delay effects. Italso needs to track well bias, considering SRAM and logic performance scaling. To mimic simulated system behavior, inverter chains representing 26% of the maximum TCRO delay are kept at ZBB and the rest may be FBB (o match the system critical path in both bias modes. TCRO voltage and temperature tracking is intrinsic and tuning bits are ‘only used to trim chip-to-chip variation margin. Fig, 4 shows ‘excellent matching within 25% of system frequency across more than two ordets of magnitude frequency variation due to voltage, temperature and bias conditions — without any runtime re ‘rimming. aT Reena Peed Fig 5: SRAM layout scton showing wel split and simulated waveforms Total SRAM area penalty i 16%, 184 154 (8073) aaie BE | aeesenceeam DDDOD SPP Ppe ‘Table slog 05) ‘TcRO margin over 0.30V 1 0.55V,0t0 85°C ‘System operates correctly at all points shown ee a sang tnd tery Coen ocr peau Fig 4: TCRO schema, aout, clock capture during bias tension and margin across vollage, temperature apd bas mode D. Power and bias sequencing ‘An on-chip Power Control State Machine (PCSM) safely’ sequences LVSB transitions and generates per-domain control signals (Fig. 5) in response to power-on reset, wake from sleep, or SW poweriperformance mode requests. When switching bias rode, integrated clock gates are disabled while wells are switched. No state is lost during well transitions. tas ig. S: System block dhngram and wel sitching control sequence, Power gate ‘otes we? SV UO devices dive by 12V VBATT cout 185 IEEE Asian Solid-State Circuits Conference November 5-7, 2018/Taman, Taiwan FBB is tumed off during sleep - well switching is incorporated into the sleep and wakeup sequence. Most mode transitions count 16 TCRO cycles, including enabling/disabling FBB, which required careful simulation. The TCRO output clock was measured to settle within this time frame as expected, without any glitches. As the PCSM is always-on and supplied by 1.2V VBATT it has significant timing slack and does not require any body bias. TIL SysreM RESULTS Measured results are shown in Fig. 6, The change in FBBIZBB relative leakage power and frequency with VDD ‘causes a leakage energy penalty for logic above 0.25V. As logic and SRAM scaling under FBB are quite different, the system efficiency is quite sensitive to the amount of SRAM used. When the system runs with 8KB SRAM enabled, this results in total energy penalty of 24% with a 4.3% speedup at 0.4V. The moderate energy penalty shows that LVSB oflers a viable fast dynamic system operating point from a single supply voltage, For a duty-eycled or intermittent workload, the system can go {nto ZBB retention resulting in an average power penalty of only 11% for FEMBC’s ULPBeneh at 0.40V. Without the LVSB capability, Fig. 6 shows that equivalent performance would need 0.47V operation and incur a 78% power penalty. Fig. 7 shows annotated logic analyzer capture for these scenarios. LVSB can also compensate for temperature inversion: at O°C the 2 reduction from 25°C performance becomes a 2.6% increase at only 11% energy cost if FBB is asserted. AS tog 420 Fessenden » Sen pertrarcand ow Ae ws treme Vio Fig. 6) Measred AES (logic only) FBINZBB fequeney ad leakage. (@) Measured sysam FBBIZB foqueny and eny eicieey 154 (8073) Figure 7: Measured behaviour of ULPBench runing at 04 with and without tetive mode body bia, IV, Coxetusion ‘As far as the authors are aware, this work is the first application of LVSB to a complete digital SoC including SRAM and clock generator. Careful system design through fine-grained bias contol, a voltage and temperature tracking clock, and a ccore/periphery partitioned RAM macro ensures that the excess leakage power duc to large forward bias is effectively mitigated, demonstrating only 11% power overhead during. system operation. Densely distributed well buffers ensure that the forward bias can be switched off in afew cycles when entering sleep mode, resulting in a negligible sleep power penalty. ‘Table 1 shows a comparison with LVSB prior at. This work achieves the best speedup (other than a ring oscillator in []) as ‘well as the highest level of integration, Fig. 8 shows an FBB system speedup histogram, and annotated chip photo. LLVSB shows promise for system and circuit co-optimization of costsensitive ulta-efficient loT systems that are tom between the energy efficiency of sub-threshold operation and the performance demands of existing software. While sophisticated back-bias schemes have been proposed on FD-Sol [8], this work demonstrates a low-overhead, fully synthesizable zero-trim all- digital biasing scheme to get a low-voltage performance boost ‘on widely available, costeffective and mature bulk CMOS processes, “TARLEL COMPARISON 10 LVSB PROR ARE Ta I [er [Fis work [Technolog |oSam Co [180 150, em em [oS [Design JATS (B —[Ring oeiltors, [52 CPU 2b Corte faapainy [ECP coe Ino soc [cients —fLowie [tose oats, power Logie, SRAM] eas, contol flock, power tal [Bias [VSB ZB, [OVS ZB [LVSRD- Si, 285 Ischeme [SFB INS fos-ehip [No [PwalToaly Yen re ven [Drivers WWeltage —[0S0te 120V [oaew TOON [ozs wos [AaRte LaOV Range [57 (LVSBY [oe Goa TOPIRT e Sa I [Speedup [5 (Se5) [4.4 (130um ROfdesi) IEEE Asian Solid-State Circuits Conference November 8-7, 2018/Tainan, Taiwan 5 20 & 3 measurement ss fon this chip o 30 35 40 45 50 55 60 65 70 Fan speed-up igure 8: Speedup histogram and anos chip photo ACKNOWLEDGMENTS ‘The authors thank An Nguyen and Dave Ondeiek of Arm Inc. in San Jose for extensive layout support and Anand Savanth ‘of Arm Lid in Cambridge for invaluable test and PCB assistance. REFERENCES, [11M Fojik etal, A Milietr Seale Enery-Autonorous Senor Sytem With Stacked Battery and Solar Cells" tn TEBE Journal of Solid Sate Cire, vl 48, 3, pp 801-813, March 2013 ]. S. Clee ot aly "84 A OS3V/A'C processtemperture closed-loop ‘compensation SoC enbedng all-digital clock mutipler and DC-DG ‘conver exploiting FDSOI 280m backyate bising.” 20/5 IEEE International SliState Circuits Conference = (SSCO) Digest of Technical Paper, San Francisco, CA, 2018, 9p. 13 [5]. 5. Pau eral, "An energy harvesting wireers ensrnode for lo systems Featringsnear-hresl volage IA-32 microsomal nnn ele CMOS" 20/6 TEBE Ssmporium on VIS! Circuits (VLSI Cris), ‘Honoka, 2016, pp. 1-2 (8), W. zhao, ¥ Ha and M, Ait, "Novel Sol&Body-Basing and Satis! Design for Near Tresold Cites With Ura Energy-Pcient AES 8 (Case Study” in IEEE Trancactions on Very Large Seale Integration (VLSI Systems, vl. 23,00. 8, pp 1890-1401, Ave. 2015, (5). S. Narendra era, "Uir-ow voltage circuits and processor in 180m Sam trhnoogies mith wswppeetody basing econ” 2004 IEEE International Sold-Siate Circus Conference (BEE Cat No.4CHS7519), 2008, p. 156518 VoL (6), 4.8. Wang 1.8. Chen, VM. Wang and C. Yeh, "4 230mV'40-500m ‘skHato-L6Mir 336 RISC Core in O.LRym CMOS." 2007 IEEE Iernational SlitStte. Cireuits Conference. Digest of Tecnical ‘Papers, San Francisco, CA, 2007, pp, 294-601, IT). Layers eral, "A 12.4picycle sub-sbreshold, L6plicyclenear-reshold ‘ARM CortseMO* MCU with autonomous SRPGDVES andtemperte ‘eacking clocks" 2017 Syposion on VLSI Creu, Kyoto, 2017, ensz-cs [8} A. Queen, 6. Pillomes,P.FlaresseandF Beigné."A 2.5 0.0067mm? futomatie’bock-busiag compensation uit achioving SO% leakage {eduction in EDSOI 2um over O35-o-1V VDD rang” 2018 IEEE International Sold - Sate Cras Conference SSCC), San Fansss, (CA, 2018, pp. 308-306 186

You might also like