You are on page 1of 527

Massimo Alioto Editor

Enabling the
Internet of
Things
From Integrated Circuits to Integrated Systems
Enabling the Internet of Things
Massimo Alioto
Editor

Enabling the Internet


of Things
From Integrated Circuits
to Integrated Systems
Editor
Massimo Alioto
Department of Electrical & Computer Engineering
National University of Singapore
Singapore, Singapore

ISBN 978-3-319-51480-2 ISBN 978-3-319-51482-6 (eBook)


DOI 10.1007/978-3-319-51482-6

Library of Congress Control Number: 2016963262

# Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in
this book are believed to be true and accurate at the date of publication. Neither the publisher nor
the authors or the editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Maria Daniela and Marco, and now also Marina.
My deepest merriment begins with M. And them, indeed.

To my family, including my beloved nephews and niece:


Rachele, Davide, Gaetano and Francesco.

And to everyone who has inspired my curiosity and love for life,
including Giusi, Rodolfo, Alfio, Dora, Gaetano, Annamaria
and Santina. And thankfully many, many others.
Preface

Decades of exponential improvements in integrated circuit manufacturing


and design have been spurred by (and given rise to, in a virtuous circle)
relentless reduction in the cost per transistor, as well as many other interest-
ing consequences. The cost reduction keeps following a well-proven learning
curve, which includes Moore’s law as a corollary, and will continue in spite
of the end of this law. According to the Bell’s law, such technology trend has
led to inexorable shrinking of electronic systems, which are now approaching
the sub-centimeter scale and below. At the same time, the Koomey’s and
Gene’s laws promise the reduction in the energy consumption by another two
orders of magnitude. Those trends promise the possibility of integrated
electronic systems that are very inexpensive, small, and extremely low
power. In other words, we will increasingly see systems that are pervasive
in space and long-lived in time. At the same time, Metcalfe’s law (or
Sarnoff’s law for the least optimistic) traces the fast-growing value of
connectivity, thanks to the rapidly increasing number of connected users,
and, more in general, connected objects.
At the same time, several megatrends are demanding more pervasive and
continuous sensing, as well as sensemaking and transfer of physical data.
Accelerated urbanization and increasing worldwide population requires sus-
tainable usage and sharing of resources, as well as more livable and smarter
environments at all scales (from home to city). Pervasive sensing and
sensemaking are also being required by assistive and proactive technologies
(e.g., robotics, decision support) that increasingly relieve humans from
routine tasks, repetitive labor, and recently data-driven decision-making.
The sharing economy is demanding the ability to spatially track, to physically
monitor and manage objects, to encourage responsible usage, and to charge
users by the actual usage. Well-being and other human factors are being
modeled and monitored to create healthy environments where humans can be
happy and productive. Geosocialization and participatory sensing are pro-
gressively involving objects other than individuals or as support to human
activities. Three-dimensional remote physical interaction with reality
provides sensory feedback, thus demanding ubiquitous sensing to enable
this ability on a wider scale and on a finer granularity.
The push and the pull effect of the above technological trends and
applications is converging on and creating a virtuous circle that we now

vii
viii Preface

call the “Internet of Things” (IoT). The IoT can evidently create a huge value
and bring unprecedented benefits to the society. To set this on a trend
perspective, we can extrapolate Hick’s law to artificial intelligence and
cloud computing: more physical data will enable us to take more automated
decisions with an effort that is only logarithmic in the space of decision
choice. The IoT is ultimately a powerful enabler to share on a larger scale,
make technology more human centric and real time, and decouple socioeco-
nomic progress from intensive use of resources. And, interestingly, IoT
silicon technology becomes so small that the user is immersed in it (there
is no more “user experience,” in a sense), with interesting implications in
terms of market and perceived value.
In spite of the daily IoT-related claims in the chip design community, the
tiny sensing nodes of the IoT at its edge (the “IoT nodes”) are still in their
technological infancy. Several challenges need to be tackled, such as energy
efficiency and related lifetime, cost, security, and interoperability, among
others. Such challenges need to be tackled in a holistic manner, developing
both an understanding of the different parts of IoT nodes and an insight into
the big picture and the strong linkage to applications and related
requirements.
To the best of our knowledge, this is the first book on integrated circuit
and system design for the Internet of Things. This book develops in both the
“vertical” and the “horizontal” dimension. Vertically, it provides a compre-
hensive view on the challenges and the solutions to successfully design chips
for IoT nodes as systems (from circuits to packages), a broad analysis of how
chip design needs to evolve to meet those challenges, and a fresh perspective
grounded on historical and recent trends. Horizontally, the book covers in
one place the very diverse domain-specific expertise of the subareas involved
in the design of IoT nodes, which was previously scattered across a large
number of talks, journals, and conferences.
This book provides a design-centric perspective, providing an understand-
ing of what the IoT really means from a design point of view. Typical
specifications of commercial IoT nodes are discussed, and constraints
imposed by IoT applications are translated into design constraints that chip
designers are used to deal with. Design guidelines to meet them are system-
atically discussed in every chapter.
This book started in the form of talks at various venues, such as VLSI
Symposium, HotChips, and ISCAS, where I had very interesting
conversations with several other speakers. Those talks were motivated by
the lack of a cohesive and detailed source of accessible knowledge on the
design of IoT nodes. The idea to write this book came exactly from those
conversations, which later continued throughout the interaction with chapter
authors. They really made this book possible, providing their deep insights
and invaluable expertise. I deeply thank all outstanding researchers and
designers who contributed to the chapters of this book, sharing their expertise
in an accessible and concise manner for the benefit of our community.
Preface ix

This book is structured as follows. Chapter 1 describes the big picture in


view of technological trends, an overview of the challenges ahead and the
possibilities that research has recently opened, and some link to the econom-
ics of the IoT and social megatrends. Chapter 2 provides a system-level
perspective of IoT nodes. Then, Chaps. 3–7 cover the design of digital
subsystems of IoT nodes, from architectures to circuits, and memories in
CMOS and other emerging technologies. Chapter 8 is about hardware-level
security techniques, whereas Chap. 9 focuses on System-on-Chip design
methodologies. Power management and energy harvesting are covered in
Chaps. 10 and 11. Analog interfaces and analog–digital converters are
discussed in Chaps. 12 and 13. Short-range radios are discussed in Chap.
14. Batteries as further essential component of IoT nodes are the focus of
Chap. 15. Packaging is the topic of Chap. 16. Finally, Chaps. 17 and 18
describe two system integration examples, exemplifying the design
techniques introduced in the previous chapters. As a common thread, all
chapters include a final section on perspectives and trends, which provides a
glance into the future, and a good starting point for further research and
advances.
There are many ways to use this book. In particular, it can serve as a
reference to practicing engineers working in the broad area of integrated
circuit/system design of IoT nodes, in view of the wide and detailed coverage
of state-of-the-art solutions for IoT and the fresh perspective on the future of
such technologies. The book is also very well suited for undergraduate,
graduate, and postgraduate students, thanks to the rigorous and lean coverage
of topics and selected references.

Singapore Massimo Alioto


December 2016
Contents

1 IoT: Bird’s Eye View, Megatrends and Perspectives . . . . . 1


Massimo Alioto
2 IoT Nodes: System-Level View . . . . . . . . . . . . . . . . . . . . . . 47
Pascal Urard and Mališa Vučinić
3 Ultra-Low-Power Digital Architectures for the Internet
of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Davide Rossi, Igor Loi, Antonio Pullini, and Luca Benini
4 Near-Threshold Digital Circuits for Nearly-Minimum
Energy Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Massimo Alioto
5 Energy Efficient Volatile Memory Circuits
for the IoT Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Jaydeep P. Kulkarni, James W. Tschanz, and Vivek K. De
6 On-Chip Non-volatile Memory for Ultra-Low Power
Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Meng-Fan Chang
7 On-Chip Non-volatile STT-MRAM for
Zero-Standby Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Xuanyao Fong and Kaushik Roy
8 Security Down to the Hardware Level . . . . . . . . . . . . . . . . 247
Anastacia Alvarez and Massimo Alioto
9 Design Methodologies for IoT Systems on a Chip . . . . . . . 271
David Flynn, James Myers, and Seng Toh
10 Power Management Circuit Design for IoT Nodes . . . . . . . 287
D. Brian Ma and Yan Lu
11 Energy Harvesting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Ying-Khai Teh and Philip K.T. Mok
12 Ultra-Low Power Analog Interfaces for IoT . . . . . . . . . . . 343
Jerald Yoo

xi
xii Contents

13 Ultra-Low Power Analog-Digital Converters for IoT . . . . . 361


Pieter Harpe
14 Circuit Techniques for IoT-Enabling Short-Range
ULP Radios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Pui-In Mak, Zhicheng Lin, and Rui Paulo Martins
15 Battery Technologies for IoT . . . . . . . . . . . . . . . . . . . . . . . 409
Jeff Sather
16 System Packaging and Assembly in IoT Nodes . . . . . . . . . . 441
You Qian and Chengkuo Lee
17 An IPv6 Energy-Harvested WSN Demonstrator
Compatible with Indoor Applications . . . . . . . . . . . . . . . . 483
Pascal Urard, Liviu Varga, Mališa Vučinić,
and Roberto Guizzetti
18 Ferro-Electric RAM Based Microcontrollers:
Ultra-Low Power Intelligence for the Internet
of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
Sudhanshu Khanna, Mark Jung, Michael Zwerg,
and Steven Bartling
About the Editor

Massimo Alioto (M’01–SM’07-F’16) was


born in Brescia, Italy, in 1972. He received
the Laurea (M.Sc.) degree in Electronics
Engineering and the Ph.D. degree in Elec-
trical Engineering from the University of
Catania (Italy) in 1997 and 2001, and a
Bachelor of Music in Jazz Studies from
the Conservatory of Music of Bologna
in 2007.
He is currently an Associate Professor at
the Department of Electrical and Computer
Engineering, National University of Singapore, where he leads the Green IC
group and is the Director of the Integrated Circuits and Embedded Systems
area. Previously, he was Associate Professor at the Department of Informa-
tion Engineering of the University of Siena. In 2013 he was also Visiting
Scientist at Intel Labs—CRL (Oregon) to work on ultra-scalable microarch-
itectures. In 2011–2012, he was Visiting Professor at the University of
Michigan, Ann Arbor, investigating on active techniques for resiliency in
near-threshold processors, energy-quality scalable VLSI design, and self-
powered circuits. In 2009–2011, he was Visiting Professor at BWRC—
University of California, Berkeley, investigating on next-generation ultra-
low power circuits and wireless nodes. In the summer of 2007, he was a
Visiting Professor at EPFL—Lausanne (Switzerland).
He has authored or co-authored more than 220 publications in journals
(80+, mostly IEEE Transactions) and conference proceedings. One of them is
the second most downloaded TCAS-I paper in 2013. He is co-author of three
books, Enabling the Internet of Things—From Integrated Circuits to
Integrated System (Springer, 2017), Flip-Flop Design in Nanometer
CMOS—From High Speed to Low Energy (Springer, 2015) and Model and
Design of Bipolar and MOS Current-Mode Logic: CML, ECL and SCL
Digital Circuits (Springer, 2005). His primary research interests include
ultra-low power VLSI circuits, self-powered and wireless nodes, near-
threshold circuits for green computing, energy-quality scalable VLSI
circuits, hardware-level security, circuits for on-chip learning, and circuit
techniques for emerging technologies.

xiii
xiv About the Editor

Prof. Alioto was a member of the HiPEAC Network of Excellence


(EU) and the MuSyC FCRP Center (US). In 2010–2012 he was the Chair
of the “VLSI Systems and Applications” Technical Committee of the
IEEE Circuits and Systems Society, for which he was also Distinguished
Lecturer in 2009–2010 and member of the DLP Coordinating Committee in
2011–2012. He is also member of the Board of Governors of the IEEE
Circuits and Systems Society (2015–2017). In the last 5 years, he has given
50+ invited talks in top universities and leading semiconductor companies.
He currently serves as Associate Editor-in-Chief of the IEEE Transactions on
VLSI Systems, and served as Guest Editor of various journal special issues
(e.g., IEEE TCAS-I issue on Internet of Things in 2017, IEEE TCAS-II issue
on green computing in 2012). He also serves or has served as Associate
Editor of a number of journals, such as IEEE Transactions on VLSI Systems,
ACM Transactions on Design Automation of Electronic Systems, IEEE
Transactions on CAS—part I and part II, Microelectronics Journal, and
others. He serves or has served as panelist for several funding agencies and
research programs in the USA and Europe. He was Technical Program Chair
(ICECS, PRIME, VARI, NEWCAS, ICM, SOCC) and Track Chair in a
number of conferences (ICCD, ISCAS, ICECS, VLSI-SoC, APCCAS, ICM).
Prof. Alioto is an IEEE Fellow.
IoT: Bird’s Eye View, Megatrends
and Perspectives 1
Massimo Alioto

This chapter opens the book and provides a sum- its pervasive networking, cloud and related
mary of the challenges and the opportunities that advances (e.g., big data), the physical world
are offered by the Internet of Things (IoT), with through distributed sensing and people’s
emphasis on the aspects that are relevant to activities, in the unprecedented form of mostly
integrated circuit and system design from circuits real-time fine-grain and aggregated data from the
to packaging for IoT nodes. The chapter is knowledge coming from environments, goods,
organized along a chronological perspective, resources, tools, infrastructures, among the others.
first reviewing technology historical trends So far, the IoT has been defined in several
beyond mere Moore’s law, and summarizing different ways, and its meaning has become so
recent past achievements and capabilities that broad that it oftentimes includes any object on
are making the IoT possible. Then, present earth that is connected to the Internet, such as
challenges are described, as pathway to connected cars, drones, smartphones, smart
up-coming advances and developments in the appliances, industrial tools, and so on. Under
design of IoT nodes. Finally, mega-trends are such generic definition based on pure Internet
examined to unearth clues on longer-term evolu- connectivity, the IoT has been already realized
tion of the IoT and the implications on integrated as the number of computing devices connected to
system design. the Internet surpassed the worldwide population
back in 2008–2009 (Evans et al. 2011).
This book focuses on the IoT as pervasive,
1.1 The Internet of Things: Context unobtrusive, systematic and coordinated intro-
and Overview duction of sense-, compute-, communication-
ability and sensemaking of physical data in a
The concept of the IoT seems to first appear in very large number of objects on earth. This is
Kevin Ashton in a presentation delivered at enabled by the introduction of extremely
Procter & Gamble in 1999 (Ashton 2009), which miniaturized integrated systems (“IoT nodes”)
was then described as a large-scale network of with very long lifetime (e.g., decades) that are
smart RFIDs. On a broad perspective, the IoT autonomous in many respects, from functional-
lies at the intersection of the Internet realm with ity, to energy, to the way they interact with the
physical world and the network infrastructure.
M. Alioto (*) From this perspective, the IoT pushes such
National University of Singapore, Singapore 117583, capabilities beyond personal devices (e.g.,
Singapore smartphones), embedding them in everyday
e-mail: malioto@ieee.org

# Springer International Publishing AG 2017 1


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_1
2 M. Alioto

objects and living environments. This book predict a somewhat slower growth (Nordrum
addresses the challenges involved in the creation 2016). The IoT market size is expected to have a
of IoT nodes in the form of integrated circuits, global economic impact of 2.5–11.1 T$ by
covering the different areas involved in this pro- 2022–2025 (Dobbs et al. 2015; Jankowski et al.
cess including architecture, circuit building 2014).
blocks, design methodologies, packaging and As shown in the simplified architecture in
system demonstrations. Being the IoT an exten- Fig. 1.1, the IoT is structured into three tiers of
sive topic, the scope of this book purposely devices. At the bottom, IoT nodes perform sensing
excludes the challenges related to the integration and interact with the physical world. To assure
of IoT nodes into a cohesive and scalable net- scalability and ubiquitous network access, gate-
work comprising inter-operable and heteroge- ways and concentrators collect, protect (under
neous nodes, and related communication users’ control) and route data from several and
protocol and software layers. physically proximal IoT nodes, and route it to
A commonly agreed target of the IoT is to servers. The latter perform data aggregation and
expand the number of connected devices per per- knowledge extraction, and deliver physically-
son to the order of a thousand, thus reaching an enhanced cloud services. Some additional inter-
unprecedented scale of trillions of connected mediate levels of aggregation might be needed,
devices (Gaudin 2015). The number of depending on the amount of data generated, the
connected devices is expected to grow to 30–50 area covered by a sub-network, and the density of
billion devices by 2020, with an expected market IoT nodes, among the others. For example,
CAGR growth of 15–35% (Markets and Markets; concentrators might actually be a sub-set of the
https://newsroom.cisco.com/press-release-content network below an Internet hub/gateway, which is
?type¼press-release&articleId¼1771211; Ericss here omitted as this would be simply part of the
on Mobility Report; http://www.gartner.com/ existing Internet infrastructure.
newsroom/id/3165317; Greenough and Camhi The hardware requirements of the devices in
2015; Worldwide Internet of Things Forecast the three tiers in Fig. 1.1 are very different, by
2015; TechNavio 2015; Machina Research 2015; virtue of their significantly different number and
Bauer et al. 2014; Jankowski et al. 2014; Dobbs level of pervasiveness. The number of IoT nodes
et al. 2015; Digital Universe of Opportunities is expected to be approximately two orders of
2014), [IoT Analytics, Oct. 2014], (http://www. magnitude larger than the number of con-
postscapes.com/internet-of-things-market-size/). centrators, which in turn is plausibly higher than
Some forecasts question such fast growth and the number of server blades by another two orders

# DEVICES WORLDWIDE
size
cost/item USERS compute- communication- sense-
power ability ability ability status
100 Millions
illions 100-1,000
meterss GFLOPs well deployed
CLOUD 10,000 $
1,000-10,000 keeps expanding
1,000 W
10 Billions
ions 100-1,000
INTERNET GATEWAYS/ 10 cm MFLOPS well deployed
OF THINGS CONCENTRATORS 10+ $ keeps expanding
1 -10 W
1 Trillion
on 1 -100 commercial nodes
1 -10 mm MFLOPS still far from IoT
IoT NODES
1$ target
000 mW
0.1-1,000

PHYSICAL
WORLD

Fig. 1.1 A simplified architecture of the IoT


1 IoT: Bird’s Eye View, Megatrends and Perspectives 3

of magnitude. To be embeddable in objects and following sections. As a result, IoT technologies


the living environment, the form factor of IoT currently tend to substantial fragmentation, pos-
nodes is expected to be in the scale of millimeters, ing a fundamental challenge in terms of economy
which is at least an order of magnitude smaller of scale and interoperability, which adds to the
than concentrators, whose size is expectedly in the expected fragmentation due to lack of
same order as today’s wireless routers (10-cm standardization in this early phase of its
range). The form factor of server blades is another development.
order of magnitude larger. The cost target for IoT Internet cloud services and wireless networks
node is widely accepted to be in the 1-dollar range will be greatly affected by the expansion of the
(Ricker et al. 2016), as might be expected by IoT, due to the large number of connected nodes.
observing that an average customer in the con- The IoT is indeed being responsible for data del-
sumer electronics market would likely spend as uge issues that impact the network traffic, and the
much as a top-of-the-line smartphone to populate power associated with wireless communications.
their home and objects with 1000 IoT nodes. Regarding the data deluge, currently only 1% of
Concentrators are allowed to have a larger cost enterprise data is being used to generate valuable
in view of their lower number, expectedly by an knowledge, and is mostly utilized for alarms or
order of magnitude at least, considering their real-time control (Dobbs et al. 2015). Such poor
non-trivial computational and wireless bandwidth data utilization for useful purposes and value
requirements. In turn, cloud servers clearly entail creation will further worsen due to the volume
a larger cost by at least two orders of magnitude. increase determined by the IoT, as the worldwide
Similarly, concentrators are expected to deliver at data is expected to grow by 4 in 2015–2020
least two orders of magnitude more compute (Digital Universe of Opportunities 2014;
power compared to IoT nodes, which can Jankowski et al. 2014; https://newsroom.cisco.
typically have very limited (e.g., sub-Mega com/press-release-content?type¼press-release&
Operations per Second—MOPS) or moderate articleId¼1771211). By 2020, data generated by
computational capability (100 MOPS). Cloud IoT devices will account for 10% of the world’s
servers are certainly required to have a much data (Digital Universe of Opportunities 2014)
larger computational power compared to (i.e., approximately 44 zettabytes). Hence, the
concentrators, by at least two orders of magnitude. IoT will demand better data utilization as well
Due to their large number and ubiquity, IoT as pre-selection and filtering of valuable data to
nodes need to be untethered and hence their be processed and stored in the cloud. Regarding
power budget is very small, and is as low as the volume of wirelessly transmitted data, in
sub-μW for miniaturized systems powered by 2020 the IoT is expected to generate 1000
energy harvesters. Due to their larger size and more data than in 2015 (Digital Universe of
lower density, concentrators are expected to be Opportunities 2014), with an overall power con-
mostly tethered, and hence their power can be sumption that would become comparable to the
much larger (e.g., in the order of Watts). A server expected total worldwide energy production of
blade dissipates a power that is two orders of 25 PWh (Callewaert 2016). Accordingly, the
magnitude larger. wireless power consumption in the IoT needs to
IoT nodes have design requirements that are be substantially reduced for sustainability
markedly different from existing Internet- reasons, which adds to the issues raised by the
connected devices (e.g., networked computers tight power limitations of IoT nodes (see later). In
and smartphones), as they aim at facilitating addition, the large number of IoT nodes requires
convergence of several tasks onto a single plat- an acceleration in the transition from the 32-bit
form (Jankowski et al. 2014). Instead, IoT nodes IPv4 Internet protocol suitable for 4  109 dif-
need to pursue hardware specialization and appli- ferent addresses, to the 128-bit IPv6 protocol that
cation specificity, mostly for the very stringent can handle up to 1038 addresses, with some chal-
power requirements, as discussed in the lenge imposed by the different 64–96 bit length
4 M. Alioto

of RFID identifiers (Atzori et al. 2010). Finally, buildings and nations, toys, worksites, smart
the Internet as we know it today was mostly infrastructures, energy, lifestyle/entertainment,
designed for non-real-time sharing of documents among the others.
and data, with resiliency being the main concern As opposed to previous technological waves
(Greenemeier et al. 2013). Due to the generation in the semiconductor history, the IoT is the first
of large amounts of real-time data, the IoT pushes one that is so pervasive that it becomes invisible
the Internet towards its limit and hence needs to to the users, with several implications on the
be structured in a more decentralized manner to value capturing in the semiconductor industry.
assure sustainable scalability. For example, only 5–10% of the IoT technology
The above issues related to the IoT data del- spending is expected to fuel the semiconductor
uge are drastically mitigated by moving intelli- industry market (Dobbs et al. 2015), whereas
gence from the cloud to the concentrators and more value (15–20% each) will be captured by
most importantly to the IoT nodes in Fig. 1.1, software and integration services. Plausibly,
i.e. making the IoT nodes “smarter” most of the value of the IoT will come from the
(or “cognitive”, if intelligence means ability to data aggregation and the real-time response
detect and classify patterns) than they are today. (or actuation) of cloud services, as well as the
Indeed, pre-processing in the IoT nodes and more demand prediction for new proactive services
distributed intelligence reduce the data volume, and tasks that no longer need us to “push a
as only partially aggregated data needs to be sent button” (or click a mouse) to be executed. To
over the network, as opposed to raw data. capture more value from the large market volume
and by delivering integration services (e.g., from
IoT nodes to software for data aggregation and
1.2 Brief Review of IoT sensemaking), semiconductor companies will
Applications likely become more vertically integrated through
acquisitions, close partnerships and industrial
1.2.1 Considerations on the IoT consortia. As further benefit, this trend will also
Market Volume favor IoT node inter-operability and
standardization.
The IoT as a whole is inherently a general-
purpose technology, similarly to computers and
mobile devices in the past decades. Like any 1.2.2 Summary of Current
other general-purpose technology, it can boost and Prospective Applications
true productivity and create a value that is sub- of the IoT
stantially higher than its market size, as it can
serve as catalyst for bigger change (Brynjolfsson The IoT is a very fragmented application sce-
and Hitt 1998). Indeed, the IoT can further nario (Vermesan and Friess 2014), and
improve efficiency, economy of scale, ability to encompasses a wide range of applications, some
react to and predict demand in capex, labor and of which are summarized in the following.
energy. Also, the IoT is expected to enable better In the agriculture sector, the IoT infrastructure
coordination and usage monitoring of buildings, can monitor the quality, the actual usage and the
machinery, manufacturing processes, factories, availability of resources, for better and predictive
supply chain and resources. The IoT will impact management (e.g., irrigation) and storage (e.g.,
a very wide diversity of applications, from agri- avoid waste of feed and fertilizing). Monitoring
culture to consumer products, automotive, the environmental conditions permits to support
healthcare, retail, manufacturing and supply the growth of animals and plants (e.g., aquacul-
chains (e.g., Industry 4.0), telecommunications, ture), optimally time the next course of action,
logistics, public sector, financial, transportation and ultimately assure quality (e.g., wine) and
and shipping, smart environments from homes to raise the efficiency in the production process.
1 IoT: Bird’s Eye View, Megatrends and Perspectives 5

In automotive, the IoT enables the monitoring objects for personal care and hygiene can be
of the state of a vehicle down to its critical used to remind of regular but infrequent care
components, from initial shipping to usage, to activities based on dentist’s suggestions shared
assess their correct utilization (e.g., detecting in the cloud, and motivate positive behavior in
bumps, vibration) and maintenance (e.g., open- children. Smart clothing can remind of periodic
ing of containers, wearing parts). Based on actual cleaning based on actual usage. Smart toys can
usage, predictive maintenance can be performed be selectively enabled only upon the occurrence
to lengthen the vehicle lifetime, and lower the of desired conditions to create positive habits
upkeep. Such capabilities enabled by the IoT are (e.g., only at certain times or lighting conditions),
also very useful in fleet management and car and prevent danger by disabling them under the
sharing services. Also, distributed sensing and presence of others (e.g., toddlers). Smart jewelry
global sensemaking enables traffic control can be used to unobtrusively track activity, mea-
through differentiated and personalized road sure exposition to solar light and other environ-
pricing to encourage virtuous behavior and pri- mental conditions, and make emergency calls.
oritize tasks for commercial (e.g., car pooling Energy management at different scales can be
with multiple passengers sharing cost) and pri- made more effective by the IoT. At the city scale,
vate vehicles (e.g., fast delivery for critical the smart grid offers several opportunities to
goods), through virtual/dynamic city area leverage the sensing and sensemaking
boundaries. capabilities of the IoT to optimize the energy
In public transportation, the occupancy and usage across many users, a better coordinated
utilization can be monitored to assure an ade- usage and planning of alternative energy sources,
quate quality of service, detect potential danger ultimately reducing the overall energy and the
(e.g., potential collision between vehicles and currently large gap between the peak and the
pedestrians), and predict short term demand average consumption.
based on crowd monitoring in strategic locations. Health care is another important application
On the road side, excessive congestion and pol- area in which the IoT promises to fundamentally
lution can be managed with real-time demand- contribute to. As few examples, the miniaturiza-
response schemes where the road pricing is tion and long lifetime of IoT nodes provides an
dynamically adjusted through real-time obser- unobtrusive mean to constantly monitor vital
vations and utilization prediction, based on signs and other related parameters (e.g., behav-
previous history and real-time data in strategic ioral) and develop deeper understanding of the
locations. Also, the transportation of dangerous patient’s health evolution. In addition, the avail-
goods and the circulation of slow (or frequently ability of big data from a large number of patients
stationary) vehicles can be optimally coordinated offers an unprecedented opportunity to explore
with the ordinary traffic to minimize their nega- correlations, build models and tools for predic-
tive impact. Again, the IoT offers unprecedented tive diagnosis, early treatment and make drug
opportunities to share resources efficiently, while discovery more efficient and effective. Similar
preserving their running condition. considerations hold for the elderly and the
Consumer electronics substantially benefits disabled, as constant non-obtrusive monitoring
from the IoT, as its pervasiveness permits to allows for better and highly responsive/predic-
track smart goods (e.g., positioning systems for tive care, while preserving individual’s indepen-
object retrieval) and detect their exposure to dency and offloading hospitals. Remote
anomaly conditions (e.g., overheating, physical supervision also enhances the ability to share
shocks). IoT sensing can signal spatial professionals across a larger number of
co-presence of objects and specific people (e.g., individuals and patients, thus driving the care
kids) to signal potential danger, or to recommend cost down.
activities to complete when all necessary objects Industrial processes and logistics can also
tools are available in the same space. Smart highly benefit from the IoT, as it can enable
6 M. Alioto

ubiquitous sensing of operating conditions, real- Smart homes promise to automatically order
time tracking of semifinished products, detection supplies before exhaustion and upon the avail-
of events that slow down the process throughput ability of online offers, as a major step forward
and potential safety issues. The data generated in compared to the today’s Amazon Dash wireless
the production line can be intelligently shared button that simply orders goods online when the
with the quality assurance process and across button is pushed (Amazon Dash). At the same
different sites, to raise the yield and reduce time, waste management will be made more effi-
cost. On the warehouse side, product location cient by sensing the actual demand, and pricing
and storage conditions can be tracked for more based on actual consumption habits, thus encour-
efficient product delivery and distribution. Simi- aging virtuous behavior. The IoT can also make
larly, sharing real-time and historical data on residential compost recycling easier and
parts with the procurement process makes automated, through the monitoring of humidity
restocking more efficient, and reduces the inven- and temperature trends.
tory cost through more strategic purchasing Smart buildings can leverage the IoT to be
strategies. The presence of IoT within machines more adaptive to the actual demand and needs
can enable early and self-diagnosis, predictive of the occupants, while ensuring the highest
(rather than reactive or pre-scheduled) mainte- safety and comfort standards. Indeed, air quality
nance, pre-emptive vendor support to prevent and thermal/acoustic/visual comfort can be mon-
known failures. Again, the IoT enables better itored and controlled for the first time with a
economy of scale, efficiency and makes pro- granularity that goes down to the single room,
cesses leaner. with obvious advantages in terms of comfort
In the area of retail, smart malls can provide assurance and energy cost. Beyond normal build-
real-time shopping recommendations, matching ing operation, the real-time capability of the IoT
available offers with individual customers, dis- enables the ability to respond to critical events
card products for potential customers with (e.g., fire) quickly, minimizing the human and
allergy issues and provide other personalized material losses in case of emergencies.
services (e.g., for customer fidelization). The Through the IoT, smart cities can manage
tight coupling between the individual and collec- resources more efficiently, be made much more
tive customers’ behavior, the store setting and resilient to temporary malfunctions and disasters,
the warehouse permits to streamline the inven- and encourage virtuous behavior. Smart and
tory management, offer better shopping experi- weather-adapting lighting, water/gas leakage
ence, dynamically adjust in-store display based monitoring, smart parking with dynamic pricing
on the predicted demand, and cut inventory costs. and area allocation, no physical boundaries and
Through the IoT, smart homes can manage automated parking advice are just a few
utilities more efficiently by controlling individ- examples of how to use the IoT to solve today’s
ual appliances based on actual utilization and urban challenges. Ubiquitous vision can enable
needs, and purchasing electricity when cost is an unprecedented level of safety and security,
lowest within the day in demand-response energy detecting potential danger and provide crucial
pricing schemes. Unprecedented levels of secu- information on crowd behavior and citizens’
rity (e.g., perimeter access control) are achiev- needs (e.g., for adaptive and predictive transpor-
able thanks to the pervasiveness of IoT nodes and tation management, real-time digital signage
sensemaking ability. Occupant recognition recommendations to prevent immediate danger).
permits to adjust lighting, sound, air condition- Other than enabling ubiquitous and augmented
ing/heating based on individual preferences. This surveillance, vision in IoT offers physical aug-
can be done in a predictive manner, so that mentation to social media and recommendation
occupants do not need to “push any button”, systems (e.g., venue recommendation based on
leveraging the fine-grain knowledge of crowdedness, and crow sentiment), and human
occupants’ habits and the ability of the cloud to activity monitoring to achieve better match
generalize and extract trends and predictions. between demand and supply of services
1 IoT: Bird’s Eye View, Megatrends and Perspectives 7

dynamically. Similarly, distributed audition


permits to develop situational awareness, build
real-time noise urban maps to mitigate noise form
pollution at critical times, and localize noise factor
events for safety assurance. Smart irrigation of access to
security
green spaces and parks is another sub-area where power
the IoT has potential to make an impact. Smart
tourism promises to give tourists the ability to cost lifetime
have an immediate understanding of the city,
such as availability, crowdedness or quietness
of different places to receive dynamic wireless
always
recommendations on tours that adapt to their connected

disposition, other than already available factual sensing &


processing
information on places. Waste management can
be made more efficient and priced fairly as
discussed before, while detecting potentially
dangerous and inappropriate waste that would Fig. 1.2 Summary of IoT node requirements
need to be disposed with different procedure
(again, encouraging virtuous behavior).
required capabilities of IoT nodes and user
Smart infrastructures will also benefit from
requirements, as summarized in Fig. 1.2 and
the IoT in terms of safety (e.g., structural moni-
discussed in the following.
toring of bridges) and security (e.g., automated
identification of unattended bags and suspicious
behavior). Distributed IoT nodes will enable
gesture-based natural human-infrastructure inter- 1.3.1 Physical Constraints
face, where users do express their preferences
anywhere and any time, being constantly Physical constraints of IoT are dictated by size
observed, instead of pushing buttons on an elec- considerations and the necessity to untether IoT
tronic booth or controllers (e.g., thermostat). This nodes and avoid any maintenance (e.g., battery
also introduces the new capability to average out replacement), as dictated by their large number.
requests and preferences from multiple users, Regarding the form factor, IoT nodes need to be
thanks to the distributed nature of such human- sufficiently small to make the deployment of IoT
infrastructure interface. nodes non-intrusive, with a typical volume rang-
In the areas of wildlife and nature preserva- ing from cubic millimeters to hundreds of mm3.
tion, the IoT can monitor both the activity and the Being untethered, IoT nodes need to be energy
living conditions of wildlife, as well as the qual- autonomous and rely on a battery and/or an
ity of available natural resources (e.g., water), energy harvester as energy source (Alioto 2012).
their level of pollution, forest fire detection, In purely battery-powered nodes, the average
earthquake early detection, counteraction of ille- IoT node power Pavg needs to be small enough to
gal activities against wildlife. achieve the desired lifetime tlifetime ¼
Ebattery =Pavg , for a given battery energy capacity
Ebattery. Figure 1.3 shows the lifetime of an IoT
1.3 Requirements of IoT Nodes system versus its average power consumption,
assuming optimistically that the battery self-
The distinctive features of IoT nodes nodes are leakage and ageing are negligible. From this
defined by the requirements imposed by IoT figure, smartphone or button cell batteries assure
applications in terms of physical constraints, a reasonably long lifetime of a decade (or more)
type of interaction with the external world, for Pavg in the order of few hundreds of nWs.
8 M. Alioto

1E+12
1E+10
10 years
1E+8
1 year
lifeme (s)

1E+6 1 week
1E+4 1 hour
1E+2 1 minute
1E+0
1E-2
nW mW mW
1E-9 1E-8 1E-7 1E-6 1E-5 1E-4 1E-3 1E-2 1E-1 1E+0
power (W)
smartwatch battery button cell battery thin-film battery

type cost capacity volume energy density


GH43-03992A 30$ 300 mAh 2,400 mm3 0.12 mAh/mm3
LR44 <1$ 150 mAh (non- 500 mm3 0.28 mAh/mm3
rechargeable)
Cymbet CBC005 0.2$ 5 mAh 0.7 mm3 6.5 mAh/mm3

Fig. 1.3 Lifetime vs. average power consumption for different batteries

Larger Pavg mandate the addition of an energy to deliver the peak power, if the former does not
harvester, whose size is generally proportional to have adequate instantaneous power capability, as
Pavg. Figure 1.4 shows the harvester size required dictated by its size.
for a given Pavg for various energy sources. From As opposed to purely battery-powered systems,
this figure, millimeter-sized photovoltaic energy harvested IoT nodes can operate nearly-
(indoor), thermo-electric (on-body patch) and perpetually, as long as the harvester power
airflow (indoor) harvesters can indefinitely sus- exceeds Pavg (i.e., the harvester size is large
tain Pavg in the order of μWs (Alioto 2015). Tens enough), and can hence indefinitely sustain the
of μWs are sustainable under more abundant power required by the IoT node. On the other
energy sources, such as photovoltaic (outdoor), hand, an increase in the targeted lifetime tlifetime
thermo-electric (industrial machines) and body ¼ Ebattery =Pavg for a given Pavg requires the adop-
vibration (e.g., walking) harvesting (Alioto tion of proportionally larger batteries. Hence,
2015). GSM radio-frequency energy harvesting energy harvesters are invariably more compact
can instead sustain only tens to very few than batteries for long enough lifetime targets.
hundreds of nWs. From Fig. 1.3, printed and Figure 1.4 shows the breakeven lifetime at which
solid-state batteries (see Chap. 15) enable harvester and battery have the same size, assum-
aggressive miniaturization at the cost of much ing a battery with energy density equal to typical
shorter lifetime, whose extension requires the alkaline button cell batteries (e.g., LR44). From
addition of an energy harvester in all practical this figure, harvesting is always more compact for
cases. As a third energy source option, the battery all practical lifetime targets under abundant
can be suppressed altogether by pairing the energy sources, such as photovoltaic (outdoor),
energy harvester with a small energy source thermo-electric (industrial machines) and body
(e.g., off-chip supercapacitor, on-chip capacitor) vibration (e.g., walking) (Alioto 2015). On the
1 IoT: Bird’s Eye View, Megatrends and Perspectives 9

Fig. 1.4 Lifetime 1000


vs. average power
consumption for different
batteries
100

harvester size (mm)


10

IoT node
1
size target

0.1
0.1 1 10 100 1000 10000
avg power Pavg (mW)
photovoltaic (indoor) photovoltaic (outoor)
thermal (human) thermal (industrial)
vibration/motion (human) air flow (indoor)
electromagnetic (GSM)

other hand, harvesters become more compact than other living environments and infrastructures,
batteries for targeted lifetimes of 2–3 years and this translates into a lifetime of several decades.
longer, and hence in most of IoT applications. Industrial applications, transportation and
GSM radio-frequency harvesters are instead shipping might require a shorter lifetime,
always larger sized than the battery counterpart. although still in the order of a decade. The life-
Regardless of the specific lifetime target and time requirement can be further relaxed in other
energy source architecture, the volume of IoT applications such as retail, worksites, lifestyle/
nodes is certainly dominated by off-chip entertainment. Hence, the above considerations
components, and in particular by the energy on the power budget of IoT nodes apply almost
source, as the antenna can be made very thin. In unmodified in a very wide range of applications.
other words, the size of IoT nodes is essentially Meeting power budgets of few μWs or below is
set by their power consumption, which hence feasible only if the IoT node actively performs
become a very stringent and crucial requirement tasks (e.g., sensing, processing) only infrequently.
in any IoT node design. In other words, power needs to be aggressively
reduced by duty cycling the IoT node operation,
alternating active tasks and long sleep periods as
depicted in Fig. 1.5, with periodicity set by the
1.3.2 Interaction with the External
wake-up cycle Twkup. From an architectural stand-
World
point, this means that IoT nodes are organized into
an always-on (ALWON) sub-system that manages
With reference to Fig. 1.2, the interaction of IoT
the periodicity of the wake-up cycle and stores
nodes with the external world needs to last at
information across active tasks, and a duty-cycled
least the lifespan of the object/environment they
(DCYC) sub-system that periodically performs the
are embedded in, as battery replacement is not an
active task (Alioto 2012). Hence, the average IoT
option due to the large number or the inaccessi-
node power can be written as the sum of the
bility of nodes. When deployed in buildings or
10 M. Alioto

Fig. 1.5 Lifetime active mode active mode


vs. average power
consumption for different
batteries
sleep mode sleep mode

Twkup

ALWON power PALWON and the DCYC energy designed resolution (and hence higher cost and
EDCYC (Alioto 2012): power). From Fig. 1.6, most of IoT applications
require a minimum resolution that is below
EDCYC
Pavg ¼ PALWON þ : ð1:1Þ 12 bits, and 8 bits are sufficient for a rather
T wkup wide range of practical cases. On the secondary
From Eq. (1.1), the power reduction can be y axis, the figure also reports the energy per
reduced by reducing the power (energy) of the conversion, assuming an energy per conversion
always-on (duty-cycled) sub-system. In other step of 30 fJ, which is relatively optimistic espe-
words, nearly-minimum power design needs to cially for larger resolutions (Murmann 1997).
be pursued in the always-on sub-system, while The datarate range of the above sensors is plot-
nearly-minimum energy design is the objective ted in Fig. 1.7. This figure shows that most of the
in the duty-cycled one (see Chap. 4). Of course, sensors require only thousands of bits per second
larger Twkup and hence more infrequent active when operating continuously, whereas tasks
operation mitigates power, although Twkup is related to vision and audio processing need orders
upper bounded by the application, depending on of magnitude higher datarates (up to 10 Mbps in
how frequently data needs to be updated. Such the case of compressed VGA video streaming).
system-level tradeoffs are discussed in Chap. 2, From the above considerations, the specifications
whereas approaches to further reduce the leakage of IoT node sensing interfaces are actually quite
power cost of storing information across tasks is relaxed, thus cost and power consumption are far
discussed in Chaps. 5–7. more important aspects than pure performance.
Both challenges are well addressed by tailoring
such circuits around the specific application. The
power consumption of the ADC is proportional to
1.3.3 On-Board Capabilities of IoT the datarate in Fig. 1.7 and the energy per conver-
Nodes sion in Fig. 1.6, and is plotted in Fig. 1.8. From this
figure, the power consumption of ADCs for IoT
IoT nodes need to have sensing, computation, nodes spans a very wide range, mostly because of
and wireless communication capabilities. In IoT the wide energy per conversion range in Fig. 1.6,
nodes design for a specific purpose, sensing can as dictated by the exponential relationship between
be typically made more inexpensive by tailoring resolution and energy (Freyman et al. 2014). This
the MEMS design and the analog interface confirms that tailoring the ADC to the specific
around the specific application. This permits to application is crucial in IoT, and the same consid-
substantially reduce the complexity that is expe- eration applies to most of the other building blocks
rienced by general-purpose platforms, and hence and sub-systems.
the cost. As simple example, Fig. 1.6 shows that Let us now consider the case where the raw
sensors for IoT applications cover a wide range sensor data is transferred directly to
of resolutions, hence using the appropriate ADC concentrators and cloud. Assuming a best-in-
resolution (see Chap. 13) is necessary to avoid class radio consuming 5 nJ/bit (ISSCC 2016),
using general-purpose platforms with over- the resulting power to wirelessly transmit such
1 IoT: Bird’s Eye View, Megatrends and Perspectives 11

Fig. 1.6 Resolution range required by various sensors

wireless power (W)


1E+8 5E-1
datarate (bps)

1E+6 5E-3

1E+4 5E-5

1 mW
1E+2 5E-7

1E+0
sensor
heart rate
humidity (capacitive)
battery monitor
temperature
accelerometer
magnetometer
altimeter/pressure
imager (VGA, RGB)
imager (MP4 compressed)
infrared proximity
gyroscope
microphone
CO2
light
strain
ultra-violet

Fig. 1.7 Datarate range required by various sensors, and wireless power required to continuously transmit data
(energy/bit assumed to be 5 nJ/bit (ISSCC 2016))
12 M. Alioto

ADC power consumption (W)


1E-03
1E-04
1E-05
1E-06
1E-07
1E-08
1E-09
1E-10
sensor
heart rate
humidity (capacitive)
battery monitor
temperature
accelerometer
magnetometer
altimeter/pressure
imager (VGA, RGB)
imager (MP4 compressed)
infrared proximity
gyroscope
microphone
CO2
light
strain
ultra-violet
Fig. 1.8 ADC power consumption under sampling rate associated with the datarate in Fig. 1.7, assuming an energy/
conversion step of 30 fJ

raw data is reported on the secondary y axis in communication. For example, the IoT node can
Fig. 1.7. From this figure, most of sensors cer- be proactive and monitor for critical or important
tainly exceed 1 μW and hence the range of prac- events (e.g., the crossing of a threshold, or an
tical IoT node power targets mentioned above. increase rate larger than a pre-set value), and
Hence, mere computation offloading to transmit data only upon their occurrence. From
concentrators and cloud through raw data trans- Fig. 1.7, this is particularly crucial in applications
mission is not an option for IoT nodes operating involving large datarates, such as continuous
continuously. vision and audio sensing. In such applications,
For some environmental sensors (e.g., temper- more intelligence needs to be embedded in the
ature, CO2, light, UV), duty cycling discussed in IoT node, such as the ability to perform pattern
the previous subsection is applicable since recognition and classification. Other options to
measurements do not need to be taken continu- trade off computation and communication are in
ously, as the related phenomena exhibit slower the choice of the data representation and sampling
time constants. For such sensors, a duty cycle of approach (e.g., compressive sensing, including
percentage points reduces the average datarate computation in the compressive sensing domain
down to hundreds of tens of bits/second, and the (Shoaib et al. 2015)), as well as signal dimension-
power down to tens of nWs. Often times, the other ality reduction (e.g., in-node feature extraction,
sensors cannot be duty cycled as the dynamics of which is equivalent to compression, with the fur-
the related phenomenon does not really allow it ther advantage that it is often a necessary task to
(e.g., accelerometers, gyroscopes, imaging, be performed anyway in many algorithms).
audio). In these cases, further power reduction From the above considerations, the wireless
can be achieved by leveraging the well-known power is always an issue in IoT nodes, and hence
computation-communication tradeoff (Min et al. requires the choice of appropriate communica-
2001), moving computation onto the IoT node tion standard for the intended range and datarate,
to reduce the volume of wireless data as will be discussed in the next section.
1 IoT: Bird’s Eye View, Megatrends and Perspectives 13

1.3.4 User Constraints The above challenges need to be addressed


through platform-based design and moderate
Other important requirements of IoT nodes come reconfigure-ability to reduce the design cost and
from the user, and are mainly related to cost and widen the range of targetable applications, espe-
security. Regarding the cost, consumer cially in consumer electronics. On the other
applications dictate a target of approximately 1 hand, such challenges are mitigated in
$/node as was discussed in Sect. 1.1 (which limits applications where the ability of directly receiv-
the die cost to a fraction of it). This clearly puts ing information on large numbers of objects is
pressure on the financials of the semiconductor particularly valuable. For example, this is the
industry due to the limited room for profit margin, case of manufacturing, logistics and smart cities,
and can be addressed through large sales volumes whereas it is not for single users in a smart home.
(say, at least several tens of millions per year) and Security is another important requirement
specialized hardware for reasons related to cost, coming from the user, as the IoT offers a very
power and form factor (see previous subsections). large number of backdoors to attackers, in view
On the other hand, achieving such volumes is diffi- of its large scale. In addition, traditional solutions
cult even deploying an IoT with a trillion devices. to counteract cyber-attacks (e.g., firewall, cryp-
Indeed, the IoT space is highly fragmented, and tography) are not applicable to IoT nodes, due to
only few applications are so pervasive that they their very limited power budget and cost. As fur-
require such large volumes (https://www. ther challenge, the dispersed deployment of IoT
mckinsey.de/files/mckinsey-gsa-internt-of-things- nodes makes it hard to keep track of individual
exec-summary.pdf). Similarly, it is hard to justify nodes, thus exposing them to physical attacks.
the non-recurring engineering cost of a new chip Such challenges require novel security
design for applications that do not require more approaches that embrace the hardware level rather
than tens of million pieces. This will require the than being confined at the network or software
development of an ecosystem that favors design level, in order to reduce the cost and energy, and
reuse, and platform-based design approaches. assure chip-level authentication (see Chap. 8).
Regarding the recurring costs, IoT node cost
reduction certainly requires more aggressive
on-die integration to limit the cost of off-chip 1.4 Looking at the Past: IoT
component assembly and testing. For example, as Natural Outcome
circuits for power delivery and harvesting need to of Technological Trends
avoid off-chip passive components (see Chaps. 10
and 11), and innovative integration techniques and The IoT can be shown to be a natural conse-
packaging becomes crucial to assemble heteroge- quence of historical trends that are relevant to
neous components in an inexpensive and ultra- its distinctive features, such as size, energy, sales
compact manner (Heterogeneous Integration volume, cost, with other software implications
Roadmap) (see Chap. 16). discussed at the end of the section. Other
As an additional challenge, IoT nodes are considerations on the evolution of the communi-
required to have a long lifetime (e.g., decades), cation infrastructure will be made in Sect. 1.7.
which translates into a missed opportunity to The Bell’s law observes that a new computer
replace the nodes for a very long time. In the class has appeared every 10 years, thus bringing
long run, this will expectedly make the IoT exponential improvements in computer size
market very different from the consumer mar- (100 smaller every 10 years) and cost (Bell
ket, which typically relies on periodic new et al. 1972; Bell 2008; Fojtik et al. 2013), as
waves of demand stimulated by incoming summarized in Fig. 1.9. This has driven the com-
generations of products with improved features puter market expansion in computer units by a
(and predictable release timeline, which allows factor of 10–20 every 10 years (Tsai 2014), as
planning). in Fig. 1.9. Based on the current dominant wave
14 M. Alioto

Fig. 1.9 Scaling laws of computer market size and system volume

of mobile computing, the next technological concept of “learning curve” for the silicon
wave is expectedly composed of systems with a manufacturing, as plotted in Fig. 1.11. In general,
volume of hundreds of mm3 or lower, i.e. the the learning curve comes from the observation
IoT. Essentially, the last few decades have seen that the doubling in the large-scale cumulative
computers and integrated systems move from production typically leads to a fixed reduced cost
desks to pockets, and then anywhere. per unit at any point in time (Jaber 2011). The
The energy efficiency trend for computers and semiconductor industry is only one of the many
Digital Signal Processors is well captured by the examples of steady (exponential) cost reduction
Koomey’s (Koomey et al. 2011) and Gene’s law that comes from the accumulated knowledge and
(Frantz et al. 2000; Karam et al. 2009), which are improvements in the overall design/production
representative of control-heavy and data-heavy process. In particular, Fig. 1.11 shows that the
architectures. As summarized in Fig. 1.10a–b, transistor manufacturing cost fits a 55% learning
both classes of computing have equally benefited curve, as doubling the overall number of
from technology advances, which can be manufactured transistors reduces the cost to 55%
expected to hold true in the future as well. Histor- of the original cost. Such relentless cost reduction
ically, most of 100 energy reduction achieved is at the basis of the Moore’s law (Jovanovic and
every 10 years has been obtained through tech- Rousseau 2002), which has driven the learning
nology scaling (90% according to (International curve of transistor cost down relying on a steady
Technology Roadmap for Semiconductors (exponential) doubling of transistors per die at
2015)). In the IoT domain, energy efficiency each CMOS generation, assuming the cost reduc-
improvements will not come from technology tion comes from shrinking (Hutcheson 2009)
scaling, as advanced CMOS technologies are (which is no longer true at 20 nm and below due
simply too expensive for the very low cost target to higher lithography cost, thus making those
of IoT nodes, as discussed below. Hence, keeping CMOS generations unsuitable for low-cost IoT
the same pace in the energy reduction requires nodes, as discussed in Sect. 1.8). The cost per
major innovation at system level (Chap. 2), in key transistor has been constantly reduced through
building blocks (Chaps. 3–7, 10–14), through shrinking in the past decades, although this is
innovative design methodologies (Chap. 9), and clearly not the only way to do it. In other words,
suitable batteries (Chap. 15). the learning curve in Fig. 1.11 actually transcends
Regarding the cost of IoT nodes, and in partic- Moore’s law, and is expected to continue for a
ular the cost per transistor, it is useful to recall the long time in spite of the end of the latter (Rhines
1 IoT: Bird’s Eye View, Megatrends and Perspectives 15

Fig. 1.10 Scaling laws of


energy in (a) computers
(Koomey’s law from
(Koomey et al. 2011)),
(b) Digital Signal
Processors (Gene’s law)

(a)

100
(mW/MIPS=nJ/instrucon)
energy per sample

10

0.1

0.01
1970 1980 1990 2000 2010 2020
year
(b)
16 M. Alioto

Fig. 1.11 Learning curve


2-0.8528 = 0.55

revenue per transistor ($)


of transistor manufacturing 1E+0
fi 55% learning curve

1E-3

1E-6

1E-9
1E+09 1E+12 1E+15 1E+18 1E+21
cumulative no. of manufactured transistors

2015), leveraging on the accumulated knowledge replenishment. Circuits to monitor the state of
of semiconductor industry, and various other the battery are included to adapt the power man-
approaches described in Sect. 1.8. agement strategy to the actual energy availability
More considerations on trends in software and (e.g., brown-out detector). Specialized circuitry
communication infrastructure will be made in is embedded for the generation of proper multi-
Sect. 1.8. ple voltages from the single (and not perfectly
stable) voltage coming from the battery. Voltage
booster and battery recharger are inserted
1.5 Looking at the Present: Typical between the energy harvesters and the battery.
Specifications of IoT Nodes A Power Management Unit (PMU) coordinates
the interaction between the energy source, the
1.5.1 Architecture of IoT Nodes harvester and the power mode of the node.
The task performed in active mode (see
A relatively general architecture of IoT nodes is Sect. 1.3.2) is started by a periodic trigger
depicted in Fig. 1.12 (covered in Chap. 2), which (time-driven) or specific events (event-driven),
details its main sub-systems, including depending on the power mode and the specific
processing and security assurance (Chaps. 3–9), application. Time-driven control is determined
power conversion and delivery (Chaps. 10 and by deriving a proper clock from the system
11), analog interfaces (Chaps. 12 and 13), radios clock, whose frequency is typically 32,768 Hz
(Chap. 14), energy sources (Chap. 15), system to serve as Real-Time Clock and hence perform
integration and assembly (Chap. 16). Two accurate timestamping and inter-node synchroni-
examples of detailed architectures of IoT nodes zation. Event-driven control is achieved by con-
are provided in Chaps. 17 and 18. In this section, stantly monitoring digital and analog signals of
a review of the main features of IoT node interest, respectively through digital transition
architectures is provided. detectors and analog comparators that detect the
As in Fig. 1.12, sensors are connected to ana- crossing of a meaningful threshold to signal an
log interfaces that include an amplifier with pro- event occurrence. When more sophisticated com-
grammable gain (and sometimes analog filters), parison is needed for signals coming from
and are multiplexed to share a single ADC for all sensors, the ADC and processing are kept on to
analog channels. Analog voltages are generated acquire samples and process them until the event
by a DAC for actuation. Energy interfaces of interest is detected.
involve an energy storage element (e.g., battery, In Fig. 1.12, memories are an important part
supercapacitor), and energy harvesters for energy of IoT node processing. RAM is needed for the
1 IoT: Bird’s Eye View, Megatrends and Perspectives 17

digital interface
for configuration

SPI
settings

Non-Volatile
Memory (Flash)
analog triggers
EVENT

& interrupt/reset handling


wake-up controller (WUC)
Microcontroller
DRIVEN
digital triggers
Direct Memory
Access (DMA) PROCESSING
internal clock
RAM
TIME Real Time Clock
watchdog
32,768 Hz
DRIVEN crystal generator/
clock timer
(WDT)
accelerators
(crypto, filters…)
external clock
conditioning

Power
Management
Unit (PMU) WIRELESS
COMMUNICATION
other supply DC-DC radio-frequency
voltages conversion RX/TX
ENERGY

system BUS
INTERFACES battery monitor
& brown-out
detection

general-purpose
booster for digital IOs
harvesting &
energy
harvester battery charger

I, V ANALOG
references INTERFACES
general-purpose
DAC analog IOs

sensors ADC

Fig. 1.12 Relatively general architecture of IoT nodes with detailed sub-systems

microcontroller/microprocessor execution, as • Systems on Board (SoBs), using off-the-shelf


well as to store data in sleep mode. The components that are assembled on a printed
Non-Volatile Memory (NVM) contains the circuit board.
instructions, settings and also data that needs to • Systems on Chip (SoCs), usually in the form
be retained for a long time, thus suppressing the of MicroController Units (MCUs) with addi-
leakage power consumption associated with tional peripherals such as analog interfaces
retention. and radios.

To develop some quantitative understanding


of existing IoT nodes, we performed a survey of
1.5.2 Typical Specifications
more than 90 commercially available “motes”
of Commercial IoT Nodes
(i.e., IoT nodes in the form of SoB), more than
30 MCUs and more than 30 sensor hubs (i.e., a
Today’s IoT nodes are implemented according to
sub-system collecting, fusing and processing
various system integration approaches:
18 M. Alioto

Table 1.1 Survey of motes, MCUs and sensor hubs


Motes (SoB) MCUs (SoC) Sensor hubs (SoC)
Min. Avg. Max. Min. Avg. Max. Min. Avg. Max.
Cost ($) 10 295 1000 0.8 12 29 2 7 25
RAM (kB) 2 64–128 512 8 30 32 4 – 16
Flash (kB) 32 32,000–64,000 64,000 2 16 32 – – –
Lifetime (months) 3 9 11 – – – – – –
Volume (mm3) 300 73,000 200,000 – – – – – –
ADC resolution (bits) 8 11 16 10 11 12 12 14 16
DAC resolution (bits) 10 11 12 4 6 12 – – –
RF receiver sensitivity (dBm) 110 – 90 – – – – – –
Min. operating voltage (V) 1.8 3.3 15 1.6 1.6 1.8 1 1.8 3.2

30

20

10

sensor hubs MCUs motes

Fig. 1.13 Survey on sensors available on board in commercial sensor hubs, MCUs (SoCs) and motes (SoBs)

data from multiple sensors to offload energy- This means that even SoCs are still far from the
hungry SoCs, typically encountered on a IoT node cost target, which poses a challenge for
smartphone). The results are summarized in the development of the IoT, as will be further
Table 1.1 and are commented below. discussed in Sect. 1.8.
Regarding the cost, SoBs represents a design Regarding the availability of on-board
approach with very low capital entry barrier, sensors, Fig. 1.13 shows that about 30% of avail-
being the prototyping and development cost rel- able SoB Printed Circuit Boards (PCBs) include
atively small. However, the recurring cost is high a temperature sensor, some also include
and the specifications (e.g., power) are drasti- accelerometers (mostly 3-axis), humidity
cally worse than SoCs, as separate general- sensors. Less than 5% of the SoBs include
purpose components are used. From Table 1.1, on-board light sensors, magnetometers,
the typical cost of motes is in the order of microphones and battery monitors. Other sensors
hundreds of dollars, and only in a few cases it is are available only externally. Expectedly, SoCs
as low as tens of dollars, largely exceeding the have a much narrower choice of on-board
cost requirements of IoT nodes (see Sect. 1.1). sensors, mostly for environmental monitoring
SoCs are an order of magnitude more inexpen- (e.g., temperature, humidity, light), and motion
sive, with a typical cost of approximately 10$. sensors (mostly 3-axis accelerometers and
1 IoT: Bird’s Eye View, Megatrends and Perspectives 19

gyroscopes). Some MCUs have an embedded (which offers lower path loss), with a datarate
battery monitor for power management purposes. ranging from a few kbps to 1 Mbps.
Sensor hubs do not include a battery monitor as Operating voltage: it typically ranges from 1.8
they are employed in smartphone platforms, to 3.3 V in motes and MCUs, as respectively
where battery management is performed by the required by off-the-shelf chips and compati-
Application Processor. bility with commercial peripherals for MCUs.
Further observations can be made on typical Sensor hubs can operate down to 1 V, as
features of existing IoT nodes, from Table 1.1: dictated by the low power consumption
requirements of smartphone platforms and
• RAM capacity: in motes, it ranges from KBs the adoption of such voltages in the related
to 512 KB, and is typically around chip ecosystem.
64–128 KB, as allowed by inexpensive
stand-alone (general-purpose) RAM chips. To build a tighter link between the above
SoCs have much lower capacity in the order specifications and the applications in
of 8–32 KB, as enabled by fairly expensive Sect. 1.2.2, Table 1.2 shows the sensor-
on-chip SRAM. application matrix, which lists typical sensors
• Flash capacity: motes can have up to 64 MB employed in each application area.
of Flash, again as allowed by inexpensive As shown in Fig. 1.14, the power consumption
stand-alone memory chips. MCUs have of current IoT nodes is dominated by the wireless
embedded Flash on chip, which is expectedly communication power, considering short-range
much lower capacity (up to 32 KB) due to communication standards with low average and
their larger cost. Sensor hubs typically do peak current, as required by IoT nodes (e.g.,
not have any non-volatile memory, as they ZigBee and Bluetooth Low Energy—BLE).
can rely on the Application Processor. This is particularly true for nodes employing
• Battery lifetime: the battery lifetime claimed ZigBee radios (see, e.g., Fig. 1.14a), whereas
by mote vendors is in the order of months, the impact of the wireless communication
which is well below the typical IoT node power becomes around 50% of the power budget
targets of decades (see Sect. 1.1). with BLE radios, as they offer better energy per
• Volume: the volume of SoBs is several orders bit (Siekkinen et al. 2012). On the other hand,
of magnitude larger than IoT targets, due to BLE handles only short-range point-to-point
the unavoidably large size of PCBs. Full communications and few nodes, as opposed to
on-chip integration enables substantial system ZigBee or Z-Wave that have true networking
shrinking, with packaging playing a funda- capabilities (e.g., meshes). For IoT nodes that
mental role (see Sect. 1.3.4 and Chap. 16). are connected directly to the Internet, their num-
• ADC resolution: as required by the wide range ber certainly exceeds the available number of
of sensors in Fig. 1.6, it ranges from 8 to addresses in the IPv4 Internet protocol (32-bit
16 bits, although MCUs have a somewhat address, i.e. billions of unique addresses).
narrower range (10–12 bits). Techniques to reuse and mask addresses such as
• DAC resolution: it is in the order of 10–12 bits network address translation and private network
in SoBs, and goes down to much lower values addressing cannot deal with the prospective
in MCUs, which are better tailored to the lousy explosion in the number of connected devices.
requirements of many practical applications. The IPv6 protocol solves this issue, as it
Sensor hubs do not have a DAC, as this is not comprises 128 bits for the address (i.e., virtually
required in sensor data gathering. unlimited unique addresses), and can maintain
• Wireless receiver: it has typically a sensitivity compatibility with IPv4 by using the last 32 bits
of 90 to 100 dBm. The radios of most of for traditional IPv4 addressing. Since early
motes and MCUs operate in the 2.4 GHz band, 2010s, the IPv6 protocol has been rapidly
with very few exceptions working at 433 MHz adopted by new users, and is expected to be
Table 1.2 Sensor-application matrix
Sensor
Galvanic
Altimeter/ Battery skin Heart Magneto Proximity Ultra-
Application Accelerometer pressure monitor Gas response Gyroscope rate Humidity Imager Light Microphone meter (IR) Strain Temperature violet
Agriculture ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Automotive ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Consumer ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Energy ✓ ✓ ✓ ✓ ✓ ✓ ✓
Healthcare ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Industrial ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Lifestyle/ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
entertainment
Logistics ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Manufacturing ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Public sector ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Retail ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Shipping ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Smart ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
buildings
Smart cities ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Smart homes ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Smart ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
infrastructures
Transportation ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
1 IoT: Bird’s Eye View, Megatrends and Perspectives 21

ADC others ADC


4% 2% 2%

ADC
1 3%
RF receiver ADC
RF processing
1 9% 25%
receiver 1 5%
32%

processing
RF RF 51%
processing processing
transmier transmier
75% 87%
47% 28%

Texas Instruments Texas Instruments NXP/Freescale Kinetis Silicon Labs


CC2620 (ZigBee) CC2640 (BLE) KL2 (no radio) EFM32HG110F64 (no radio)

(a) (b) (c) (d)


Fig. 1.14 Power breakdown of commercial motes (on the left) and MCUs (on the right)

very widespread by the time the IoT reaches its budget, and demands for energy efficiency
inflection point. From the IoT node point of view, improvements by orders of magnitude compared
the adoption of IPv6 entails some communica- to today’s commercial IoT nodes in Sect. 1.5.2.
tion overhead, due to the increase in the length of In practice, excessive power consumption leads
the header by few tens of bytes (Siekkinen et al. to discontinuous operation depending on the
2012). This suggests that IoT nodes carrying residual (instantaneous) energy availability in
little information per measurement (e.g., temper- battery-powered (purely energy-harvested) IoT
ature, humidity) should temporarily store and nodes. As an analogy to the “dark silicon” issue
cluster measurements, and transmit multiple in high-performance SoCs, where a spatial por-
measurements in a burst fashion. tion of the chip needs to be kept off to meet the
When lower power radios are employed (see, power budget (Esmaeilzadeh et al. 2012) (see
e.g., Fig. 1.14b), the power consumed by the Fig. 1.15a), IoT nodes become “dark” (i.e.,
processing becomes a major contribution, and it entirely off) when their power consumption can-
can reduced through substantial voltage reduc- not be sustained by their energy source. In other
tion and other techniques, which will be words, inadequate energy efficiency of IoT nodes
discussed in Chaps. 3–5, 9 and 12–13 from the again leads to dark silicon as occurs in high-
perspective of power utilization, and Chaps. 10 performance SoCs, although on a temporal
and 11 for the power delivery. As shown in dimension rather than spatially (see Fig. 1.15b).
Fig. 1.14c, d, the analog interface consumes a
power that is a fraction of the processing contri-
bution. Chapters 17 and 18 will present two
1.6.1 The Wireless Power Issue
examples of IoT node architectures and designs
and the Communication-
based in SoB and SoC implementations, and will
Computation Tradeoff
discuss the underlying design tradeoff.
As discussed in the previous section, the wireless
communication power contribution needs to be
1.6 Present and Future Challenges substantially reduced to create room for energy
in Chips for IoT Nodes: Energy efficiency improvements. Best-in-class commer-
Efficiency cial radios consume an energy in the order of tens
of nJ/bit, and several academic prototypes with
The scarce and often intermittent power avail- energy around 1 nJ/bit have been demonstrated
ability in IoT nodes tightly constrains their power for both receivers and transmitters with a range in
22 M. Alioto

Fig. 1.15 Dark silicon in (a) high-performance SoCs (spatial dimension), (b) IoT nodes (temporal dimension)

the order of 10 m (ISSCC 2016). Unfortunately, cost of an unaffordable energy cost associated with
limited savings in the energy per bit are expected encoding/coding (e.g., 250 pJ/bit in convolutional
from circuit and modulation techniques in the codes, 800 pJ/bit in Reed-Muller and Turbo codes,
decade ahead, compared to today’s state of the and more than 1 nJ/bit in Reed-Solomon codes).
art. Indeed, from the Shannon-Hartley theorem on Overall, the above advances in ultra-low
the channel capacity (Shannon 1949), the energy power radios will only moderately reduce the
per bit during transmission cannot be really energy per bit, beyond the great achievements
reduced through the choice of the modulation of research in the field in the last 15 years. Such
scheme and spectral efficiency improvements, as energy per bit reduction will translate into an
existing schemes with reasonable low complexity even smaller saving in the consumption of IoT
(e.g., BPSK, BFSK, OOK) are already very close nodes, due to the additional fixed energy cost of
to best-in-class highly complex modulations (unaf- radio start-up (e.g., waiting oscillators to settle to
fordable in IoT nodes, such as M-ary FSK with the correct frequency), which becomes a major
large M), and only 10 dB higher than the minimum energy contributor in typical IoT nodes sending
theoretical limit (Otis and Rabaey 2007). Also, the short packets.
adoption of low-complexity modulation schemes Another way to substantially reduce the wire-
keeps the energy overhead for modulation/demod- less power is to reduce the communication range
ulation low, which in the foreseeable future can be R, as the transmitted power can be reduced
expected to be moderately reduced (e.g., by units, quadratically in free space (or faster, in indoor
not orders of magnitude) through energy-efficient or obstructed environments) when R is reduced
circuit techniques. Channel coding will also bring (Friis 1946). Pragmatically, a 10 wireless
some further energy reduction in the energy per bit, power reduction can be achieved by shortening
using error correcting codes (e.g., Cyclic Redun- the range from the 10-m range down to a few
dancy Check—CRC) and packet retransmission meters. Such range reduction requires a denser
schemes, as customary in most of low-power wire- distribution of concentrators, whose number
less network standards, and recently in cellular increases quadratically when reducing the
standards for IoT (e.g., EC-GSM (Extended Cov- range, as shown in Fig. 1.16. Although the
erage-GSM), EC-GSM, NB-IoT (Narrow Band range reduction and increase in concentrator den-
IoT)). For example, Hamming (Golay) codes are sity is technologically feasible even today, this
able to reduce the energy per bit by 1.4 (2.5) at entails an unacceptable cost increase due to the
the additional computation energy cost of 10 pJ/bit substantially larger number of concentrators, and
(30 pJ/bit) (Desset et al. 2003), as scaled down to considering that their cost is easily tens of times
40 nm. Other codes may potentially reduce the the cost of a single IoT node (the cost of a
energy per bit further, but this would come at the concentrator is probably similar to a wireless
1 IoT: Bird’s Eye View, Megatrends and Perspectives 23

Fig. 1.16 Quadratic R


increase in the number of
concentrators per IoT
system under range
IoT node R R R R
reduction from (a) to (b) concentrator

(a)
R

(b)

router). In other words, reducing the range of IoT power is to make IoT nodes smarter (“cognitive”,
nodes is not an option in real applications due to “attentive”) than they are today, as was observed
economic reasons, rather than technological in Sect. 1.1. In other words, further reductions in
ones. Accordingly, the range of IoT nodes is IoT node consumption are subject to
expected to remain in the 10-m range for most improvements in the energy cost associated
of the prospective applications. with processing. As aggressive technology scal-
The above trend in the IoT node range is ing is not an option in IoT nodes (see Sect. 1.8),
somewhat in contrast with cellular various tradeoffs need to be explored to further
communications, in which the distance between reduce energy, as discussed in the next
phones and the base station keeps shrinking, due subsection.
to the more pervasive adoption of small cells To express the communication-computation
(Keene 2015) (e.g., pico and femto cells), and tradeoff more quantitatively, we will refer to
the integration with non-licensed bands offering “simple” IoT nodes if the wireless power while
more capillary distribution of access points (e.g., transmitting raw sensed data Pwireless,raw is well
WiFi as in LTE-U networks, which clearly pose below the targeted power budget Pbudget (i.e.,
other challenges due to the expansion of the Pwireless,raw/Pbudget  1). In other words, the
cellular traffic into existing bands (Ngo et al. wireless power for raw data is not a major bottle-
2016)). This difference is explained by the dif- neck to the overall power consumption, and
ferent economy of scale. Indeed, the cost of cells likely intelligence can be entirely kept in the
in the cellular network infrastructure is shared by cloud (i.e., there is no fundamental need for
a very large number of users, whereas IoT on-board intelligence to bring down the power).
sub-networks are spatially fragmented and tend On the other hand, we will refer to “smart” nodes
to be proprietary, thus their cost is born by a when the wireless power associated with raw data
smaller number of users per device. As further would simply exceed the power budget (i.e.,
fundamental difference with respect to cellular Pwireless,raw/Pbudget  1), hence on-board
and broadband networks, the preferential traffic processing has been necessarily added to aggre-
direction is expected to be upstream, as massive gate information in a more compact form that can
quantities of physical data are used to receive be transmitted more inexpensively. Assuming a
aggregate information and decision outcomes radio with state-of-the-art energy per bit Ebit, the
(Zhang et al. 2015). notion of “simple” and “smart” nodes essentially
The above considerations again confirm that depends on Pbudget set by the energy source, and
the most effective way to reduce the wireless Pwireless, raw ¼ Ebit  N bit, measure  f measurements , being
24 M. Alioto

1E+0
“smart” nodes 20-yr lifetime
Pwireless,raw (W) 1E-3 (Pwireless,raw>>Pbudget)
1E-6 Pbudget = 1.3 mW

1E-9
“simple” nodes
1E-12 (Pwireless,raw<<Pbudget)

1E-15
sensor
heart rate
humidity (capacitive)
battery monitor
temperature
accelerometer
magnetometer
altimeter/pressure
imager (VGA, RGB)
imager (MP4 compressed)
infrared proximity
gyroscope
microphone
CO2
light
strain
ultra-violet
Fig. 1.17 Average wireless power Pwireless,raw required LR44 battery under 20-year lifetime target. IoT nodes
by sensors in Fig. 1.7, assuming best-in-class 5-nJ/bit above the black dashed line need on-board intelligence
radio and comparison with power budget imposed by to reduce the amount of wireless data communication

Nbit,measure the number of bits transmitted per core knowledge that needs to be extracted from
measurement, and fmeasurements the number of the data (i.e., some of the cloud computation is
measurements per unit of time. moved into the IoT node). The above
Numerical examples are presented in considerations are all relative to the available
Fig. 1.17, for various sensors with datarate power budget, hence relaxing (tightening) Pbudget
taken from Fig. 1.7, assuming an LR44 button has an immediate effect on which sensors need to
cell battery as only source of energy with capac- be smart (essentially, the black dashed line in
ity of 150 mAh, a 5-nJ/bit radio, and a targeted Fig. 1.17 moves and hence splits the sensors
lifetime of 20 years. From this figure and for into the two categories accordingly).
LR44 battery-powered nodes, environmental As few representative examples of levels of
sensors (e.g., humidity, temperature) and battery semantic understanding encountered in IoT
monitors are “simple” nodes, i.e. there is no point applications, Fig. 1.18 ordered by data succinct-
in adding intelligence on board, as the wireless ness. The simplest type of processing (i.e., lowest
power to transmit raw data is much smaller than level of semantic understanding) above raw data
the power budget anyway. This is mainly due to is compression, which does not really extract any
the aggressive duty cycling enabled by the slow knowledge, or translation into a more efficient
dynamics of the physical quantities monitored. representation (e.g., compressive sensing
All other types of IoT need to be “smart”, in the (Baraniuk 2007; Candès et al. 2006; Donoho
sense that some on-board processing is needed to 2006)). Similarly, a very simple type of
reduce the amount of transferred data. Clearly, processing encountered in IoT applications is
smart IoT nodes with a larger Pwireless,raw can represented by the evaluation of aggregate
benefit more from the insertion of on-board intel- metrics to enable simple critical event detection
ligence. Hence, IoT nodes with larger Pwireless, (e.g., comparison of average/peak value or slope
raw/Pbudget should internally process the sensed of recent data with a threshold (WebSphere Pro-
data at a higher level of their semantic under- cess Server)). Higher and relevance. Salient and
standing, so that the output to be transmitted relevant features can then be analyzed in terms of
becomes more succinct while still retaining the semantic understanding is achieved through
1 IoT: Bird’s Eye View, Megatrends and Perspectives 25

computation computation in
detection / low dimensional space
classification
novelty
in IoT node the cloud

more succinct transmitted data


higher level of understanding

more on-board intelligence


assessment
saliency
assessment
feature extraction
compression / aggregate
metrics
raw data high dimensional space

Fig. 1.18 Levels of semantic understanding ordered and cloud. Smarter IoT nodes embrace more levels of
from low (raw data) to high (detection/classification) understanding to reduce the amount of wirelessly
and allocation of related computation between IoT node transmitted data

feature extraction, which compresses informa- processing model applies to computation


tion and also identifies potentially interesting performed in the cloud.
patterns, as preliminary step for higher levels of
understanding. On a higher level of understand-
ing, extracted features can be analyzed in terms 1.6.2 Opportunities to Achieve
of their saliency novelty, as they should be fur- Highly Energy-Efficient
ther processed only if they bring novel informa- Processing
tion. Relevant and novel features are then
processed through the more computationally Trading off performance and energy is an obvi-
expensive detection/classification engine. As in ous option in the processing sub-system of IoT
Fig. 1.18, lower and higher levels of understand- nodes. Such dynamic tradeoff is primarily
ing are properly allocated between the IoT node enabled by dynamic voltage scaling, as was
and the cloud, with smarter nodes embracing a briefly mentioned in Sect. 1.5.2. From a circuit
larger number of levels of understanding. point of view, operation at nearly-minimum
In summary, some computational tasks that energy is achieved through the techniques and
are traditionally performed in the cloud are design guidelines described in Chaps. 4 and 5,
expected to be moved into IoT nodes, whenever and the automated design methodologies in
this achieves a more favorable balance between Chap. 9. Architecturally, energy efficiency in
the reduced wireless datarate (thanks to the the processing sub-system is improved through
reduced data dimensionality at higher levels of parallelism and the adoption of energy-efficient
understanding) and the increased computational accelerators (i.e., trading off energy efficiency
effort performed in the IoT node. As a result, the for area efficiency), as discussed in Chap. 3.
IoT is expected to require an even more Another general approach to mitigate the con-
distributed processing model compared to the sumption in IoT nodes is to resort to event-driven
conventional cloud computing based on sensing and processing paradigms, rather than
virtualization. Indeed, in the IoT computation more conventional time-driven schemes where
can be dynamically allocated between IoT data is continually sampled and processed while
nodes, concentrators and the cloud, depending waiting for specific conditions that need to be
on the energy availability of the former ones. detected. Event-driven sensing and processing
Then, conventional datacenter-scale distributed is inherently asynchronous and triggers
26 M. Alioto

processing and communication only when really more favorable by mitigating the energy per
needed. Indeed, event-driven operation permits write, compared to state-of-the-art embedded
to stop the clock and limit the overall power to Flash memories. Indeed, the write energy per
few always-on analog blocks (e.g., the compara- bit in Flash memories is in the order of nJ/bit
tor in Fig. 1.12) and the leakage of the processing (Aitken et al. 2014), which is a few tens of times
sub-system. This approach is clearly advanta- larger than low-power DRAM (e.g., LPDDR2
geous when the active task in Fig. 1.5 is triggered (Malladi et al. 2012)), and tens of thousand
by specific events whose occurrence has a wide times larger than SRAM (ISSCC 2016). The
temporal distribution, rather than a strict tempo- lower write energy per bit in emerging NVMs
ral periodicity. makes data storing more advantageous at smaller
In non-real time systems whose dynamics of TMEM,avg.
actuation and/or decision is much slower than the Another important design dimension that can
data acquisition, latency is not an issue to a be traded off to reduce the energy per computa-
certain extent. In other words, the value of sensed tion is the quality of processing, leveraging the
data comes from its collection and aggregation, fact that algorithms, architectures and circuits for
rather than the most recent datum. In such cases, processing are substantially overdesigned in
the IoT node senses and stores data in a time- terms of their accuracy in most of the applications
driven fashion, occasionally processes data in a (Alioto 2016; Frustaci et al. 2015). Accuracy is a
batch fashion, and then transmits only if the very general term that can involve arithmetic
stored data is representative of a condition of precision (e.g., approximate computing), circuit
interest that should be sent to the cloud resiliency to occasional faults (e.g., voltage
(or examined by the nearby concentrator to overscaling (Hegde and Shanbhag 2001)), classi-
develop a more global perspective). In other fication accuracy or number of classes in a classi-
words, the IoT node behaves like a smart data fier. The fundamental idea is that higher quality
logger, as it normally stores without transmitting requires more energy per computation, or equiv-
and uses on-board intelligence to decide when alently the energy can be reduced whenever lower
wireless communication is truly needed. Such quality can be targeted. Energy-quality scalable
approach is exemplified in the SoC design systems allow dynamic tradeoff between energy
discussed Chap. 18, where NVM is tightly and quality of computation, minimizing the for-
integrated with logic to enable such capability. mer for a given quality target. As an illustrative
From a design perspective, such approach example of energy-quality scalable system,
involves a tradeoff between the wireless power Fig. 1.19 shows the energy-quality tradeoff in
reduction and the increased power cost due to the the processing of a video stream on a smartphone.
data storage in a memory (all other power As in Fig. 1.19a, the quality of the video stream
contributions are unaffected, and can be ignored processing (e.g., video decoding) can be degraded
in terms of power optimization). The latter con- under poorer lighting conditions, when the user is
tribution becomes important when meaningful paying less attention, when the battery is close to
events occur infrequently, as the leakage energy discharged, or when the user is being physically
Plkg, MEM ¼ V DD  I off , MEM  T MEM, avg for storage active. As in Fig. 1.19b, lighting, attention (e.g.,
is proportional to the average storage time TMEM, camera), battery monitor and motion sensors can
avg, and hence becomes large for slow sampling detect the usage context and generate an appro-
and long observation times. This tradeoff can be priate quality target, which is achieved by letting
made much more favorable by embedding a the operating system (OS, or a hardware control-
non-volatile memory for acquired data, which ler) adjust the energy-quality (EQ) knobs that
motivates Chaps. 6 and 7. In these chapters, the are supported by the hardware architecture.
consideration of emerging non-volatile Then, the hardware provides the OS with the
memories also aims to make the above tradeoff actual quality measurement for feedback loop
closure.
1 IoT: Bird’s Eye View, Megatrends and Perspectives 27

energy sensors (context)

OS
energy-quality
knobs
HW
quality quality
(a) (b)

Fig. 1.19 Example of energy-quality scaling allowed loop that senses such context, generates the corresponding
during video stream processing in a smartphone: (a) quality target and adjusts the related energy-quality
energy can be reduced when lower quality is acceptable, knobs, and measures the actual quality for further tuning
based on the usage context, through (b) a feedback control

required accuracy resiliency app/task context/usage dataset


just-enough quality quality slack quality
available

Fig. 1.20 Slack between the quality available from the processing platform and the actual quality required by the
specific chip, application, usage scenario and dataset

As summarized in Fig. 1.20, digital platforms IoT nodes can greatly benefit from the con-
are designed with a given level of processing cept of energy-quality scaling, as their on-board
quality, as constrained by the architecture for processing deals with physical signals, which are
general-purpose designs, or by the quality noisy in nature and hence can be processed with a
imposed by the most stringent task in precision that adapts to the level of noise. Simi-
application-specific designs. Hence, a “quality larly, in certain cases smart IoT nodes might
slack” exists and can be reduced to improve the execute statistical and soft (e.g., machine
energy efficiency. Such slack is imposed by the learning) algorithms, which are very resilient to
requirement of adequate circuit resiliency, i.e. no occasional faults or approximations. Also, IoT
hardware faults are allowed, although actually nodes might be involved in sensing physical
the application might occasionally allow them. quantities that are related to human perception
The slack is also due to the fact that the design is (e.g., vision, hearing), which have intrinsic
sized for the most demanding application or task, redundancy and are inherently resilient against
whereas the one at hand could run at lower qual- simplifications and uncertainty. Finally, “swarm
ity. Also, the lack of knowledge of the usage intelligence” and intelligent global behavior
context forces the quality to be set at its maxi- (Beni et al. 1989) arises due to the massively
mum level, although this could actually be distributed nature of IoT nodes, and it can be
relaxed (see above example in Fig. 1.19). Finally, exploited to reduce the processing ability of indi-
the required quality of processing actually vidual nodes, thus offering additional
depends on the specific incoming data, although opportunities for energy-quality scaling.
the quality is usually set the worst-case dataset Other opportunities to reduce energy come
(e.g., object classification might become difficult from context-awareness and tight coupling
under cluttered scene). with task under execution in power management.
28 M. Alioto

Indeed, the distributed nature of the IoT gives “backdoors” that can be exploited by attackers
puts certain nodes (e.g., some concentrators) in to perform physical and network attacks
a position to have a global or semi-global • The always-connected feature of IoT nodes
perspective on the context, and hence they can makes them rather vulnerable to
help setup the power management strategies in eavesdropping and software attacks, data
individual IoT nodes. In other words, power stealing and device cloning (Narendra
management in IoT has the potential to Mahalle and Railkar 2015)
become distributed, and leverage the global • In our view, the personal data stored in the
understanding to mitigate the power consump- cloud will be likely shared across different
tion in individual IoT nodes. On a smaller scale, service providers, which will deliver their ser-
tight coupling between power delivery and vice through “cloud apps” that will run on a
processing offers additional opportunities for datacenter-scale software platform. The user
energy efficiency improvements (Ramadass and will likely give access to the service providers
Chandrakasan 2008). For example, adjusting the as we do today when we install apps on a
power delivery to the instantaneous performance smartphone. This data sharing with multiple
requirement of the processing sub-system service providers poses new security
permits to improve both the energy per computa- challenges, which have to be solved mostly
tion of the latter (Ramadass and Chandrakasan on the server side.
2008; Kwong et al. 2009), and the power effi- • The limited energy availability of IoT nodes
ciency of the former (Zimmer et al. 2016). Fur- makes them vulnerable to resource enervation
thermore, reconfigurable on-chip DC-DC and Denial of Service (DoS) attacks
converters offer the ability to adapt to the system (Narendra Mahalle and Railkar 2015; Aitken
active-sleep pattern, maximizing the energy effi- et al. 2014).
ciency via optimal sleep/wakeup power balanc- • The long lifetime of IoT nodes makes soft-
ing (Lueders et al. 2014; Alioto et al. 2013), ware and firmware updates unavoidable, and
while avoiding the area cost of several different hence mandates strong authentication
regulators to cover the available power modes mechanisms to assess the authenticity and
(Kristjansson 2006). the integrity of the updates and patches,
under the tight power budget of IoT nodes.

The above security challenges are clearly per-


1.7 Present and Future Challenges
ceived to be crucial by users, and a credible solu-
in Chips for IoT Nodes: Security
tion is well known to be a fundamental prerequisite
for the future expansion of the IoT (Hofschen et al.
1.7.1 Security Challenges in IoT
2015). This is already being reflected in the growth
Nodes
of IoT security spending, with an expected CAGR
close to 25% until 2018 (http://www.gartner.com/
Security represents a very broad challenge in the
newsroom/id/3291817). In line with the scope of
IoT, due to several factors:
this book, security will be mostly discussed from
the IoT node standpoint, and introducing new
• Traditional security schemes such as public- concepts that assure truly distributed and ubiqui-
key cryptography are not feasible in most of tous security, as discussed in Chap. 8.
IoT nodes, due to their stringent power and Regarding the first of the above challenges,
cost requirements public-key cryptography cannot be embedded in
• The large number of IoT nodes and the most of IoT nodes, although it is a fundamental
resulting scale of the IoT network create an building block in today’s information security
unprecedented very large number of architectures (e.g., for crypto-key exchange
1 IoT: Bird’s Eye View, Megatrends and Perspectives 29

over an insecure channel, authentication, and security integration are being made available
others). Indeed, its computational effort is unaf- (Atmel), where keys, certificates and other sensi-
fordable under typical power budgets of IoT tive security data used for authentication are
node (see Sects. 1.1 and 1.6). For example, soft- stored in secure hardware and protected against
ware implementations on an MCU of the popular software, hardware and back-door attacks.
encryption RSA algorithm at typical 2048-bit
key size entails an energy consumption per bit
in the order of mJ/bit (Nehru et al. 2014; Hu et al. 1.7.2 Opportunities to Address
2009), i.e. 109 times larger than state-of-the-art Security Challenges on the IoT
energy per bit in private cryptography (e.g., AES Node Side: Physically
(Zhao et al. 2015), see Chap. 8), and 106 times Unclonable Functions (PUFs)
larger than energy/bit of best-in-class radios and PUF-Enhanced
(see Sect. 1.3.3). RSA decryption entails a Cryptography
10 larger computational effort and energy
(Hu et al. 2009). This means that public-key Due to the pervasive nature of the IoT, security
cryptography is affordable only if it is used measures need to be distributed and ubiquitous.
roughly once every 109 transactions, as in this Distributed security is necessary to preserve scal-
case its impact on the average power becomes ability in the IoT, introducing a centralized
comparable to private-key cryptography. mechanism to preliminarily build the trust
Exchanging a key every 109 transactions tran- among IoT nodes once and for all, so that inter-
slates into one RSA execution every 192 years node communication does not need any involve-
(!), assuming that one measurement (and trans- ment of other devices to manage their mutual
mission) every 10 min is performed by the IoT trust, as summarized in Fig. 1.21. Regarding the
node. In other words, public-key cryptography is need for ubiquitous security, in the IoT a hard-
not really usable in power-constrained IoT ware root of trust is needed in every single piece
nodes, and novel methods to exchange keys of hardware to assure its authenticity (and possi-
over the insecure wireless channel are needed, bly integrity), as well as the integrity and confi-
as discussed in Chap. 8. dentiality of sensitive data used to establish trust
On the other hand, application-specific (Atmel). Such hardware root of trust also drasti-
on-chip accelerators for RSA encryption con- cally simplifies difficult tasks such as the key
sume an energy in the order of uJ/bit (Hu et al. exchange over an insecure channel (see previous
2009), i.e. 106 times larger than private cryptog- subsection), as will be discussed in Chap. 8.
raphy. This comes at the cost of additional com- A hardware root of trust is a secret shared
plexity in the order of 40 kgates, which is between each chip and a secure database server,
comparable to the gate count of an entire MCU and is typically based on chip IDs that uniquely
(e.g., MSP430 (Kwong et al. 2009)), three times a identifies the node and are stored in the chip and
low-end processor for IoT (ARM Cortex M0), the server. The (non-volatile) chip IDs can be
5–7 more complex than AES crypto-cores, classified as extrinsic and intrinsic. Extrinsic
and 40 more complex than recently proposed keys are created outside the chip and shared
crypto-cores for IoT (e.g., Simon (Beaulieu et al. with it by writing them onto it before deploy-
2015)). Hence, public-key cryptographic ment. Conversely, intrinsic keys are directly cre-
accelerators might be affordable in terms of ated and available on the chip once it is
power for rather infrequent key exchange (once fabricated, and must be securely read before
every few months), but at significant hardware deployment to share the secret with the database
cost. Some research effort is being performed to managing trust.
devise new public-key crypto-algorithms that are As mainstream approach to hardware-level
affordable in IoT nodes (Arbit et al. 2015). Also, security (e.g., RFIDs, MCUs), extrinsic keys can
crypto-accelerators for hardware-software be stored in eFuse-based, One-Time
30 M. Alioto

secure
CRP
database
secure secure
request communication communication
node 1 node 1 node 2
request to establish trust in-field operation with
centralized trust management
(a)

secure
CRP
database

request CRPs CRPs

node 1 node 1 node 2 node 1 node 2


request to establish trust one-time inter-node direct inter-node
trust establishment secure communication
(b)

Fig. 1.21 (a) Centralized security architectures require trust between nodes, which then communicate securely
other devices to constantly maintain the trust between and directly afterwards
nodes, (b) distributed architectures preliminarily establish

Programmable Read Only Memory (OTPROM) vulnerable against physical attacks, compared to
or Flash memories (Aitken et al. 2014; Rosenblatt extrinsic key storage. Indeed, keys not really stored
et al. 2013; Uhlmann et al. 2008). In IoT nodes, on chip, but are recreated on demand through cir-
the Flash memory is certainly needed to store the cuit techniques that are sensitive to random
program and intermediate data (see Sect. 1.5.2), variations, and insensitive to systematic or fully-
hence it is the natural mainstream solution to correlate variations (see Chap. 8). Also, they also
store keys on IoT nodes. Unfortunately, extrinsic provide tamper evidence, at least to a certain extent
keys can be recovered with little or moderate (Helfmeier et al. 2014). Clearly, PUFs are still
effort through several types of attacks (van somewhat vulnerable to sophisticated physical
Tilborg and Jajodia 2005). For example, eFuses attacks, and their resiliency against such attacks is
that disable the testing and programming inter- currently under investigation (Nedospasov et al.
face can be restored with invasive methods (e.g., 2013; BlackHat; Helfmeier et al. 2014).
FIB) or bypassed through fault attacks (Helfmeier Due to their ability to inexpensively introduce
et al. 2013; Tehranipoor and Wang 2012). hardware-level security into individual silicon
OTPROM or Flash key storage is prone to clon- dice, PUFs have drawn substantial attention,
ing through de-layering or manipulation/read-out and have already been adopted in a few commer-
through a number of well-established techniques cial products (ICTK, Co. Ltd. 2014; Intrinsic-ID
to read out on-chip memories, and (Kommerling 2016; Invia PUF IP; QuantumTrace and Product
et al. 1999; Nedospasov et al. 2013; 2013; Verayo Inc. 2013). For the sake of simplic-
Skorobogatov et al. 2010). ity, a PUF can be thought of as a digital
Physically Unclonable Functions (PUFs) are sub-system that provides chip-specific, unpre-
able to generate intrinsic keys, and are far less dictable and perfectly repeatable responses to
1 IoT: Bird’s Eye View, Megatrends and Perspectives 31

Fig. 1.22 Summary of


PUF operation, with chip- 01101 PUF 1011010010
specific responses being Challenge Response
stored in a secure server for
authentication via
comparison. Responses are
unique to the die, and can
be regarded as “silicon
fingerprint”

incoming challenges (i.e., inputs), as thus reducing the number of CRPs during the
summarized in Fig. 1.22. The chip-specific IoT node lifespan, and lowering the power by
responses permit to identify and authenticate strengthening the crypto-key with the PUF key.
the die in an unambiguous manner. Authentica- Also, in Chap. 8 it is shown that recent advances
tion is largely performed by storing challenge- in energy-efficient crypto-cores now allows full
response pairs (CRPs) in a secure server, each and continuous encryption of any packet under
pair being used to perform authentication strictly very small power budgets (Zhao et al. 2015)
in a single transaction (being transmitted over an (e.g., down to UHF RFIDs), as AES encryption/
insecure channel). In Chap. 8, this model is decryption can be performed at a small fraction
questioned since the need for a fresh CRP for of a μW at typical IoT throughputs (e.g.,
each transaction fundamentally conflicts with the hundreds of kbps).
long lifetime requirement of IoT nodes (i.e., a In reality, the responses of PUFs are not per-
large number of fresh CRPs during the lifespan fectly unpredictable and repeatable (i.e., random
of the PUF). In particular, it is shown that such and stable), and significant research effort is
conventional CRP-based security scheme being spent to improve such features in an inex-
requires a PUF capacity that translates to a sili- pensive manner. To gain deep understanding of
con cost of more than a dollar only for the PUF the state of the art and fairly compare the differ-
itself. This is clearly unaffordable and requires ent available options, Chap. 8 of this book
the adoption of a more efficient usage of CRPs. introduces for the first time the public PUF data-
A much more inexpensive approach for base (http://www.green-ic.org/pufdb), named
authentication will be discussed in Chap. 8, “PUFdb”. This database summarizes the state of
which introduces the new concept of “PUF- the art in PUFs, and reports up-to-date technol-
enhanced cryptography”, which merges Physi- ogy trends and advances in the field of PUFs,
cally Unclonable Functions (PUFs, or “silicon through the comparative evaluation of relevant
fingerprint”) for hardware-level security and tra- metrics that are discussed in Chap. 8.
ditional cryptography. Interestingly, as will be In perspective, the long lifetime of IoT nodes
discussed in Chap. 8, PUF-enhanced cryptogra- poses additional challenges, in addition to their
phy drastically reduces the PUF capacity require- limited power budget and hence capabilities.
ment, solving the above mentioned issue. In Indeed, security standards will change over time
addition, PUF-enhanced cryptography enables and will impose higher standards. On one end, this
novel approaches to exchange crypto-keys over means that IoT node will need to be margined in
an insecure channel without resorting to conven- terms of hardware-level security and cryptography
tional energy/area-hungry public cryptography to exceed current requirements, with obvious cost
schemes (see previous subsection). As discussed implications. At the same time, security primitives
in Chap. 8, this concept permits to combine the in IoT nodes need to be flexible enough to adapt to
best of the two worlds: (1) chip-specific opera- future standards and requirements, as well as to
tion, excellent energy/area efficiency of PUFs, allow for introducing hardware patches for unantic-
(2) while hiding CRPs behind a crypto-core, ipated types of attacks and vulnerabilities.
32 M. Alioto

A possible approach can be borrowed from today’s possibilities to address issues caused by
accelerators for cryptography, which combine malfunctioning or hacking in a single node,
highly efficient primitives for operations common thanks to the partial redundancy that is available
to most of crypto-algorithms, while allowing flexi- across nodes placed in the same vicinity. For
bility to reuse them in a wide range of cryptographic example, the measurements of multiple IoT
standards (see, e.g., (Silicon Labs 32-bit Microcon- nodes sensing the same physical quantity have
troller Application Notes; NXP Cryptographic some degree of correlation, if not too far from
Acceleration Technology; STMicroelectronics each other. This offers the opportunity to identify
Nescrypt cryptoprocessor datasheet; Texas anomalies (due to faults or attacks) in individual
Instruments Cryptoaccelerators)). Future nodes, based on the past observations.
accelerators will need to incorporate PUFs as addi- Distributed sensor fusion offers additional
tional primitives. opportunities as the deviation of a given physical
As additional challenge, there is a fundamen- quantity can affect other quantities, hence the
tal conflict between chip debug-ability and correlation between different sensors in different
hardware-level security (Bhadra et al. 2016; IoT nodes can be leveraged similarly.
Ray et al. 2015). Indeed, the former requirement
demands architectural support for internal state
observability through the testing port, and some 1.8 Present and Future Challenges
controllability through firmware patching. In in IoT Nodes: Cost
turn, such architectural support weakens the
level of hardware security since it offers The challenges related to the tight IoT cost
opportunities and backdoors to discover and requirement need to be looked at from the per-
leverage vulnerabilities, having access to internal spective of semiconductor economics trend
signals and states (e.g., through assertion checker (Rhines 2015), as exemplified by the learning
and coverage monitor circuits), and expose them curve in Fig. 1.11. To this aim, the cost per
outside the chip. This can enable the attacker to transistor and the cumulative number of
recover sensitive information (e.g., digital rights transistors are separately plotted in Fig. 1.23 ver-
management keys), settings of programmable sus time. Since the inception of integrated circuit
fuses, defeature bits, among the others (Bhadra manufacturing, the overall number of
et al. 2016). This tradeoff is difficult to manage, manufactured transistors has doubled every
and requires deeper integration of design 18 months, i.e. expectedly faster than the transis-
methodologies for verification and security, tor density as per Moore’s law, which has
which in turn demands novel tools with such increased by 1.6–1.7 every 18 months.
ability. A possible solution to cleanly manage The economy of scale and the efficiency
such tradeoff is to introduce an explicit “hard- improvements from the accumulated design/
ware firewall”, which manages independently the manufacturing knowledge has translated into a
access to such architectural support based on the halved cost per transistor every 22 months, being
context and software under execution. For exam- a 55% learning curve (see Sect. 1.4).
ple, such hardware firewall can preventing such Figure 1.23 is useful to develop a sense of how
access under fault injection (as detected by semiconductor manufacturing will help drive
timing, voltage, temperature sensors), physical down the cost per transistor in the IoT, assuming
intrusion (as detected by on-chip sensors), or we will continue to innovate to maintain the same
unsuccessful authentication. The same firewall cost reduction rate (quite reasonable, based on the
could also manage the testing port and forbid past experience). On the other hand, the cost
access to critical primitives (e.g., PUF) and reduction in IoT nodes is expected to be driven
related signals transferred to other sub-systems. by very different factors, compared to the past.
On the larger scale of an IoT node cluster, the Indeed, cost reductions has been mainly pushed
distributed nature of the IoT also offers new by the technology scaling and Moore’s law, with
1 IoT: Bird’s Eye View, Megatrends and Perspectives 33

1E+20 1E-02

revenue per transistor ($)


manufactured transistors 1E+18 1E-04
cumulave no. of

1E+16 1E-06

1E+14 1E-08

1E+12 1E-10
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
year
manufactured transistors revenue per transistor ($)

Fig. 1.23 Plot of cumulative manufactured transistors and cost per transistor (adjusted for inflation) versus year, and
related scaling rules

1E-4
revenue per transistor ($)

1E-5

1E-6 2X / 68 months
2X / 53 months
1E-7 2X / 42 months

1E-8
1990 1995 2000 2005 2010
year
1 um 0.7 um 0.5 um 0.35 um 0.25 um
0.18 um 0.13 um 90 um 65 um

Fig. 1.24 Plot of cost per transistor (not adjusted for inflation) for several CMOS generations vs. time (data from
(Kurzweil 2005; Goodall et al. 2002))

steady reduction enabled at each new CMOS the envelope across multiple generations
generation as in Fig. 1.24 (the cost reduction is (Kurzweil 2005; Goodall et al. 2002).
somewhat slower than in Fig. 1.23 since the for- Unfortunately, it will be very difficult to
mer does not account for inflation). Within the leverage aggressive technology scaling (e.g.,
same generation, some cost reduction has also below 28 nm) for the implementation of IoT
been achieved over time, thanks to the amortiza- nodes in the foreseeable future, due to various
tion of the initial investments across a larger reasons. Indeed, embedded Flash is a fundamen-
production volume, and the depreciation com- tal requirement in IoT nodes (see Sect. 1.5.2),
pared to new generations. The cost reduction hence the technology of choice is severely
rate within the same generation at 90–65 nm restricted by the availability of such process
technologies is approximately twice as slow as option. As of 2016, 40 and 55 nm are the most
34 M. Alioto

Fig. 1.25 Plot of cost per 3 100

cost per Mgate (a.u.)

gate ulizaon (%)


million gates versus CMOS
2.5 80
technology generation, for
both prototyping and mass 2
production (data from 60
(Jones 2014)) 1.5
40
1
0.5 20

0 0
16 28 40 65 90
technology generaon (nm)
cost/Mgate for prototyping (normalized to 40nm)
cost/Mgate at mass production (normalized to 28nm)

advanced technologies offering eFlash, whereas section). Some further considerations on the
28 nm is being developed for mass production technology selection will be made in Chap. 18.
and makes sense only for highly digital-intensive From the above considerations, chips for IoT
IoT nodes (Taito et al. 2015). 2T-NOR bitcells nodes will not be able to benefit from further
are generally a preferable option in IoT nodes, in technology scaling, and will only enjoy half of
view of their low power consumption and rela- the cost reduction that we have observed (see
tively small capacity (Strenz et al. 2012). No Fig. 1.24). The remaining half will need to
further downscaling of eFlash (e.g., at 20 nm) is come from other strategies that enable further
expected, being technologically very challenging cost reduction while not shrinking the process
and excessively expensive (Jurczak et al. 2015). (i.e., extend the lifespan of mature technology
Other emerging NVM technologies are being generations such as 65, 40, and ultimately
considered as a replacement of successive 28 nm). For example, we envision that 450-mm
generations (e.g., ReRAM and STT-MRAM, as wafers should actually be used at such well-
discussed in Chaps. 6 and 7), although with established processes, rather than being limited
uncertain roadmap for the market introduction. to up-coming cutting-edge generations (Global
As more fundamental reason why IoT nodes 450 Consortium (G450C) Program). Indeed,
will hardly scale below 28 nm in the foreseeable IoT offers an unprecedented opportunity to
future, the cost per transistor in 28 nm has scale up the manufacturing volume at such
reached a minimum, as shown in Fig. 1.25 well-established generations, and hence truly
(Jones 2014). The cost per transistor increases benefit from the cost reduction allowed by larger
again at 20 nm and finer technologies due to wafer size. Furthermore, such cost benefit is
higher cost associated with double patterning expected to be even more pronounced compared
and poorer gate utilization per unit area, sanc- to most advanced CMOS generations, as lithog-
tioning the end of Moore’s law. From Fig. 1.25, raphy takes a smaller fraction of the wafer
40–65 nm generations are very reasonable processing time, as no double patterning is
options for cost-effective IoT nodes, with needed (the cost associated with does not scale
28-nm becoming preferable only upon availabil- down at larger wafers). Another very interesting
ity of eFlash and for digital-dominated SoCs. option to extract value from well-established
Then, we envision that the range of feasible technologies at low cost is to introduce the
technologies for IoT nodes will remain the FDSOI option at 40 and 65 nm (Esteve et al.
same for several years, dictating a radical shift 2016), leveraging its low cost and benefits in
in the business model of chip design (see next terms of voltage scalability and leakage control.
1 IoT: Bird’s Eye View, Megatrends and Perspectives 35

As another important recurring engineering hence does not explicitly address challenges at
cost, testing of chips for IoT nodes needs to be higher levels of abstraction, such as inter-
kept to a minimum, in spite of their low voltage operability, communication protocols, data man-
operation and hence large sensitivity to process, agement, detailed IoT architecture with
voltage and temperature variations (see Chap. 4). distributed processing and software allocation,
For example, their sensors are generally needed to among the others. Such challenges are crucial to
be calibrated in a post-silicon manner, between the IoT, and mandate the creation of proper
testing and run time (i.e., at additional testing cost abstraction layers that can abstract the specific
or higher energy in in-field operation). In perspec- behavior of IoT nodes and concentrators, to create
tive, such cost can be driven down by leveraging a scalable, secure, IoT infrastructure with predict-
the distributed nature of IoT systems, and the able latency and availability as a whole (Zhang
intrinsic spatial redundancy of IoT sensors (Lee et al. 2015). Ultimately, the challenge is to create a
et al. 2014; Miluzzo et al. 2008; Stojmenovic true ecosystem that seamlessly collects pushes
2005). The research work in this field has been data to the cloud on the enterprise side (communi-
mainly algorithmic and dealt with sensors that cation with sensors), create repositories and struc-
have not been conceived to calibrate each other. ture data, perform storage and security, analyze
In perspective, mutual calibration should be and visualize data for sensemaking. The reader is
enabled by incorporating it in IoT nodes in first encouraged to explore the related topics in the vast
place, offering architectural support to the net- existing literature.
work for faster, continuous and accurate mutual
calibration. We envision that such approach will
help create “supersensors”, i.e. sensors exceeding 1.9 Looking at the Future: IoT
their intrinsic capabilities through synergy with Market towards the End
the neighboring IoT nodes. This can be achieved of Moore’s Law and Related
through conventional sensor fusion (i.e., combin- Trends
ing information from different physical
dimensions), as well as cross-calibration by mea- 1.9.1 Perspectives on the Growth
suring (1) their operating conditions (e.g., using an of the IoT
IoT node sensing temperature) to suppress the
margin that is traditionally added to the sensor The IoT is currently at its very early stage of
static characteristics, (2) their usage intensity for adoption and development, and more precisely it
aging/drift estimation and compensation corresponds to the “innovators” adoption stage,
(as needed by their long lifetime). according to the well-known S-curve adoption
To mitigate their Non-Recurring Engineering model of innovation (Rogers 2003). The S-curve
cost (NRE), IoT nodes need to be designed effi- is the cumulative distribution of the time of adop-
ciently through an application-driven combination tion, and is usually Gaussian distributed (Rogers
of platform-based design, design composability 2003). The adoption stages are defined by the
(both intra-die and inter-die, i.e. reuse and distance from the mean μ of the time of adoption,
System-in-Package design), reconfigurability and expressed in number of standard deviations σ. For
low-cost adaptation. As an example, Application- example, from Fig. 1.26 innovators are very early
Specific Instruction set Processors (ASIP) with adopters, as they adopt a specific innovation
processor extensions (i.e., accelerators) offer flex- 2σ before μ, thus accounting for 2.5% of the
ibility to be tailored around an application and total adopters at maturity stage in a Gaussian
achieve both area (i.e., cost) and energy effi- S-curve. Fast growth is observed until the aver-
ciency, while mitigating the design effort through age/median adopter is included, then growth
automated generation of RTL, Software Develop- slows down. After μ + σ, the adoption tends to
ment Kit, and verification support. saturate, leading the market to a maturity stage.
As discussed in Sect. 1.1, this book focuses on This is for example the case of the smartphone
the design of integrated circuits for IoT nodes, and market around 2016–2017 (IDC 2016), which is
36 M. Alioto

Fig. 1.26 The S-curve smartphone market is currently


describes the stages of in this adoption phase
adoption of innovation in
the form of cumulative cumulative
distribution of the time of frequency
adoption 84% distribution
(Gaussian)

50%

S-curve

16%

IoT is currently in
this adoption phase 2.5%
frequency
m-2s m-s m m+s distribuon
(Gaussian)
2.5% 13.5% 34% 34% 16%
time
innovators

adopters
early

majority
early

majority
late

laggards
IoT
smart
400
mobile
semiconductor market size

350
mobile
(billions of US$)

300 networked/
250 portable PC
200
150
PC
100
50
0
1985 1990 1995 2000 2005 2010 2015 2020
year

Fig. 1.27 Semiconductor annual sales over time, and technology waves leading to growth

only a few percentage points, after the impetuous 2015), which have grown the market size by
growth in the previous 10 years. approximately 11 B$ per year in the last three
The above considerations show that the decades (this size is expected to grow as the
smartphone market wave is now in the logarithm of the IoT market volume increase,
“laggards” phase, as shown in Fig. 1.26. Accord- comparing the trends in Figs. 1.9 and 1.27 and
ingly, the growing IoT wave is expected to trig- assuming they will not change drastically).
ger the next growth in the semiconductor market The push towards the next phase of adoption
size as shown in Fig. 1.27, as steadily occurred in IoT in Fig. 1.26 will likely come from very
through previous technology waves (Rhines few key applications that are less impacted by the
1 IoT: Bird’s Eye View, Megatrends and Perspectives 37

open challenges in IoT described in Sect. 1.3 introduced by the IoT in terms of conditioning/
(e.g., cost/profitability, inter-operability), and heating system management, security, lighting,
benefit more from the IoT infrastructure (e.g., entertainment, appliances, assisted living, safety
applications where having large numbers of and security (i.e., well beyond the existing simple
coordinated sensing points is perceived as a thermostats or surveillance cameras.). The need
strong need). In terms of profitability, in for deeper integration with the external world
Sect. 1.3.4 it was pointed out that profitability (e.g., cloud services, smartphone apps) will make
in the IoT area is not trivial in spite of the large the start of the smart homes technology wave
production volume, due to the extreme diversity slower than Industry 4.0. We believe that the
of IoT applications, which in turn limits the vol- third wave of IoT technologies will be in smart
ume in each specific application. Design buildings and then smart cities, as the experience
approaches mitigating this profitability challenge and standardization efforts developed in smaller
have been briefly discussed in Sect. 1.3.4, and settings will be valuable to grow to the next scale.
will require better coordination within the semi- Other applications such as retail, transportation,
conductor value chain. Meanwhile, the environment, security and public safety, will
applications that inherently have larger profit develop later, and will augment the capabilities
margin will develop faster than others. of the IoT in the above application spaces. Several
Similarly, applications where a proprietary/ other views have been expressed on the evolution
closed system is acceptable will develop faster of the IoT (see, e.g., (Beica 2015; Jankowski et al.
than those requiring pervasive standardization 2014; Dobbs et al. 2015; https://www.mckinsey.
and open hardware/software platforms. On both de/files/mckinsey-gsa-internt-of-things-exec-sum
respects, industrial applications are particularly mary.pdf; Lueth 2014)).
well suited for initiating the next stage of adoption
of the IoT. Indeed, industrial applications put less
pressure on the profit margin per IoT node, as a 1.9.2 Perspectives on the Impact
disproportionately larger value comes from the of the IoT on the
savings and the efficiencies that are enabled by Semiconductor Industry
the distributed real-time monitoring capability. At Structure
the same time, industrial environments do not
currently need to be pervasively connected to the The value chain of IoT includes silicon integra-
outside world, thus making the adoption of propri- tion (i.e., sensors, chips with processing and
etary/closed systems (i.e., non-interoperable with communication abilities, actuators), system
the outside world) acceptable in the foreseeable integration (software, network/server infrastruc-
future. And, overall, the ability to coordinate large ture), and applications (cloud services and
numbers of IoT nodes is of great value to industry, end-user applications) (https://www.mckinsey.
as efficiencies come from economy of scale and de/files/mckinsey-gsa-internt-of-things-exec-sum
global coordination (from machines, to processes, mary.pdf). The silicon integration is of crucial
to inventory and sales). importance to the realization of the IoT, and semi-
After fostering Industry 4.0 (i.e., a business-to- conductor manufacturing will be boosted by
business commercialization model), we envision More-than-Moore integration focusing on inte-
that the IoT will penetrate smart homes as second grated MEMS/ASICs, actuators (power devices,
large-scale application over time (and first as busi- LEDs), microfluidics, integration with printed
ness-to-customer). Indeed, the cost of individual electronics and flexible substrates, among the
IoT nodes is not a key factor in smart homes, as others. However, the value perceived by the end
they can be progressively upgraded (as opposed to users will mostly come from the services that will
an industrial plant) and their value is well per- be delivered through it. Indeed, such services
ceived by users, as the IoT truly becomes part of ultimately define the user experience, being the
their life. In smart homes, new capabilities will be infrastructure and the IoT nodes physically
38 M. Alioto

invisible or inaccessible. This has implications on To create such economy of scale and enable
the viable business models for IoT, both for single the above mentioned growth of IoT through
users, and small/large-sized enterprises. bottom-up business and under mature CMOS
The above mentioned value shift towards the technologies, the EDA industry will need to pro-
service will expectedly favor bottom-up small vide value in a way that deviates from the tradi-
business approaches that deliver high-value tional business fueled by technology scaling.
services that stem from a strong expertise in a Indeed, we envision that user-friendliness of
specific application area (domain knowledge). In EDA tools for IoT node design will become a
other words, we expect that the IoT will develop major differentiator in the future, considering
in a similar way as the Internet did, with many that the value will not come from the develop-
small enterprises leveraging their deep expertise ment of tools for new technology generations.
in a specific application domain, and learning User-friendly tools are indeed expected to bring
about the technologies that the IoT is made up substantial value to customers who are focused
of. Similarly, once the Internet infrastructure was on applications rather than chip design, as they
available in the 1990s, innovation (and its will give them the capability to build chips with
growth) was created “on the fringes”, by people reduced effort, team size and related capital
and small organizations focused on expenditures. On the EDA tool vendor side, the
applications, instead of being specialized reduced profit margin will be (over) compensated
companies with strong technological expertise by the much wider market volume, thanks to the
(e.g., chip design) in search of promising larger demand of a wider basis of small
applications. This explains why 50% of the IoT application-focused enterprises performing chip
solutions are expected to be provided by design. To continue create value, EDA tool
companies which are less than 3 years old in vendors will likely offer vertically integrated
2017 (Gartner 2014). software packages and pay-per-use emulation
The above trend of application-focused services, where IoT nodes (and maybe complete
companies driving the IoT will need to be systems) can be simulated embracing their multi-
supported by an ecosystem that makes chip physics nature, from electrical to microfluidics,
design more accessible, in terms of design effort chemical and mechanics, including packaging
and financial entry bar. In the last decade, tech- aspects at their interface.
nological advances (e.g., cloud computing) have Other developments in the software develop-
drastically lowered the financial bar for starting ment are expected to become part of chip design
new companies (Horowitz et al. 2014), and espe- environments. Among the others, they will need
cially for software startups (Wertz et al. 2012). to become more collaborative for rapid develop-
We envision that this will extend to hardware ment through distributed and virtual/elastic
startups as well, through an ecosystem built on teams, similarly to the raise of Collaborative
consortia of silicon service providers that are Development Environments (CDE) introduced
linked to foundries with well-established in software design since early 2000s (Q&A with
technologies (e.g., 65–40 nm, or 28 nm) Grady Booch: Collaborative Development
congregating to average out demand volatility Environments 2006), and CAD/CAM environ-
and have predictable/high wafer throughput, cre- ments (Wang and Zhang 2002). In turn, the adop-
ate economy of scale through collaborative cloud tion of such collaborative environments will
design environments, with a new breed of Elec- favor rapid communication between designers,
tronic Automation Design (EDA) tools that application engineers, manufacturing, sales and
lower the expertise bar to build integrated customers, and more frequent pivoting towards
systems, and platform-based design methodo- the objectives. As a consequence, the principles
logies (e.g., based on reconfigurable archi- of agile design will ultimately permeate chip
tectures) that favor design reuse and rapid design, as an approach to further reduce time to
design closure. market, improve responsiveness to customers’
1 IoT: Bird’s Eye View, Megatrends and Perspectives 39

needs, and tight coupling between design and stable core set of available IPs thanks to the longer
applications (Beck et al. 2001; Patterson et al. lifetime of IPs, as enabled by the adoption of
2015; Beck and Andres 2005; Agile SoC). mature (and stable) technology generations. As
Development in short sprints will be enabled by side benefit, the longer lifetime of IPs will amor-
the reduced design effort in individual tasks, as tize their NRE design cost across a larger number
enabled by reuse and platform-based design. of systems, thus reducing the cost. As a side
Cloud-based chip design platform are benefit of cloud-based design environments,
emerging and will likely become mainstream in workflow management (e.g., for agile design)
design under well-established CMOS generations will be integrated directly with the design tools
(see, e.g., (Silicon Cloud)). These cloud-based for more predictable design schedule, and nearly-
design platforms will primarily create further real time management of related resources (e.g.,
economy of scale in design team of application- cashflow, inventory, sales).
focused companies, as they will enable pay-per- The above envisioned bottom-up model of
use cost models, thus avoiding the large cost of IoT companies will not put an end to the tradi-
entry related to the purchase of EDA tools, and tional top-down technology push from Integrated
related system management costs (Brayton et al. Device Manufacturers (IDMs), of course. In the
2015). Also, cloud-based design platforms will future, IDMs will need to innovate by developing
favor collaboration within enterprises, and will truly vertical design capabilities, ranging from
create new opportunities for synergy between chip design to software. This will be financially
companies (e.g., rapid and unceasing exchange viable in applications with large market size
of designs for strategic partnerships). Also, (e.g., wearables). As another important role of
cloud-based design platforms will help build IDMs, they will be able to build complete secu-
trust between entities generating Intellectual rity/trust chains that go from design to
Properties (IP), foundries and system integrators, manufacturing and die-specific key and certifi-
as access to resources of each party is strictly cate management, as needed by hardware-level
disciplined in the cloud. The coexistence of sev- primitives such as PUFs (NXP).
eral designs and related aggregate data (without Finally, the growth in the market segment of
details on design execution, for obvious confi- IoT nodes will certainly boost (with a lever
dentiality reasons) on their performance will cre- effect) several other segments, being responsible
ate unprecedented opportunities to develop for the production of massive amounts of data. In
across-design understanding of the underlying other words, the IoT market expansion will actu-
design tradeoffs in a data-driven manner. This ally lead to the expansion of markets related to
will enable new capabilities that go beyond the servers, storage and communications. For this
individual designer’s experience, and enable rapid reason, we envision that traditional players in
architectural exploration, IP and process selection, such spaces might actually subsidize the creation
and verification/debugging, NRE cost prediction of the IoT infrastructure, with the goal of
and manufacturing cost model creation through expanding their core business.
data mining and machine learning-based models.
Linking this new type of collective knowledge to a
collaborative design environment will further 1.10 Looking at the Future:
accelerate the capabilities of design teams. For Convergence of IoT and Other
example, new online (private or public) forums Social Megatrends
can be created to discuss about design issues while
linking them to specific parts of the design or The growth of the IoT is expected to be fostered
simulation outcomes, thus simplifying by the convergence with other social megatrends.
troubleshooting, favoring communication and Accelerated urbanization and increased human
knowledge building. Such more collaborative population are posing fundamental challenges
and shared environment can rely on a relatively in terms of sustainability, livability, and demands
40 M. Alioto

more efficient resource management (even space Distributed sensing is again the key to recreate
and time, other than water, electricity, transpor- a remote environment.
tation, etc.) and opportunities for more pervasive Wide adoption of participatory sensing is
sharing (Ericsson report on “City Index”). An another megatrend that is expected to converge
examples of opportunity to share services and with IoT. In participatory sensing, users deliber-
goods is the e-mobility (e.g., bikes, smart city ately agree to share their sensor information (cur-
cars, self-driving cars and drones for good deliv- rently on a smartphone) with a cloud platform
ery). All these services need to track goods, that aggregates data and provides valuable infor-
detect anomalies, and keep track of their mainte- mation and recommendations to the participants
nance status. (e.g., traffic congestion). We envision that an
Geo socialization is also well recognized as an extended form of participatory sensing will take
up-coming trend that will supersede today’s place where users choose to share the data sensed
social networking, adding the geographical by their IoT nodes for the benefit of all the
dimension as additional component to facilitate participants (e.g., fire, house break-in and loud
social interaction and communication for leisure events detection).
and business purposes (Frost and Sullivan As further interesting megatrend, the sharing
Report). The IoT will play a key role in geo economy is proving that aggregate advice from
socialization, as it will provide the necessary individual recommendations is perceived as
physical information to augment social media trustworthy, in a time when public institutions
(e.g., auditive awareness, cognitive cameras). are losing people’s trust (Botsman 2016). The
Pervasive assistive or proactive robot technol- IoT might strengthen this trend, augmenting
ogy for are likely to be part of our life in one or shared data with its physical dimension.
two decades. Such robots will provide domestic Recently, the human factor is playing a more
help, perform repetitive labor in factories, and important role in IoT product planning and
assist the elderly. Such robots will need to be design, with happiness being modeled, sensed
aware of the physical context in which they are and kept in the actuation loop to make the built
immersed, beyond mere Internet connectivity. environment closer to the actual needs of human
Again, the IoT will be instrumental to augment beings (Yano et al. 2015).
their capabilities with physical understanding. In the end, the IoT will enable new capabilities
Constant and data-driven product upgrade that can truly improve the quality of life of a large
will become a norm, following today’s examples number of people on earth. Indeed, it will allow
from the automotive industry (Tesla). Such more efficient and real-time management of com-
upgrades are pushed from the vendor to continu- mon resources, based on actual and predicted
ously improve the product, based on the data demand. It will make technology more human
generated by the sensors embedded in all similar centric, as it will sense and learn the behavioral
products. Accordingly, the IoT will play a funda- patterns and the users’ needs, to overcome the
mental role in understanding the in-field usage need for “pushing buttons” and enable proactive
and performance of products, and will be instru- adaptation to such needs. It will also take the
mental in closing the loop for regular product sharing economy to the next level, as embedding
improvements. sensors in all shared objects permits to measure
Three-dimensional remote physical interac- actual utilization of goods, resources, tools, and
tion is another megatrend that has clear conver- monitor their operating conditions. In turn, this
gence with the IoT. 3D remote interaction encourages responsible usage, incentivizes virtu-
permits to manipulate objects and receive sen- ous behavior that favors sustainability and
sory feedback (e.g., visual, tactile), and has resource share-ability, leveraging the capability
applications to immersive conferencing, medical to charge users by the effective usage (instead of
training, remote surgery, among the others. merely charging by time, as is customarily done
in equipment leasing, car sharing, and others).
1 IoT: Bird’s Eye View, Megatrends and Perspectives 41

In summary, the pervasive nature of the IoT first-hardware-interface-library-for-tls-stacks-used-in-


and its ability to distribute intelligence across iot-edge-node-apps/
L. Atzori, A. Iera, G. Morabito, Internet of things—a
objects has a very strong potential impact on survey. Comput. Netw. 54(15), 2787–2805 (2010)
our ability to share on a larger scale, to ultimately R.G. Baraniuk, Compressive sensing [Lecture Notes].
improve our lives and decouple socioeconomic IEEE Signal Process. Mag. 24(4), 118–124 (2007)
progress from intensive use of resources R. Beaulieu, S. Treatman-Clark, D. Shors, B. Weeks,
J. Smith, L. Wingers, The SIMON and SPECK light-
(Ericsson Mobility Report). weight block ciphers, in Proceedings of the DAC
2015, San Francisco, USA (2015)
Acknowledgments The author acknowledges the kind K. Beck, C. Andres, Extreme programming explained,
support by the MOE2014-T2-1-161 grant from the 2nd edn. (Addison-Wesley, Boston, 2005)
Singaporean Ministry of Education, and Mohsen K. Beck, J. Grenning, R. C. Martin, M. Beedle,
Shaghasemi for gathering data on commercial IoT J. Highsmith, S. Mellor, A. van Bennekum, A. Hunt,
nodes. Special thanks go to Vivek De, Giovanni De K. Schwaber, A. Cockburn, R. Jeffries, J. Sutherland,
Micheli, Gerhard Fettweis, Niraj Jha, and Samuel W. Cunningham, J. Kern, D. Thomas, M. Fowler,
Naffziger for the interesting discussion and their valuable B. Marick, Manifesto for Agile Software Develop-
feedback. ment (2001). http://agilemanifesto.org/
R. Beica, MEMS and sensors: applications and key
aspects—Short course at VLSI Symposium 2015,
Kyoto, 15 June (2015)
G. Bell, Bell’s law for rise and death of computer classes.
References Commun. ACM 51(1), 86–94 (2008)
C.G. Bell, R. Chen, S. Rege, Effect of technology on near term
A1006 Secure Authenticator for anti-counterfeit computer structures. IEEE Comput. 5(2), 29–38 (1972)
applications, NXP website on http://www.nxp.com/ G. Beni, J. Wang, Swarm intelligence in cellular robotic
documents/leaflet/A1006-LF-SECURE.pdf?fasp¼1& systems, in Proceedings of the NATO advanced work-
WT_TYPE¼Brochures&WT_VENDOR¼FREESCA shop on robots & biological systems, Italy (1989)
LE&WT_FILE_FORMAT¼pdf&WT_ASSET¼Docu J. Bhadra, S. Ray, Security challenges in mobile and IoT
mentation&fileExt¼.pdf systems, in Proceedings of the SOCC 2016, Seattle,
Agile SoC website, http://www.agilesoc.com/about/ USA (2016)
R. Aitken, V. Chandra, J. Myers, B. Sandhu, L. Shifren, BlackHat conference website. https://www.blackhat.com/
G. Yeric, Device and technology implications of the R. Brayton, L.P. Carloni, A. Sangiovanni-Vincentelli,
Internet of Things, in Proceedings of IEEE Sympo- T. Villa, Design automation of electronic systems:
sium VLSI Circuits (VLSI-Symposium) (2014) past accomplishments and challenges ahead. Proc.
M. Alioto, Ultra-low power VLSI circuit design IEEE 103(11), 1952–1957 (2015)
demystified and explained: a tutorial. IEEE Trans. E. Brynjolfsson, L.M. Hitt, Beyond the productivity para-
Circ. Syst. 59(1), 3–29 (2012) dox. Commun. ACM 41(8), 49–55 (1998)
M. Alioto, Designing (relatively) reliable systems with S. Callewaert, Communication infrastructure disruptions
(highly) unreliable components,—Keynote at IEEE caused by the forthcoming IoT data deluge: the
NEWCAS 2016, Vancouver (CA), 24–26 June (2016) FD-SOI solution, in Proceedings of the SOI Consor-
M. Alioto, Energy harvesters for IoT: applications and tium Symposium, San Jose (2016). http://www.
key aspects—Short course at VLSI Symposium 2015, soiconsortium.org/fully-depleted-soi/presentations/
Kyoto, 15 June (2015) SOI-Consortium-FD-SOI-Symposium-Sanjose-2016/
M. Alioto, E. Consoli, J. Rabaey, “EChO” reconfigurable E. Candès, J. Romberg, T. Tao, Robust uncertainty
power management unit for energy reduction in sleep- principles: exact signal reconstruction from highly
active transitions. IEEE J. Solid State Circuits 48(8), incomplete frequency information. IEEE Trans.
1921–1932 (2013) Inform. Theory 52(2), 489–509 (2006)
Amazon Dash, https://www.amazon.com/Dash-Buttons/ Cisco Visual Networking Index Predicts Near-Tripling of
b?ie¼UTF8&node¼10667898011 IP Traffic by 2020, white paper, https://newsroom.
A. Arbit, Y. Livne, Y. Oren, A. Wool, Implementing cisco.com/press-release-content?type¼press-
public-key cryptography on passive RFID tags is prac- release&articleId¼1771211
tical. Int. J. Inform. Secur. 14(1), 85–99 (2015) C. Desset and A. Fort, Selection of channel coding for
ARM Cortex M0 datasheet, https://www.arm.com/ low-power wireless systems, in Proceedings of the
products/processors/cortex-m/cortex-m0.php IEEE Vehicular Technology Conference (2003),
K. Ashton, That ‘Internet of Things’ Thing. RFID pp. 22–25
J. http://www.rfidjournal.com/articles/view?4986. R. Dobbs, J. Manyika, J. Woetzel, The internet of things:
Accessed 22 June 2009 mapping the value beyond the hype, McKinsey Global
Atmel website, The first hardware interface library for Institute report (2015). http://www.mckinsey.com/
TLS stacks used in IoT edge node apps, http://blog. business-functions/business-technology/our-insights/
atmel.com/2016/02/22/atmel-launches-the-industrys-
42 M. Alioto

the-internet-of-things-the-value-of-digitizing-the- Gartner Says Worldwide IoT Security Spending to Reach


physical-world $348 Million in 2016. http://www.gartner.com/news
D. Donoho, Compressed sensing. IEEE Trans. Inform. room/id/3291817. Accessed 25 April 2016
Theory 52(4), 1289–1306 (2006) S. Gaudin, Get ready to live in a trillion-device world, in
Ericsson Mobility Report, https://www.ericsson.com/res/ ComputerWorld, Sept 11 (2015). http://www.
docs/2016/ericsson-mobility-report-2016.pdf computerworld.com/article/2983155/internet-of-
Ericsson report on “City Index”. https://www.ericsson. things/get-ready-to-live-in-a-trillion-device-world.
com/networked-society/trends-and-insights/city-ind html
ex?utm_source¼programmatic&utm_medium¼dbm- Global 450 Consortium (G450C) Program website. http://
1stpartydata&utm_campaign¼cityindex www.f450c.org/
H. Esmaeilzadeh, E. Blem, T.S. Amant, Global Internet of Things (IoT) Market 2015–2019,
K. Sankaralingam, D. Burger, Dark silicon and the TechNavio report, Feb (2015), http://www.technavio.
end of multicore scaling. IEEE Micro 32(3), com/report/global-internet-of-things-iot-market-
122–134 (2012) 2015-2019
E. Esteve, No reason for FD-SOI Roadmap to follow R. Goodall, D. Fandel, H. Huffet, Long-Term Productiv-
Moore’s law!, SemiWiki forum, 26 April (2016). ity Mechanisms of the Semiconductor Industry, in
https://www.semiwiki.com/forum/content/5720-no- Proceedings of the Ninth International Symposium
reason-fd-soi-roadmap-follow-moores-law.html# on Silicon Materials Science and Technology,
D. Evans, The internet of things—How the next evolution Philadelphia, USA (2002)
of the internet is changing everything, in CISCO white Q&A with Grady Booch: Collaborative Development
paper (2011). http://www.cisco.com/c/dam/en_us/ Environments, https://web.archive.org/web/
about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf 20081011045609/http://www.alphaworks.ibm.com/
Extended Coverage-GSM—Internet of Things standard. contentnr/cdeintro. Accessed 7 Dec 2006
http://www.gsma.com/connectedliving/extended-cov L. Greenemeier, When Will the Internet Reach Its Limit
erage-gsm-internet-of-things-ec-gsm-iot/ (and How Do We Stop That from Happening)? Scien-
M. Fojtik, D. Kim, G. Chen, Y.-S. Lin, D. Fick, J. Park, tific American, Feb 12th (2013). https://www.
M. Seok, M.-T. Chen, Z. Foo, D. Blaauw, scientificamerican.com/article/when-will-the-inter
D. Sylvester, A millimeter-scale energy-autonomous net-reach-its-limit/
sensor system with stacked battery and solar cells. J. Greenough, J. Camhi, The internet of things: examining
IEEE J. Solid State Circuits 48(3), 801–813 (2013) how the IoT will affect the world, Business Intelli-
G. Frantz, Digital Signal Processor Trends, IEEE Micro gence report (2015)
(2000) H. Jones, Why Migration to 20 nm Bulk CMOS and
L. Freyman, D. Fick, M. Alioto, D. Blaauw, D. Sylvester, 16/14 nm FinFETs Is Not Best Approach for Semi-
A 346 μm2 VCO-based, reference-free, self-timed conductor Industry (white paper) (2014). http://www.
sensor interface for cubic-millimeter sensor nodes in soitec.com/pdf/WP_handel-jones.pdf
28 nm CMOS. IEEE J. Solid State Circuits 49(11), R. Hegde, N.R. Shanbhag, Soft digital signal processing.
2462–2473 (2014) IEEE TVLSI 9(6), 813–823 (2001)
H.T. Friis, A note on a simple transmission formula. Proc. C. Helfmeier, C. Boit, D. Nedospasov, J.-P. Seifert, Clon-
IRE 34(5), 254–256 (1946) ing Physically Unclonable Functions, in Proceeding of
Frost & Sullivan report, Six degrees apart: geo socializa- the IEEE International Symposium on Hardware-
tion: the next trend in social networking, http://www. Oriented Security and Trust (HOST) (2013), pp. 1–6.
growthconsulting.frost.com/web/images.nsf/0/ C. Helfmeier, C. Boit, D. Nedospasov, S. Tajik, J.-P.
8b2a3d36c83d296c652577ee00270b79/$FILE/ Seifert, Physical Vulnerabilities of Physically
Megatrends_GeoSocialization.pdf Unclonable Functions, in Proceedings of the DATE
F. Frustaci, M. Khayatzadeh, D. Blaauw, D. Sylvester, 2014, Grenoble (France) (2014)
M. Alioto, SRAM for error-tolerant applications with Heterogeneous Integration Roadmap, http://cpmt.ieee.
dynamic energy-quality management in 28 nm org/technology/heterogeneous-integration-roadmap.
CMOS. IEEE J. Solid State Circuits 50(3), html
1310–1323 (2015) S. Hofschen, Internet of things—Secure Authentication
Gartner Says By 2017, 50 Percent of Internet of Things and Communication Are Vital for Success, Silicon
Solutions Will Originate in Startups That Are Less Trust website, https://silicontrust.wordpress.com/
Than Three Years Old. http://www.gartner.com/news 2015/01/20/internet-of-things-secure-authentication-
room/id/2869521. Accessed 9 Oct 2014 and-communication-are-prerequisites-for-business-
Gartner Says 6.4 Billion Connected “Things” Will Be in success/. Accessed 20 Jan 2015
Use in 2016, Up 30 Percent From 2015. http://www. B. Horowitz, The hard thing about hard things: building a
gartner.com/newsroom/id/3165317. Accessed 10 Nov business when there are no easy answers,
2015 HarperBusiness (2014)
1 IoT: Bird’s Eye View, Megatrends and Perspectives 43

H. Bauer, M. Patel, J.Veira, The Internet of Things: Sizing E. Kristjansson, Low Power ARM7 Design (2006)
up the opportunity (2014) http://www.mckinsey.com/ [Online]. http://www.microcontroller.com/ARM7_
industries/high-tech/our-insights/the-internet-of- Low_Power_Design_White_Paper.htm
things-sizing-up-the-opportunity R. Kurzweil, The Singularity Is Near (Viking Press,
http://www.postscapes.com/internet-of-things-market- 2005). http://www.singularity.com/charts/page60.
size/ html
W. Hu, P. Corke, W. C. Shih, L. Overs, secFleck: a public J. Kwong, Y.K. Ramadass, N. Verma,
key technology platform for wireless sensor networks, A.P. Chandrakasan, A 65 nm Sub-Vt microcontroller
in Proceedings of the EWSN’09, Cork (Ireland) with integrated SRAM and switched capacitor DC-DC
(2009), pp. 296–311. converter. IEEE J. Solid State Circuits 44(1), 115–126
G.D. Hutcheson, The economic implications of Moore’s (2009)
law, in Into the nano era, ed. by H. Huff (Springer, B.-T. Lee, S.-C. Son, K. Kang, A blind calibration scheme
Berlin, 2009) exploiting mutual calibration relationships for a dense
ICTK, Co. Ltd., (2014). http://www.ictk.com/ mobile sensor network. IEEE Sensors J. 14(5),
servicenproduct/puf 1518–1526 (2014)
International Technology Roadmap for Semiconductors: M. Lueders, B. Eversmann, J. Gerber, K. Huber, R. Kuhn,
2015 edition. http://www.semiconductors.org/main/ M. Zwerg, D. Schmitt-Landsiedel, R. Brederlow,
2015_international_technology_roadmap_for_ Architectural and circuit design techniques for power
semiconductors_itrs (2015) management of ultra-low-power mcu systems. IEEE
Intrinsic-ID, SRAM PUF : the secure silicon fingerprint, Trans. VLSI Syst. 22(11), 2287–2296 (2014)
White Paper (2016) M. Alioto, A. Alvarez, Physically unclonable function
Invia PUF IP. http://invia.fr/infrastructure/physical- database, [Online]. http://www.green-ic.org/pufdb
unclonable-function-PUF.aspx (2016) K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee,
ISSCC 2016 Trends. http://isscc.org/doc/2016/ C. Kozyrakis, M. Horowitz, Towards Energy-
ISSCC2016_TechTrends.pdf Proportional Datacenter Memory with Mobile
M.Y. Jaber, Learning curves: theory, models, and DRAM, in Proceedings of the ISCA 2012, Portland,
applications (CRC Press, Boca Raton, 2011) USA (2012).
S. Jankowski, J. Covello, H. Bellini, H. Ritchie, D. Costa, Markets and Markets report—“Internet of Things Tech-
IoT primer the internet of things: making sense of the nology Market by Hardware, Platform, Software
next mega-trend, Global Investment Research, The Solutions, and Services, Application, and Geogra-
Goldman Sachs Group (2014) phy—Forecast to 2022”. http://www.
B. Jovanovic, P.L. Rousseau, Moore’s Law and marketsandmarkets.com/Market-Reports/iot-applica
Learning by Doing. Rev. Econ. Dyn. 5(2), 346–375 tion-technology-market-258239167.html
(2002). http://www.nyu.edu/econ/user/jovanovi/ E. Miluzzo, N. D. Lane, A. T. Campbell1, R. Olfati-Saber,
MooreLawRed.pdf CaliBree: A Self-calibration System for Mobile Sen-
G. Jurczak, Advances and Trends of RRAM technology, sor Networks, in Proceedings of the 4th IEEE interna-
SemiconTaiwan 2015, Taipei (Taiwan) (2015). http:// tional conference on Distributed Computing in Sensor
www.semicontaiwan.org/en/sites/semicontaiwan.org/ Systems, Santorini (Greece) (2008), pp. 314–331
files/data15/docs/2_5_advances_and_trends_in_ R. Min, M. Bhardwaj, S.-H. Cho, E. Shih, A. Sinha,
rram_technology_semicon_taiwan_2015_final.pdf A. Wang, A. Chandrakasan, Low-Power Wireless
K. L. Lueth, IoT market segments—Biggest opportunities Sensor Networks, in Proceedings of the 14th Interna-
in industrial manufacturing, IOT Analytics report tional Conference on VLSI Design (2001),
(2014). http://www.icinsights.com/news/bulletins/ pp. 205–210
Internet-Of-Things-Boosts-Embedded-Systems- B. Murmann, ADC Performance Survey 1997–2016.
Growth/ http://web.stanford.edu/~murmann/adcsurvey.html
L. Karam, I. Alkamal, A. Gatherer, G.A. Frantz, P. Narendra Mahalle, P.N. Railkar, Identity Management
D.V. Anderson, B.L. Evans, Trends in multicore for Internet of Things (Rivers Publishers, Netherlands,
DSP platforms. IEEE Signal Process. Mag. 26(6), 2015)
38–49 (2009) D. Nedospasov, J.-P. Seifert, C. Helfmeier, C. Boit, Inva-
I. Keene, Market trends: small cell infrastructure, femtocell, sive PUF Analysis, in Proceedings of the 2013 Work-
picocell and carrier wi-fi hot spot deployment plans start shop on Fault Diagnosis and Tolerance in
to solidify, Gartner report, 19 Mar (2015) Cryptography (2013), pp. 30–38
O. Kommerling, M. Kuhn, Design principles for tamper- V. Nehru, H. S. Jattana, Efficient ASIC Architecture of RSA
resistant security processors, USENIX Workshop on Cryptosystem, in Procedings of the Fourth International
Smartcard Technology, Chicago, IL, 10–11 May Conference on Advances in Computing and Information
1999. http://www.cl.cam.ac.uk/Research/Security/ Technology (ACITY 2014), Delhi, India (2014)
tamper (1999) T. Ngo, Why Wi-Fi Stinks—and How to Fix It, IEEE
J. Koomey, et al., Implications of historical trends in the Spectrum, http://spectrum.ieee.org/telecom/wireless/
electrical efficiency of computing. IEEE Ann. History why-wifi-stinksand-how-to-fix-it. Accessed 28 Jun
Comput. (2011) 2016
44 M. Alioto

A. Nordrum, Popular internet of things forecast of 50 bil- S. Skorobogatov, Optical Fault Masking Attacks, in
lion devices by 2020 is outdated, IEEE Spectrum, Proceedings of the 2010 Workshop on Fault Diagno-
http://spectrum.ieee.org/tech-talk/telecom/internet/ sis and Tolerance in Cryptography (2010), pp. 23–29
popular-internet-of-things-forecast-of-50-billion- STMicroelectronics Nescrypt cryptoprocessor datasheet,
devices-by-2020-is-outdated, Accessed 18 Aug 2016 http://www.st.com/en/secure-mcus/sr31z052.html
NXP Cryptographic Acceleration Technology, http:// I. Stojmenovic (ed.), Handbook of Sensor Networks:
www.nxp.com/products/identification-and-secu Algorithms and Architectures (Wiley, Hoboken, 2005)
rity/network-security-technology/cryptographic-accel R. Strenz, Embedded Flash Technologies: Enabler for
eration-technology:NETWORK_SECURITY_ Automotive μCs & Smartcards, in Workshop on Inno-
CRYPTOG vative Memory Technologies MINATEC, Grenoble,
B. Otis, J. Rabaey, Ultra-Low Power Wireless Technologies France, 21 June (2012), http://leti.congres-
for Sensor Networks (Springer, Berlin, 2007) scientifique.com/workshopmemories2012/11_Strenz_
D. Patterson, B. Nikolic, Agile Design for Hardware, Leti_InnovativeMemoryWorkshop_2012.pdf
Part I, EETimes, July 27 (2015), http://www.eetimes. Y. Taito, M. Nakano, H. Okimoto, D. Okada, T. Ito,
com/author.asp?doc_id¼1327239 T. Kono, K. Noguchi, H. Hidaka, T. Yamauchi, A
QuantumTrace, LLC PUF IP Product (2013), http://www. 28 nm Embedded SG-MONOS Flash Macro for Auto-
quantumtrace.com/Products/IP/PUF%20IP/ motive Achieving 200 MHz Read Operation and 2.0
R. Botsman, We’ve stopped trusting institutions and MB/s Write Throughput at Tj of 170  C, in IEEE
started trusting strangers, TED talk (2016), http:// ISSCC Dig. Tech. Papers (2015), pp. 132–133
www.ted.com/talks/rachel_botsman_we_ve_stopped_ M. Tehranipoor, C. Wang, Introduction to Hardware
trusting_institutions_and_started_trusting_strangers Security and Trust (Springer, Berlin, 2012)
Y.K. Ramadass, A.P. Chandrakasan, Minimum energy Tesla, https://www.tesla.com/
tracking loop with embedded DC-DC converter Texas Instruments Cryptoaccelerators, http://
enabling ultra-low-voltage operation down to processors.wiki.ti.com/index.php/Cryptography_
250 mV in 65 nm CMOS. IEEE J. Solid State Circuits Users_Guide#Devices_Supported
43(1), 256–265 (2008) The Digital Universe of Opportunities: Rich Data and the
S. Ray, J. Yang, A. Basak, and S. Bhunia, Correctness and Increasing Value of the Internet of Things, IDC report,
Security at Odds: Post-silicon Validation of Modern April (2014). http://www.emc.com/leadership/digital-
SoC Designs, in Proceedings of the 52nd Annual universe/2014iview/index.htm
Design Automation Conference (2015) The Global IoT Market Opportunity Will Reach USD4.3
W. C. Rhines, Cost Challenges on the Way to the Internet Trillion by 2024, Machina Research report, (2015).
of Things—Keynote at IEEE ICECS 2015, Cairo, https://machinaresearch.com/news/the-global-iot-mar
Egypt (2015) ket-opportunity-will-reach-usd43-trillion-by-2024/
T. Ricker, First Click: This $1 chip will connect The Internet of Things—Opportunities and challenges for
your things to the city for free, The Verge, http:// semiconductor companies.pdf, McKinsey Global
www.theverge.com/2016/2/17/11030692/Lorawan-in Institute and Global Semiconductor Alliance report,
ternet-of-things-network-amsterdam. Accessed May (2015); Executive summary available at https://
17 Feb 2016 www.mckinsey.de/files/mckinsey-gsa-internt-of-
E.M. Rogers, Diffusion of Innovations, 5th edn. (Simon things-exec-summary.pdf
and Schuster, New York, 2003) M.-K. Tsai, Cloud 2.0 Clients and Connectivity—Tech-
S. Rosenblatt, D. Fainstein, A. Cestero, J. Safran, nology and Challenges, keynote at IEEE ISSCC
N. Robson, T. Kirihata, S.S. Iyer, Field tolerant (2014). http://isscc.org/videos/2014_plenary.html
dynamic intrinsic chip ID using 32 nm high-k/metal G. Uhlmann, T. Aipperspach, T. Kirihata,
gate SOI embedded dram. IEEE J. Solid State Circuits C. Kothandaraman, Y. Li, C. Paone, B. Reed,
48(4), 940–947 (2013) N. Robson, J. Safran, D. Schmitt, and S. Iyer, A
C.E. Shannon, Communication in the presence of noise. commercial field-programmable dense eFUSE array
Proc. IRE 37(1), 10–21 (1949) memory with 99.999% sense yield for 45 nm SOI
M. Shoaib, N.K. Jha, N. Verma, Signal processing with CMOS, in IEEE ISSCC Dig. Tech. Papers (2008),
direct computations on compressively sensed data. pp. 406–407
IEEE Trans. VLSI Syst. 23(1), 30–43 (2015) H.C.A. van Tilborg, S. Jajodia (eds.), Encyclopedia of
M. Siekkinen, M. Hiienkari, J. K. Nurminen, J. Nieminen, Cryptography and Security (Springer, New York,
“How Low Energy is Bluetooth Low Energy? Com- 2005)
parative Measurements with ZigBee/802.15.4,” in Verayo Inc. (2013), http://www.verayo.com/tech.php
Proceedings of the WCNC 2012 (2012), pp. 232–237 O. Vermesan, P. Friess, Internet of Things—From
Silicon Cloud website, https://siliconcloudinternational. Research and Innovation to Market Deployment
com/white-paper/ (River Publishers, Netherlands, 2014)
Silicon Labs 32-bit Microcontroller Application Notes, H.F. Wang, Y.L. Zhang, CAD/CAM Integrated System in
http://www.silabs.com/products/mcu/Pages/32-bit- Collaborative Development Environment. Robot
mcu-application-notes.aspx Comput. Integrated Manuf. 18, 135–145 (2002)
1 IoT: Bird’s Eye View, Megatrends and Perspectives 45

WebSphere Process Server documentation. https://www. B. Zhang, N. Mor, J. Kolb, D. S. Chan, N. Goyal, K. Lutz,
ibm.com/support/knowledgecenter/SSQH9M_7.0.0/ E. Allman, J. Wawrzynek, E. Lee, J. Kubiatowicz, The
com.ibm.btools.modeler.advanced.deploy.doc/ cloud is not enough: saving iot from the cloud, in
measures/aggregates.html HotCloud’15 Proceedings of the 7th USENIX Confer-
B. Wertz, “The One Barrier to Entry Startups Should ence on Hot Topics in Cloud Computing, Santa Clara,
Focus on,” on VersionOne website, http://versionone. USA (2015)
vc/the-only-barrier-to-entry-you-should-care-about/. W. Zhao, Y. Ha, M. Alioto, Novel self-body-biasing and
Accessed 5 Nov 2012 statistical design for near-threshold circuits with ultra
Worldwide Internet of Things Forecast, 2015–2020, IDC energy-efficient aes as case study. IEEE Trans. VLSI
report, May (2015). http://www.idc.com/getdoc.jsp? Syst. 23(8), 1390–1401 (2015)
containerId¼256397 B. Zimmer, Y. Lee, A. Puggelli, J. Kwak, R. Jevtic,
Worldwide Smartphone Growth Forecast to Slow to 3.1% B. Keller, S. Bailey, M. Blagojevic, P.-F. Chiu, H.-P.
in 2016 as Focus Shifts to Device Lifecycles, IDC Le, P.-H. Chen, N. Sutardja, R. Avizienis,
report, June (2016). http://www.idc.com/getdoc.jsp? A. Waterman, B. Richards, P. Flatresse, E. Alon,
containerId¼prUS41425416 K. Asanovic, B. Nikolic, A RISC-V Vector Processor
K. Yano, T. Akitomi, K. Ara, J. Watanabe, S. Tsuji, with Simultaneous-Switching Switched-Capacitor
N. Sato, M. Hayakawa, N. Moriwaki, Profiting From DC–DC Converters in 28 nm FDSOI. IEEE J. Solid
IoT: The Key Is Very-Large-Scale Happiness Integra- State Circuits 51(4), 930–942 (2016)
tion, keynote at VLSI Symposium 2015 (2015)
IoT Nodes: System-Level View
2
Pascal Urard and Mališa Vučinić

In this chapter, we detail the key elements of the • a battery: many possibilities, from alkaline to
wireless sensor network nodes architectures. We long-life Lithium based
also review the most important tradeoffs to make
in order to maximize the system energy effi- The current market trend is to enlarge battery
ciency, keeping the cost of the solution under life by increasing the global solution energy effi-
control. We also review the pairing and registra- ciency. One of the research directions is to get rid
tion operations and detail the security require- of the disposable batteries and create an energy-
ments as well as the impact of security on the autonomous solution at a reasonable price thanks
energy efficiency and the cost of the solution. to small energy harvesters. In the case of such an
autonomous node the system would also embed:

2.1 Architecture of IoT Nodes • An energy harvester enabling energy


harvesting from the environment: photovoltaic
There are multiple possible architectures of IoT cell or vibration harvester, or even a Seebeck
nodes. Depending on the mission profile, the effect harvester enabling to create some energy
use-case conditions, the pairing conditions to from a delta of temperature between two
setup the network, topology of the network, and materials, or between a hot surface and a
for sure the application use-cases, then the archi- radiator.
tecture shall be optimized in a direction or • an energy storage, usually a rechargeable bat-
another. However, some elements are common tery or a supercap
to the possible architectures: • a power management unit, typically taking
care of adapting harvester voltage value to
• One or several processors, typically MCUs the battery, and managing battery charge/dis-
(microcontroller units) like STM32, based on charge, plus eventually energy distribution to
ARM CortexM solutions the system (Fig. 2.1).
• a communication unit (i.e.,: radio for wireless
sensor networks), In the following, we take as an example a
• one or several sensors or actuators wireless sensors network end-device node. This
node is collecting data from the environment
(sensor) or act on the environment (actuator).
P. Urard (*) • M. Vučinić The so-called ‘environment’ is a general term
STMicroelectronics, Crolles, France that can represent multiple domains. The sensing
e-mail: urard.pascal@gmail.com

# Springer International Publishing AG 2017 47


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_2
48 P. Urard and M. Vučinić

Fig. 2.1 Basic blocks and


functions

function is for sure adapted to each environment. source through a switch or a relay, managed
In the case of: by a GPIO. This is typically the case of an
actuator like a motor, or in the case of some
• a motor: sensing T or vibration, . . . high-current consumption sensors.
• a room: sensing T , pressure, humidity, light, • The digital sensors interfaces can be directly
gas (CO or CO2 or any chemicals), noise, connected to the I2C or SPI buses of the
presence of roommate, . . . application processor
• a forest: sensing humidity, T , fire, light, • The analog sensors can be connected directly
gases, chemicals, noise, and even the presence to analog-to-digital IOs. Some MCUs like
of bugs,. . . STM32 can share this ADC between several
analog inputs, enabling to reuse this important
The first task is to precisely define the use component to manage several analog sensors.
cases and the targeted cost of the solution. This • In the case a sensor is used as a wake-up func-
enables to select what kind of sensors have to be tion, then the usually available “wake-up” out-
used, as well as the amount of local computing in put of the sensor should be connected to the
the node (by opposition to raw-data sending processor interrupt controller to turn on the
through the network) and the kind of MCU MCU
needed. Then the global architecture often • In the case a sensor needs some specific ana-
respects the following tradeoffs. log voltage values, MCUs also embed a
digital-to-analog converter (DAC).
• In order to keep a low-cost solution, the appli-
cation processor can manage directly sensors Concerning MCU choice, modern low-power
and actuators: MCU solutions like STM32L family embed a
• Sensor power supply: in the vast majority of the built-in DCDC converter enabling to drastically
low-power sensors, sensor power supply can be reduce the power consumption of the processor
directly connected to one of the MCU digital itself. In the case such a processor is used, the
General Purpose Input/Output (GPIO) pins. sensors and eventually the radios powered by this
• Actuator power supply: in some other case, processor will also benefit of this power reduc-
when the load need in terms of current is tion. Let’s take the example of a 3 V battery-
higher than the MCU capability (tens of mA), operated sensing system embedding an MCU and
the actuator can be connected to the power a low-power sensor. In the case the MCU has an
2 IoT Nodes: System-Level View 49

eDCDC (i.e.,: embedded DCDC, which is the keeping an application processor for system and
case of the STM32L151), then a 1.2 V sensor application management and an additional net-
that consumes ~5 mA under 1.2 V will only draw work co-processor for the radio and protocol
2 mA from the 3 V battery if powered through management enables to have a quite robust solu-
the MCU. tion where both processors can be turned on or
In many cases, the radio or the radio subsys- off independently. It also enables to set the opti-
tem has its own DCDC or can also take advan- mization of each subsystem independently from
tage of the eDCDC of the main processor. the other (i.e.,: choose for each function the most
Radio subsystem can rely on a module or on adequate solution in terms of computing capac-
separated components. Module solution is easier ity, without affecting the other functions).
to integrate but also more expensive. In recent Having only one processor is more complex
generations of MCUs (e.g., NXP/Freescale), the because depending on radio protocol, radio
radio subsystem covering Bluetooth-low-energy actions may be higher priority than application,
and 802.15.4 is integrated in the main processor. interrupting an on-going sensing in some cases
One of the first choices to be done is the (so we have to define priorities and possibly
system partitioning at processor level. Several require software-level concurrency contol).
possibilities exist at SoC level as shown in The battery voltage choice is also a key
Fig. 2.2. parameter of the system line-up, as the battery
Depending on the use case, the chosen archi- voltage is heavily affecting the global architec-
tecture of the MCU should be more or less pow- ture of the node. If the battery voltage is higher
erful. One of the key requirements of a than the energy harvester output voltage, then a
low-power solution is to not wake-up the proces- DCDC boost is needed to charge de battery. This
sor(s) for nothing. Some system analysis with the component will enable to target high charging
different tradeoffs shall be done, taking into voltage batteries like LiPO. Another important
account the different wake-up of the MCU(s), parameter in the choice of the battery is the
the power-consumption during these active internal resistance value. Minimizing this resis-
periods, and also the cost of waking-up the tance enables to increase system power effi-
processors. Choosing a powerful MCU to man- ciency. A high internal resistance (10s or 100s
age at the same time application and communi- of ohms) will lead to an important voltage drop
cation may be lower cost than two processors, when the system is running, and may require
but also require to turn-on regularly a large raising the input voltage value when the har-
power-consuming processor. On the other hand, vester is charging the battery. Some batteries

Fig. 2.2 System level


radio partitioning
50 P. Urard and M. Vučinić

must be charged at a fixed given voltage. In this • A quick estimation of the battery efficiency is
case, the battery current charge decreases when given by the following ratio: Vbatt when using
the battery charge reaches its maximum capacity. the system/Vbatt when charging. The higher
In other case, the battery must be charged in the internal resistance of the battery, the
current, always keeping the current bellow a lower is this ratio.
maximum given in order to avoid any damage
to the battery. In this case, the PMU shall adapt
the impedance of the harvester to the battery, and
battery voltage (Vbatt) increases with the battery 2.2 Requirements for IoT Nodes
charge, up to a certain point where the charge has
to be stopped to preserve the battery. Continuing Depending on the usage and the targeted market
the charge may damage it. one radio protocol will have to be chosen to
There are some key advices that could save enable interoperability. For industrial market
time to design a performant solution: purpose, it is admitted that sub-gigahertz radios
are the ones to target. For home-automation and
• The battery voltage range should be chosen building automation, it can be both. The market
taking into account the energy harvester out- seems to push 2.4 GHz at home level: NEST-
put voltage. Minimizing the number of volt- labs, acquired by Google and promoting Thread,
age conversions enables better system seem to rely primarily on Bluetooth low-energy
efficiency. and 802.15.4 @ 2.4GHz (web:), but Zwave and
• The harvester has to be chosen to meet the at a lower scale the energy-harvested solution
requirements of your use cases: typical/mini- Enocean still offer sub-gigahertz solutions. At
mal light and output charging voltage to min- local or regional area, we see upcoming solutions
imize the need for complex PMU. A PV cell like Lora or Sigfox enabling long (resp. very
can easily and at low cost be designed on long) distance of transmission at the cost of a
purpose in many places worldwide. low or super-low data rate. There are plenty of
• In the case of unknown harvesting conditions applications that could take advantage of such
or risk of non-respected use-cases, an MPPT solutions in the coming years.
(maximum power point tracking) or pseudo- We can divide these solutions in two
MPPT power management unit can be chosen topologies (see Fig. 2.3): the solutions enabling
to adapt to any situation. This option has a star network topology such as Wifi, Lora, Sigfox,
higher cost. or even GSM or LTE, and the ones enabling

Fig. 2.3 Network


topologies and definitions
2 IoT Nodes: System-Level View 51

Mesh networks such as 802.15.4x including • System with wake-up on purpose (e.g., alarm).
Zigbee & Thread, Zwave, Enocean,. . .
We discuss in the next section the impact of In the vast majority of modern sensors, this
topology on the power consumption. function is natively embedded in the sensors,
providing a mode with activity watch. So the
design can directly use this feature, usually
2.2.1 Power called low-power mode.
In order to successfully design an ultra-low-
In order to target long battery life duration or power solution, one should think energy
energy harvesting compatibility, there should be harvesting and duty-cycling. See Fig. 2.4.
a ratio of at least x500 between the system sleep- For sure, every component of the node has to
ing time and its functional time. be carefully chosen to minimize power. How-
ever, in the first place, the radio choice and
• Processor in terms of current is in the range of radio protocol are one of the major contributors
mA (e.g., 5–10 mA for CortexM3 running at in power consumption.
12 MHz under a 3 V supply, which results in Minimizing power by minimizing protocol
15 to 30 mW power) when in activity, but overhead is an excellent starting point. However
only around 1uA in deep sleep mode (3 μW). as we will see in the next sections this may go
It should logically evolve in the coming years, against interoperability. The first adopted
thanks to Moore’s law and additional low systems in the past used to propose proprietary
power techniques to achieve a division by optimized solutions for radios and non-IP
3 of this power consumption, reaching 5 to protocols. However, the future seems to be
10 mW for the same function. standards-based and IP-based. The Moore’s law
• Energy cost of wake-up and go-to-sleep has to helps to gain in power efficiency: the former
be taken into account in the tradeoff, as it is generations of radios used to consume 20 mA
not negligible. Processors usually offer sev- @ 3 V to send a frame at 0dBm (1 mW in the
eral kinds of sleep modes (light, medium, air). In 2015, the best solutions on the market
deep). Each of them can have some interest, consume bellow 6 mA for the same service, and
depending on your application and context. the trend is still to reduce further, towards 3 mA
Consider that you have to empty a large in the 2 to 4 coming years.
on-board capacitance to make the system Once the radio power is under control, the leak-
sleep; the price-of-wake-up may be higher age has to be addressed. Let’s imagine a node
than putting the system in an intermediate working during 20 ms every 2 min (mean schedule
sleep mode where this capacitance remains in real operating mode) with a mean power of
charged. So it is important to take this param- 30 mW (10 mA under 3 V). It would consume
eter into account, and not put the system in around 600 μJ per active cycle. If during the inac-
some deeper sleep mode when the power bud- tive period, the total current (board level leakage +
get is finally larger than staying in some ligh- low-power-watch-dogs + counters) is around
ter sleep mode for a given time. 2 μA (6 μW), then the total leakage would be
• Sensors have to be turned off as much as 720 μJ per inactive cycle. The Pareto of the
possible (i.e.,: between two sensing phases), power consumption would start by the leakage!
or at least forced to low-power mode if they So the leakage can really become your number
have some, especially when they have the one issue to meet your total power budget. In the
mission to wake-up the system. case your objective is to build an energy-harvested
• Radio has to be mainly off. platform, the global requirements should lead you
to consider every leakage above 30 nA in stop
We can consider two families of systems: mode, and any way to save 5 μJ or more in
• System with regular wake-up active mode.
52 P. Urard and M. Vučinić

Fig. 2.4 Heavy duty cycling enables Ultra-Low-Power solutions

2.2.2 Cost issue was not a major problem few years ago, but
it is more and more reported by final customers
Radio protocol has a direct impact on the global that the battery budget per year can be a show-
cost, because of its huge impact on the power stopper after a first trial. Imagine you have to
efficiency. Also, factors like whether the upper change every year heavy-duty long-duration
protocol is IPv6 or not-IP, if security is enabled lithium-AA batteries in 10 devices. The cost
or not will directly impact useful data rate, and so after 5y shall be higher than the system itself.
the global energy per useful bit transmitted, lead- The pain to change the batteries every year in a
ing to expend the battery size or reduce the lifetime more-than-20 devices network very quickly
of the node operating in harsh conditions where the becomes upsetting or too expensive if a specialist
eventual energy harvester would not be able to has to do it. We can easily understand why it is
harvest anything (i.e.,: dark for a PV cell). requested to optimize the solutions, still
The battery choice, as seen in the previous maintaining the cost low:
section will heavily impact the cost of the node.
Choosing components with high power con- • by choosing an optimal system partitioning
sumption would lead to choose a large battery, • by optimizing global power efficiency
like the ones of the cellphones. Even if the price • by minimizing additional components and
in volume may seem reasonable, it shall impose voltage conversions.
to add a specific power-management unit,
enabling to charge this battery up to 4.2 V. Let’s take as an example the final cost of an
Such a high voltage will prevent to connect energy-harvested node in volume, the GreenNet
directly the MCU to the battery: MCU maximum node V2.1 as shown in Fig. 2.5 (Urard et al.
operating voltage is usually around 2.6 V. 2015). It was targeting a Bill-Of-Material
The cost of not-optimizing the global power (BOM) in high volume below 12 USD. This
consumption would lead to a poor battery life for R&D project has been an ST-internal demonstra-
some or even all the nodes of the network. This tor to understand the WSN challenges and
2 IoT Nodes: System-Level View 53

Fig. 2.5 GreenNet V2.1: Energy Harvester secured IPv6 Node content

requirements, regarding system energy effi- Figure 2.6. shows the key elements of various
ciency on the hardware side, and secured-IPv6 802.15.4 solutions and highlights in red some of
power-optimized solution compatible with the most interesting added values to increase the
energy harvesting on the protocol side, in coop- quality of service or the energy efficiency of the
eration with the Laboratoire d’Informatique de network. Please note the similarity between those
Grenoble (LIG). GreenNet is a successful dem- solutions. Most of them are using 6LoWPAN
onstrator but has not been commercialized. Some for IPv6 interoperability. The two on the right
of the key elements are detailed in Chap. 17. (circled in red) are the latest ones, with a larger
adoption of 802.15.4e, now part of 802.15.4-
2015 specification. However, in some particular
cases detailed in the coming chapters, there is a
2.2.3 Interoperability
room for improvement and some part of the
802.15.4-beaconed may in the future be reused
Wireless Sensor Networks have suffered for
to further improve the energy efficiency of
years from the fragmentation of the radio offer:
802.15.4e standard.
too many different standards, not enough inter-
802.15.4x solutions aren’t the only WSN
operability. Zigbee is the typical example where
protocols to adopt IPv6: as an example, 802.15.1
4 main “application profiles” were offered as
(Bluetooth) compliancy to IPv6 has been devel-
defensive layers, one for each of the markets:
oped in 2015 and should be available during 2016.
Smart Energy, Home Automation, Building
Automation, and Lighting. Thread, pushed by
Google through Nest-labs, has pushed further
by proposing a single profile for all the 2.2.4 Security
use-cases. IPv6 is a definitely a trend: ensuring
interoperability by offering an IPv6 solution. Few years ago (2013), while the author was
Nowadays, it seems also as one of the preferred demonstrating GreenNet in Paris, audience was
ways to adopt a secured solution. IPv6 security is not convinced, at the time, about the need of
constantly evolving, so progressing at low cost secured connections for WSN. In only 2 years,
can only come from solutions that can be shared the number of hacks worldwide, the need for
among the standards. more secured solutions at all levels, especially
54 P. Urard and M. Vučinić

Fig. 2.6 2.4 GHz 802.15.4 enlarged protocols family

private data, and the fact that the interesting infor- • Passive attacks: Such as eavesdropping and
mation for hackers is maybe not the one you think traffic analysis, where the attacker gains
about, made people change their mind. Nowa- knowledge on ongoing communication by
days, no question any more: security is a must. passive means. For instance, if messages are
We present these hereafter an overview of sent in clear, attacker is able to read full mes-
typical threats and attacks in WSNs. Some more sage content. If network messages are
details can be found in (Vučinić et al. 2015). encrypted, attacker may still be able to infer
Security of a system can be studied within a some information by studying communication
given model. Internet protocols typically con- patterns, timing, or message length.
sider the traditional Dolev-Yao model (Dolev • Active attacks: Attacker actively participates
and Yao 1982) where the attacker has full control in the communication by re-playing old
over the network. messages (replay attack), modifying messages
More precisely, the attacker can: and playing Man in the Middle (MITM),
• Intercept messages, pretending to be another entity in order to
• Modify messages, gain unauthorized access to a resource and
• Block messages, similar. A particular class of active attacks
• Generate and insert new messages are Denial of Service (DoS) attacks, where
the attacker’s ultimate goal is to disrupt the
It is important to understand that crypto- availability of a network service, such as the
graphic algorithms are considered “perfect” and alarm notification, typically by exhausting
the attacker can decrypt/forge a message only if physical resources (memory, energy, band-
he possesses the corresponding key. In the net- width) on the target node.
working context, “message” corresponds to a
Protocol Data Unit (PDU) of an abstraction An important point to note is that the Dolev-
layer under study. For instance, if we consider Yao model typically considered in protocol
security solution at the link layer (radio proto- design is a formal model that does not take into
col), message corresponds to a radio frame. account physical compromise of a node. There-
Traditionally, there are two typical classes of fore, research around WSNs (Atakli et al. 2008;
attacks: Chan et al. 2003; Karlof and Wagner 2002;
2 IoT Nodes: System-Level View 55

Rezvani et al. 2015; Vempaty et al. 2012) and Midkiff 2008). This DoS attack is often
has often taken into account a more powerful, called jamming and is mostly a concern in
Byzantine attacker (Awerbuch et al. 2004). military scenarios. Common defense is chan-
In such scenarios, attacker has access to local nel hopping that increases the bandwidth
cryptographic material on the node and we can- attacker has to jam, which can require a sub-
not rely on cryptographic techniques to prevent stantial power supply and thus make the
attacks (Rezvani et al. 2015). Indeed, Byzantine attack less practical. Also, network-level
attacker can compromise a set of nodes and redundancy can help in order to route around
through them inject false data that passes all the jammed area.
cryptographic checks. • Traffic Injection. Injecting false traffic in the
Becher et al. (2006) conclude that physical network can have multiple consequences.
compromise of a device in order to extract Firstly, it is possible to affect network
keying material and obtain full control over it, applications, e.g., by introducing a bogus tem-
as assumed by the Byzantine model, is not as perature reading to trigger the Heating, Venti-
easy as often perceived in WSN literature. lation, and Air Conditioning (HVAC) system,
It requires costly equipment, expert knowledge or even to directly control an actuator, such as
on hardware and hard determination of the the pressure regulating valve in the industrial
attacker. An interesting observation of this study automation system. Similarly, one can obtain
is that such attacks often require that a node be full control over the network by forging net-
removed from the network for a non-trivial work maintenance packets (Karlof and
amount of time making detection of unusual Wagner 2002), e.g., beacons, and corrupting
activity via neighbor discovery protocols a sim- neighbor tables of the nodes. Secondly,
pler approach than specialized Byzantine-tolerant attacker can launch DoS attacks by generating
schemes. Common sense practices, such as dis- significant traffic loads that can cause network
abling JTAG port or Bootstrap Loader (BSL) once collapse in terms of depleted energy due to,
the product is deployed, go a long way towards e.g., multihop forwarding. First-level protec-
making attacks in the field more difficult. tion against such attacks is link-layer secu-
We do recognize that in many IoT rity—network nodes should not accept any
deployments, devices will be physically avail- radio frame other than those secured with
able to the general public and as such, system link-level keys they possess locally. At the
designers should take into account the threat of a application level, access rights should be
physical compromise and extraction of the properly configured in order to limit the dam-
keying material. We emphasize that final IoT age if one of the nodes in the network is
products should either have hardware- level or compromised. For instance, node measuring
software-level protection against physical tam- the temperature should not be allowed to issue
pering, i.e., tamper-resistant packages or pressure valve regulating commands. Second-
schemes to detect unauthorized access to the level defense is common sense program-
hardware (Becher et al. 2006; Liu et al. 2007). ming—if some of the network nodes gets
compromised and starts injecting
2.2.4.1 Wireless Network Threats cryptographically-valid traffic, one should
locally check the rate at which it is forwarding
• Physical Jamming. The most basic attempt to
packets or performing local operations instead
disrupt the network service is the attack on
of blindly following the protocol.
physical resources—the radio channel.
• Attacks on Join Protocol. Link-layer security
Attacker can generate high-power signal that
protects the wireless network in “steady”
will interfere at different receivers in the net-
state, when all the nodes have joined and
work and increase the error rate, possibly
have been provisioned with necessary keying
completely disrupting wireless operation
material. Before we admit a new node in the
(Law et al. 2005; Li et al. 2007; Raymond
56 P. Urard and M. Vučinić

network, it is necessary to perform some represents a privacy concern, we focus on


checks. Join protocols are technology-specific information that may leak to an outsider.
but some common points exist: Obviously, data confidentiality at the link
– The joining node may initiate the join pro- layer (protected radio frames) is the first step
tocol multiple hops away from the gateway to improve user’s privacy. In many IoT
– Several messages may need to be scenarios, however, radio communication
exchanged between the joining node and alone suffices to reveal some information
the gateway before the “admittance” deci- about the user. For example, a presence sensor
sion can be made. This necessitates that may initiate radio communication when a per-
intermediate nodes in the mesh forward son enters a building (Tschofenig et al. 2015)
the messages that may come from a rogue or a light switch may indicate that the state has
joining node (attacker), which opens up the been toggled by emitting a radio frame. Typi-
possibility of DoS attacks. Although this cal defense would involve injecting dummy
threat can never be fully neutralized, a com- traffic in the network but that may not always
mon strategy is to minimize the damage a be feasible due to the local energy constraints.
potential attacker can do. As such, one may
ensure that joining messages do not instill
state information in the network and can 2.2.4.2 Countermeasures
control the rate at which intermediate Threats described in above section are typically
nodes forward join protocol messages. fought using security mechanisms at 2 levels:
• Attacks on Routing Protocol. Due to their Level 2 (L2, security between radio neighbors)
distributed nature, WSNs are prone to attacks for hop-to-hop security or link-layer security and
that involve an attacker that can for example: Level 5 for end-to-end security (L5, security
– Selectively forward messages if it is within between an IoT node and e.g., smartphone).
the network, or jam radio transmissions Each level of protection needs to meet 4 funda-
and cause collisions from the outside. mental security goals (Fig. 2.7):
– Advertise false routes in order to attract the
surrounding traffic and create a sinkhole. 1. Confidentiality: ensured by using encryption.
– Present multiple false identities to other 2. Integrity: we want the message to arrive
nodes in order to reduce effectiveness of safely, and not loose parts of it. Ensured by
fault-tolerant schemes. appending a checksum at the end of a
– Create radio “tunnels”, so called message.
“wormholes”, between two distant parts 3. Authentication: Am I talking to the right node
of the network in order to appear closer to and is the message I received coming from the
the gateway and create a sinkhole at the right node? Ensured by an authentication pro-
other end of the tunnel. Such attacks can tocol and by using secure checksums,
only partly be neutralized by using link- computed using a secret key.
layer security in order to reject radio 4. Availability: Achieving the guaranteed level
frames coming from the outside. When an of operational performance even in presence
attacker is inside the network, i.e., a of Denial of Service attacks (DoS) is a
compromised node, defense requires care- non-trivial task for a system designer. In
ful design of the routing protocol that takes terms of link-layer security, message filters
security into account from the beginning and access control lists implemented in hard-
(Karlof and Wagner 2002). ware allow certain degree of confidence but
• Privacy issues. Sensor and actuator networks are alone not enough due to possible jamming
that make part of our daily life bring along attacks. Radio technologies that use frequency
various privacy issues. While management of hopping (802.15.4e and BLE) help in
data collected by these networks in itself preventing networks to be stuck on a single
2 IoT Nodes: System-Level View 57

Fig. 2.7 The four fundamental security goals of a secured solution

radio channel and make it more difficult to the 2.3.1 Node Availability and
attacker to disrupt the service. However, Duty-Cycled Operation
availability is a system-level goal and thus
must not be treated only at the node level, As for the system, the radio shall sleep as long as
but also at the gateway level thanks to: it makes sense. This is different than “as long as
(a) The gateway firewall possible” for several reasons. First reason: the
(b) The decoupling of fast Internet world need for synchronization in radio protocols. In
(HTTPS) and the slow one (CoAP this trade-off, the topology plays a major role.
over DTLS).
• Case of star network topologies: we can con-
Attacks on the routing protocol are, from the sider the central node (concentrator) as
point of view of confidentiality and authentica- always-on like on Wi-Fi or Sigfox or Lora.
tion, defended using end-to-end security • Case of Mesh or extended Star topology
mechanisms, where even the on-path attacker is (extended star enables each node to have one
not able to modify or read the data, as it is not in additional peer attached to it). In this case
possession of the end-to-end keys. there are multiple solutions:
– Either the routers are always or mainly
listening (e.g., Enocean, Zwave, Zigbee,
802.15.4-by-default) in order to receive
2.3 Power-Related Challenges
information from the other nodes of the
and Design Tradeoffs
mesh and transmit them to the next node
– Either both emitter and receiver are mainly
Chapter 2 of Varga’s PhD thesis (Varga 2015)
off. This is the trend of the new radio
presents an overview of the latest technologies
standards. Bluetooth-low-Energy (BLE),
used in IoT as well a presentation of the
802.15.4e, 802.15.4-beaconned-option
synchronized and unsynchronized operation
standards are proposing mainly off
mode. These are presented hereafter.
58 P. Urard and M. Vučinić

solutions with synchronized wake-ups. This communicate with him can get by decoding this
enables better energy efficiency versus beacon, the time of the next available beacon,
older standards. We can say the system has the time slot when to emit if they need and even
slots: an active period where the radio the fact that they have some information to
transmits or listens to the RF activity request to the master node in the case they
(in case of message for the node), and an have: Bluetooth low-energy, 802.15.4
inactive period where the radio is turned off. beaconed-option are working this way.
By extension we say the system is slotted.
All these possibilities enable to have more or
Slotted systems have however a constraint to less energy efficient protocols see for instance
solve: the need for synchronization between Romaniello (2015). In Fig. 2.8, we can consider
emitters and receivers. As there is no always- that green bubble could enable energy-harvested
listening node over the air, a de-synchronized nodes in some specific conditions.
network would have a very poor quality of ser-
vice, unless the active slots represent a very high • In the first category, we can fit mains-powered
percentage of the time, reducing by this way the devices. We can also consider part of this
efficiency and the interest of such solutions category the networks where all the devices
regarding previously existing ones. have been declared as routers. This is typi-
As a conclusion: in order to maximize energy cally the case when all regular devices offer
efficiency, new systems have the requirement to routing capability to enable mesh network:
be mainly-off so they are synchronized in order Zwave & Zigbee.
to transmit the data in the shortest time for both • ZigbeePro-GreenPower, Enocean, would fit in
nodes, produce the relevant acknowledge (emit- the second category, enabling end-device to
ter get the feedback that the data has been be super-low-power and energy-harvested at
received) and go back to sleep mode as soon as the cost of a mains-powered, always-listening
possible. router device. However, bi-directionality is
There are two ways to maintain the synchro- not granted, preventing low-cost battery-
nization of the network. operated actuators or IPv6 secured link.
• BLE and 802.15.4e and 802.15.4-beaconned
• Either there is enough and regular would be categorized on the third category
communications: the transmitted data or the (row), enabling bidirectional IPv6 and
first-level acknowledge can carry the synchro- secured networks, at the cost of a need for
nization information: this is the case of the synchronization through beacons or regular
802.15.4e: when two nodes communicate data exchanges.
together, the receiver sends to the emitter a • 802.15.4-beaconned could also be categorized
low-level acknowledge embedding the syn- in the fourth category in some specific
chronization data (e.g., timing corrections for conditions (low-latency network), enabling
the next wake-up). This solution requires a energy-harvested routers, as demonstrated in
one-to-one communication: a specific rendez- the GreenNet project.
vous between the two nodes. This is possible
in the 802.15.4e standard in which a specific
time slot and frequency channel is attributed
by the router (or the master node) to each of 2.3.2 Activity Profile and Power
the nodes that needs to communicate Modes
with him.
• Either the network synchronization is One of the goals of an embedded code software
maintained globally thanks to the usage of designer in charge of providing a solution for an
beacons: the master node sends a beacon in autonomous wireless sensor network is to mini-
broadcast, all the surrounding nodes needing to mize the amount of energy spent in the nodes.
2 IoT Nodes: System-Level View 59

Fig. 2.8 (a) 802.15.4-beaconned option versus 802.15.4e; (b) network topology impacting the power (red arrows
mean ultra-low-power protocols, bidirectional arrows mean bidirectional communication)

In the case of a sensor node operating regu- case having the local low-speed oscillator run-
larly, there are few main ways to achieve this ning for the next RF rendezvous, or having
goal: some sensors in low-power mode, able to
wake up the node in the case the environment
• Maximizing the deep-sleep periods of time. In parameter (e.g., temperature, vibration) goes
deep-sleep mode, the application processor is beyond a programmed limit. In this case, the
programmed at the lowest level of power, still node would register this kind of alarm and
maintaining its stack in retention mode send it to the gateway as soon as the next
(or equivalent) in the System RAM and the communication slot is available. So on one
logic glue, and turning off the NVM. In some hand, the amount of time spent in deep-sleep
60 P. Urard and M. Vučinić

mode shall be as long as possible. However, cost of the solution. It depends on the cost of
waking-up from deep-sleep requires some maintenance versus the cost of fabrication. Only
energy, so on the other hand, the number of a detailed cost study taking into account the
time we put the system in deep-sleep has to be forecasted volumes, price per node in each case,
minimized. One unique long period between and the cost of SW maintenance of several
each sensor measurement is an ideal case and applications versus 1 unique application can
answers to both constraints. give you a precise answer.
• When the duration of the sleeping period is not
sufficient (e.g., less than 2 ms in the case of the
GreenNet node) some intermediate stop mode
may be used: reducing the power by a fair 2.4.1 Impact of Power on Cost of IoT
amount, still consuming more than the deep- Node and Concentrator
sleep, but at a lower penalty cost for wake-up.
• Turn the radio off as often and as soon as The most expensive part of a sensor node can
possible. easily become the battery. Having a low-cost
• The sensors have to be turned to their lowest coin battery is a plus as its price in volume is
power mode when in run mode and to be around 1USD (2032 LiMn). However, it is lim-
switched off asap. ited to 200 mAh (announced) and the usable part
would only be around 40mAh (absolute limit) in
In the case of an alarm node able to operate the case you want to preserve the battery from
anytime, like a light-switch, the amount of short life duration.
energy required by the node is directly propor- What would be the maximum activity rate of a
tional to the latency you would accept to wait in node using such a battery? We first consider it is
order to operate the action. Same principle for an acceptable to cycle by 0.5% per day on such a
actuator: the amount of energy consumed to battery which means 1mAh (i.e., 2.6 Coulombs)
enable actuation control is directly linked to the per day. In other words, the mean current con-
actuator latency. E.g.: in the case of a light sumed by such a node would have to be lower
switch, we would consider only the amount of than 41.6 μA if no energy is harvested during
energy to be able to command the switch, not the 24 h. Any energy harvested would enable more
energy consumed by the light. The total latency activity. For example a 6 h harvesting duration
would then be: switch latency to send the com- per day with a 20 cm2 PV cell would enable the
mand + network latency + actuator latency to same system to have a mean current of 55 μA.
receive and execute the order. This would correspond to an ultra-low-power
In the case the alarm node is operating without system operating every 20–30 s, which is not so
any battery but with a capacitor coupled to a bad. If the components are consuming more, or if
pulsed-energy harvester generating 100 uJ or the communication needs to occur more often,
less, then the only way to operate is to have an then the battery may have to be upgraded to a
always-on router, quite-always listening, that larger capacity, from another type, more expen-
will immediately take the message and relay it sive, with sometimes more complex charging
inside the network. protocols.
Regarding indoor routers: in the general case
one doesn’t have any choice: routers currently
2.4 Cost-Related Challenges must have a powerful battery or be mains-
and Design Tradeoffs powered. In few particular use-cases however,
it is possible to specify an energy-harvested
Is it better to design a single node with multiple node that handles the router function. However
programmable sensors you can activate over the due to the poor harvesting capabilities, this kind
air, or it is worse putting only one sensor per of router would need to be placed very close to an
node? The answer has a direct impact on the energy source, like a window.
2 IoT Nodes: System-Level View 61

2.4.2 Impact of Protocol on Cost much longer payload. So there is still room
improvement regarding protocol overhead.
We have seen in the previous chapters that a trend In summary, IPv6 heavily impacts the useful
for interoperability is the use of IPv6 protocol. data rate. But in order to enable interoperability,
However, it defers from the usual “fast” internet as well as implement regularly latest security
world (TCP-IP based) in several points. Figure 2.9 updates from IETF, it is worth implementing an
shows the equivalence between the WSN “slow” IPv6 solution, compared to proprietary ones. In
internet world on one hand (low latencies in a another chapter, we will show through the
WSN are hard to achieve, and are not granted in GreenNet demonstrator how efficient such a net-
energy-constrained networks), and “fast” internet work can be.
world (i.e.,: usual internet). The HTTP protocol is
replaced by CoAP (Constrained Application Pro-
tocol). The main role of CoAP is to reduce the
2.5 Pairing and Security
number and overhead of the exchanged packets.
From end device node point of view, CoAP is also
2.5.1 Pairing, Registration,
interfacing the two internet worlds, protecting the
and Installation of IoT Nodes
WSN from repeated high-speed requests.
In the IPv6 frame in Fig. 2.9, there is a fairly
good reduction of the number of bytes exchanged Pairing is a mandatory phase of a Wireless Sen-
in a CoAP-based case: from 681 to 111, but over a sor Network. The area of pre-paired objects is
short IP packet as it is defined in 802.15.4 quite over. Pairing by the final users enables to
(127 bytes), the payload is not representing more complete an existing network with new nodes,
than 20 to 27 bytes, depending on the options and customize each network with various set of
(long or short addresses, security level, . . .). This nodes, even coming from multiple vendors.
gives a useful bit efficiency of only 15–20%. Technically speaking, pairing enables the node
This efficiency would be better with long to enter into the network by providing the ID of
Internet packet (2047 bytes) like in 802.15.4 g, the network and some network-type specific data
where these long packets are supported. The IPv6 like the discovery RF channel to be able to regis-
overhead would remain the same in absolute ter at each level of the stack, plus some security
value, but would be more acceptable with a parameters if the network is secured.

Fig. 2.9 Protocol suites typically used in traditional Internet, versus IP-based 802.15.4 Wireless Sensor Networks
62 P. Urard and M. Vučinić

A clear trend to ease secured pairing is to use place. A node can always be moved, but regular
NFC as an out-of-band (OOB) channel. moves may lead to frequently rebuild the routing
NFC-forum (http://nfc-forum.org) specifies the tables, which can be energy hungry. This is why
way to enable secured pairing for BLE. The we consider ultra-low-power WSN as quasi-
same principle could be used for 802.15.4 stan- static networks.
dard and its extensions. The principle is to
exchange handover data using NFC. The
negotiated handover protocol, introduced in 2.5.2 Impact of Security on Power
January 2014 with the 1.3 release of the
NFC-forum Connection Handover Specification Security of WSNs typically relies on Level
[http://nfc-forum.org/our-work/specifications- 2 (MAC) and L5 (CoAP over DTLS) that both
and-application-documents/specifications/nfc- provide the four security goals, as seen in Sect.
forum-technical-specifications/], enable to use a 2.2.4. The impact of L2-802.15.4 security on
smartphone to pair securely 2 devices like an IoT power is not as bad as often perceived, if we
node and a Concentrator or a Gateway. consider the nodes already paired. We measured
The registration is a phase coming after the the impact of link-layer security in terms of
pairing, requiring a fair amount of exchanges energy on the GreenNet demonstrator, using a
between a node and the concentrator/gateway. fully autonomous scenario in harsh environment
Registration protocol is usually described in the (e.g., sensor nodes communicating during 31 ms
application profile (e.g., Smart Energy Profile— once every 4min20s). The overall cost of
SEP2.0 specified by Zigbee Alliance). An exam- IEEE 802.15.4 security in our scenario ranges
ple of SEP2.0 registration for a Temperature from 1.94 to 4.18%, depending on the security
sensor is given in Fig. 2.10. level. For energy-harvested platforms, such as
We use the term installation for the physical GreenNet, this result directly corresponds to the
installation of the node in its final working requirement that 1.94–4.18% extra energy needs

Fig. 2.10 SEP registration procedure example


2 IoT Nodes: System-Level View 63

to be harvested from the environment, in respect GreenNet demonstrator. In this network pairing
to the scenario without security (Vučinić et al. and registration are done at high rate: up to
2015). 32 frames per second if the routers enable
L5 end-to-end security impact would be more it. Once the registration is performed, the node
important, but still smaller than the impact of ‘slows down’ to its normal rate (e.g., 1 communi-
using IPv6 transmissions. During pairing phase cation per minute or less).
however, the Level 5 security ensured by DTLS
implies to exchange a common key. Typically L5
messages shall be exchanged between the gate- 2.5.3 Impact of Security, Pairing
way and each node, penalizing the energy of the and Installation on Cost
node, and the energy of all the router nodes on
the way if the registration is done prior The bill of material may be affected by the
installation. solutions chosen for Pairing and Installation
It is advised to define both pairing and regis- protocols. As seen above, the need for security
tration protocols as well as installation use-cases on top of IPv6 may induce to enlarge the battery
early enough, to take their energy impact into capacity and to add some NFC specific chip.
account during the architectural definition phase Some energy-independent low-cost solutions
of the nodes. In the case you intend to use exist to perform NFC pairing (e.g., Dual-
energy-harvested or battery-operated routers, interface I2C-NFC EEPROM). An NFC-specific
we would advise to perform pairing and registra- antenna must also be added on the IoT board.
tion phases close to the gateway, prior to physical There are some electromagnetic rules to comply
installation, to avoid many messages exchanges with, in order to avoid NFC (12.56 MHz) antenna
through the full network. to interact with the node radio antenna (2.4 GHz
This exchange of data can take a fair amount or Subgig). You can refer to (AN2866 Applica-
of time, up to several minutes in the case of very tion Note) to design a 12.56 MHz customized tag
slow, ultra-low-power networks. This could indi- antenna. In the case your IoT node is small
rectly raise some energy issues if the physical enough and isn’t compatible with printed NFC
installation of a given node occurs before the antenna size, some discrete coil antenna
registration is finalized. Effectively, if a node is components are also available from distributors.
paired in a place (e.g., nearby the gateway) and In some case, the most expensive component
installed in another place far from the pairing may finally be the NFC antenna.
place (i.e.,: lower in the network tree, at a ‘dis-
tance’ of several hops), the required exchanges to
update the routing tables in order to follow a 2.6 Battery Lifetime and Examples
“moving node” inside the network would come
on top of the registration exchanges. Some regis- Let define what we call battery lifetime: it is the
tration exchanges may be lost and would have to time a node will operate in its normal/typical
be sent several times, leading to burn a fair mode, without replacing the battery.
amount of the energy stored in the nodes. A first approximation of the battery lifetime
In the case the network is very slow because (in seconds) can be done by dividing the battery
very low-energy (e.g., one active mode per capacity provided by the battery maker
minute), it is possible to fix this issues by (in Coulombs), by the mean power consumption
speeding-up the pairing and registration phases. of the node (in Amperes). However battery capac-
Fast-exchanges can be used to quickly perform ity is usually provided in mAh. 1 mAh means
these phases in a few seconds. Then any move of 1 mA delivered during 1 h, which corresponds
the registered node to its final place inside the to 3.6 Coulombs.
network becomes much more energy friendly. Let take a 2000 mAh “AA” battery. It would
This technique has been developed for the provide 7200C. If the system consumes only
64 P. Urard and M. Vučinić

3.2 uA, then the theoretical battery lifetime higher than the mean current. The actual battery
would be 2.25E9 seconds which corresponds to capacity of your system will depend on the way
about 71 years. This duration does not take into you use this battery:
account battery aging or battery auto-discharge
(that reaches up to 20% per year in the case of • To preserve the battery capacity, it is advised
Alkaline batteries), as well as battery usage to never charge the battery to 100% of its
conditions (temperature, humidity, chemicals, capacity, but keep a 5–10% margin.
non-continuous discharge, peak current). All • To preserve battery life, it is required to never
these parameters affect the battery lifetime. discharge completely the battery. Battery
Long-life non-rechargeable lithium batteries cycling (deep discharge) usually reduces the
have a much better auto-discharge specification battery capacity; however the effective impact
than alkaline ones. However, because of the depends on the battery model. Deep cycles
other aging parameters, it is difficult to predict (60–80% discharge) have heavy impact com-
the exact battery duration. This is why consumer pared to light cycles (10% or less). As the
systems are claiming 10-year autonomy but very harvester is refueling regularly, it is possible
few makers (if any) really guarantee the battery to calculate the daily cycle and choose a bat-
lifetime. tery capacity to maintain in typical conditions
In the case the node has to use an energy this cycle around 0.5% of the ideal capacity.
harvester, choosing the right battery capacity is In this case the impact on aging can be con-
not trivial. We describe hereafter a way to sidered as negligible.
proceed. • When operating in a real environment, every
The battery capacity provided by the battery day is different. For example during the week-
maker is obtained when the battery is charged at end, there may be less light in the offices and
its maximum capacity then totally discharged the harvester may not be able to recharge the
with a constant current value which is specified battery, leading to a weekly cycle, deeper than
for each kind of battery. This current is usually the daily one see (see Fig. 2.11). You should
much smaller than the peak current of a node, but estimate this weekly cycle and make sure it

Fig. 2.11 Energy-Harvested GreenNet node rechargeable battery voltage


2 IoT Nodes: System-Level View 65

represents less than 2% of the ideal capacity. broadcast transmissions and network mainte-
In some case you may enlarge the battery size. nance to its minimal activity
• Every week is different, and every month is • Tradeoff deep-sleep to save more charges
different. Using the same way, you should than the amount of charge you loose when
estimate monthly cycles be less than 15% you discharge external capacitors. In some
and yearly cycles to be less sufficient to target cases, some higher leakage sleep mode need-
a 10 year life. Values differ for each kind of ing less energy to recover, can be more profit-
battery, but a 40% cycle can be considered in able for the energy efficiency
many cases as a sufficient value. • Tradeoff higher-level software. Usage of
• Tradeoff network loss and recovery algorithm DTLS + CoAP seems a good compromise to
to be lower than a monthly cycle in terms of ensure an IPv6 network with fairly good
battery discharge energy efficiency.
• Take pairing and installation into account as a • Checkout the wires between non-powered
major cycle debug-purpose on board material. Some per-
nicious leakage could be hidden there.
• When using multiple sensors, check that
sensors are really off even if several sensors
were running asynchronously.
2.7 Global System Power
• In some case, depending on your node archi-
Optimization
tecture, grouping sensors wake-up and radio
wake-up enables to reduce wake-up penalty.
The power optimization is of tremendous impor-
tance when designing an Ultra-Low-Power node
Many of those points concern software impact
and it is very often a never-ending process. We
on hardware. Software impact is effectively huge
saw all along the chapter some critical points to
in term of power efficiency. Ultra-low-energy
keep high the energy efficiency. Here are few
efficient software programming is not an easy
additional points to check out, in order to still
task, and is a new domain for many software
be in your power budget:
engineers.
Figure 2.12 gives an overview of the commer-
• 15 to 30 nA over-leakage on each of the GPIO
cially available nodes in November 2015, and
seems not so much, but given the number of
provides a comparison with the results with the
GPIOs, it may lead to overpass your power
GreenNet development. Node names depicted in
budget. Programming GPIOs correctly in the
italic are not commercially available.
lowest possible energy state for long sleep is
not always simple, but very often profitable.
• Tradeoff quartz capacitances and precision to
avoid wake-up in advance. You need to com- 2.8 Perspectives and Trends
pensate the local oscillator error of both the
emitter and the receiver by doubling the max The current generations of IoT nodes on the
possible error. The longer the sleep between market are mainly AAA battery operated, con-
two synchronizations, the larger the error. suming 50–100 mW of power to transmit a mes-
• In the case of synchronized networks, a large sage at 0 dBm. Most of them are operating thanks
temperature change may lead to the wake-up to always-on routers. Few of them are IPv6 com-
of the node for synchronization. Regarding pliant, few of them offer a secured solution, few
energy efficiency in such networks, it is better other are energy-harvested but none are able to
to wake-up a node to keep synchronization provide all those features at the same time. In the
than to recover a loss of synchronization. future, the growing multiplicity of sensors,
• Minimize routing activity: routing protocols including image sensors, as well as the needs
are very hungry in terms of transactions, limit for lower cost, interoperable and secured
66 P. Urard and M. Vučinić

CPU Tx Batt.
MCU RAM CPU ON Rx Har-
Node sleep 0dBm Size/type
#bits [kB] [mA/MHz] [mA] vested
[µA] [mA] [mAh]
25
GreenNet 32 32 0.185 0.44 4.9 4.5 Y
LiMn
Hikob Azure 2000
32 16 0.180 0.6 12.8 11.8 Y
[H14]
SmartMeshIP
32 72 0.176 0.8 5.4 4.5 OPT 2AA
[WDS13]
M3OpenNode 650
32 64 1.138 25 11.6 10.3 N
[FIT15] LiPo
OpenMote
32 32 0.438 0.4 24 20 N 2AAA
[O15]
WisMote
16 16 0.312 1.69 25.8 18.5 N 2AAA
[W15]
TelosB
16 10 1.8 5.1 19.5 21.8 N 2AA
[M04]
Waspmote15.4
8 8 1.07 7.2 45 50 OPT N/A
[WD15]
MICAz
8 4 1.0 <15 17.4 19.7 N 2AA
[C08]

Fig. 2.12 GreenNet versus commercially available nodes (Varga et al. 2015)

solutions seem to be paving the way towards the MCU and digital computing part of the radio
use of secured-IPv6 networking solutions. Easier able to process 5–10 times more data keeping
to maintain than proprietary solutions, IETF the same power budget. The new generation
compliant, enabling complete interoperability, of MCUs, using CMOS 40 nm Ultra-Low-
these solutions seem to be the definitive trend Power with embedded non-volatile memories
of the WSN. technology should enable to offer 90 nm-like
There are mainly three limitations to adopt retention current with a dynamic power reduc-
secured IPv6: first, the need for interoperability tion from x2 to x5 on the digital part
was not strong enough during years in the WSN depending on the needed frequency, thanks
domain. This has evolved thanks to the Thread to DVFS techniques. This technology will
initiative. The other two reasons are: definitely help to fix the power consumption
issue, on one hand by keeping power con-
• In some case the cost increase due to addi- sumption at a fair level, compatible with
tional need for more SRAM in the processor low-cost batteries, and on the other hand by
to store and process IPv6 frames. enabling much more computing capability
• In some other case the power consumption than the previous technology nodes for the
reached with the current solutions (MCU + same power budget. One step further will be
radio + IPv6 secured protocols) which leads reached few years after thanks to 28FDSOI-
to either decrease the lifetime of the battery, ULP technology, achieving to save an addi-
or increase the battery capacity, impacting the tional x2 to x10 dynamic power versus
price of the global solution, preventing the CMOS40 (depending on the targeted fre-
usage of energy harvesting as energy source. quency), keeping the leakage at a fair level.
In order to use energy harvesting with a plain Once again, cost shall be the limiting factor
802.15.4 secured IPv6, we would need an for adoption and only the needs for additional
2 IoT Nodes: System-Level View 67

features like extended signal processing capa- Among the existing solutions, you can find Lora,
bility, extended memory storage or even SigFox and Weightless solutions. All of them
graphical display at node level may justify operate in the Subgig ISM bands, with reduced
the need for adoption. data rates (100bit/s to 100 kb/s) compared to
2.4 GHz technologies. Business model should
At node level, we start to see NFC-pairing be different in this case, as it requires the usage
adopted by a fair amount of the solutions of an existing infrastructure usually managed by
reaching the market in 2016. Using NFC for operators, enabling some pay-per-use options.
pairing and registration should become a must Those solutions will certainly survive in parallel
in a near future. to WSN-based solutions, targeting different
Among the future evolutions of the radio pro- needs.
tocol solutions, two major opportunities are
foreseen at the time:
References
• Bluetooth-low-energy is one of them and
seems evident when targeting Personal Area AN2866 Application Note, How to design a 12.56 MHz
Network communication. BLE should move customized tag antenna. http://www.st.com/st-web-ui/
short term to IPv6 however its poor network- static/active/cn/resource/technical/document/applica
tion_note/CD00221490.pdf
ing capability and its relatively short distance
I.M. Atakli, H. Hu, Y. Chen, W.S. Ku, Z. Su, Malicious
of communication may not enable to replace node detection in wireless sensor networks using
WSN-oriented radios (802.15.4 family). In weighted trust evaluation, in Proceedings of the 2008
order to solve these issues, BLE will eventu- Spring Simulation Multiconference, SpringSim’08
(Society for Computer Simulation International, San
ally offer real mesh network capability and
Diego, 2008), pp. 836–843
even some longer transmission range option. B. Awerbuch, R. Curtmola, D. Holmer, C. Nita-Rotaru,
The success of these options will depend on H. Rubens, Mitigating byzantine attacks in ad hoc
how it compares to alternatives such as wireless networks. Department of Computer Science,
Johns Hopkins University, Technical Report, Version,
802.15.4 family.
1 (2004)
• The 802.15.4e evolution for industrial A. Becher, Z. Benenson, M. Dornseif, Tampering with
applications, part of the 802.15.4-2015 is motes: real-world physical attacks on wireless sensor
enabling the IPv6 frequency hopping multi- networks, in Proceedings of the Third International
Conference on Security in Pervasive Computing,
channel mesh network in 2.4GHz networks.
SPC’06 (Springer-Verlag, Berlin, 2006), pp. 104–118
This should increase quality of service in H. Chan, A. Perrig, D. Song, Random key predistribution
crowded environment. schemes for sensor networks, in Proceedings of 2003
Symposium on Security and Privacy 2002 (May 2002),
pp. 197–213
Both those standards may survive together,
D. Dolev, A.C. Yao, On the security of public key
and we may see a generalization of the radio protocols. IEEE Trans. Inform. Theory 29(2),
combos (e.g., BLE + 802.15.4 in 2.4GHz) each 198–208 (1982)
solution enabling to target a different network HIKOB, http://www.hikob.com/wp-content/uploads/
2015/06/HIKOB_AZURE_LION_ProductSheet_EN.
depending on the required service. As an exam-
pdf
ple: a node could use BLE to transmit some C. Karlof, D. Wagner, Secure routing in wireless sensor
information like an image to a smartphone networks: attacks and countermeasures. Ad hoc Netw.
when it is in the range, but would use 802.15.4 1(2), 293–315 (2002)
Y.W. Law, L. van Hoesel, J. Doumen, P. Hartel,
or 4e for infrastructure management. Multi-radio
P. Havinga, Energy-efficient link-layer jamming
nodes sharing the same antenna for cost reason attacks against wireless sensor network mac protocols,
shall become the standard in a near future. in Proceedings of the 3rd ACM Workshop on Security
As an alternative to Mesh WSN enabling to of Ad Hoc and Sensor Networks, SASN’05 (ACM,
New York, 2005), pp. 76–88
repeat the messages from nodes to nodes to cover
M. Li, I. Koutsopoulos, R. Poovendran, Optimal jamming
long distances, star topology networks could be attacks and network defense policies in wireless sen-
used in so-called Wide Area Network (WAN). sor networks, in 26th IEEE International Conference
68 P. Urard and M. Vučinić

on Computer Communications, INFOCOM 2007 L. Damon, R. Guizzetti, L Andre, A. Cathelin, A


(IEEE, May 2007), pp. 1307–1315 self-powered IPv6 bidirectional wireless sensor &
F. Liu, X. Cheng, D. Chen, Insider attacker detection in actuator network for indoor conditions, in 2015 Sym-
wireless sensor networks, in 26th IEEE International posium on VLSI Circuits (VLSI Circuits) (IEEE, June
Conference on Computer Communications, 2015), pp. C100–C101
INFOCOM 2007 (IEEE, May 2007), pp. 1937–1945 L.-O. Varga, G. Romaniello, M. Vucinic, M. Favre,
M3 Open Node motes, https://www.iot-lab.info/hard A. Banciu, R. Guizzetti, C. Planat et al., GreenNet:
ware/m3/ an energy-harvesting IP-enabled wireless sensor net-
MICAz mote, http://www.openautomation.net/upload work. IEEE Internet Things J 2(5), 412–426 (2015a)
sproductos/micaz_datasheet.pdf L.-O. Varga, Multi-hop energy harvesting wireless sensor
OpenMote, http://www.openmote.com/hardware/open networks: routing and low duty-cycle link layer. PhD
mote-cc2538-en.html thesis, Grenoble Alps University, December 2015
D.R. Raymond, S.F. Midkiff, Denial-of-service in wire- A. Vempaty, O. Ozdemir, K. Agrawal, H. Chen,
less sensor networks: attacks and defenses. IEEE Per- P.K. Varshney, Localization in wireless sensor
vasive Comput. 7(1), 74–81 (2008) networks: byzantines and mitigation techniques.
M. Rezvani, A. Ignjatovic, E. Bertino, S. Jha, Secure data IEEE Trans. Signal Process. 61(6), 1495–1508 (2012)
aggregation technique for wireless sensor networks in M. Vučinić, Architectures and protocols for secure and
the presence of collusion attacks. IEEE Trans. energy-efficient integration of wireless sensor
Dependable Secure Comput. 12(1), 98–110 (2015) networks with the internet of things. PhD thesis,
G. Romaniello, Energy efficient protocols for harvested Grenoble Alps University, November 2015
wireless sensor networks. PhD thesis, Universite de Waspmote, http://www.libelium.com/downloads/docu
Grenoble, March 2015 mentation/waspmote_datasheet.pdf and http://www.
TelosB motes, http://www4.ncsu.edu/~kkolla/CSC714/ digi.com/pdf/ds_xbeemultipointmodules.pdf
datasheet.pdf T. Watteyne, L. Doherty, J. Simon, K. Pister, Technical
H. Tschofenig, T. Fossati, TLS/DTLS Profiles for the overview of SmartMesh IP. in Proceedings of IMIS’13
Internet of Things. draft-ietf-dice-profile-14. Work in (Washington, DC, 2012)
progress, August 2015 WisMote, http://wismote.org
P. Urard, G. Romaniello, A. Banciu, J.C. Grasset,
V. Heinrich, M. Boulemnakher, F. Todeschni,
Ultra-Low-Power Digital Architectures
for the Internet of Things 3
Davide Rossi, Igor Loi, Antonio Pullini, and Luca Benini

This chapter introduces the architectures part is implemented by a sensing subsystem,


implementing the digital processing platforms realized with low-power sensors such as visual
and control for Internet of things applications. It imagers, microphone arrays, Micro Electro-
will provide a review of the state of the art Ultra- Mechanical Systems (MEMS), or bioelectrical
Low-Power (ULP) micro-controllers architec- sensors. Sensed data is then digitally processed
ture, highlighting the main challenges and with low-power microcontrollers (MCUs), and
perspectives, and introducing the potential of transmitted through a wireless communication
exploiting parallelism in this field currently subsystem consisting of low-energy TX/RX
dominated by single issue processors. radio transceivers implementing low-energy
stacks. While these three subsystem are some-
times split over more than one chip, the market is
3.1 Definitions and Motivations trending toward fully-integrated single-chip
solution. The entire IoT node is powered by
The last years have seen an explosive growth of harvesters or small form factor batteries (power
small, battery powered devices that sense the supply/conversion subsystem). Despite of an
environment and communicate wirelessly the almost 10x reduction of the transceiver power
sensed data after some data processing, recogni- in just a few years, its share in the overall
tion, or classification. Collectively referred to as power budget of most wireless sensor nodes and
the Internet of Things (IoT), all these devices wearables remains dominant (De Groot 2015). In
share the need for extreme energy efficiency this scenario, a high-potential approach to reduce
and power envelopes of few milliwatts. From a the system energy is to increase the complexity
system level perspective, in architectures of near-sensor data analysis and filtering by
targeting such applications, the data acquisition providing more computational power to the
processing sub-system. This approach can dra-
matically reduce the amount of wireless data
D. Rossi (*) • I. Loi transmitted, that could be reduced to a class, a
University of Bologna, Bologna, Italy
signature, or even just a simple event. For this
e-mail: davide.rossi@unibo.it
reason, the availability of powerful, flexible and
A. Pullini
energy-efficient digital processing hardware in
ETH, Zurich, Switzerland
close proximity to sensors plays a key role in
L. Benini
the internet of things revolution. The aim of this
University of Bologna, Bologna, Italy
chapter is to review the state of the art of
ETH, Zurich, Switzerland

# Springer International Publishing AG 2017 69


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_3
70 D. Rossi et al.

ultra-low power computing platforms for near- their function is limited to the specific purpose
sensor processing, and highlight the main for which they are designed. Also, their perfor-
challenges and perspectives related to these mance is not scalable as they are designed with a
architectures. specific use case in mind. While adoption of
Most of low-power commercial microcon- dedicated circuits is attractive to tackle the strin-
trollers cannot provide the required performance gent constraints of implantable applications for
levels for several applications within the power health monitoring (tens of microwatts), these two
budgets offered by small form factor coin important aspects limit the exploitation of dedi-
batteries and energy harvesters. A promising cated circuits for the majority of IoT applications
approach to achieve up to one order of magnitude since the Non Recurrent Engineering (NRE)
of improvement in energy efficiency of costs required for their development cannot be
integrated circuits with respect to “business as amortized over large volume products.
usual” CMOS in strong inversion is ultra-low When the targeted algorithms are more
voltage, near-threshold computing. The key generic, so that can they be re-utilized for more
idea is to lower the supply voltage of chips to a than a single application, the above described
value only slightly higher than the threshold restrictions can be relaxed by providing some
voltage. Aggressive voltage scaling has been run-time configurability to the integrated
extensively analyzed in the literature, including circuits, and by increasing the operating range
its limitations and disadvantages (Dreslinski via voltage and frequency scaling. As this
et al. 2010). One of the main issues with approach has mainly been applied to the signal
low-voltage operation is performance degrada- processing field, this class of computing devices
tion, which can limit the degree of use of is usually referred as Application (or domain)
voltage-scaling for a given processing Specific Signal Processors (ASSPs). Several
requirement. examples of this class of devices apply to visual
A commonly adopted approach to overcome sensors, where several basic functions
the performance loss in ultra-low voltage devices implemented with dedicated accelerators or
on hardwired functions implemented in Applica- specialized processors can be shared among dif-
tion Specific Integrated Circuits (ASICs). By ferent applications (Park et al. 2013; Hsu et al.
exploiting dedicated circuits, digital processing 2012; Jeon et al. 2013).
systems are able to match the performance The ultimate step toward flexibility is given
requirements of applications, even at a very low by the exploitation of the software
operating voltage, with frequencies of tens to programmability of instruction processors. In
hundreds of KHz and power consumptions in last few years some instruction processors work-
the range of few μW to hundreds μW. In some ing in the near-threshold or sub-threshold
cases, these dedicated systems, implemented as operating region have been presented from both
System-On-Chip (SoC) or System-In-Package industry (Ambiq 2015) and research (Bol et al.
(SiP), integrate digital signal processing circuits, 2013). Some of the proposed devices are also
analog front end, analog signal processing able to work on a wide range of operating points,
circuits, and power supply circuits (batteries, as the low voltage operation might not be always
harvester or both) leading to extremely compact suitable to match the requirements of the appli-
form factors. cation targets (Gammie et al. 2011). Again, when
The dedicated ASIC approach has been exten- operating at low voltage, sequential instruction
sively used in traditional fields of applications of execution coupled with very slow operating fre-
ultra-low power devices, such as wearable or quency may lead to insufficient performance for
implantable sensors for health monitoring the application requirements. Explored solutions
(Zhang et al. 2012; Yoo et al. 2012; Yakovlev to improve performance of ultra-low power
et al. 2012). Although these devices minimize processors while maintain a high degree of
power consumption, they are not flexible as flexibility also rely on the exploitation of
3 Ultra-Low-Power Digital Architectures for the Internet of Things 71

software parallelism. The exploitation of multi 3.2 Ultra-Low-Power


core platforms can provide benefits with respect Microcontroller Architectures
to single processor cores for high application
workloads as it has been demonstrated that par- The applications targeted by current off-the-shelf
allel computing at low voltage can be more microcontroller (MCUs) architectures require to
energy efficient than sequential computing at a periodically fetch environmental information
higher voltage under certain assumptions (Dogan from a wide variety of sensors, which is then
et al. 2011). This concept has been extensively transmitted via wireless through low-power
exploited for high-end embedded applications, antennas after a limited amount of processing.
where multi-core architecture has become the In this scenario, to maximize the power effi-
de-facto standard. On the other hand, when the ciency of the system, the data processing hard-
application workload is low, or when the ware should be active only for the small amount
workloads are not easily parallelizable, the of time required to read, process and transmit the
introduced multi-core platforms suffer from information, while it is idle for the rest of the
energy efficiency losses with respect to single time. It is usual to refer to this mechanism as duty
core platforms, mainly caused by static (primar- cycling. Figure 3.1 shows the typical power con-
ily leakage-induced) power consumption, due to sumption pattern of a microcontroller employed
the larger area and architectural overheads. for a generic IoT application. For most of the
Hence, while multi-core platforms are starting time, the MCU is in a deep sleep state where it
to be seen with interest in the world of ultra- is waiting for an event to wake up, triggered by
low-power applications, a great effort still needs an internal timer or by an external event
to be done to target the applications driving the generated by one of the sensors. Once the event
IoT domain. is captured the device restores the power supply
The rest of the chapter is organized as follow: and restarts clocks (wake-up), restores the state
Section 3.1 describes the architecture of off-the- of the CPU (stack, data etc.) and then can start
shelf microcontrollers typically employed in IoT fetching new data from I/Os, process and trans-
applications. Section 3.2 describes the main mit it. Once done with the processing, it can save
challenges of IoT computing platforms, mainly the state and go back to sleep.
related to the exploitation of performance scal- Today microcontrollers (MCUs) feature sev-
ability exploiting parallel processing. Section 3.3 eral power modes to deal with different applica-
provides an example of research Parallel Ultra- tion scenarios where the various states may have
Low-Power computing platform (PULP). a very different duration, duty cycles and perfor-
Finally, Sect. 3.4 provides some concluding mance requirements. For example if an applica-
remarks, and highlights challenges and tion has to deal with frequent wake-ups, paying
perspectives. at each wake-up the price of saving and restoring
POWER

STATE RESTORE

DATA
STATE SAVE

PROCESSING
WAKE

I/O
DEEP SLEEP DEEP SLEEP

TIME
Fig. 3.1 Typical power profile of IoT applications
72 D. Rossi et al.

its full state is highly inefficient and a power any overhead at circuit level. Wakeup time after
mode with higher off current but with state reten- an idle state is dependent only on the status of the
tion in this case is much more valuable. clock generator. If the clock generator is kept on,
the wakeup time is negligible; but the clock
generator has also been stopped for more aggres-
3.2.1 Power Management sive power savings. The wake-up time is strongly
impacted by the type of clock generator. In
Having an efficient off and idle states is a key today’s microcontrollers there are different
requirements for micro controllers. In many IoT types of clock generation unit to deal with the
applications sleep-mode current is the biggest different operating corners that always guarantee
contributor to the overall power consumption. the maximum efficiency. A good example of
When considering sleep-mode current we should state of the art clocking offer in microcontrollers
consider also the current required to reactivate is the latest STM32L4 family. In the micro of the
the circuit. Wakeup currents have a big impact on L4 family there are five available clock sources
the choice of the sleep mode. There are many (see Fig. 3.2). There are two low speed
different techniques to reduce the power con- generators that generate a 32 kHz frequency:
sumption when idle (Sect. 10). one oscillator using an external quartz (LSE)
The simplest approach is clock gating, which and another one using an internal RC oscillator
only implies stopping the clock. In this mode (LSI), which is more power-efficient but less
dynamic power is cut and only leakage current precise. The same is done for the faster 48 MHz
is flowing. This mode is state retentive and clock, where there is an oscillator using an exter-
requires no time to go back to active if the nal clock (HSE) and an internal configurable RC
clock generator is still running. Clock gating oscillator (MSI). The fifth clock source is a fixed
usually has a very fine grained granularity, 16 MHz internal RC oscillator. The L4 has also
mostly all blocks in the system can be put in three PLLs capable of multiplying the frequency
idle mode independently since it does not have and reaching up to 180 MHz (Microelectronics

Fig. 3.2 Example of state of the art clock distribution


3 Ultra-Low-Power Digital Architectures for the Internet of Things 73

2016). Sleep mode turns the clocks to the core designer to analyze the overall wake-up and
off, but the user has the option to leave on the settling time for both the digital and analog cir-
peripherals’ clocks. The power in this mode is cuitry to factor in the true cost of this wasted
not just leakage but dynamic current of the energy. State of the art devices as the Ambiq
peripherals that are left on. In this mode data Apollo have deep sleep currents as low as
can be still received, and the core retains its 100 nA with RTC on (Ambiq 2015).
state and continues operation when required.
All clocks to the digital logic are turned off and
the analog sub-systems can be controlled to have 3.2.2 IO Architecture
flexible wake up times depending on the applica-
tion requirements. The lowest power mode is Peripheral subsystems in microcontrollers
when all the analog clocking elements are turned include an extensive set of peripherals needed
off. Wake up time is determined by the selection to connect to the wide variety of sensors avail-
of the wake up clock source. The fastest time is able on the market. Although MCUs traditionally
from the low-power oscillator and the slowest feature low bandwidth interfaces like UART,
time is from the crystal oscillator and the PLL. I2C, I2S or standard SPI, they lately include
To reduce leakage in idle mode, clock gating also higher bandwidth peripherals like USB or
is used in conjunction with voltage scaling. camera and display interfaces. Low bandwidth
When the clock is stopped we can afford to peripherals are usually attached to a shared bus
lower the voltage as low as the retention voltage and high bandwidth peripherals are usually
for the memory elements. The cost of this at connected to the system bus. Traditionally, the
circuit level is the required flexibility in the I/O was constantly supervised by the CPU and
power supply. The effect on the application is the CPU was responsible of handling peripherals
that we require more time to exit the idle mode events and regulate data transfer to/from periph-
due to the settling time of the power supply. eral and memory. Recent MCUs have increased
Further reduction can be obtained by not only the complexity of the I/O subsystem to support
lowering the voltage but also applying a reverse more power modes to be able to selectively turn
body biasing to increase the threshold voltage of on peripherals only when needed (Figure 3.3).
the devices (Rossi et al. 2016a). Further reduction in power can be obtained by
The lowest leakage is obtainable only with increasing the intelligence of the peripherals and
power gating, where the power supply is turned have they run without CPU supervision while the
off. All microcontrollers available today feature CPU is in sleep mode. The “smart peripheral”
a deep-sleep mode where the device power sup- approach is used more and more in the most
ply is disconnected and only a small always-on advanced MCUs. It has many variants with dif-
part of the chip controlling the wake up is kept ferent names, but the general idea is the same.
alive. Always-on domain usually have an RTC to For example ATMEL “SleepWalking” features
have the possibility to wake up the system after a in the latest SAM-L2 family enable events to
certain amount of time and have a set of memory wake up peripherals and put them back to idle
elements (flops or SRAMs) to keep track of the when data transfer is done without any CPU
previous state. The amount of features active on intervention for instance using a DMA (Atmel
the always-on domain is usually selectable by the 2015). Another example is the Renesas’ RL78
user to give maximum flexibility depending on which has a “snooze mode” where the analog-to-
the application requirements. The startup behav- digital converter (ADC) operates while the pro-
ior of the analog modules can have a major cessor is asleep. An I2C slave or CAN controller
impact on the amount of time spent in active can watch for an address before it captures
mode; voltage regulators or references utilizing incoming data and then wakes up the CPU
external decoupling caps can take milliseconds (Renesas 2014). Cypress Semiconductor’s
to settle. Therefore, it is important for a systems PSoC flexible family was one of the pioneers
74 D. Rossi et al.

with their configurable digital and analog 3.2.3 Data Processing


peripherals. Part of the configuration is the abil-
ity to link peripherals together without the CPU Reduction of energy consumption for the data
being involved. The company’s latest PSoC processing part should be improved under differ-
4 BLE (Bluetooth Low Energy) can operate the ent aspects, by improving the micro-architecture
BLE radio while the CPU is idle (Cypress et al. of the CPU increasing the amount of data that can
2016). Microchip’s Configurable Logic Cell be processed per cycle, or by optimizing the
(CLC) highlights the more conventional circuit design to reduce the power consumed
approach to configurable peripherals. A micro- per cycle.
controller may have one or more CLC blocks. The Architecture of CPUs used in MCUs has
Each block has a selectable set of inputs and rapidly evolved in the last years, moving quickly
outputs with limited logic capability so the out- from the 8 bit architectures to the now wide-
put of one peripheral can be fed to another. As spread 32 bit ARM Cortex-M architectures. The
with the PSoC, linked peripherals can often han- architectural evolutions are driven by the con-
dle simple algorithms faster than the CPU. ST stantly increasing demand for performance for
Microelectronics’ STM32 has “autonomous the always more complex applications. The vast
peripherals” that use a “peripheral interconnect majority of CPUs used in microcontrollers are
matrix” to link peripherals together. Timers, single issue machine in which instructions are
DMAs, ADCs, and DACs can be linked together. executed in order. The only exception today is
Or with their latest “Analog Chain” where the Cortex-M7 recently released by ARM which
comparators, references, DACs and ADCs can has a dual-issue pipeline. The choice of simple
be interconnected to generate complex triggers single issue micro-architecture is mainly dictated
(Microelectronics 2016). The trend towards more by energy efficiency for the target performance.
functionality and complexity is highlighted by We are not yet in the performance range to
Microchip’s PIC16F18877 family. It has a justify multiple-issue superscalar architectures.
10-ADC tied to a computational unit that can Engineers focused more on optimization to the
do accumulation and averaging. It can even do micro architecture to improve as much as possi-
low-pass filter calculations in hardware. The ble the IPC (instruction per cycle) and the data
CPU can sleep until a filtered result exceeds level parallelism.
programmed limits (Microchip 2016).

Data moving and check for trigger Data Processing


conditions
Traditional IO
CPU Subsystem
CPU CPU
IOs IOs IOs

Data movement and trigger conditions Data Processing


done by dedicated hardware
New «Autonomous»
CPU IO Subsystem
POWER

IOs HW IOs HW IOs

TIME

Fig. 3.3 Power profile of new “Autonomous” I/O subsystem


3 Ultra-Low-Power Digital Architectures for the Internet of Things 75

The most common improvements include for common to have the possibility to process 2 half-
example hardware loops to speed up loops on words or 4 chars with a single 32bit instruction.
tiny kernels typical of most DSP applications. It Good examples are the DSP extensions
is usually done with one or more instructions that introduced in the Cortex-M4 (ARM 2010).
setup a dedicated logic which controls the num- Architectural optimization to support single
ber of interactions of a loop and the fetch stage, cycle multiplications and dedicated instructions
with the benefit of reducing the amount of cycles for MACs, as well as hardware support for satu-
needed to compare and jump back at the end of a ration arithmetic to better handle fixed point
loop. The benefit is huge when small kernels are arithmetic, are other examples of how CPUs are
considered (Gautschi et al. 2015). extended to increase energy efficiency during the
Other non DSP-specific improvements are execution of DSP algorithms.
loads and stores with pre- and post-increment Those were the optimizations for the data
whose main usage is to reduce the number of and control logic but with the introduction of
instructions needed to access consecutive mem- the ARM cores we also saw improvements on
ory locations. the instruction side. The ARM7TDMI ARM
More on the DSP optimizations we have for introduced the THUMB instructions, which
example the use of SIMD (Single instruction allows some instructions to be coded using
Multiple Data) where a single arithmetic instruc- 16-bit instead of 32-bit, reducing the pressure
tion can operate on multiple data. On high-end on the instruction memory, the overall code
CPUs this is often coupled with wider access to size and instruction memory requirements.
memory to handle multiple data at the data path Table 3.1 summarizes the main features of few
width. On microcontrollers it is much more com- MCUs widely used for ultra-low-power
mon to use their less power hungry version based applications, while Figure 3.4 shows their typical
on data size lower than the data path width. It is architecture.

Table 3.1 Summary of commercial low power MCUs


EMF32
Kinetis Pearl
MCU 16 F1503 MSP430FR6x SAML21x KL17 Gecko Apollo STM32F745xx
Instruction set PIC MSP430 ARM ARM ARM ARM ARM Cortex
architecture Cortex M0 Cortex M0 Cortex Cortex M7
+ + M4 M4
Datapath 8-bit 16-bit 32-bit 32-bit 32- 32- 32-bit + FPU
bit + FPU bit + FPU (dual issue)
Memory I: 2 bB 128 kB 256 kB 256 kB 256 kB 512 kB 1 MB FLASH
SRAM FRAM FLASH FLASH FLASH FLASH
D: 128 B 64 kB 32 kB 32 kB 64 kB 340 kB SRAM
SRAM SRAM SRAM SRAM SRAM
Max Freq. 20 16 48 48 40 24 216
(MHz)
Current (μA/ 30 100 35 54 60 34 700
MHz)
Deep sleep 0.02 0.02 0.2 0.28 0.02 0.12 2
current (μA)
State-retentive NA 0.02 1.3 1.96 1.4 0.193 2.75
current (μA)
Retentive deep No Yes No No No No No
sleep
76 D. Rossi et al.

Fig. 3.4 Typical ICache

Flash Ctrl.
architecture of off-the-shelf DCache
MCUs for IoT CPU FLASH
Ld/St

SRAM

BUS MATRIX
DMA

High
Performance
Peripherals

Low BW
I/O Subsystem
and peripherals
UART
HP Interconnect SPI
LP Interconnect UART
System Masters
Periphs and I/O
Timers
Storage

3.2.4 Non-volatile Memories and one capacitor and they share with DRAM the
high speed. The capacitor is implemented by
As briefly mentioned, before energy spent to save using a ferroelectric material (usually PZT lead-
and restore the state before and after a deep sleep zirconate-titanate) which is polarized in two pos-
state has a deep impact on how often and for how sible states that are kept without requiring any
long a deep sleep state could be used. Such cost refresh. Other types of memory currently under
depends almost entirely on the type of memory study and close to commercial applications are
used for the storage. Embedded flash memories MRAM, STT-MRAM, and PCRAM. Magnetic
are the most common type of non-volatile mem- RAM (MRAM) are based on memory cells
ory used but, due to their high write energy and constituted of two magnetic storage elements,
low endurance, they are not suitable to be used one with fixed polarity and one with switchable
for state retention during frequent power cycles polarity. Depending on the state of the switch-
(Chap. 6). In today microcontrollers, state reten- able element the resistance of the overall cell
tion is implemented by keeping active part of the changes and that is what is sensed by the read
circuit, with the consequent reduction of the circuitry. STT-MRAM is a particular implemen-
effectiveness of the deep sleep state. tation of MRAMs more suitable for scaled tech-
Recent developments in non-volatile nology. It uses spin-polarized currents enabling
memories show many possible solution to this smaller and less power demanding bit cells. Cur-
problem. Texas Instruments has recently rently, STT-RAM is being developed in various
included in their MSP430 family some companies including Everspin, Grandis, Hynix,
microcontrollers based on FeRAM (Texas IBM, Samsung, TDK, and Toshiba. Due to its
2011). FeRAM have a structure very similar to easy integration in the CMOS process as well as
DRAM where a bit cell is made of one transistor its performance, power and scalability properties,
3 Ultra-Low-Power Digital Architectures for the Internet of Things 77

it is one of the most appealing solutions. performance target, compensation of PVT


Chapter 7 will present a detailed explanation. variations has to be achieved in a transparent
Another example of NVM currently investigated way with respect to the end user, which cannot
by the industry is the Phase Change RAM be aware of the operating conditions of the
(PCRAM) in which the bit cell is made of device.
materials that can exist in two phases (e.g., crys-
talline and amorphous) and result in different
resistance. A more detailed description of NVM 3.2.6 Compensation of Process
memories in the context of IoT architectures is and Environmental Variations
presented in Chaps. 6 and 7. in Near-Threshold

The variability of ultra-low-power devices


3.2.5 A Step Forward: operating at low voltage has been extensively
Near-Threshold MCU analyzed in the last few years from the research
Architectures community (Alioto 2012). Some of the
approaches leverage design-time techniques to
Today’s low-power MCU architectures operate improve resiliency of the circuits with respect to
in the super-threshold domain during active process and temperature variations, including
phases of computation, relying on duty cycling standard-cell design, clock tree optimization,
and heavily optimized deep sleep modes to and automatic synthesis. All these techniques
improve energy consumption. For example, are extensively analyzed in Chap. 4. On top of
some commercial devices, such as Ambiq Apollo design level techniques to mitigate the impact of
(Ambiq 2015) provides a sub-threshold RTC to variations, their compensation is usually
minimize deep-sleep power (~300 nW), but it addressed at the architectural level, integrating
operates at 0.9 V, which is only 100 mV below mechanisms able to probe the PVT conditions
the nominal operating voltage for the 90 nm pro- of the circuit and aging, and provide a feedback
cess technology utilized for its implementation, to knobs exposed at system level that allow to
providing active power efficiency similar to other compensate the variations by adjusting the sup-
commercial MCUs. Although this approach ply voltage of the circuit.
provides a very low power for applications Widely explored approaches to implement the
requiring low computational workloads (e.g., a probing circuits to adapt the supply voltage are
temperature sensor wakes up the microcontroller process monitoring blocks (PMBs), canary
once per minute to transmit sensed data) the circuits, or razor flip-flops (Ernst et al. 2005).
situation drastically changes when the appli- PMBs are generic structures implementing ring
cations require nearly always-active operation. oscillators with different characteristics (e.g.,
In this scenario, the energy consumption of a PMOS only, NMOS only, inverters with long
microcontroller can increase by up to three orders wires). Depending on the specific oscillator
of magnitude (i.e., from tens of μW to tens of probed from a PMB it is possible to acquire
mW, on average). Near threshold computing is specific information about the process condition
emerging as a promising approach to achieve of PMOS and NMOS, temperature, or voltage.
major energy efficiency improvements of ultra- Although this approach is generic, as it does not
low-power digital architectures (Dreslinski et al. require design of specific hardware for each chip,
2010). However, this comes at the cost of perfor- some additional logic (e.g., a lookup table) or
mance degradation and increased sensitivity with software processing is required to merge the
respect to process, voltage and temperature data and provide a feedback to the actuator. A
(PVT) variations. While the performance degra- canary circuit is a replica of the critical path plus
dation can be managed by adjusting the operating some delay element, often programmable, which
voltage of the device according to the required makes the monitor super-critical (Calhoun and
78 D. Rossi et al.

Chandrakasan 2004). As canary circuits are react in a single clock cycle to the detection of
designed to clone a specific critical path, they a fault or a critical situation. On the other hand,
are not generic components, but they provide a once again, they are very invasive from the
more direct information on the criticality of the micro-architectural viewpoint. Alternative
operating point. The global process variations as approaches to compensate for PVT variations
well the environmental conditions can be moni- rely on the adoption of adaptive body biasing
tored with both PMBs and canary circuits, and (Tschanz et al. 2002, 2007). This approach has
the supply voltage can be adapted according to several advantages with respect to modulation
these conditions. However, there are still consid- through supply voltage. First of all, dynamically
erable margins to be assumed to ensure a reliable adapting threshold voltage of transistors only
operation of the circuit. For example, local pro- increases leakage power of the circuit, while
cess variations and IR drop cannot be covered, as adaptation of supply voltage has an impact on
the monitor circuit is a copy of the critical path both leakage and dynamic power. With body
placed in a different location. A yet more direct biasing it is possible to apply independent polar-
approach is to use the speed monitor within the ization to PMOS and NMOS. Hence, it is possi-
respective circuit block. The razor concept (Ernst ble to optimize the leakage power consumption
et al. 2003; Blaauw et al. 2008) provides energy of a circuit by applying asymmetric body bias-
reduction guaranteeing reliable operation by ing. Moreover, body biasing is very effective
lowering the supply voltage to the point of first when the device operates in near-threshold, as
failure. Timing failures are detected by shadow in this operating region small changes of the
latches placed on the critical path of the design threshold voltage of transistor provide signifi-
controlled using a delayed clock, eliminating all cant increase (or reduction) of the operating
margins due to global and local PVT variations. frequency (Rossi et al. 2016a). Finally, since
By comparing the values latched by the flip-flop the polarization of the PWELL and NWELL
and the shadow latch, a delay error in the main only requires transient currents, modulation of
flip-flop is detected. The value in the shadow body biasing can be implemented with simple
latch, which is guaranteed to be correct, is then and energy efficient circuits (e.g., charge
used to correct the delay failure. The main draw- pumps), as opposed to supply voltage control,
back of the razor approach is that is very design which requires the adoption of DC/DC or volt-
specific, requiring major manual adjustments to age regulators. As a drawback, the adoption of
the microarchitecture of the SoC, as no automatic body biasing to address large PVT variations is
razor flops mechanism is available in commer- somehow challenging due to the limited
cial synthesis tools. This makes the adoption of capabilities of bulk technology to provide
this approach very challenging for a wide set of extended body bias ranges. For this reason, join-
designs, and impossible when the IP cores are ing the degrees of freedom provided by voltage
provided as hard macros or encrypted netlist by scaling and adaptive body biasing appear as a
processors or IPs providers. suitable solution to compensate for PVT
Although the most traditional way of com- variations while tracking the best energy
pensation for PVT variations leads to adjust the operation.
supply voltage of the circuits, other approaches
have been explored at the architectural level:
clock gating the specific pipeline stage where a 3.3 From Single Core to Multi Core
fault is detected (Ernst et al. 2003), using
counter flow pipelining (Charles et al. 1994), The trend to develop multicore-systems is a gen-
through architectural replay (Blaauw et al. eral tendency for several high-end embedded,
2008), or through insertion of reconfigurable desktop and server platforms. Energy efficiency
pipeline stages (Bortolotti et al. 2013). The requirements have forced processor developers
main benefit of the architectural compensation to add multi-core capabilities instead of increas-
or error recovery mechanism is that they can ing the system clock frequency in single-core
3 Ultra-Low-Power Digital Architectures for the Internet of Things 79

systems (Parkhurst et al. 2006). Cache coherency provide this amount of computational power,
and memory bandwidth determine the architec- and the trend is to migrate from a single-core to
ture of such multi-core systems, and when pri- a multi-core architecture, while maintaining low
vate caches are involved, support for cache the complexity of the memory hierarchy to target
coherency across the entire multi-core system is high energy efficiency.
desired, as it enables the software running on For this reason, multi-core architectures are
embedded processing system to balance load beginning to penetrate the microcontroller busi-
and allocate tasks seamlessly between cores. ness segment: recently, a new class of heteroge-
Cache coherence protocols must react imme- neous dual-core MCU products appeared in the
diately to writes, and invalidate all cached read market. These devices have cores with different
copies in the private cache banks, and this is instruction set architectures (i.e., Cortex M0 and
source of significant complexity. Complexity Cortex M4) (NXP Semiconductors 2015) and
translates into cost, both from silicon and energy rely on the architectural heterogeneity to achieve
perspective (Martin et al. 2012). Data synchroni- energy efficiency with a principle similar to the
zation in multi-core systems prevents data from mid-to-high-end big-little multi-cores (Peter
being invalidated by parallel access whereas 2011). In these architectures, the little core is
event synchronization coordinates concurrent mainly meant for control of peripherals and low
execution. One common mechanism to achieve workload tasks, while the big processor perform
data synchronization is a lock. Event synchroni- heavy data processing tasks. The main advantage
zation forces processes to join at a certain point of these architectures is that cores are completely
of execution. Barriers can be used to separate decoupled, easing the partial shut-down of the
distinct phases of computation and they are nor- platform when one of the cores is not used. On
mally implemented without special hardware the other hand, this decoupled architecture,
using locks and shared memory (Culler and where each core has its own binary and process
Singh 1999). Efficient data transfer between data on private memories, doesn’t allow for easy
cores in a multi-core system is critical for bal- and fair workload distribution among cores, and
anced system performance. A multi-core envi- requires to keep private copies of data-buffers,
ronment introduces high demands on the which dramatically reduces the computational
interconnect infrastructure and the interconnect efficiency of the platform when true data level
needs to be able to handle multiple streams parallelism has to be exploited.
simultaneously. Complex interconnect are A more convenient approach to the design of
widely used to sustain bandwidth and QoS parallel low-power architectures is a tightly cou-
requirements (e.g., AXI ACE), and again com- pled clusters sharing, similarly to what happens
plexity is translated into cost. in GPGPUs, a multi-banked L1 memory. With
On the other side, micro-controller systems respect to traditional architectures featuring
can take benefits of lean a simple architecture per-core private data memory, where data buffers
with respect to high-end single/multi-core are processed on the local, low-latency L1 mem-
systems. For this reason, they are much more ory, and shared with the other cores through a
attractive for a wide category of IoT devices. higher high-latency L2 memory, the tightly cou-
For instance basic micro-controllers typically pled data memory approach significantly reduces
have no data/instruction caches, while multicores the data-sharing overhead, as the memory banks
needs complex memory hierarchies to sustain can be accessed by all cores with a fixed latency
bandwidth and computational power require- (usually one cycle). This way allows to effi-
ments. But the current trend in IoT devices is ciently exploit both data and task parallelism, as
that, modern use cases are demanding more and opposed to more traditional private L1 memory
more computational power (eg. speech and scheme that allows to run efficiently only task
image recognition, surveillance etc.). To achieve parallel applications, enabling better perfor-
this target with a prefixed energy budget, stan- mance scalability when the required workloads
dard micro-controllers are not sufficient to allow for true data-level parallelism.
80 D. Rossi et al.

3.3.1 Energy Benefits and Challenges period, the energy drained by the core is given
for Parallel ULP Processors by the sum of core and system energy. Assuming
to run the same application on a multi-core plat-
Figure 3.5 shows the tradeoff for multi-core vs. - form, then the application runtime (active region)
single-core systems, under different workload will decrease (in the ideal case, will be N times
conditions. Each workload represents a specific smaller, where N is the number of cores). In this
voltage-frequency pair that exploits the required region, we have multiple cores and the system
computation power. Under high workload draining energy from the power supply, and
requirements the single-core is less energy effi- assuming that the system power is the same in
cient when compared to multi-core due to the both cases, and that core energy is the same, by
quadratic dependency of supply voltage with reducing the active period, some amount of
dynamic power (Dogan et al. 2011). Indeed, to energy is saved, thus better energy efficiency is
achieve the same throughput the single-core achieved.
needs to operate with a supply voltage much This scenario highlights the importance of a
higher than that of the multi-core, where CMOS power management strategy at the system-level
devices are far from their maximum energy effi- to develop efficient shutdown policies of unused
ciency point. However, when the voltage is close cores and workload consolidation policies for
to the threshold, leakage become dominant and minimizing the dynamic and leakage power con-
circuit slowdown increases, so from an energy sumption for the idle cores (Rossi et al. 2016a).
perspective, single core is much more efficient, An effective strategy to eliminate dynamic
therefore, this region is not interesting for power of unused processors during the execution
always-on parallel computing over multiple of sequential portions of code is architectural
cores (Fig. 3.5). clock gating. Indeed, joining hardware support
On the other hand, when we introduce a deep- for synchronization mechanism with architec-
sleep mode typical of the MCU domain, under tural clock gating of cores provides significant
low workloads, the multicore solution is still energy savings during execution of sequential or
attractive. In Fig. 3.6, in case of a single core not perfectly balance code, while minimizing the
execution, energy is consumed in the whole synchronization overhead, not present in single-
active period, then when computation is done, core platforms. However, even at low speed
the system is put in deep-sleep. In the active levels a processor consumes a significant amount

Fig. 3.5 Power efficiency DD


of multi-core vs. single
core under different
operating points. Each
operating point is function
750 MOps/s

of both workload and


65.6 mW

frequency/voltage that
Power, mW

167 MOps/s

enable such performance


6.0 mW

 GOPS/mW
1E-1 1E+0 1E+1 1E+2 1E+3

Workload, MOps/s
3 Ultra-Low-Power Digital Architectures for the Internet of Things 81

Fig. 3.6 Energy efficiency Multi-core vs Single core on parallel application

of static energy, caused, for example, by leakage implement the power gating. This significantly
current. A possible alternative to reduce both increases the area of the cores, and most impor-
leakage and dynamic power consumption of tant significantly degrades the performance of the
multi-core platform is per-core DVFS. circuits when operating at low voltage. Indeed,
Exploiting a per-core DVFS scheme, each core the presence of one additional PMOS stacked
is allowed to operate with and independent volt- over the pull-up network of standard cells has
age and frequency, thus theoretically achieving been shown to lead significant performance deg-
the best possible energy efficiency for each given radation in the near threshold operating region
workload, and can be shut down when idle. How- (Alioto 2012). Finally, power gating does not
ever, this comes with level shifters and dual- allow for state retention, as during the idle
clock FIFOs at the boundaries of every core phases all the cells within a gated region are not
introducing significant data sharing overhead, supplied. In this context, implementing a state
due to the handshaking required to implement retention mechanism requires more complex
the clock domain crossing between the state-retentive memories and flip-flops, with the
processors and the data memory. Moreover, the associated overheads in terms of area, power, and
small form factor of processors typically adopted additional routing required to bring to all these
in this domain might not justify the costs of a so components the retention voltage. Another alter-
complex solution (i.e., per-core DC/DC native to manage leakage power of idle cores is
converters or LDOs are required). A simpler but reverse body biasing (RBB). RBB can reduce
effective approach to reduce leakage power of leakage power by up to one order of magnitude
unused cores lies in per-core power gating. (Rossi et al. 2016a). Although this is not enough
Exploiting a per-core power gating architecture for implementing deep-sleep modes typical of
all the cores belong to the same frequency MCUs, it provides an effective and fast (less
domain, and operate at the same voltage when than 100 ns settling time) way to reduce the
active, eliminating clock synchronization and leakage power of idle cores in a ULP parallel
level shifting overhead, but they are able to shut architecture. Indeed, in contrast to multiple volt-
down independently when idle. Although this age domains and power gating approaches, this
approach is effective as it can reduce leakage architecture has minimal overhead in term of
power by several orders of magnitude, it requires isolation, as it usually requires a small ring of
a ring of PMOS transistors around each core to deep N-well around the region to isolate P-wells
82 D. Rossi et al.

and N-wells in a typical triple well process. et al. 2009). Scalability issues are real in multi-
Moreover, it does not require level shifters and core system, where the cache coherency com-
isolation with power gating transistors, since all plexity puts a limit to the number of cores that
the regions belong to the same voltage domain. can be integrated in the chip.
Finally, employing leakage reduction through Data Scratch-Pad Memories (SPMs on
body biasing allows standard registers to main- Fig. 3.7a) are a valid alternative to caches for
tain the state during the sleep phases, thus on-chip data memories. SPM is small on-chip
avoiding the usage of more complex state- memory bank (e.g., SRAM) mapped into the
retentive flip-flops and the related overhead in processor’s address space, and tightly coupled
terms of power routing and area. with the processor pipeline. SPMs are faster and
consume less energy with respect to larger or
off-chip memories, and their inherent predictabil-
ity have made them popular in real-time systems
3.3.2 Memory Hierarchy for Parallel
for instance. SPM also offer better scalability by
ULP Processors
allowing explicit control and optimization of data
placement and transfers. These systems, then
IoT is a network of physical objects that can refer
employ specialized data memories hierarchies
to a wide variety of devices. In such scenario it is
for better efficiency for targeted data. However
not possible to derive a single architecture that
these memory structures need a mechanism to
fits all the possible cases, hence the device archi-
move data from the different levels of the mem-
tecture is tailored upon the computation
ory hierarchy to the SPM, and transferring data
requirements of the use case, and energy
to/from this address space (explicit communica-
constraints. The memory hierarchy of such
tion) may lead to inefficiencies that can vanish the
architectures plays a dominant role in the perfor-
benefits of specialization. Remote direct memory
mance/energy metrics: for example, in real time
accesses (RDMA) is a wide used technique to
systems caches would never exist for the fact that
move data from/to the SPM from the different
different layers of memory hierarchy have such
levels of the memory hierarchy. This technique
different access/speed characteristics, and the
is efficient if the programming overhead to trigger
global access time, e.g., in case of miss, is not
the DMA transfer is minimized, and it is afford-
predictable, and may lead to deadline violations
able in the cases when the producer knows who
and system failures (safety issues). Moreover, it
the consumers will be, or when the consumer
is hard for the software to explicitly control and
knows its input data set ahead of time. Moreover,
optimize data locality and transfers (Kalokerinos

a b
Read Only Interconnect
To L2
CORE CORE CORE CORE
L1 I$ L1 I$ L1 I$ L1 I$
L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$
CORE CORE CORE CORE

Load/store Load/store Load/store Load/store


Cache Coherent Interconnect
(eg. snooping)
To L2
Interconnect

L2 Unified Cache
SPM SPM SPM SPM

Fig. 3.7 Overview of cached (a) vs. cacheless (b) non uniform access time. Cacheless approach removes
multicore system. Private caches pose a limit to scalabil- any coherency issues, and improve energy efficiency
ity, increase complexity (e.g., snooping), and introduce (simple architecture), and predictability
3 Ultra-Low-Power Digital Architectures for the Internet of Things 83

SPMs can replace data caches only if they are involved, and parametric variations of the indi-
supported by an effective compiler. The mapping vidual devices can lead to functional failures of
of memory elements of the application the cell. Low-voltage SRAMs use different sup-
benchmarks to the SPM is done mostly for data, ply voltages for the digital domain and memories.
since for instructions there are no coherency This approach entails additional complexity on
issues, therefore having a cache is highly desir- system level (level shifters and power distribu-
able on this side. However, the major benefit to tion). Other approaches for designing
adopt SPMs vs. on-chip caches is the better low-voltage SRAMs include design of 7 T, 8 T,
energy efficiency, better access timing, and better or 10 T bit-cells, decoupled read/write operations,
silicon area usage. Caches using static RAM con- read and write assist techniques, and adaptive and
sume power in the range of 25–45% of the total resilient SRAMs design (Chap. 5).
chip power, and energy can be reduced by 40% if One important factor is given by the leakage
replaced by SPM (Banakar et al. 2002). The Area- power, which is proportional to the chip area,
Time product metric can be reduced to 46% in which generally is dominated by memories. Scal-
favor of the SPM. ing the voltage in the NTC area can leverage to
10X better static power consumption, which in
IoT devices can lead to huge power improvement
since those devices are most of the time in
3.3.3 SPMs in the Near-Threshold
standby. If, from one side, scaling the voltage
Region
can lead to some power benefit, on the other
side, SRAMs become slower and slower and
On the other side, with the introduction of near-
more susceptible to process variation, leading to
threshold or even sub-threshold operation to dig-
failures when the voltage is below 0.6 V in a
ital circuits, memory design has gained renewed
28 nm process (Teman et al. 2015).
attention as being more susceptible to the side
Figure 3.8 shows the power breakdown for a
effects of low voltage operation.
multicore system, a cluster of four RISC
A classic SRAM cell is a ratioed circuit relying
processors, equipped with private instruction
on the relative drive strength of the transistors

6% 6%
4% 6%

6%
14%
PROCESSOR 0
PROCESSOR 1
6%
PROCESSOR 2
PROCESSOR 3
ICACHE 0
ICACHE 1
ICACHE 2
13% 13% ICACHE 3
L1 DATA MEMORY
L2 MEMORY
INTERCONNECT

13% 13%

Fig. 3.8 Power Breakdown for a QuadCore with private instruction caches, based on monolithic RAM memories
84 D. Rossi et al.

caches, based on SRAM memories (TAG and those access patterns are cached locally, leading
DATA). As shown in the figure, private I$ are to data replication, which results in reduced
responsible for more than 50% of the power bud- effective memory capacity, and poor energy effi-
get. One of the alternatives that has been proposed ciency. Secondly, L1 private cache need to be
in recent years to overcome some of these coherent, and as highlighted in the previous sec-
shortcomings is to synthesize memories from tion, cache coherency pose serious limit to multi-
standard cells. Standard Cell Memories (SCMs) core scalability, and it brings a not negligible
are soft-digital blocks, described in a hardware intrinsic cost (power).
description language (HDL) at register-transfer Moreover, for each processor, the available
level (RTL), and mapped to Standard Cell memory capacity is limited by its local memory
(SC) libraries. By using SCMs instead of size, and each inter-core communication is
SRAMs, designers can define each memory expensive because is done though message-
block according to the specific needs of each passing mechanism (mailbox etc.). The major
component and achieve the specifications benefit of the private caching scheme is a lower
required by their design as part of the standard cache hit latency.
digital design implementation flow. One of the On the other hand, the shared caching scheme
primary advantages that SCMs have over tradi- always maps data to a fixed location. Because
tional SRAM is their robustness, especially at there is no replication of data, this scheme
low-voltages. Since SCMs are constructed exclu- achieves a lower on-chip miss rate than private
sively from SCs, they scale along with the core caching (because of large aggregate cache capac-
digital logic, and continue to operate well below ity), and simple and efficient mechanism for
the limit of standard 6 T SRAM arrays inter-core communication, and finally no coher-
(600–700 mV). This advantage is accompanied ency issues (at L1 level). However, the average
by a loss of memory density, since the basic SCM cache hit latency is larger, because cache blocks
storage cell is much larger than a standard 6 T are simply distributed to all available cache slices.
SRAM bitcell. However, this trade-off is common To mitigate this penalty, several architectural
to all low-voltage SRAM solutions (Teman et al. solutions have been proposed (e.g., victim cache).
2015). SCM offers better energy efficiency with However, both in case of shared or private
respect to SRAM based SPMs In the application scenarios, the data cache is not attractive for an
example provided in (Meinerzhagen et al. 2010) it energetic point of view. Secondly a wide range of
was shown that the use of the considered SCM applications from image and video processing
architecture reduces the power consumption 37% domains have significant data storage
compared to the use of SRAM. requirements in addition to their computational
requirements and recent and studies (Benini et al.
2000) have shown that regular data access
3.3.4 Architecture of Memory patterns found in array-dominated applications
Subsystem for Parallel ULP can be better captured if SPM is employed,
Processors instead of a more traditional data cache.
Based on the assumption discussed before, in
The conventional memory hierarchy for ultra-low-power multi-core systems, the data-
low-power multi-core architectures is generally cache is replaced by several SPM blocks
composed by a private L1 subsystem and a (Fig. 3.7b) that are linked together in a multi-
shared L2 level (as depicted in Fig. 3.7a). In the banked shared data memory, and referred as
private L1 subsystem, local data keeps a copy of Tightly Coupled Data Memory (TCDM).
the accessed data, potentially replicating the TCDM is tightly integrated on the processors
same data in the different private local memory load/store unit interface, and shared through a
subsystem. If the L1 is based on data caches, and lean, fast and thin distributed crossbar, that
cores are working on the same dataset, then all provides shared access to several SPM banks.
3 Ultra-Low-Power Digital Architectures for the Internet of Things 85

The TCDM therefore implements a multi- To make this approach affordable, shared and
ported multi-banked shared data memory, private SPM schemes must behave in the same way
where each processor load/store interface is in the case of best case (no collision on shared
plugged directly on one master port of the accesses). The processor load/store interface
TCDM, while the SPM memory banks are requires to be carefully optimized, since SPM are
connected on the slave side. The internal arbiter now shared, and there is the crossbar logic in
handles routing and flow control of request com- between processor pipeline and SPM. Second, in
ing from different processors and directed to the case of conflicts (e.g., two or more processors
same SPM bank. Memory banks are mapped on making a request on the same shared memory
the global address space, and are accessible from bank) the processor pipeline must be frozen until
each master port with a simple flow control pro- the request is granted, therefore and additional stall
tocol (req/grant). To reduce the collision proba- must is added in the processor pipeline architecture.
bility, those banks are mapped with a word level Private instruction caches are usually able to
interleaving scheme, meaning that adjacent achieve higher speed, due to their simpler design
addresses are mapped on adjacent banks, with a (deep integration with processor pipeline), but
typical granularity of 32 bits (same as the proces- similarly to private data cache, the reduced
sor architecture). To further decrease the pres- capacity (vs. the aggregate capacity of shared
sure on memory side, the number of SPM banks instruction cache) lead to an increase in the
are doubled with respect the number of miss ratio. Although large private instruction
processors. Since TCDM is a shared architecture, caches can significantly improve performance,
it loses the determinist latency feature of private they have the potential to increase power con-
SPM, but on the other side, the embedded cross- sumption, therefore are not affordable for ultra-
bar ensures a maximum worst case latency (with low-power multicore systems.
round robin arbitration, the maximum latency for The optimal solution from a point of view of
the worst case is equal to the number of cores). the energy saving is the shared instruction cache

instr_if
AXI interconnect
Shared I$

To L2
L1 I$ L1 I$ L1 I$ L1 I$
Bank Bank Bank Bank

Read Only Interconnect

CORE CORE CORE CORE


Load/store Load/store Load/store Load/store

To L2
TCDM

Interconnect
data_if

SPM SPM SPM SPM

Fig. 3.9 Cluster of four processing elements, with tightly coupled data memory (SPM) and shared instruction cache
86 D. Rossi et al.

(Fig. 3.9). This approach can is an attractive performance across a wide range of workloads
solution to improve performance and energy effi- (Rossi et al. 2016a; Rossi et al. 2016b). The aim
ciency while reducing the footprint area of the of PULP is to satisfy the computational demands
chip, since the total capacity seen by each pro- of IoT applications requiring flexible processing
cessor is given by the aggregate capacity of the of data streams generated by multiple sensors,
whole shared instruction cache (Loi et al. 2015). such as accelerometers, low-resolution cameras,
The shared instruction cache has no coherency microphone arrays, vital signs monitors. As
issues (cache is read only), and it is implemented opposed to single-core MCUs, a parallel ultra-
as multi-banked, multi ported read-only cache. low-power programmable accelerator allows to
Each cache bank is shared through a thin, fast meet the computational requirements of these
read-only interconnect (which arbitrates the fetch applications, without exceeding the power enve-
request among the available cores), and mapped lope of a few mW typical of miniaturized,
at cache line level address interleaving to better battery-powered systems.
distribute accesses in all available cache banks
(to reduce the collision rate). Similarly to the
TCDM, the latency is not deterministic because 3.4.1 SoC Architecture
both hot/miss and local collision may happen.
Local contention may lower the instruction per The compute engine is a cluster with a
cycle metric (IPC) in each core, so some addi- parametric number (2–16) of cores (Fig. 3.10).
tional features are required to reduce perfor- Cores are based on a power-optimized micro-
mance degradations. architecture implementing the OpenRISC ISA.
Adding a pre-fetcher in between the shared Pipeline depth is optimized at four balanced
instruction cache and processor instruction-fetch stages, enabling full forwarding with single stalls
interface, with the capability to fetch one cache only on load-use and mis-predicted branches.
line in a single cycle (wide interface) further Further pipelining would not improve near-
reduce the local collision rate at cache bank threshold energy-efficiency because L1 memory
level, leading to reduce the IPC, very close to access time dominates cycle time, while register
1. With these optimizations the shared cache can and clocking overhead increase power (Gautschi
be considered a valid alternative to private cache et al. 2014). The original OpenRISC ISA and the
subsystem, with a negligible overhead due addi- core micro-architecture is extended for energy
tional resources, but with the big advantage of an efficient digital signal processing, supporting
increased memory capacity which finally leads to zero-overhead hardware loops with L0 I-buffer,
improve hit ratio and less refills though the L2 load and store operations embedding pointer
level hierarchy (with a sensible improvement of arithmetic, SIMD vector instructions and power
energy/power consumption). management instructions (Gautschi et al. 2015).
Both shared TCDM and Shared instruction The cluster can be configured with either private
cache, can be implemented using SCM, making or shared instruction cache with instruction
these architecture suitable even in the Near broadcasting support. By coupling the shared
Threshold region, where ultra-low-power I-cache with L0 buffers, it is possible to greatly
requirements are mandatory, and where SRAM reduce cache pressure, resulting in a much higher
based memory are not operative. energy efficiency (Loi et al. 2015). The cores do
not have private data caches, avoiding memory
coherency overhead and greatly increasing leak-
3.4 Design Example: The PULP age and area efficiency for data memory. A L1
Platform multi-banked Tightly Coupled Data Memory
(TCDM) acts as a software-managed shared
PULP is a multi-core platform achieving leading- data scratchpad. The TCDM features a
edge energy efficiency and featuring tunable parametric number of word-level interleaved
3 Ultra-Low-Power Digital Architectures for the Internet of Things 87

Fig. 3.10 PULP platform architecture

banks connected to the processors through a optimized for low power with just ten cycles
non-blocking interconnect to minimize banking programming latency, up to 16 outstanding
conflicts (Rahimi et al. 2011). Each logical bank transactions and a private physical channel for
is implemented as a heterogeneous memory, DMA control for each core. Various peripherals
composed of a mix of SRAMs and latch-based are available for PULP SoCs, including SPI
Standard Cell Memory (SCM) banks. Instruction interfaces with streaming support, I2C, I2S, a
caches can also be implemented using SCMs. camera interface, GPIOs, a bootup ROM and a
While SRAMs achieve a higher density than JTAG interface for debug and test purposes. SPI
SCMs (3x-4x), SCMs are able to work over the interfaces can be configured in master/slave sin-
same voltage range as logic, extending the gle or quad mode. PULP SoCs can operate stand-
operating range of the whole cluster to the very alone or as a slave accelerator of a standard host
limit of the technology; moreover, their energy/ processor (e.g., an ARM Cortex-M microcontrol-
access is lower than that of SRAMs for the small ler). To reduce the overall number of pads, and
cuts needed in L1 (Teman et al. 2015). make low-cost wire bonding packaging of the
Depending on the availability of low-voltage SoC suitable for IoT applications, the IO
memories in the targeted implementation tech- interfaces can be multiplexed.
nology, different ratios of SCM and SRAM
memory can be instantiated at design time.
Reconfigurable pipeline stages, controlled by 3.4.2 Power Management
the processors through a memory mapped inter- Architecture
face, can be optionally added to the TCDM inter-
connect to deal with the performance To provide high energy efficiency across a wide
degradation and variability of SRAMs at low range of workloads, the PULP cluster and the rest
voltage. of the SoC are in different power and clock
The cluster has AXI4-compliant interfaces. domains. Fine-grained tuning of the SoC and
Off-cluster (L2) memory and peripheral access cluster frequencies is achieved through two
latency is managed by a tightly coupled DMA FLLs (Frequency-Locked Loops) (Miro-Panades
88 D. Rossi et al.

Fig. 3.11 Power management architecture

et al. 2014). Each processor within the cluster, as is able to keep active an arbitrary number of
well as other data transfer, and memory resources processing elements, while the others consume
can be separately disabled, clock-gated and zero dynamic power and up to 10x less leakage
reverse body biased when idle. The concept of power. An event unit automatically manages
fine-grain power management is shown in transitions of the cores between the active and
Fig. 3.11. In the PULP platform, the body-bias idle states. Processors can be put in idle state
multiplexers allow to dynamically select the with a write operation on a memory mapped
back-bias voltage of each processor, enabling control register. After going in sleep mode, a
ultra-fast transitions between the active mode core remains idle until a configurable event is
and the sleep mode. Although the same concept triggered. Events can be issued by all IO
applies also to leakage management with power peripherals, the DMA, and timers. Emergency
gating, in the context of a near-threshold proces- and general-purpose events are also supported.
sor body biasing offers two key advantages for The same approach is replicated hierarchically at
partial shut-down of small blocks. First it is cluster and SoC level. A power manager based in
intrinsically state-retentive, and does not require a safe domain controls the state of the SoC and
to increase the complexity of flip flops and cluster domains, managing dynamic and leakage
memories with additional supply rails. Second, power during sleep states and generating wake-
power gating requires a ring of transistors around up sequence when a termination event is
the digital area to me managed, causing a rele- received, for example due to the completion of
vant overhead for small blocks, and add a PMOS a transfer from an IO peripheral to the L2 mem-
to the pull-up network of the digital logic, which ory. This mechanism allows to minimize the
degrades the performance in near-threshold. intrinsic overheads of a relatively complex paral-
Depending on the required workload, the cluster lel computing platform, guaranteeing that only
3 Ultra-Low-Power Digital Architectures for the Internet of Things 89

the strictly necessary blocks are active during the 3.4.4 Extending PULP
phases of parallel or sequential computation, and
data transfer. On the other hand, power gating is When the application requirements are so strict
more effective to implement system-level deep- that they cannot be matched by parallel execu-
sleep modes of heavily duty-cycled applications, tion on power-optimized processors, customiza-
where the whole system needs to be managed for tion of the cluster may be required. It is possible
longer periods (tens to hundreds micro seconds) to include hardware accelerators as an extension
and leakage power needs to be reduced by sev- of the baseline PULP cluster sharing L1 memory
eral orders of magnitudes (100 nW–10 μW). (Dehyadegari et al. 2015). Integration of HW
accelerators in the PULP cluster is fully modular;
the IPs used to couple accelerators with the clus-
ter are parametric and support manually or
3.4.3 Programming PULP
HLS-designed accelerators using either a stream-
ing dataflow model or a memory-mapped one. In
OpenMP, OpenCL and OpenVX programming
the context of high-performance computing,
models are available for PULP, as they are
deep learning (Memisevic 2015) and more spe-
well-known standards for shared memory pro-
cifically Convolutional Neural Network (CNN)
gramming, and widely adopted in embedded
algorithms have rapidly grown since 2012 thanks
MPSoCs (Stotzer et al. 2013). GCC- and
to their outstanding capabilities in the recogni-
LLVM-based tool-chains and light-weight
tion (Hannun 2014), and classification
implementations of all these environments have
(Russakovsky 2014) fields. Being these
been tailored to PULP’s explicitly managed,
algorithms approximation tolerant, a new key
scratchpad-based memory hierarchy. To achieve
challenge is to exploit them as hardware
energy efficiency, however, it is necessary that
accelerators for deeply embedded applications
the implementation of the different execution
by replacing their native single- or double- preci-
models efficiently exploits the ULP features of
sion implementations with integer or
the hardware. The alternance of sequential and
approximated floating point arithmetic to
parallel code parts is inherent to all the program-
improve energy efficiency. Another advantage
ming paradigms based on the fork-join model.
of CNNs is their affinity to be rapidly adapted
While minimizing the impact of Amdahl’s law
to several application domains, such as embed-
by extracting the maximum degree of concur-
ded audio and video processing (Wu et al. 2015).
rency in applications is paramount for every par-
In addition, the computational patterns typically
allel system, for ULP parallel system the way
implemented by CNNs are very similar to those
idleness is implemented is key to achieve energy
of more traditional filters used in several deeply
efficiency. Idle power of unused cores might be
embedded classification applications such as FIR
significant, causing huge energy efficiency drop.
filtering or FFT. Hence, CNN hardware
To this end, special hardware for accelerating
accelerators join the typical advantages of appli-
key software patterns has been developed. The
cation specific computing (i.e., significant boost
PULP software runtime integrates a clock-gating
in performance and energy efficiency with
based thread docking scheme to eliminate
respect to equivalent software implementations)
dynamic power, coupled with Reverse Body
with generality, matching two of the key
Bias (RBB) to reduce leakage, when worker
requirements of IoT computing devices. Vision-
threads are idling (e.g., in sequential regions of
PULP is an extension of the basic platform ori-
the program). In addition, the key operations
ented to embedded vision that features a 2D
required in fork/join thread management are
convolutional accelerator able to perform two
HW-accelerated, dramatically increasing the per-
16-bit 55 convolutions per cycle (Conti et al.
formance with respect to polling or event-based
2015). This improves energy efficiency by a
synchronization.
90 D. Rossi et al.

factor of 40x or more in convolutional regular intervals, with long idle periods in
workloads, reaching up to 80 GOPS of perfor- between. Hence, the computational platforms
mance and 3000 GOPS/W. have to be always-on, powered by harvester or
small coin batteries, and their life-time can be as
long as several years, as expected by their target
3.5 Trends and Perspectives applications. As a consequence, both sleep
modes and active modes need to be carefully
Table 3.2 provides a summary of recent single- optimized to reach the goal of zero-energy
core and multi-core ultra-low-power and energy sense, classify and transmit IoT systems.
efficient digital computing architectures In the next generation of IoT nodes energy
targeting the IoT application domain. Until last saving will have to be considered as a system-
few years, most IC design has mainly been wide concern. All the components within an IoT
driven by performance and area to reduce the node need to be optimized for power, including
cost production, while power consumption has sensor interfaces (Chaps. 12 and 13), digital
always been a secondary concern for optimiza- platforms (Chap. 9) and RX/TX transceivers
tion. With the coming of for near-sensor (Chap. 14), power supply and conversion subsys-
processing and IoT we are experiencing a shift tem (Chaps. 10, 11, 15). Optimized software also
from a computation-driven to a data-driven envi- plays a crucial role system-wide to manage
ronment, requiring a radical change in the design power state of all the components forming the
of the hardware platform architectures. In this IoT node. While DVFS joint with parallelism has
scenario data are generated by several sensors, been demonstrated to be extremely effective in
often asynchronously (event-based computing), reducing the energy of digital blocks, this tech-
that activate the computational platform at nique can only be applied in a very limited way

Table 3.2 Summary of recent ultra-low-power and energy efficient processors and DSPs
SLEEP REISC PULP PULP2
WALKER (Ickes DSP FRISBEE CENTR (Rossi (Rossi
(Bol et al. et al. (Gammie (Wilson P3DE (Fick et al. et al.
2013) 2011) et al. 2011) et al. 2014) et al. 2012) 2016a) 2016b)
CPU MSP430 ReiSC TMS320C64x FRISBEE ARM OpenRISC OpenRISC
CortexM3
Data format 16-bit 32-bit 32-bit VLIW 32-bit 32-bit 32-bit 32-bit
VLIW
Number of 1 1 1 1 64 4 4
cores
I$/D$/L2 16 k/2 k/n.a. 8 k/8 k/n. 32 k/32 k/ 4 k/4 k/n.a. 64 k/512 k/n. 16 k/4 k/ 16 k/4 k/
(bytes) a. 128 k a. 16 k 16 k
Technology CMOS CMOS CMOS FD-SOI CMOS FD-SOI FD-SOI
Node 65 nm GP 65 nm LP 28 nm LP 28 nm HP 130 nm 28 nm LP 28 nm HP
VDD range 0.4 (1.0) 0.54–1.2 0.6–1.0 0.4–1.3 0.65–1.15 0.44–1.2 0.32–1.2
(mem) [V] (0.4–1.2) (0.8–1.65) (0.54–1.2) (0.45–1.2)
Max freq. 25 82.5 331 2600 80 475 825
(MHz)
Power 7.7 10.2 409 62 317 72 20.7
dens.
(uW/MHz)
Best Perf. 25 57.5 662 2600 1600 1800 3300
(MOPS)
Energy eff. 64.5 68.6 4.5 16 3.9 60 193
(MOPS/
mW)
3 Ultra-Low-Power Digital Architectures for the Internet of Things 91

and with some restrictions to analog blocks. For technology nodes, ranging from 180 to 65 nm.
example, memories are very critical IPs, and Advanced nodes such sub 20 nm, or FD-SOI,
often form a bottleneck for energy efficiency of would have advantages at the transistor level
ultra-low-power platforms, since the bit-cells, as but their mask set and wafer cost will only
well as some of the analog IPs in the periphery be justified only with the expected increase of
requires higher voltages than the logic to be IoT product volumes, that would reduce manu-
operational in a reliable way. Although several facturing costs as well. A big concern of scaled
solutions has been proposed in research technology nodes is the availability of embedded
(Chap. 5), silicon vendors still strive to provide Flash as today 40 nm is the most advanced node
memory generators optimized for low-voltage featuring embedded Flash. Another issue that has
operation in production design kits. to be addressed deals with process and tempera-
An emerging trend for ultra-low-power ture variations. IoT systems are intrinsically sub-
systems, supplied only with energy harvesters ject to variations during their long lifetime due to
or small batteries is that of energy-driven com- degradation of batteries and power supply net-
putation. This computational paradigm, referred work lowering the supply voltage, or aging.
as transient computing, can be applied to Usage of scaled technology magnifies variations
applications with very low requirements in caused by process and temperature. More specif-
terms of real-time constraints. For example a ically, scaled transistors operating at low voltage
sensor that collects and transmit environmental are subject to thermal inversion phenomena,
conditions (e.g., temperature, pressure). When an which causes a strong dependency between the
energy burst is captured and accumulated in ambient temperature and maximum operating
some energy storage (e.g., super-caps) by the frequency. This issues will have to be tackled in
harvester the acquisition and computation can next generation of IoT commercial platforms
start. When the energy on the accumulator— with advanced techniques to tolerate or compen-
monitored by the hardware platform—is going sate these variations, such as in situ error detec-
to finish, the hardware platform goes into a deep- tion and correction, adaptive voltage scaling, and
sleep or even shut-down mode, after saving its adaptive body biasing.
internal state and temporary data on a
non-volatile support. When a further energy
burst occurs, the hardware platform restores the
References
state from its non-volatile support and continues
acquisition, processing and transmission. This is M. Alioto, Ultra-low power VLSI circuit design
yet an example showing the new requirements in demystified and explained: a tutorial. IEEE Trans.
terms of tighter system-level integration between Circuits Syst.—Part I (Invited) 59(1), 3–29 (2012)
the MCU and the peripheral parts of the node Ambiq Micro, Ultra-low power MCU family. Apollo
Datasheet rev 0.45 (Sep. 2015)
(e.g., batteries, sensors, transceivers) whose ARM, Cortex-M4, Technical Reference Manual r0p0
power consumption need to be constantly moni- Issue A (Mar. 2010), pp. 3.8–3.10
tored and predicted to adopt power-management Atmel, SMART ARM-based Microcontrollers. SAM
and shut-down strategies, while guaranteeing to L22x Rev. A datasheet (Aug. 2015)
R. Banakar, S. Steinke, B. Lee, M. Balakrishnan,
always be able to recover processing from a P. Marwedel, Scratchpad memory: a design alterna-
consistent state. In this scenario, a key role will tive for cache on-chip memory in embedded systems.
be played by non-volatile memories, which has in Tenth International Symposium on Hardware/Soft-
to be re-designed and optimized for their new ware Codesign (CODES) (Estes Park, 2002)
L. Benini, A. Macii, E. Macii, M. Poncino, Increasing
scope (Chaps. 6 and 7). energy efficiency of embedded systems by
Most of today’s IoT systems are fairly low application-specific memory hierarchy generation.
volume, and intended for a cost conscious mar- IEEE Design Test Comput. 17(2), 74–85 (2000)
ket. Hence, computing devices for IoT are D. Blaauw, et al., Razor II: in situ error detection and
correction for PVT and SER tolerance, in IEEE
implemented with very cheap and not scaled
92 D. Rossi et al.

International Solid-State Circuits Conference, 2008. Cortex-M3 cores. in 2012 I.E. International Solid-
ISSCC 2008. Digest of Technical Papers (IEEE, 2008) State Circuits Conference (San Francisco, 2012),
D. Bol, J. De Vos, C. Hocquet, F. Botman, F. Durvaux, pp. 190–192
S. Boyd, D. Flandre, J. Legat, SleepWalker: a 25-MHz G. Gammie, N. Ickes, M.E. Sinangil, R. Rithe, J. Gu,
0.4-V Sub-mm 2 7-μW/MHz microcontroller in A. Wang, H. Mair, S. Datla, B. Rong, S. Honnavara-
65-nm LP/GP CMOS for low-carbon wireless sensor Prasad, L. Ho, G. Baldwin, D. Buss,
nodes. IEEE J. Solid-State Circuits 48(1), 20–32 A.P. Chandrakasan, Uming Ko, A 28 nm 0.6 V
(2013) low-power DSP for mobile applications, in 2011 I.E.
D. Bortolotti, D. Rossi, A. Bartolini, L. Benini, A varia- International Solid-State Circuits Conference Digest
tion tolerant architecture for ultra low power multi- of Technical Papers (ISSCC) (20–24 Feb. 2011),
processor cluster, in 2013 23rd International pp. 132–134
Workshop on Power and Timing Modeling, Optimiza- M. Gautschi, D. Rossi, L. Benini, Customizing an open
tion and Simulation (PATMOS) (Karlsruhe, 2013), source processor to fit in an ultra-low power cluster
pp. 32–38 with a shared L1 memory, in Proceedings of the 24th
B.H. Calhoun, A.P. Chandrakasan, Standby power reduc- Edition of the Great Lakes Symposium on VLSI-
tion using dynamic voltage scaling and canary flip- GLSVLSI’14 (2014), pp. 87–88
flop structures. IEEE J. Solid-State Circuits 39(9), M. Gautschi, A. Traber, A. Pullini, L. Benini,
1504–1511 (2004) M. Scandale, A. Di Federico, M. Beretta, G. Agosta,
F. Conti, L. Benini, A ultra-low-energy convolution Tailoring instruction-set extensions for an ultra-low
engine for fast brain-inspired vision in multicore power tightly-coupled cluster of OpenRISC cores, in
clusters, in Design, Automation & Test in Europe Con- IFIP/IEEE International Conference on Very Large
ference & Exhibition (9–13 Mar 2015), pp. 683–688 Scale Integration (VLSI-SoC) (October 2015)
D.E. Culler, J.P. Singh, Parallel Computer Architecture, a A. Hannun, Deep speech: scaling up end-to-end speech
hw/sw Approach (Morgan Kaufmann, San Francisco, recognition, arXiv (2014)
1999) S. Hsu, A. Agarwal, M. Anders, S. Mathew, H. Kaul,
Cypress, Designing for low power and estimating battery F. Sheikh, R. Krishnamurthy, A 280 mV-to-1.1 V
life for BLE applications, AN92584 Rev. C Applica- 256b reconfigurable SIMD vector permutation engine
tion Note (Mar. 2016) with 2-dimensional shuffle in 22 nm CMOS, in 2012 I.
C.E. Molnar, R.F. Sproull, I.E. Sutherland, The counter- E. International Solid-State Circuits Conference
flow pipeline processor architecture. IEEE Des. Test Digest of Technical Papers (ISSCC) (19–23 Feb.
Comput. 11, 48–59 (1994) 2012), pp. 178–180
H. De Groot, IoT and the cloud: a hacked personality and N. Ickes, Y. Sinangil, F. Pappalardo, E. Guidetti, A. P.
an empty battery head-ache or an intuitive environ- Chandrakasan, “A 10 pJ/cycle ultra-low-voltage
ment to make our lives easier? S3S (2015) 32-bit microprocessor system-on-chip, in 2011
M. Dehyadegari, A. Marongiu, M.R. Kakoee, Proceedings of the ESSCIRC (ESSCIRC) (Helsinki,
S. Mohammadi, N. Yazdani, L. Benini, Architecture 2011), pp. 159–162
support for tightly-coupled multi-core clusters with D. Jeon, Y. Kim, I. Lee, Z. Zhang, D. Blaauw,
shared-memory HW accelerators. IEEE Trans. D. Sylvester, A 470 mV 2.7 mW feature extraction-
Comput. 64(8), 2132–2144 (2015) accelerator for micro-autonomous vehicle navigation
A.Y. Dogan, D. Atienza, A. Burg, I. Loi, L. Benini, in 28 nm CMOS, in 2013 I.E. International Solid-State
Power/performance exploration of single-core and Circuits Conference Digest of Technical Papers
multi-core processor approaches for biomedical signal (ISSCC) (17–21 Feb. 2013), pp. 166–167
processing, in Proceedings of 21st International G. Kalokerinos, V. Papaefstathiou, G. Nikiforos,
Workshop, PATMOS 2011 (Madrid, 26–29 Sept. S. Kavadias, Ma. Katevenis, D. Pnevmatikatos,
2011), pp. 102–111 X. Yang, FPGA implementation of a configurable
R.G. Dreslinski, M. Wieckowski, D. Blaauw, cache/scratchpad memory with virtualized user-level
D. Sylvester, T. Mudge, Near-threshold computing: RDMA capability, in International Symposium on
reclaiming Moore’s law through energy efficient Systems, Architectures, Modeling, and Simulation,
integrated circuits. Proc. IEEE 98(2), 253–266 (2010) 2009. SAMOS ’09 (Samos, 2009), pp. 149–156
D. Ernst et al., Razor: a low-power pipeline based on I. Loi, D. Rossi, G. Haugou, Exploring multi-banked
circuit-level timing speculation, in Proceedings of shared-L1 program cache on ultra-low power, tightly
36th Annual IEEE/ACM International Symposium on coupled processor clusters, in Proceedings of the 11th
Microarchitecture, 2003. MICRO-36 (2003), pp. 7–18 ACM Conference on Computing Frontiers-CF ’15
D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, (July 2015), pp. 1–10
N.S. Kim, Razor: circuit-level correction of timing M.M.K. Martin, M.D. Hill, D.J. Sorin, Why on-chip cache
errors for low-power operation. IEEE Micro 34(6), coherence is here to stay. Commun. ACM 55(7),
10–20 (2005) 78–89 (2012)
D. Fick et al., Centip3De: a 3930DMIPS/W configurable P. Meinerzhagen, C. Roth, A. Burg, Towards generic
near-threshold 3D stacked system with 64 ARM lowpower area-efficient standard cell based memory
3 Ultra-Low-Power Digital Architectures for the Internet of Things 93

architectures, in Proceedings of IEEE International energy-efficient parallel and sequential digital


Midwest Symposium on Circuits and Systems (Aug. processing, Cool Chips (2016b)
2010), pp. 129–132 O. Russakovsky, ImageNet large scale visual recognition
R. Memisevic, Deep learning architectures, algorithms challenge. Int. J. Comput. Vis (2014)
and applications, in Hot Chips: A Symposium on E. Stotzer, A. Jayaraj, M. Ali, A. Friedmann, G. Mitra,
High Performance Chips (Cupertino, 23–25 Aug. A. P. Rendell, I. Lintault, OpenMP on the low-power
2015) TI keystone II ARM/DSP system-on-chip, in OpenMP
Microchip, Analog-to-digital converter with computation in the Era of Low Power Devices and Accelerators
technical brief. TB3146 Application Note (Aug. 2016) (2013)
ST Microelectronics, STM32L4x5 advanced ARM®- A. Teman, D. Rossi, P. Meinerzhagen, L. Benini, A. Burg,
based 32-bit MCUs. RM0395 Reference Manual Controlled placement of standard cell memory arrays
(Feb. 2016) for high density and low power in 28 nm FD-SOI, in
I. Miro-Panades, E. Beignè, Y. Thonnart, L. Alacoque, 20th Asia and South Pacific Design Automation Con-
P. Vivet, S. Lesecq, D. Puschini, A. Molnos, ference (ASP-DAC), 19–22 January, 2015 (2015),
F. Thabet, B. Tain, K.B. Chehida, S. Engels, pp. 81–86
R. Wilson, D. Fuin, A fine-grain variation-aware Texas Instruments, Low-power FRAM microcontrollers
dynamic Vdd-hopping AVFS architecture on a and their applications, SLAA502 White Paper (June
32 nm GALS MPSoC. IEEE J. Solid-State Circuits 2011)
49(7), 1475–1486 (2014) J.W. Tschanz et al., Adaptive body bias for reducing
NXP Semiconductors. 2015. LPC5410X Product Data impacts of die-to-die and within-die parameter
Sheet. NXP. Rev 2.2 variations on microprocessor frequency and leakage.
J. Park, I. Hong, G. Kim, Y. Kim, K. Lee, S. Park, IEEE J. Solid-State Circuits 37(11), 1396–1402
K. Bong, H.-J. Yoo, A 646GOPS/W multi-classifier (2002)
many-core processor with cortex-like architecture for J. Tschanz et al., Adaptive frequency and biasing
super-resolution recognition, in 2013 I.E. Interna- techniques for tolerance to dynamic temperature-
tional Solid-State Circuits Conference Digest of Tech- voltage variations and aging, in IEEE International
nical Papers (ISSCC) (17–21 Feb. 2013), pp. 168–169 Solid-State Circuits Conference, 2007. ISSCC 2007.
J. Parkhurst, J. Darringer, B. Grundmann, From single Digest of Technical Papers (11–15 Feb. 2007),
core to multi-core: preparing for a new exponential, pp. 292–604
in 2006 IEEE/ACM International Conference on R. Wilson et al., A 460MHz at 397mV, 2.6GHz at 1.3V,
Computer Aided Design (San Jose, CA, 2006), 32b VLIW DSP, embedding FMAX tracking, in 2014 I.
pp. 67–72 E. International Solid-State Circuits Conference
Peter Greenhalgh, big.LITTLE processing with ARM Digest of Technical Papers (ISSCC) (San Francisco,
Cortex-A15 & Cortex-A7. Technical Report (ARM 2014), pp. 452–453
Ltd., 2011) M. Wu, R. Iyer, Y. Hoskote, S. Zhang, B. Deadman,
A. Rahimi, I. Loi, M.R. Kakoee, L. Benini, A fully- M. Bhartiya, Y. Satish, Design of an ultra-low Power
synthesizable single-cycle interconnection network SoC testchip for wearables & IOT, in Hot Chips: A
for Shared-L1 processor clusters, in Design, Automa- Symposium on High Performance Chips (Cupertino,
tion & Test in Europe Conference & Exhibition (Mar. 23–25 Aug. 2015)
2011), pp. 1–6 A. Yakovlev, D. Pivonka, T. Meng, A. Poon, A mm-sized
Renesas, RL78/G13 Rev. 3.20 User’s Manual Hardware wirelessly powered and remotely controlled locomo-
(July 2014) tive implantable device, in 2012 I.E. International
D. Rossi, I. Loi, G. Haugou, L. Benini, Ultra-low-latency Solid-State Circuits Conference Digest of Technical
lightweight DMA for tightly coupled multi-core Papers (ISSCC) (19–23 Feb. 2012), pp. 302–304
clusters, in Proceedings of the 11th ACM Conference J. Yoo, Y. Long, D. El-Damak, M. Bin Altaf, A. Shoeb,
on Computing Frontiers—CF ’14 (July 2014), Y. Hoi-Jun, A. Chandrakasan, An 8-channel scalable
pp. 1–10 EEG acquisition SoC with fully integrated patient-
D. Rossi, A. Pullini, I. Loi, M. Gautschi, F.K. Gurkaynak, specific seizure classification and recording processor,
A. Bartolini, P. Flatresse, L. Benini, A 60 GOPS/ in 2012 I.E. International Solid-State Circuits Confer-
W, 1.8 V to 0.9 V body bias ULP cluster in 28 nm ence Digest of Technical Papers (ISSCC) (19–23 Feb.
UTBB FD-SOI technology. Solid-State Electron. 117, 2012), pp. 292–294
170–184 (2016a) F. Zhang, Y. Zhang, J. Silver, Y. Shakhsheer,
D. Rossi, A. Pullini, I. Loi, M. Gautschi, F.K. Gurkaynak, M. Nagaraju, A. Klinefelter, J. Pandey, J. Boley,
A. Teman, J. Constantin, A. Burg, I.M. Panades, E. Carlson, A. Shrivastava, B. Otis, B. Calhoun, A
E. Beignè, F. Clermidy, F. Abouzeid, P. Flatresse, batteryless 19 μW MICS/ISM-band energy harvesting
L. Benini, 193 MOPS/mW @ 162 MOPS, 0.32V to body area sensor node SoC, in 2012 I.E. International
1.15V voltage range multi-core accelerator for Solid-State Circuits Conference Digest of Technical
Papers (ISSCC). (Feb. 2012), pp. 298–300
Near-Threshold Digital Circuits
for Nearly-Minimum Energy Processing 4
Massimo Alioto

This chapter addresses the challenges and the 4.1 Preliminary Considerations
opportunities to perform computation with on Near-Threshold Operation
nearly-minimum energy consumption through
the adoption of logic circuits operating at near- 4.1.1 Transistor Current vs. Supply
threshold voltages. Simple models are provided Voltage and Transregional
to gain an insight into the fundamental design Model
tradeoffs. A wide set of design techniques is
presented to preserve the nearly-minimum Voltage scaling is well known to be a very effec-
energy feature in spite of the fundamental tive knob to reduce the energy per computation at
challenges in terms of performance, leakage the cost of degraded performance (Burd et al.
and variations. Emphasis is given on debunking 2015). The performance degradation at supply
the incorrect assumptions that stem from tradi- voltages VDD lower than the nominal voltage is
tional low-power common wisdom at above- determined by the reduction in the transistor
threshold voltages. on-current Ion, which in turn depends on the
In this analysis, the main emphasis is given on operating region (i.e., the voltage range). The
the energy consumption, as performance transregional EKV model can be conveniently
requirements in IoT nodes are easily achievable used to express such dependence in all regions
with near-threshold circuits in most cases, as (Enz and Vittoz 2006):
discussed in Chap. 1 and in the following.
Sustained higher levels of performance can I on ¼ I 0  IC ¼ I 0  ½lnðev þ 1Þ2 ð4:1Þ
always be achieved through architectural
techniques (see Chap. 3), whereas occasional where IC is the inversion coefficient (i.e.,
performance boosts can be obtained through cir- normalized current), I0 is the specific current
cuit techniques (see below). 2  n  μ  COX WL ðkT=qÞ2 , and v is the normalized
gate overdrive v ¼ ðV DD  V TH Þ=½2  n  ðkT=qÞ.
In the above equations, n is the transistor sub-
threshold factor, μ is the carrier mobility, COX is
the MOS capacitance per unit area, W/L is the
aspect ratio, VTH is the transistor threshold volt-
age, and kT/q is the thermal voltage.
M. Alioto (*) In the EKV model in (4.1), a transistor
National University of Singapore, Singapore, Singapore operates in weak inversion when IC < 0:1
e-mail: malioto@ieee.org

# Springer International Publishing AG 2017 95


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_4
96 M. Alioto

Fig. 4.1 Qualitative trend


of Ion transistor current (log IC < 0.1 0.1 £ IC £ 10 IC > 10
scale) versus the gate
overdrive V DD  V TH weak moderate strong
inversion inversion inversion

log(Ion)

VDD=VTH

-50 mV 0 200 mV VDD-VTH

(i.e., for v < 1), which from (4.1) corresponds circuit simulations on average (in the worst
to voltages below V TH  50 mV for typical case), hence it is well suited for quick estimates
n (~1.3–1.5) and operating temperatures (Sansen and design purposes.
2006). On the other hand, a transistor operates in
strong inversion for IC > 10 (i.e., for v > 3:1),
and hence for voltages above V TH þ 200 mV
4.1.2 Transistor Current and Gate
(Sansen 2006). Near-threshold operation occurs
Delay in Different Regions
for intermediate voltages, as summarized in
Fig. 4.1.
By the definition summarized in Fig. 4.2,
The above traditional EKV model is very
sub-threshold voltages correspond to transistor
useful for quick estimates, but it oversimplifies
operation in weak inversion, above-threshold
the I–V characteristics at voltages above VTH.
are associated with strong inversion, and near-
Indeed, eq. (4.1) leads to I on  I 0  v2 in strong
threshold voltages correspond to intermediate
inversion, and its quadratic trend is far from the
voltages between V TH  50 mV and
linear trend that is observed in actual nanometer
V TH þ 200 mV. For typical standard threshold
CMOS technologies.1
voltages,2 near-threshold voltages are in the
Introducing voltage-dependent coefficients in
range of 400–600 mV, approximately.
(4.1) solves the issue, but leads to impractically
At above-threshold voltages such that
complicated expressions for pencil-and-paper V DD V TH

evaluations. To retain its simplicity while e nðkT=qÞ  1, Eq. (4.2) is approximately a linear
employing constant coefficients, (4.1) is here function of V DD  V TH as expected
modified according to
 VDD VTH 
I on ¼ I 0  ln e nðkT=qÞ þ 1 ð4:2Þ

which is plotted in Fig. 4.2 along with the actual 2


I–V characteristics for 28-nm NMOS and PMOS Operation at near-threshold voltages tends to increase
V TH compared to the value at nominal voltage, due to
transistors. The model is 10% (20%) within DIBL (see Sect. 5.2.2). For standard V TH of 350–380 mV
at nominal voltage, it is common to have V TH in the order
of 400–450 mV when operating at near-threshold voltages
1
Indeed, sub-100 nm CMOS technologies typically have (see, e.g., Fig. 4.7). Observe that the “standard VTH”
an I–V characteristics that is proportional to nomenclature might be attributed to different threshold
ðV DD  V TH Þα with α  1 (Sakurai and Newton 1990). voltages in some processes.
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 97

Fig. 4.2 Plot of Ion VOLTAGE sub near


transistor current (log above threshold
RANGE threshold threshold
scale) in (4.2) versus the
TRANSISTOR weak moderate strong inversion
magnitude of the gate-
REGION inversion inversion
source voltage VGS
(in CMOS logic gates, V GS EKV MODEL IC < 0.1
0.1 < IC < 10 IC > 10
IC=I/I0 RANGE
¼ V DD )
VDD RANGE <VTH-50 mV intermediate >VTH+200 mV

1E-03

ION(A/mm)
1E-04

1E-05
threshold voltage
1E-06
|VTH|
1E-07
200 400 600 800 1000 1200
|VGS| (mV)
NMOS LVT (sims) NMOS LVT (model)
PMOS LVT (sims) PMOS LVT (model)

 
kT (Weste and Harris 2011), its propagation delay
I abovethreshold  I 0 =n  ðV DD  V TH Þ
q τPD can be expressed as ðC=I on Þ  ðV DD =2Þ:
ð4:3Þ C V DD C V DD
τPD ¼     :
I on 2 2  I 0 ln eVDD V TH
nðkT=qÞ þ 1
whereas at sub-threshold voltages it can be
approximated as3
ð4:6Þ
V DD V TH
I subthreshold  I 0  e nvt ð4:4Þ As shown in Fig. 4.3, from (4.3) and (4.6), volt-
age downscaling leads to an approximately linear
which exponentially decreases when lowering
delay (i.e., performance) degradation, when
the voltage. At near-threshold voltages,
operating above threshold. As discussed in
Eq. (4.2) can be approximated as
Sect. 4.3, the energy is typically dominated by
"   #
I0 V DD  V TH 1:35 the dynamic contribution, hence a quadratic
I nearthreshold   1:5 þ energy saving is observed above threshold. On
2 n  kT=q
the other hand, an exponential increase in the
ð4:5Þ gate delay is observed in the sub-threshold
region. Also, due to the heavier leakage contri-
which is within 15% of the exact I–V bution at low voltages (Sect. 4.3), the energy
characteristics in Fig. 4.2. From (4.5), the near- reaches a minimum energy point (MEP), and it
threshold I–V characteristics is a power law, and tends to increase again when further lowering
is steeper than in the above-threshold region. VDD. Hence, near-threshold voltages are an
Let us now consider a CMOS logic gate ideal compromise between energy and perfor-
driving a capacitive load C, which includes the mance in energy-centric VLSI designs. Indeed,
capacitive parasitics of the gate itself. As usual the near-threshold gate delay is still reasonably
small, and energy is close to its minimum value
across all voltages. This motivates this chapter,
and the adoption of near-threshold circuits for
3
Indeed, lnðex þ 1Þ  ex for x < 0 (i.e., for V DD < V TH) in
(5.2). VLSI processing in the IoT domain.
98 M. Alioto

Fig. 4.3 Qualitative trend linear performance


of performance (gate delay) performance degradation
and energy per operation
energy
versus supply voltage VDD

MEP

VTH VDD
subthreshold near threshold above threshold
(ST) (NT) (AT)

shown by Fig. 4.4, the transistor gate capacitance


4.2 Near-Threshold Transistor tends to moderately decrease at voltages close to
and Circuit Properties or below VTH, and hence makes the delay degra-
dation more graceful than discussed above,
In this section, properties of transistors at near although to a minor extent.
threshold are discussed to provide general circuit According to the above observation, the above
design guidelines. Preliminary considerations on qualitative considerations on the delay at near- and
voltage scaling and threshold voltage depen- sub-threshold voltages fully apply to any practical
dence on sizing are respectively provided in design. As an example, Fig. 4.5 shows the trend of
Sects. 4.2.1 and 4.2.2. The impact of transistor the fan-out-of-4 delay FO4 (i.e., the delay of an
stacking and PMOS/NMOS imbalance are inverter gate driving four equal inverters). This
discussed in Sect. 4.2.3 to guide the topology metric is widely used at process level to character-
selection during circuit design. As second and ize the speed of the technology, at circuit level to
equally fundamental aspect of circuit design, abstract the circuit design from the process details,
transistor strength adjustment is discussed in and at architectural level since the clock cycle
Sect. 4.2.4. normalized to FO4 is typically a constant that is
defined by the architecture (Harris). In short, FO4
characterizes the system performance versus volt-
4.2.1 Impact of Aggressive Voltage age for a given architecture. From Fig. 4.5, opera-
Scaling on Transistor Current tion in the middle of the near-threshold region
and Delay degrades the performance by approximately a fac-
tor of 10, compared to operation at nominal volt-
The considerations on the delay degradation age. This is generally true regardless of the adopted
under voltage scaling in the previous section technology (Dreslinski et al. 2010).
were based on the assumption that the gate load A very distinctive property of near-threshold
C is independent of VDD. Observe that the load operation is the stronger delay sensitivity to a
C comprises wire parasitics and transistor gate given absolute change in the gate overdrive (i.e.,
capacitances. The above assumption certainly both VDD and VTH), compared to above-threshold
holds in wire-dominated loads (as wire parasitics designs. This is partially explained by the steeper
are voltage-independent), whereas it is somewhat I–V characteristics (4.5) compared to (4.3) (the
pessimistic in gate-dominated loads. Indeed, as exponent of v is respectively 1.35 and 1). But the
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 99

Fig. 4.4 Gate capacitance 1


normalized to value at
nominal voltage versus

normalized Cg
supply voltage VDD (28 nm, 0.9
LVT and RVT transistors)

0.8

0.7

0.6
0 200 400 600 800 1000 1200
VDD [mV]
LVT NMOS LVT PMOS RVT NMOS RVT PMOS

Fig. 4.5 Fanout-of-4 1E+04


(FO4) delay normalized to
normalized FO4 delay

value at nominal voltage


versus supply voltage VDD 1E+03
(28 nm, LVT and RVT
transistors)
1E+02

1E+01

1E+00
0 200 400 600 800 1000 1200
VDD [mV]
LVT RVT

main reason is due to the very large sensitivity of Table 4.1 Ion Improvement due to supply voltage
v ¼ ðV DD  V TH Þ=½2  n  ðkT=qÞ to a given boosting by 100 mV
change in VDD, as VDD is much closer to VTH VDD Ion improvement
compared to above-threshold voltages. The rela- 400 mV 4.05X
tive Ion improvement due to a 100-mV supply 500 mV 2.24X
voltage increase (i.e., boosting) for a 28-nm tech- 600 mV 1.7X
nology is shown in Table 4.1. As expected, the 800 mV 1.31X
impact of voltage boosting at near-threshold 1V 1.17X
voltages is substantially larger than above thresh-
old, with improvements in Ion in the range of gate overdrive V DD  V TH . For example, the Ion
2-4X. This unique feature permits to have signifi- and speed sensitivity to a 100-mV VDD shift in
cant speed adjustment capability with very lim- Table 4.1 hold for the same change in VTH
ited amount of boosting, which needs to be (although with negative sign). In other words,
thoroughly exploited in near-threshold-designs. increasing VTH by 100 mV (i.e., the typical dif-
The above considerations equally apply to the ference between a low and regular VTH) at
threshold voltage, as Ion is a direct function of the near-threshold supply voltages leads to a 2-4X
100 M. Alioto

Fig. 4.6 Ratio between 20


FO4 of RVT and LVT

FO4(RVT)/FO4(LVT)
transistors versus supply
voltage VDD 15

10

0
0 200 400 600 800 1000 1200
VDD [V]

reduction in speed. This is shown in Fig. 4.6, Biasing, i.e. for positive4 (negative) body bias
which plots the inverter delay ratio under regu- voltages VBB (Tsividis 1999). The approximately
lar-VTH (RVT) and low-VTH (LVT). At nominal linear dependence in both effects is captured by
voltage, the different VTH has a moderate impact the following equation
on the performance, whereas such difference is
V TH ¼ V TH0  λDIBL V DS  λBB V BB ð4:7Þ
much more pronounced at near-threshold
voltages. where VTH0 is the threshold voltage extrapolated
In summary, the high sensitivity of perfor- for very low VDD and V BB ¼ 0 V, λDIBL is the
mance to VDD and VTH makes them very power- DIBL coefficient and λBB is the body effect coef-
ful knobs at near-threshold voltages, although the ficient. The DIBL coefficient is in the order of
(same) sensitivity to their variations poses a chal- 0.1 V/V or larger for technologies suited for IoT,
lenge at the same time, as will be discussed in the and hence denotes a pronounced dependence of
following sections. the threshold voltage on the drain-source voltage.
As an example, Fig. 4.7 shows the change in VTH
versus the drain-source voltage (i.e., VDD) in
4.2.2 Impact of DIBL and Sizing 28-nm transistors. When VDD is reduced down
on Threshold Voltage to near-threshold voltages, VTH typically
increases by around 100 mV compared to opera-
In the previous subsection, VTH was implicitly tion at nominal voltage. This needs to be explic-
considered constant. In view of the large sensi- itly taken into account when choosing the type of
tivity of Ion to VTH, the dependence of VTH on threshold voltage at design time.
transistor voltages and sizing needs to be explic- On the other hand, the threshold voltage
itly considered at near-threshold voltages. dependence on the body voltage is well known
Regarding the dependence of the transistor to be rather weak in advanced technologies,
voltages, VTH tends to be quite sensitive to the although it is appreciable in 90-nm generations
drain-source voltage due to the Drain Induced or older. Considering the strong sensitivity of
Barrier Lowering (DIBL) effect (Tsividis 1999). performance and leakage on VTH, body biasing
Due to the DIBL effect, VTH increases in an
approximately linear fashion when the magni- 4
These considerations hold for NMOS transistors. For
tude of the drain-source voltage is reduced. Due
PMOS transistors, change the sign in all voltages. Regard-
to the body effect, VTH decreases (increases) ing the body effect, FBB (RBB) refers to body voltages
under Forward FBB (Reverse, RBB) Body V BB below (above) V DD .
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 101

change of threshold voltage


Fig. 4.7 Threshold 250
voltage deviation vs. drain-
source voltage due to DIBL 200

150

[mV]
100

50

0
0 200 400 600 800 1000 1200
|VDS| [mV]
LVT NMOS LVT PMOS RVT NMOS RVT PMOS

VTH
VTH

Wmin W Lmin L
(a) (b)

Fig. 4.8 Qualitative trend of threshold voltage vs. (a) transistor channel width, (b) transistor channel length (due to
Short Channel Effect)

is still a viable option in near-threshold circuits in of L leads to an increase (decrease) in VTH due to
90-nm technology bulk generations, or in more the Reverse (traditional) Short-Channel Effect
recent generations in FDSOI CMOS technology. RSCE (SCE) (Tsividis 1999). The dominance
The transistor threshold voltage also depends of one of the two former mainly depends on
on the size, especially when the latter is close to whether the transistor body is lightly doped or
the minimum allowed by the process. The quali- includes halos to counteract short-channel effects
tative dependence on the channel width (Tsividis 1999).
W (length L) is qualitatively depicted in From the above considerations, transistor
Fig. 4.8a, b. From Fig. 4.8a, the reduction of sizing can affect the performance in ways that
W leads to a decrease (increase) in VTH due to are more complicated than the usual linear
the Reverse (traditional) Narrow-Channel Effect dependence of Ion on W/L, due to the additional
RNCE (NCE) (Tsividis 1999). The dominance of (strong) dependence of VTH on size at near-
one of the two effects mainly depends on the threshold voltages. As an example, Fig. 4.9
transistor isolation technology (e.g., Shallow shows that the Ion current trend versus
Trench Isolation vs LOCOS), device structure W deviates from the traditional linear depen-
(bulk, FinFET, FDSOI) and several parameters. dence of I on / W. For the specific considered
On the other hand, from Fig. 4.8b, the reduction technology, the current is increasing faster than
102 M. Alioto

Fig. 4.9 Ion normalized to 30

Ion normalized to Wmin, Lmin


current in minimum-sized LVT PMOS
transistor vs. channel width 25
(normalized to minimum
width Wmin) 20
15
10
5
0
1 2 3 4 5 6 7 8 9 10
W/Wmin
VDD=200 mV VDD=400 mV VDD=600 mV
VDD=800 mV VDD=1 V

I on / W, denoting that the NCE effect dominates mild dependence on VTH and is mostly defined
(i.e., wider channels lead to large current than by the carrier mobility at nominal voltage.
expected due to the simultaneous reduction in Hence, differences in VTH between PMOS and
VTH). This effect is clearly more pronounced at NMOS do not significantly impact the strength.
low voltages due to the stronger dependence of On the other hand, the strength has a strong
Ion on VTH, whereas it is negligible at nominal dependence on VTH at near-threshold (and
voltage. lower) voltages, hence even moderate
Other technologies might have opposite differences in VTH between PMOS and NMOS
behavior due to dominant RNCE (i.e., Ion substantially alter their strength ratio. The latter
increases slower than W, due to the progressive can be smaller or larger than the value at nominal
increase in VTH due to the increase in VTH). On voltage, depending on the VTH differences
the other hand, Fig. 4.10 shows that Ion decreases between PMOS and NMOS (including DIBL),
faster than 1/L at near-threshold voltages, due to and hence the specific technology. Figure 4.11
the dominance of SCE. Again, this dependence is shows the trend of the PMOS/NMOS strength
1.5–3X stronger than at nominal voltage due to ratio in a specific 28-nm technology, which at
the stronger dependence of Ion on VTH at low near-threshold voltages can be reduced by up to
voltages. Other technologies might have differ- 2.5X compared to nominal voltages. At lower
ent behavior, due to the dominance of RSCE. voltages, the impact is even larger, due to the
exponential dependence of Ion on VTH in (4.4).
This deviation of the PMOS/NMOS strength
4.2.3 PMOS/NMOS Strength Ratio, ratio clearly threatens the noise margin of
Stacking and Wire Delay CMOS logic gates, thus degrading robustness
and exposing logic gates to malfunctions due to
Another important effect observed at near- variations. This also emphasizes the imbalance
threshold voltages is the deviation of the ratio between the rise and fall delay, thus degrading
of the PMOS and NMOS strength (i.e., Ion) at performance.
iso-size, compared to nominal voltage. This is Analogously, the strength of stacked
due to the different dependence of PMOS and transistors (i.e., connected in series) can heavily
NMOS Ion across different voltages. Indeed, deviate from the strength of a single transistor,
from (4.3)–(4.6) the transistor strength has a compared to operation at nominal voltage.
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 103

Ion normalized to Wmin, Lmin


LVT PMOS
0.8
1.5-3X more sensitive to L
0.6 variations than nominal VDD trend with VTH
independent of L
0.4

0.2
1
0 L/Lmin
1 2 3 4 5 6 7 8 9 10
L/Lmin
VDD=200 mV VDD=400 mV VDD=600 mV
VDD=800 mV VDD=1 V

Fig. 4.10 Ion normalized to current in minimum-sized transistor vs. channel length (normalized to minimum length
Lmin)
PMOS/NMOS strength rao

Fig. 4.11 Ratio between 1.2


the strength of PMOS and LVT transistors
ransistors
NMOS versus supply 1
voltage VDD (LVT
transistors) 0.8

0.6

0.4

0.2

0
0 200 400 600 800 1000 1200
V DD [mV]
W=Wmin W=2Wmin W=4Wmin W=8Wmin

This can be shown by the Ion stacking factor Xon, The stacking factor peaking at near-threshold
defined as the factor by which the Ion current is voltages can be observed in any CMOS technol-
reduced due to the transistor stacking, compared ogy, as the presence of stacked transistors
to a single transistor (assuming all transistors reduces the drain-source voltage of each stacked
have the same size as the single one). The trend transistor, and hence leads to a further increase
in Fig. 4.12 shows that the stacking factor tends to in VTH (and decrease in the strength) due to
peak around near-threshold voltages, and the phe- DIBL, compared to a single transistor. This
nomenon is more evident under a larger number explains why the near-threshold current deliv-
of stacked transistors. At lower (sub-threshold) ered by four stacked transistors is up to 7X
voltages, the stacking factor goes back to lower than a single transistor at iso-size,
smaller and threshold-voltage independent value although this factor is about half of it at nominal
(Alioto 2012). voltage. Due to the same reason, the degradation
104 M. Alioto

7
LVT NMOS

Ion stacking factor


Nstacked
6

3
2
2

1
1

0
0 200 400 600 800 1000 1200
VDD [mV]
2 stacked transistors 3 stacked transistors 4 stacked transistors

Fig. 4.12 Ratio between the strength of PMOS and NMOS versus supply voltage VDD (LVT transistors)

for two and three stacked transistors is much impact of wire delay. This brings back circuit
less pronounced, and is acceptable from a per- and architectural solutions that were abandoned
formance point of view. Hence, as general cir- in the late 90s, due to the then incumbent impact
cuit design guideline, the maximum number of of wires on the system performance.
stacked transistors in near-threshold designs In summary, the large performance sensitivity
needs to be lower (e.g., 3) than at nominal volt- on VDD and VTH represents a very interesting
age (typically up to four). opportunity in near-threshold circuits, but also
Finally, another fundamental difference poses various challenges. Among those, it is not
encountered in near-threshold designs is the possible to maintain a fixed delay ratio between
deviation of the ratio between the gate and cells with different amount of stacking and
wire delay, compared to nominal voltage. threshold voltage, when scaling the voltage
Indeed, at lower voltages, the gate delay (even without variations). This, in addition to
increases as in (4.6), whereas the wire delay the substantial performance degradation due to
remains constant. As an example, Fig. 4.13 stacked transistors, suggests that the maximum
considers a wire whose delay matches the fan-in of near-threshold CMOS standard cell
delay of a single gate designed for high perfor- should be three. Within the same cell, it is not
mance (i.e., its delay is about one FO4 possible to maintain a stable PMOS/NMOS
(Sutherland et al. 1999)) at nominal VDD. This strength ratio across different voltages. For the
corresponds to a global wire with a length in the same reasons, ratioed and dynamic logic styles
order of a millimeter. From this figure, the wire are unfeasible at near-threshold voltages (not to
delay at near-threshold voltages represents only mention the larger impact of leakage and
a small fraction (in the order of 10X smaller) of variations, as discussed in Sects. 4.3 and 4.6).
the gate delay. This means that above-threshold Similarly, topologies that are inherently based
designs and architectures that aims at mitigating on current contention and positive feedback
the impact of wire delay are definitely need to be definitely avoided (e.g., cross-coupled
overdesigned and performance/energy non-clocked inverters in flip-flops). Unfortu-
sub-optimal at near-threshold voltages. Hence, nately, this cannot be avoided in SRAM bitcells
near-threshold circuits and architectures need to and register files for reasons due to density, and
be very different from traditional above- other sophisticated techniques need to be
threshold solutions, due to drastically smaller deployed (see Chap. 5).
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 105

Fig. 4.13 Ratio of wire 1E+00


and gate delay normalized

wire/gate delay ratio


to value at nominal voltage
versus supply voltage VDD
(28 nm, LVT transistors) 1E-01

1E-02

1E-03
0 200 400 600 800 1000 1200
V DD [mV]

4.2.4 Knobs to Adjust Transistor operation at nominal voltage. Indeed, the sensi-
Strength tivity of Ion to VTH translates into a strong sensi-
tivity to its process variations. Also, the delay
From the previous subsection, the transistor ratio of an RVT and LVT logic gate (see
strength can be adjusted with the following Fig. 4.14. for an inverter gate) strongly depends
knobs: on the supply voltage. In other words, mixing
standard cells with different VTH poses the prob-
• transistor size lem of having different delay scaling in different
• body biasing portions of the system. In turn, this makes timing
• VTH selection closure certainly more difficult and might reduce
• VDD tuning and fine-grain boosting. the energy benefit of dynamic voltage scaling, as
the critical path(s) depends on the voltage.
From the previous subsection, transistor Let us now consider fine-grain voltage
sizing is relatively effective, and can be more or boosting, which consists in selectively over-
less effective than at nominal voltage, depending driving appropriate transistors with a voltage
on the dominance of RNCE over NCE, and SCE above VDD. As shown in the illustrative example
over RSCE. Body biasing can significantly alter in Fig. 4.15, this might be the case of a single
the transistor strength only in old technologies large transistor M1 (e.g., sleep transistor, large
(e.g., 90 nm), or in recent FDSOI technologies, buffer) that drives a sub-circuit containing sev-
with a typical 30% range of adjustment at near- eral smaller transistors. Let us assume that the
threshold voltages. gate of M1 is overdriven at V DD þ ΔV DD as
In view of the strong dependence of Ion on the opposed to all other transistors and logic gates,
gate overdrive discussed in Sect. 4.1, the transis- which are powered at VDD. Due to the strong
tor strength can be substantially modified (super-linear) Ion increase in M1 due to the gate
through the proper selection of the threshold voltage boosting by ΔVDD, the transistor can be
voltage, and the fine-grain boosting of VDD to significantly undersized while maintaining the
selectively increase Ion where required. Regard- same strength as the transistor that is driven by
ing the VTH selection, a 2–4X Ion (and delay) VDD. In view of the strong dependence of Ion on
change was previously shown to be feasible the gate voltage in M1 at near-threshold voltages,
when changing VTH from one type (e.g., RVT) a small amount of boosting ΔVDD permits to
to the next available one (e.g., LVT). However, substantially reduce the area occupied by M1.
VTH selection at near-threshold voltages poses This is shown in the example in Fig. 4.15 in
various additional challenges, compared to 28 nm, where the area of M1 can be reduced by
106 M. Alioto

Fig. 4.14 Ratio of Ion of 16


LVT and RVT transistors 14

LVT/RVT Ion ratio


normalized to value at 12
nominal voltage versus
10
supply voltage VDD
8
6
4 4X
1.8X
2
0
0 200 400 600 800 1000 1200
V DD [mV]
NMOS PMOS

1.2 boosting improves Ion superlinearly


VDD fi lower size/energy @ iso-current
size, energy of boosted

1
transistor (a.u.)

0.8 2X lower
VDD+D VDD 0.6 energy
0.4 11X lower
VDD
0.2 area
0
M1 0 100 200 300 400 500 600 700
energy D VDD [mV]
transistor size energy to drive transistor

(a) (b)
Fig. 4.15 (a) In-principle circuit with large transistor whose gate voltage is boosted by ΔVDD, (b) area and energy
improvement vs. ΔVDD

up to an order of magnitude while maintaining operating above threshold (i.e., Ion becomes less
the same strength, through an amount of boosting sensitive to ΔVDD), and the energy cost Cg, M1
in the order of a few hundreds of mVs. Similarly, ðV DD þ ΔV DD Þ2 of boosting increases substan-
such selective boosting permits to super-linearly tially due to the quadratic dependence.
reduce the leakage current of M1. At the same From the above considerations, near-threshold
time, the gate capacitance Cg,M1 of M1 is reduced circuits can be made more energy- area-efficient
super-linearly as well, whereas the supply volt- by selectively boosting portions of the circuit that
age is increased by the very limited amount contain large (and hence energy- and area-
ΔVDD. This means that the dynamic energy Cg, M1 hungry) transistors. As opposed to traditional
ðV DD þ ΔV DD Þ2 to switch M1 ON is reduced multi-VDD approaches that are applied at the
overall. In the example in Fig. 4.15, a 2X energy module level, in this case the supply is boosted
reduction can be achieved through selective with fine granularity (i.e., down to the single
boosting of M1, when adopting an adopting an transistor). Such fine-grain voltage boosting also
optimal ΔVDD of 300 mV (this voltage depends offers the opportunity to equalize imbalanced
on the specific technology). Very similar energy logic across pipestages. As fundamental chal-
saving is observed for ΔVDD in the order of lenge, fine-grain boosting entails significant area
100–200 mV. On the other hand, larger amount overhead, due to the additional level shifters to
of boosting slightly increases the energy con- drive boosted-voltage domains, and to the slight
sumption to turn on M1, since the transistor starts additional cost of distributing multiple voltages at
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 107

IST D D D

IG IJ
G B G B G B
IG IJ

S S S
(a) (b) (c)
Fig. 4.16 Transistor leakage contributions: (a) sub-threshold, (b) gate, (c) substrate

the physical design level (Flynn et al. 2007). In threshold voltages is well approximated by
other words, near-threshold circuits should cer- (4.4), where the gate-source voltage (assigned
tainly take advantage of fine-grain voltage to VDD in (4.4)) has to be set to zero. By
boosting, but innovative techniques are needed substituting the dependence of VTH in (4.7), the
to minimize the unavoidable overhead. Some near-threshold leakage current of an NMOS tran-
recently proposed ideas to address this challenge sistor immediately results to
will be presented in Sect. 4.1. V TH0 λBB V BB λDIBL V DD
I lkg ¼ I 0  e nkT=q e nkT=q ð4:8Þ

where the first exponential term is set by the


4.3 Energy Trends threshold voltage (including body biasing, when
applied), and the second expresses the DIBL
In this section, the impact of voltage scaling on the
effect and hence the leakage dependence on
energy is reviewed, providing models and design
VDD. The latter dependence is exponential, and
guidelines for minimum-energy operation.
typically operation at near-threshold voltages
reduces the leakage current by about an order of
magnitude for typical DIBL coefficients in the
4.3.1 Transistor Leakage Current order of 0.1 V/V, compared to nominal voltage.
at Near-Threshold Voltages The consistent exponential trend across voltages
in (4.8) is shown in Fig. 4.17 for a 28-nm tech-
The MOS transistor leakage contributions are nology, along with a leakage current reduction at
summarized in Fig. 4.16. The dominant contri- near-threshold voltages by 4–8.5X.
bution is due to the sub-threshold leakage in
(4.4), which flows between drain and source
and is due to the diffusion of minority carriers 4.3.2 Energy Consumption of Digital
between the two terminals (Tsividis 1999). The Systems at Near-Threshold
gate leakage flows from gate to source/drain or Voltages
vice versa, depending on the applied voltages,
and tends to be exponentially smaller than the The total energy per operation5 in a near-
sub-threshold contribution when lowering VDD threshold VLSI digital system is essentially
(Narendra and Chandrakasan 2006). Similar equal to the sum of the dynamic and the leakage
considerations hold for the substrate leakage, energy. Indeed, the short-circuit energy
which is mostly due to the Band-to-Band
Tunneling (BTBT), and the inverse saturation 5
An operation is here defined as the basic task that the
current of the source-bulk and drain-bulk pn
considered system is executing, e.g., an instruction in a
junctions (Narendra and Chandrakasan 2006). CPU or GPU, a new output sample in a DSP, an arithmetic
Hence, the transistor leakage current at near- operation in an Arithmetic Logic Unit.
108 M. Alioto

Fig. 4.17 Leakage current 1E+00 LVT NMOS


Ioff of LVT transistor
(normalized to value at

normalized Ioff
nominal voltage) versus 4X lower than
supply voltage VDD nominal VDD
8.5X lower than
1E-01
nominal VDD

1E-02
0 200 400 600 800 1000 1200
VDD [mV]

contribution (Weste and Harris 2011) is negligi- number of FO4 delays (i.e., cascaded inverters
ble at near-threshold voltages, as opposed to with fan-out of 4) that can fit the cycle time.
operation at nominal voltage. This is because Hence, LDeff represents the effective logic depth
the transistors for input voltages around VDD/ per pipestage, which is a constant defined by the
2 is a small sub-threshold current, which also (micro)architecture.6 Hence the leakage energy
rapidly vanishes when the input voltage deviates per operation can be written as
from VDD/2 to settle to its stable value (Alioto Elkg ¼ V DD  I off  T CK  CPO, or equivalently
2012) (due to the exponential I–V characteristics
in the sub-threshold region).
The dynamic energy per operation is given by
affected by
(micro) architecture In (4.10), the only parameters that depend on
affected by circuit design VDD are VDD itself, Ioff and FO4. When down-
affected by technology
scaling the voltage, the first term decreases line-
arly and Ioff decreases exponentially due to
DIBL, although not very rapidly since VDD is
where αSW  C  V 2DD is the energy per cycle, being
multiplied by λDIBL  1 in (4.8). On the other
C the total capacitance within the circuit, αSW is
hand, FO4 rapidly increases as in (4.6) when VDD
the activity factor (Weste and Harris 2011) (i.e.,
is reduced down to near-threshold voltages and
the fraction of C that is switched in a cycle, on
below. The overall effect of the three factors
average). In (4.9), it was considered that an oper-
leads to an increase in Elkg at near- and
ation in general takes an average number of
sub-threshold voltages when VDD is reduced, as
cycles CPO (Cycles per Operation), which
opposed to the dynamic energy. The leakage
depends on the specific (micro)architecture, and
energy tends to increase very rapidly when
the dataset to a minor extent (e.g., in
decreasing VDD down to the transistor threshold
microprocessors).
The leakage energy per operation can be
expressed as the product of the average leakage
power V DD  I off (being Ioff the average leakage
6
T CK =FO4 is essentially constant in gate-delay dominated
critical paths when varying V DD , as all gate delays gener-
current), the clock cycle TCK and CPO (Alioto
ally scale like FO4 (Harris et al. n.d.; Weste and Harris
2012). TCK can be expressed as FO4  LDeff , 2011). At near-threshold voltages, this assumption is gen-
where LDeff ¼ T CK =FO4 is the number of the erally correct, as the wire delay is typically much smaller
(see Sect. 5.2.3).
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 109

Fig. 4.18 Leakage energy higher VTH fi higher


vs. supply voltage VDD for Elkg at low VDD RVT
different logic depths LDeff 40 marchitectures with larger logic

normalized leakage energy


equal to 25FO4 and 50FO4 35 depth have higher Elkg at low VDD
(28 nm, LVT and RVT
transistors) 30
LVT
25
Elkg increases due to FO4
20
increase at lower VDD
15
DIBL decreases Elkg when
10
lowering VDD near nominal
5
0
0 200 400 600 800 1000 1200
VDD [mV]
LVT (25FO4) LVT (50FO4) RVT (25FO4) RVT (50FO4)

voltage, due to the resulting rapid increase in the The qualitative trend of (4.11)–(4.12) versus VDD
gate delay in (4.6). This is shown in Fig. 4.18, in Fig. 4.19 shows that the voltage down-scaling
where the leakage energy of the reference digital reduces the dynamic energy, but increases the
circuit in Sect. 4.4 is plotted versus VDD under leakage energy. Hence, a minimum-energy
RVT and LVT transistor flavor. As expected, the point (MEP) is observed at a voltage VDD,opt
leakage energy under RVT flavor rapidly that optimally balances the dynamic and leakage
increases at higher voltages compared to LVT, energy, thus leading to the minimum energy7
due to the higher threshold voltage. This figure Emin. The MEP voltage VDD,opt typically lies in
also shows that Elkg tends to shoot up at larger the sub-threshold or near-threshold region
voltages, under microarchitectures with larger (Hanson et al. 2006a; Hanson et al. 2006b), as
logic depth (e.g., LDeff ¼ 50 instead of 25). discussed in the next section. Due to the flatness
This is because such microarchitectures suffer of the MEP, near-threshold operation permit
from larger leakage energy from (4.10), and true- or nearly-minimum energy operation, as
hence the rapid increase can be observed at larger fundamental design target of this chapter.
voltages. On a side note, Fig. 4.18 also shows Figure 4.20a–d shows the energy trend and
that Elkg has an opposite behavior at above- the presence of the MEP in various integrated
threshold voltages (i.e., it decreases when prototypes, including an FFT core from MIT
decreasing VDD), due to the dominance of the (Wang and Chandrakasan 2005), an 8-bit micro-
exponential effect of DIBL over the linear FO4 processor from Umich (Hanson et al. 2008), an
increase. IA-32 processor from Intel (Jain et al. 2012), and
From (4.9)–(4.10), the total energy per opera- an AES core from NUS (Zhao et al. 2015). From
tion ETOT of a given VLSI system or sub-system these figures, the energy curve is relatively flat
results to around the MEP, hence the minimum- or nearly-
minimum energy per operation does not require a
ETOT ¼ Edyn þ Elkg ¼ Ecycle  CPO ð4:11Þ stringent precision in the generation of the supply
where the energy per cycle Ecycle ¼ ETOT =CPO is voltage. In practical designs, a change in VDD
defined as
7
Ecycle ¼ αSW  C  V 2DD þ V DD  I off Since the energy per operation ETOT in (5.11) is propor-
 FO4  LDeff ð4:12Þ tional to Ecycle , in the following we will simply refer to the
energy per cycle in (5.12), unless otherwise specified. All
considerations are immediately extended to ETOT by
simply multiplying Ecycle by CPO.
110 M. Alioto

Fig. 4.19 Qualitative Ecycle ETOT


trend of dynamic, leakage
(ETOT) Edyn
and total energy per cycle
Ecycle (or equivalently total
energy per operation ETOT)
vs. supply voltage VDD

MEP
Emin
Elkg

Vmin VDD,opt VDD

Fig. 4.20 Energy vs. VDD and minimum-energy point in from Umich), (c) IA-32 processor ((Jain et al. 2012)
(a) FFT core ((Wang and Chandrakasan 2005) from from Intel), (d) AES core ((Zhao et al. 2015) from NUS)
MIT), (b) 8-bit microprocessor ((Hanson et al. 2008)

around the optimal voltage VDD,opt by various points). The MEP voltage VDD,opt in the above
tens of mVs (e.g., 30-50 mV) keeps the energy examples covers the typical range encountered in
very close to Emin (e.g., within a few percentage real designs (300–450 mV). The detailed
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 111

analysis on the dependence of the MEP position As even more crucial observation, the leakage
on process and design parameters is presented in energy is a substantially larger fraction of the
the next subsection. overall energy budget at near-threshold voltages,
Let us observe that the leakage energy compared to nominal voltage. Indeed, Elkg (Edyn)
increase at low voltages limits the energy at near-threshold voltages is larger (smaller) than
reductions enabled by aggressive voltage scaling, at nominal VDD. Table 4.3 shows the detailed
compared to the quadratic reduction that would energy breakdown measured in the microproces-
be achievable if the total energy were dominated sor in (Jain et al. 2012), which includes a level-1
by Edyn. Indeed, in the latter case the minimum cache. Above threshold, the leakage energy is
achievable energy would be given by (4.9) with 14% and is well in line with the expectations at
VDD equal to the minimum operating voltage nominal voltage. In near-threshold region, the
Vmin that ensure correct operation, as in leakage energy raises to a much larger 42% as
Fig. 4.19. The related energy saving compared expected, and in sub-threshold region it
to nominal voltage is reported in Table 4.2, completely dominates the overall energy.
which represents an upper bound of the energy For all the above reasons, mitigating the leak-
savings achievable for quick estimates. Observe age energy is a crucial goal of near-threshold
that the potential energy savings in circuits with designs, and is far more important than tradi-
wire-dominated load are lower than the case of tional low-power above-threshold designs. In
gate-dominated load. Indeed, in the former case addition, Sect. 4.4 will show that traditional
the dynamic energy reduction is purely qua- low-power techniques to mitigate leakage are
dratic, whereas the latter also benefits from the rather ineffective when VDD is pushed down to
simultaneous load reduction due to the reduction near threshold.
in the transistor gate capacitance at low VDD (see
Sect. 4.2.1 and Fig. 4.4). From this table, opera-
tion at near-threshold voltages potentially
4.3.3 Trans-Regional Energy Model
reduces the energy by up to an order of magni-
tude, compared to the nominal voltage. At the
From the previous subsection, the MEP is set by
same time, the presence of leakage narrows down
the optimal balance between dynamic and leak-
the range of voltages at which energy reduction
age energy. In other words, the MEP voltage
is truly allowed.
VDD,opt and the resulting minimum energy Emin
both depend on the ratio between leakage and
Table 4.2 Dynamic energy reduction vs. VDD
total energy. This means that the MEP position in
VDD2 energy saving CgVDD2 energy saving the energy-voltage plane changes according to
VDD (load ¼ (load ¼
(mV) wire only) transistor only)
this ratio, as discussed below.
200 mV 36X 54X
When the leakage energy significantly
400 mV 9X 11.6X increases for some reason, whereas the dynamic
600 mV 4X 4.4X energy remains constant, the total energy clearly
800 mV 2.2X 2.4X increases and VDD,opt increases as well (i.e., the
1V 1.4X 1.4X MEP moves to the right, and upwards, as
1.2 V 1X 1X summarized in Fig. 4.21a). Indeed, in this case

Table 4.3 Measured energy breakdown in Jain et al. (2012)


VDD (V) Elkg(%) Edyn(%)
Logic L1C Logic L1C Total Logic L1C Total
Sub threshold (Vmin) 0.28 0.55 62% 33% 95% 4% 1% 5%
Near threshold (MEP, VDD,opt) 0.45 0.55 27% 15% 42% 53% 5% 58%
Above threshold (nominal VDD) 1.2 1.2 11% 3% 14% 81% 5% 86%
logic ¼ Core, L1C ¼ L1 Cache
112 M. Alioto

ETOT ETOT

ase
incre yn
E l eas
in

Ed
c
k g
r
e

MEP MEP

VDD VDD
(a) (b)
Fig. 4.21 Qualitative description of how the minimum-energy point (MEP) changes position when (a) Elkg increases
(Edyn kept constant), (b) Edyn increases (Elkg kept constant)

the leakage energy tends to be a larger fraction of modules are activated). Since the input dataset
ETOT, hence VDD needs to be increased to reduce and the temperature are time varying and are
Elkg (as explained by Fig. 4.19). On the other unpredictable at design time, a feedback scheme
hand, when the dynamic energy increases at that tracks the actual MEP through energy sens-
iso-leakage energy, Edyn becomes a larger frac- ing (or estimation) and adjusts the supply voltage
tion of ETOT, hence it becomes more important to accordingly (Ramadass and Chandrakasan
reduce Edyn and hence VDD,opt decreases (i.e., the 2008).
MEP moves to the left, and upwards, as To have a more quantitative understanding of
summarized in Fig. 4.21a). From the above the dependence of the MEP position on process,
considerations, the MEP tends to move to the design and environmental parameters, let us con-
right when the temperature is increased and/or sider an analytical model of the energy. In detail,
the circuit activity is reduced, due to a different eq. (4.12) can be written in the following more
input data profile or power mode (i.e., different useful form8:

 
I off , TOT
Ecycle ¼ αSW  CTOT  V 2DD 1 þ LDeff  FO4
αSW  CTOT  V DD
"  #
gatecount  I off 5Cin, min
 αSW  CTOT  V 2DD 1 þ LDeff   
αSW  gatecount  Ccell 2I on, min
2 V TH 3

6 I 0, min  e  nkT
q
strength
Xstack, of f 1 7
 αSW  CTOT  V 2DD 41 þ 2:5  LDef f    VDD VTH 5
αSW  Cin, min
Ccell
I 0, min ln e nðkT=qÞ þ1
2   3
V DD 1þαX
6 7
off

¼ αSW  CTOT  V 2DD 6
41 þ ILDR  e
nkT
q  f ILDR ðV DD Þ7
5 ð4:13Þ

where (4.7) and (4.26) were used, and the intrin- 8


The following analysis is inspired by Hanson et al.
sic leakage-dynamic energy ratio ILDR (i.e., the
(2006a), Hanson et al. (2006b), Bo et al. (2004) and
contribution of Elkg/Edyn that is independent of generalizes the results to arbitrary designs, instead of
VDD) was defined as being valid only for simple cascaded inverters.
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 113

strength 1
ILDR ¼ 2:5  LDeff 
V DD, nom  ð4:14aÞ
Xstack, off
V DD, nom
 eαXoff  nkT=q αSW  CCincell
, min

and fILDR(VDD) was defined as VDD,nom of the adopted technology, and its down-
V DD ð1þλDIBL ÞV TH0
scaling at low voltages is accounted for by the
e nkT
q
technology-dependent parameter αXoff  1 (see
f ILDR ðV DD Þ ¼  V 1þλ V  (4.26)). Eq. (4.13) is strongly affected by ILDR.
DD ð DIBL Þ TH0
ln e nð kT=q Þ þ1 From (4.13), the latter parameter represents the
voltage-independent (i.e., intrinsic) contribution
ð4:14bÞ of the ratio between leakage and dynamic
Also, in (4.13) it was observed that energy, and is defined by

• the total leakage current Ioff,TOT of the design • the architecture through LDeff (i.e., faster
under consideration is equal to the gate count designs with low LDeff exhibit lower ILDR)
(gatecount) multiplied by the average leakage • the function implemented by the design under
consideration, which in turn sets the standard
per standard cell I off
cell usage statistics (i.e., the average
• I off can be expressed as the leakage current of
off-stacking factor Xstack,off) and the average
a minimum-sized inverter I 0, min  eV TH =n q ,
kT

fan-out Ccell =Cin, min (i.e., the average equiva-


multiplied by the average cell strength lent number of minimum-sized inverters that
strength (where 1X refers to the minimum- load the cells); such dependence tends to be
sized inverter) and divided by the average fairly weak, due to the averaging effect across
off-stacking factor Xstack,off (i.e., the factor cells in large designs
by which the leakage current is reduced due • the performance target in the automated cell
to transistor stacking) sizing phase, as strength and the average
• the total capacitance CTOT is equal to the gate fan-out Ccell =Cin, min both tend to be larger
count multiplied by the average switched for tighter timing constraints (again, faster
capacitance per standard cell Ccell designs have lower ILDR); accordingly, such
• FO4 can be thought of as the delay of a dependence tends to be fairly weak as well
minimum-sized inverter driving four equal • the input dataset, which in turn sets the circuit
inverters, and hence its total load capacitance activity (i.e., αSW).
is 4Cin,min (Cin,min is the input capacitance of a
minimum-sized inverter) plus its parasitic In summary, parameter ILDR is mainly set by
capacitance, which is approximately equal to the architecture and the input statistics, and low
Cin,min (Sutherland et al. 1999); Ion,min is the values of ILDR are associated with more active
current delivered by such minimum-sized and faster designs. In practical cases, ILDR
inverter (see (4.4)), and I0,min is the I0 param- ranges from a few hundreds for rather fast and
eter in (4.1) and (4.2) pertaining to the same active designs with heavy dynamic energy, to a
inverter. several tens of thousands in very slow and inac-
tive circuits. Higher values are observed only
In (4.13), the average off-stacking factor when an additional constant power contribution
Xstack,off is evaluated at the nominal voltage comes from external blocks.
114 M. Alioto

Table 4.4 Numerical Examples for ILDR in 28 nm (V DD, nom ¼ 1:2 V)


Elk

Edyn
MEP
Ccell
Design LDeff αSW Cin, min strength ILDR VDD,opt
High performance, active 20 15% 20 6 110 180 mV 0.62
Low performance, little active 100 3% 6 3 3100 320 mV 0.27

Observe that ILDR is defined at nominal volt- exclusively based on the performance require-
age and all parameters not explicitly related to ment (i.e., targeted FO4), according to (4.4).
VDD,nom are essentially independent of the volt- Let us now analyze the optimum voltage VDD,opt
age,9 and hence can be evaluated from the report that minimizes the energy in (4.13) assuming
of synthesis/place&route at such voltage without the MEP to be in sub-threshold. Although a
requiring the full characterization of the library closed-form solution cannot be found, a good
at different voltages. A few numerical examples approximation for a single-VTH design is loga-
in 28 nm are reported in Table 4.4, assuming rithmic (similar to (Bo et al. 2004; Hanson
λDIBL ¼ 0:1 (i.e., αXoff ¼ 0:098 from (4.26)), et al. 2006a, b))

Xstack, off
V DD, nom equal to 20 (i.e., average of two
n kT
stacked transistors in this technology) at nominal V DD, opt  ½1:25lnðILDRÞ  0:5
1 þ αXoff q
voltage. From the technology scaling viewpoint,
k0 tends to slightly decrease at finer technologies, ð4:15Þ
due to stronger DIBL and hence larger

which has a maximum error of 4% for practical
Xstack, off
V DD, nom . As a simpler approach, ILDR values of ILDR, as plotted in Fig. 4.22a under the
can also be estimated as the value that makes above 28-nm parameters and V TH0 ¼ 0:35 V.

From this figure, the MEP voltage logarithmi-
the ratio Elkg/Edyn (i.e., ILDR  eV DD 1þαXoff =n q
kT

cally increases with the constant slope in (4.15),


f ILDR ðV DD Þ in (4.13)) equal to the value that is
which is independent of the adopted threshold
obtained from power analysis at RTL level.
voltage, as shown in Fig. 4.22b.
As expected from the above considerations
and Fig. 4.21, the MEP moves to the right for
4.3.4 Considerations on the MEP slow and less active circuits, and to the left for
Voltage circuits with dominating dynamic energy and
low logic depth. Observe that operation at V DD
Typically, the MEP mostly lies in the deep < V min severely degrades the die yield, hence
sub-threshold region (Hanson et al. 2006a), and minimum-energy designs need to adopt a supply
sometimes near-threshold (Hanson et al. 2006b). voltage equal to the minimum between V DD, opt in
In the former case, f ðV DD Þ  1 in (4.13) since (4.15) and Vmin. From Fig. 4.22a, b, this means
V DD < V TH0 in (4.14b), hence the energy is inde- that fast (i.e., with low LDeff) designs with ILDR
pendent of the transistor threshold voltage. This lower than a few thousands cannot really achieve
is because the latter affects both the leakage and true-minimum energy, due to voltage scaling
the on-current in the same way (i.e., both are
  limitations imposed by robustness issues.
proportional to exp V TH =n  kTq in The above analysis assumed that the MEP lies
sub-threshold). In this case, VTH is chosen in the deep sub-threshold region, which is correct
as long as VDD,opt in (4.13) is lower than V TH
50 mV (see Fig. 4.1), i.e., when
9
For example, the average fan-out Ccell =Cin, min is inde- V TH 50 mV
kT þ1:6
pendent of V DD since the wire capacitance is constant, and ILDR < e 1:5n q . The latter boundary value
the transistor gate capacitance does not change substan- for ILDR in 28 nm is typically in the order of
tially (see Fig. 5.4). Similarly, the logic depth, the activity
1,000–2,000 for V TH ¼ 350 mV (including
and the average strength do not depend on V DD .
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 115

Fig. 4.22 MEP voltage


vs ILDR for 28-nm 700
VDD,opt saturation
technology with (a) V TH0 ¼ above threshold
0:35 V and detailed 600 /
ub
analytical model, (b) V TH0 500 in s ld .5]
¼ 0:35 V, 0:45 V, 0:55 V, ope sho ) –0

VDD,opt (mV)
l
DR
e
0:65 V (exact solution via 400 e s -thr
numerical minimization of m
sa near n (IL
I
.25
300
(4.13), approximate
expression as in kT [1
200 q
(4.15)–(4.17)) n
f
a x of
100 1+
0
1E+02 1E+04 1E+06 1E+08 1E+10
ILDR
exact approximate
(a)
1200

1000 saturation at
larger VDD under
800
larger VTH
VDD,opt (mV)

ent
600 end
indep
e
400 slop of V TH

200

0
1E+02 1E+04 1E+06 1E+08 1E+10
ILDR
VTH0=035 V (exact) VTH0=0.35 V (approximate)
VTH0=0.45 V (exact) VTH0=0.45 V (approximate)
VTH0=0.55 V (exact) VTH0=0.55 V (approximate)
VTH0=0.65 V (exact) VTH0=0.65 V (approximate)
(b)

DIBL), 4000–5000 for V TH ¼ 400 mV, and value V DD, opt


ILDR!1 . Indeed, f(VDD) in (4.14b)
10,000 for V TH ¼ 450 mV. Slightly larger becomes approximately equal to
values are typically found in older technologies, V DD ð1þλDIBL ÞV TH0  
due to the lower subthreshold factor n. e nkT
q  V DD ð1þλ DIBL ÞV TH0
nðkT=qÞ , and the
Interestingly, Fig. 4.22a, b show that (4.15) resulting VDD,opt is found by minimizing (4.13)
can be extended to the near-threshold region (i.e., for ILDR ! 1:
larger ILDR), as it still predicts the MEP voltage
2V TH0
with good accuracy. Hence, the MEP voltage is V DD, opt
ILDR!1 ¼ ð4:16Þ
1 þ λDIBL
again independent of the threshold voltage, even
in the near-threshold region. For even larger which was found to be always within 12% of the
values of ILDR such that V DD > V TH0 þ 200 exact solution that minimizes (4.13) in 28 nm
mV (see Fig. 4.1), the MEP moves to the above- (and typically within 5%). VDD,opt saturates
threshold region and eventually saturates to a because Elkg increases when increasing VDD in
116 M. Alioto

the above-threshold region, as opposed to sub- voltage at which Elkg is minimum (i.e., such
and near-threshold (see Fig. 4.18 and related that V DD  I off  FO4 is minimum, from (4.12)).
discussion). In other words, it does not make In summary, the above considerations suggest
sense to increase VDD beyond (4.16) from an that VDD,opt can be simply modeled by extending
energy viewpoint, as this would surely increase (4.15) to the near-threshold and part of the
both dynamic and leakage energy, and hence the above-threshold region, and limiting it to its
total energy. Indeed, (4.16) represents the asymptotic maximum value in (4.16):

 
n kT 2V TH0
V DD, opt  min ½1:25lnðILDRÞ  0:5 , ð4:17Þ
1 þ αXoff q 1 þ λDIBL

which has a typical (maximum) error of 4% (7%) 4.3.5 Considerations on the MEP
across the very wide range of ILDR in Fig. 4.22a, b. Energy
Eq. (4.17) is a useful tool to estimate the MEP
position by knowing the type of design (i.e., From (4.13), the resulting energy at the MEP in
ILDR), and a few other technology-dependent deep sub-threshold region (i.e., under (4.15)) can
parameters. From (4.17), the transistor threshold be written as
choice affects only the value of ILDR and the 

Elkg

voltage at which the MEP saturates at. In particular, Emin ¼ αSW  CTOT  V 2DD, opt 1 þ ð4:18Þ
larger VTH0 moves saturation towards exponen- Edyn
MEP
tially larger k0 and proportionally larger

where, considering that f ðV DD Þ  1 and αXoff  1
V DD, opt
ILDR!1 .
in (4.13), the energy-optimum ratio Elkg/Edyn at
the MEP is given by

Elk

V , opt
 DDkT 17

 ILDR  e n q
 0:2 þ ð4:19Þ
Edyn MEP, subthreshold ILDR0:75

In (4.19), an empirical approximate expression Compared to the overall energy budget, Elkg at
has been introduced to facilitate its estimate at the MEP needs to be 40–50% for very fast/active
design time, and its error is within 12% for low designs (ILDR  150), around 15–30% for more
values of ILDR (down to 150), as plotted in typical designs ( ILDR > 150 but still in
Fig. 4.23. Observe that Elkg/Edyn at the MEP is sub-threshold). Previous work on joint supply/
independent from the chosen (single) threshold threshold voltage and sizing optimization
voltage, as expected from the considerations in showed that energy optimality is achieved when
Sect. 4.3.4. In other words, sub-threshold designs Elkg is about one third of the overall energy
that differ only for the threshold voltage choice (Markovic et al. 2004; Nose et al. 2000; Patil).
have the same leakage percentage contribution, Accordingly, these results hold in sub-threshold
other than the same VDD,opt (see Sect. 4.3.4). region only for relatively slow and inactive
Accordingly, the MEP is hence chosen based on designs, from Fig. 4.23.
the performance target rather than energy. Also, As discussed in Sect. 4.3.4, very fast and
this means that the MEP voltage can be estimated active designs have low VDD,opt, which often
at design time even before choosing the transistor times falls below Vmin. In these cases,
flavor (and hence before actual implementation). minimum-energy and reliable operation is
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 117

10

ld

old
VERY SLOW/INACTIVE DESIGNS

ho

old

old
esh
re s
(near-threshold)

esh

esh
-th
(Elkg / Edyn)MEP
Elkg > 50-90% of ETOT

thr
thr

thr
ar

ar-
ar-

ar-
ne

ne
1

ne

ne
VERY FAST/ACTIVE DESIGNS
1E+02 1E+04 1E+06 1E+08 1E+10
(deep sub-threshold, ILDR≤150)
Elkg = 40-50% of of ETOT
SLOW/INACTIVE DESIGNS
(sub-threshold) 0.1
Elkg = 15-30% of of ETOT ILDR
VTH0=0.35 V VTH0=0.45 V
VTH0=0.55 V VTH0=0.65 V

Fig. 4.23 Leakage-dynamic energy ratio at the MEP vs. ILDR, obtained through numerical minimization of (4.13a)

achieved at V DD ¼ V min > V DD, opt . If the extra above-threshold region, and is definitely
performance compared to the MEP is not utilized dominated by Elkg.
since the design is essentially energy From the above considerations, the energy
constrained, operation at V min > V DD, opt leads Emin at the MEP monotonically increases when
to an increase in Edyn by a factor (Vmin/VDD,opt)2 increasing ILDR. At sub-threshold voltages, this
from (4.9), compared to the MEP. At the same is due to the increase in the energy-optimal volt-
time, a smaller increase in Elkg by a factor Vmin/ age VDD,opt in (4.17), which is certainly more


VDD,opt is observed (since same clock cycle is rapid than the reduction in 1 þ Elkg =Edyn
MEP
assumed in (4.10), and DIBL effect is neglected).
in (4.19). Since both dependencies were found to
In other words, designs with V min > V DD, opt typi-
be unaffected by the threshold voltage in
cally have lower Elkg/Edyn, compared to opera-
sub-threshold, Emin for MEP in sub-threshold is
tion at the MEP in Fig. 4.23.
independent of VTH as well. At near-threshold
For larger ILDR, again the MEP moves to the
voltages, Emin keeps increasing since VDD,opt in
near-threshold region and the energy-optimum
(4.17) continues to increase with the same trend
ratio Elkg/Edyn at the MEP increases again when
as sub-threshold (see Fig. 4.22a), and
increasing VDD. This is because the increase in 

VDD,opt at near-threshold voltages determines a 1 þ Elkg =Edyn

MEP
increases as well. For
much smaller reduction in Elkg compared to above-threshold MEP, VDD,opt in (4.17) saturates
sub-threshold, as the gate delay decreases much to an almost constant value, and


slower than exponentially from (4.6). Analyti-

1 þ Elkg =Edyn MEP


keeps increasing.
cally, this is accounted for by the increase in f
(VDD) in (4.13), which determines a proportional From the above considerations, the energy at
increase in Elkg/Edyn. Accordingly, Elkg becomes the MEP is monotonically degraded when ILDR
again a substantial fraction of the energy budget is increased, i.e., for leaky or little active designs.
when the MEP is pushed at near-threshold This is essentially due to the increase in VDD,opt
voltages or higher (i.e., for large ILDR), as (i.e., dynamic energy at the MEP), and the
shown in Fig. 4.23. This explains why Elkg/Edyn increase in the leakage-dynamic energy ratio at
heavily depends on VTH at near-threshold voltages above VTH. More quantitatively,
voltages, as in Fig. 4.23. For extremely slow Fig. 4.24 shows that Emin increases in an approx-
and inactive circuits, the MEP moves to the imately linear fashion when increasing ILDR. In
118 M. Alioto

Fig. 4.24 Energy at MEP 1.60E+07


Emin normalized to

(normalized energy)
αSWCTOT vs. ILDR for
various threshold voltages linear r )

Emin /a SW CTOT
8.00E+04
e p e r V TH
independent of VTH te e
t s larg
~3.5×10-4×ILDR u
b at
4.00E+02
ear, ore
in m
till l tly
2.00E+00 s igh
(sl
1E+02 1E+04 1E+06 1E+08 1E+10
1.00E-02 ILDR
sub-threshold above-threshold

VTH0=0.35 V VTH0=0.45 V VTH0=0.55 V VTH0=0.65 V

particular, in the sub-threshold region the ratio 4.3.6 Sensitivity of Nearly-Minimum


Emin/αSWCTOT is well approximated by 3:5  104 Energy to VDD Inaccuracies
ILDR regardless of the threshold voltage, hence
From a design standpoint, it is necessary to pre-
Emin  3:5  104  αSW  CTOT  ILDR ð4:20Þ dict the required accuracy for VDD to achieve
nearly-minimum energy per operation, which in
which is within 20% of exact Emin for ILDR up to
turn constraints the design of the voltage regula-
a few tens of thousands. For less typical design
tion circuitry and the power management
with larger ILDR, the trend becomes slightly
sub-system. As can be seen from Fig. 4.20a–d,
steeper by a factor ranging from to 2.5X to 4X
the energy-voltage curve is steeper at the left of
compared to (4.20), for a threshold voltage in the
the MEP, due to the exponential increase in the
350–650 mV range. In other words, when the
leakage energy at low voltages. In other words,
MEP is at near-threshold voltages, Emin actually
an uncertainty ΔV DD in the supply voltage
increases when VTH increases, although
around the MEP degrades the energy more sub-
moderately.
stantially when it pushes VDD below VDD,opt
Summarizing these conclusions and those in
rather than above (even more so, if performance
Sect. 4.3.4, the MEP voltage is unaffected by the
is considered). Due to the same reason, the
(single) threshold voltage when operating in sub-
energy degradation at the left of the MEP com-
and near-threshold voltages. On the other hand,
pared to the right becomes more evident for
the energy portion associated with leakage dras-
larger ΔVDD.
tically increases when operating at near-
In nearly-minimum energy designs, the maxi-
threshold voltages, compared to sub-threshold
mum tolerable percentage energy degradation
ones. At above-threshold voltages, the MEP volt-
% energydegradation compared to the MEP
age becomes a function of VTH, and energy is
due to the uncertainty in VDD needs to be trans-
dominated by leakage. Regardless of the voltage
lated into the specification of the maximum tol-
range in which the MEP lies in, the minimum
erable uncertainty ΔVDD. In sub-threshold
achievable energy monotonically and propor-
region, the resulting tolerable uncertainty ΔVDD
tionally increases with ILDR.
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 119

Fig. 4.25 Maximum


supply voltage deviation
from MEP that maintains
the energy degradation
within the target (on x-axis)
in sub-threshold region
(model in (4.21a), (4.21b))

is independent of VTH, since the energy is solution ΔVDD,norm turns out to be largely inde-
independent of VTH as well (see Sect. 4.3.4). pendent of ILDR, and is hence only a function of
Since f ðV DD Þ  1, Eq. (4.13) can be % energy degradation. ΔVDD,norm is well
expressed in a technology-independent manner10 approximated (within 10%) by
by defining the normalized voltage 0:62 þ 0:034  ð%energydegradationÞ, as shown

V DD, norm ¼ V DD 1 þ αXoff =n kT
q . The maximum in Fig. 4.25. Accordingly, the maximum tolera-
deviation ΔVDD,norm in VDD,norm compared to ble voltage deviation that meets a targeted per-
the value that minimizes the energy can be centage deviation in sub-threshold region is
easily solved numerically. The numerical

n kT
ΔV DD, subthreshold    gð%energy degradationÞ
q
ð4:21aÞ
1 þ αXoff

to be set with a precision of about a thermal


gðxÞ ¼ 0:62 þ 0:034  x ð4:21bÞ
voltage (25–35 mV) to keep the energy degrada-
Interestingly, from (4.21a), (4.21b), ΔVDD in tion compared to the MEP modest (5–10%).
sub-threshold does not depend on the position Larger VDD uncertainty (e.g., 1.5–2X the thermal
of the MEP, and it only depends on technology voltage) leads instead to an unacceptably large
through subthreshold slope and DIBL coeffi- energy degradation, and should hence be avoided
cient, and on the targeted maximum energy in practical cases.
degradation. As an example, Fig. 4.25 plots the When the MEP moves to the near-threshold
maximum tolerable ΔVDD versus the energy region, a larger voltage deviation can be tolerated
degradation in 28 nm, and shows that VDD needs for a targeted maximum energy degradation
compared to the MEP. This is because FO4 and
hence Elkg (see Eq. (4.10)) become less sensitive
10
Indeed, eq. (5.13) in sub-threshold region becomes to VDD compared to sub-threshold, as shown in
Ecycle / V 2DD, norm ½1 þ ILDR  eV DD, norm , assuming αXoff  Fig. 4.26. As expected, the tolerable voltage
λDIBL (which is generally, since λDIBL  1).
120 M. Alioto

Fig. 4.26 Maximum 200 ST = sub-threshold


supply voltage deviation 180 NT = near-threshold AT
160 AT = above-threshold

tolerable VDD deviation


from MEP vs. ILDR for
different VTH0 in 28 nm AT

w.r.t. VDD,opt (mV)


(target energy degradation 140
AT

NT
w.r.t. MEP ¼ 10%) 120
independent

NT
100

NT
of VTH and ILDR

NT
80
60
40
20 ST
0
1E+02 1E+04 1E+06 1E+08 1E+10
ILDR
VTH0=0.35 V VTH0=0.45 V VTH0=0.55 V VTH0=0.65 V

deviation at near-threshold depends on VTH0, as deviation saturates as well at above-threshold


opposed to sub-threshold. This is because the voltages, as shown in Fig. 4.26. As expected from
energy in (4.13) depends on VTH0 at near thresh- (4.16), larger VTH0 pushes the saturation to higher
old (see Sect. 4.3.4), since VTH0 defines the volt- voltages (and hence ILDR). Analytically, the max-
age range (and hence ILDR) in which transistors imum tolerable allowed voltage deviation around
enter this region. From Fig. 4.26, 4 to 6 thermal the MEP is evaluated by equating (4.13) and the
voltages can be tolerated with minimal energy energy at MEP (i.e., voltage in (4.16)) increased by
penalty at near-threshold voltages. a factor ð1 þ %energy degradation=100Þ,
For larger MEP voltages in the above-threshold and solving for VDD. By approximating
 V 1þλ V 
region, an even larger voltage deviation around the DD ð DIBL Þ TH0
ln e nðkT=qÞ þ 1  V DD ð1þλ DIBL ÞV TH0
nðkT=qÞ and
MEP is tolerable for a given allowed energy degra-
dation. This is because FO4 has the minimum αXoff  λDIBL in (4.13), the maximum voltage devi-
sensitivity to VDD across voltages from (4.6). As a ation at above-threshold voltages results to
consequence of the saturation of the MEP voltage
discussed in Sect. 4.3.4, the maximum voltage

 

%energy degradation
ΔV DD, abovethreshold ¼ V DD, opt
ILDR!1  h ð4:22aÞ
100

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi above calculations hence become inaccurate.


hðxÞ ¼ x þ x  ð1 þ xÞ ð4:22bÞ From (4.22a), (4.22b) the maximum voltage
deviation around MEP above threshold depends
the first of which is plotted in Fig. 4.27 for a
on the technology through a proportional depen-
28-nm technology, along with the technology-
dence on V TH0 =ð1 þ λDIBL Þ. In other words, the
independent curve in (4.22b). Equation (4.22a),
maximum voltage deviation around the MEP
(4.22b) was found to be within 10-20% of the
above threshold is a fixed and technology-
exact solution, for typical threshold voltages. For
independent fraction of the MEP voltage, as set
large VTH0 (e.g., 0.65 V), the error increases to
by h(% energy degradation/100). As opposed to
30–40% since ΔVDD in (4.22a) becomes so large
sub-threshold region, the maximum deviation
that it intrudes the near-threshold region, and the
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 121

Fig. 4.27 Maximum


supply voltage deviation
from MEP that maintains
the energy degradation
within the target (on x-axis)
in above-threshold region
(model in (4.22a), (4.22b))

around a MEP lying above threshold depends on 4.3.7 Example: ARM Core Operating
VTH0, and larger thresholds further relax the pre- at Minimum Energy
cision requirement on the voltage optimization
and delivery. This is because a larger VTH0 As a further numerical example, let us apply the
enlarges the voltage range in which the MEP above energy and MEP models to the design of
effectively increases for larger ILDR (i.e., from the ARM Cortex M0 core in Myers et al. (2015).
about VTH0 up to (4.16)). Table 4.5 reports all technology-, design- and
In summary, nearly-minimum energy opera- workload-dependent parameters for this design,
tion requires the voltage to be controlled within as obtained from the process design kit and infor-
approximately one thermal voltage when the mation provided in Myers et al. (2015). The two
MEP is in the sub-threshold region, indepen- programs “checksum” and “AES” are considered
dently of the position of the MEP. This require- to consider a wide range of activities, from low
ment is substantially relaxed at near threshold, (checksum) to high (AES), and hence observe the
and increases at above threshold until saturation MEP shift due to different activity factors (the
to a value that is proportional to the threshold latter has 60% higher activity than the former
voltage and sub-linearly related to the tolerable (Myers et al. 2015)).
energy degradation. More in detail, deviation The resulting energy curve versus VDD from
increases up to 4 thermal voltages for relatively experimental results in Myers et al. (2015) and
low VTH0, whereas it increases to more than the model in (4.13) is plotted in Fig. 4.29. Very
6 thermal voltages under large VTH0. good agreement can be observed across the wide
Overall, this suggests that the performance voltage range from 0.25 to 1.2 V, with an average
can be increased with modest energy penalty by error of 1.7%, and an error well below 10% down
raising the voltage compared to the MEP, when to 0.3 V. As expected from Sect. 4.3.3 and
the latter is at near- or above-threshold voltages. Fig. 4.21, the MEP for the AES program is
These considerations are summarized in the pushed to the left of the MEP for the checksum
example in 28 nm in Fig. 4.28, which plots program, due to the higher activity determined
ΔVDD versus ILDR for various energy degrada- by the former one. More quantitatively, the ILDR
tion targets and the related analytical models. factor in (4.13) and (4.14a), (4.14b) from the
122 M. Alioto

Fig. 4.28 Summary of


maximum VDD deviation
from MEP that maintains
the energy degradation
within the target vs. ILDR
and related models

Table 4.5 Technology-, design- and workload-


From Table 4.6, the estimated MEP voltage of
dependent parameters for ARM Cortex M0 Core in the core in Myers et al. (2015) from Eq. (4.17) is
Myers et al. (2015) 378 mV for the checksum program, which is close
Parameter Value to the measured MEP voltage of 390 mV. The
Technology Process 65 nm resulting minimum energy estimate of 11.6 pJ
VDD,nom 1.2 V from (4.18) is also close to the measured energy
VTH @ VDD,nom 0.47 V of 11.7 pJ (Myers et al. 2015). At the MEP, the
λDIBL 0.095 V/V leakage energy is estimated to be smaller than the
αXoff 0.087 V/V dynamic energy by a factor of 0.21 from eq. (4.19),
FO4 @ VDD,nom 60 ps which is close to the value of 0.22 in Myers et al.
Architecture/ckt LDeff 240a (2015). Good agreement of the models is also
design Ccell 4b confirmed for the AES program, from the same
Cin, min

strength 2b table. Finally, the maximum tolerable VDD uncer-


Workload αSW (checksum 5%c tainty for a 10% energy increase compared to the
program) MEP results to 45 mV from (4.21a), (4.21b),
αSW (AES program) 8%c which agrees well with the value of approximately
a
Estimated from cycle time at nominal voltage (14.3 ns) 48 mV in Myers et al. (2015).
and FO4 at nominal voltage
b
Typical values for very slow and low-energy designs
(changes in a reasonable range do not significantly influ-
ence results) 4.4 Exploration of MEP
c
Activity factor in AES obtained via a 60% increase Dependence on Logic Depth,
compared to checksum program (Myers et al. 2015) VTH, Activity
and Ineffectiveness of Leakage
Reduction Techniques

parameters in Table 4.5 results to 18,000 for the In this section, the impact of logic depth, thresh-
checksum program, and 11,200 for the AES pro- old voltage and activity are quantitatively and
gram. The resulting voltage and energy at the widely explored by considering the reference
MEP are summarized in Table 4.6, for both the circuit in Fig. 4.30, applying the insights gained
experimental results in Myers et al. (2015) and in Sect. 4.3. The simplicity and regularity of the
the above models. circuit in 4.30 permits to gain an intuitive
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 123

Fig. 4.29 Experimental 140


energy curve vs. VDD in
ARM Cortex M0 core 120
(Myers et al. 2015) and
100
energy predicted by the

ETOT (pJ)
MEP = (350 mV, 17.3 pJ)
model in (4.13) 80
60
40
20
MEP = (390 mV, 11.7 pJ)
0
0 200 400 600 800 1000 1200
VTH VDD
experimental (checksum) model (checksum)
experimental (AES) model (AES)

Table 4.6 Minimum Energy Point, Leakage/Dynamic Energy Ratio and Maximum Tolerable VDD Uncertainty from
(Myers et al. 2015) and Above Models
checksum program AES program
experimental (Myers et al. model experimental (Myers et al. model
2015) (equation) 2015) (equation)
VDD,opt 390 mV 378 mV 350 mV 360 mV
(4.17) (4.17)
Emin 11.7 pJ 11.6 pJ 17.3 pJ 16.8 pJ
(4.18) (4.18)

Elk
0.22 0.21 0.22 0.22
Edyn
MEP
(4.19) (4.19)
ΔVDD (energy 45 mV 48 mV 45 mV 48 mV
increase ¼ 10%) (4.21a), (4.21a),
(4.21b) (4.21b)

pipestage 1 … ... pipestage 200/LDeff

basic slice
basic cell
… ... D Clk Q … ... … ... D Clk Q
BIT
3 3 3
SLICE 1 3 3 3 3 3

BIT … ... D Clk Q … ... … ... D


Clk
Q

SLICE 2 3 3 3 3 3 3 3 3

… … … … … … … … … … … ... … … … … … … … … …
… ... D Clk Q … ... … ... D
Clk
Q
BIT SLICE
3 3 3 3 3 3 3 3
32

1 2 3 … … ... LDeff 1 2 3 … … ... LDeff

200 FO4
Fig. 4.30 Reference circuit for evaluation of the impact of logic depth, threshold voltage and activity

understanding of the underlying tradeoffs. Such 200FO4, as representative of a relatively com-


circuit contains 32 slices of inverter gates, each plex microprocessor. The slices are interrupted
with a fan-out of 4, and with a total logic depth through the insertion of registers, whose number
LDTOT (and hence delay by construction) of is adjusted to achieve a targeted logic depth
124 M. Alioto

LDeff. Registers are made up of transmission-gate judiciously to avoid a significant increase in the
flip-flops, which are customarily encountered in clocking overhead, which might offset some of
standard cell libraries (Alioto et al. 2015). the benefit brought by reduction in Elkg. In the
following, we will assume that the additional
clocking cost of meeting the timing constraints
4.4.1 Impact of Logic Depth with lower logic depth is modest (which is typi-
cally true in microarchitectures with LDeff
25
The heavy impact of Elkg at near- and FO4=cycle).
sub-threshold voltages can be mitigated by The reference circuit in Fig. 4.30 has an over-
adopting microarchitectures with lower logic all energy per cycle equal to
depth (i.e., deeper pipelining), from (4.10). How-
ever, deeper pipelining should be applied

LDTOT
Ecycle ¼ αSW  CTOT  V 2DD þ  EREG þ V DD  I off  FO4  LDeff ð4:23Þ
LDeff

where it was assumed that pipestages are per- number of registers in the above circuit. From
fectly balanced (i.e., the number of pipestages (4.23), an energy-optimal logic depth exists at a
is LD
LDeff ). In (4.23), EREG is the energy consumed
TOT given VDD, and its expression is readily found to
LDTOT be
by a single register, and LDeff represents the

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffi
EREG EREG
LDeff ¼ LDTOT ¼ LDTOT ð4:24Þ
V DD  I off  LDTOT  FO4 Elkg

where (4.10) was used to express Elkg. From voltage, and traditional architectures conceived
(4.24), the optimal logic depth is determined by for nominal voltage operation tend to be energy
the balance between the clocking and the leakage inefficient at low voltage. In other words, ultra-
energy, as a larger number of registers leads to an low power architectures need to be deeply
increase in the former and a decrease in the latter. rethought to truly enable nearly-minimum
Such tradeoff is not really observed in traditional energy operation, as discussed in Chap. 3.
low-power (above-threshold) designs, as the Quantitatively, eq. (4.24) suggests that the
leakage energy is usually kept a small fraction energy-optimal pipeline depth LDTOT/LDeff is
of the overall budget through several techniques given by the square root of the leakage-clocking
(Narendra and Chandrakasan 2006), and the energy ratio. Considering the large contribution
amount of pipelining is mainly defined by the of Elkg at near-threshold voltages, the theoretical
performance target, or the dynamic energy- energy-optimal pipedepth tends to be quite small.
performance tradeoff under dynamic voltage In (4.24), the energy cost of all registers EREG is
scaling. Instead, nearly-minimum energy designs assumed to be the same, since it refers to the
require a careful management of the clocking- simple reference circuit in Fig. 4.30. In more
leakage energy tradeoff, due to their strong inter- general architectures, the number of flip-flops
dependence (Alioto 2012). In addition, this per register, and hence the energy cost of a regis-
means that energy-centric (micro)architectures ter, increases super-linearly under higher
need to be tailored around the targeted operation pipedepths (Chinnery and Keutzer 2007). Indeed,
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 125

0.25
Fig. 4.31 Energy LVT-100 mV MEP1=(250 mV, 0.055)
normalized to value at
0.2 MEP2=(300 mV, 0.066)

normalized energy
nominal voltage vs. VDD
for logic depth of 25FO4, MEP3=(350 mV, 0.086)
50FO4 and 100FO4 0.15
1.6X larger Emin
0.1 MEP3
1.2X larger Emin
0.05 MEP2
MEP1
VTH
0
0 200 400 600 800 1000 1200
VDD [mV]
25FO4 50FO4 100FO4

the overall number of flip-flops in a digital mod- some non-trivial clocking technique to avoid
ule increases according to a power law (LDTOT/ timing violations under the unavoidably large
LDeff)LGF, where LGF > 1 is the Latch Growth variations (see Sect. 4.7), such as 2-phase latch
Factor, which is mainly defined by the specific clocking, custom latches with embedded logic,
function implemented (Srinivasan et al. 2002). aggressive hold fix buffer insertion, and shallow
Hence, the energy-optimal logic depth in general clock network (3 levels, for the reasons clarified
architectures tends to be moderately larger than in Sect. 4.10).
predicted by (4.24). The resulting energy curve versus VDD for
On the low side, the energy-optimal logic different logic depths in the reference circuit in
depth is practically limited by the rapidly Fig. 4.30 is reported in Fig. 4.31 in 28-nm
increasing clocking energy cost at small logic CMOS. As expected, increasing the logic depth
depths (i.e., deep pipelines). Typically, low to the practical lower bound of 25FO4 to larger
logic depths in the order of 20–25 FO4 or smaller logic depths of 50 and 100FO4 leads to a signifi-
have a disproportionately large energy cost in the cant 20% and a considerable 60% energy
clock network at ultra-low voltages, and require increase at the MEP. This is respectively due to
non-straightforward clock network design a 2X and 4X leakage energy increase, due to the
approaches. The necessity of “fast” circuit and larger logic depth from (4.10). From (4.14a),
architectural designs11 with deep pipelining at (4.14b), such increase in Elkg leads to a 2X
ultra-low voltages was first shown in Jeon et al. (4X) increase in ILDR, which from (4.15)
(2013), where an aggressive 17FO4 logic depth translates into an increase in the MEP voltage
was adopted in a 1024-point complex FFT pro- of approximately 35 mV and 65 mV (Fig. 4.31
cessor. The adoption of such deep pipeline led to discretizes voltages in 50-mV step, and hence
17.7 nJ/transform at V DD, opt ¼ 270 mV in 65-nm results to 50 and 100 mV).
CMOS, which was a 3.6X lower energy than Finally, it should be observed that true
previous state of the art. However, this required minimum-energy operation actually requires a
complex optimization that involves logic depth,
11 voltage and transistor sizing. Unfortunately, no
Here, “fast” refers to the clock cycle normalized to FO4
(i.e., LDeff ), rather than the absolute clock cycle. This thorough methodology and no CAD support is
choice is motivated by the need for characterizing the currently available for this purpose, hence such
design regardless of the specific voltage and hence FO4. joint optimization is still an open research ques-
Indeed, low values of T CK =FO4 identify designs that tion. A qualitative treatment of this problem will
would be fast at nominal and any other voltage, regardless
of FO4. On the other hand, ultra-low voltage operation
be presented in Sect. 4.11, to gain an insight into
makes the absolute T CK large simply because of the this fundamental design problem.
increase in FO4, not because of the design itself.
126 M. Alioto

Fig. 4.32 Energy 0.25


MEP1=(350 mV, 0.072)
normalized to value at
0.2 MEP2=(350 mV, 0.073)

normalized energy
nominal voltage vs. VDD
for single- and multi-VTH MEP3=(450 mV, 0.12)
design (50% HVT, 50% 0.15
LVT cells, logic depth of MEP3 1.7X larger Emin
25FO410% activity) 0.1 than single-VTH
MEP2
0.05 MEP1
LVT LVT+100 mV
0
0 200 400 600 800 1000 1200
VDD [mV]
LVT LVT+100 mV multi-VTH

4.4.2 Impact of Threshold Voltage region for most of the designs, and it is pushed
and Activity into in the near-threshold region only for very
low threshold voltages (see Fig. 4.34a).
The impact of the threshold voltage on the refer- Figure 4.34d–f show the contribution of the leak-
ence circuit in Fig. 4.30 is shown in Fig. 4.32. As age energy as a fraction of the overall energy for
expected from Sect. 4.3.4, the MEP voltage and the same threshold voltages. From this figure,
energy are the same for different threshold Elkg accounts for 40% of the total energy or
voltages under a single-VTH design, being in the more in most of the designs, and tends to be
sub-threshold region. On the other hand, mixing larger under lower threshold voltages. According
the two threshold voltages leads to a substantially to Fig. 4.23, this is because the MEP is pushed to
larger MEP energy (by 1.7X in this specific near-threshold voltages at low VTH, and the
case). This suggests that multi-VTH design is not resulting Elkg can be as high as 70% of the total
really advantageous at near- and sub-threshold energy in some designs (see Fig. 4.34d).
voltages, and it should hence be avoided. Thor-
ough analysis and justification of this observation
will be provided in the next section. 4.5 Ineffectiveness of Traditional
The effect of activity is depicted in Fig. 4.33, Leakage Reduction Techniques
which once again confirms that the MEP moves to
the left when the dynamic energy is increased, as This section shows that traditional leakage reduc-
was observed in Fig. 4.21. More quantitatively, tion techniques (e.g., stacking, multi-VTH) are far
the increase in the activity factor from 3% to 10% less effective at near-threshold voltages, thus
(20%) leads to a 3.3X (6.6X) decrease in ILDR, posing a challenge on leakage management at
which from (4.15) translates into a decrease in the such voltages.
MEP voltage of 51 mV and 81 mV (the latter is Transistor stacking has been extensively
not precisely visualized in Fig. 4.33, as voltages exploited to reduce leakage in above-threshold
are discretized in 50-mV step). circuits (Narendra and 2006), as the off-stacking
To provide a broader view on the impact of factor12 is typically much higher than the
the above parameters onto the MEP position, on-stacking factor. In other words, the series
Fig. 4.34a–c plot the statistical distribution of
the MEP voltage for several different activities 12
and logic depths, respectively for a very low, The off (on) stacking factor is defined as the factor by
which the transistor current of an off (on) single transistor
relatively low and relatively high VTH. From is reduced due to the series connection of multiple
this figure, the MEP lies in the sub-threshold transistors having the same size.
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 127

0.25
LVT-100 mV
0.2
normalized energy

MEP1=(300 mV, 0.037)


MEP2=(250 mV, 0.055) 2X smaller Emin
0.15
MEP3=(200 mV, 0.074) 1.3X smaller Emin
0.1

0.05
VTH
0
0 200 400 600 800 1000 1200
VDD [mV]
aSW=3% aSW=10% aSW=20%

Fig. 4.33 Energy normalized to value at nominal voltage vs. VDD for different activity values (logic depth of 25FO4)

20 LVT-100 mV 20 20 LVT+100 mV
18
LVT 18
18
16 16 16
14 14
occurrences

occurrences
14
occurrences

12 12 12
10 10 10
8 8 8
6 6 6
4 4 4
2 VTH 2 2
VTH VTH
0 0 0
150 200 250 300 350 400 450 500 150 200 250 300 350 400 450 500 150 200 250 300 350 400 450 500
VDD,opt VDD,opt VDD,opt

(a) (b) (c)


40 LVT-100 mV 35 LVT 30 LVT+100 mV
35 30 25
30 25
occurrences

occurrences

occurrences

20
25
20
20 15
15
15
10
10 10
5 5
5
0 0 0
0.1 0.25 0.4 0.55 0.7 0.1 0.25 0.4 0.55 0.7 0.1 0.25 0.4 0.55 0.7
Elkg /ETOT Elkg /ETOT Elkg /ETOT

(d) (e) (f)


Fig. 4.34 Statistics on the MEP voltage VDD,opt for the (a)–(c). Ratio between leakage energy and total energy
reference circuit in Fig. 4.30 across different values for same threshold voltages (d)–(f)
of activity and logic depth for three threshold voltages

connection of multiple transistors reduces the eαXoff V DD =ðnkT=qÞ (Narendra and Chandrakasan
leakage current much more heavily than the 2006), where
on-current (i.e., performance). As shown in
Fig. 4.35, this is true in the above-threshold 1 þ λDIBL
αXoff ¼ λDIBL  1 ð4:25Þ
region, where the off-stacking factor for 2 to 1 þ 2λDIBL
4 transistors is larger than the on-stacking factor
by an order of magnitude. At lower voltages, the and approximately the same dependence is
on-stacking factor tends to moderately increase, observed for a larger number of stacked transistors,
for the reasons discussed in Sect. 4.2.3. At the as shown in Fig. 4.35. In other words, the
same time, the off-stacking factor decreases expo- off-stacking factor Xstack,off is proportional to
V DD
nentially when reducing VDD (Narendra and eαXoff nkT=q (Narendra and Chandrakasan 2006),
Chandrakasan 2006). Indeed, the off-stacking fac- hence it can be expressed as
tor for two stacked transistors can be expressed as
128 M. Alioto

Ion and Ioff stacking factor


100
LVT NMOS

Nstacked
Xon /Xoff Xon /Xoff only 3X
only 1.2X
10
2
1

1
0 200 400 600 800 1000 1200
VDD [mV]
Ion (2-stacked) Ion (3-stacked) Ion (4-stacked)
Ioff (2-stacked) Ioff (3-stacked) Ioff (4-stacked)

Fig. 4.35 On- and off-stacking factor Xon and Xoff vs. VDD for 2, 3 and 4 stacked transistors in 28 nm

Fig. 4.36 Multi-VTH critical path @ 0.4V


approach and critical path VDD=0.4 V HVT
shift from LVT to HVT
paths when scaling down
VDD
VDD=0.6 V LVT
critical path @ 0.6 V


V DD V DD, nom
not really degrade performance thanks to its
Xstack, off ¼ Xstack, off
V DD, nom  eαXoff  nkT=q ð4:26Þ
weak dependence on VTH, while it certainly
reduces the leakage current thanks to its strong
which tends to be very accurate across all
dependence on VTH from (4.8). In other words,
voltages (within 2% in 28 nm, according to
the multi-VTH approach offers a favorable
Fig. 4.35).
tradeoff between performance and leakage in
From Fig. 4.35, the off-stacking factor at near-
traditional low-power designs operating above
threshold voltages is no longer much larger than the
threshold. On the contrary, performance
on-stacking factor, hence no significant leakage
becomes very sensitive to VTH at near-threshold
reduction is actually allowed for a given perfor-
voltages as discussed in Sect. 4.2.4, and the HVT
mance penalty. In other words, transistor stacking
cells are slowed down much more substantially
is rather ineffective in counteracting leakage at
than LVT when VDD is dynamically down-scaled
near-threshold voltages, as opposed to common
(see Figs. 4.5 and 4.6). As a consequence,
low-power wisdom (i.e., above threshold).
non-critical HVT paths at a given voltage (e.g.,
As another traditional leakage reduction tech-
0.6 V in Fig. 4.36) actually become critical13
nique, let us consider the adoption of multiple
when down-scaling VDD (e.g., 0.4 V in
threshold voltages, as depicted in Fig. 4.36 for
the simple case of a design with two thresholds
(i.e., low and high VTH). In multi-VTH designs, 13
This is unavoidable in real designs, as overall energy-
cells in critical paths are LVT to meet the perfor- performance optimization aims to equalize the delay of
mance requirement, whereas cells in non-critical different paths (De Micheli 1994), so that non-critical
paths can be down-sized to reduce their energy, while
paths are replaced by the HVT counterparts. At maintaining the same performance target (Narayanan
above-threshold voltages, such replacement does et al. 2010).
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 129

Fig. 4.36). In other words, the clock cycle of a design, in spite of its significantly larger energy
multi-VTH design at lower voltages is signifi- consumption.
cantly larger than a single-LVT design, thus Similar considerations hold for other tradi-
leading to a leakage energy increased compared tional leakage reduction techniques, such as
to the latter one, from (4.10). At the same time, power gating (Flynn et al. 2007). At above-
the leakage current of a multi-VTH design is sig- threshold voltages, power gating is well-known
nificantly larger than a single-HVT design, as the to provide substantial leakage reductions due to
LVT cells in the design have a considerably two different mechanisms. First, the sleep tran-
larger leakage (typically more than an order of sistor size can be much lower than the overall
magnitude increase when moving from a thresh- effective transistor width of the power gated cir-
old value to the immediately lower one (Interna- cuit, as only a fraction of the cells are active at a
tional Technology Roadmap for Semiconductors given time. Since the relative strength of the
2013)). sleep and the power gated transistors is
From the above considerations, multi-VTH maintained at low voltages, this reduction mech-
designs suffers from substantially larger leakage anism is essentially maintained at near-threshold
energy compared to single-VTH designs for two voltages. Second, the sleep transistor (see
concurrent reasons, under dynamic voltage scal- Fig. 4.38a) is able to provide its large
ing. Hence, multi-VTH approaches actually dete- on-current during active mode ( sleep ¼ 0 ),
riorate the energy efficiency of VLSI circuits, whereas it delivers only its off-current during
and should be always avoided in favor of sin- sleep mode (sleep ¼ 1). Such reduction is clearly
gle-VTH designs. The choice of the single VTH has more pronounced for larger Ion/Ioff ratio, which is
been discussed in Sect. 4.2.4. As an example, this traditionally obtained by using HVT devices for
is shown in Fig. 4.32, where the multi-VTH design sleep transistors at above-threshold voltages. At
of the reference circuit in Fig. 4.30 is found to be near-threshold voltages, the transistor Ion/Ioff
1.7X less energy efficient than the single-VTH ratio is severely degraded (by 1–2 orders of
designs. In terms of energy-performance tradeoff magnitude) as shown in Fig. 4.38b. Hence, the
at the MEP, Fig. 4.37 confirms that the multi-VTH leakage reduction enabled by power gating at
design is essentially as slow as the single-HVT near-threshold voltages is worsened by at least

1
MEP1=(1.4E-2, 0.073)
MEP2=(1.2E-3, 0.072) 0.8
normalized energy

MEP3=(1.2E-2, 0.12)
0.6

0.4

MEP3 0.2

MEP2 MEP1 0
1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00
MHz normalized TCK GHz
LVT LVT+100 mV multi-VTH

Fig. 4.37 Energy vs. clock frequency (both normalized to value at nominal voltage) for single- and multi-VTH design,
and logic depth of 25FO4
130 M. Alioto

VDD
1E+8 1E+7

degradaon in Ioff reducon


Ion/Ioff normalized to
sleep sleep 1E+7 1E+6
transistor

w.r.t. nominal VDD


1E+6 1E+5

nominal VDD
1E+5
1E+4
1E+4
1E+3
1E+3
1E+2 1E+2
1E+1 1E+1
1E+0 1E+0
0 200 400 600 800 1000 1200
VDD [mV]
degradation in Ioff reduction
degradation in leakage reduction w.r.t. nominal VDD
(a) (b)
Fig. 4.38 (a) Power gating scheme, (b) Ion/Ioff ratio of sleep transistor in 28 nm

one order of magnitude, compared to above the order of nanoseconds. Hence, throughputs in
threshold. Such degradation in the effectiveness the order of hundreds of MOPS (Millions of
of power gating at near-threshold voltages can be Operations per Second) are easily achievable by
partially recovered by boosting the gate voltage near-threshold microprocessor cores. Such level
of the sleep transistor (Myers et al. 2015). of performance achievable near the threshold is
Indeed, boosting its gate voltage only during actually acceptable for (or can exceed) the typi-
active mode significantly increases Ion, while cal requirements of IoT systems, at least in the
maintaining the same Ioff. At near-threshold most frequent operation modes and in most of the
voltages, the sleep transistor Ion/Ioff ratio (and practical applications. Higher performance might
hence the effectiveness of power gating) can be be needed occasionally in some applications, or
further improved by using thick-oxide (i.e., I/O) customarily for compute-intensive ones, such as
NMOS transistors whose gate is powered at the computer vision or real-time pattern recognition.
large I/O voltage (e.g., 1.8 V instead of 1 V). In Sustained throughputs that are higher than
this case, such Ion/Ioff improvement is achieved at hundreds of MOPS can always be obtained
the expense of a larger energy and slower tran- through appropriate architectures at near-
sient to turn on the sleep transistor, and hence to threshold voltages (e.g., multi-core) and
switch from sleep to active mode. specialized hardware, as discussed in Chap. 3.
Occasional performance boosts can be achieved
through wide dynamic voltage scaling, i.e., by
raising VDD from the MEP to the nominal voltage
4.6 Challenges: Performance
(Chandrakasan et al. 2010). Such temporary volt-
age up-scaling permits to increase the perfor-
As discussed in Sect. 4.2, operation at near-
mance by one (two) order(s) of magnitude,
threshold voltages entails a ~10X penalty in
when the MEP is in the near-threshold (sub-
terms of FO4 and hence performance, compared
threshold) region. This performance increase is
to the same architecture operating at nominal
achieved at the expense of an increase in the
voltage. For sub-100 nm technologies, FO4 at
energy per operation, as summarized in
near-threshold voltages is typically in the order
Fig. 4.39 for several integrated prototypes
of few hundreds of picoseconds. For reasonable
(Abouzeid et al. 2012; Gammie et al. 2011; Hsu
architectures with a logic depth of up to several
et al. 2012; Jain et al. 2012; Kaul et al. 2012;
tens of FO4, this translates into a cycle time in
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 131

Fig. 4.39 Energy improvement at MEP vs. performance degradation at MEP, as compared to operation at nominal
voltage

Fig. 4.40 (a) RC wire

delay
delay, (b) relative scaling y
of gate and wire delay wire dela
vs. VDD ...

VDD
(a) (b)

Myers et al. 2015; Sheikh et al. 2012; Wilson choice of the architecture, and the way the latter
et al. 2014; Jacquet et al. 2013). From this figure, is mapped into the physical level. Indeed, VLSI
this energy increase is more pronounced when architectures for nearly-minimum energy need to
the MEP is in the sub-threshold region, due to the be different from traditional low-power
larger voltage difference between the MEP and architectures (i.e., for above-threshold opera-
the nominal voltage. tion). More specifically, signals can be
As another property of circuits operating near propagated through a wider silicon area com-
the MEP, the gate delay dominates over the wire pared to nominal voltage operation. In detail,
delay, as shown in Fig. 4.40. Indeed, they might be for unrepeated wires a ~3X longer distance14
comparable at nominal voltage in realistic VLSI can be covered by the same wire at near-
architectures, due to the significant resistive, threshold voltages, as compared to the same cir-
capacitive, and sometimes inductive parasitics of cuit operating at nominal voltage, when
metal wires. However, operation at the MEP volt- maintaining the wire delay a fixed fraction of
age determines a substantial increase in the gate the clock cycle. For similar reasons, repeated
delay, while keeping the wire delay constant. wires require 3X fewer repeaters per wire unit
Hence, the wire delay is no longer a challenge in length since the optimal distance between
circuits with nearly-minimum energy, which cer-
tainly simplifies the design, the circuit modeling
14
and the timing closure. This is due to the well-known quadratic dependence of
The reduced wire-to-gate delay ratio around the RC wire delay on its length (Weste and Harris 2011),
and assuming a 10X FO4 degradation at the MEP com-
the MEP has also important consequences on the pared to the nominal voltage.
132 M. Alioto

repeaters is proportional to the square root of around the MEP. In general, process, voltage
FO4 (Weste and Harris 2011), thus improving and temperature variations as well as aging
the route-ability. At the same time, the size of impose an additional timing margin that stretches
each repeater at MEP needs to be increased by the clock cycle as shown in Fig. 4.41. This con-
3X compared to nominal voltage, as its servative approach preserves correct functional-
performance-optimal size is proportional to ity and performance specifications even in the
FO4 (Weste and Harris 2011). Since the number worst-case die and environmental conditions.
of repeaters is reduced by the same factor by The above cycle time margin resulting from
which their area is increased, the energy and variations translates into an increase in the
area cost of intra-chip global communication in energy per operation, as faster chips are forced
designs around the MEP remains approximately to operate as slowly as the worst-case die and at
the same. In summary, VLSI architectures for the same voltage (which is higher than needed).
nearly-minimum energy can afford more global Figure 4.42 shows the voltage increase required
communications and larger modules (e.g., shared by the circuit in Fig. 4.30 to maintain a given
caches), compared to traditional low-power clock cycle under a given clock cycle margin, as
architectures. Such profound difference in the well as the resulting energy increase, assuming a
communication-computation energy/perfor- logic depth of 25FO4, 10% activity, and LVT
mance tradeoff requires the adoption of innova- transistors. From Fig. 4.42, the voltage increase
tive architectures, as discussed in Chap. 3. imposed by variations is fairly linear with the
cycle time margin, and tends to be larger at
higher nominal operating voltages. This is
because FO4 (i.e., the cycle time from
4.7 Challenges: Variations
Sect. 4.3.2) is less sensitive to voltage increases
at higher operating voltages, and hence requires
In this section, the impact of variations is
larger increase to achieve a given percentage
analyzed in the context of circuits operating

Fig. 4.41 Nominal cycle


time and additional margin
accounting for variations

nominal clock cycle margin

2 0.2
req'd voltage increase (V)

Fig. 4.42 Required VDD to


energy increase factor

sustain a given
performance specification 1.5 0.15
vs. cycle time margin (i.e.,
factor by which the cycle
1 0.1
time needs to be increased
due to variations), and
resulting energy increase 0.5 0.05
due to variations
0 0
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
cycle time margin
VDD=400 mV VDD=500 mV VDD=600 mV
VDD=400 mV VDD=500 mV VDD=600 mV
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 133

performance improvement. From Fig. 4.42, a timing error detection or prediction (Bowman
typical 1.5X–2X cycle time increase requires a et al. 2009; Bowman et al. 2011; Das et al.
voltage increase by 100–200 mV to sustain the 2009; Khayatzadeh et al. 2016; Zhang et al.
same performance as the nominal corner, which 2016)), rather than corner-based adaptive voltage
leads to an energy increase by a factor of 1.5X– scaling and body biasing techniques (Gregg and
2X. In other words, variations at near threshold Chen 2007; Meijer and Pineda de Gyvez 2012;
entail a very large energy cost, which can negate Martin et al. 2002; Olivieri et al. 2005; Tschanz
the advantages of operating at the MEP. Accord- et al. 2002). Accordingly, our analysis in the
ingly, variations need to be accounted for in first following will be focused on random variations.
place in the design of near-threshold circuits At low voltages, process variations determine
rather than an afterthought, as discussed more a much larger path delay variations than above-
in detail in the following. Similar considerations threshold voltages due to two phenomena:
hold for the sensitivity to soft errors, which is
somewhat increased at near threshold voltages, 1. the variability of the gate delay defined as the
compared to nominal voltage. On the other hand, ratio between the standard deviation and mean
operation at near-threshold voltages suppresses value increases significantly
most of aging and reliability issues, such as Bias 2. the probability distribution function (PDF) of
Temperature Instability, Hot Carrier Injection, such delay is no longer Gaussian for short
Time Dependent Dielectric Breakdown. Indeed, paths, and has a longer tail on the right side.
such phenomena are all exponentially dependent
on the supply voltage, and its reduction to near- Regarding the first phenomenon, the
threshold voltage substantially mitigates them. variability of the critical path delay is mainly
due to the intrinsically larger variability of the
transistor Ion current (see Eq. (4.6)). This is
4.7.1 Process Variations mostly due to the larger impact of the threshold
voltage, and hence of its variations, at lower
Random (within-die) process variations are well voltages (see Sect. 4.2.1). More quantitatively,
known to be responsible for a major fraction of the delay variability is approximately equal to
the cycle time margin, having a much heavier the variability in Ion from (4.6). If the nominal
effect than fully correlated (die-to-die) variations threshold voltage VTH is subject to a variation
(Orshansky et al. 2008). Indeed, threshold volt- ΔVTH that is Gaussian distributed with zero
age variations at low voltages are dominated by mean and standard deviation σ V TH , from (4.3) to
random dopant fluctuations (Alioto et al. 2010; (4.4) the variability of Ion for above- and
Orshansky et al. 2008), and their effect requires sub-threshold voltages is readily found to be
much more sophisticated feedback schemes that
are immune to transistor mismatch (e.g., with

σ τPD

σ Ion

σ V TH
 ¼ ð4:27aÞ
μτPD
abovethreshold μIon
abovethreshold V DD  V TH



qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

σ τPD
σ Ion

σV 2

 ¼
TH
1 ð4:27bÞ
μτPD
μIon
subthreshold e nvt
subthreshold

For example, for the typical values σ V TH ¼ 35 124% in the sub-threshold region (e.g., 0.3 V). In
mV and V TH ¼ 0:4 V in 28 nm CMOS, the gate other words, the delay variability in
delay variability turns out to be 7% at 0.9 V, and sub-threshold is an order of magnitude larger
134 M. Alioto

Fig. 4.43 Variability of 8

s/m normalized to value


FO4 normalized to value at 7 rise fall
1.2 V vs. VDD (28 nm

at nominal VDD
CMOS) 6
5
4
3
2
1
0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
VDD (V)

than above threshold. At near-threshold voltages, lognormal distribution has a much longer tail
the variability is somewhat intermediate. compared to the Gaussian distribution, at same
Figure 4.43 plots the variability of the FO4 standard deviation. This leads to a considerable
delay normalized to the value at nominal voltage increase in the number of standard deviations
in 28 nm CMOS. From this figure, the gate delay needed as design margin to meet a given yield
variability increases when VDD is reduced, and target, as shown in Table 4.7. For example, from
becomes 2X–6X larger than at nominal VDD at this table the worst-case gate delay margin across
near threshold, and about an order of magnitude 99.9% of the cases is three standard deviations
larger in the deep sub-threshold region. A typical for Gaussian (i.e., above threshold), and twenty
delay variability of around 6–8% at nominal standard deviations for lognormal (i.e.,
voltage in 28 nm translates into a sizable delay sub-threshold). In the near-threshold region, the
variability of various tens of percentage points at distribution is somewhat intermediate between
near threshold. To achieve a parametric yield of above- and sub-threshold, and hence it is neither
approximately 99%, three standard deviations perfectly Gaussian nor lognormal. This is shown
are needed, hence the margin for a single gate in Fig. 4.45a–c, which show the quantile-quantile
can easily be 100%, which entails an unfeasibly (Q–Q) plot (Walpole et al. 2006) of the statistical
large margin in Fig. 4.41 (i.e., an unacceptably FO4 delay sample in 28 nm CMOS versus the
high energy cost from Fig. 4.42, which easily theoretical quantiles of a Gaussian distribution
offsets the energy benefit of operating at near- with same mean and standard deviation. The
threshold voltages). When the MEP is in deviation from a straight line (i.e., perfect Gauss-
sub-threshold region, such margin becomes ian behavior) of the Q–Q plot becomes notice-
even higher. able at near threshold (see Fig. 4.45b), and is
Regarding the second phenomenon that was substantial at sub-threshold voltages (see
observed above, the statistical delay distribution Fig. 4.45c). Figure 4.45d confirms the FO4 log-
of a single gate is no longer Gaussian when normal distribution in sub-threshold.
operating at near- and sub-threshold voltages The above considerations of non-Gaussian
(Alioto 2012; Alioto et al. (in press); Gammie delay distribution at low voltages hold for single
et al. 2011). In the sub-threshold region, Ion and logic gates, and can be extended to short paths,
hence the gate delay are lognormally distributed i.e., paths that can be problematic in terms of hold
due to the exponential dependence of Ion on the time violations rather than setup time. Accord-
threshold voltage in (4.4), being the latter Gauss- ingly, short paths and hold fix at sub- and near-
ian distributed. As shown in Fig. 4.44, the threshold voltages requires a much wider design
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 135

0.8

probability distribution function


Fig. 4.44 Probability
density function of
Gaussian and lognormal 0.7 s=1
distribution at same
0.6
standard deviation (σ ¼ 1)
0.5
0.4 s=1
0.3
0.2 lognormal distribution
has much longer tail
0.1
0
0 s 2s 3s 4s 5s 6s

m = 1.27
x (standard deviaons)
m=0

Gaussian Lognormal

Table 4.7 Number of VTH Standard Deviations beyond the Mean to Achieve Given Yield Target in Gaussian and
Lognormal
Logic path (lognormal gate delay)
Single gate Logic depth #
Yield target Gaussian Lognormal 4 gates 8 gates 16 gates 32 gates
84% σ V TH eσ V TH 1.07σ V TH 1.05σ V TH 1.03σ V TH 1.02σ V TH
97.7% 2σ V TH e2σ V TH 7.4σ V TH 2.07σ V TH 2.05σ V TH 2.03σ V TH 2.02σ V TH
99.87% 3σ V TH e3σ V TH 20.1σ V TH 3.07σ V TH 3.05σ V TH 3.03σ V TH 3.02σ V TH
99.997% 4σ V TH e4σ V TH 54.6σ V TH 4.07σ V TH 4.05σ V TH 4.03σ V TH 4.02σ V TH
99.99997% 5σ V TH e5σ V TH 148.4σ V TH 5.07σ V TH 5.05σ V TH 5.03σ V TH 5.02σ V TH
99.9999999% 6σ V TH e6σ V TH 403.4σ V TH 6.07σ V TH 6.05σ V TH 6.03σ V TH 6.02σ V TH

margin, compared to above-threshold. In other shown in Table 4.7 for 4, 8, 16 and 32 equal
words, the timing margin against hold time cascaded gates, which are individually assumed
violations at low voltages tends to be very large to have a lognormal delay distribution, as rele-
compared to nominal voltage, and hence requires vant to the sub-threshold region. Indeed, this
a much larger number of hold fix buffers. table shows that margin in terms of standard
On the other hand, long logic paths have a deviations is essentially the same as an ideal
Gaussian delay distribution even in Gaussian distribution even for a relatively short
sub-threshold voltages. This is because of the path of 4 cascaded gates, and is closer for a larger
Central Limit theorem, which guarantees that number of gates. This means that the clock cycle
the sum of non-Gaussian random variables rap- distribution is Gaussian at any voltage, and hence
idly tends to a Gaussian distribution, when the margin in terms of standard deviations is the
increasing the number of variables being same as nominal voltage. In other words, the
summed (Walpole et al. 2006) (e.g., the number timing margin against setup time violations at
of logic gates whose delays are added to derive low voltages scales like FO4, as opposed to
the critical path delay). This is quantitatively hold violations.
136 M. Alioto

Fig. 4.45 Q–Q plots of Gaussian, VDD=1.2 V Gaussian, VDD=0.6 V


4

quantiles of actual sample (s)

quantiles of actual sample (s)


FO4 distribution (y axis) 4
in 28 nm CMOS at 3 3
(a) 1.2 V, (b) 0.6 V, 2 2
(c) 0.3 V with normal
distribution on the x axis. 1 1
(d) at 0.3 V with lognormal 0 0
distribution on the x axis -1 -1
(100,000 Monte Carlo
runs)
-2 -2
-3 -3
-4 -4
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Gaussian quantiles (s) Gaussian quantiles (s)
a) b)
Gaussian, VDD=0.3 V Lognormal, VDD=0.3 V
10
quantiles of actual sample (s)

quantiles of actual sample (s)


4
3 9
2 8
1
7
0
6
-1
-2 5
-3 4
-4 3
-4 -3 -2 -1 0 1 2 3 4 3 4 5 6 7 8 9 10
Gaussian quantiles (s) lognormal quantiles (s)

c) d)

4.7.2 Voltage and Temperature providing the supply. Accordingly, supplies for
Variations minimum-energy operation need to be designed
with quite stringent specifications on voltage sta-
Voltage variations have a heavy impact on the bility across temperatures, as well as line and
clock cycle margin, due to the strong sensitivity load regulation.
of Ion and hence the gate delay on VDD (see Temperature variations in circuits designed
Sect. 4.2.1). Figure 4.46 plots the cycle time for minimum energy have an effect that is quite
margin associated with a typical 5% and 10% different from above-threshold circuits. Indeed,
voltage drop of the circuit in Fig. 4.30 in 28 nm larger temperatures lead to a substantial increase
CMOS. As all gate delays scale approximately in the energy per operation at low voltages, due
like FO4 when VDD changes, this example is to the large contribution of the leakage energy
representative of any logic path. From this figure, (see Sect. 4.3.2). Such effect is more pronounced
a large cycle time margin of 20–50% is imposed in architectures with larger leakage energy, e.g.,
by supply variations, if not kept under strict with larger logic depth. Figure 4.47 plots the
control. As discussed in Sect. 4.10, supply energy versus VDD of the circuit in Fig. 4.30 at
variations in systems designed for nearly- different temperatures (27 C and 70 C), and for
minimum energy operation are dominated by logic depths widely ranging from 25FO4 to
fluctuations in the output voltage of the regulator 100FO4. From this figure, the minimum energy
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 137

Fig. 4.46 Plot of the cycle 1.6


time margin vs. VDD 1.5

cycle time margin


(assuming cycle time
scaling to be proportional 1.4
to FO4)
1.3

1.2

1.1

1
0 200 400 600 800 1000 1200
V DD [mV]
5% VDD droop 10% VDD droop

0.25
Fig. 4.47 Impact of LVT-100 mV
temperature on minimum
0.2
normalized energy

energy vs. VDD for different


logic depths (10% activity,
very low VTH) 0.15
1.69X larger Emin
0.1
1.36X larger Emin
0.05
VTH 1.36X larger Emin
0
0 200 400 600 800 1000 1200
VDD [mV]
25FO4 (27oC) 50FO4 (27oC) 100FO4 (27oC)
25FO4 (70Oc) 50FO4 (70oC) 100FO4 (70oC)

is heavily influenced by the operating tempera- (4.7). Such effect is less pronounced at higher
ture, as it is increased by a factor of 1.3X for threshold voltages, as the corresponding higher
well-designed architectures with reasonable supply voltage emphasizes the carrier velocity
logic depth, and 1.7X for less energy-efficient saturation and the mobility degradation due to
and leakier architectures. high-field operation, which in turn weaken the
As further difference compared to traditional dependence of Ion on VTH0. At above-threshold
above-threshold low-power designs, the perfor- voltages around 700–800 mV, the effect of tem-
mance of circuits operating around the MEP perature is insignificant. At larger voltages, the
actually benefits from increased temperature. temperature has the traditional inverse effect on
Indeed, the Ion transistor current is much more the performance, and is much weaker (e.g., 2%
sensitive to the threshold voltage rather than the change due to a temperature change from 27 C
carrier mobility, hence it increases at larger to 70 C) than near threshold. Hence, unless the
temperatures. Figure 4.48 shows the FO4 delay operating temperature range set by the applica-
versus VDD for various threshold voltages. From tion is narrow (e.g., indoor applications), active
this figure, a temperature raise from 27 C to compensation of temperature variations is essen-
70 C leads to a 1.4X–2X FO4 reduction at tial in any integrated system aiming at minimum-
VDD equal to the threshold voltage VTH0 in energy operation.
138 M. Alioto

100
1.43X faster
normalized FO4

2X faster
10 2X faster
2% slower
same speed 2% slower
2% slower
1
0 200 400 600 800 1000 1200
V DD [mV]
LVT-100 mV (27°C) LVT (27°C) LVT+100 mV (27°C)
LVT-100 mV (70°C) LVT (70°C) LVT+100 mV

Fig. 4.48 Impact of temperature on performance vs. VDD for different threshold voltages (FO4 is normalized to the
value at nominal voltage for lowest VTH)

Table 4.8 Dependence of the delay variability in logic paths, cells and transistors
Design parameter Delay variability dependence
Logic path logic depth LD / p1ffiffiffiffi

3 LD LD
1 2 ...
Standard cell Nstacked / pffiffiffiffiffiffiffiffiffiffi
N
1
stacked
...
1 2 Nstacked
Transistor channel width W / pffiffiffiffiffiffiffiffiffiffiffi
1 ffi
strength W/Wmin strength

L=Lmin

to better averaging across a larger number of


4.8 The Leakage-Variability cells). Hence, the reduction of the delay
Tradeoff variability would require the adoption of
microarchitectures with larger logic depths. On
Operation around the MEP introduces a tradeoff the other hand, larger logic depths increase the
that is not encountered in traditional low-power clock cycle and hence the leakage energy per
above-threshold designs, namely the variability- cycle from (4.10). In other words, the mitigation
leakage tradeoff. This is an unavoidable tradeoff of delay variations comes at the cost of a higher
that constrains the design at all levels of abstrac- leakage energy, and vice versa. Such tradeoff is
tion and is tightly linked to the averaging effect very specific to operation at near- and
of additive variations, as discussed below. sub-threshold voltages, due to the much more
At the gate level, a logic path with logic depth important contribution of the leakage energy, as
LD has a delay that is the sum of LD delays, as opposed to above-threshold designs.
depicted in Table 4.8. The resulting path delay
pffiffiffiffiffiffiffi At the cell circuit level, a similar tradeoff is
variability is inversely proportional to LD encountered when the number of stacked
thanks to the averaging effect of the random transistors is considered in a standard cell (Alioto
variations across cascaded gates (Alioto et al. et al. 2010; Merrett et al. 2010) (i.e., the cell
2010; Merrett et al. 2010) (i.e., more cascaded fan-in). Indeed, the variability of the Ion current
gates reduce the overall delay variability thanks delivered by the cell to the load, and hence the
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 139

cell delay variability, is inversely proportional to tradeoffs observed at near-threshold voltages in


pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
N stacked as in Table 4.8. Thus, the mitigation of a more efficient manner, at the cost of additional
variations through more stacked transistors design and characterization effort.
comes at the cost of larger delay (see Since performance is not the main objective,
considerations on stacking in Sect. 4.2.3) and near-threshold cell libraries can be designed with
hence larger leakage energy from (4.10). short standard cells (e.g., 7 metal tracks), as
At the transistor level, from Table 4.8 wider transistors typically do not need to be wide, as
transistors exhibit smaller Ion variability thanks depicted in Fig. 4.49. In near-threshold designs,
to the Pelgrom’s law (Pelgrom et al. 1989). taller cells (e.g., 10–12 metal 1 tracks) can
Hence cells with larger strength have smaller achieve higher performance but lead to a signifi-
delay variability, as the latter is inversely propor- cant area efficiency degradation, and longer
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
tional to strength. Again, the delay variability interconnects, thus degrading energy efficiency.
mitigation comes at the cost of larger leakage The composition of near-threshold libraries
energy, as the leakage current drawn by a cell is does not need to be as wide as libraries for
proportional to its strength. above-threshold (i.e., higher performance) oper-
Summarizing, the variability-leakage tradeoff ation. Indeed, cell versions with very large
is an unescapable challenge in the design of strength can be suppressed, as they are typically
circuits and systems operating around the MEP, not used due to the more relaxed performance
as opposed to traditional low-power above-thresh- constraints. Similarly, cells with large fan-in
old designs. This tradeoff involves all levels of need to be eliminated, since they suffer from
abstraction, and needs to be constantly taken care disproportionately larger delay, as discussed in
of during the design process. Such tradeoff can be Sect. 4.2.3. Typically, libraries with around
broken by introducing innovative design 100 cells are adequate for near-threshold designs.
techniques that do not purely rely on timing From observations of prior designs, the energy
margining, as discussed later in this chapter. reduction obtained through a custom near-
threshold library can be in the order of 20%
(Gemmeke et al. 2013; Gammie et al. 2011),
4.9 Near-Threshold Cell Libraries compared to a pruned out conventional library
for above-threshold voltages (see below).
Designing cell libraries for operation around the The circuit design of cells is affected by near-
MEP certainly helps manage the peculiar threshold operation in terms of sizing as well.

Fig. 4.49 Near-threshold


cells are shorter than
typical above-threshold
cells

cell cell
height height

sub- or near-
threshold cell

above-threshold cell
140 M. Alioto

Indeed, minimum transistor size needs to be As an alternative option, existing cell libraries
skipped in technologies that are significantly designed for above-threshold regions can be
affected by Narrow Channel Effects (see reused at lower voltages, after proper pruning to
Sect. 4.2.2), to avoid the related increase in the eliminate the cells that suffer from robustness
transistor threshold voltage. issues or particularly pronounced delay increase
Near-threshold libraries might need to be (Alioto 2010; Wang et al. 2006). In a given
enriched with cells that are normally not avail- library designed for above-threshold voltages,
able in above-threshold libraries. For example, the number of usable cells at lower voltages
cells with thick-oxide transistors might be decreases when reducing VDD, as fewer cells
needed for always-on blocks (see Chap. 1) that can operate reliably at lower voltages. Typically,
need to be very low leakage, or connected as summarized in Fig. 4.51, the suppression of
directly to 3.6-V LiIon batteries (see Chap. 15). cells with a high fan-in (e.g., 4) leads to approxi-
Being particularly critical in terms of the mini- mately 100-mV Vmin reduction (Gemmeke et al.
mum voltage Vmin assuring correct operation, 2013).
flip-flops usually need to be thoroughly
redesigned to achieve adequate functional yield
at low voltages. This is usually achieved through 4.10 Clock and Supply Networks
circuit techniques that eliminate the potential for Near-Threshold Operation
current contention between transistors (Jain
et al. 2012; Kim et al. 2014a). Vmin is further The design of clock networks for near-threshold
reduced by replacing conventional dynamic designs is very different from above-threshold
circuits (e.g., periphery in register files) by their networks, due to the very different balance
static CMOS counterparts. As summarized in between clock repeater and wire delay, and the
Fig. 4.50, Vmin is determined by several clock skew is determined by different dominant
contributions arising at the process and circuit mechanisms (Alioto 2014; Lin et al. 2017; Seok
level, and is certainly dominated by variations et al. 2011; Tolbert et al. 2011). At above-
(Alioto 2012). threshold voltages, several levels of clock

Fig. 4.50 Breakdown of


minimum supply voltage
Vmin of logic gates ensuring
correct operation (Alioto
2012)

8 – 9 vt VDD,min increase due to


variations

13 – 14 vt ~
~325 – 350 mV

VDD,min increase due to residual PUN/


0.5 vt PDN imbalance

VDD,min increase due to NMOS/


2.5 vt PMOS imbalance

2 vt theoretical lower bound


4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 141

Fig. 4.51 Percentage of


library cells operating
correctly vs. VDD
(Gemmeke et al. 2013)

clk
1 2 ... CLKLVL
clk clock slope = rise/fall time
clk clock skew
clk
clk
BUFFER WIRE
clk
REGISTERS
CLOCK DISTRIBUTION NETWORK TIMING PARAMETERS

Fig. 4.52 General clock network structure and related timing parameters (Alioto 2014)

repeaters are needed to frequently interrupt wires cascaded repeaters, i.e., of the depth of the
to limit the related RC time constant and hence clock network. Accordingly, shallow networks
clock slope through the wires (Xanthopoulos need to be used at sub- and near-threshold
2009) (see Fig. 4.52). Indeed, excessive clock voltages, so that the dominant skew contribution
slope induces large random delay variations in due to the number of clock repeaters is reduced.
the clock repeaters at intermediate nodes of the From the above considerations, the design the
clock network, and degrades flip-flop nominal clock network at a given voltage leads to a skew
timing parameters (as well as its variations) degradation at the other end of the voltage range.
when considering the sinks of the same network. As an example, Fig. 4.53 plots the skew of a
In other words, the significant wire RC delay and clock network in 28 nm that has been designed
its impact on clock skew through the clock slope at 1.2 V and used at lower voltages. This figure
justifies the adoption of deep above-threshold shows that the skew at low voltages becomes
clock networks. several FO4 and even exceeds 10FO4 in
At sub- and near-threshold voltages, the gate sub-threshold. This means that the skew in a
delay becomes much larger than the wire delay clock network used in a wide range of voltages
(see Sects. 4.2.3 and 4.6), hence the clock slope becomes a large fraction of typical cycle time
through wires is no longer an issue, and the targets of energy-efficient designs (see
random skew is dominated by the intrinsic Sect. 4.4.1). In other words, using a clock net-
variations in the clock repeaters. According to work in a wide voltage range leads to significant
the Central Limit theorem (Walpole et al. 2006), performance degradation (or energy efficiency, if
the random skew standard deviation is propor- VDD is increased to recover the lost perfor-
tional to the square root of the number of mance). Similarly, the clock skew easily exceeds
142 M. Alioto

12

clock skew/FO4 in a 4-level


Fig. 4.53 Clock skew of a
sample clock network in
28 nm designed at 1.2 V 10

clock network
(normalized to FO4)
(Alioto 2014) 8

0
0 200 400 600 800 1000 1200
VDD [mV]

the available hold margin (Alioto et al. 2015), memory), although no adaption to voltage has
thus leading to timing failures at low voltages. In been performed within each clock domain.
other words, the clock skew degradation at low Clock network adaptation to a wide range of
voltages typically defines Vmin. Similar trends are voltages with each clock domain has been
observed when designing the clock network at demonstrated in Lin et al. (2017), where the
low voltages and running at above-threshold clock network topology is reconfigured to mini-
voltages. mize the skew at each specific voltage.
From the above considerations, the design of Regarding the supply network, voltage drops
the clock network of integrated systems are less of a concern at near-threshold voltages
operating in a wide voltage range entails a fun- and below, as the Ion transistor current is at least
damental tradeoff between the performance at an order of magnitude lower than at nominal
above-threshold voltages, and the ability to voltage. Accordingly, the current density drawn
scale down to low voltages. Various approaches by the digital circuit is reduced by the same
have been proposed to make this tradeoff more amount, and hence issues related to voltage
favorable, and mitigate the skew-energy penalty drops across the supply network are largely
imposed by the adoption of deep or shallow clock mitigated. This partially alleviates the problem
networks. For example, moderately deep of the stronger impact of VDD fluctuations at
networks with long-channel LVT buffers have near-threshold voltages, due to the larger sensi-
been proposed in Myers et al. (2015). Design tivity of performance (see Sect. 4.7.2). This
methodologies have been introduced in Seok translates in a relaxed requirement on the supply
et al. (2011), Tolbert et al. (2011), Zhao et al. rail width in the cell library, which can help
(2012) to optimally design clock networks, slightly reduce the cell height. For analogous
although for a single low voltage. Techniques reasons, the lower clock frequency of near-
for adaptive point-to-point interconnects with threshold circuits makes the effect of the wire
regenerative drivers have been also proposed parasitic inductance negligible. Finally, the peak
(Kim et al. 2014b; Wang et al. 2015), although current absorbed by near-threshold circuits is
they cannot be used for clock networks and are also reduced by an order of magnitude, compared
not supported by commercial EDA tools. to above-threshold operation. Hence, the size of
Voltage-adaptive delay insertion across different decoupling capacitors to keep VDD fluctuations
clock domains was introduced in (Jain et al. within a targeted band can be reduced by the
2012; Tokunaga et al. 2014) to mitigate the same amount, thus saving area and improving
inter-domain skew (e.g., between processor and the utilization factor of the module under design.
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 143

4.11 Perspectives and Trends altogether, so that the voltages can be freely
assigned to very small domains to compensate
In summary, near-threshold circuits pose chal- variations where they arise, while avoiding the
lenge and opportunities that are significantly differ otherwise large overhead of level shifters. The
from conventional above-threshold low-power Panoptic approach (Putic et al. 2009) introduces
circuits. Counteracting leakage in spite of the both spatial and temporal fine granularity by using
inefficacy of conventional leakage reduction multiple sleep transistors that also dynamically
techniques (see Sect. 4.5) requires a radically dif- connect sub-blocks to three different supply
ferent approach that maximizes the opportunities voltages. The sleep transistors serve the purpose
to reduce leakage when transistors are not being of reducing leakage of unused sub-blocks, and
used. This can be accomplished by introducing assign them the minimum possible voltage for
fine-grain power domains that can be power the task at hand when used.
gated (e.g., with gate boosting to improve its Variations can also be exploited rather than
effectiveness, as in Sect. 4.5) or voltage scaled to added as design margin, when an adequately
mitigate the leakage contribution of unused large number of replicas of a given block are
transistors. Power domains are typically coarse available on the same chip. For example,
and of the size of at least an entire microprocessor, Raghunathan et al. (2013) introduces the concept
whereas such fine-grain power domains have the of “cherry picking” among many on-chip cores,
size of sub-blocks or execution units (e.g., ALU), which consists in the post-silicon selection of the
or even finer (e.g., individual operators in the most energy-efficient cores while keeping the
ALU). Although such approach certainly others off. This permits to maximize the energy
enhances the chances to turn off transistors, its efficiency by leveraging the inevitable random
direct application leads to significant area/ variations, rather than tolerating them, at the cost
energy/performance overhead. The latter is due of area due to the partial utilization of cores.
to the need for additional power domain control Observe that full utilization would not be
circuitry, as well as isolation/clamping cells for allowed anyway in practical cases, due to the
power gating and level shifters (see Chap. 9) at the “dark silicon” issue (see Chap. 1) that is deter-
boundary of each domain. mined by the chip power constraint.
Fine-grain voltage domains are also a highly In general, variations can be mitigated at dif-
promising approach in near-threshold circuits. ferent times, from design time to testing, chip
Indeed, the ability to distribute different voltages boot time and run-time, as summarized in
with fine granularity maximizes the opportunities Fig. 4.54. At design time, all variation
to correct variations in paths that turn out to be contributions need to be incorporated into the
critical due to random variations, while reducing design (e.g., cycle time) margin, as they are not
the energy in all other domains. The effectiveness known upfront. The margin is lowered at testing
of fine-grain voltage domains is further enhanced time, as process variations are known and can
by the strong sensitivity of performance on VDD hence be suppressed, whereas voltage, tempera-
(see Sect. 4.2), which ensures that voltage ture and aging-induced variations are need to be
boosting is kept small (e.g., 100–200 mV) in all included (as they will be defined later at in-field
practical cases. For example, selective boosting operation). At boot time, aging can be
can be used to reduce the general Vmin of the compensated as well. The margin is made very
circuit, while raising the voltage of the small small and virtually removed when variations are
portion of the circuit that needs to operate at compensated at run-time, i.e., when all process,
higher voltages (Tokunaga et al. 2014). As temperature, (slow) voltage variations are
another example, (Muramatsu et al. 2011) known. Obviously, the cost of such detection
leverages such small voltage difference across and compensation of variations increases when
voltage domains by suppressing level shifters moving from design to run-time.
144 M. Alioto

margined clock cycle

nominal clock cycle clock cycle margin

clock cycle margin to deal with design uncertainty sources:


temperature/slow fast total clock cycle implementation
process variations aging
VDD variations variations margin complexity/area overhead
design time
testing time
boot time
runtime
periodically @runtime

Fig. 4.54 Summary of techniques to counteract variations at different time, and resulting cycle margin and overhead

Due to very large design margin required at 2010), Bubble Razor (Fojtik et al. 2013). How-
near-threshold voltages (see Sect. 4.7), adequate ever, their overhead is in the order of various
yield and energy efficiency certainly require the (if not several) tens of percentage points, and
adoption of run-time compensation of variations. hence an order of magnitude larger than TRCs,
This is typically performed through timing error which has prevented their adoption in commer-
detection and correction (EDAC) methods, cial chips. Recently, very low-overhead (i.e.,
which have been investigated since early 2000s percentage points) in-situ approaches have been
(Ernst et al. 2003), EDAC methods sense the demonstrated, such Razor-Lite (Kwon et al.
timing margin at run time by detecting timing 2014) and iRazor (Zhang et al. 2016) for
failures, so that the system can be tuned to oper- processors, and RazorSRAM for on-chip
ate at nearly-zero margin (Ernst et al. 2003). This memories (Khayatzadeh et al. 2016). Being
permits to run at the highest possible frequency very lightweight, these approaches promise a
at given voltage, or at the minimum possible much wider adoption of in-situ error detection
voltage at given frequency. Hence, error detec- in mass produced chips.
tion and correction improves the energy effi- Another very promising direction to further
ciency of circuits operating at any voltage, reduce the energy per computation is offered by
typically by 1.3–1.45X (see references below). its tradeoff with quality. As discussed in Chap. 1,
Error detection can be performed through quality can be defined in different ways
canary circuits and Tunable Replica Circuits depending on the application and the
(TRCs) mimicking critical path variations, and sub-system under design. In a processing
hence predicting the occurrence of timing sub-system, quality is related to accuracy in
violations with high (but not 100%) level of terms of precision in case of arithmetic tasks,
confidence, at rather low overhead (Bowman misclassification rate in classification tasks, or
et al. 2011). However, tracking the critical path effective number of bits in an Analog-to-Digital
across a wide range of voltages is difficult, and Converter (ADC). The concept is far more gen-
hence such methods are more appropriate for eral than approximations (e.g., approximate com-
operation on a narrow range. Also, since TRCs puting), in that it applies to a broad range of types
try to replicate the critical path, they cannot of tasks and applications, and the tradeoff
completely eliminate the design margin. In-situ between quality and energy is dynamic and
error detection is performed by inserting timing based on quality sensing (see example in Sect.
sensors to detect true timing failures, which typi- 1.6.2).
cally entails significant area overhead. Several Based on the concepts described in Sect.
in-situ error detection methods have been pro- 1.6.2, energy-quality scalability has been
posed, such as Razor (Ernst et al. 2003), Razor II introduced in many different sub-systems and
(Das et al. 2009), EDS (Bowman et al. 2009; levels of abstraction. For example, the first
Bowman et al. 2011), ERSA (Leem et al. energy-quality SRAM memory has been
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 145

Fig. 4.55 Energy curve vs. VDD in a 32-bit multiplier in 28-nm technology

introduced in Frustaci et al. (2015), where occa- the resolution is reduced by one bit, leading to an
sional faults (e.g., bitcells with inadequate write- exponential energy saving.
or read-ability) occur in the array. The scalability Finally, the presence of a minimum-energy
comes from a bit-level management of the point (MEP) actually poses a fundamental chal-
tradeoff between the bit error rate and the energy, lenge in terms of energy scalability when the
by adjusting assist techniques (see Chap. 5) dif- system is operating at the right of the MEP.
ferently for different positions and in a Indeed, the MEP tends to be a flat minimum as
dynamically scalable manner. This is beyond discussed in Sect. 4.3.2, which in turns translates
traditional memories where assist is uniformly into insignificant energy savings when the volt-
applied to all bit positions and to fully suppress age is scaled down from values at the right of the
errors, which entails a substantial error cost. MEP towards the MEP itself. As an example,
Similarly, selective Error Correction Codes from Fig. 4.55 there is almost no energy saving
have been introduced in the SRAM, to favor the when scaling from 0.6 V (i.e., at the right of the
robustness of the bits carrying the highest infor- MEP) down to 0.5 V (i.e., the MEP), due to the
mation content (e.g., MSBs in video processing flatness of the MEP. In other words, the voltage
applications), while saving on the other bit scalability of the design (i.e., its ability to operate
positions. Overall, this approach leads permits at very low voltages) does not translate in an
to improve the general quality by spending actual energy scalability (i.e., the ability to
some energy in selected bit positions, thus reduce energy when scaling down the voltage).
enabling much more aggressive scaling on all To preserve energy scalability, the energy curve
positions and hence achieving quadratic benefit. in Fig. 4.55 needs to be steep rather than flat,
Energy reductions of 2X have been demonstrated which is achieved only if the operating voltage is
compared to traditional voltage scaling, at far enough on the right side of the MEP. For
iso-quality (Frustaci et al. 2016). The same gen- example, quadratic benefit is observed in this
eral concept has been applied to several other figure, at voltages from 0.8 V to 1 V. Conversely,
sub-systems, such as ADCs with dynamically to achieve good energy scalability at a given low
scalable resolution (Freyman et al. 2014; Yip voltage (e.g., 0.5 V), the MEP needs to be pushed
et al. 2011). In this case, when the application to the left of this targeted voltage (e.g., 0.3 V). In
can tolerate a reduction in the ADC resolution, a other words, innovation is needed to move the
more than 2X energy reduction is gained when MEP where needed, depending on the operating
146 M. Alioto

voltage. At above-threshold voltages, the MEP efficient and metastability-immune resilient circuits
can lie at a fairly high voltage, while not being a for dynamic variation tolerance. IEEE J. Solid-State
Circuits 44, 49–63 (2009)
problem since the dominance of the dynamic K.A. Bowman, J.W. Tschanz, S.-L. Lu, P. Aseron,
energy still assured a quadratic benefit when M. Khellah, A. Raychowdhury, B. Geuskens,
downscaling VDD. When a near-threshold volt- C. Tokunaga, C. Wilkerson, T. Karnik, V. De, A
age is targeted and further voltage scaling needs 45 nm resilient microprocessor core for dynamic vari-
ation tolerance. IEEE J. Solid-State Circuits 46(1),
to be applied, the MEP needs to be dynamically 194–208 (2011)
moved to the left to make the energy curve T. Burd, T. Pering, A. Stratakos, R. Brodersen, A dynamic
steeper, and again achieve a nearly-quadratic voltage scaled microprocessor system, in IEEE ISSCC
energy benefit. We believe that this is one of Dig. Tech. Papers (Feb. 2015), pp. 294–295
A. Chandrakasan, D. Daly, D. Finchelstein, J. Kwong,
the fundamental challenges that needs to be Y. Ramadass, M. Sinangil, V. Sze, N. Verma,
addressed to further improve the energy effi- Technologies for ultradynamic voltage scaling. Proc.
ciency of low-voltage integrated systems for IoT. IEEE 98(2), 191–214 (2010)
D. Chinnery, K. Keutzer, Closing the Power Gap between
ASIC & Custom (Springer, Berlin, 2007)
Acknowledgement The authors acknowledge the kind J. Crop, E. Krimer, N. Moezzi-Madani, R. Pawlowski,
support by the MOE2014-T2-2-158 grant from the T. Ruggeri, P. Chiang, M. Erez, Error detection and
Singaporean Ministry of Education. recovery techniques for variation-aware CMOS com-
puting: a comprehensive review. J. Low Power Elec-
tron. Appl. 1, 334–356 (2011)
S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan,
References K. Lai, D.M. Bull, D.T. Blaauw, Razor II: in situ error
detection and correction for PVT and SER tolerance.
F. Abouzeid, S. Clerc, B. Pelloux-Prayer, F. Argoud, IEEE J. Solid-State Circuits 44(1), 32–48 (2009)
P. Roche, 28 nm CMOS, energy efficient and G. De Micheli, Synthesis and Optimization of Digital
variability tolerant, 350 mV-to-1.0 V, 10 MHz/ Circuits (McGraw Hill, New York, 1994)
700MHz, 252bits frame error-decoder, in Proceedings R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester,
of ESSCIRC 2012 (Bordeaux, France, Sept. 2012), T. Mudge, Near-threshold computing: reclaiming
pp. 153–156 Moore’s law through energy efficient integrated
M. Alioto, G. Scotti, A. Trifiletti, A novel framework to circuits. Proc. IEEE 98(2), 253–266 (2010)
estimate the path delay variability via the fan-out-of-4 C. Enz, E. Vittoz, Charge-Based MOS Transistor
metric. IEEE Trans. Circuits Syst—Part I (in press) Modeling: The EKV Model for Low-Power and RF
M. Alioto, Understanding DC behavior of subthreshold IC Design (Wiley, New York, 2006)
CMOS logic through closed-form analysis. IEEE D. Ernst, N.S. Kim, S. Das, S. Pant, R. Rao, T. Pham,
Trans. Circuits Syst.—part I 57(7), 1597–1607 (2010) C. Ziesler, D. Blaauw, T. Austin, K. Flautner,
M. Alioto, Ultra-low power VLSI circuit design T. Mudge, Razor: a low-power pipeline based on
demystified and explained: a tutorial. IEEE Trans. circuit-level timing speculation, in Proceedings of
Circuits Syst.—part I (invited) 59(1), 3–29 (2012) MICRO-36 (Dec. 2003), pp. 7–18
M. Alioto, Challenges and techniques for ultra-low volt- D. Flynn, R. Aitken, A. Gibbons, K. Shi, Low Power
age logic with nearly-minimum energy. in Short Methodology Manual (Springer, New York, 2007)
course at VLSI Symposium 2014, Hawaii 10 June 2014 M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris,
M. Alioto, G. Palumbo, M. Pennisi, Understanding the D. Blaauw, D. Sylvester, Bubble razor: eliminating
effect of process variations on the delay of static and timing margins in an ARM Cortex-M3 processor in
Domino logic. IEEE Trans. VLSI Syst. 18(5), 45 nm CMOS using architecturally independent error
697–710 (2010) detection and correction. IEEE J. Solid-State Circuits
M. Alioto, Guest editorial for the special issue on “Ultra- 48(1), 66–81 (2013)
low-voltage VLSI circuits and systems for green com- L. Freyman, D. Fick, M. Alioto, D. Blaauw, D. Sylvester,
puting. IEEE Trans. Circuits Systems—part II 59(12), A 346 μm2 VCO-based, reference-free, self-timed
849–852 (2012) sensor interface for cubic-millimeter sensor nodes in
M. Alioto, E. Consoli, G. Palumbo, Flip-Flop Design in 28 nm CMOS. IEEE J. Solid-State Circuits 49(11),
Nanometer CMOS—From High Speed to Low Energy 2462–2473 (2014)
(Springer, Berlin, 2015) F. Frustaci, M. Khayatzadeh, D. Blaauw, D. Sylvester,
Z. Bo, D. Blaauw, D. Sylvester, K. Flautner, Theoretical M. Alioto, SRAM for error-tolerant applications with
and practical limits of dynamic voltage scaling, in dynamic energy-quality management in 28 nm
Proceedings of DAC (2004), pp. 868–873 CMOS. IEEE J. Solid-State Circuits 50(3),
K.A. Bowman, J.W. Tschanz, N.S. Kim, J.C. Lee, 1310–1323 (2015)
C.B. Wilkerson, S.-L. Lu, T. Karnik, V. De, Energy-
4 Near-Threshold Digital Circuits for Nearly-Minimum Energy Processing 147

F. Frustaci, D. Blaauw, D. Sylvester, M. Alioto, Approxi- variable-precision floating-point fused multiply-add


mate SRAMs with dynamic energy-quality manage- unit with certainty tracking in 32 nm CMOS, in IEEE
ment. IEEE Trans. VLSI Syst. 24(6), 2128–2141 ISSCC Dig. Tech. Papers, (Feb. 2012), pp. 182–183
(2016) M. Khayatzadeh, M. Saligane, J. Wang, M. Alioto,
G. Gammie, N. Ickes, M. Sinangil, R. Rithe, J. Gu, D. Blaauw, D. Sylvester, A reconfigurable dual port
A. Wang, H. Mair, S. Datla, R. Bing, S. Honnavara- memory with error detection and correction in 28 nm
Prasad, L. Ho, G. Baldwin, D. Buss, A. Chandrakasan, FDSOI, in IEEE ISSCC Dig. Tech. Papers (Feb.
U. Ko, A 28 nm 0.6 V low-power DSP for mobile 2016), pp. 310–311
applications, in ISSCC Digest of Technical Papers Y. Kim, W. Jung, I. Lee, Q. Dong, M. Henry,
(ISSCC) (San Francisco, Feb. 2011) D. Sylvester, D. Blaauw, A static contention-free
T. Gemmeke, M. Ashouei, B. Liu, M. Meixner, T.G. Noll, single-phase-clocked 24T Flip-Flop in 45 nm for
H. de Groot, Cell libraries for robust low-voltage low-power applications. in IEEE ISSCC Dig. Tech.
operation in nanometer technologies. Solid-State Papers (Feb. 2014)
Electron. 84, 132–141 (2013) S. Kim, M. Seok, Reconfigurable interconnect-driving
J. Gregg, T.W. Chen, Post silicon power/performance technique for ultra-dynamic-voltage-scaling systems,
optimization in the presence of process variations in IEEE ACM International Symposium on Low Power
using individual well-adaptive body biasing. IEEE Electronics and Design (ISLPED) (2014)
Trans. VLSI Syst. 15(3), 366–376 (2007) I. Kwon, S. Kim, D. Fick, M. Kim, Y.-P. Chen,
S. Hanson, B. Zhai, D. Blaauw, D. Sylvester, A. Bryant, D. Sylvester, Razor-lite: a light-weight register for
X. Wang, Energy optimality and variability in sub- error detection by observing virtual supply rails.
threshold design, in Proceedings of ISLPED 2006 IEEE J. Solid-State Circuits 49(9), 2054–2066 (2014)
(2006), pp. 363–365 L. Leem, H. Cho, J. Bau, Q.A. Jacobson, S. Mitra, ERSA:
S. Hanson, B. Zhai, K. Bernstein, D. Blaauw, A. Bryant, error resilient system architecture for probabilistic
L. Chang, K.K. Das, W. Haensch, E.J. Nowak, applications, in Proceedings of DATE 2010 (Dresden,
D.M. Sylvester, Ultralow-voltage, minimum-energy Germany, Mar. 2010), pp. 1560–1565
CMOS. IBM J. Res. & Dev. 50(4/5) (2006), L. Lin, S. Jain, M. Alioto, Reconfigurable clock networks
pp. 469–490 for random skew mitigation from sub-threshold to
S. Hanson, B. Zhai, M. Seok, B. Cline, K. Zhou, nominal voltage, in IEEE ISSCC Dig. Tech. Papers
M. Singhal, M. Minuth, J. Olson, L. Nazhandali, (Feb. 2017)
T. Austin, D. Sylvester, D. Blaauw, Exploring M. Alioto, G. Scotti, A. Trifiletti, A novel framework to
variability and performance in a sub-200-mV proces- estimate the path delay variability via the Fan-Out-of-
sor. IEEE J. Solid-State Circuits 43(4), 881–890 4 metric. IEEE Trans. Circuits Syst.—part I
(2008) D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz,
D. Harris, R. Ho, G.-Y. Wei, M. Horowitz, The Fanout- R.W. Brodersen, Methods for true energy-
of-4 inverter delay metric, unpublished manuscript performance optimization. IEEE J. Solid-State
http://citeseerx.ist.psu.edu/viewdoc/download? Circuits 39(8), 1282–1293 (2004)
doi¼10.1.1.68.831&rep¼rep1&type¼pdf S.M. Martin, K. Flautner, T. Mudge, D. Blaauw, Com-
S. Hsu, A. Agarwal, M. Anders, S. Mathew, H. Kaul, bined dynamic voltage scaling and adaptive body
F. Sheikh, R. Krishnamurthy, A 280 mV-to-1.1 V biasing for lower power microprocessors under
256b reconfigurable SIMD vector permutation engine dynamic workloads, in Proceedings of ICCAD’02
with 2-dimensional shuffle in 22 nm CMOS, in ISSCC (Nov. 2002), pp. 721–725
Digest of Technical Papers (ISSCC) (San Francisco, M. Meijer, J. Pineda de Gyvez, Body-bias-driven design
Feb. 2012) strategy for area- and performance-efficient CMOS
International Technology Roadmap for Semiconductors: circuits. IEEE Trans. VLSI Syst. 20(1), 42–51 (2012)
2013 edition. http://www.itrs.net (2013) M. Merrett, Y. Wang, M. Alioto, M. Zwolinski, Design
D. Jacquet et al., 2.6GHz ultra-wide voltage range energy metrics for RTL level estimation of delay variability
efficient dual A9 in 28 nm UTBB FD-SOI, in IEEE due to intradie (random) variations, in Proceedings of
Symposium on VLSI Circuits Dig. Tech. Papers (June ISCAS 2010 (Paris (France), May 2010),
2013) pp. 2498–2501
S. Jain et al., A 280 mV-to-1.2 V wide-operating-range A. Muramatsu, T. Yasufuku, M. Nomura, M. Takamiya,
IA-32 processor in 32 nm CMOS, in IEEE ISSCC Dig. H. Shinohara, T. Sakurai, 12% power reduction by
Tech. Papers (Feb. 2012), pp. 66–67 within-functional-block fine-grained adaptive dual
D. Jeon, M. Seok, C. Chakrabarti, D. Blaauw, supply voltage control in logic circuits with 42 voltage
D. Sylvester, A super-pipelined energy efficient sub- domains, in 37th European Solid-State Circuits Con-
threshold 240 MS/s FFT core in 65 nm CMOS. IEEE ference (ESSCIRC) (Helsinki (Finland), Sep. 2011),
J. Solid-State Circuits 47(1), 23–34 (2013) pp. 191–194
H. Kaul, M.A. Anders, S.K. Mathew, S.K. Hsu, J. Myers, A. Savanth, D. Howard, R. Gaddh, P. Prabhat,
A. Agarwal, F. Sheikh, R.K. Krishnamurthy, D. Flynn, An 80nW retention 11.7pJ/cycle active
S. Borkar, A 1.45 GHz 52-to-162GFLOPS/W sub-threshold ARM Cortex®-M0+ sub-system in
148 M. Alioto

65 nm CMOS for WSN applications, in IEEE ISSCC of International Symposium on Microarchitectures


Dig. Tech. Papers (Feb. 2015), pp. 144–145 (2002), pp. 333–344
S. Narayanan, J. Sartori, R. Kumar, D.L. Jones, Scalable I. Sutherland, B. Sproull, D. Harris, Logical Effort:
stochastic processors, in Proceedings of DATE 2010 Designing Fast CMOS Circuits (Morgan-Kaufmann,
(Dresden, Germany, Mar. 2010), pp. 335–338 Burlington, 1999)
S. Narendra, A. Chandrakasan (eds.), Leakage in Nano- C. Tokunaga, J.F. Ryan, C. Augustine, J.P. Kulkarni, Y.-
meter CMOS Technologies (Springer, Berlin, 2006) C. Shih, S.T. Kim, R. Jain, K. Bowman,
K. Nose, T. Sakurai, Optimization of VDD and VTH for A. Raychowdhury, M.M. Khellah, J.W. Tschanz,
low-power and high-speed applications, in V. De, A graphics execution core in 22 nm CMOS
Proceedings of ASPDAC (Jan. 2000), pp. 469–474 featuring adaptive clocking, selective boosting and
M. Olivieri, G. Scotti, A. Trifiletti, A novel yield optimi- state-retentive sleep, in ISSCC Digest of Technical
zation technique for digital CMOS circuits design by Papers (ISSCC) (San Francisco, CA, Feb. 2014)
means of process parameters run-time estimation and J.R. Tolbert, X. Zhao, S.K. Lim, S. Mukhopadhyay, Anal-
body bias active control. IEEE Trans. VLSI Syst. 13 ysis and design of energy and slew aware subthreshold
(5), 630–638 (2005) clock systems. IEEE Trans. CAD 30(9), 1348–1358
M.M. Orshansky, S. Nassif, D. Boning, Design for (2011)
Manufacturability and Statistical Design (Springer, J.W. Tschanz, J.T. Kao, S.G. Narendra, R. Nair,
Berlin, 2008) D.A. Antoniadis, A.P. Chandrakasan, V. De, Adaptive
D. Patil, M. Horowitz, Joint supply, threshold voltage and body bias for reducing impacts of die-to-die and
sizing optimization for design of robust digital within-die parameter variations on microprocessor
circuits. http://vlsiweb.stanford.edu/papers/ frequency and leakage. IEEE J. Solid-State Cicuits
JointVddVthSizing.pdf 37(11), 1396–1042 (2002)
M. Pelgrom, A. Duinmaijer, A. Welbers, Matching Y. Tsividis, Operational Modeling of the MOS Transistor,
properties of MOS transistors. IEEE J. Solid-State 2nd edn. (McGraw-Hill, New York, 1999)
Circuits 24(1), 1433–1439 (1989) R.E. Walpole, R.H. Myers, S.L. Myers, K. Ye, Probabil-
M. Putic, L. Di, B. H. Calhoun, J. Lach, Panoptic DVS: a ity & Statistics for Engineers & Scientists (Prentice
fine-grained dynamic voltage scaling framework for Hall, Englewood Cliffs, 2006)
energy scalable CMOS design, in Proceedings of A. Wang, A. Chandrakasan, 180-mV subthreshold FFT
ICCD 2009 (Lake Tahoe, CA, Oct. 2009), processor using a minimum energy design methodol-
pp. 491–497 ogy. IEEE J. Solid-State Circuits 40(1), 310–319 (2005)
B. Raghunathan, Y. Turakhia, S. Garg, D. Marculescu, A. Wang, B.H. Calhoun, A. Chandrakasan, Sub-threshold
Cherry-picking: exploiting process variations in dark- design for ultra low-power systems (Springer, Berlin,
silicon homogeneous chip multi-processors, in 2006)
Proceedings of DATE 2013 (Grenoble, France, Mar. J. Wang, N. Pinckney, D. Blaauw, D. Sylvester,
2013), pp. 39–44 Reconfigurable self-timed regenerators for wide-
Y.K. Ramadass, A.P. Chandrakasan, Minimum energy range voltage scaled interconnect, in Proceedings of
tracking loop with embedded DC–DC converter ASSCC 2015 (Nov. 2015)
enabling ultra-low-voltage operation down to N. Weste, D. Harris, CMOS VLSI Design, 4th edn.
250 mV in 65 nm CMOS. IEEE J. Solid-state Circuits (Pearson Education, Upper Saddle River, 2011)
43(1), 256–265 (2008) R. Wilson et al., A 460 MHz at 397 mV, 2.6 GHz at 1.3 V,
T. Sakurai, R. Newton, Alpha-power law MOSFET model 32b VLIW DSP, embedding FMAX tracking, in IEEE
and its applications to CMOS inverter delay and other ISSCC Dig. Tech. Papers (Feb. 2014), pp. 452–453
formulas. IEEE J. Solid-State Circuits 25(2), 584–594 T. Xanthopoulos, Clocking in Modern VLSI Systems
(1990) (Springer, New York, 2009)
W. Sansen, Analog Design Essentials (Springer, M. Yip, A. Chandrakasan, A resolution-reconfigurable
New York, 2006) 5-to-10 b 0.4-to-1 V power scalable SAR ADC, in
M. Seok, D. Blaauw, D. Sylvester, Robust clock network IEEE ISSCC Dig. Tech. Papers (2011), pp. 190–191
design methodology for ultra-low voltage operations, Y. Zhang, M. Khayatzadeh, K. Yang, M. Saligane,
in IEEE Transactions on Emerging Selected Topics M. Alioto, D. Blaauw, D. Sylvester, iRazor:
Circuits Systems, vol. 1(2) (2011) 3-transistor current-based error detection and correc-
F. Sheikh, S. Mathew, M. Anders, H. Kaul, S. Hsu, tion in an ARM Cortex-R4 processor, in IEEE ISSCC
A. Agarwal, R. Krishnamurthy, S. Borkar, A 2.05 Dig. Tech. Papers (Feb. 2016), pp. 160–161
GVertices/s 151 mW lighting accelerator for 3D X. Zhao, J.R. Tolbert, S. Mukhopadhyay, S.K. Lim,
graphics vertex and pixel shading in 32 nm CMOS, Variation-aware clock network design methodology
in IEEE ISSCC Dig. Tech. Papers (Feb. 2012), for ultralow voltage (ULV) circuits. IEEE Trans.
pp. 178–179 CAD 31(8), 1222–1234 (2012)
V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, W. Zhao, Y. Ha, M. Alioto, Novel self-body-biasing and
V. Zyuban, P.N. Strenski, P.G. Emma, Optimizing statistical design for near-threshold circuits with ultra
pipelines for power and performance, in Proceedings energy-efficient AES as case study. IEEE Trans. VLSI
Syst. 23(8), 1390–1401 (2015)
Energy Efficient Volatile Memory
Circuits for the IoT Era 5
Jaydeep P. Kulkarni, James W. Tschanz, and Vivek K. De

This chapter addresses the challenges involved of personal computers, desktop machines,
in designing energy efficient embedded Static smartphones, wearables and many units
Random Access Memory (SRAM) circuits for connected to the internet; collectively known as
the IoT era. It discusses memory design for the Internet of Things (IoT). This dramatic
wide voltage range operation using 6 Transistor growth in compute devices results in a data
(6T), 8T and 10T bitcells and novel circuit explosion (known as ‘data deluge’) and would
assist techniques. In addition, it discusses future require millions of Zettabytes of memory in
memory designs using emerging nano-wire FET, order to perform this large scale of computing
Tunnel FET, III-V FET, and monolithic 3-D (Semiconductor Industry Association 2015).
technologies. Therefore, memories play a critical role in future
energy efficient computing systems. This chapter
addresses the challenges involved in designing
5.1 Introduction and Challenges energy efficient embedded memory circuits
in Embedded Memory Design operating across a wide voltage range, and also
for IoT presents design examples using emerging device
technologies.
5.1.1 Introduction

Technological advances and form factor driven


cost reduction have resulted in tremendous
5.1.2 SRAM Scaling Trends
growth of computing devices. As shown in
Aggressive scaling of transistor dimensions
Fig. 5.1, the number of internet-connected
with each technology generation has resulted in
devices deployed worldwide is projected to
increased integration density and improved
grow to 50 billion by 2020.
device performance. The SRAM bitcell area has
If continued, this trend might result in tens
been scaled by ~0.5 over each process genera-
of billions of computing systems consisting
tion as shown in Fig. 5.2. This area scaling is
achieved by various lithographic and circuit
innovations such as thin-cell layouts, high-K
metal gate technology, tri-gate geometry,
J.P. Kulkarni (*) • J.W. Tschanz • V.K. De
leakage-reduction techniques, and low voltage
Circuit Research Lab, Intel Corporation,
Hillsboro, OR, USA assist techniques (Wang et al. 2012; Yoshinobu
e-mail: jaydeep.p.kulkarni@intel.com et al. 2003).

# Springer International Publishing AG 2017 149


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_5
150 J.P. Kulkarni et al.

Fig. 5.1 Projected growth of IoT devices and the required memory capacity

Fig. 5.2 6T Static Random Access Memory (SRAM) bitcell scaling trend

5.1.3 SRAM Design Octagon known as VMIN) using various read/write assist
techniques. Another important aspect is leakage
Energy-efficient SRAM design in nanoscale reduction during active as well as idle mode. This
technology needs to address several process, is particularly important for IoT designs that in
design, architecture, and reliability issues as general have very strict power budgets and at the
shown in Fig. 5.3. Process technology issues same time can exhibit very low array activity due
relate to the restricted design rules, diffusion- to duty-cycled or burst-mode styles of operation.
notch free layout, proximity effects, stress Reliability is also an important aspect of mod-
effects, and threshold voltage (VT) optimization. ern SRAM design. The effects of bias tempera-
It impacts the 6T bitcell sizing/area, read/write ture instability (BTI), time-dependent dielectric
stability, 6T+ bitcells, and single/multi-port breakdown (TDDB), hot carrier injection (HCI),
bitcell requirements. These process optimi- random telegraph noise (RTN), erratic bits, and
zations and design requirements govern the radiation-induced soft errors limit the operating
array efficiency and bit density. voltage range and lifetime of the SRAM array.
Circuit techniques focus on lowering the min- Architectural techniques such as redundancy,
imum successful operating supply voltage (also parity protection, error correcting codes (ECC),
5 Energy Efficient Volatile Memory Circuits for the IoT Era 151

Fig. 5.3 Energy-efficient SRAM design requirements

and reconfigurable caches are used along with voltage (VMIN) (Mukhopadhyay et al. 2005;
the circuit techniques to address these reliability Khellah et al. 2008).
issues. In this chapter, we focus on the circuit The ‘de facto’ memory bitcell used in modern
techniques for enabling energy-efficient SRAM SRAM designs is a 6-transistor cell consisting of a
operation across a wide voltage range specifi- cross-coupled inverter pair. SRAM cells are sized
cally needed for IoT applications. to satisfy the conflicting design requirements of
read stability and write-ability. For a read stability
optimized bitcell as shown Fig. 5.4, pass gate
5.1.4 SRAM Parametric Failures transistors (AXL, AXR) are sized weaker than
and VMIN the pull-up and pull-down transistor.
During a read operation, when a wordline
Supply voltage scaling has remained the major (WL) is asserted, the voltage at the node storing
focus of energy-efficient memory design. How- logic 0 (Node VL in left Fig. 5.4) rises above VSS
ever, as the supply voltage is reduced the sensitiv- due to voltage divider action between the pass-gate
ity of the circuit parameters to process variations AXL and pull-down NL transistors. At nominal
increases. Nanoscale SRAM bitcells having supply voltage, the VL node voltage rise during
minimum-sized transistors are vulnerable to read operation is not significant enough to flip the
inter-die as well as intra-die process variations. bitcell contents (shown in blue color). However
Intra-die process variations include random dop- due to process variations, especially for a bitcell
ant fluctuation (RDF), line edge roughness (LER), operating at lower supply voltages, the voltage rise
and other manufacturing variations that may result at VL node can be higher than the trip point of the
in a threshold voltage mismatch between adjacent other inverter and can flip the VR node, resulting in
transistors in a memory cell (Bhavnagarwala et al. a read failure event (shown in red color). Read
2001). Coupled with inter-die and intra-die pro- failure can also occur during a dummy-read sce-
cess variations, supply voltage scaling is limited nario (half-select) in a column-interleaved design.
due to various memory failures (i.e. read failure, For the write operation, the design require-
retention failure, access time failure, and write ment is that the bitcell nodes should flip easily.
failure) also known as minimum operating In a write-optimized bitcell, the pass gate is
152 J.P. Kulkarni et al.

Fig. 5.4 Read and write failure events in a 6T SRAM bitcell

stronger than the pull-up and pull-down devices. bitcells (bitcell-VCC/VSS, WL, BL) appropriately
At nominal supply voltage, when write data is to favor a read or a write operation.
applied on the bitline (BL) and bitline comple-
ment (BR), the pass gate connected to the
grounded bitline (BR) pulls down the bitcell 5.2.1 Read Assist Techniques
node (VR in right Fig. 5.4) below the trip point
of the other inverter (VL) resulting in the flip of As explained in the earlier section, a read failure is
the storage bit (node VL) and a successful write caused by increase in the access transistor drive
operation. However due to process variations strength compared to the cross-coupled inverter
especially at lower supply voltage, the pass gate pair due to low voltage operation and/or process
AXR can be weaker than the pull-up device PR variations. The read assist circuits try to reinforce
and may not lower the VR node voltage below the access transistors to be weaker than the cross-
switching threshold of the other side inverter coupled inverter pair. The node biasing techniques
resulting in a write failure event. Write failure include raising the bitcell-VCC and/or lowering
can also happen if the wordline pulse is not long bitcell-VSS of the cross-coupled inverter pair.
enough for the bitcell to flip the internal nodes. These techniques also include wordline under-
drive (WLUD) or operating the entire bitcell at a
higher supply voltage compared to the peripheral
circuits (Mann et al. 2010; Khellah et al. 2008)
5.2 6T SRAM Circuit Techniques (Figs. 5.5 and 5.6).

To improve the operating voltage range of


SRAM arrays in the presence of process 5.2.2 Write Assist Techniques
variations and to satisfy the conflicting design
requirement of read vs. write stability, various Contrary to a read operation, a write failure is
circuit assist techniques have been explored. The caused by decrease in the access transistor drive
key idea is to bias different nodes of the 6T strength compared to the cross-coupled inverter
5 Energy Efficient Volatile Memory Circuits for the IoT Era 153

Fig. 5.5 6T SRAM read assist techniques

Fig. 5.6 6T SRAM write assist techniques

pair due to low voltage operation and/or process requires power and area hungry VMIN assist
variations. The write assist circuits try to rein- techniques. The operating voltage can be reduced
force the access transistor to be stronger than the substantially by adding a dedicated read port
cross-coupled inverter pair. The node biasing using two extra transistors. This 8T bitcell
techniques include lowering the bitcell-VCC (Fig. 5.7) is commonly used in microprocessor
and/or raising the bitcell-VSS of the cross- cores for performance-critical low-level caches
coupled inverter pair. Write assist techniques and multi-ported register-file arrays.
also include wordline boosting or negative The 8T bitcell offers fast read and write oper-
bitline approaches (Mann et al. 2010; Khellah ation, dual-port capability, and generally lower
et al. 2008). VMIN than the 6T SRAM bitcell. By using a
decoupled single-ended read port with domino-
style hierarchal read bit-line, the 8T cell features
5.3 8T SRAM Circuit Techniques a fast read evaluation path without causing
access disturbance that limits read-VMIN in the
5.3.1 Benefits of Decoupled Read/ 6T bitcell. Using the 8T cell in a half-select-free
Write Operation architecture eliminates pseudo-reads during par-
tial writes, hence enabling write-VMIN optimiza-
The fundamental conflicting design constraints tion independent of read. It is well known that
of read-stability vs. write-ability in the 6T both with-in-die (WID) and die-to-die (D2D)
SRAM bitcell limits low voltage operation and device parameter variations are getting worse
154 J.P. Kulkarni et al.

RWL0

VCC VCC
WR WR “1”
contention PU completion

Local RBL (LBL)


Nx
“0" “1" “0" “1"

RD Port

PCH
WWL0
RD contention
bit0

VCC
bit1
domino
WBL#

WBL
keeper

Global RBL (GBL)


bit7

write driver
local
merge

Fig. 5.7 8T bitcell array organization with non-interleaved columns

with feature size scaling. Unfortunately, the typ- 5.3.2 Wordline Boosting as VMIN
ical approach of sizing up the 8T read and write Assist Technique
ports to mitigate process variation has limited
VMIN returns. Wordline boosting is an effective technique for
In the read case, using larger NMOS read port lowering the read and write VMIN of 8T bitcell
helps when reading a “1” by reducing contention arrays with minimal area and power overheads.
with the PMOS domino keeper on the local The key idea is to selectively boost critical VMIN-
bit-line (LBL). To compensate for degraded limiting nodes of the 8T bitcell, allowing the
noise margin that comes with a larger (and leak- majority of the array peripheral circuits and
ier) read port, the PMOS keeper needs to be remaining logic block to operate at much lower
relatively upsized for good reading of a “0”, voltage thereby lowering the overall chip VMIN.
resulting in diminishing read-VMIN returns with There are various ways to achieve a boosted
continued upsizing. Similarly, contention voltage for the wordline, such as charge-pump
between PMOS pull-up (PU) and the NMOS based boosting, capacitive coupling, and using
pass (NX) impacting write-VMIN is reduced by separate high-VCC rail. All three schemes rely
sizing up NX. However, PU needs to be relatively on one or more combinations of (1) read
sized up as well since at a given point a weak PU wordline (RWL) boosting, (2) write wordline
will limit the completion of write within a given (WWL) boosting, and/or bitcell boosting.
WL pulse. Only 10% total VMIN reduction is Boosting RWL and bitcell-VCC enable larger
attained through optimal cell upsizing at a cost read “ON” current without forcing a larger
of ~25% increase in array area (Raychowdhury PMOS keeper. Boosting WWL helps write-
et al. 2010). Therefore, read and write assis cir- VMIN for two reasons—improving contention
cuit that can achieve VMIN reduction at a minimal without upsizing NX (or lowering its VTH), and
area impact are necessary. improving completion by writing a “1” from the
5 Energy Efficient Volatile Memory Circuits for the IoT Era 155

other side. Unlike VCC collapse, WWL boosting The actual boosting ratio is lower, however, as
does not degrade the dynamic retention margin determined by ILOAD from all active and inactive
of unselected cells on the same column. level shifters, boosting clock (BCLK) frequency
(FBCLK), and boosting capacitance (CCP). To
minimize the load current requirement on the
5.3.2.1 Charge Pump Based Wordline
charge pump design, a two-stage level shifter is
Boosting
used as the wordline driver. Unlike the conven-
In this approach, an embedded charge pump is
tional (DCVS) level shifter, where a “0”-
used to generate a higher voltage to boost RWL
-to-VBOOST transition is all supplied by the
and WWL (VBOOST generation) (Raychowdhury
VBOOST rail, the two-stage level shifter performs
et al. 2010). Figure 5.8 shows a 2 KB 8T SRAM
this transition in two steps. In the first step,
bank with a locally-integrated charge pump
“0”-to-VCC is supplied by MP1 at which point
(CP) and associated wordline level shifter
MP2 kicks in to supply the remaining VCC-to-
circuits. The CP itself is divided into ten identical
VBOOST. This significantly reduces the maximum
units placed in the layout slices created by the
charge pump load current allowing the boosting
level shifters in the global IO of the 2 KB macro.
ratio to increase to ~1.6, compared to only
The boosting ratio is adjusted based on the built-
1.1 using DCVS level shifter.
in read-ability and write-ability sensor. The ideal
Figure 5.9 shows measurement results from a
boosting ratio (VBOOST/VCC) under no load cur-
45 nm test chip achieving VMIN improvement of
rent (ILOAD) is 2VCC.

Fig. 5.8 8T SRAM array with distributed charge pump and two-step level shifters

Fig. 5.9 Charge pump based boosting: Measured bit failures vs. VCC extrapolated to 1 MB array
156 J.P. Kulkarni et al.

140 mV (120 mV) for read (write) for a 1 MB WWL driver (C1), while the second is at the
target array size. The area overhead due to WWL interface to the cell’s NMOS WR pass
distributed charge pump and two-stage level devices (C2 and C3).
shifters is significant (around ~25%). To enable use of the first capacitance, the
input of the WWL driver is asserted normally
5.3.2.2 Capacitive Coupling with Self- (high-to-low) to create a 0 ! VCC transition on
Induced-Collapse the WWL. After a short delay, the input is
The adoption of charge pump based boosting de-asserted (goes from low to high) turning off
requires careful design of the charge pump and the top PMOS (but without turning on the bottom
two-stage level shifters. Furthermore, dyna- NMOS)-effectively floating the WWL. This in
mically turning the charge pump off (or putting turn excites the G-to-D capacitance from the
it in a drowsy mode) during inactive periods is PMOS transistor of the WL driver creating
necessary for net power savings. As an alterna- about 3–5% coupling to the floating WWL. To
tive to the charge bump, the capacitive coupling enable use of the second capacitance, both WBL
(CC) scheme achieves write-wordline (WWL) and WBLx are pre-discharged and, depending on
boosting using write-bitline (WBL) to WWL data polarity; one of the bit-lines makes a
coupling. This eliminates the need for a power- 0 ! VCC transition. This rising transition on
hungry charge pump as well as complex level the bitline and the internal bitcell node is capaci-
shifters. The basic idea is to take advantage of the tively coupled to the floated WWL, boosting it by
large G-S/D capacitance to create a boosted volt- ~20% of VCC. This scheme is scalable to any
age on the WWL without using any charge number of bits per WWL with the scaling of the
pump, level shifter or high voltage supply WWL driver size and the per-bit coupling
(Kulkarni et al. 2010). There are two locations capacitance.
on the WWL where this capacitance can be A beneficial side effect of pre-discharging
found (Fig. 5.10). The first is at the WWL inter- both the WBL and WBLx prior to the write
face to the PMOS/NMOS devices of the final operation is a self-induced collapse (SIC) of the

Fig. 5.10 Capacitive coupled WWL boosting with self-induced-collapse (SIC)


5 Energy Efficient Volatile Memory Circuits for the IoT Era 157

virtual bitcell voltage when the WWL is asserted. SIC-only mode (Fig. 5.12). Both SIC and CC
While SIC is inherent with the CC boost boost incur an increase in array power when run
technique, it can also be used alone as a at the nominal voltage as baseline due to addi-
low-overhead write-VMIN reduction technique, tional switching of the WBLs, bitcell VCC, and
and is an effective alternative when WWL overhead circuitry. However, both techniques
boosting violates gate-oxide reliability limits at enable VMIN scaling beyond the baseline. Total
high voltage. The SIC-only mode can be enabled array power savings when operating at lower
by keeping the boost signal high (Fig. 5.10), VMIN are 12% for SIC and 27% for CC boost
ensuring that the WWL is not floated or boosted (Fig. 5.12).
during the write operation.
Figure 5.11 shows the measured write failure
rate versus supply voltage for CC boost and 5.3.2.3 Dual-VCC Design
SIC-only modes. The slope of the write PFAIL Dual-VCC based boosting selectively increases
curve is governed by the SIC magnitude, WWL the voltage of critical nodes in an 8T-bitcell
boost ratio, and how late the WBL transitions with while incurring no array area overhead. A sepa-
respect to WWL activation. Extrapolating PFAIL rate voltage VBOOST  VMAX, supplied exter-
data to 1 MB array size demonstrates 140 mV nally or generated locally from a fixed high
reduction in write-VMIN for CC boost and input voltage rail (VIN) using a step-down volt-
80 mV reduction for SIC-only mode at optimal age regulator (VR), is used to “boost” selected
timing settings at 1.6 GHz operating frequency. read/write wordlines (R/WWLs) and cell-VCC
At lower frequencies, VMIN savings increase (during read only) as shown in Fig. 5.13
to 180 mV for CC boost and 130 mV for (Kulkarni et al. 2013).

Fig. 5.11 Measured bit failure rate vs. VCC for capacitive coupling and self-induced collapse technique

Fig. 5.12 Measured VMIN improvement with frequency and power measurement
158 J.P. Kulkarni et al.

Fig. 5.13 Dual-VCC approach and array floorplan

Fig. 5.14 Measured read and write failures vs. VCC for varying boosting levels

All remaining array circuits such as R/ single-VCC design. During a write operation,
WWL pre-decoder, pre-charge logic, local selected bitcells remain at VCC while the
and global bitline (LBL/GBL) sensing, timer, WWL is boosted to mitigate contention
and column- I/O drivers are connected to the between the pass NMOS and pull-up PMOS
variable VCC  VMAX that is shared with core in the bitcell. WWL boosting also aids write
logic operating across a wide voltage range. completion by passing a strong “1” through the
By decoupling the VMIN-limiting 8T bitcell pass NMOS. A dynamic level-shifting NAND
from remaining array and core logic, overall WL decoder replaces the static single-VCC
chip VMIN can be reduced, thus improving NAND implementation while fitting in the
energy efficiency. During a read operation, same area.
selected RWL and associated bitcells are Measurement results from a 22 nm tri-gate
switched to Vboost to enable overdrive of the CMOS process show 130 mV read-VMIN
read-port transistor stack. This alleviates improvement and 290 mV write-VMIN improve-
keeper contention and also improves bitline ment when extrapolated to a 1 MB target array
evaluation delay compared to the baseline size at 1.6 GHz (Fig. 5.14). As boosting
5 Energy Efficient Volatile Memory Circuits for the IoT Era 159

Fig. 5.15 Measured total power vs. VMIN and total power vs. supply voltage

magnitude is increased, the array VMIN is pro- path pipeline stage delay, and timing errors due
gressively reduced and achieves an optimum of to dynamic variations are detected by double-
130 mV improvement for 150 mV of boosting sampling the TRC outputs. The key requirement
operating at 1.6 GHz, resulting in 27% lower is that the TRC must always fail before the criti-
power. Any further boosting does not improve cal path fails. These circuit techniques are used in
the array VMIN as it is now limited by peripheral conjunction with architecture features to support
circuits, and results in an increase in power dissi- instruction replay (to recover from a timing
pation. Operation of the dual-VCC 8T bitcell error) or dynamic clocking techniques that pre-
SRAM across a wide voltage range is achieved vent failure of the critical paths.
by gradually increasing VBOOST value as VCC is The alternative in-situ approach for timing
scaled down (Fig. 5.15). error detection uses Error Detection Sequentials
(EDS) in the critical paths of the pipeline stage.
Timing errors are detected by a double-sampling
5.3.3 Adaptive and Resilient mechanism using a flip-flop and a latch (Bow-
Techniques man et al. 2009; Ernst et al. 2003). These error
detection techniques, however, cannot be
In modern microprocessor design, the register directly used for two-phase precharge-evaluate
file operating voltage (V) and frequency (F) are domino read critical paths in high-performance
limited by the delay of the precharge-evaluate RF arrays since the data outputs are valid only
read critical path. Furthermore, additional V/F during the evaluate phase. For error detection in
guardbands are applied to account for worst- RF arrays, replica-based techniques such as Tun-
case dynamic variations such as voltage droops, able Replica Bits (TRB) have been proposed
temperature fluctuations, and aging-induced deg- (Raychowdhury et al. 2011). In this approach, a
radation. However, since most systems usually set of replica memory bits are tuned at test time
operate at nominal conditions, the fixed V/F so that in the presence of dynamic variations the
guardbands for infrequent dynamic variations TRBs fail before the worst-case memory bit fails.
significantly limit the best-achievable perfor- An alternate approach implements in-situ
mance and energy efficiency. To reduce these Timing Margin Detector (TMD) and Timing
guardbands in flip-flop based static CMOS logic Error Detector (TED) circuits for domino read
units, replica-based approaches such as Tunable paths in 8T-bitcell arrays (Kulkarni et al. 2015).
Replica Circuits (TRC) have been proposed TMD circuit enables voltage and frequency
(Tschanz et al. 2009). In this approach, a set of adaptation to low-frequency voltage variations,
replica circuits are calibrated to match the critical temperature and aging, as well as to excessive
160 J.P. Kulkarni et al.

Fig. 5.16 In-situ timing margin and timing error detection along with error compaction circuits for high-performance
adaptive and resilient domino register file design

persistent timing errors produced by certain data incur 6–13% area overhead and 0.2–0.3% power
access patterns. The timing margin is detected by overhead for a 4 KB sub-array, but this area
double-sampling the array read output and its overhead can be amortized over a larger
delayed version at the same clock edge as array size.
shown in Fig. 5.16. Figure 5.17 shows the read timing error
TED circuits enable resiliency to timing errors measurements at 440 mV under voltage and
triggered by local high-frequency voltage droops temperature fluctuations. The read failure
and nominally random data access in the pres- measurements are extrapolated to 106 to estimate
ence of within-die (WID) delay variations. The the required frequency guardbands for a 1 Mb
timing errors are detected by double-sampling target array size. With 10% voltage droop, the
the read output with a flop+latch comparison operating frequency is lowered by 8%. With tem-
(Fig. 5.16). The sensing errors in the precharge/ perature variation from 100 to 25  C, the fre-
evaluate domino read path are converted into quency further reduces by 7%. The aging
timing errors using conditional delayed bitline guardband is estimated to be 3% based on a rep-
precharge without affecting the subsequent resentative ring oscillator delay degradation.
precharge operation. An error compaction circuit Therefore a baseline design with process, volt-
combines the error output from every bit slice to age, temperature and aging guardband would oper-
generate a one-bit error-compact signal for the ate at 860 MHz at 440 mV. As frequency is
entire register file array. TMD/TED techniques pushed higher using the TMD + TED scheme,
5 Energy Efficient Volatile Memory Circuits for the IoT Era 161

Fig. 5.17 Measured read


failures vs. frequency and
throughput improvement
with adaptive and resilient
approach

throughput first increases proportionally, and then block along with the domino register file
peaks when the error rate and the corresponding operating at the same VCC and frequency. The
recovery overheads become too large. As the TRC + EDS address the dynamic variations in
recovery overhead increases, throughput improve- the static CMOS logic block. For the domino RF
ment further reduces. The maximum frequency array, TMD addresses the slow variations and
can be improved by 36–39% using the TMD/TED generates an output to the V/F adaptation con-
based V/F adaptation and error recovery via troller. The adaptation controller modulates the
replay, depending on the number of recovery over- clock generator and/or voltage regulator to adapt
head cycles. With TED, frequency can be pushed the voltage and/or frequency applied to the
by 6%, thus compensating for WID variation design. On the other hand, TED addresses the
impacts across random bitcell access patterns. fast dynamic variations that occur too quickly
These 22 nm tri-gate CMOS testchip mea- to be mitigated by V/F adaptation. The TED
surement results demonstrate that the throughput circuits give an output to the error response con-
gain is eventually limited by the replay troller which initiates a replay mechanism at
overheads. Maximum achievable throughput increased supply voltage or reduced clock fre-
gain observed is 21%. For a given throughput quency. An error-rate tracker is also included
(0.9 GOPS) the energy efficiency is improved to monitor the error rate. If the error rate exceeds
by 65–67% with a peak energy efficiency a certain threshold, a signal (ERTe) is sent to
409 GOPS/W. the adaptation controller which adapts voltage
Figure 5.18 shows the unified framework for and frequency to minimize the overheads due
the adaptive and resilient static CMOS logic to frequent replay mechanisms. In this way,
162 J.P. Kulkarni et al.

Fig. 5.18 Unified framework for adaptive and resilient logic+register file array operating on same voltage and
frequency domain

energy-efficient high-performance 8T SRAM 10T bitcells are similar to the single-ended


based register files featuring in-situ timing mar- 8T bitcell except for the read port configurations
gin and timing error detection circuits can enable as shown in Fig. 5.19. Additional transistors are
an adaptive and resilient framework for the entire used to control for the read bitline leakage
block operating on the same voltage and fre- (Calhoun et al. 2006; Kim et al. 2007). Noguchi
quency domain. et al. have proposed a single-ended transmission
gate 10T bitcell (Noguchi et al. 2008) in which
the bitcell contents are buffered using an
5.4 10T SRAM for Sub-threshold inverter and then transferred to the read bitline
Operation whenever the bitcell is accessed. The use of
a transmission gate eliminates domino-style
Most IoT systems exhibit a burst mode of opera- read-bitline sensing. Thus read bitline does not
tion followed by long idle periods. For always- require precharge and keeper transistors. Also
ON blocks performing continuous sensing, if the same data is accessed in consecutive
minimizing the state retention leakage becomes read cycles, the charging/discharging of the
an important design consideration. For such read-bitline is reduced. A differential 10T bitcell
always-ON blocks, ultra-low voltage memory with two separate ports for read-disturb-free
operation is critical for extending the battery operation has also been reported (Noguchi et al.
lifetime. Various SRAM bitcells have been 2008). Chang et al. have proposed a read-
explored for sub-threshold region operation. disturb-free differential 10T bitcell which is
suitable for bit-interleaved architecture (Chang
et al. 2008). A similar 10T cell with a column-
5.4.1 Sub-threshold SRAMs assist technique is also reported (Okumura et al.
2009). However, series-connected write
For sub-threshold operation, various 6T, 8T access transistors degrade the write-ability of
bitcells with decoupled read and write operations the bitcell and require write-assist circuits such
have been proposed (Zhai et al. 2007; Chang as wordline boosting for a successful write
et al. 2007; Verma et al. 2007). Single-ended operation.
5 Energy Efficient Volatile Memory Circuits for the IoT Era 163

Fig. 5.19 10T sub-threshold SRAM bitcell topologies

Fig. 5.20 10T Schmitt Trigger-1 (ST-1 on left) and Schmitt Trigger-2 (ST-2 on right) SRAM bitcell topologies

5.4.2 Schmitt Trigger SRAMs 1 wordline (WL) and 2 bit-lines (BL/BR).


Transistors PL-NL1-NL2-NFL form one ST
To improve the stability of the cross-coupled inverter while PR-NR1-NR2-NFR form another
inverter pair for ultra-low voltage state retention ST inverter. Feedback transistors NFL/NFR raise
in always-On domains of an IoT system, Schmitt the switching threshold of the inverter during the
Trigger based SRAM topologies have been pro- 0 ! 1 input transition enabling the Schmitt Trig-
posed (Kulkarni et al. 2007; Kulkarni and Roy ger action. The positive feedback from
2012). A Schmitt trigger is used to modulate the NFL/NFR adaptively changes the switching
switching threshold of an inverter depending on threshold of the inverter depending on the direc-
the direction of the input transition. In the tion of the input transition. During a read opera-
Schmitt trigger SRAM bitcells, the feedback tion, (with VL ¼ 0 and VR ¼ 1, for example)
mechanism is used only in the pull-down path. due to voltage divider action between the access
Figure 5.20 (left side) shows the schematics of transistor and the pull down NMOS, the voltage
Schmitt Trigger-1 (ST-1) bitcell. ST-1 bitcell of VL node rises. If this voltage is greater than
utilizes differential sensing with 10 transistors, the switching threshold (trip point) of the VR
164 J.P. Kulkarni et al.

inverter, the contents of the cell can get flipped 5.4.3 Static Memory Arrays
resulting in a read failure event. In order to avoid
a read failure, the feedback mechanism should Another approach to achieve very low voltage
increase the switching threshold of the inverter operation is to design the SRAM bitcell to avoid
PR-NR1-NR2. Transistor NFR raises the voltage contention during the write operation, and to
at node VNR and increases the switching thresh- employ hierarchical read bitline to avoid leakage
old of the inverter storing ‘1’. Thus Schmitt due to unselected bits in order to achieve high
trigger action is used to preserve the logic ‘1’ ION/IOFF ratio. Worst-case process variations
state of the memory cell. make it difficult to satisfy both read and write
Figure 5.20 (right side) shows the schematics conditions at sub-threshold voltages. A latch-
of the Schmitt Trigger-2 (ST-2) bitcell utilizing based write scheme with C2MOS tri-state
differential sensing with 10 transistors, two inverters can be more effective for sub-threshold
wordlines (WL/WWL) and two bit-lines operation (Wang and Chandrakasann 2005).
(BL/BR). The WL signal is asserted during read The read operation of the memory in
as well as the write operation, while WWL signal sub-threshold can be challenging due to bitline
is asserted during the write operation. During the leakage, where the leakage through the pull-
retention mode, both WL and WWL are OFF. In down devices causes the dynamic bitline to
the ST-2 bitcell, feedback is provided by separate drop. Because clock speed is not the key metric
control signal (WL) unlike the ST-1 bitcell, in this application, a sense-amplifier-based read-
wherein feedback is provided by the internal bitline for fast read accesses is not needed. In a
nodes. In the ST-1 bitcell, the feedback mecha- conventional dynamic bitline design, the keeper/
nism is effective as long as the storage node precharge transistor needs to be upsized to miti-
voltages are maintained. Once the storage nodes gate the leakage due to unselected bitcells. How-
start transitioning from one state to another state, ever it reduces effective read current due to lower
the feedback mechanism is lost. To improve the IREAD/IOFF ratio (Fig. 5.21).
feedback mechanism, a separate control signal Instead, a hierarchical read-bitline which
WL is employed for achieving stronger feed- segments the bitline by using a 2-to-1
back. This enables robust operation compared multiplexers can be used. Segmenting reduces
to 6T and ST-1 bitcells operating at very low parallel leakage for each level of the hierarchy
supply voltages. and the effect of process variations is mitigated.

Fig. 5.21 Static memory design using latch bitcell and hierarchical mux based read path
5 Energy Efficient Volatile Memory Circuits for the IoT Era 165

These multiplexers can be designed to avoid nanometer scale hard mask for subsequent etch-
stacked devices and sneak leakage paths by ing process. Mask-less lithography reduces the
inserting inverters between each level of number of process steps, while photoresist-free
hierarchy. technology is more immune to light/electron
interference, thus resulting in less proximity
effects and better spacing resolution. A Dynamic
5.5 SRAMs Using Emerging VDD Regulator (DVR) is used to improve the
Technologies bitcell stability. DVR increases bitcell voltage
during read operation and lowers the read disturb
Many alternative technologies such as nanowire voltage.
Field Effect Transistors (FET), Tunnel FETs,
and III–V semiconductor FETs having superior
Ion/Ioff characteristics compared to the silicon 5.5.2 Tunnel FET
MOSFTs have been explored for energy efficient
low voltage SRAM designs. The Tunneling Field Effect Transistor (TFET) is
a quantum mechanical device in which electron
transport is governed by the quantum tunneling
5.5.1 Nanowire FET across the source-channel junction instead of
thermionic barrier modulation as in MOSFETs.
Nanowire transistors with gate all around struc- Figure 5.23 shows 0.069 μm2 hybrid 6T SRAM
ture exhibit significant reduction in VT variation bitcell schematic and SEM fabricated in 65 nm
with lower channel dopant concentration as well process technology (Nirschl et al. 2005). Pull-
as superior short-channel control. Figure 5.22 down (PD) NMOS devices are formed using
shows nano-wire FET TEM cross-section and TFETs. Pass gate (PG) transistor requires bi
0.039 μm2 6T SRAM bitcell SEM image (Chen directional current conduction capability due to
et al. 2009). Nano-wire FET is fabricated using read-0 vs. write-0 scenarios. Hence TFETs are
Nano Injection Lithography (NIL) technique not used for PG but only for PD devices.
which doesn’t use photoresist as well as any An all-TFET SRAM design could be chal-
masks. This patterning technology utilizes lenging due to the unidirectional current conduc-
gas-phase reaction activated with finely- tion in TFETs, complicating its use as a SRAM
controlled electron beam to deposit the desired access transistor. Dual-port 8T SRAM bitcells

Fig. 5.22 Nanowire FET structure, 6T SRAM bitcell SEM image and read stability analysis
166 J.P. Kulkarni et al.

Fig. 5.23 TFET based 6T SRAM schematics and SEM image

Fig. 5.24 TFET based Schmitt Trigger-2 bitcell and read stability comparison

with two TFET access transistors having oppo- performance yet low-voltage SRAM designs.
site current directionality are used for read and Figure 5.25 shows Indium Antimonide (InSb)
write operation. Schmitt Trigger-2 bitcell topol- semiconductor having highest electron mobility
ogy can be realized with TFETs as shown in used to form NFET device. This could be used to
Figure 5.24 (Saripalli et al. 2011). Statistical form the read port of the hybrid 8T SRAM bitcell
simulations show extreme low voltage read oper- (Kulkarni and Roy 2008). The higher Ion current
ation for TFET based ST-2 bitcell compared to in InSb can be leveraged to improve the read
the CMOS based 6T, 10T and TFET dual port 8T bitline delay up to 60% compared to the Silicon
bitcells. Therefore, TFET devices with better CMOS based 8T SRAM.
Ion/Ioff at low supply voltages could be a suit-
able candidate for ultra-low voltage SRAM
designs in always-ON modules of an IoT system.
5.5.4 Monolithic 3-D Integration

The demand for higher SRAM density is continu-


5.5.3 III–V Quantum Well FET ally growing due to increased design complexity
(application processor, modem, and graphics on
Compound semiconductor-based quantum well the same die) and also due to increased die cost
FETs are a promising candidate to realize high- in dimensionally-scaled advanced CMOS
5 Energy Efficient Volatile Memory Circuits for the IoT Era 167

Fig. 5.25 InSb based quantum well FET structure, hybrid Si+InSb 8T SRAM bitcell and bitline delay comparison

2.0

1.6
Output, V
1.2

0.8

0.4 282mV

0.0
0.0 0.4 0.8 1.2 1.6 2.0

Input, V
SEM Cross section of the S3S cell Static Noise Margin Characteristics

Fig. 5.26 Monolithic SRAM bitcell SEM image and read stability measurements

technologies. Towards this goal, many approaches dynamic operating range of the memory by
for monolithic 3-D integration of SRAM bitcells enhancing the periphery to apply optimum access
have been reported. Figure 5.26 shows the scan- voltages (dynamic assist, dual-VCC arrays), or
ning electron microscope (SEM) cross section of optimizing the bitcell itself for low-voltage
Stacked Single crystal Silicon (S3) SRAM cell operation (10T bitcells, static memory arrays),
fabricated using 65 nm process technology achiev- or by embedding resiliency techniques to
ing 0.16 μm2 bitcell area showing almost 3 better dynamically detect and respond to variations.
bitcell density compared to the conventional These techniques differ widely in terms of area
65 nm planar bitcell (Jung et al. 2005). Load overhead, complexity, ease of integration, and
PMOS and pass gate NMOS transistors are stacked power and voltage reduction that can be
on the planar pull-down transistors in different achieved. Therefore it is important to analyze
layers. The Static Noise Margin (SNM) achieved the unique requirements for each memory array
is 282 mV at VCC ¼ 1.2 V operation. in a particular application before selecting the
best implementation for that array (Table 5.1).
In this section we give a high-level summary
5.6 Summary of the key metrics for the techniques which have
been detailed in this chapter. In addition to these
In this chapter we have shown that there are a circuit techniques, emerging technologies (such
wide variety of circuit and device techniques that as nanowires, tunnel FETs, and quantum well
can be applied for energy-efficient, low-voltage FETs) can also provide power/performance
memory arrays. These techniques improve the
168 J.P. Kulkarni et al.

Table 5.1 Qualitative comparison of various 6T, 8T, 10T SRAM VMIN and guardband reduction techniques
Technique Ref. Array type Key goal Overheada Example resultsa
Read/write assist Khellah et al. (2008) 6T SRAM VMIN Varies Varies
and Mann et al. (2010) reduction
Wordline boosting: Raychowdhury et al. 8T SRAM VMIN ~25% area 120–140 mV
charge pump (2010) reduction VMIN reduction
Wordline boosting: Kulkarni et al. (2010) 8T SRAM VMIN 5–11% area 140 mV write
capacitive coupling reduction VMIN reduction
Dual-Vcc design Kulkarni et al. (2013) 8T SRAM VMIN Extra supply rail 130 mV VMIN,
reduction 27% lower power
Subthreshold Verma et al. (2007); 10T SRAM VMIN ~50% area Sub-500 mV
SRAMs Calhoun et al. (2006); bitcells reduction operation
and Kim et al. (2007)
Schmitt trigger Kulkarni et al. (2007) 10T SRAM VMIN ~100% area ~100 mV lower
SRAM and Kulkarni and Roy bitcells reduction VMIN
(2012)
Static memory Wang and Latch- VMIN >2–3 area Sub-500 mV
Chandrakasann (2005) based reduction operation
bitcell
Resiliency: tunable Raychowdhury 6T, 8T, 10T VMIN, ~5% area 9% VMIN, 7.5%
replica bits et al. (2011) guardband power reduction
reduction
Resiliency: in-situ Kulkarni et al. (2015) 8T VMIN, 6–13% area ~36% frequency
timing monitors guardband increase
reduction
a
Note: overheads and results for these techniques are strongly dependent on process technology, array design
characteristics, optimization target, etc. The numbers shown here are meant to be only representative examples

improvements at the low voltages required for The techniques discussed here also point to
many IoT devices. the need for cross-layer optimization to reap
maximum benefit for these advanced memory
techniques while minimizing cost. IoT SoCs
5.7 Trends and Perspectives can contain hundreds of embedded memory
arrays, each with its own unique size, arrange-
The IoT domain presents both a challenge ment, access pattern, and power/performance
and opportunity for advances in low-power, requirements. Detailed architecture and system
low-voltage memory arrays. Power requirements simulations are needed early in the design phase
for many IoT devices are extremely low, requir- to determine the optimum method for imple-
ing memory arrays with low standby power as menting each memory array. Dynamic assist or
well as low access (read and write) power. Volt- multi-voltage techniques require area-efficient
age scaling of traditional SRAM arrays has and energy-efficient power delivery circuits
slowed or even stopped with technology scaling, such as on-die voltage generators, charge
which points to the need for advanced circuit pumps, and dynamic power management. Resil-
techniques for energy efficient and reliable iency techniques that detect timing margins or
on-die memory for the IoT space. At the same timing errors show the promise to drastically
time, new and emerging workloads of special reduce array operating voltages, but require
importance to IoT-for example, always-on architecture support (such as instruction replay)
audio and visual recognition workloads, cogni- to obtain the full benefit. Looking further towards
tion, etc.-require ever-increasing amounts of the future, dramatic gains in memory energy
memory storage and bandwidth. Clearly, efficiency can be obtained by over-scaling the
advancements in memory design are needed to voltage and allowing a small number of bits to
enable these usages for IoT. fail. Traditionally, these failures are either
5 Energy Efficient Volatile Memory Circuits for the IoT Era 169

avoided or handled via error-correction tech- injection lithography, nanowire channel, and full TiN
niques such as ECC for large memory arrays. gate, in Proceedings of International Electron Device
Meeting (IEDM) (2009), pp. 958–960
However, there are a class of emerging work- D. Ernst, N.S. Kim, S. Das, S. Pant, R. Rao, T. Pham,
loads such as convolutional neural networks C. Ziesler, D. Blaauw, T. Austin, K. Flautner,
that can tolerate infrequent errors in certain T. Mudge, Razor: a low-power pipeline based on
computations. These types of workloads, coupled circuit-level timing speculation, IEEE/ACM MICRO-
36 (2003), pp. 7–18
with the necessary resiliency and monitoring S.-M. Jung, Y. Rah, T. Ha, H. Park, C. Chang, S. Lee,
techniques to ensure memory operation within J. Yun, W. Cho, H. Lim, J. Park, J. Jeong, B. Son,
the desired bit error range, could provide the J. Jang, B. Choi, H. Cho, K. Kim, Highly cost effective
next big gains in energy efficient operation. and high performance 65 nm S3 (Stacked Single-
crystal Si) SRAM technology with 25F2, 0.16 μm2
cell and doubly stacked SSTFT cell transistors for
Acknowledgments Authors would like to thank ultra high density and high speed applications, Sym-
Muhammad Khellah, Arijit Raychowdhury, Keith Bow- posium on VLSI Technology (2005), pp. 220–221
man, Carlos Tokunaga, Dinesh Somasekhar, Bibiche M.M. Khellah, A. Keshavarzi, D. Somasekhar, T. Karnik,
Geuskens, Tanay Karnik, and Shekhar Borkar for insight- V. De, Read and write circuit assist techniques for
ful technical discussions. Authors would also like to thank improving Vccmin of dense 6T SRAM cell, in
Greg Taylor, Richard Forand and Matthew Haycock for Proceedings of the International Conference on
encouragement and support. This research was, in part, Integrated circuit Design and Technology (2008),
funded by the U.S. Government (DARPA). The views pp. 185–189
and conclusions contained in this document are those of T.-H. Kim, J. Liu, J. Keane, C.-H. Kim, A high-density
the authors and should not be interpreted as representing subthreshold sram with data-independent bitline leak-
the official policies, either expressed or implied, of the age and virtual ground replica scheme, in Proceedings
U.S. Government. of the International Solid State Circuits Conference
(2007), pp. 330–331
J.P. Kulkarni, K. Roy, Technology circuit co-design for
ultra-fast InSb quantum well transistors. IEEE Trans.
References Electron Dev. 55, 2537–2545 (2008)
J.P. Kulkarni, K. Roy, Ultralow-voltage process-
A. Bhavnagarwala, X. Tang, J. Meindl, The impact of variation-tolerant schmitt-trigger-based SRAM
intrinsic device fluctuations on CMOS SRAM cell design. IEEE Trans. VLSI Syst. 20(2), 319–332
stability. IEEE J. Solid State Circ. 36(4), 658–665 (2012)
(2001) J.P. Kulkarni, K. Kim, K. Roy, A 160 mV robust schmitt
K. Bowman, J. Tschanz, C. Wilkerson, S.-L Lu, trigger based sub-threshold SRAM. IEEE J. Solid
T. Karnik, V. De, S. Borkar, Circuit techniques for State Circ. 42(10), 2304–2313 (2007)
dynamic variation tolerance, in Proceedings of the J. Kulkarni, B. Geuskens, T. Karnik, M. Khellah,
46th Annual Design Automation Conference (2009), J. Tschanz, V. De, Capacitive-coupling wordline
pp. 4–7 boosting with self-induced VCC collapse for write
B.H. Calhoun, A.P. Chandrakasan, A 256 kb VMIN reduction in 22-nm 8T SRAM, International
Sub-threshold SRAM in 65 nm CMOS, in Solid State Circuits Conference (ISSCC) (2010),
Proceedings of the International Solid State Circuits pp. 234–235
Conference (2006), pp. 628–629 J.P. Kulkarni, M. Khellah, J. Tschanz, B. Geuskens,
L. Chang, Y. Nakamura, R.K. Montoye, J. Sawada, R. Jain, S. Kim, V. De, Dual-Vcc 8T-bitcell SRAM
A.K. Martin, K. Kinoshita, F.H. Gebara, array in 22 nm Tri-gate CMOS for energy efficient
K.B. Agarwal, D.J. Acharyya, W. Haensch, operation across wide dynamic voltage range, VLSI
K. Hosokawa, D. Jamsek, A 5.3 GHz 8T-SRAM Circuit Symposium (VLSI Symp) (2013), pp. C126–
with operation down to 0.41 V in 65 nm CMOS, in C127
Proceedings of the VLSI Circuit Symposium (2007), J. P. Kulkarni, C. Tokunaga, P. Aseron, T. Nguyen Jr.,
pp. 252–253 C. Augustine, J. Tschanz, V. De, A 409 GOPS/W
I. Chang, J.-J. Kim, S. Park, K. Roy, A 32 kb 10T sub- adaptive & resilient domino register file in 22 nm
threshold SRAM array with bit-interleaving and dif- Tri-Gate CMOS featuring in-situ timing margin &
ferential read scheme in 90 nm CMOS, in Proceedings error detection for tolerance to within-die variation,
of the International Solid State Circuits Conference voltage droop, temperature & aging, International
(2008), pp. 628–629 Solid State Circuits Conference (ISSCC) (2015),
H.-Y. Chen, C.-C. Chen, F.-K. Hsueh, J.-T. Liu, C.-Y. pp. 82–83
Shen, C.-C. Hsu, S.-L. Shy, B.-T. Lin, H.-T. Chuang, R.W. Mann, J. Wang, S. Nalam, S. Khanna, G. Braceras,
C.-S. Wu, C. Hu, C.-C. Huang, F.-L. Yang, 16 nm H. Pilo, B.H. Calhoun, Impact of circuit assist
functional 0.039 μm2 6T-SRAM cell with nano
170 J.P. Kulkarni et al.

methods on margin and performance in 6T SRAM. replica bits for dynamic variation tolerance in 8T
Solid State Electron. 54(11), 1398–1407 (2010) SRAM arrays, IEEE Journal of Solid State Circuits,
S. Mukhopadhyay, H. Mahmoodi, K. Roy, Modeling of 46(4), (2011)
failure probability and statistical design of SRAM V. Saripalli, S. Datta, V. Narayanan, J.P. Kulkarni,
array for yield enhancement in nanoscaled CMOS. Variation-tolerant ultra low-power hetero-junction
IEEE Trans. Comput. Aided Des. 24(12), 1859–1880 tunnel FET SRAM Design, 7th International Sympo-
(2005) sium on Nanoscale Architectures (NANOARCH)
H. Noguchi, S. Okumura, Y. Iguchi, H. Fujiwara, (2011), pp. 45–52
Y. Morita, K. Nii, H. Kawaguchi, M. Yoshimoto, Semiconductor Industry Association (SIA) and Semicon-
Which is the best dual-port SRAM in 45-nm process ductor Research Corporation (SRC) report on
technology?—8T, 10T single end, and 10T differen- “Rebooting the IT revolution” (2015)
tial, IEEE International Conference on Integrated J. Tschanz, K. Bowman, S. Walstra, M. Agostinelli,
Circuit Design and Technology (2008), pp. 55–58 T. Karnik, V. De, Tunable replica circuits and adap-
Th. Nirschl, St. Henzler, J. Fischer, A. Bargagli-Stoffi, tive voltage-frequency techniques for dynamic volt-
M. Fulde, M. Sterkel, P. Teichmann, U. Schaper, age, temperature, and aging variation tolerance,
J. Einfeld, C. Linnenbank, J. Sedlmeir, C. Weber, Digest of Technical Papers, Symposium on VLSI
R. Heinrich, M. Ostermayr, A. Olbrich, B. Dobler, Circuits (2009), pp. 112–113
E. Ruderer, R. Kakoschke, K. Schrüfer, N. Verma, A.P. Chandrakasan, 65 nm 8T Sub-Vt SRAM
G. Georgakos, W. Hansch, D. Schmitt-Landsiedel, employing sense-amplifier redundancy, in
The 65 nm tunneling field effect transistor (TFET) Proceedings of International Solid State Circuits Con-
0.68 μm2 6T memory cell and multi-Vth device, 35th ference (2007), pp. 328–329
European Solid-State Device Research Conference A. Wang, A. Chandrakasann, A 180-mV subthreshold
(2005), pp. 173–176 FFT processor using a minimum energy design meth-
S. Okumura, Y. Iguchi, S. Yoshimoto, H. Fujiwara, odology. IEEE J. Solid State Circ. 40(1), 310–319
H. Noguchi, K. Nii, H. Kawaguchi, M. Yoshimoto, (2005)
A 0.56-V 128 kb 10T SRAM using Column Line Y. Wang, Robust SRAM Design in Nanoscale CMOS
Assist (CLA) scheme, in Proceedings of the Interna- Circuit and Technology, ISSCC Forum on embedded
tional Symposium on Quality Electronics Design memories (2012)
(ISQED) (2009), pp. 659–663 N. Yoshinobu, H. Masahi, K. Takayuki, K. Itoh, Review
A. Raychowdhury, B. Geuskens, J. Kulkarni, J. Tschanz, and future prospects of low-voltage RAM circuits.
K. Bowman, T. Karnik, S.-L. Lu, V. De, M. Khellah, IBM J. Res. Dev. 47(5/6), 525–552 (2003)
PVT & Aging Adaptive Word-Line Boosting for 8T B. Zhai, D. Blaauw, D. Sylvester, S. Hanson, A
SRAM Power reduction, International Solid State sub-200 mV 6T SRAM in 0.13 μm CMOS, in
Circuits Conference (ISSCC) (2010) Proceedings of the International Solid State Circuits
A. Raychowdhury, B. Geuskens, K. Bowman, J. Tschanz, Conference (2007), pp. 332–333
S.-L. Lu, T. Karnik, M. Khellah, V. De, Tunable
On-Chip Non-volatile Memory
for Ultra-Low Power Operation 6
Meng-Fan Chang

This chapter addresses trends and challenges in battery-powered or energy-harvester-powered


the development of on-chip (embedded) IoTs or wearable devices equipped with nanome-
non-volatile memory (NVM) for ultra-low ter chips, which are particularly susceptible to
power operation. Various NVM technologies leakage current. Many energy-efficient chips
have been introduced, including Flash, employ on-chip (embedded) nonvolatile memory
OTP/MTP, resistive RAM, and phase-change (eNVM) for the storage of programs and critical
memory (PCM). In the following, we examine data while in power-off mode (Yano et al. 2012;
some of the challenges in the design of circuits Hatanaka et al. 2007; Hidaka et al. 2011; Zwerg
used for read and write operations. Future trends et al. 2011; Wang et al. 2012c, 2014; Chang et al.
in ultra-low-power NVM are also discussed. 2011a, b, 2012a, 2014b, 2015b; Chiu et al. 2010;
Yamamoto et al. 2009; Huang et al. 2014;
Eshraghian et al. 2010; Matsunaga et al. 2012;
6.1 Operational Requirements Li et al. 2014). At present, embedded Flash
of On-Chip NVM for IoT (eFlash) is the most mature and reliable solution
for the mass production of eNVM. Several other
In this section, we outline the requirements of emerging memory technologies, such as Phase
on-chip NVM for IoT devices under low-power Change Memory (PCM) (Wen et al. 2011; Rizzi
operation and provide a review of current state- et al. 2014; Pozidis et al. 2013; Boniardi et al.
of-the-art on-chip NVMs. 2014), Spin-Torque Transfer Magnetic RAM
(STT-MRAM) (Kim et al. 2011; Kitagawa et al.
2012; Wang et al. 2012b; Tsunoda et al. 2013),
6.1.1 Operation of On-Chip NVM and ReRAM devices (memristor or resistive
for IoT and Normally-Off RAM, ReRAM) (Chang et al. 2011a, b, 2012a,
Applications 2013a, b, c, 2014a, b, c, 2015b; Chiu et al. 2010;
Yamamoto et al. 2009; Huang et al. 2014;
Numerous energy-efficient systems employ Eshraghian et al. 2010; Banno et al. 2014; Xue
intelligent power on-off schemes to reduce sys- et al. 2013; Wataru et al. 2011; Fackenthal et al.
tem standby power, as shown in Fig. 6.1. This is a 2014; Liu et al. 2013b; Kawahara et al. 2012;
particularly important issue when dealing with Wang et al. 2012a; Koveshnikov et al. 2012;
Zhang et al. 2014; Lee et al. 2010; Hsieh et al.
M.-F. Chang (*) 2013; Sakotsubo et al. 2010; Jo et al. 2014; Serb
National Tsing Hua University, Hsinchu City, Taiwan et al. 2014; Yang et al. 2012; Nardi et al. 2011),
e-mail: mfchang@ee.nthu.edu.tw

# Springer International Publishing AG 2017 171


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_6
172 M.-F. Chang

Fig. 6.1 Underlying concepts in the use of (a) low-voltage or (b) power interrupt to reduce power consumption

stored in eNVM is uploaded to on-chip instruction


buffer (Frustaci et al. 2014; Chen et al. 2014; Oh
et al. 2014; Hamzaoglu et al. 2014; Song et al.
2014), whereas the data is uploaded to an on-chip
data buffer, such as SRAM or DRAM (Frustaci
et al. 2014; Chen et al. 2014; Oh et al. 2014;
Hamzaoglu et al. 2014; Song et al. 2014). In com-
putation mode, the volatile memory macros pro-
vide access to program-code and perform data
buffering while the eNVM is used for run-time
Fig. 6.2 Operation of NVM in high-performance IoT data back-up. During power-down procedures, all
devices critical data is flushed to eNVM.
As shown in Fig. 6.3, most low-cost IoT
also are good candidates for IoT devices. In devices use eNVM for power-off storage as
Sect. 6.2, we discuss on-chip NVM devices for well as power-on program-code access as a
IoT applications. means of reducing reduce chip area. Eliminating
As shown in Fig. 6.2, most high performance the SRAM instruction macro means that eNVM
IoT (HP-IoT) devices require large-capacity must perform frequent-read and infrequent-write
eNVM for the storage of long stretches of actions, thereby necessitating a reduction in the
program-code and large volumes of data. power consumption associated with read
This commonly involves the use of two discrete operations in order to reduce overall power
volatile memory devices (i.e., SRAM or eDRAM consumption.
macros) to provide access to instructions and data. Figure 6.4 illustrates the standby energy con-
During power-on procedures, program-code sumption of a SRAM macro at the data-retention
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 173

voltage (DRV) and that of the memory power against a low-voltage on-chip SRAM during
interrupt approach (eNVM-based two-macro sleep-mode, where VDD is biased at DRV.
scheme) for a given memory capacity. Standby BET is shorter in more advanced nanometer
energy consumption of two-macro scheme technologies due to an increase in SRAM leak-
includes the energy consumed by data backup age current. This makes the two-macro eNVM-
operations (SRAM read out + eNVM write in) based two-macro scheme more energy efficient
as well as in the restore operation (eNVM read than DRV-biased SRAM in cases of extended
out + SRAM write in). Compared with an system standby period.
SRAM macro held at a DRV, the memory
power interrupt approach (one power-off and
one-power on operation) using a eNVM-based
6.1.2 Requirements of On-Chip NVM
two-macro scheme consumes less standby
for Ultra-Low Power Operation
energy when the standby period exceeds the
break-even time (BET). BET refers to the time
In this subsection, we discuss the requirements of
required for a eNVM-based two-macro device to
on-chip NVM to reduce the power consumption
break even in power consumption when balanced
of various applications involving IoT devices.
Figure 6.5 presents a simplified illustration of
a typical on-chip NVM macro, which includes a
cell array, row drivers, input-output (IO) circuits,
timing controller, and high-voltage generator.
In a cell array, each NVM cell is connected to
three signals: word-line (WL), source-line (SL),
and bit-line (BL). The row drivers include
row-decoders, level-shifters, and drivers for the
word-line and source-lines.
The IO circuit includes column multiplexors
Fig. 6.3 Operation of NVM in low-cost IoT devices as well as read-path and write-path sub-circuits.

Fig. 6.4 Standby energy consumption of a SRAM macro and two-macro schemes
174 M.-F. Chang

Fig. 6.6 Power breakdown in read operation of an NVM


macro

control signal toggling and data access for a


Fig. 6.5 Simplified structure of typical on-chip NVM processor. As shown in Fig. 6.7, NVM with
macros low-voltage read operations further reduces
power consumption, particularly in IoT devices
The read-path includes output buffers and sense with frequent-read-seldom-write operations.
amplifiers. The write-path includes write-drivers In Sect. 6.3, we discuss various design
and input buffers. challenges and outline a number of solutions
Most high-voltage generators employ an that have been proposed for low-power read
on-chip charge-pump circuit to generate the schemes.
required voltage, which is higher than the supply Figure 6.8 illustrates the normalized power
voltage (VDD) for write operations in NVM breakdown in the write operation of a typical
devices. embedded Flash memory device. Most NVM
As discussed in Sect. 6.1.1, low-cost IoT devices require high voltages and excessive
devices require eNVM to reduce power consump- time for write operations, such that most of the
tion when performing read operations. Figure 6.6 write power is consumed by sub-blocks using
presents the normalized power breakdown of a DC-current or through continuous toggling
typical read operation in an embedded Flash throughout the write period. The sub-blocks that
memory. The following approaches can be used consume much of the write power include
to reduce read power: (1) low bit-line bias row-drivers (level-shifters), write-drivers, and
(precharge) voltage to suppress power consump- the charge-pump circuit. This means that an
tion by the cell array. (2) Small input-offset sense on-chip NVM with low write power should
amplifier to reduce the DC-current consumption employ low write-energy NVM devices as well
time and read delay time. as means of suppressing power consumption of
Many energy-efficient and energy-harvesting- sub-blocks.
powered devices (Zwerg et al. 2011; Wang et al. Power integrity is another important concern
2012c; Tang et al. 2014) must deal with a limited for most chips (Jain et al. 2011; Chang et al.
supply voltage and strict power budget. 2015e), particularly those used in low-voltage
Implementing low VDD is an effective approach IoT devices. Supply noise can induce functional
to reduce dynamic power consumption through failures in supply-sensitive circuits, such as
the suppression of voltage swing associated with on-chip mixed-signal and memory circuits
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 175

Fig. 6.7 Use of low-voltage to reduce dynamic power consumption in IoT chips

6.2.1 Embedded Flash (eFlash)


Memory

On-chip Flash memory cells employ two basic


structures: split-gate or stack-gate. Both of these
structures feature a floating-gate (FG) beneath a
control-gate (CG) for the storage of charge used
to alter the threshold voltage (VTH-MC) of an
N-channel transistor memory cell. The charge
can be stored in the form of electrons or holes
(Zhang 2009). Flash memory performs two write
operations: program and erase. Program
operations involve the injection of charge into
the FG to achieve a high VTH-MC in order to
yield a tiny amount of cell current (ICELL-PRG)
Fig. 6.8 Normalized power breakdown of write opera- in read mode. Erase operations remove the
tion in a typical embedded Flash memory device
negative charge from the FG in order to reduce
VTH-MC and generate substantial cell current
(Chang and Yang 2009; Larsson et al. 2001a;
(ICELL) in read mode. These cell structures
Mezhiba and Friedman 2004). Supply noise can
employ either a floating-poly-gate or charge-
also affect program-verify operations and the
trapping technology (Kuo et al. 1998; Kamiya
threshold voltage distribution of NVM cells.
et al. 1982; Kianian et al. 1994; Liu et al. 1997;
These issues underline the importance of devel-
Takahashi et al. 1999; Eitan et al. 1999; Chen
oping low peak-current CPs for 3D-ICs and
et al. 1997; Lee et al. 2006a; Yater et al. 2007).
3D-memory modules.
Figure 6.9 presents a simplified illustration of
In Sect. 6.4, we discuss the design challenges
a typical one-transistor (1T) stack-gate Flash
and outline a number of solutions that have been
device in which the control-gate and floating-
proposed for low-power write schemes.
gate are stacked vertically. The 1T stack-gate
achieves a compact cell area and is commonly
6.2 On-Chip NVM Devices utilized in discrete memory products, such as
NOR and 2D NAND Flash memories.
In this section, we introduce various mass- In program operations (Zhang 2009), a
production-ready on-chip NVM devices for IoT portion of the channel hot electron (CHE),
applications, including embedded Flash and one- generated by the source-to-drain current, is
time-programmable (OTP) devices. Two injected into the FG through the bottom oxide
emerging memory devices with high signal in order to negatively charge the FG, thereby
ratio, resistive memory (ReRAM) and phase- increasing the VTH-MC. In erase operations, high
change memory (PCM), are also introduced. voltage is applied to the channel (substrate) to
176 M.-F. Chang

Fig. 6.9 Simplified structure of typical one-transistor (1T) stack-gate Flash device

Fig. 6.10 Simplified structure of (a) conventional and (b) recently developed split-gate Flash cells

initiate the Fowler-Nordheim (FN) electron commodity NVMs, thereby precluding the use
tunneling mechanism between the FG and the of complex program-verify schemes to avoid
channel. The negative charge is then removed the issue of over-erase.
from the FG through the bottom tunneling Since the early 1980s, numerous split-gate
oxide in order to lower the VTH-MC. cells (Kianian et al. 1994; Yater et al. 2007;
This 1T stack-gate cell features a compact cell Van Houdt et al. 1992; Ahn et al. 1998; Huang
size but suffers from issues related to (1) high et al. 2000; Mih et al. 2000; Tsouhlarakis et al.
program current and (2) over-erase. High pro- 2001; Wang et al. 2005; Saha 2007; Cho et al.
gram current necessitates the consumption of 2006; Kotov et al. 2013; Kono et al. 2013) have
high write power as well as a charge-pump cir- been proposed. Figure 6.10 illustrates the
cuit with large area overhead. Over-erase results simplified structure of conventional and recently
in ultra-low VTH-MC and considerable bit-line developed split-gate Flash cells. In conventional
leakage current from unselected erased cells dur- split-gate cells, a portion of the floating gate is
ing read operations, thereby degrading the sens- placed beneath the control gate, such that the
ing margin. Unfortunately, most on-chip NVM channel of a memory cell transistor is controlled
devices have less memory capacity than do by the control gate (wordline) as well as the
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 177

floating gate. This causes the split-gate Flash OTP cells cannot be used for more than 100
memory cell to act as two transistors operating program/erase cycles.
in serial, equivalent to 1.5T per cell. This 1.5T Two types of OTP cells are commonly used
serial-transistor structure makes it possible to for logic processes: floating-gate (FG) and anti-
turn off unselected erased cells using the control fuse (AF) based OTP.
gate, even in cases of a negative VTH-MC caused Logic process compatibility means that
by the floating gate. Unlike stack-gate cells, split- FG-based OTP (FG-OTP) cannot use stack-gate
gate cells are immune to over-erase, which or split-gate structures, as in eFlash. Thus, OTP
eliminates the need to boost the wordline usually uses a two-transistor (2T) cell structure,
(WL) voltage for fast read operations. In recent in which one transistor functions as the FG, while
split-gate Flash cells, most of the area of the the other transistor functions as the control gate
floating-gate is located beneath the control gate. (CG). The program operation of FG-OTP usually
A side-wall select gate (SG) or erase gate uses a channel hot electron injection (CHI)
(EG) can also be added to facilitate the inclusion mechanism for the injection of negative charge
of high-voltage and low-voltage areas within a into the FG, such that the FG-transistor ends up
split-gate cell (Kono et al. 2013). with a high threshold voltage (VTH). The erase
A source-side injection (SSI) program scheme operation of FG-OTP uses either UV light or the
reduces the current required by split-gate cells as Fowler-Nordheim (FN) electron tunneling mech-
well as the power consumed in program anism to reduce the negative charge on the FG,
operations. Moreover, the small gap between which means that a FG-transistor has low VTH.
the floating gate and control gate enables rapid Due to the non-permanence of programming
erase operations via poly-to-poly FN tunneling behavior, most FG-OTP devices are capable of
operations (Kianian et al. 1994) or hot hole injec- multiple program/erase cycles. Figure 6.11
tion via band-to-band tunneling (BTBT) (Eitan presents an example of FG-based OTP.
et al. 2000). Small program current, fast erase Anti-fuse based OTP (AF-OTP) is another
operations, and immunity to over-erase have popular solution for cost-sensitive chips that do
made split-gate flash cells popular for on-chip not require repeated program operations. Most
(embedded) applications, particularly those AF-OTP cells initially have high impedance
used in low-power applications. between the gate and channel. During program
operations, a strong electrical field is applied
across the gate oxide. With sufficient program
6.2.2 Embedded OTP/MTP Memory time and electric field, the gate oxide breaks
down, thereby reducing the impedance between
For IoT devices that do not require frequent code the gate and channel. This gate oxide breakdown
modification, low-cost logic-process-compatible structure provides highly secure mechanism for
on-chip one-time-programmable (OTP) provide the storage of information. Once the information
a viable alternative for on-chip NVM. Generally, is stored, it is nearly impossible for it to be

Fig. 6.11 Typical structures used in FG-OTP cells


178 M.-F. Chang

WL Pass transistor WLr WLp


Programmable element
Floated
Sidewall spacer Sidewall spacer BL
Poly Gate Poly Gate terminal
BL Polysilicon

Gate Oxide N+ Body N+ Body N+

BOX
Diffusion Diffusion
Field Oxide Field Oxide Substrate

Fig. 6.12 Typical structure used in AF-OTP cells

Fig. 6.13 Write time vs. write voltage in recent NVM devices

changed without completely destroying the data. memory (PCM) (Wen et al. 2011; Rizzi et al.
However, the gate oxide cannot be recovered 2014; Pozidis et al. 2013; Boniardi et al. 2014),
following breakdown; therefore, the AF-OTP resistive RAM (ReRAM) (Chang et al. 2011a, b,
can be programmed only once. This limits the 2012a, 2013a, b, c, 2014a, b, c, 2015b; Chiu et al.
testing coverage of any chip using AF-OTP. 2010; Yamamoto et al. 2009; Huang et al. 2014;
Figure 6.12 presents several examples of Eshraghian et al. 2010; Banno et al. 2014; Xue
AF-based OTP. et al. 2013; Wataru et al. 2011; Fackenthal et al.
2014; Liu et al. 2013b; Kawahara et al. 2012;
Wang et al. 2012a; Koveshnikov et al. 2012;
6.2.3 Emerging Resistive-Type Zhang et al. 2014; Lee et al. 2010; Hsieh et al.
ReRAM for Embedded 2013; Sakotsubo et al. 2010; Jo et al. 2014; Serb
Applications et al. 2014; Yang et al. 2012; Nardi et al. 2011),
and spin-torque transfer magnetic RAM
As Fig. 6.13 shows, a variety of emerging (STT-MRAM) (Kim et al. 2011; Kitagawa et al.
resistive-type memory devices have been pro- 2012; Wang et al. 2012b; Tsunoda et al. 2013).
posed to achieve faster write times than those of PCM and ReRAM both feature a high resistance-
on-chip flash memory with lower write-voltage ratio (R-ratio) but suffer from limited endurance.
and write-energy consumption. Those resistive- STT-MRAM has high endurance but a small
type memory devices include phase change R-ratio (or TMR ratio). The R-ratio
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 179

(¼RHRS/RLRS) represents the ratio between the in bipolar ReRAM devices (Chang et al. 2013a,
cell resistance values in high resistance state b, 2014a; Banno et al. 2014; Xue et al. 2013;
(HRS, RHRS) vs. the values in a low resistance Wataru et al. 2011; Fackenthal et al. 2014; Liu
state (LRS, RLRS). In the following subsection, et al. 2013b; Kawahara et al. 2012; Wang et al.
we introduce two high R-ratio resistive-type 2012a; Koveshnikov et al. 2012; Zhang et al.
NVM devices, PCM and ReRAM. 2014; Lee et al. 2010; Jo et al. 2014; Serb et al.
ReRAM and PCM devices are capable of 2014; Yang et al. 2012; Nardi et al. 2011) is
performing two direct overwrite operations: achieved by providing different voltage
SET and RESET. The SET operation changes polarities across the two terminals (top and bot-
the ReRAM device from a high-resistive state tom plates) for SET and RESET operations. The
(HRS) to a low-resistive state (LRS). The switching of cell resistance state in unipolar
RESET operation changes the ReRAM/PCM ReRAM devices (Chang et al. 2013c, 2014a, c;
device from low resistance (RL, LRS) to high Hsieh et al. 2013; Sakotsubo et al. 2010) is
resistance (RH, HRS). For most embedded achieved by providing the same voltage
applications, ReRAM/PCM macros employ a polarities but with different amplitudes across
one-transistor one-resistor (1T1R) structure. the two terminals of the ReRAM device for
The materials used in ReRAM devices vary SET and RESET operations.
widely with regard to write-speed and write- It should be noted that in bipolar-ReRAM
current performance. The write behavior of most cells, the polarity of the bias voltage for read
ReRAM devices is either bipolar or unipolar. operations is the same as for either a SET or
Figure 6.14 illustrrates the simplified cell RESET operation. As a result, bipolar ReRAM
structure of 1T1R ReRAM cells using back-end- devices suffer from read disturbance when under
of-line (BEOL) processing. Bipolar and unipolar high stress voltage (VR) and/or extended stress
ReRAM devices can be designed with the same time (TR). As shown in Fig. 6.16, the cell read
cell array structure, in which all 1T1R ReRAM current (ICELL) of an HFO-based bipolar ReRAM
cells in a given row share the same word-line device (Lee et al. 2010) gradually decreases with
(WL) and source-lines (SLs), while all cells in a an increase in TR at VR ¼ 0.5 V. This means that
given column share the same bit-line (BL). LRS resistance increases slowly (is disturbed)
Figure 6.15 presents the IV-curves and bias during the read process. Low bit-line bias voltage
conditions of bipolar and unipolar ReRAM (VBL) is required to prevent read disturbance
devices. The switching of cell resistance states when reading from bipolar ReRAM cells.

Fig. 6.14 Simplified


diagrams of (a) cell
structure and (b) mini-array
of 1T1R ReRAM cells
180 M.-F. Chang

Fig. 6.15 IV-curves of (a)


bipolar and (b) unipolar
ReRAM devices

Fig. 6.17 Typical current–voltage (I–V) curves of PCM


Fig. 6.16 Cell read current vs. stress time of HFO-based devices
ReRAM device under various stress voltages
of PCM were first reported in the 1960s
6.2.4 Emerging Resistive-Type (Ovshinsky 1968), only in the past two decades
ReRAM for Embedded has it become possible to produce PCM chip-
Applications: PCM level devices with performance comparable to
that of other contemporary memory technologies
Phase Change Memory (PCM) cells are based on (Lai et al. 2003). In this subsection, we introduce
a chalcogenide compound produced from Ge, Sb, the basic operations of PCM and outline a num-
and Te (GST). These devices exploit the large ber of recent developments.
resistance contrast between crystalline and amor- Figure 6.17 presents typical current–voltage
phous phases to enable the storage of data. (I–V) curves of a PCM device, revealing two
Despite the fact that the fundamental principles important characteristics: (1) threshold switching
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 181

Fig. 6.18 Conventional


mushroom-type PCM cell
and conceptual illustration
of SET and RESET
operations in PCM

and (2) phase transformation. Threshold researchers have reported that critical device per-
switching occurs when the applied voltage formance indicators (e.g., switching speed and
increases beyond the voltage snapback point at thermal stability) are largely a function of Ge-Sb-
Vth, whereupon the material enters a conductive Te composition and doping. Those studies sought
state due to impact ionization and carrier recom- to engineer GeSbTe ternary alloys along isoelec-
bination (Pirovano et al. 2004). Phase transition tronic tie lines and the Ge/Sb2Te3 tie lines in the
to a crystalline or amorphous state depends on search for high performance materials (Cheng
programming pulse conditions and can be et al. 2011, 2012).
induced by a further increase in the current
amplitude.
The transition from an amorphous state to a 6.3 Read Circuits for On-Chip NVM
crystalline state is commonly referred to as a
SET operation, the reverse of which is a In this section, we outline the general behavior
RESET operation. Figure 6.18 presents a concep- and introduce the key sub-circuits used in the
tual illustration of SET, RESET, and Read read operations of on-chip NVM. The read
operations. In a SET operation, the application operations in this section include current-mode
of an electrical pulse heats the PCM cell to above and voltage-mode read schemes. We also explore
its crystallization temperature (TCRYS), where- various challenges in the development of read
upon the slow-quench time allows the GST mate- circuits for low-power on-chip NVM and review
rial to return to a crystalline state. Conversely, a number of advanced techniques used in the
RESET operations involve heating the PCM cell development of read circuits.
to above its melting temperature (TMELT) and
abruptly cutting off the heat source in order to
melt-quench the material, thereby leaving it in an
amorphous state. For read operations, cell resis- 6.3.1 General Operation and Sub-
tance is measured by passing an electrical current circuits for Read Operations
small enough to prevent disturbing the existing
state of the material. On-chip NVM commonly employs one of two
Researchers have recently focused on improv- read schemes: voltage mode and current mode
ing device characteristics through the use of sensing (Chang et al. 2015a).
material engineering and innovative device
structures. Those structures include (a) conven- 6.3.1.1 Current-Mode Sensing
tional mushroom-type device structures (Shih Figure 6.19 illustrates the circuit structure and
et al. 2008), (b) dash-type confined cells waveforms of a typical current-mode sensing
(Im et al. 2008), (c) thermally-confined bottom scheme, in which the bit-line (BL) voltage is
electrodes (Wu et al. 2011), and (d) sidewall bot- maintained at a constant value (BL clamping
tom electrodes (Wu et al. 2015). Several voltage or VPRE) throughout the sensing period.
182 M.-F. Chang

Fig. 6.19 (a) Circuit structure and (b) waveforms of typical current-mode sensing scheme

The operation of current-mode sensing pro- consequently faster bit-line bias speeds, at the
ceeds through three phases: bit-line precharge, cost of additional area and power overhead.
current development, and current comparison.
In the bit-line precharge phase (CP1), a 6.3.1.2 Voltage-Mode Sensing
precharge transistor is turned on to raise the Figure 6.21 presents the read path and a concep-
bit-line voltage from 0 V to the precharge voltage tual waveforms used in a typical voltage-mode
(VPRE) or bit-line clamping voltage (VBL-CLP). In sensing scheme. The operation of voltage-mode
the ICELL development phase (CP2), VBL-CLP sensing proceeds through three phases: bit-line
varies the cell read current (ICELL) in accordance precharge, bit-line voltage development, and
with the resistance of the accessed memory cell. voltage comparison. In the bit-line precharge
The ICELL of a logic-0 cell (ICELL-0) is larger than phase (VP1), a precharge transistor is turned on
that of a logic-1 cell (ICELL-1) under the same to raise the BL voltage from 0 V to a precharge
VPRE or VBL-CLP. Finally, in the current compari- voltage (VPRE). In the bit-line developing phase
son phase (CP3), a current-mode sense amplifier (VP2), bit-line voltage (VBL) begins falling in
(CSA) is used to compare the ICELL values using accordance with the cell resistive state of the
a reference current (IREF), whereupon a digital accessed memory cell. When a logic-1 (pro-
output is generated at data-out (DOUT). gram/HRS) cell is read, the cell read current
Figure 6.20 presents the two bit-line clamping (ICELL) of the logic-1 cell (ICELL-1) is small,
methods commonly used for current-mode sens- such that the bit-line is maintained near VPRE.
ing: static and dynamic. In static clamping, a When a logic-0 (erase/LRS) cell is read, a larger
clamping transistor (CLP) is controlled through ICELL (ICELL-0) causes the bit-line to drop more
the application of constant VBL-CLP. In dynamic rapidly, thereby generating a bit-line voltage
bias clamping, the bit-line voltage is amplified swing (VBLS) exceeding that of the logic-1 cell.
and fed back to the gate of the CLP. Dynamic Finally, in the voltage comparison phase (VP3), a
biasing provides a larger CLP gate swing and voltage-mode sense amplifier (VSA) is used to
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 183

Fig. 6.20 Typical (a) static and (b) dynamic bit-line clamping schemes

Fig. 6.21 (a) Circuit structure and (b) waveforms of typical voltage-mode sensing scheme
184 M.-F. Chang

compare VBL with a reference voltage (VREF), (IBL-LEAK) from unselected cells as well as the
whereupon a digital output is generated at data- input offset of the CSA. However, process varia-
out (DOUT). tion can lead to local fluctuations in VBL-CLP.

Generation of Reference Current Vs. Process


6.3.2 Challenges in Current-Mode Variation
Read Schemes and Examples As described above, data stored in a flash cell is
accessed using a current-sense amplifier (CSA)
In this subsection, we explore some of the to compare the cell current (ICELL) of the
challenges associated with the development of accessed NVM cell with a reference current
current-mode read schemes aimed at achieving (IREF). However, a fixed IREF is unable to match
small cell current, low-voltage operation, and the ICELL fluctuations across various processes,
high-speed read operations. Examples of voltages and temperature (PVT) conditions. This
advance circuit techniques are also reviewed. has led to the application of replica cell schemes
(Kobayashi et al. 1990; Bauer et al. 1995;
6.3.2.1 Challenges in Current-Mode Micheloni et al. 2003; Le et al. 2004; Ogura
Sensing et al. 2006; Chung et al. 2007; Chang and Shen
2009) to enable the generation of a IREF capable
BL Clamping Voltage Vs. Read-Disturb of tracking variations in ICELL across a range of
Current mode sensing requitres that the bit-line PVT conditions. Figure 6.22 presents a nominal
be clamped at a constant voltage (VBL-CLP). replica-cell current sensing circuit for eNVM.
Limitations in clamping voltage are determined This replica array comprises inactive as well as
by the read disturb behavior of eNVM devices. active replica eNVM cells. An active replica cell
To ensure robust, high-speed read operations, emulates regular eNVM cells and provides a
VBL-CLP should be as high as possible in replica cell current, referred to as ICELL-REPLICA.
order to compensate for bit-line leakage current Inactive replica cells preserve neighboring

Fig. 6.22 Nominal


replica-cell current sensing
circuit for eNVM
WL Driver

Main Cell Array

Replica
Dummy Control

Array
Column MUX

BLC BLC BLC


VREF (IREF-SA = SR x IREF )

CM CM

IREF-SA ICELL IREF-SA ICELL Current Mirror


(CM)
SA SA

DOUT[0] DOUT[7]
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 185

geometric patterns, as in a regular cell array, in 6.3.2.2 Advanced Circuit Techniques


order to minimize mismatch between ICELL and for Current-Mode Read
ICELL-REPLICA. Inactive replica cells also provide Operations
the same parasitic load for a replica bit-line as for
a regular bit-line. Generation of Reference Current
In cases where a hard defect or significant with Narrow Distribution
process variation occurs in a replica cell, the Most floating-gate (FG)-based replica cells
resulting reference current (IREF) would have connect selected floating gates to their control
inaccurate values or vary considerably across gates at the edge of the replica array. This is
dies. This would subsequently leads to read done to avoid the need for additional erase cycles
failure or long read access times and increases and thereby maintain the voltage potential of the
power consumption. Accordingly, the generation floating gate in a replica cell following multiple
of read reference current is a crucial issue in the access cycles. The sharp corners and special
design of eNVM. shape of the floating-gate layer in split-gate
flash technology make it difficult to maintain
uniformity and low resistance in the metal-to-
Low-Voltage Operation Vs. Voltage
FG contacts and floating-gate poly. Therefore,
Head Room
the resistance between the control-gate line
Figure 6.23 outlines the concept of voltage bud-
(dummy word-line) and the floating gate
get in a current-mode sensing scheme. Efforts to
(RDWL-FG) of accessed replica cells varies across
reduce the supply voltage (VDD) in current-
rows/dies. Differences in RDWL-FG affect the
mode sensing are constrained by the voltage
turn-on times of the selected replica cell and
headroom of bit-line voltage clamper (BLC)
produce fluctuations in read reference current
and the diode-connected current-mirror (CM)
(IREF) settling time, which can result in
circuitry. As shown in Fig. 6.22, a 3 μA bit-line
ringing at the data-out (DOUT), long-address
current requires 400 mV of headroom for a
access times (TAC), and increased power con-
typical 65 nm current-mirror circuit (Chang et al.
sumption. FG-always-high and pre-stable current
2013c). Low VDD suppresses the effective volt-
sensing (PSCS) schemes have been proposed
age on the BL (VBL) as well as the voltage across
to overcome RDWL-FG induced fluctuations
the eNVM device (VNVM). This leads to small cell
in IREF settling time, as shown in Fig. 6.24.
read current, which greatly reduces the sensing
margins and read speeds when VDD is low.

Fig. 6.23 (a) Concept of voltage budget and (b) bit-line voltage vs. bit-line current in current-mode sensing scheme
186 M.-F. Chang

Fig. 6.24 (a) FG-always-high and (b) pre-stable current sensing (PSCS) replica cell schemes
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 187

Fig. 6.25 IREF-generation schemes based on replica-cells for resistive-type memory: (a) k-parallel-cell, (b) high-low-
average, and (c) parallel-series reference cell (PSRC) schemes

FG-always-high (Chih et al. 2002) is a hidden in the ending period of an erase/program


straightforward circuit in which the floating cycle prior to subsequent read cycles. The replica
gate is kept high at all times, making this a cells in PSCS undergo turn-on only in read
non-switching approach. However, this cycles; therefore, they experience less voltage-
approach exposes the FG-oxide of a replica stress and consume less power than the case in
cell to longer voltage-stress times than that of the FG-always-high scheme. Moreover, with
regular Flash cells, which are not exposed to multiple replica-cells turning on simultaneously
high FG voltage in most applications. This (Chang and Shen 2009), the distribution of IREF
means that accessed replica cells in the FG- can be narrowed.
always-high scheme are more susceptible to Several IREF-generation schemes based on
the effects of aging than are nominal cells. replica-cells have been proposed for resistive-
Moreover, the read reference current generation type emerging memory, including k-parallel-cell,
circuit in the FG-always-high scheme consumes high-low-average, and parallel-series reference
power even during non-read cycles (erase and cell (PSRC), as shown in Fig. 6.25. The k-parallel-
program cycles), due to the fact that one of the cell scheme provides the average ICELL of k replica
replica cells is always on. LRS cells. High-low-average schemes provide the
In the PSCS scheme (Chang and Shen 2009), average ICELL of LRS and HRS cells, as long as the
the turn-on of IREF generation and the switching R-ratio is small. However, in cases of large varia-
of the floating gate in the replica-cell occur in the tion in cell resistance, k-parallel-cell and high-
final period of an erase/program cycle or during low-average schemes both suffer from wide IREF
the power-up period, rather than at the beginning distribution, due to the fact that IREF is dominated
of every read cycle, as in conventional flash by the ultra-low-resistance of the replica cell
memories. The time required to complete the (a tail bit). In contrast, PSRC eliminates the depen-
erase/program operation is generally far longer dence of IREF on a single tail-bit reference cell by
than the worst-case settling time of IREF in having two parallel LRS sets connected in serial.
foundry-provided eFlash macros (Datasheet This enables the PSRC scheme to narrow the
2001, 2005). Thus, the FG switching time of distribution of IREF associated with variations in
replica cells and the settling time of IREF can be RHRS/RLRS.
188 M.-F. Chang

Current-Mode Sensing with Small high-speed sensing without the need for
Input-Offset run-time offset-cancellation operations. A digital
Several CSA schemes with small input-offset offset cancellation CSA (DOC-CSA) (Kono et al.
have been proposed for high-speed or small 2013) was proposed for the application of addi-
cell-current eNVMs. tional offset-cancel current (Ioc) to the differen-
Figure 6.26 presents a small-offset CSA tial inputs (LBLL or LBLR) of the comparator.
sampling scheme using threshold voltage (VTH) The amount of Ioc is digitally calibrated
(Javanifard et al. 2008). The VTH of the NMOS according to the mismatch of the comparator in
latch (M1/M2) is stored in capacitors C1 and C2 the wafer testing stage. In the calibration-based
to provide offset compensation. This scheme asymmetric-voltage-biased CSA (AVB-CSA)
succeeds in achieving small input offset but (Chang et al. 2015c) in Fig. 6.28b, a pair of
requires considerable area overhead due to its asymmetric precharge voltages (VAP-CP and
use of numerous switches. VAP-RP) are applied to the differential node
Figure 6.27 presents two CSA schemes based (CP and RP nodes) of a latch-type comparator
on current-sampling (IS-CSA). As shown in prior to the sensing operation. VAP-CP and VAP-RP
Fig. 6.27a, unlike the VTH-sampling scheme, are calibrated during wafer testing in accordance
IS-CSA (Chang et al. 2013a, b) employs two with the offset voltage of the read-path detected
capacitors (C1/C2) to store the gate-source volt- at nodes CP and RP. Accordingly, AVB-CSA
age difference (VGS) of critical transistors suppresses the offset caused by the comparator
(M1/M2) in order to yield the ICELL and IREF. as well as the offset produced along the entire
IS-CSA then uses the sampled ICELL and IREF for read-path (IOS-SUM).
second-stage amplification operations, which are
insensitive to variations in the VTH of M1/M2. Low-Voltage Current-Mode Sensing
As with IS-CSA, time-differential IS-CSA Figure 6.29 illustrates a low-voltage, body-drain
(TD-IS-CSA) (Jefremow et al. 2013) uses a sin- driven (BDD) CSA (Chang et al. 2013c). A
gle capacitor for the sampling of IREF, as shown BDD-based current mirror (CM) circuit requires
in Fig. 6.27b. In the subsequent timing phase, smaller voltage headroom than do conventional
TD-IS-CSA compares the sampled IREF with diode-connected current mirrors. The small volt-
ICELL in order to achieve robust sensing with age headroom on read circuits enables BDD-CSA
small input offset. to assign most of the voltage budget to the voltage
Figure 6.28 shows two digital-calibration- across NVM devices, thereby producing ICELL
based CSAs (DC-CSA) designed to enable sufficient for sensing even at low VDD.

Fig. 6.26 Small-offset CSA using threshold-voltage (VTH) sampling scheme


6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 189

Fig. 6.27 Two CSAs based on current-sampling (IS-CSA): (a) two-capacitor IS-CSA and (b) single-capacitor time-
differential IS-CSA

6.3.3 Challenges in Voltage-Mode 6.3.3.1 Challenges in Voltage-Mode


Read Schemes and Examples Sensing

In this subsection, we investigate some of the On-Off Current Ratio Vs. Sensing Margin
challenges associated with the development of SRAM and logic-ROM are able to generate the
voltage-mode read schemes for small cell cur- large bit-line voltage swings required for read-
rent, low-voltage operation, and high-speed read. 0 operations, while maintaining near-zero
Examples of advanced circuit techniques are also bit-line voltage swings for read-1 operations.
reviewed. This can be attributed to their small read-1 cell
190 M.-F. Chang

Fig. 6.28 Two digital-calibration based CSAs (DC-CSA): (a) digital offset cancellation CSA (DOC-CSA), and
(b) asymmetric-voltage-biased CSA (AVB-CSA)

current (ICELL-1) and high on-off current ratio Process variation subjects the tail-bits of
(I-ratio ¼ ICELL-0/ICELL-1) between read-1 and logic-1 eNVM cells to higher ICELL-1, compared
read-0 (ICELL-0) cell currents. Thus, most to those of typical cells, which tends to be close
SRAM and logic-ROM devices use conventional to zero and far below ICELL-0. This is because tail
voltage-mode sense amplifiers (VSA) for differ- eFlash cells have a lower threshold voltage
ential sensing or single-end sensing using a ref- (VTHP) for programmed cells, or tail resistive
erence voltage (VREF). In typical VSA, the VREF eNVM cells have low HRS resistance. As
is placed at the mid-point between the bit-line shown in Fig. 6.30, eNVM suffers from small
voltages when reading tail logic-0 and logic-1 difference in bit-line voltage between tail bits
cells. Thus, the sensing margin (VSM) is half when reading logic-0 and logic-1 cells. Same as
(or less) of the ΔVBLS_TAIL. read-0 operation, the bit-line that is used to read a
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 191

Fig. 6.29 Low voltage, body-drain driven CSA (BDD-CSA)

Fig. 6.30 (a) Variations in BL voltage swing during read operations; (b) timing breakdown vs. VDD in a typical VSA

tail logic-1 cell has a greater drop in voltage novel VSAs (Takeuchi et al. 2007; Byeon et al.
(below the bit-line precharge voltage, VPRE) 2005; Chang et al. 2009b; Marotta et al. 2010),
with an increase in bit-line developing time which assume that the read-1 bit-line voltage is
(TBL). Read-1 and read-0 operations both maintained at close to VPRE or VDD.
undergo dynamic voltage swings on bit-lines
proportional to TBL; therefore, eNVM is unable Read Access Time Vs. VDD
to use conventional fixed-value VREF schemes Figure 6.30b presents a breakdown in the timing
(Verma and Chandrakasan 2008; Kim et al. of a typical voltage-mode sensing scheme
2009; Zhai et al. 2008; Chang et al. 2009a; employing a 65 nm eNVM memory macro with
Morita et al. 2007; Chen et al. 2006) or some a bit-line length of 512 cells. Lowering the VDD
192 M.-F. Chang

reduces the driving strength of the transistors, utilizes only 0.5  ΔVBLS_TAIL for the sensing
which can lead to an increase in macro-level margin, SSC-VSA has VREF placed below the
read access times (TAC) and a reduction in VSM. bit-line voltage of the tail-LRS cell and uses a
It is worth noting that under a low VDD, TAC is switch-capacitor-like circuit to enable the full
dominated by the bit-line precharge time (TPRE) utilization of the ΔVBLS_TAIL for the sensing mar-
and bit-line developing time (TBL), rather than gin (VSM). This increment in VSM (nearly double)
the response time of a VSA. This effect is more leads to a reduction in VDDmin and makes it
obvious in macros with a long BL length. Thus, possible to achieve read speeds exceeding those
efforts to improve the speed and yield of voltage- of conventional VSA at low VDD.
mode sensing under low VDD should focus on
reducing the TBL while maintaining a VSM suffi-
cient to resist the input offset voltage (VOS) of
6.4 Write Circuits for On-Chip NVM
the VSA.
In this section, we outline the general behavior
6.3.3.2 Advanced Circuit Techniques and key sub-circuits used in the write
for Voltage-Mode Read operations of on-chip NVM. The read operations
Operations discussed in this section include current-mode
and voltage-mode read schemes. We also explore
Low-Voltage Voltage-Mode Sensing various challenges in the development of
Figure 6.31 illustrates the concept of low-VDD read circuits for low-power on-chip NVM and
swing-sample-and-couple (SSC) VSA (Chang review a number of advanced read circuit
et al. 2014c). Unlike conventional VSA, which techniques.

Fig. 6.31 (a) Conceptual illustration and (b) circuitry of a low-VDD swing-sample-and-couple (SSC) VSA
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 193

6.4.1 General Operations limited and somewhat inefficient. This means


and Sub-Circuits that the current load on the high-voltage path
for Write Operations should be minimized in order to reduce power
consumption by the eNVM macro. The resulting
Figure 6.32 illustrates the structure of the write- low-power eNVM requires low-DC/leakage cur-
path in a typical eNVM macro. The key rent for consumption by the level-shifters and
sub-circuits used for write operations include write-drivers. A small-peak-current energy-
level-shifters, write-drivers, and charge-pump efficient CP is also important when low-power
circuits. eNVM macros are embedded on a chip.
Most eNVMs use multiple voltages higher
than the supply-voltage (VDD) for write
operations. A charge-pump (CP) circuit is used 6.4.2 Challenges in Level-Shifters
for the conversion of VDD to several high volt-
age levels in order to provide the output current Level shifters are embedded in eNVMs for the
required for write operations. Write drivers are conversion of logic signals (0 V or VDD)
employed only to provide the write voltage generated in row decoders to either 0 V or
required for the cell array. The drivers for voltages higher than VDD (VDDH). The
word-lines (WL) or source-lines (SL), which VDDH could be any write voltage required for
are used in read as well as write operations, NVM devices. Generally, level-shifters (LSs)
vary in their output voltages across operation pose challenges in the following two areas:
modes. As a result, level-shifters are commonly (1) operations with a low input voltage (VDDL)
used for WL/SL drivers in NVMs. and high VDDH during write operations;
It should be noted that the high voltages and (2) large DC current consumption by conven-
currents generated by charge pump circuits are tional low-VDDL level shifters in unselected
rows, resulting in significant charge loss in the
VDDH path. These difficulties can lead to an
increase in the current-load and power consump-
tion of on-chip charge-pump circuits, particularly
in NVMs with a large number of rows.
In the following, we examine a number of
existing level-shifters:

6.4.2.1 Half-Latch Level Shifter (HL-LS)


Figure 6.33 illustrates the circuit and waveform
of common half-latch level shifter (HL-LS)
(Wooters et al. 2010; Chang et al. 2011c),
which includes two pull-down NMOS (MNL
and MNR) and a pair of cross-coupled PMOS
(MPL and MPR) transistors as well as an input
signal inverter (INV1) and an output buffer
(BUF). When SEL ¼ 1, MNL is on, which pulls
down the voltage at node VOB (VOB) to 0 V, This
turns on MPR and raises the voltage at node VOT
(VOT) to VDDH. Thereafter, MPR remains off
due to cross-coupled feedback, wherein the out-
put (VO) of HL-LS is VDDH. When SELb ¼ 1,
Fig. 6.32 Structure of write-path in a typical eNVM the turning on of MNR creates a discharge current
macro (ID) that lowers VOT to below VDDH. If ID is
194 M.-F. Chang

Delay

VDDH VDDH VDDH


SEL VDDL
VSS

MPL MPR VDDH VDDH VDDH


VOT Current
Fighting
VOT VSS
VOB VO VDDH
VO
SEL MNL MNR CLoad
VSS
VDDL
VDDH
VOT
Current Fighting Fail
VSS

Fig. 6.33 Circuit and waveform of a half-latch level shifter (HL-LS)

Delay

VDDH
VDDH VDDH SEL VDDL
VSS
VM
MPL MPR VDDH VDDH VDDH
VM
VOT
VO
VSS
SEL MNL MNR CLoad
VDDH
VDDL VO

VSS

Fig. 6.34 Circuit and waveform of a current-mirror based level shifter (CM-LS)

large, the fight for current between MPR and MNR transistors, an inverter, and an output buffer
can cause MNR to pull down VOT to 0 V. When (BUF). When SEL ¼ 1, the fact that MNL is on
VDD is low (i.e., near or below the threshold lowers the voltage at node VM (VM) and then
voltage of MNL/MNR), ID is too small to win the turns on MPR. This, in turn, raises the voltage at
fight for current over MPL/MPR. This prevents node VOT (VOT) to VDDH. When SEL ¼ 0,
HL-LS from applying low VDDL and requires MNL is off and VM is slowly pulled down to
that ultra-large transistors be used for MNL/MNR VDDH-VTH-MPL by MPL, which significantly
in HL-LS. reduces the current flowing through MPR. This
enables MNR to pull down VOT to below the trig-
6.4.2.2 Current-Mirror Based Level Shifter point of the output buffer without incurring seri-
(CM-LS) ous current-fighting behavior. This makes it pos-
As shown in Fig. 6.34, current-mirror (CM)- sible for CM-LS to achieve a low VDDL using a
based level shifters (Wooters et al. 2010) smaller MNL/MNR than that required for HL-LS.
(CM-LS) were developed to reduce VDDL for However, DC current remains in selected and
low-VDD applications. CM-LS comprises two unselected CM-LSs. When SEL ¼ 1 (selected
pull-down NMOS (MNL and MNR) transistors, a row), the DC current flows through MPL and
pair of PMOS current-mirror (MPL and MPR) MNL (IMPL). When, SEL ¼ 0 (unselected rows),
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 195

DC leakage current flows through the weakly- When SEL ¼ 1, MNL is on, which pulls down
turn-off MPR (IMPL) because its gate voltage is VM to 0 V, which turns on MPR, thereby charging
VDDH-VTH-MPL. In the case of an eNVM with the voltage at node VOT (VOT) to VDDH. When
k rows, the DC leakage current on the VDDH VM ¼ 0 V, VM becomes the drain terminal of
path consumed by CM-LS is “IMPL + (k1) MPS, causing MPS to form a diode-connected
•IMPL”. When k is large, the large current-load structure, in which VML is equal to the VTH of
on VDDH prevents the on-chip charge-pump MPS (VTH-MPS). “VOT ¼ VDDH” causes MPM to
from providing current and voltage sufficient turn off, which subsequently cuts off the DC
for write operations. A Wilson CM LS current flowing through MPL. When SEL ¼ 0,
(Lütkemeier et al. 2010) has been proposed for SELb ¼ 1 and MNR turn on, thereby pulling
the suppression of DC current at SEL ¼ 1; how- down VOT. MPM is then turned on slightly and
ever, DC leakage current at SEL ¼ 0 still VM is charged by MPL. Meanwhile, MPS enters
remains. cut-off mode because in this operation, its source
terminal is VM. VM then rises to a level
approaching VDDH, which is higher than the
6.4.2.3 Pseudo-Diode-Mirrored Level
level of CM-LS, thereby strongly suppressing
Shifter (PDM-LS)
the DC leakage current flowing through MPR.
A pseudo-diode-mirrored level shifter (PDM-LS)
Thus, by cutting off the DC current at SEL ¼ 1
was proposed (Chang et al. 2014) to achieve low
and suppressing DC leakage current at SEL ¼ 0,
VDDL while suppressing DC current, as shown
the PDM-LS is able to operate consuming far less
in Fig. 6.35. The PDM-LS consists of two pull-
DC current than that required for conventional
down NMOS (MNL and MNR) transistors, a
CM-LS, while maintaining a low minimum
pseudo-diode PMOS transistor (MPS), a current-
VDDL (VDDLmin).
cut-off PMOS transistor (MPM), and pseudo
PMOS current-mirror (MPL and MPR) transistors,
an inverter, and a buffer. One source/drain termi-
nal of MPS is connected to the gate of MPL (VML), 6.4.3 Challenges in Write-Drivers
while the gate and the other source/drain terminal
of MPS are connected to the gate of MPR and the In a typical eNVM cell array, the cells selected
drain of MNL (node VM), respectively. MPM is for write operations are in the same row and
placed between MPL and MNL and controlled share the same WL and SL. As a result, all cells
by VOT. in that row have the same voltage bias on WL

VDDH VDDH
Delay

VML VDDH
MPL MPR
SEL VDDL
MPS
VSS
MPM VDDH VDDH
VDDH
VM VOT VM
VO
VSS
SEL MNL MNR CLoad
VDDH
VDDL
VO

VSS

Fig. 6.35 Circuit and waveform of a pseudo-diode-mirrored level shifter (PDM-LS)


196 M.-F. Chang

25

Percentage(%)
20
15
10
5

2 4 6 8 10
Programming time(a.u.)
(a)

SL VSET
voltage

TSET[K]

TSET[1]
SL
current
1st bit 2nd bit K-th bit Time
switch switch switch

(b) (c)
Fig. 6.36 Write times of logic-process (a) OTP and (b) resistive-type eNVM devices; (b) waveforms associated with
the voltage and current in a high-voltage path (VDDH) during program operations

and SL. As shown in Fig. 6.36a, process variation Bit-by-bit write-termination schemes (Halupka
can cause considerable variability in the write et al. 2010; Xue et al. 2013; Chang et al. 2014c)
times required for NVM devices, such as pro- have been proposed for the suppression of IDC-CELL
gram times (TPROG) or SET (TSET)/RESET by monitoring the switching of each accessed
(TRESET) times. Covering the tail bits (with eNVM device and then initiating the termination
slower TPROG/TSET/TRESET) requires that the volt- of the Program/SET operation as soon as it has
age bias for word-line and source-line be been successfully written. In the following,
maintained until the tail bits have completed we review three write-termination schemes: (1)
logic-state switching. For OTP and resistive- OP-based current-mode, (2) negative-resistance
type memory, the switching of an eNVM device based current-mode write-termination, and (3)
from HRS to LRS (program or SET operation) voltage-mode write-termination.
causes a considerable flow of DC current (IDC-
CELL) between the two terminals of an NVM
device. As a result, eNVM devices with shorter 6.4.3.1 OP-Based Current-Mode Write-
TPROG/TSET consume large IDC-CELL from the Termination (OP-CWT) Scheme
high-voltage supply source (VDDH). As shown In (Xue et al. 2013), a current-mode write-termi-
in Fig. 6.36b, a large IDC-CELL results in a waste nation (OP-CWT) scheme based on operational
of energy and degrades the output voltage of the amplifiers was proposed to perform termination
on-chip voltage generator as well as the source- operations for the SET and RESET operations of
line voltage. This can lead to write failure in cells resistive NVM devices, as shown in Fig. 6.37.
(tail bits) requiring longer TPROG/TSET. This scheme uses operational amplifiers (OP),
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 197

RCELL

VREF IWRITE
VBIAS

Control VREF
VR

Polarity SEL Memory Cell Array FB

Fig. 6.37 Circuits and waveforms of OP based current-mode write-termination (OP-CWT) scheme

SL BL
Cell
WL

NVM
IWRITE
IWRITE w/o NR
-R Driver

W1
Time

VBIAS VBL
W0 -R
W1
Time
Fig. 6.38 Circuits and waveforms of negative-resistance based current-mode write-termination (NR-CWT) scheme

current-mirror (CM) circuits, and bias circuits to NR-CWT employs a current-mirror based
monitor the write current during SET and RESET negative-resistance (R) circuit to provide feed-
operations. The high-voltage bias circuit is back for the modulation of write current. When
disabled to terminate the SET operation of an writing ‘1’ (LRS ! HRS), the WR1 signal is
accessed cell. This OP-CWT scheme is effective high, the WR0 signal is low, and the current
in suppressing IDC-SET; however, the use of flows through the NVM cell from SL to
Ops/CMs incurs significant area overhead, slows BL. When writing ‘0’ (switch HRS to LRS),
the response time, and leads to the consumption WR1 signal is low, the WR0 signal is high, and
of considerable quantities of DC current. the current flows through the NVM cell from BL to
SL. The –R driver reflects into the bit-line (BL) a
quantity of current proportional to the cell resis-
6.4.3.2 Negative-Resistance Based
tance of the NVM cell (RNVM). Once the write ‘0’
Current-Mode Write-Termination
is completed and RNVM reaches the LRS resistance
(NR-CWT) Scheme
value, the increase in current flowing through
Figure 6.38 illustrates the negative-resistance
the NVM cell pulls down the BL voltage. A
based current-mode write-termination (NR-
lower BL voltage reduces the flow of current
CWT) scheme proposed in (Halupka et al. 2010).
198 M.-F. Chang

Memory Cell
SL
Array SEL

WL
SL
NVM

NVM

NVM

NVM
WL

DL<K>
MUX MUX
DL<K> DL<K+1>
DL<K+1>

P1 P1 FB<K>
FB
NSD NRSD SEL
FB
NSD NRSD SEL FB<K+1>
VRE F N1 VRE F N1
ISL w/o WT
ISL

Fig. 6.39 Circuits and waveforms of voltage-mode write-termination scheme

through the current mirror, which lowers the BL to several high voltage levels (VDDH1,
voltage considerably, thereby forming a positive VDDH2, . . .) in order to provide write voltage
feedback loop. The current mirror and positive and output current sufficient for write
feedback loop moderate (rather than terminate) operations. Unfortunately, the inefficiency of
the write current, because current continues voltage conversion leads to the consumption
flowing through the R driver, even after the of a great deal of power. Thus, the charge
write-0 operation is complete. pump circuit consumes a significant portion of
the write power used in an eNVM macro.
6.4.3.3 Voltage-Mode Write-Termination Many IoT devices use on-chip NVM macros
(VWT) Scheme integrated with on-chip sensors, mixed-signal
Figure 6.39 illustrates the voltage-mode write- blocks, and SRAM macros (Chang and Yang
termination (VWT) scheme devised in (Chang 2009; Larsson et al. 2001b; Mezhiba and
et al. 2014c) for the SET operation. This scheme Friedman 2004), all of which are sensitive to
uses only four transistors to monitor bit-line volt- supply noise. In most IoT chips, on-chip
age and reuses the IDC-CELL to raise the bit-line sub-blocks share the same power supply source
voltage in order to terminate the SET operation (i.e., battery, energy harvesters, or power
of an accessed cell. The use of voltage-mode generators). Supply noise can induce functional
operation and reuse of IDC-SET minimize the failures in supply-sensitive sub-blocks and even
area overhead, reduce the power consumption, affect write operations and threshold voltage dis-
and improve the response time of write- tribution in NVM cells (Chang and Shen 2009;
termination operations. Chang et al. 2015c). Supply noise can also
degrade the minimum VDD (VDDmin) of the
sub-blocks on a chip. Charge-pump circuits
6.4.4 Challenges in Charge-Pump require a pumping-capacitor (PC) with large-
Circuits capacitance for each stage, in order to obtain a
sufficiently large output current (IOUT). This
eNVMs use an on-chip charge pump (CP) for greatly increases the switching power required
the conversion of power supply voltage (VDD)
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 199

for the clock buffers. The switching of the clock 6.4.4.2 Four-Phase Charge-Pump (4P-CP)
buffer also generates large peak current, which Circuit
results in the generation of a great deal of supply Figure 6.41 illustrates a four-phase Dickson-type
noise. Thus, the charge-pump circuits in CPs (4P-CPs), which was developed to increase
low-power eNVMs face two major challenges: the overall power efficiency. A nominal 4P-CP
increasing power efficiency, and reducing the comprises two pumping-stages (PSs) and four
consumption of peak current in order to avoid clock phases (P1 ~ P4) in a single block. Each
inducing noise at the chip level. pumping-stage includes a diode-connected
In the following, we review a number of NMOS transistor (NSW) for charge transfer and
charge-pumps commonly used for eNVM a gate switch (GSW) to control the gate of NSW
macros. (node GNSW). The 4P-CP eliminates the VTH
drop between the source and drain terminals of
6.4.4.1 Two-Phase Charge-Pump (2P-CP) each NSW by applying a voltage higher than that
Circuit of 2P-CP at the gate of NSW (VGNSW).
Figure 6.40 illustrates a conventional two-phase Figure 6.41b illustrates the relationship
Dickson-type charge-pump (2P-CP) circuit between P1 ~ P4 in a nominal 4P-CP. The
(Dickson 1976; Palumbo and Pappalardo 2009; overlapping period between P1-rising and
Witters et al. 1989; Kuriyama et al. 1992; Atsumi P3-falling is listed as TOVER-L. The overlapping
et al. 1994). In a 2P-CP, each pumping stage period between P3-rising and P1-falling is listed
(PS) includes a diode-connected NMOS transis- as TOVER-R. When P1 rises, the voltage at node-A
tor (NSW) and a pumping-capacitor (PC). Even (VA) increases. During TOVER-L period, VA is
and odd stages employ opposite clock phases transferred to the VGNSW of the even stage
(P1 and P2) for the alternate transfer of boosted (VGNSW-E). At this point, P4 rises to pump
charge to subsequent stages. A 2P-CP VGNSW-E to a voltage higher than VA, thereby
necessitates a threshold-voltage (VTH) difference allowing the transfer of the charge at node-A to
between the source and drain terminals of the node-B without a drop in VTH. During TOVER-R
NSW and is therefore limited with regard to period, P1 is pulled down to discharge VGSW-E in
maximum pumping output voltage (VOUT) and order to turn off the NSW of the even stage
power efficiency. (NSW-E). At the same time, the VGSW of the

NSW1 NSW2 NSW3 NSWK


VDD

A B
VOUT

CLK CLKB CLKB


3*VDD-2*VTH
2*VDD-VTH
2*VDD-2*VTH
Node B
VDD-VTH
Node A

Fig. 6.40 Circuit and waveforms of typical two-phase charge-pumps (2P-CP)


200 M.-F. Chang

P2 P4 P2 P4

VDD

GSW1 GSW2 GSW3 GSW4

VOUT
NSW1 A NSW2 B NSW3 C NSW4 D

P1 P3 P1 P3
3*VDD

2*VDD
Node B
VDD
Node A

P1

P3

P2

P4

Fig. 6.41 Circuit and waveforms of typical four-phase charge-pump (4P-CP)

odd stage (VAGSW-O) of the following block (i.e., transfer operation, the gate voltage of an NMOS
BK2) is raised by node-B of the previous block pass-gate is one VDD higher than its terminal
(i.e., BK1). connected to the previous stage. As a result, the
In conventional 4P-CP, TOVER-L and TOVER-L NMOS is on, thereby providing efficient charge
are given the same period in order to ensure transfer behavior, while the PMOS is off, thereby
consistency in VA-to-VGSW and cross-stage cutting off the charge flow-back path.
charge transfer times across stages within a sin-
gle block (i.e., PS1-PS2 or PS3-PS4) or across
blocks (i.e., PS2-PS3). 6.4.4.4 Two-Step Clocking Charge-Pump
(TSC-CP) Circuit
6.4.4.3 Cross-Coupled Charge-Pump Figure 6.43 illustrates the two-step clocking
(CC-CP) Circuit charge-pump (TSC-CP) scheme presented in
As shown in Fig. 6.42, the cross-coupled charge- (Lauterbach et al. 2000). Conventional CPs use
pump (CC-CP) proposed in (Ker et al. 2006) single voltage steps for pumping clocks wherein
comprises two branches (branch A and B). Each the energy delivered by the clock buffers equals
branch comprises k pairs of NMOS-PMOS (Q  VDD). The TSC-CP scheme employs two
(MN-MP) connected in serial. The NMOS- voltage steps for the pumping clock, which means
PMOS pairs are cross-couple connected between that the energy delivered by the clock buffers
the two branches. Branches A and B alternate in in TSC-CP is equal to (3/4  Q  VDD ¼ 1/2 
the pumping of output voltage. During the charge Q  VDD + Q  VDDQ + 1/2  Q  VDD).
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 201

3*VDD
2*VDD
Node B
VDD
Node A
CLK CLKB CLK CLKB

A B C D

VDD
VOUT

E F G H

CLKB CLK CLKB CLK


3*VDD
2*VDD
F node
VDD
Node E

Fig. 6.42 Circuit and waveforms of a typical cross-couple charge-pump (CC-CP)

This results in energy consumption below that of a symmetrical. Thus, the timing relationship
conventional CP. between P1 ~ P4 in each pumping block should
The use of equalization between the be fixed in order to avoid charge-feedback activ-
up-pumping capacitor (i.e., C1) and down- ity within a single pumping block (intra-block) or
pumping capacitor (i.e., C2) in neighboring between pumping blocks (inter-block).
stages means that the TSC-CP needs to charge As shown in Fig. 6.44, a split asymmetrically
the up-pumping capacitor using VDD/2, instead shifted clocking (SASC) scheme was proposed in
of VDD, as in conventional CPs. This leads to a Chang et al. (2015f) to (1) prevent simultaneous
further reduction in power consumption. Com- switching activities across pumping blocks and
bining two-step clocking with capacitor equali- (2) maintain the same timing relationship
zation makes it possible for the TSC-CP to between the local four-phase clocking of each
reduce energy consumption far below that of pumping block.
conventional CP. Splitting the four-phase clocks between
pumping blocks enables the discrete control of
6.4.4.5 Low-Peak-Current Split- charge boosting (pumping) processes at inter-
asymmetrically-Shifted Clocking block and intra-block levels. The local
Charge-Pump (SASC-CP) Circuit four-phase clocks of each pumping block switch
One effective approach to reduce the peak cur- after those of the previous pumping block
rent of a charge-pump involves preventing the (its input stage) due to TDELAY. To prevent a
simultaneous occurrence of all switching-related drop in the pumping efficiency of each pumping
activities in multiple pumping blocks. However, block, SASC-CP employs different timing for
the clock phases of all pumping blocks in con- TOVER-R and TOVER-L. It should be noted that
ventional charge-pumps are shared and in SASC-CP, TOVER-R is not as critical as it
202 M.-F. Chang

n1 n2 n1 n2

VDD

V OUT

CP1
CP2
Tristate Driver 1 Tristate Driver 2

n1 n1

n1 n2
n2 n2

n1

n2

t12

Cp 1

Cp 2

Fig. 6.43 (a) Circuit and (b) waveform of two-step clocking charge-pump (TSC-CP) scheme

would be in conventional charge-pumps. This


makes it possible to employ a shorter period for 6.5 Perspectives and Trends
TOVER-R than that used in conventional charge-
pumps without affecting intra-block pumping In this section, we review various state-of-the-art
behavior. The timing relationship between on-chip NVMs and discuss recent trends in
TOVER-R and TOVER-L is presented as TOVER-L ¼ low-power on-chip NVM for IoT applications.
TDELAY + TOVER-R. We also explore extended applications of NVM
Figure 6.44b presents the clock generation devices, beyond the issue of memory usage for
scheme for SASC-CP using one clock generator IoT applications.
for the global clocks (GP1 ~ GP4). A local delay
buffer (LDB) is placed above each pumping
block to provide local four-phase clocks 6.5.1 State-of-the-Art On-Chip NVMs
(P1 ~ P4). Accordingly, the RC delays of and Trends in On-Chip NVM
P1 ~ P4 are far smaller than in conventional for IoT Devices
charge-pumps. This also makes it possible to
use a higher clock frequency with SASC-CP IoT devices employ a wide range of technology
than would be possible with conventional nodes to accommodate a wide range of
charge-pumps. applications and cost structures. Figure 6.45
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 203

Local Delay Buffer (TDELAY )

GP1~GP4

P1[0]~P4[0] P1[1]~P4[1] P1[n]~P4[n]

Block[0] Block[1] Block[n]

Global 4-Phase
Clock Generator
(a)
Local Delay Buffer(TDELAY)

GP1~GP4

P1[0]~P4[0] P1[1]~P4[1] P1[n]~P4[n]

Block[0] Block[1]
Global 4-Phase P2[0] P4[0] P2[1] P4[1]
Clock Generator
VDD

GSW1 GSW2 GSW3 GSW4

NSW1 A NSW2 B NSW3 C NSW4


VOUT
D

P1[0] P3[0] P1[1] P3[1]

(b)
T OVER-L1 T OVER-R1 T DELAY

P1_1

P3_1

P1_2

P3_2

P1_3

P3_3
T OVER-R2 T DELAY

(c)
Fig. 6.44 (a) Concept, (b) circuit, (c) waveform and layout used in split asymmetrically shifted (SAS) clocking
scheme
204 M.-F. Chang

Fig. 6.45 Read power


vs. technology nodes for
eFlash and emerging
memory solutions

presents the read energy versus technology nodes slightly smaller than that of eFlash macros at the
of recent eFlash macros (Jefremow et al. 2012; nodes of more mature processes (i.e., 0.25 μm
Liu et al. 2013a; Kono et al. 2013, 2014; Cho node). The small scaling ratio in write energy for
et al. 2013; Yu et al. 2013; Taito et al. 2015; eFlash across process nodes is due to the fact that
Yamauchi et al. 2015a, b). State-of-the-art similar write (program and erase) mechanisms
on-chip eFlash macros (at 28 nm nodes) consume are used for program and erase operations. In
less read energy than do the eFlash macros at the contrast, some emerging resistive-type memories
nodes of mature processes (i.e., 0.25 μm node). require far less write energy than does eFlash,
This significant reduction in read energy is due due to lower write voltages and faster write
primarily to the scaling of devices and supply times.
voltages (VDD). The state-of-the-art processes used in eFlash
Due to their low write voltage requirements, memory are a few generations behind the most
ReRAM and PCM macros do not require the advanced logic processes, due to the lengthy time
high-voltage transistors found in eFlash macros required for the development of processes to deal
for peripheral circuits. This makes it possible for with the multiple gates (floating gate, control
ReRAM and PCM macros (Chien et al. 2010; gate, select gate, and erase gate) found in eFlash
Chang et al. 2012b, 2014c; Fackenthal et al. cells. With the migration of advanced logic pro-
2014; De Sandre et al. 2010; Chung et al. 2011; cesses away from planar transistors to FinFET
Close et al. 2011; Choi et al. 2012) to employ a transistors, it is becoming increasingly difficult
lower supply voltage (VDD) for read operations to integrate conventional floating-gate based
and thereby achieve a read energy lower than that eFlash technology with FinFET processes. This
of eFlash macros. As a result, ReRAM and PCM has led to the need for on-chip NVM solutions
macros are more practical than eFlash for IoT that are logic-process compatible with applica-
devices using low supply voltages to reduce read bility to nanometer processes. ReRAM and PCM
energy consumption. are promising eNVM solutions for advanced
Figure 6.46 presents the write energy versus nanometer processes due to their inclusion of
technology nodes of recently reported NVM back-end processes (post-BEOL processes) to
macros. The write energy consumed by state-of- decouple NVM device processing from front-
the-art on-chip eFlash macros (at 28 nm nodes) is end processes (transistors).
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 205

Fig. 6.46 Write power


versus technology nodes
for eFlash and emerging
memory solutions

6.5.2 Extension of NVM devices single cell and are capable of block-level parallel
to IoT Devices data transfer with store/restore operations faster
than that found in two-macro schemes. Fig-
Recent emerging memory devices have been ure 6.47 presents several examples of recent
employed as NVM macros and combined with nvSRAM macros.
SRAM or TCAM to produce hybrid-mode Conventional SRAM-based TCAM (Hayashi
memories, such as nonvolatile SRAM et al. 2013; Arsovski et al. 2013; Yabuuchi et al.
(nvSRAM) (Ohsawa et al. 2013; Wang et al. 2014; Do et al. 2014; Jeloka et al. 2015; Nii et al.
2006; Sheu et al. 2013; Yamamoto et al. 2009; 2015) uses two 6T-SRAMs as content-storage
Chiu et al. 2010; Wei et al. 2014; Lee et al. 2015) units and 4T comparison logics to perform
and nonvolatile TCAM (nvTCAM) (Li et al. data-matching comparisons. Figure 6.48 shows
2013; Matsunaga et al. 2012; Huang et al. 2014; several recent nvTCAM macros. The use of
Chang et al. 2015d). NVM as a content storage unit makes it possible
As discussed in Sect. 6.1, many energy- to reduce the search energy, standby power, and
efficient chips employ SRAM for computing area of nvTCAM. It also enables search speeds
and NVM for power-off storage in order to comparable to those of SRAM-based TCAM.
reduce standby current. Unfortunately, this Moreover, the NVM devices in nvTCAM are
two-macro (SRAM+NVM) scheme consumes a able to perform storage as well as comparison
great deal of energy during power interruptions functions, thereby eliminating the need for the
and results in slow store (power-off) and restore movement of data between volatile and NVM
(power-on) operations, due to the word-by-word devises, as in nvSRAM. This one-macro
serial transfer of data. nvTACM scheme enables fast low-power
nvSRAMs perform bit-to-bit data transfer power-off and power-on operations for IoT
between SRAM and NVM devices within a devices.
206 M.-F. Chang

Fig. 6.47 Several examples of recent nvSRAM cells

Fig. 6.48 Several recent nvTCAM cells

M. Boniardi, A. Redaelli, C. Cupeta, et al., Optimization


References metrics for Phase Change Memory (PCM) cell
architectures, IEEE International Electron Devices
B.J. Ahn, J.H. Sone, J.W. Kim, I.H. Choi, D.M. Kim, A Meeting (IEDM) Dig. Tech. Papers (2014),
simple and efficient self-limiting erase scheme for pp. 29.1.1–29.1.4
high performance split-gate flash memory cells. D.S. Byeon, S.S. Lee, Y.H. Lim et al., An 8 Gb multi-
IEEE Electron Device Lett. 19, 438–440 (1998) level NAND flash memory with 63 nm STI CMOS
I. Arsovski, T. Hebig, D. Dobson et al., A 32 nm 0.58-fJ/ process technology. ISSCC Dig. Tech. Pap. 1, 46–47
bit/search 1-GHz ternary content addressable memory (2005)
compiler using silicon-aware early-predict late-correct M.F. Chang, S.J. Shen, A process variation tolerant
sensing with embedded deep-trench capacitor noise embedded split-gate flash memory using pre-stable
mitigation. IEEE J. Solid State Circuits 48(4), current sensing scheme. IEEE J. Solid-State Circuits
932–939 (2013) 44(3), 987–994 (2009)
S. Atsumi, M. Kuriyama, A. Umezawa et al., A 16-Mb M.-F. Chang, S.-M. Yang, Analysis and reduction of
Flash EEPROM with a new self-data-refresh scheme supply noise fluctuations induced by embedded
for a sector erase operation. IEEE J. Solid State ROM. IEEE Trans. VLSI Syst. 17(6), 758–769 (2009)
Circuits 29(4), 461–468 (1994) I.J. Chang, J.J. Kim, S.P. Park et al., A 32 kb 10T
N. Banno, M. Tada, T. Sakamoto, et al., A fast and sub-threshold SRAM array with bit-interleaving and
low-voltage Cu complementary-atom- switch 1 Mb differential read scheme in 90 nm CMOS. IEEE
array with high-temperature retention, IEEE Sympo- J. Solid State Circuits 44(2), 650–658 (2009a)
sium on VLSI Technology (VLSIT) Dig. Tech. Papers S.H. Chang, S.K. Lee, S.J. Park, et al., A 48 nm 32 Gb
(2014), pp. 1–2 8-level NAND flash memory with 5.5 MB/s program
M. Bauer, R. Alexis, G. Atwood, et al., A multilevel-cell throughput, IEEE International Solid-State Circuits
32 Mb flash memory, IEEE International Solid-State Conference (ISSCC) Dig. Tech. Papers (2009),
Circuits Conference (ISSCC) Dig. Tech. Papers, pp. 240–241, 241a
15–17 Feb. 1995 (1995), pp.132–133
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 207

M.-F. Chang, P.-F. Chiu, S.-S. Sheu, Circuit design M.F. Chang et al., A low-power subthreshold-to-
challenges in embedded memory and resistive RAM superthreshold level-shifter for sub-0.5V embedded
(RRAM) for mobile SoC and 3D-IC, in Proceedings resistive RAM (ReRAM) macro in ultra low-voltage
of the IEEE Asia and South Pacific Design Automation chips, in 2014 IEEE Asia Pacific Conference on
Conference (ASP-DAC) (2011), pp. 197–203 Circuits and Systems (APCCAS), Ishigaki, 2014, pp.
M.-F. Chang, P.-F. Chiu, W.-C. Wu, C.-H. Chuang, S.-S. 695–698
Sheu, Challenges and trends in low-power 3D M.F. Chang, A. Lee, P.C. Chen et al., Challenges and
die-stacked IC designs using RAM, memristor logic, circuit techniques for energy-efficient on-chip nonvol-
and resistive memory (ReRAM), in IEEE Interna- atile memory using memristive devices. IEEE
tional Conference on ASIC (ASICON) (2011), J. Emerg. Select. Top. Circuits Syst. 5(2), 183–193
pp. 299–302 (2015a)
I.J. Chang, J.J. Kim, K. Kim et al., Robust level converter M.-F. Chang, A. Lee, C.-C. Lin, et al. Read circuits for
for sub-threshold/super-threshold operation: 100 mV resistive memory (ReRAM) and memristor-based
to 2.5 V. IEEE TransactionsVLSI Syst. 19(8), nonvolatile Logics, in Proceedings of the IEEE Asia
1429–1437 (2011c) and South Pacific Design Automation Conference
M.-F. Chang, C.-H. Chuang, M.-P. Chen, et al. (ASP-DAC) (2015), pp. 569–574
Endurance-aware circuit designs of nonvolatile logic M.F. Chang, Y.F. Lin, Y.C. Liu et al., An asymmetric-
and nonvolatile SRAM using resistive memory voltage-biased current-mode sensing scheme for fast-
(memristor) device, in Proceedings of the IEEE Asia read embedded flash macros. IEEE J. Solid State
and South Pacific Design Automation Conference Circuits 50(9), 2188–2198 (2015c)
(2012), pp. 329–334 M.F. Chang, C.C. Lin, A. Lee, et al., A 3T1R nonvolatile
M.F. Chang, C.W. Wu, C.C. Kuo, et al., A 0.5 V 4 Mb TCAM using MLC ReRAM with Sub-1 ns search
logic-process compatible embedded resistive RAM time, IEEE International on Solid-State Circuits Con-
(ReRAM) in 65 nm CMOS using low-voltage cur- ference (ISSCC) Dig. Tech. Papers (2015), pp. 1–3
rent-mode sensing scheme with 45 ns random read M.-F. Chang, W.-Y. Lu, S.-J. Shen, M.-P. Chen, C.-S.
time, IEEE International Solid-State Circuits Confer- Lin, S.-S. Sheu, C.-H. Hung, Y.-S. Yang, Y.-J. Kuo,
ence (ISSCC) Dig. Tech. Papers (2012), pp. 434–436 S.-N. Hung, H.-T. Lue, C.-H. Shen, J.-M. Shieh,
M.-F. Chang, S.-S. Sheu, K.-F. Lin, et al., A high-speed Supply-variation-resilient nonvolatile 3D IC and 3D
7.2-ns read-write random access 4-Mb embedded memory using low peak-current on-chip charge-pump
Resistive RAM (ReRAM) macro using process-varia- circuits, in Proceedings of the IEEE Electron Devices
tion- tolerant current-mode read schemes, IEEE Jour- and Solid-State Circuits (EDSSC) (2015), pp. 118–121
nal of Solid-State Circuits (2013), pp. 878–891 M.F. Chang, W.Y. Lu, S.J. Shen, et al., Supply-variation-
M.F. Chang, S.J. Shen, C.C. Liu et al., An offset-tolerant resilient nonvolatile 3D IC and 3D memory using low
fast-random-read current-sampling-based sense peak-current on-chip charge-pump circuits, in
amplifier for small-cell-current nonvolatile memory. Proceedings of the IEEE Conference on Electron
IEEE J. Solid State Circuits 48(3), 864–877 (2013b) Devices and Solid-State Circuits (2015), pp. 118–121
M.F. Chang, C.W. Wu, C.C. Kuo et al., A low-voltage W.-M. Chen, C. Swift, D. Roberts, K. Forbes, J. Higman,
bulk-drain-driven read scheme for Sub-0.5 V 4 Mb B. Maiti, W. Paulson, K.-T. Chang, A novel flash
65 nm logic-process compatible embedded resistive memory device with split gate source side injection
RAM (ReRAM) macro. J. Solid-State Circuits 48(9), and ono charge storage stack (SPIN), Symp. VLSI
2250–2259 (2013c) Technology Dig. Tech. Papers (1997), pp. 63–64
M.-F. Chang, C.-C. Kuo, S.-S. Sheu, et al. Area-efficient J.H. Chen, L.T. Clark, T.H. Chen, An ultra-low-power
embedded Resistive RAM (ReRAM) macros using memory with a subthreshold power supply voltage.
logic-process Vertical-Parasitic-BJT (VPBJT) IEEE J. Solid State Circuits 41(10), 2344–2353 (2006)
switches and read-disturb-free temperature-aware cur- Y.-H. Chen, W.-M. Chan, W.-C. Wu, et al., A 16 nm
rent-mode read scheme, IEEE J. Solid-State Circuits 128 Mb SRAM in high-κ metal-gate FinFET technol-
(JSSC) (2014), pp. 908–916 ogy with write-assist circuitry for low-VMIN
M.-F. Chang, A. Lee, C.-C. Kuo, et al. Challenges at applications, in IEEE International Solid-State
circuit designs for resistive-type nonvolatile memory Circuits Conference (ISSCC) Dig. Tech. Papers
and nonvolatile logics in mobile and cloud (2014), pp. 238–239
applications, in Proceedings of the IEEE International H.Y. Cheng, T.H. Hsu, S. Raoux, J.Y. Wu, P.Y. Du,
Conference on Solid-State and Integrated Circuit M. Breitwisch, Y. Zhu, E.K. Lai, E. Joseph,
Technology (ICSICT) (2014), pp. 1–4 S. Mittal, R. Cheek, A. Schrott, S.C. Lai, H.L. Lung,
M.F. Chang, J.J. Wu, T.F. Chien, et al., 19.4 embedded C. Lam, A high performance phase change memory
1 Mb ReRAM in 28 nm CMOS with 0.27-to-1 V read with fast switching speed and high temperature reten-
using swing-sample-and-couple sense amplifier and tion by engineering the GexSbyTez phase change
self-boost-write-termination scheme, IEEE Interna- material, IEEE International Electron Devices
tional Solid-State Circuits Conference (ISSCC) Dig. Meeting (IEDM) Dig. Tech. Papers (2011),
Tech. Papers (2014), pp. 332–333 pp. 3.4.1–3.4.4
208 M.-F. Chang

H.Y. Cheng, J.Y. Wu, R. Cheek, S. Raoux, M. BrightSky, multiplier technique. IEEE J. Solid State Circuits 11
D. Garbin, S. Kim, T.H. Hsu, Y. Zhu, E.K. Lai, (3), 374–378 (1976)
E. Joseph, A. Schrott, S.C. Lai, A. Ray, H.L. Lung, A.T. Do, C. Yin, K. Velayudhan et al., 0.77 fJ/bit/search
C. Lam, A thermally robust phase change memory by content addressable memory using small match line
engineering the Ge/N concentration in (Ge, N)xSbyTez swing and automated background checking scheme
phase change material, IEEE International Electron for variation tolerance. IEEE J. Solid State Circuits
Devices Meeting (IEDM) Dig. Tech. Papers (2012), 49(7), 1487–1498 (2014)
pp. 31.1.1–31.1.4 B. Eitan, P. Pavan, I. Bloom, E. Aloni, A. Frommer,
W.C. Chien, Y.R. Chen, Y.C. Chen, et al., A forming-free D. Finzi, Can NROM, a 2-bit, trapping storage BVN
WOx resistive memory using a novel self-aligned field cell, give a real challenge to floating gate cells, in
enhancement feature with excellent reliability and Proc. Int. Conf. Solid State Devices and Materials
scalability, IEEE International Electron Devices (1999), pp. 522–524
Meeting (IEDM) Dig. Tech. Papers (2010), B. Eitan, P. Pavan, I. Bloom, E. Aloni, A. Frommer,
pp. 19.2.1–19.2.4 D. Finzi, NROM: a novel localized trapping, 2-bit
Y.D. Chih, C.H. Wang, C.H. Kuo, Reference cell circuit nonvolatile memory cell. IEEE Electron Device Lett.
for split gate flash memory. U.S. Patent 6,396,740, 28 21(11), 543–545 (2000)
May 2002 K. Eshraghian, K.-C. Cho, O. Kavehei, et al., Memristor
P.-F. Chiu, M.-F. Chang, S.-S. Sheu, et al., A low store MOS Content Addressable Memory (MCAM): hybrid
energy, low VDDmin, nonvolatile 8T2R SRAM with architecture for future high performance search
3D stacked RRAM devices for low power mobile engines, IEEE Transactions on Very Large Scale Inte-
applications, Symposium on VLSI Circuits (VLSIC) gration (VLSI) Systems (2010), pp. 1407–1417
Dig. Tech. Papers (2010), pp. 229–230 R. Fackenthal, M. Kitagawa, W. Otsuka, et al., 19.7 A
C.Y.-S. Cho, M.-J. Chen, C.-F. Chen, P. Tuntasood, D.-F. 16 Gb ReRAM with 200 MB/s write and 1 GB/s read
Fan, T.-Y. Liu, A novel self-aligned highly reliable in 27 nm technology, in IEEE International Solid-
sidewall split-gate flash memory. IEEE Trans. Elec- State Circuits Conference (ISSCC) Dig. Tech. Papers
tron Dev. 53, 465–473 (2006) (2014), pp. 338–339
C.Y.-S. Cho, J.C. Wang, L. Huang, et al., A 55-nm, 0.86- F. Frustaci, M. Khayatzadeh, D. Blaauw, D. Sylvester,
Volt operation, 75 MHz high speed, 96 μA/MHz low M. Alioto, A 32 kb SRAM for error-free and error-
power, wide voltage supply range 2 M-bit split-gate tolerant applications with dynamic energy- quality
embedded Flash, in Proceedings of the IEEE Interna- management in 28 nm CMOS, IEEE International
tional Symposium on VLSI Design, Automation, and Solid-State Circuits Conference (ISSCC) Dig. Tech.
Test (VLSI-DAT) (2013), pp. 1–4 Papers (2014), pp. 244–245
Y.D. Choi, I.H. Song, M.H. Park, et al., A 20 nm 1.8 V D. Halupka, S. Huda, W. Song, et al., Negative-resistance
8 Gb PRAM with 40 MB/s program bandwidth, IEEE read and write schemes for STT-MRAM in 0.13 μm
International Solid-State Circuits Conference CMOS, IEEE International Solid-State Circuits Con-
(ISSCC) Dig. Tech. Papers (2012), pp. 46–48 ference (ISSCC) Dig. Tech. Papers (2010),
C.C. Chung, H.C. Lin, Y.T. Lin, A multilevel read and pp. 256–257
verifying scheme for Bi-NAND flash memories. IEEE F. Hamzaoglu, U. Arslan, N. Bisnik, et al. A 1 Gb 2 GHz
J. Solid State Circuits 42(5), 1180–1188 (2007) embedded DRAM in 22 nm Tri-Gate CMOS tech-
H. Chung, B.H. Jeong, B.J. Min, et al., A 58 nm 1.8 V nology, IEEE International Solid-State Circuits
1 Gb PRAM with 6.4 MB/s program BW, IEEE Inter- Conference (ISSCC) Dig. Tech. Papers (2014),
national Solid-State Circuits Conference (ISSCC) pp. 230–231
Dig. Tech. Papers (2011), pp. 500–502 M. Hatanaka, H. Hidaka, Value creation in SOC/MCU
G.F. Close, U. Frey, J. Morrish, et al., A 512 Mb phase- applications by embedded nonvolatile memory
change memory (PCM) in 90 nm CMOS achieving evolutions, in Proceedings of the IEEE Asia Solid-
2b/cell, Symposium on VLSI Circuits (VLSIC) Dig. State Circuits Conf. (A-SSCC) (2007), pp. 38–42
Tech. Papers (2011), pp. 202–203 I. Hayashi, T. Amano, N. Watanabe et al., A 250-MHz
Datasheet, “sfc 0064_08b9_he” Taiwan Semiconductor 18-Mb full ternary CAM with low-voltage matchline
Manufacturing Company (TSMC) (2001) sensing scheme in 65-nm CMOS. IEEE J. Solid State
Datasheet, AF64K8AF25, v1.0 1st Silicon Sdn. Bhd. Circuits 48(11), 2671–2680 (2013)
(X-Fab) (2005) H. Hidaka, Evolution of embedded flash memory technol-
G. De Sandre, L. Bettini, A. Pirola, et al., A 90 nm 4 Mb ogy for MCU, in IEEE International Conference on IC
embedded phase-change memory with 1.2 V 12 ns Design & Technology (ICICDT) (2011), pp. 1–4
read access time and 1 MB/s write throughput, IEEE M.-C. Hsieh, Y.-C. Liao, Y.-W. Chin, et al. Ultra high
International Solid-State Circuits Conference density 3D via RRAM in pure 28 nm CMOS process,
(ISSCC) Dig. Tech. Papers (2010), pp. 268–269 IEEE International Electron Devices Meeting (IEDM)
J.F. Dickson, On-chip high-voltage generation in MNOS Dig. Tech. Papers (2013), pp. 10.3.1–10.3.4
integrated circuits using an improved voltage
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 209

K.-C. Huang et al., The impacts pf control gate voltage on S. Kianian, A. Levi, D. Lee, Y. W. Hu, A novel 3 volts-
the cycling endurance of split gate flash memory. only, small sector erase, high density flash E PROM,
IEEE Electron Device Lett. 21, 359–361 (2000) Symp. VLSI Technology Dig. Tech. Papers (1994),
L.Y. Huang, M.F. Chang, C.H. Chuang et al., ReRAM- pp. 71–72
based 4T2R nonvolatile TCAM with 7x NVM-stress T.H. Kim, J. Liu, C.H. Kim, A voltage scalable 0.26 V,
reduction, and 4x improvement in speed-wordlength- 64 kb 8T SRAM with Vmin lowering techniques and
capacity for normally-off instant-on filter-based deep sleep mode. IEEE J. Solid State Circuits 44(6),
search engines used in big-data processing, Sympo- 1785–1795 (2009)
sium on VLSI Circuits Dig. Tech. Papers (2014), W.J. Kim, J.H. Jeong, Y. Kim, et al., Extended scalability
pp. 1–2 of perpendicular STT-MRAM towards sub-20 nm
D.H. Im, J.I. Lee, S.L. Cho, H.G. An, D.H. Kim, I.S. Kim, MTJ node, in IEEE International Electron Devices
H. Park, D.H. Ahn, H. Horii, S.O. Park, U-in Chung, Meeting (IEDM) Dig. Tech. Papers (2011),
J.T. Moon, A unified 7.5 nm dash-type confined cell pp. 24.1.1–24.1.4
for high performance PRAM device, IEEE Interna- E. Kitagawa, S. Fujita, K. Nomura, et al., Impact of
tional Electron Devices Meeting (IEDM) Dig. Tech. ultra-low power and fast write operation of advanced
Papers (2008), pp. 1–4 perpendicular MTJ on power reduction for high-
P. Jain, D. Jiao, X. Wang, C.H. Kim, Measurement, anal- performance mobile CPU, IEEE International
ysis and improvement of supply noise in 3D ICs, Electron Devices Meeting (IEDM) Dig. Tech. Papers
Symposium on VLSI Circuits Dig. Tech. Papers (2012), pp. 29.4.1–29.4.4
(2011), pp. 46–47 K. Kobayashi, T. Nakayama, Y. Miyawaki et al., A high-
J. Javanifard, T. Tanadi, H. Giduturi, et al., A 45 nm self- speed parallel sensing architecture for multi-megabit
aligned-contact process 1 Gb NOR flash with 5 MB/ flash E2PROMs. IEEE J. Solid State Circuits 25(1),
s program speed, IEEE International Solid-State 79–83 (1990)
Circuits Conference (ISSCC) Dig. Tech. Papers T. Kono, T. Ito, T. Tsuruda, T. Nishiyama, T. Nagasawa,
(2008), pp. 424–624 T. Ogawa, Y. Kawashima, H. Hidaka, T. Yamauchi,
M. Jefremow, T. Kern, U. Backhausen, et al., Bitline- 40-nm embedded split-gate MONOS (SG-MONOS)
capacitance-cancelation sensing scheme with 11 ns flash macros for automotive with 160-MHz random
read latency and maximum read throughput of access for code and endurance over 10 M cycles for
2.9 GB/s in 65 nm embedded flash for automotive, data at the junction temperature of 170  C. IEEE
IEEE International Solid-State Circuits Conference J. Solid State Circuits 49, 154–166 (2013)
(ISSCC) Dig. Tech. Papers (2012), pp. 428–430 T. Kono, T. Ito, T. Tsuruda et al., 40-nm embedded Split-
M. Jefremow, T. Kern, W. Allers, et al., Time-differential Gate MONOS (SG-MONOS) flash macros for auto-
sense amplifier for sub-80 mV bitline voltage embed- motive with 160-MHz random access for code and
ded STT-MRAM in 40 nm CMOS, IEEE Interna- endurance over 10 M cycles for data at the junction
tional Solid-State Circuits Conference (ISSCC) Dig. temperature of 170 C. IEEE J. Solid State Circuits 49
Tech. Papers (2013), pp. 216–217 (1), 154–166 (2014)
S. Jeloka, N. Akesh, D. Sylvester, et al., A configurable A. Kotov, Three generations of Embedded SuperFlash
TCAM/BCAM/SRAM using 28 nm push-rule 6T bit split gate cell: scaling progress and challenges, Leti
cell, Symposium on VLSI Circuits (2015), pp. C272– Innovation Days–Memory Workshop (2013)
C273 S. Koveshnikov, K. Matthews, K. Min, et al. Real-time
S.H. Jo, T. Kumar, S. Narayanan, W.D. Lu, H. Nazarian, study of switching kinetics in integrated 1T/ HfOx1R
3D-stackable crossbar resistive memory based on RRAM: intrinsic tunability of set/reset voltage and
Field Assisted Superlinear Threshold (FAST) selector, trade-off with switching time, IEEE International
IEEE International Electron Devices Meeting (IEDM) Electron Devices Meeting (IEDM) Dig. Tech. Papers
Dig. Tech. Papers (2014), pp. 6.7.1–6.7.4 (2012), pp. 20.4.1–20.4.3
M. Kamiya, Y. Kojima, Y. Kato, K. Tanaka, Y. Hayashi, C. Kuo, D. Chrudimsky, T. Jew, C. Gallun, J. Choy,
EPROM CellWith High Gate Injection Efficiency, B. Wang, S. Pessoney, A 32-Bit RISC microcontroller
IEEE International Electron Devices Meeting with 448 K bytes of embedded flash memory, Int.
(IEDM) Dig. Tech. Papers (1982), pp. 741–744 NonVolatile Memory Technol. Conference (1998),
A. Kawahara, R. Azuma, Y. Ikeda, et al. An 8 Mb multi- pp. 28–33
layered cross-point ReRAM macro with 443 MB/ M. Kuriyama, S. Atsumi, A. Umezawa, et al., A 5 V-only
s write throughput, IEEE International Solid-State 0.6 μm flash EEPROM with row decoder scheme in
Circuits Conference (ISSCC) Dig. Tech. Papers triple-well structure, IEEE International Solid-State
(2012), pp. 432–434 Circuits Conference (ISSCC) Dig. Tech. Papers
M.D. Ker, S.L. Chen, C.S. Tsai et al., Design of charge (1992), pp. 152–153
pump circuit with consideration of gate-oxide reliabil- S. Lai, Current status of the phase change memory and its
ity in low-voltage CMOS processes. IEEE J. Solid future, IEEE International Electron Devices Meeting
State Circuits 41(5), 1100–1107 (2006) (IEDM) Dig. Tech. Papers (2003), pp. 10.1.1–10.1.4
210 M.-F. Chang

P. Larsson, Measurements and analysis of PLL jitter State Circuits Conference (ISSCC) Dig. Tech. Papers
caused by digital switching noise, IEEE Journal of (2010), pp. 444–445
Solid-State Circuits (2001), pp. 1113–1119 S. Matsunaga, S. Miura, H. Honjou, et al., A 3.14 μm2
P. Larsson, Measurements and analysis of PLL jitter 4T-2MTJ-cell fully parallel TCAM based on nonvola-
caused by digital switching noise, IEEE J. Solid- tile logic-in-memory architecture, Symposium on VLSI
State Circuits (2001), pp. 1113–1119 Circuits (VLSIC) (2012), pp. 44–45
C. Lauterbach, W. Weber, D. Romer et al., Charge shar- A.V. Mezhiba, E.G. Friedman, Scaling trends of on-chip
ing concept and new clocking scheme for power effi- power distribution noise. IEEE Trans. VLSI Syst 12
ciency and electromagnetic emission improvement of (4), 386–394 (2004)
boosted charge pumps. IEEE J. Solid State Circuits 35 R. Micheloni, L. Crippa, M. Sangalli et al., The flash
(5), 719–723 (2000) memory read path: building blocks and critical
B.Q. Le, M. Achter, C.G. Chng et al., Virtual-ground aspects. IEEE Proc. 91(4), 537–553 (2003)
sensing techniques for a 49-ns/200-MHz access time R. Mih et al., 0.18 m modular triple self-aligned embed-
1.8-V 256-Mb 2-bit-per-cell flash memory. IEEE ded split-gate flash memory, in Symp. VLSI Technol-
J. Solid State Circuits 39(11), 2014–2023 (2004) ogy Dig. Tech. Papers, 2000, pp. 120–121
J.Y. Lee, S.E. Kim, S.J. Song et al., A regulated charge Y. Morita, H. Fujiwara, H. Noguchi, et al., An area-
pump with small ripple voltage and fast start-up. IEEE conscious low-voltage-oriented 8T-SRAM design
J. Solid State Circuits 41(2), 425–432 (2006a) under DVS environment, Symposium on VLSI Circuits
H.-Y. Lee, Y.-S. Chen, P.-S. Chen, et al., Comprehen- Dig. Tech. Papers (2007), pp. 256–257
sively study of read disturb immunity and optimal read F. Nardi, S. Balatti, S. Larentis, D. Ielmini, Complemen-
scheme for high speed HfOx based RRAM with a Ti tary switching in metal oxides: Toward diode-less
layer, in Proceedings of the IEEE Symposium on VLSI crossbar RRAMs, IEEE International Electron
Technology, Systems and Applications (VLSI-TSA) Devices Meeting (IEDM) Dig. Tech. Papers (2011),
(2010), pp. 132–133 pp. 31.1.1–31.1.4
A. Lee, M.F. Chang, C.C. Lin, et al., RRAM-based 7T1R K. Nii, K. Yamaguchi, M. Yabuuchi, et al., Silicon
nonvolatile SRAM with 2x reduction in store energy measurements of characteristics for passgate/pull-
and 94x reduction in restore energy for frequent-off down/pull-up MOSs and search MOS in a 28 nm
instant-on applications, Symposium on VLSI Circuits HKMG TCAM bitcell, in Proceedings of the Interna-
(VLSI Circuits) (2015), pp. C76–C77 tional Conference on Microelectronic Test Structures
J. Li, R. Montoye, M. Ishii, et al., 1 Mb 0.41 μm2 2T-2R (ICMTS) (2015), pp. 200–203
cell nonvolatile TCAM with two-bit encoding and T. Ogura, M. Hosoda, T. Ogawa et al., A 1.8-V 256-Mb
clocked self-referenced sensing, Symposium on VLSI multilevel cell nor flash memory with BGO function.
Technology (VLSIT) (2013), pp. C104–C105 IEEE J. Solid State Circuits 41(11), 2589–2600 (2006)
J. Li, R.K. Montoye, M. Ishii, L. Chang, 1 Mb 0.41 μm2 T.-Y. Oh, H. Chung, Y.-C. Cho, et al. A 3.2 Gb/s/pin 8 Gb
2T-2R cell nonvolatile TCAM with two-bit encoding 1.0 V LPDDR4 SDRAM with integrated ECC engine
and clocked self-referenced sensing. IEEE J. Solid for sub-1 V DRAM core operation, in IEEE Interna-
State Circuits 49, 896–907 (2014) tional Solid-State Circuits Conference (ISSCC) Dig.
W. Liu, K.T. Chang, C. Cavins, B. Luderman, C. Swift, Tech. Papers (2014), pp. 430–431
K.M. Chang, B. Morton, G. Espinor, S. Ledford, A T. Ohsawa, H. Koike, S. Miura et al., A 1 Mb nonvolatile
2-Transistor Source-Select (2TS) flash EEPROM for embedded memory using 4T2MTJ cell with 32 b fine-
1.8 V-Only applications, Non-Volatile Semiconductor grained power gating scheme. IEEE J. Solid State
Memory Worshop (1997), pp.4.1.1–4.1.3 Circuits 48(6), 1511–1520 (2013)
Y.C. Liu, M.F. Chang, Y.F. Lin, et al., An embedded flash S.R. Ovshinsky, Reversible electrical switching phenom-
macro with sub-4 ns random-read-access using ena in disordered structure. Phys. Rev. Lett. 21(20),
asymmetric-voltage-biased current-mode sensing 1450–1455 (1968)
scheme, in Proceedings of the IEEE Asian Solid- G. Palumbo, D. Pappalardo, Charge pump circuits: an
State Circuits Conference (A-SSCC) (2013), overview on design strategies and topologies. IEEE
pp. 241–244 Circuits Syst. Mag. 10(1), 31–45 (2009)
T.Y. Liu, T.H. Yan, R. Scheuerlein, et al., A 130.7 mm2 A. Pirovano, A.L. Lacaita, A. Benvenuti, F. Pellizzer,
2-layer 32 Gb ReRAM memory device in 24 nm tech- R. Bez, Electronic switching in phase-change
nology, IEEE International Solid-State Circuits Confer- memories. IEEE Trans. Electron Dev. 51(3),
ence (ISSCC) Dig. Tech. Papers (2013), pp. 210–211 452–459 (2004)
S. Lütkemeier, U. Ruckert et al., A subthreshold to above- H. Pozidis, N. Papandreou, A. Sebastian, et al., Reliable
threshold level shifter comprising a wilson current MLC data storage and retention in phase-change
mirror. IEEE Trans. Circuits Syst. II: Exp. Brief. 57 memory after endurance cycling, in Proceedings of
(9), 721–724 (2010) the IEEE International Memory Workshop (IMW)
G.G. Marotta, A. Macerola, A. d’Alessandro, et al., A (2013), pp. 100–103
3 bit/cell 32 Gb NAND flash memory at 34 nm with M. Rizzi, N. Ciocchini, S. Caravati, et al., Statistics of set
6 MB/s program throughput and with dynamic 2 b/cell transition in phase change memory (PCM) arrays,
blocks configuration mode for a program throughput IEEE International Electron Devices Meeting
increase up to 13 MB/s, IEEE International Solid- (IEDM) Dig. Tech. Papers (2014)
6 On-Chip Non-volatile Memory for Ultra-Low Power Operation 211

S.K. Saha, Design considerations for sub-90-nm split-gate J. Van Houdt, P. Heremans, L. Deferns, G. Groeseneken,
flash-memory cells. IEEE Trans. Electron Dev. 54, H.E. Maes, Analysis of the enhanced hot-electron
465–473 (2007) injection in split-gate transistors useful for EEPROM
Y. Sakotsubo, M. Terai, S. Kotsuji, et al., A new approach applications. IEEE Trans. Electron Dev. 39,
for improving operating margin of unipolar ReRAM 1150–1156 (1992)
using local minimum of reset voltage, IEEE Sympo- N. Verma, A.P. Chandrakasan, A 256 kb 65 nm 8T sub-
sium on VLSI Technology (VLSIT) Dig. Tech. Papers threshold SRAM employing sense-amplifier redun-
(2010), pp. 87–88 dancy. IEEE J. Solid State Circuits 43(1), 141–149
A. Serb, R. Berdan, A. Khiat, C. Papavassiliou, (2008)
T. Prodromakis, Live demonstration: a versatile, Y.-H. Wang, M.-C. Wu, C.-J. Lin et al., An analytical
low-cost platform for testing large ReRAM cross-bar programming model for the draincoupling source-side
arrays, in Proceedings of the International Symposium injection split gate flash EEPROM. IEEE Trans. Elec-
on Circuits and Systems (ISCAS) (2014), pp. 441 tron Dev. 52, 385–391 (2005)
S.S. Sheu, C.C. Kuo, M.F. Chang, et al., A ReRAM W. Wang, A. Gibby, Z. Wang, et al., Nonvolatile SRAM
integrated 7T2R non-volatile SRAM for normally-off Cell, IEEE International Electron Devices Meeting
computing application, in Proceedings of the IEEE (IEDM) Dig. Tech. Papers (2006), pp. 1–4
Asian Solid-State Circuits Conference (A-SSCC) X.P. Wang, Z. Fang, X. Li, et al., Highly compact 1T-1R
(2013), pp. 245–248 architecture (4F2 footprint) involving fully CMOS
Y.H. Shih, J.Y. Wu, B. Rajendran, M.H. Lee, R. Cheek, compatible vertical GAA nano-pillar transistors and
M. Lamorey, M. Breitwisch, Y. Zhu, E.K. Lai, oxide-based RRAM cells exhibiting excellent NVM
C.F. Chen, E. Stinzianni, A. Schrott, E. Joseph, properties and ultra-low power operation, IEEE Inter-
R. Dasaka, S. Raoux, H.L. Lung, C. Lam, Mechanisms national Electron Devices Meeting (IEDM) Dig. Tech.
of retention loss in Ge2Sb2Te5-based phase-change Papers (2012), pp. 20.6.1–20.6.4
memory, IEEE Electron Devices Meeting (IEDM) Y.-H. Wang, S.-H. Huang, D.-Y. Wang, et al., Impact of
Dig. Tech. Papers (2008), pp. 1–4 stray field on the switching properties of perpendicular
T. Song, W. Rim, J. Jung, et al., A 14 nm FinFET 128 Mb MTJ for scaled MRAM, IEEE International Electron
6T SRAM with VMIN enhancement techniques for Devices Meeting (IEDM) Dig. Tech. Papers (2012),
low-power applications, IEEE International Solid- pp. 29.2.1–29.2.4
State Circuits Conference (ISSCC) Dig. Tech. Papers Y. Wang, Y. Liu, S. Li, D. Zhang, B. Zhao, M.-F. Chiang,
(2014), pp. 232–233 Y. Yan, B. Sai, H. Yang, A 3μs wake-up time nonvol-
Y. Taito, M. Nakano, H. Okimoto, et al., 7.3 A 28 nm atile processor based on ferroelectric flip-flops, in
embedded SG-MONOS flash macro for automotive Proceedings of the European Solid-State Circuits
achieving 200 MHz read operation and 2.0 MB/S Conference (ESSCIRC) (2012), pp. 149–152
write throughput at Ti, of 170  C, IEEE International Y. Wang, Y. Liu, S. Li, X. Sheng, D. Zhang, M.-F. Chiang,
Solid-State Circuits Conference (ISSCC) Dig. Tech. B. Sai, X.-S. Hu, H. Yang, PaCC: A parallel compare
Papers (2015), pp. 1–3 and compress codec for area reduction in nonvolatile
K. Takahashi, H. Doi, N. Tamura, K. Mimuro, processors, IEEE Transactions on Very Large Scale
T. Hashizume, Y. Moriyama, Y. Okuda, A 0.9 V Integration (VLSI) Systems (2014), pp. 1491–1505
operation 2-transistor flash memory for embedded O. Wataru, K. Miyata, M. Kitagawa, et al., A 4 Mb
logic LSIs, Symp. VLSI Technology Dig. Tech. Ppaers conductive-bridge resistive memory with 2.3 GB/
(1999), pp. 21–22 s read-throughput and 216 MB/s program-throughput,
K. Takeuchi, Y. Kameda, S. Fujimura et al., A 56-nm IEEE International Solid-State Circuits Conference
CMOS 99-mm2 8-Gb Multi-Level NAND Flash (ISSCC) Dig. Tech. Papers (2011), pp. 210–211
Memory With 10-MB/s Program Throughput. IEEE W. Wei, K. Namba, J. Han et al., Design of a nonvolatile
J. Solid State Circuits 42(1), 219–232 (2007) 7T1R SRAM cell for instant-on operation. IEEE
K.-T. Tang, S.-W. Chiu, C.-H. Shih, et al., A 0.5 V Trans. Nanotechnol. 13(5), 905–916 (2014)
1.27 mW nose-on-a-chip for rapid diagnosis of C.-Y. Wen, J. Li, S. Kim, M. Breitwisch, C. Lam,
ventilator-associated pneumonia, IEEE International J. Paramesh, L.T. Pileggi, A non-volatile look-up
Solid-State Circuits Conference (ISSCC) (2014), table design using PCM (phase-change memory)
pp. 1–2, pp. 420–421 cells, Symposium on VLSI Circuits (VLSIC) Dig.
J. Tsouhlarakis, G. Vanhorebeek, G. Vehoeven et al., A Tech. Papers (2011), pp. 302–303
flash memory technology with quasi-virtual ground J.S. Witters, G. Groeseneken, H.E. Maes, Analysis and
array for low-cost embedded applications. IEEE modeling of on-chip high-voltage generator circuit for
J. Solid State Circuits 36(6), 969–978 (2001) use in EEPROM circuits. IEEE J. Solid State Circuits
K. Tsunoda, M. Aoki, H. Noshiro, et al. Highly manufac- 24, 1372–1380 (1989)
turable multi-level perpendicular MTJ with a single S.N. Wooters, B.H. Calhoun, T.N. Blalock et al., An
top-pinned layer and multiple barrier/free layers, in energy-efficient subthreshold level converter in
IEEE International Electron Devices Meeting (IEDM) 130-nm CMOS. IEEE Trans. Circuits Syst. II: Exp.
Dig. Tech. Papers (2013), pp. 3.3.1–3.3.4 Brief. 57(4), 290–294 (2010)
212 M.-F. Chang

J.Y. Wu, M. Breitwisch, S. Kim, T.H. Hsu, R. Cheek, Technology (VLSIT), Dig. Tech. Papers (2015),
P.Y. Du, J. Li, E.K. Lai, Y. Zhu, T.Y. Wang, pp. T80–T81
H.Y. Cheng, A. Schrott, E.A. Joseph, R. Dasaka, J.J. Yang et al., Engineering nonlinearity into memristors
S. Raoux, M.H. Lee, H.L. Lung, C. Lam, A low for passive crossbar applications. Appl. Phys. Lett.
power phase change memory using thermally confined 100, 113501 (2012)
TaN/TiN bottom electrode, IEEE International Elec- Y. Yano, Take the expressway to go greener, in IEEE
tron Devices Meeting (IEDM) Dig. Tech. Papers International Solid-State Circuits Conference
(2011), pp. 3.2.1–3.2.4 (ISSCC) Dig. Tech. Papers (2012), pp. 24–30
J.Y. Wu, W.S. Khwa, M.H. Lee, H.P. Li, S.C. Lai, J.A. Yater, S.T. Kang, R. Steimle, C.M. Hong,
T.H. Su, M.L. Wei, T.Y. Wang, M. BrightSky, B. Winstead, M. Herrick, G. Chindalore, Optimization
T.S. Chen, W.C. Chien, S. Kim, R. Cheek, of 90 nm split gate nanocrystal non-volatile memory,
H.Y. Cheng, E.K. Lai, Y. Zhu, H.L. Lung, C. Lam, in Proceedings of the Non-Volatile Semiconductor
Greater than 2-bits/cell MLC storage for ultra high Memory Workshop (2007), pp. 77–78
density phase change memory using a novel sensing H.C. Yu, K.F. Lin, K.C. Lin, et al., A 180 MHz direct
scheme, Symposium on VLSI Technology (VLSI Tech- access read 4.6 Mb embedded flash in 90 nm technol-
nology) (2015), pp. T94–T95 ogy operating under wide range power supply from
X.Y. Xue, W.X. Jian, J.G. Yang et al., A 0.13 μm 8 Mb 2.1 V to 3.6 V, in Proceedings of the IEEE Interna-
Logic-Based CuxSiyO ReRAM With Self-Adaptive tional Symposium on VLSI Design, Automation, and
Operation for Yield Enhancement and Power Reduc- Test (VLSI-DAT) (2013), pp. 1–4
tion. IEEE J. Solid State Circuits 48(5), 1315–1322 B. Zhai, S. Hanson, D. Blaauw et al., A variation-tolerant
(2013) sub-200 mV 6-T subthreshold SRAM. IEEE J. Solid
M. Yabuuchi, Y. Tsukamoto, M. Morimoto, et al., 13.3 State Circuits 43(10), 2338–2348 (2008)
20 nm high-density single-port and dual-port SRAMs K. Zhang, Embedded memories for nano-scale VLSIs
with wordline-voltage-adjustment system for read/ (Springer, New York, 2009)
write assists, IEEE International Solid-State Circuits L. Zhang, B. Govoreanu, B. Redolfi, et al., High-drive
Conference (ISSCC) Digest of Technical Papers current (>1 MA/cm2) and highly nonlinear (>103)
(2014), pp. 234–235 TiN/Amorphous-Silicon/TiN scalable bidirectional
S. Yamamoto, Y. Shuto, S. Sugahara, Nonvolatile SRAM selector with excellent reliability and its variability
(NV-SRAM) using functional MOSFET merged with impact on the 1S1R array performance, IEEE Interna-
resistive switching devices, in Proceedings of the tional Electron Devices Meeting (IEDM) Dig. Tech.
IEEE Custom Integrated Circuits Conference (CICC) Papers (2014), pp. 6.8.1–6.8.4
(2009), pp. 531–534 M. Zwerg, A. Baumann, R. Kuhn, M. Arnold, R. Nerlich,
T. Yamauchi, Prospect of embedded non-volatile memory An 82 μA/MHz microcontroller with embedded
in the smart society, in Proceeding of the IEEE Inter- FeRAM for energy-harvesting applications, IEEE
national Symposium on VLSI Technology, Systems International Solid-State Circuits Conference
and Application (VLSI-TSA) (2015), pp. 1–2 (ISSCC) Dig. Tech. Papers (2011), pp. 334–336
T. Yamauchi, H. Kondo, K. Nii, Automotive low power
technology for IoT society, Symposium on VLSI
On-Chip Non-volatile STT-MRAM
for Zero-Standby Power 7
Xuanyao Fong and Kaushik Roy

In this Chapter, we present spin-transfer torque random access for IoT applications (Yamauchi
magnetic random access memory (STT-MRAM) et al. 2015). However, STT-MRAM is especially
suitable for IoT applications. Its ability to operate attractive as compared to ReRAM and eFlash due
at low supply voltages, non-volatility, good endur- to its relative ease of integration into the back-
ance, and small bit-cell footprint are especially end-of-line (BEOL) in the CMOS fabrication
attractive for IoT applications in which low energy process, ability to operate at <1.2 V supply
consumption is crucial. We will present the voltages, <100 ns read and write delays, good
fundamentals of STT-MRAM. The design of the endurance (>1014 cycles) and bit-cell footprint as
STT-MRAM storage device, memory bit-cell and small as 40 F2 (F is the smallest feature size of the
memory array architecture are also discussed to CMOS technology) (ITRS Roadmap 2014). In
highlight the benefits STT-MRAM brings to IoT this chapter, we discuss the modeling, design,
applications, as well as the design issues that need and optimization of STT-MRAMs and present
to be considered. We then present a device/circuit/ some of the potential benefits in relation to IoT
architecture co-design approach for STT-MRAM. applications. As we will see later in this chapter, a
Finally, we will discuss the trends in STT-MRAM highly desirable trait of STT-MRAM for IoT
and give some perspectives on the future of applications is that they only need to be powered
STT-MRAM design. when they are being accessed, which results in
zero-standby power.

7.1 Introduction
7.2 The Magnetic Tunnel Junction
As discussed in Chap. 6, several non-volatile and STT-MRAM
memory technologies such as embedded flash
(eFlash) and resistive RAM (ReRAM) are avail- The storage device in STT-MRAM is the
able for implementing memories with truly magnetic tunnel junction or MTJ. The structure
of a typical MTJ with in-plane magnetic anisot-
ropy is shown in Fig. 7.1. The MTJ may be
X. Fong (*) visualized as a stack consisting of two nano-
National University of Singapore, Singapore 119077, magnets sandwiching a tunneling oxide barrier
Singapore (usually AlOx or more commonly MgO). One
e-mail: elefongx@nus.edu.sg
nano-magnet is a soft ferromagnetic layer used
K. Roy to store the information (also called the “free”
Purdue University, West Lafayette, IN 47907, USA

# Springer International Publishing AG 2017 213


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_7
214 X. Fong and K. Roy

Fig. 7.1 The storage device in the STT-MRAM is the parallel and anti-parallel MTJ configurations. The
magnetic tunnel junction (MTJ). The typical stack struc- directions of programming current flow through the
ture of an MTJ with in-plane anisotropy is shown. It is bit-cell are shown using colored arrows whereas the
easier to understand the operation of an MTJ by only black arrows indicate possible magnetization directions
considering the simple tri-layer stack illustrating the of magnetic layers in the MTJ

HSHIFT ¼ HSHIFT  |HC,1|). However, the ther-


mal stability and retention time of the MTJ is
determined by min (|HC,1|,| HC,2|). Thus, non-zero
HSHIFT reduces the thermal stability and reten-
tion time of the MTJ.
An example of a SAF layer is shown in
Fig. 7.1. The anti-ferromagnetic (AFM) layer
pins the magnetization of the bottom pinning
layer via exchange bias effect (Nogués et al.
2005). The top pinning layer is anti-
ferromagnetically coupled to the bottom pinning
layer via interlayer exchange coupling with a
non-magnetic spacer such as Ru (Parkin et al.
1990). Exchange coupling increases the mag-
netic field needed to switch the magnetizations
of the coupled pinning layers and effectively pins
the magnetizations of both ferromagnetic layers.
Fig. 7.2 The M-H loop of an MTJ without SAF based This allows the magnetization of the top pinning
pinned layer exhibits a shifted hysteresis loop as layer to be used as a reference.
illustrated here, which corresponds to a degraded
retention time The MTJ is designed to be switchable
between two stable states. When the magnetiza-
tion directions of both the free and the pinned
layer) whereas the other nano-magnet is a hard layers point in the same direction, the MTJ con-
ferromagnetic layer for use as a reference layer figuration is called the “parallel” state (P). When
(also called the “fixed” or “pinned” layer). instead the magnetization directions of the free
In many MTJ stacks, a synthetic anti- layer and the pinned layer point in opposite
ferromagnetic (SAF) layer is used for the pinned directions, the MTJ configuration is called
layer to reduce the stray magnetic field. The stray the “anti-parallel” state (AP). An important met-
magnetic field may shift the M-H loop of the ric for the MTJ is its resistance-area (RA) product
MTJ as illustrated in Fig. 7.2. The M-H loop (Huai 2008). The RA product of the MTJ varies
(M: magnetization, H: applied magnetic field) exponentially with the thickness of the tunneling
exhibits a horizontal shift, HSHIFT. HC,1 and oxide barrier (tMgO) since the mechanism for
HC,2 are symmetric about HSHIFT (i.e., HC,2  electron transport is tunneling. The MTJ
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 215

resistance, RMTJ, depends linearly on the cross- where μ0 is the permeability of vacuum, and Msat
sectional area of the MTJ (AMTJ) similar to an and VFL are the saturation magnetization and
Ohmic conductor when tMgO is constant. RMTJ volume of the nano-magnet, respectively. EB of
also depends on the relative magnetic polariza- the pinned layer in the MTJ is engineered to be
tion of the free layer with respect to the pinned much larger than that of the free layer in the MTJ
layer. The dependence of RMTJ on magnetic so that the pinned layer magnetization direction
polarization arises due to the difference in den- is fixed. As such, only the magnetization direc-
sity of states around the Fermi energy, EF, in the tion of the free layer can change during opera-
ferromagnetic layers (Datta et al. 2012). When tion. The most common form of anisotropy
the MTJ is in the P configuration, the density of engineered into the free layer of an MTJ is the
states of like-spins around EF is very high in the uniaxial anisotropy. This causes the magnetiza-
ferromagnetic layers. Conversely, the density of tion of the magnetic layers to have a preferential
states of like-spins around EF in the ferromag- alignment axis—the magnetization will align
netic layers is very low when the MTJ is in AP along this axis when no external stimulus is pres-
configuration. Thus, RMTJ is low in the P config- ent. The uniaxial anisotropy energy density, Ku2,
uration (RMTJ ¼ RP ¼ RL) and high in the AP and EB of the nano-magnet are related by
configuration (RMTJ ¼ RAP ¼ RH). This differ-
K u2 V FL ¼ EB ð7:4Þ
ence in RMTJ, termed the “tunneling magneto-
resistance ratio” (or TMR), is quantified as Hence, Ku2VFL must be kept constant to main-
RAP  RP tain the same thermal stability when the volume
TMR ¼  100% ð7:1Þ of the nano-magnet is reduced. We will discuss
RP
this in more detail in the later sections. In the
and is an important metric for the performance of presence of a stray magnetic field that shifts the
MTJs as memory elements. A larger TMR also M-H loop of the MTJ by HSHIFT (as in Fig. 7.2),
means that MTJ states can be distinguished more the effective barrier height becomes
easily. Binary data may then be represented and
stored as the resistance state of the MTJ. EB ¼ 0:5μ0 Msat V FL ðH k  H SHIFT Þ ð7:5Þ
The magnetic layers in an MTJ, which may be Hence, the free layer of the MTJ needs to be
considered as nano-magnets, are engineered with engineered with a larger Ku2 to compensate for
anisotropy energies to satisfy thermal stability the barrier height degradation due to HSHIFT.
and data retention requirements. The required
energy barrier height (EB, in Joules) and reten-
tion time (τ, in seconds) must satisfy:
  7.2.1 Spin-Transfer Torque
EB mτ
Δ ¼ > ln ð7:2Þ
kB T τ0 ln 2 Nano-scale MTJs may be switched using the
spin-transfer torque (STT) phenomenon shown
where kB is the Boltzmann constant, T is the
in Fig. 7.3a, which was theoretically predicted
temperature in Kelvin, m is the number of bits
by Slonczewski and Berger independently in
in the memory and τ0 is the characteristic time in
1996 (Berger 1996; Slonczewski 1996). Since
seconds (τ0  1 ns). Δ ¼ 40.66 corresponds to a
then many experiments have observed STT
retention time of about 10 years for m ¼ 1. In
switching (Huai et al. 2004; Katine et al. 2000;
real STT-MRAM arrays, Δ > 70 may be
Myers 1999). STT arises due to the spin property
required (Naeimi et al. 2013). The minimum
of electrons. When the majority of electron spins
magnetic field for switching a nano-magnet, Hk,
in a nano-magnet are aligned in a particular direc-
and the energy barrier are related by (Sun 2000):
tion, the magnetization of the nano-magnet also
2EB points in that direction. This is illustrated by the
Hk ¼ ð7:3Þ
μ0 Msat V FL different density of states for different electron
216 X. Fong and K. Roy

(a)
Slonczewski-like
torque
torque
transmitted
electrons
incident
me m
electrons

(b) E

EF

spin-splitting energy
EC0

D (E) D (E)

Fig. 7.3 (a) When an electron carries spin polarization direction of the electron. This occurs because magnetism
that is non-collinear with the magnetization of the nano- is a result of unequal spin density of states in the nano-
magnet it is incident on, it exchanges spin angular magnet as illustrated in (b)—the flow of electrons with
momentum with the nano-magnet. Consequently, spin- other spin directions perturb the total spin populations in
transfer torque (decomposed into Slonczewski-like and the nano-magnet, which manifests as a torque on the
field-like components) is exerted on the magnetization magnetization of the nano-magnet
of the nano-magnet to align it into the spin polarization

spins in Fig. 7.3b. When an electron is incident on current in which the majority of electrons are
a nano-magnet, it experiences an exchange spin polarized in a particular direction) transfer
field—the same field that aligns the spin their spin momentum to the nano-magnet and in
directions of all electrons in the ferromagnet—if the process exerts a torque on the magnetization
its spin polarization direction is non-collinear of the nano-magnet that tries to align the magne-
with the magnetization direction of the nano- tization direction of the nano-magnet with the
magnet. The exchange field exerts a torque on spin polarization direction of the spin-polarized
the spin polarization of the electron, aligning it current. The magnetization direction of the nano-
with the magnetization direction of the nano- magnet may be switched if the torque exerted is
magnet. Hence, when current flows through the large enough to overcome all other energies in the
MTJ, the ferromagnetic layers also act as spin nano-magnet. Furthermore, the rate of spin
filters that polarize the spin direction of electrons momentum transfer and the torque exerted are
constituting the current flow. Whereas the proportional to the rate of electron flow or the
exchange field exerts torque on the spin polariza- current, and determine the switching time. The
tion of the electron, due to conservation of spin current or current density needed to achieve a
angular momentum, an equal and opposite torque specific switching time is the critical current, IC,
(spin-transfer torque) is exerted on the magneti- or critical current density, JC. It is often easier to
zation of the nano-magnet to align it with the spin analyze the MTJ in terms of the intrinsic
polarization direction of the electron. Hence, the switching current density, JC0, given as (Huai
electrons in a spin-polarized current (i.e., a 2008; Sun 2000)
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 217

2eαMsat tFL H eff efficient than that for anti-parallelizing the


J C0 ¼ ð7:6Þ
ℏη MTJ configuration (i.e., IC and JC are asym-
metric and depends on switching direction
Here, e is the elementary charge, α and tFL are (Datta et al. 2012; Ikeda et al. 2010). It has
the Gilbert damping constant and thickness of the been reported that JC for anti-parallelizing the
free layer, respectively. ℏ is the reduced Planck MTJ can be 10%–200% larger than for
constant, η is the spin polarization efficiency parallelizing the MTJ (Ikeda et al. 2010;
factor, and Heff is the effective magnetic field Kishi et al. 2008).
spin-transfer torque must overcome when As just mentioned, spin-transfer torque, τ STT,
switching the free layer magnetization. is only exerted when the spin polarization direc-
In an MTJ, it is easier for spin-transfer tion, p, of the electrons incident on the nano-
torque to switch the free layer than to switch magnet is non-collinear with the magnetization
the pinned layer because the EB of the pinned direction of the nano-magnet, m. It can be shown
layer is much higher than that of the free layer. that (Salahuddin et al. 2008)
Let us consider what happens when electrons
are flowing from the pinned layer to the free τSTT ¼ β½εðm  p  mÞ þ ε0 ðp  mÞ
layer in an MTJ. The pinned layer spin- ð7:7Þ
polarizes the incoming electrons which then
flow into the free layer. These electrons are where β is a scalar factor proportional to the
spin-polarized in the direction of the pinned current flowing through the MTJ. ε and ε0 are
layer magnetization and transfer their spin scalar factors proportional to the strength of
momentum to the free layer. The spin-transfer Slonczewski-like and field-like torques, respec-
torque exerted on the free layer tries to align tively. In an MTJ, the spin polarization direction
the free layer magnetization with that of the of the electrons is pointing in the magnetization
pinned layer (i.e., MTJ is switched into the P direction of the pinned layer or that of the free
configuration). Consider when electrons flow layer. The magnetization directions of the two
from the free layer to the pinned layer instead. magnetic layers are collinear to maximize the
Electrons entering the free layer from the TMR and the distinguishability between the P
non-magnetic metal interconnect are not spin- and AP configurations of the MTJ. Thus, τ STT
polarized and can have any spin direction. should be negligible. However, thermal effects
Electrons spin-polarized in direction of the perturb the magnetization of the nano-magnets in
pinned layer magnetization are able to tunnel the MTJ, causing them to be non-collinear and
across the oxide easily. Those with the oppo- τ STT can be large enough to overcome all other
site spin polarization may not tunnel across the energies in the free layer of the MTJ. Hence,
oxide easily and accumulate in the free layer. spin-transfer torque switching of the MTJ is a
These electrons transfer their spin angular stochastic process since thermal effects are ran-
momentum to the free layer and exert a torque dom in nature. The thermal effect may be
that aligns the direction of free layer magneti- modeled as a fluctuating magnetic field written
zation opposite to that of the pinned layer (i.e., as (Brown 1963)
MTJ is being switched into the AP configura- sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
tion). Consequently, the spin directions of the αkB T
HEFF ¼ ξ ð7:8Þ
electrons become aligned with the magnetiza- γμ0 Msat V FL δt
tion direction of the pinned layer and they may
then easily tunnel across the oxide. From this ξ is a 3-vector whose components are indepen-
discussion, we can see that the process of dent Gaussian random variables with zero mean
parallelizing the MTJ configuration is more and unit variance. γ is the gyromagnetic ratio
218 X. Fong and K. Roy

Fig. 7.4 This graph 3


illustrates an example of

Dynamic Switching
the pulse width, τ, Thermal
2.5
Activation
dependence of JC needed to
program an MTJ with 0.5
2
success probability. The

J (τ) / JC0
spin torque generated at
Precessional
low current densities is still 1.5
Switching
able to switch the MTJ state

C
due to heating of the MTJ, J
1 C0
which increases the thermal
effects
0.5

0
−1 0 1 2 3 4
10 10 10 10 10 10
Pulse Width, τ (ns)

(17.6 MHz/Oe) and δt is the constant time step thermal effects to sufficiently reduce the effec-
used in numerical simulation of the MTJ tive barrier height.
dynamics.
Also, it was found that JC depends on the
pulse width, τ (Huai 2008). An example of this 7.2.2 Integrating MTJ with CMOS
is illustrated in Fig. 7.4. Three switching regimes Technology
can be observed: the precessional, thermal and
dynamic regimes. In the precessional regime, When designing STT-MRAM arrays, the fabri-
τSTT completely overcomes all other energies in cation steps need to be understood in order to
the free layer to switch the MTJ. Thermal effects understand the impact of design choices on the
only affect the magnetization direction of the characteristics of the MTJ, the impact of the
free layer in the MTJ just prior to onset of JC. bit-cell topology on the bit-cell footprint, layout
Thus, the dependence of programming failure on of the memory array in terms of area overhead,
JC is determined by the distribution of free layer and performance and energy overheads due to
magnetization direction just prior to applying the increased parasitics. Figure 7.5 illustrates one
switching current pulse. In the thermal regime, method of integrating MTJs into the back end
τSTT alone is unable to overcome all other of the CMOS fabrication process (back-end-of-
energies in the free layer. However, thermal line, BEOL) developed by Qualcomm (Kang
effects, which are also increased due to Joule et al. 2014). The MTJs are placed in between
heating by the current flowing through the MTJ, metal layers that form the interconnects of the
assists τSTT in switching the MTJ configuration. integrated circuit (IC). After the chemical
The dependence of programming failure on JC in mechanical polishing (CMP) step (step 1) to
the thermal regime is determined by the random expose the metal layer on which the MTJs are
time needed for thermal effects to sufficiently to be placed, the layers constituting the MTJ
reduce the effective barrier height such that stack are deposited (steps 2 and 3). The MTJ
τSTT can switch the MTJ configuration. In the stack consists of many thin layers including
dynamic regime, the dependence of program- those needed to form the top and bottom
ming failure on JC is determined by both the electrodes (TE and BE, respectively), the seed
distribution of free layer magnetization prior to layer for growing high quality magnetic thin
applying JC and the random time needed for films, the magnetic layers needed to form the
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 219

Fig. 7.5 An example flow for fabricating magnetic tun- top electrode, ILD: interlayer dielectric, CMP: chemical
nel junctions in the back-end-of-line (BEOL) of the mechanical polishing
CMOS fabrication process. BE: bottom electrode, TE:

pinned and free layers of the MTJs, and also the next metal layer is then deposited (step 10) and a
tunnel barrier. In Fig. 7.5, the seed layer and CMP step (step 11) is performed to complete the
layers constituting the pinned layer of the MTJ definition of the metal interconnect layer. The
are deposited on the bottom electrode next ILD layer is then deposited (step 12) and
(BE) before the tunneling oxide layer. The first the rest of the BEOL fabrication process
lithographic step is then applied to pattern the continues.
MTJ pillars using the BE as the etch stop layer
(step 4). Since the thin film layers constituting
the MTJ stack are very sensitive to particle 7.2.3 Impact of STT-MRAM
contaminants, a dielectric passivation layer is Integration on IC Design
immediately deposited after MTJ patterning
(step 5) to protect them. Particle contamination The STT-MRAM array may be fabricated
can be detrimental to MTJ characteristics. A together with other CMOS circuits in an system-
CMP step is performed to expose the capping on-chip (SoC) and hence the placement of the
layer of the MTJs after the dielectric passivation MTJ layers needs to account for the interconnect
layer is deposited. Thereafter, the TE layer is wiring in the rest of the silicon wafer. For exam-
deposited to contact the capping layer of the ple, increasing the separation between the lower
MTJs (step 6). The TE and BE of individual metal layers to accommodate the MTJ may
MTJ pillars are then defined by lithography and increase parasitics in the interconnects in other
etch (step 7). The next interlayer dielectric (ILD) parts of the IC, which can negatively affect the
layer is then deposited (step 8). Lithography and overall performance of the IC. The MTJs may be
etching are then performed to define the next formed between the higher interconnect metal
layer of interconnect (step 9). This layer also layers (such as between the last 1 metal layer
serves as an electrical connection to the MTJs. and the first 2 interconnect metal layer) where
Note that the material for the TE layer have been the separation between layers across the IC is
chosen so that it may be used as an etch stop. The large enough accommodate the MTJs. As a
220 X. Fong and K. Roy

result, the minimum pitch between MTJs and footprint of the STT-MRAM bit-cell that
hence the integration density of STT-MRAM, requires such an MTJ connection. An alternative
may be limited by the minimum pitch between scheme is to swap the order of pinned layer and
metal interconnects. Furthermore, the via free layer deposition but the impact on the MTJ
parasitics between the MTJ and the transistors characteristics depends on the fabrication
below may be significantly increased if the MTJ process.
is placed too high in the interconnect layers.
Hence, process for integrating MTJs may
become an important factor not only in determin- 7.3 Design of the STT-MRAM
ing the characteristics of the MTJ, but also in the Bit-Cell
overall performance of the IC.
The order of stack deposition may also impact Figure 7.7 shows the topology of the basic
the footprint of the STT-MRAM bit-cell, as we one-transistor one-MTJ (1T-1M) STT-MRAM
shall see in the next section. The MTJ pinned bit-cell. One electrode of the MTJ is connected
layer stack may be deposited first followed by the to the bit line (BL) while the other electrode is
MgO and then the free layer stack. Consider if connected to the access transistor (ATx). ATx is
the pinned layer of the MTJ is connected to the also connected to the source line (SL) as shown,
transistor layer below as shown in Fig. 7.6a. This and the word line (WL) is connected to the gate
may be achieved by placing the MTJ on top of a of ATx. WL is used to turn ATx ON and OFF.
stack of vias. If the free layer is to be connected The 1T-1M STT-MRAM bit-cells can have two
to the transistor below instead as Fig. 7.6b configurations (Kishi et al. 2008; Lin et al. 2009):
illustrates, the metal layer on top of the TE the “standard” connection (SC) as shown in
needs to be extended to one side and then Fig. 7.7a, and the “reversed” connection
connected to the transistor below through a (RC) shown in Fig. 7.7b. The bit-cell is accessed
stack of vias. The additional area needed to for read and for write operations by charging WL
accommodate this extension may increase the to VDD to turn ATx ON. Read operation may then
be performed by sensing the resistance between
BL and SL. Write operations are performed by
setting the voltages on BL and SL to the values
shown in Fig. 7.7. Let us consider the SC config-
uration for example. A bit-cell having MTJ in P
configuration stores ‘0’ whereas a bit-cell having
MTJ in AP configuration stores ‘1’. The bit-cell
is programmed with ‘0’ by applying VDD and
GND to BL and SL, respectively. The electrons
constituting the write current flow from the
pinned layer into the free layer of the MTJ to
parallelize the MTJ configuration. A ‘1’ is
programmed into the bit-cell by applying VDD
and GND to SL and BL, respectively. The
electrons flow from the free layer to the pinned
layer in this configuration, and exert torque that
anti-parallelizes the MTJ configuration. Note
Fig. 7.6 The bottom pinned MTJ may have its pinned that the size of ATx, the value of VDD, and the
layer directly connected to the transistor below as shown in write current pulse width are all designed such
(a). If the free layer is to be connected instead and the order
that the current flowing through MTJ during
of layer deposition cannot be altered, the metal layer above
TE has to be extended, as shown in (b), and connected to write operations is larger than IC. Compared to
the transistor below through a stack of vias the SC bit-cell configuration, the connections of
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 221

Fig. 7.7 The topology of the 1T-1M STT-MRAM access transistor. The voltages on the control lines of the
bit-cell is shown. The (a) “standard” and (b) “reversed” bit-cell corresponding to various operations of each
connection differs in the way the MTJ is connected to the bit-cell configuration are shown in the tables

the MTJ are swapped in the RC bit-cell configu- 7.3.1 Design Issues of the 1T-1M
ration. Hence, the voltages on BL and SL for STT-MRAM Bit-Cell
write operations of the RC bit-cell configuration
are also swapped as compared to the SC bit-cell Let us now discuss the major design issues of
configuration. 1T-1M STT-MRAM bit-cells. The source degen-
During read operations, a sense amplifier is eration of ATx during write operations, shared
used to sense the MTJ state in the STT-MRAM read and write current paths, and single-ended
bit-cell through BL or through SL. Also, a con- sensing scheme are the three major issues in
stant voltage or constant current scheme may be 1T-1M STT-MRAM bit-cell. As we shall see,
used to sense RMTJ (Dorrance et al. 2011; Fong these design issues have conflicting design
et al. 2012). In the constant voltage scheme, a requirements, which constrain the design space
fixed voltage, VRD, is applied across the bit line of 1T-1M STT-MRAM bit-cell. If not carefully
and the source line of the STT-MRAM bit-cell taken care of, these design issues may lead to
and the resulting current flowing through the excessive energy consumption, and degraded
MTJ, IMTJ, is compared to a reference current, STT-MRAM performance and reliability.
IREF. IMTJ can be either higher or lower than
IREF, depending on the resistance state of the 7.3.1.1 Source Degeneration of ATx
MTJ. The advantage of the constant voltage In order to program the 1T-1M STT-MRAM
scheme is that IMTJ during read operations bit-cell, bi-directional write current flow is
may be amplified in the sense amplifier to required. Figure 7.8 shows the voltages on the
improve sensing speed. The disadvantage is circuit nodes of the bit-cell during write
that the result of the sensing needs to be operations. When current flows from BL to SL
converted into an output voltage. In the constant as illustrated in Fig. 7.8a, the overdrive voltage
current scheme, a fixed current, IRD is passed of ATx (VGS) is at VDD. When the direction of
through the MTJ and the voltage developed current flow is reversed as shown in Fig. 7.8b, the
across the bit line and the source line, VBC, is voltage on the source terminal of ATx is at GND
compared with a reference voltage, VREF. The < VS < VDD. As such, VGS < VDD and the drive
constant current scheme has the advantage that strength of ATx is weakened. Furthermore, the
the result of the sensing is the voltage domain asymmetry in ATx drive strength may be
does not need to be converted. Furthermore, exacerbated by the asymmetry in IC of the MTJ.
since the MTJ is programmed by passing cur- The size of ATx may need to be enlarged to
rent through it as we will see later, IRD may be ensure successful write operations when passing
limited to prevent accidental programming of current from SL to BL. This increases the write
the MTJ during read operations or read-disturb current supplied when write current is flowing
failures. However, the |VBC-VREF| signal from BL to SL, which increases energy con-
generated by IRD may be too small to be easily sumption and may degrade the reliability of the
sensed. MTJ tunneling oxide barrier. The reliability of
222 X. Fong and K. Roy

Fig. 7.8 (a) The overdrive


voltage, VGS, of the access
transistor is at VDD when
write current flows from
BL to SL. (b) VGS < VDD
when write current flows
from SL to BL, and the
drive strength of the access
transistor is reduced

the MTJ tunneling oxide barrier is crucial for failure may significantly constrain the design
maintaining the large TMR required for reliably space. For example, EB may be increased to
distinguishing between RP and RAP. reduce disturb failures. Doing so increases IC,
resulting in increased write currents. Hence,
design choices must be made carefully to achieve
7.3.1.2 Shared Read and Write Current
an optimum 1T-1M STT-MRAM design.
Paths
During write operations, current needs to be
passed through the MTJ to exert STT so as to 7.3.1.3 Single-Ended Sensing Scheme
switch the MTJ configuration. During read The sensing scheme for 1T-1M STT-MRAM
operations, current flows through the MTJ bit-cells described in the earlier sections uses
regardless of whether the sensing scheme used single-ended sensing—the voltage across or cur-
is the constant voltage scheme or constant cur- rent through the bit-cell is compared with a com-
rent scheme. Hence, the current flowing through mon reference during sensing. Under process
the MTJ needs to be limited to avoid disturb variations, the resistance of MTJs will deviate
failures. A disturb failure occurs when the MTJ from the nominal value as depicted by the scatter
is accidentally programmed during a read opera- plot in Fig. 7.9. Some bit-cells having MTJ in P
tion. Since the circuit nodes in the bit-cell need to configuration may have read current smaller than
be charged or discharged during sensing the reference current, or read voltage larger than
operations, limiting the amount of current flow the reference voltage. Also, some bit-cells having
through the bit-cell limits the rate at which these MTJ in AP configuration may have read current
circuit nodes are charged or discharged. As a larger than the reference current, or read voltage
consequence, the sensing speed is reduced. Fur- smaller than the reference voltage. The sense
thermore, limiting the amount of current flowing amplifier will sense the bit-cells as having MTJ
through the bit-cell during sensing may also in AP configuration in the former case whereas
reduce the signal margin available for the sense the bit-cells are sensed as having MTJ in P con-
amplifier to determine the MTJ state. As a result, figuration in the latter case. This is also called
the sense amplifier may not be able to reliably decision failure.
distinguish between RP and RAP.
Note that it may be more difficult to limit the
current flowing through the MTJ during read 7.3.2 Scalability of STT-MRAM
operation if the ATx is enlarged to mitigate the
source degeneration problem discussed previ- When designing STT-MRAM, it is crucial for
ously. The conflicting requirements for STT-MRAM designers to understand how the
mitigating source degeneration and disturb size of the access transistor and the MTJ impact
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 223

For MTJs with IMA, the shape of the nano-


magnets are engineered to achieve the required
energy barrier height (Devolder 2011). In these
nano-magnets, the width is defined by the limits
of the process technology. The length and thick-
ness of the nano-magnet are varied to engineer its
energy barrier. However, the aspect ratio (AR ¼
length/width) is usually in the range of 2–3 to
keep the MTJ footprint small and for
manufacturability (Apalkov et al. 2010). The
thickness of the nano-magnet is then varied to
meet thermal stability requirements. On the other
hand, the nano-magnets in MTJs with PMA have
Fig. 7.9 Under process variations, the read currents other anisotropies, such as interfacial perpendic-
through the 1T-1M STT-MRAM bit-cells may be ular anisotropy (Ikeda et al. 2010) or magneto-
distributed as this example plot shows. Each data point crystalline anisotropy (Apalkov et al. 2010), that
on the plot is the read current through a bit-cell when its
MTJ is in the P configuration and the AP configuration.
overcome the shape anisotropy. The nano-
The bit-cells falling to the right of and below IREF will not magnets in MTJs with PMA may be modeled as
be correctly sensed by the sense amplifier having only uniaxial anisotropy and
2K u2
the characteristics of the STT-MRAM bit-cell. Hk ¼ ð7:10Þ
μ0 Msat
Furthermore, it is desirable to scale down the
size of the STT-MRAM bit-cell to achieve high where Ku2 is the effective uniaxial anisotropy
bit density and reduce the cost per bit. The size of constant.
the MTJ may need to be varied to achieve this Scaling analysis of the footprint of IMA and
while also ensuring that its resistance is not too PMA based MTJs conclude that IMA based MTJ
large for the access transistor to provide adequate has excellent scalability down to 10–20 nm
write current. widths for thermal stability factor, Δ ¼ 60
MTJs may be engineered with two flavors of (Apalkov et al. 2010; Devolder 2011). Below
magnetic anisotropy for satisfying thermal stabil- this critical width, the geometry of the free
ity and retention time requirements—perpendic- layer causes the anisotropy energies in the
ular and in-plane magnetic anisotropy (PMA and nano-magnet to favor a PMA configuration.
IMA, respectively). The magnetization of the Also, the critical width is increased if the
magnetic layers in the MTJ with IMA points required thermal stability factor is larger. This
within the plane of the wafer on which the MTJ has motivated a shift in research focus toward
is deposited. On the other hand, the magnetiza- PMA based MTJs.
tion of the magnetic layers in the MTJ with PMA Another crucial characteristic of the PMA
points perpendicular to the plane of the wafer. In based MTJ is that its IC0 is constant under size
both flavors of MTJs, the minimum magnetic scaling (Apalkov et al. 2010). Thus, the voltage
field to switch the MTJ configuration, Hk, and required to switch the MTJ, VC0, scales as
the energy barrier, EB, are related by (Sun 2000):
4  RA  I C0
2EB V C0 ¼ ð7:11Þ
Hk ¼ ð7:9Þ πw2
μ0 Msat V FL
where a cylindrical MTJ with diameter w is
where μ0 is the permeability of vacuum, and Msat assumed. The supply voltage required is
and VFL are the saturation magnetization and expected to scale down together with the
volume of the nano-magnet, respectively. CMOS technology node (Roadmap 2014),
224 X. Fong and K. Roy

which is important for IoT and other applications scale down IC0 as the footprint of the MTJ is
requiring low energy consumption. Hence, the scaled down. The options for doing so, from
RA product of the MTJ needs to be scaled down analyzing Eq. (7.6), are to: (1) decrease the
(by reducing the thickness of the tunneling oxide damping factor, α, (2) decrease Msat, or (3) to
barrier) so that the access transistor is able to improve the polarization efficiency factor, η.
supply sufficient write current to the MTJ. How- Reducing Msat may require a change of material
ever, it was shown that the TMR of the MTJ may for the free layer of the MTJ, whereas increasing
be substantially degraded when its RA product is η also requires careful optimization of the inter-
reduced (Yuasa et al. 2004). face between the free layer and tunneling oxide
Another important consideration when scaling barrier in the MTJ. It was found recently that α
down the RA product of the MTJ is the available and the interfacial magnetic anisotropy energy
voltage signal for the sense amplifier to distin- density may be optimized in a PMA based MTJ
guish between a bit-cell storing an MTJ with P having an FeB free layer sandwiched between
configuration from one storing an MTJ with AP two MgO based tunneling oxide barriers (Tsunegi
configuration. Let us now derive a minimum et al. 2014). Reducing α is promising because
magneto-resistance (MR) ratio of the bit-cell both the randomness of STT switching and IC0
needed to meet sensing requirements when the are reduced. The magnitude of the random ther-
pffiffiffi
resistance of the access transistor is neglected. mal field depends on α (see Eq. (7.8)) and hence,
The inclusion of the resistance of the access tran- reducing α also reduces the stochasticity of STT
sistor will increase the MR ratio we calculate. We based switching in MTJs. Thus, recent research
will assume a constant current sensing scheme so works are focused on reducing α. However, the
as to limit read-disturb failures. Under this STT-MRAM designer must ensure that the
scheme, the current being passed through the required MR ratio to meet sensing requirements
bit-cell during sensing is limited to a fraction of is achievable when IC0 is reduced.
IC0. The voltage developed across the bit-cell is
compared to a reference voltage by a sense ampli-
fier to determine the configuration of the MTJ. 7.3.3 Alternative STT-MRAM Bit-Cell
Ordinarily, the difference in voltage that the sense Topologies
amplifier sees needs to be at least 50 mV. This
means the difference in voltage due to the differ- As we saw in the preceding sections, one of the
ence in MTJ resistance is δV  0.1. Hence, for a most challenging aspect of STT-MRAM design is
cylindrical MTJ with diameter w, the MR ratio the conflicting design requirements for read
has to satisfy: operations and for write operations—improving
write operation in the 1T-1M STT-MRAM
0:1ð0:25πw2 Þ
MR ratio  ð7:12Þ bit-cell requires degrading the read operation and
κI C0 RA
vice versa. However, alternative bit-cell
κIC0 is the current passed through the bit-cell to topologies may open new pathways for optimizing
sense the MTJ configuration. For a cylindrical the design of STT-MRAM. In the following
MTJ with 50 nm diameter and κIC0 ¼ 5 μA, the sections, we will present several alternative
MR ratio needs to be larger than 300% for STT-MRAM bit-cell topologies that are able to
RA  13 Ω-μm2. Fortunately, the required MR mitigate the conflicting design requirements in
ratio decreases rapidly when the MTJ diameter is STT-MRAM.
scaled down. For w  30 nm and RA  6 Ω-μm2,
the required MR ratio is less than 250%. Note that 7.3.3.1 The 2T-1M STT-MRAM Bit-Cell
in the preceding analysis, we have assumed that The motivation for using the 2T-1M
IC0 does not scale down with w. STT-MRAM bit-cell topology may be under-
An alternative scheme that relaxes the require- stood by first analyzing the 1T-1M STT-MRAM
ment to scale the RA product of the MTJ is to bit-cell. The major failure mechanisms in the
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 225

1T-1M STT-MRAM bit-cell are decision, dis- probability. As a result, the minimum PDECISION
turb, and write failures (Fong et al. 2012). Deci- cannot be achieved.
sion and disturb failures were briefly described in Alternatively, the design constraint can be
Sect. 7.3.1. Write failure occurs when the MTJ relaxed by noting that multi-finger transistors
configuration is not successfully programmed are typically used to implement very wide
during the write operation. The optimization transistors. Multi-finger transistors are multiple
methodology proposed in (Fong et al. 2012) transistors connected in such a way that their
sizes the width of the access transistor (ATx) so gate, source, and drain terminals are shared.
as to optimize the total failure probability as When multi-finger transistors are used in the
shown in Fig. 7.10. In this example, the STT-MRAM bit-cell design, the effective access
STT-MRAM bit-cell is designed using a com- transistor width may be varied using two word
mercial 45 nm CMOS technology with MTJs that lines instead of one as illustrated in Fig. 7.11,
have 40  100 nm elliptical cross-section area. which results in the 2T-1M STT-MRAM bit-cell
The read operation may be designed to select the topology (Li et al. 2010). During write operations,
dominant read failure mechanism. For example, both word lines are turned ON and OFF in unison
disturb failure is the dominant read failure mech- to supply maximum current through the MTJ.
anism when the read operation uses a constant During read operation, Word Line 2 keeps M2
voltage sensing scheme that has a large read OFF whereas Word Line 1 switches M1 ON and
voltage. In the example shown in Fig. 7.10, the OFF. The width of M1 may then be optimized for
decision failure probability, PDECISION, is decision failures (908 nm in our example), while
minimized when ATx width is 908 nm. How- the width of M2 is made as large as required to
ever, when the dominant mechanism for read fulfill write failure, array area, and array capacity
failure is decision failure, the ATx width needs requirements. Hence, the PWRITE of the 2T-1MTJ
to be increased to reduce the write failure proba- bit-cell may be optimized without worsening
bility (PWRITE) and optimize the total failure PDECISION.

Fig. 7.10 The read-decision, read-disturb, and write fail- failure probability is significantly larger than that for read
ure rates of an example 1T-1M STT-MRAM bit-cell operations. Based on the optimization methodology pro-
design are shown on the left. The optimum width for the posed in (Fong et al. 2012), the optimum ATx width is
access transistor (ATx) depends on whether the dominant 1829 nm, as shown on the right, when decision failure is
failure mechanism for read operations is decision failure the dominant read failure mechanism
or disturb failure. When the ATx width is small, write
226 X. Fong and K. Roy

Fig. 7.11 The schematic


of the (a) 1T-1M and (b)
2T-1M STT-MRAM
bit-cells. If the 1T-1M
STT-MRAM bit-cell
requires an ATx with large
width, the ATx may be
formed by connecting two
ATx’s with smaller width
in parallel. The 2T-1M
STT-MRAM topology is
implemented by
connecting the gates of the
ATx’s to different word
lines, enabling independent
control to each ATx

7.3.3.2 Nonvolatile SRAM Using MTJs fast read access delays. Another advantage of
The write current requirement of the NV-SRAM is that only the bit-cells that are
STT-MRAM bit-cells presented thus far may be being accessed need to be powered ON—the
relaxed by using a longer write current pulse rest may be powered OFF to save on standby
width. Consequently, the write access delays of power, which is highly desirable for IoT
these bit-cells may be very long (possibly  1 applications.
μs), and are unsuitable for IoT applications that Although NV-SRAMs are able to achieve
require faster write access delays. Several alter- short access delays, their restore operations
native STT-MRAM bit-cell structures (see depend on the ability of the MTJs to skew the
Fig. 7.12) have been proposed in the literature cross-coupled inverters. Hence, the proposed
to target IoT applications that require faster NV-SRAM bit-cells may be very sensitive to
access delays (Abe et al. 2004; Ohsawa et al. mismatches in the characteristics of the MTJ
2012; Yamamoto and Sugahara 2009). The pro- and the CMOS transistors. Furthermore, the
posed bit-cell structures are similar to the con- sizes of the transistors in the bit-cells shown in
ventional 6T SRAM bit-cell structure shown in Fig. 7.12 may need to be enlarged to meet the
Fig. 7.12. MTJ0 and MTJ1, which store comple- write current requirements. The proposed
mentary data (i.e., one stores ‘0’ and the other NV-SRAM bit-cells illustrated in Fig. 7.12 also
stores ‘1’), are used to skew the cross-coupled need extra transistors as compared to the 1T-1M
inverters in the SRAM cell to implement a STT-MRAM bit-cell. Moreover, both MTJs in
non-volatile SRAM (NV-SRAM) bit-cell. When the NV-SRAM bit-cells need to be programmed
the NV-SRAM cell is powered off, the MTJs are which results in high write energy consumption.
first programmed with the state of the cell. When
the NV-SRAM is powered back on later, the
MTJs skew the cross-coupled inverters so that 7.4 STT-MRAM Array Architecture
the state the NV-SRAM returns to the state and Layout
before it was powered off. The advantage of
NV-SRAM is that since the SRAM bit-cell topol- In the preceding sections, the design of
ogy is preserved, only the array peripheral cir- STT-MRAM bit-cells and the MTJ have been
cuitry for write operations need to be modified to discussed. We have seen that at the bit-cell
meet write requirements. Furthermore, the differ- level, it is extremely challenging to design
ential nature of the bit-cell structure enables fast STT-MRAM to meet the performance achiev-
self-referenced differential read operations for able by 6T SRAMs. The write and read access
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 227

WL WL
MTJ0 MTJ1 PL MTJ0

MTJ1
BL BLB
BL BLB

(a) (b)

WL

WL

BL BLB

BL BLB
CTRL
MTJ0

MTJ1
SLL SLR
(c) (d)

Fig. 7.12 The structures of NV-SRAMs proposed (a) in (Abe et al. 2004), (b) in (Ohsawa et al. 2012), and (c) in
(Yamamoto and Sugahara 2009) are shown. The structure of (d) a volatile 6T SRAM is also shown for comparison

delays may be limited by large IC needed and the word also selects the columns to be accessed in
single-ended nature of the sensing scheme, the array.
respectively. Although the STT-MRAM bit-cell The schematic and corresponding layout of an
does not outperform 6T SRAM, STT-MRAM array of 1T-1M STT-MRAM bit-cells are shown
arrays have several advantages over their 6T in Fig. 7.14a, b, respectively. The bit-cells are
SRAM counterparts (Park et al. 2012). arranged into rows and columns where a word
The logical organization of memories stores line turns on the ATx’s of bit-cells connected to
data words sequentially as shown in Fig. 7.13a. the same row. The bit-cells connected along the
This may result in a memory footprint that is not column stores the data bit of the data word stored
feasible for physical implementation. Alterna- in the corresponding row. Due to the nonvolatile
tively, several data words are stored in every nature of STT-MRAM, only the accessed
row as shown in Fig. 7.13b. Bit-interleaving bit-cells need to be powered ON. Power is not
(Park et al. 2011) is used to reduce the wiring in supplied to the rest of the bit-cells, which saves
the peripheral circuitry that selects the array leakage power consumption in the array. Note
columns during data access. Note that that the STT-MRAM array based on NV-SRAM
bit-interleaving is commonly applied to 6T bit-cells may also be organized in a similar man-
SRAM arrays to mitigate soft errors due to parti- ner so that they may be powered OFF when idle.
cle strikes. When a data word is being accessed, Now, let us compare the access operation of a
the word line corresponding to the address of the 6T SRAM array to that of the STT-MRAM array.
data word is turned ON. The address of the data Figure 7.15a illustrates the access operation to a
228 X. Fong and K. Roy

Fig. 7.13 An example showing the (a) logical and (b) physically as a group) is used to reduce the wiring to the
physical organization of a memory consisting of N 8-bit column selection multiplexers shown in (b). These
wide data words (Bit0 to Bit7). Each row stores 4 data multiplexers are two-way multiplexers that allow Bit0 to
words. Bit-interleaving (i.e., the bit-cells storing the same Bit7 to behave as input and output
bit position of each data word are located together

Fig. 7.14 This figure illustrates the (a) schematic and (b) Note that every SL is shared between two columns to
array layout of an example array of 1T-1M STT-MRAM reduce the total array area. In this layout, the pitch between
bit-cells. An individual 1T-1M STT-MRAM bit-cell is MTJs limits the lowest achievable layout area
shown on the left. Only 6 rows  4 columns are shown.
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 229

MC HS HS HS MC HS HS HS MC HS HS HS

b
MC CELLS TO READ OR WRITE HS HALF SELECTED CELLS (PSEUDO READ)

MC MC MC

Fig. 7.15 (a) 6T SRAM arrays have the half select issue condition to prevent disturb failures. (b) Non-volatile
because power must be continuously supplied to all STT-MRAM cells do not have the half select issue
bit-cells to retain data. Thus, unaccessed columns in the because power is only supplied to bit-cells that are being
selected row need to be biased in the pseudo-read accessed

row in a 6T SRAM array. When a row is STT-MRAM based cache may have as much as
accessed, only the columns corresponding to the 3 higher capacity than its 6T SRAM counter-
word being selected are accessed. The bit lines part at the same bit-cell array footprint as shown
connected to the other columns must be in Fig. 7.16a. As the cache capacity increases, the
precharged to put bit-cells connected to them in read and write latencies of the 6T SRAM based
the pseudo-read condition. Doing so ensures that cache increases faster than the STT-MRAM
the data stored in those bit-cells are not acciden- based cache as shown in Fig. 7.16b, c. This is
tally overwritten (i.e., the half-select issue). This because the access latencies are dominated by the
is required because the SRAM bit-cells need to delays needed to charge and discharge parasitic
be continuously powered to retain data. Since capacitances in the cache. The 1T-1M
STT-MRAM is nonvolatile, power does not STT-MRAM has a smaller bit-cell footprint
need to be supplied to the bit-cells that are not than the 6T SRAM bit-cell. Hence, the parasitic
accessed as Fig. 7.15b shows. Hence, the half- line capacitances increase much more quickly in
select issue present in the 6T SRAM array is 6T SRAM cache than the STT-MRAM cache
absent in the STT-MRAM array. Furthermore, when its capacity is increased. Consequently,
only the peripheral circuitry consumes standby the read and write energies of 6T SRAM cache
power, leading to significant energy savings that increases faster than their 1T-1M STT-MRAM
is desirable for IoT application. counterpart as plotted in Fig. 7.16d, e. The
Further comparisons show that although the increase in read and write energies of 6T
6T SRAM bit-cell outperforms the 1T-1M SRAM cache with increasing cache capacity is
STT-MRAM bit-cell, the 1T-1M STT-MRAM exacerbated by the half-select issue discussed
based cache may outperform the 6T SRAM earlier. The biggest advantage of the 1T-1M
based cache (Park et al. 2012). This is illustrated STT-MRAM based cache over its 6T SRAM
by the cache comparisons reproduced from (Park counterpart is the lower leakage power consump-
et al. 2012) in Fig. 7.16. First, the 1T-1M tion graphed in Fig. 7.16f.
230

Fig. 7.16 (a) Array area of SRAM and STT-MRAM based data caches (4-way, 64B read energy per operation and (e) write energy per operation, and (f) total leakage
cache line, B ¼ Byte, M ¼ Mega Byte), (b) read latency and (c) write latency, (d) power
X. Fong and K. Roy
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 231

Fig. 7.17 Total energy


consumption versus cache
utilization for SRAM and
for STT-MRAM based
caches for various levels of
cache utilization. Note that
due to the difference in
cache capacity, the cache
utilization of the
STT-MRAM based caches
(SC and RC) is only a
quarter of that in the 6T
SRAM based cache is the
data stored is at most
0.5 MB

Real world applications affect how the cache that need to be charged and discharged during
is accessed (number of read and write operations, memory accesses (Zhao et al. 2012). Techniques
number of cycles during which the cache is idle, have also been proposed in the literature to
etc.) as well as the cache utilization (amount of improve the energy efficiency of write operations
data used by the program that is stored in the in STT-MRAM, which may also improve the
cache as a fraction of the total cache capacity). reliability of the tunneling oxide barrier in the
The comparison between 1T-1M STT-MRAM MTJ as explained in Sect. 7.3.1. Examples of
based cache and 6T SRAM based cache in such techniques include the stretched write
(Park et al. 2012) analyzes the performance of cycle (Augustine et al. 2012), the balanced
the caches using SPEC2000 benchmarks that write scheme (Lee et al. 2012), and write optimi-
emulate real world applications. The total energy zation techniques discussed in (Kim et al. 2012).
consumption versus cache utilization for caches The STT-MRAM array architecture may also be
based on 6T SRAM, standard connection modified to implement early write termination
(SC) and reversed connection (RC) 1T-1M (Zhou et al. 2009), partial line update (Park
STT-MRAM bit-cells is plotted in Fig. 7.17. et al. 2012), and write biasing techniques (Ahn
Consider when the data that needs to be stored et al. 2013; Jung et al. 2013; Mao et al. 2014;
in cache is 0.1 MB. This corresponds to 20% Rasquinha et al. 2010; Wang et al. 2013b). The
cache utilization in the 6T SRAM based cache objective of these array architecture
whereas the cache utilization is only 5% in the modifications is to ensure that write operations
STT-MRAM based caches. The results show that into the STT-MRAM array are performed only
the total energy consumption of the STT-MRAM when they are required, and in an energy efficient
based caches is 25%–35% lower than that of the manner. Some proposed array architecture
6T SRAM based cache. design techniques even exploit the asymmetry
At this juncture, we would like to highlight in write characteristics of STT-MRAM bit-cells
several design techniques proposed in the litera- to improve the write energy efficiency of the
ture to further reduce the energy consumption of STT-MRAM array (Kwon et al. 2013; Sun et al.
STT-MRAM arrays. The common source line 2012).
technique was proposed to reduce the bit-cell Although significant improvements to
footprint of the 1T-1M STT-MRAM bit-cell, STT-MRAM may be achieved using the design
which also reduces the parasitic line capacitances techniques presented earlier, they may not be the
232 X. Fong and K. Roy

most optimum. In the next section, we present a STT-MRAM may be designed to fulfill the
device/circuit/array architecture co-design tech- design requirements of memories for IoT
nique that optimizes the design of the applications. The design strategy proposed in
STT-MRAM array for energy efficiency in the (Pajouhi et al. 2015) exploits the fact that the
presence of process variations. This technique retention time requirement in STT-MRAM is
leverages insights from earlier work that reten- tunable (either by changing the shape anisotropy
tion time requirements may be relaxed in certain or interfacial anisotropy of MTJs with in-plane
applications (Jog et al. 2012; Li et al. 2013a, b; and perpendicular magnetic anisotropy, respec-
Sun et al. 2011, 2014; Smullen et al. 2011) and tively). In the proposed approach, STT-MRAM
that error-correcting codes (ECC) may be used to is designed with a retention time target that is
mitigate retention time errors as well as write significantly relaxed from the conventional target
failures and disturb failures that are caused by of 10 years so as to save write energy. Note that it
thermal effects (Xu et al. 2009). This is particu- is difficult to achieve 10 year retention time
larly useful for caches used in some IoT target for large STT-MRAM arrays (Kwon
applications. et al. 2015; Naeimi et al. 2013). Moreover, the
disturb failure probability of STT-MRAM is
increased when the retention time target is
7.5 ECC and Device/Circuit/ reduced. The design methodology proposed in
Architecture Co-design (Pajouhi et al. 2015) mitigates the increased dis-
for Low-Power and Higher turb failure rate using error correction codes
Reliability (ECC). ECC may also be used to mitigate
failures caused by process variations (Kwon
It is important to note that many IoT applications et al. 2015; Xu et al. 2009; Yang et al. 2015).
do not need 10 years of data retention time, To exemplify the aforementioned STT-MRAM
which is the requirement of most other memory design approach, let us consider an example of
applications. For example, the IoT system may STT-MRAM cache design with 10 years retention
periodically wake up from sleep mode, sample time requirement as done in (Pajouhi et al. 2015).
new data, process the new data with that gathered The graph of 10 year retention failure probability
when it was previously awake, store results and versus the thermal stability factor for different
needed data in memory, and discard old data memory word lengths shown in Fig. 7.18a imply
before returning to sleep mode. In such cases, the tradeoff between retention time and IC0,
the memory only needs to retain data long derived as:
enough so that it is still available when the sys- eα  
tem next wakes up. To save on leakage energy, a I C0 ¼ 4ΔkB T þ H Demag μ0 Msat V ð7:13Þ
ℏη
nonvolatile memory technology is desirable for
implementing caches for these applications.
Here, α, Msat, and V are the Gilbert damping
Although dynamic RAM (DRAM) may be
factor, saturation magnetization, and volume,
used, it may need to be frequently refreshed
respectively, of the free layer of the MTJ. η is
because its retention time is tens to hundreds of
the spin polarization efficiency describing the
milliseconds, whereas the retention time for IoT
spin-transfer torque generated per unit current.
applications maybe several days to several
ℏ, e and μ0 are the reduced Planck’s constant,
weeks. Hence, DRAM based caches may lead
elementary charge and permeability of free
to high energy consumption in systems for IoT
space, respectively. HDemag is the demagnetizing
applications. Another requirement for the mem-
field that spin-transfer torque needs to
ory system is that the any overhead in energy or
overcome in the free layer of the MTJ. In the
latency due to transitioning between sleep and
MTJ with perpendicular magnetic anisotropy
active modes must be sufficiently low compared
(PMA), HDemagμ0MsatV  EB and hence the
to the leakage energy savings.
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 233

100 100

Write Failure Probability


10−2
10 Years Retention
Failure Probability 10−2

10−4
10−4
−6 length = 16 45kBT
10
length = 32 50kBT
−8 10−6
10 length = 64 60kBT
length = 128
10−10 10−8
40 45 50 55 60 0.01 0.02 0.03
Δ Bit−cell Area (μm2)
(a) (b)

Fig. 7.18 (a) The retention failure probability versus thermal stability factor for different word lengths. (b) The write
failure probability versus bit-cell layout area for different thermal stability factors

STT-MRAM array. Figure 7.19 graphs the read


100 failure probability versus bit-cell layout area for
VREAD = 0.1 V
STT-MRAM bit-cells designed with MTJs hav-
VREAD = 0.3 V ing Δ ¼ 45. The constant voltage scheme is
used, and the read failure probabilities are
graphed for bit-cell read voltage of VREAD ¼ 0.1
V and VREAD ¼ 0.3 V. The word line voltage is
PFAIL

10−2 VDD ¼ 1.0 V. In the bit-cell layouts used, the


bit-cell area is dominated by the access transis-
tor. Thus, the resistance of the access transistor is
large compared to that of the MTJ when the
bit-cell area is small. This results in significant
read failures because the bit-cell resistance when
10−4
0 0.01 0.02 0.03 0.04 the MTJ is in P configuration is not substantially
different than when the MTJ is in AP configura-
Bit−cell Layout Area (μm2) tion (also called decision failures). When the
Fig. 7.19 The probability of read failure for constant bit-cell area is large, the resistance of the MTJ
voltage sensing scheme with VREAD ¼ 0.1 and 0.3 V is dominant compared to that of the access tran-
sistor. If VREAD is large, the state of the MTJ may
be accidentally switched during read operations
demagnetizing field term in Eq. (7.13) may be (also called disturb failures). In Fig. 7.19, VREAD
neglected. Due to reduced IC0 by reducing Δ, the ¼ 0.3 V results in significant number of disturb
write failure probability of an STT-MRAM failures when the bit-cell area is larger than
bit-cell decreases under the same bit-cell bias ~0.025 μm2. This is not the case when VREAD
conditions as graphed in Fig. 7.18b. Note that ¼ 0.1 V, and the read failure probability
the following analysis assumes that the MTJ approaches the minimum due to decision failure.
resistance is independent of Δ, which may not To simplify the following analysis, we will con-
be generally true. sider the case when VREAD ¼ 0.1 V. Other
The biasing conditions of the bit-cell also parameters of the memory array that we analyze
determine the read failure probability of the are listed in Table 7.1.
234 X. Fong and K. Roy

When the STT-MRAM is designed without probability of the bit-cells using them is limited
ECC and MTJs of different Δ may be chosen, by retention failure. If the bit-cell area can be
the optimum Δ depends on the bit-cell layout larger than ~0.025 μm2, retention failure is the
area (Pajouhi et al. 2015). Figure 7.20a graphs dominant failure mechanism and MTJ with
the total failure probability of the STT-MRAM higher Δ is preferred. This result implies that at
bit-cell versus its layout area for different Δ. fixed STT-MRAM array area budget, the desired
When the bit-cell area is small, write failure is cache capacity and the total array failure proba-
dominant because the access transistor is unable bility needs to be jointly considered to select the
to supply enough write current to the MTJ. The Δ for the MTJ.
total failure probabilities approach the minimum The graph of total failure probability versus
determined by retention failure as the layout area bit-cell layout area of STT-MRAM arrays
of the bit-cell is increased. For bit-cell area rang- implemented with different ECC schemes and
ing from ~0.0175 to ~0.025 μm2, bit-cell having Δ of the MTJ is shown in Fig. 7.20b. A trend
MTJs with Δ ¼ 60 has higher write failure prob- similar to Fig. 7.20a may be observed. The ECC
ability than those having MTJs with Δ ¼ 50. schemes encode data words as code words before
Hence, write failure is still dominant and a writing into the array so as to correct for bit
small Δ is preferred. However, MTJ with errors due to read, write or retention failures.
Δ ¼ 45 is not suitable because the total failure The length of these code words are longer than
that of data words as shown in Table 7.1. Due to
Table 7.1 Parameters of the memory array used in our stronger error correction capability of the Bose-
example Chaudhuri-Hocquenghem (BCH) code utilized,
CMOS Technology 32 nm the minimum achievable total failure probability
MTJ cross-section Equivalent to 64  64 nm for STT-MRAM implemented with BCH code
Total Capacity 64 Mega-byte may be lower than that implemented with Ham-
Length of stored data 64 bits ming code. Note that the minimum achievable
word total failure probability in the STT-MRAM
Hamming Code 71 bit code words, 1 bit error
arrays implemented with ECC schemes is much
correction
BCH Code 78 bit code words, 2 bit error
lower than the STT-MRAM implemented with-
correction out ECC. As the bit-cell layout area is increased,
the point at which the total failure probability of

100 100
Total Failure Probability
Total Failure Probability

10−1 10−3

10−2 10−6 Δ=50, Hamming code,


1 correct
Δ=50, BCH code,
Δ=45 −9
2 correct
−3
10 10 Δ=60, Hamming code,
Δ=50 1 correct
Δ=60, BCH code, 2
Δ=60 correct
−4 −12
10 10
0.01 0.02 0.03 0.01 0.02 0.03 0.04
Effective Bit−cell Area (μm2) Effective Bit−cell Area (μm2)
(a) (b)
Fig. 7.20 Total failure probability of STT-MRAM arrays implemented (a) without ECC, and (b) with ECC based on
Hamming code and BCH code with single- and double-error correction capability, respectively
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 235

Table 7.2 Possible design choices based on device/circuit/array co-design methodology presented
1 Minimize failure probability, no bit-cell Use MTJ with Δ ¼ 60, use BCH code ECC
area constraint
2 Bit-cell layout area < 0.0225 μm2, minimize failure Use MTJ with Δ ¼ 50, use Hamming code ECC
probability
3 0.025 μm2 < bit-cell layout area < 0.0285 μm2, Use MTJ with Δ ¼ 50, use BCH code ECC
minimize failure probability

the arrays start decreasing depends on the Δ of budget, total failure probability, and desired
the MTJ used, as well as the ability to correct for capacity, the optimum STT-MRAM array design
bit errors. For the arrays implemented without can be selected. Table 7.2 lists some possible
ECC, the one using MTJs with Δ ¼ 45 are more considerations and the optimum STT-MRAM
severely affected by retention failures than the array design corresponding to them. A similar
array using MTJs with Δ ¼ 50. Consequently, it analysis based on the flow we just described
is not as apparent that the total failure probability may be applied to design STT-MRAM arrays
of the array using MTJs with Δ ¼ 45 decreases that fulfill the requirements of IoT systems.
earlier than the array using MTJs with Δ ¼ 50.
Figure 7.20b also shows there are differences
between the STT-MRAM arrays implemented 7.6 Other Uses of MTJ for Sensing,
with Hamming code and those implemented Random Number Generation
with BCH code. One difference is the point at
which the total failure probability starts decreas- We have discussed the design and optimization
ing. As mentioned earlier, write failure is domi- of STT-MRAM for IoT applications and
nant when the bit-cell layout area is small. Since presented several unique characteristics of
fewer bits needs to be written into the array STT-MRAM earlier. Several works in the litera-
implemented with Hamming code for ECC as ture exploit these unique characteristics to
compared to the array implemented with BCH embed additional functionality within the
code for ECC, the probability of write failure is STT-MRAM array while incurring minimal
also smaller. Thus, the total failure probability of area overhead and near zero performance pen-
the array implemented with Hamming code as alty (Ahn et al. 2013; Fukushima et al. 2014,
the ECC scheme decreases earlier than in the 2015; Fong et al. 2015; Lee et al. 2013; Zhang
array implemented with BCH code as the ECC et al. 2014a, b, 2015). These additional function-
scheme at the same Δ. ality are particularly interesting for ultralow
Another difference is the point at which reten- power IoT systems. In the following sections,
tion failure starts becoming dominant compared we will briefly discuss three design techniques
to write failures. As Fig. 7.20b shows, in the that embed new functionality in STT-MRAM.
STT-MRAM arrays using MTJs with Δ ¼ 60, We discuss how STT-MRAM may be used as
retention failure becomes dominant when the truly random number generator (TRNG) and as
bit-cell layout area is larger than 0.03 μm2. physically unclonable function (PUF). TRNGs
Retention failure becomes dominant in the and PUFs may be used as on-chip security hard-
STT-MRAM arrays using MTJs with Δ ¼ 50 ware, which is needed in IoT applications.
when the bit-cell layout area is larger than Finally, we present a methodology for embed-
~0.025 μm2. Note that for the STT-MRAM ding read-only memory (ROM) in an
implemented without ECC, retention failure STT-MRAM array. The embedded ROM stores
becomes dominant when the bit-cell area is data which may be different from that stored in
larger than ~0.0225 and ~0.0275 μm2 for the RAM, and may be used to accelerate functions
MTJs with Δ ¼ 50 and Δ ¼ 60, respectively. that use ROM, such as many digital signal
Hence, by considering the array layout area processing (DSP) functions.
236 X. Fong and K. Roy

7.6.1 STT-MRAM as True Random Thereafter, the complementary data is written to


Number Generators every bit-cell in the row (all ‘1’ if all ‘0’ were
written previously and vice versa). When com-
The STT-MRAM write process is stochastic as plementary data is being written, the current sup-
we have discussed in Sect. 7.2.1 and may be plied to each bit-cell is such that the state of the
exploited for random number generation. Since bit-cell has 50% probability of switching. After
thermal disturbance is used as the entropy source, the write operation is completed, the data stored
the STT-MRAM bit-cell may be used as a truly in the bit-cells is read out and supplied to the
random number generator (TRNG) (Ahn et al. circuitry requesting the random number. Finally,
2013; Fukushima et al. 2014). On-chip security the data stored in the buffer is written back into
hardware, which is crucial for IoT applications, the bit-cells to restore them to the previous state.
may use TRNGs for generating high quality Note that three write operations are required
encryption keys. Hence, STT-MRAM that can in the TRNG scheme just described. The energy
also function as high quality TRNG with little consumption of this TRNG scheme may be very
overhead may be a very cost effective solution to high due to the high write energy of STT-MRAM
enable on-chip security hardware suitable for IoT bit-cells, which was discussed in Sect. 7.4. Fig-
applications. ure 7.21b shows that the overall write energy
The operation concept of the STT-MRAM may be reduced by reducing the number of
based TRNG illustrated in Fig. 7.21a is as write operations (Ahn et al. 2013). Once data
follows. When a random number is requested, has been read out of the bit-cells and stored in a
data stored in a row of STT-MRAM bit-cells is buffer, we may immediately generate the random
read out and stored in a buffer. Next, the random number by overwriting the bit-cells storing ‘0’
number is generated by performing a special with ‘1’, and those storing ‘1’ with ‘0’. The
write operation to the same row of bit-cells. current supplied to each bit-cell during this
During this step, a predetermined data (either write operation is such that the probability of
all ‘0’ or all ‘1’) is first written to every bit-cell write failure in each bit-cell is 50%. After the
in the row with 100% probability of success. random number is generated, we compare it to

Fig. 7.21 These steps show how STT-MRAM may be number generation, which incurs large energy dissipation.
used as a truly random number generator by exploiting the (b) The initialization step in (a) is redundant and may be
stochastic nature of its write operation. (a) The MTJ may eliminated to reduce energy consumption
be initialized to some predetermined state prior to random
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 237

the data stored in the buffer to identify the exploited to implement a memory-based physi-
bit-cells whose data that has been modified by cally unclonable function (PUF) (Zhang et al.
the random number generation process. The data 2014a, b, 2015). PUFs may be used as on-chip
in these bit-cells are then restored from the security hardware that generates chip-unique
buffer. keys for IoT applications. The input and output
Although the STT-MRAM based TRNG may of the PUF are called the challenge and response,
generate very high quality random numbers (Ahn respectively. Since it is desired that the PUF
et al. 2013), it may be very challenging to choose generate chip-unique keys, the set of challenge-
the write current needed to implement a process response pairs (CRPs) of a PUF must be unique
variation tolerant random number generation (the uniqueness criterion). During operation, the
process. Due to process variations, the amount set of CRPs for each PUF is fixed (the reliability
of write current supplied to each bit-cell during criterion). To ensure that the PUF is indeed
the random number generation process may unclonable, the response of a PUF to any chal-
result in switching probabilities that are not lenge exploits variations in the fabrication pro-
exactly 50%. This issue may be exacerbated if cess so that it is random and not known at design
the TRNG scheme with fewer write operations is time (the randomness criterion).
used. Depending on the degradation of the A memory-based PUF (MemPUF) may be
switching probability in the random number gen- implemented using 1T-1M STT-MRAM
eration process, strong postprocessing techniques bit-cells by exploiting the single-ended nature
may be required to ensure the quality of random of the sensing scheme (Zhang et al. 2014a).
numbers generated by STT-MRAM based Every bit-cell in the STT-MRAM array may be
TRNGs is sufficiently high (Dichtl 2008; grouped into pairs where one is the data cell and
Lacharme 2008; Von Neumann 1951). the other is the reference cell as shown in
Fig. 7.22a. The challenge is given as the address
to the STT-MRAM array and the response will
7.6.2 Physically Unclonable be generated. In the flow chart shown in
Functions Using STT-MRAM Fig. 7.22b, the data stored in the STT-MRAM
is first copied to a buffer when the array operates
Another interesting aspect of STT-MRAM is that as a PUF. When a challenge is given and a
the process variations in the bit-cells may be response is required, the data and reference

Fig. 7.22 (a) The schematic showing how an The flow chart shows how the PUF functionality may be
STT-MRAM array consisting of a data array and a refer- used without overwriting the data stored in the
ence array, which are identical 1T-1M bit-cell arrays, may STT-MRAM array
be used as a physically unclonable function (PUF). (b)
238 X. Fong and K. Roy

cells selected by the challenge are written with may be exploited to expand the set of CRPs to
‘0’. The response is generated by comparing the improve its resilience against attacks.
resistance of the data cell with its companion Although promising as on-chip security hard-
reference cell. Since the data cell and reference ware for IoT applications, there are several
cell are storing the same values, the result of the disadvantages of the MemPUF scheme just
resistance comparison depends on the process described. First, the reliability of the MemPUF
variations in the bit-cells. The corresponding bit may be degraded by disturb failures. The currents
in the response is ‘0’ if the resistance of the data flowing through the bit-cells during response
cell is smaller and ‘1’ otherwise. When the generation must be sufficiently small that the
STT-MRAM resumes operation as RAM, the states of the data cell and the reference cell are
data stored in the buffer is restored into the array. not accidentally switched. Next, to utilize the
The abovementioned scheme may not be reli- MemPUF as a RAM as well, a small buffer is
able due to thermal perturbations. Since the data needed. Prior to enrolment, the data stored in the
and reference cells store the same value during selected data and reference cells are first copied
response generation, thermal perturbations in the into the buffer. The data in the buffer is restored
magnetic layers of the MTJs in the bit-cells, to the data and reference cells once the MemPUF
which are random by nature, may affect the result response is no longer needed, incurring addi-
of the resistance comparison. As a result, the tional write energy consumption.
response to the challenge may be unreliable
(i.e., response to the challenge is not determin-
istic). The proposed STT-MRAM based 7.6.3 Embedding Read-Only Memory
MemPUF in (Zhang et al. 2014a) overcomes in STT-MRAM
this design issue by using two operational
phases: the enrolment and the regeneration As illustrated earlier in Fig. 7.14, every column
phase. During the enrolment phase, every in the STT-MRAM array consists of a bit line to
bit-cell in the array is written with ‘0’. Then, which bit-cells in the column are connected
the resistance of every data cell is compared to. The physical connection of the STT-MRAM
with its companion reference cell. The cell with bit-cells in the array may be exploited such that
the larger resistance is overwritten with a ‘1’ and every bit-cell stores a RAM bit in their constitu-
the enrolment phase ends. During the regenera- ent MTJ as well as a read-only memory (ROM)
tion phase, the resistance of the data cell selected data (Fong et al. 2015; Lee et al. 2013). The
by the challenge is compared to that of its com- embedding of ROM in a column of the array is
panion reference cell. The result of the compari- enabled by an additional bit line as illustrated by
son gives the corresponding bit of the response to Fig. 7.23. The layout of an example STT-MRAM
the challenge. Hence, the operation of the pro- bit-cell array without and with embedded ROM
posed STT-MRAM based MemPUF must begin is shown in Fig. 7.24a, b, respectively. Note that
with an enrolment phase. The reliability of the when the size of the access transistor is large, the
proposed STT-MRAM based MemPUF is additional bit line may be placed without chang-
enhanced during the enrolment phase by ensur- ing the footprint of the bit-cell array. Each
ing that the data cell and reference cell stores bit-cell in the column is connected to either
complementary data. Furthermore, the CRPs BL0 or BL1. Bit-cells connected to BL0 store a
generated may depend on whether ‘1’ or ‘0’ ROM value of ‘0’ whereas those connected to
was written during the first step of the enrolment BL1 store a ROM value of ‘1’.
phase. If the TMR variation is sufficiently large In the ROM-embedded STT-MRAM array
and random, CRPs generated by writing ‘1’ dur- shown in Fig. 7.23, a pair of pass transistors
ing the first step in the enrolment phase may be connects BL0 and BL1 to the RAM mode periph-
uncorrelated with those generated by writing ‘0’ eral circuitry. During RAM mode operations, the
during the first step in the enrolment phase. This pair of pass transistors are turned ON. Hence,
ROM Sense Amplifier RAM Sense Amplifier
Data RAMOut
ROMOut
DataB EnRAM
+ -
V+ V- Reference
Generator
BL1
BL0
RMTJ RMTJ RMTJ RMTJ RMTJ RMTJ RMTJ

Read Bias Write WriteEn


Generator Driver DataIn
SL
ReadEn
VDD
RCLK
M1
Data M2 M6 DataB
DataB M10
M11
Data
V+ V- VPre M3 M7 VPre
M4
M8
REN
V- V+

Fig. 7.23 This schematic shows a column of an or BL1 is used to store ROM data in each bit-cell. RAM
STT-MRAM array with embedded read-only memory data is stored using the MTJ in each bit-cell and hence can
(ROM). The physical connection of each bit-cell to BL0 be different from the ROM data

Fig. 7.24 The layout of an STT-MRAM bit-cell array (a) without embedded ROM and (b) with embedded ROM.
ROM data is stored as the via connection from the MTJ to BL0 or BL1
240 X. Fong and K. Roy

Data RAMOut
ROMOut
DataB EnRAM
Reference
Generator

Read Bias Write WriteEn


Generator Driver DataIn

ReadEn

Data ROMOut RAMOut


DataB EnRAM
Reference
Generator

Read Bias Write WriteEn


Generator Driver DataIn

ReadEn

Fig. 7.25 Current path through the ROM-embedded STT-MRAM array during ROM operation when the bit-cell
selected stores (a) ROM data ‘0’ and (b) ROM data ‘1’

BL0 and BL1 are electrically shorted during charged to VDD by the cross-coupled inverter
RAM mode operations. Consequently, the action in the latch. Note that because RAM
peripheral circuitry is able to access any selected mode operations require BL0 and BL1 to be
bit-cell regardless of whether it is connected to electrically connected whereas ROM mode
BL0 or BL1. During ROM mode operations, the operations require BL0 and BL1 to be electri-
pair of pass transistors is turned OFF. The SL of cally disconnected, RAM mode and ROM mode
the column is then grounded. The access transis- operations to the same column of bit-cells cannot
tor of the selected bit-cell is turned ON next, occur simultaneously.
followed by the ROM mode sensing latch. The ROM embedded in the STT-MRAM may
Figure 7.25a, b show the current path in the be used to store instructions and data closer to the
ROM-embedded STT-MRAM array during processor core, and applications utilizing the
ROM operation when a bit-cell storing ROM ROM may be accelerated even though RAM
data ‘0’ and ROM data ‘1’ is selected, respec- mode and ROM mode operations cannot occur
tively. Note that of the two bit lines, the one simultaneously. Note that without the embedded
connected to the selected bit-cell has a much ROM, instructions and data are stored off-chip
smaller resistance to SL compared to the other and need to be fetched on-chip before the proces-
bit line. Hence, during ROM mode sensing oper- sor core may operate on them (Fong et al. 2015;
ation, the output of the sensing latch that senses Lee et al. 2013). The cost of accessing off-chip
the bit line connected to the selected bit-cell is memory may greatly outweigh the cost due to not
easily discharged to GND whereas the other is being able to perform RAM mode and ROM
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 241

mode operations simultaneously. This is the case Chien 2013) and the multi-terminal device based
in the benchmark programs analyzed in (Fong on spin Hall effect or spin-orbit torque (Kim
et al. 2015; Lee et al. 2013) and it was observed et al. 2013; Wang et al. 2013a) have garnered
that applications exploiting the ROM embedded much research interest recently.
in STT-MRAM may be accelerated by as much The two-terminal device exploiting voltage-
as 30%. controlled magnetic anisotropy has a stack
composition that is very similar to that in the
conventional MTJ structure we presented earlier
7.7 Perspectives and Trends (Shiota et al. 2012; Wang et al. 2012, 2013a;
Wang and Chien 2013). Hence, the device inte-
As we saw in this chapter, STT-MRAM may be gration issues are not significantly different from
suitable for a wide array of IoT applications. that for the conventional MTJ. It was found that
Comparisons of STT-MRAM with other eNVM the perpendicular magnetic anisotropy of the
technologies in Chap. 6 show that the low volt- CoFeB free layer in the MTJ may be modulated
age and high speed of STT-MRAM is suitable for by the voltage-induced electric-field at the
IoT applications. Thus, STT-MRAM may be CoFeB/MgO interface. When the applied voltage
useful for implementing memory systems for reduces the perpendicular magnetic anisotropy,
IoT applications with unstable power supplies. the hysteresis loop of the CoFeB free layer
The non-volatility and relatively fast access narrows. Due to the stray field from the pinned
speed of STT-MRAM may also enable IoT layer, the hysteresis loop is not centered about
systems that can quickly transition between zero applied magnetic field. Hence, the applied
sleep and active states, which significantly voltage may narrow the hysteresis loop to the
improves energy efficiency. Another attractive- point that the stray field from the pinned layer
ness of STT-MRAM is that its unique aligns the free layer magnetization parallel to
characteristics may also be exploited to embed that of the pinned layer. The magnetization of
functionality, which is useful for IoT the free layer may be aligned anti-parallel to that
applications, in the memory array with few over- of the pinned layer using a magnetic field (Wang
head as discussed in Sect. 7.6. However, such et al. 2012).
cost savings must be weighed against the Alternatively, current-induced spin-transfer
increase in STT-MRAM array design complexity torque, just like in the conventional MTJ, may
as well as the implications on array test method- be used to switch the state of the MTJ (Wang and
ology and failure rate. Chien 2013). When this is the case, the polarity
The immediate STT-MRAM design of the voltage applied across the MTJ must be
challenges that need to be addressed are the designed such that the current flowing through
high write energy consumption, and the severely the MTJ tries to anti-parallelize the MTJ config-
limited design space imposed by the uration. The magnitude of the applied voltage is
two-terminal nature of the MTJ (Augustine then used to determine whether the stable MTJ
et al. 2010, 2011; Mojumder and Roy 2012). state is anti-parallel (for small applied voltage) or
Several works investigated alternative spintronic parallel (for large applied voltage) (Wang and
device structures that are based on the MTJ con- Chien 2013). A major disadvantage of this
cept (Braganca et al. 2009; Fong and Roy 2013; scheme is that although the current required to
Fong et al. 2013, 2014; Huda and Sheikholeslami program the MTJ is reduced, there is a significant
2013; Kim et al. 2013; Mojumder et al. 2011a, b, increase in MTJ resistance. As a result, the volt-
2013; Shiota et al. 2012; Wang et al. 2012, age needed to program the MTJ may be signifi-
2013a; Wang and Chien 2013). Of these device cantly larger than that available for IoT
proposals, the two-terminal device based on applications. Furthermore, this may limit the
voltage-controlled magnetic anisotropy (Shiota write energy reduction. Hence, the major focus
et al. 2012; Wang et al. 2012, 2013a; Wang and of research on voltage-controlled magnetic
242 X. Fong and K. Roy

z LFM WFM
x AS
y
m tFM

FM
tHM Aq
JS Jq
HM
LHM
WHM
JS = θSH (σ × Jq)
(a) (b)
Fig. 7.26 (a) Current-induced torque is exerted on a electron may be realigned by the spin Hall effect in the
magnetic layer adjacent to a heavy metal as shown here. heavy metal and gets reinjected into the magnetic layer.
(b) One possible mechanism is the spin Hall effect. After Hence, an electron may transfer multiple units of spin
exchanging spin angular momentum with the adjacent angular momentum to the magnetic layer
magnetic layer, the spin polarization direction of an

anisotropy is on devising a method and/or device Hence, an electron may transfer multiple units of
structure that allows the MTJ state to be switched spin angular momentum to the free layer of the
without the need for current flow or magnetic MTJ as illustrated by Fig. 7.26b. Note that since
fields. the free layer of the MTJ is metallic, some charge
Unlike the data storage devices using voltage- current is shunted through it. Due to structural
controlled magnetic anisotropy, those utilizing asymmetry and strong spin-orbit interaction,
spin-orbit torque or spin Hall effect are still electrons at the interface between HM and the
current-driven by nature (Kim et al. 2013; free layer may also experience a Rashba field
Wang et al. 2013a). In these devices, the MTJ (Gambardella and Miron 2011). The spin Hall
grown on top of a heavy metal (or HM, such as Pt effect and Rashba field contributes to the spin-
(Miron et al. 2011), β-Ta (Liu et al. 2012), β-W orbit torque that is exerted on the magnetization
(Pai et al. 2012), CuBi (Niimi et al. 2012), or of the free layer of the MTJ.
CuIr (Niimi et al. 2011)) with its free layer in The spin Hall effect is promising for reducing
direct contact with HM. Consider the structure the write energy in STT-MRAM because the spin
shown in Fig. 7.26a. When charge current flows current generated, Is, can be controlled by engi-
in HM, a spin current (i.e., a current in which all neering the geometry of the structure in
constituent electrons are identically spin- Fig. 7.26a. Is is given as:
polarized) that flows perpendicular to the direc-
tion of charge current flow is generated via the As
I s ¼ θSH Iq σ ð7:14Þ
spin Hall effect. In this spin current, electrons Aq
carrying opposite spin polarizations flow in
opposite directions. Furthermore, the spin polar- where Aq and As are as shown in Fig. 7.26a, σ is
ization direction of the electron is perpendicular the electron spin polarization direction, and θSH
to both charge and spin current flow directions. is the spin Hall angle of the HM. It was also
When the HM is in contact with the free layer of experimentally demonstrated that spin-orbit
an MTJ, the spin current can get injected into the torque may be used to switch the magnetization
free layer of the MTJ from HM, which exerts of a free layer with PMA (Yu et al. 2014). Fur-
spin-transfer torque on the magnetization of the thermore, the current for programming the MTJ
free layer. After transferring spin angular is not passed through the tunneling oxide of the
momentum to the free layer of the MTJ, the MTJ and hence the reliability of the tunnel oxide
spin polarization of the electron may get barrier is improved. This is crucial in
realigned and injected into the free layer again. maintaining high TMR. The resistance of the
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 243

HM layer may also be reduced to lower the T. Devolder, Scalability of magnetic random access
voltage required to drive the write current, memories based on an in-plane magnetized free
layer. Appl. Phys. Exp. 4(9), 093001 (2011)
which is desirable for reducing write energy. M. Dichtl, Bad and good ways of post-processing biased
Because the device structure that enables spin- physical random numbers, in Fast Software Encryp-
orbit torque is compatible with that having tion (Springer, Berlin, 2008), pp. 137–152
voltage-controlled magnetic anisotropy, a device R. Dorrance, F. Ren, Y. Toriyama, A. A. Hafez, C. K.
Yang, D. Markovic, Scalability and design-space anal-
structure that utilizes voltage-controlled mag- ysis of a 1T-1MTJ memory cell, in IEEE/ACM Int.
netic anisotropy for assisting spin-orbit torque Symp. Nanoscale Archit. (2011), pp. 32–36
switching may be the key to achieving X. Fong, K. Roy, Complementary polarizers STT-MRAM
STT-MRAM for ultralow power IoT (CPSTT) for on-chip caches. IEEE Electron Device
Lett. 34(2), 232–234 (2013)
applications. X. Fong, S.H. Choday, K. Roy, Bit-cell level optimization
for non-volatile memories using magnetic tunnel
junctions and spin-transfer torque switching. IEEE
Trans. Nanotechnol. 11(1), 172–181 (2012)
References X. Fong, K. Roy, Low-power robust complementary
polarizer STTMRAM (CPSTT) for on-chip caches,
K. Abe, S. Fujita, T. H. Lee, Architecture of three- in 5th IEEE Int. Mem. Work. (2013), pp. 88–91
dimensional circuit using nanoscale memory devices, X. Fong, R. Venkatesan, A. Raghunathan, K. Roy,
Eur. Micro Nano Syst., Noisy le Grand, France, 2004. Non-volatile complementary polarizer spin-transfer
TIMA, Grenoble, France (2004), pp. 225–229 torque on-chip caches: a device/circuit/systems per-
J. Ahn, S. Yoo, K. Choi, Write intensity prediction for spective. IEEE Trans. Magn. 50(10), 1–11 (2014)
energy-efficient non-volatile caches, Int. Symp. Low X. Fong, R. Venkatesan, D. Lee, A. Raghunathan, K. Roy,
Power Electron. Des. (2013), pp. 223–228 Embedding read-only memory in spin-transfer torque
D. Apalkov, S. Watts, A. Driskill-Smith, E. Chen, mram based on-chip caches. IEEE Trans. Very Large
Z. Diao, V. Nikitin, Comparison of scaling of Scale Integr. Syst. 24(3), 992–1002 (2016)
in-plane and perpendicular spin transfer switching A. Fukushima, T. Seki, K. Yakushiji, H. Kubota,
technologies by micromagnetic simulation. IEEE H. Imamura, S. Yuasa, K. Ando, Spin dice: a scalable
Trans. Magn. 46(6), 2240–2243 (2010) truly random number generator based on spintronics.
C. Augustine, A. Raychowdhury, D. Somasekhar, Appl. Phys. Exp. 7(8), 083001 (2014)
J. Tschanz, K. Roy, V. K. De, Numerical analysis of A. Fukushima, K. Yakushiji, H. Kubota, S. Yuasa, Spin
typical STT-MTJ stacks for 1T-1R memory arrays, dice (physical random number generator using spin
Int. Electron Devices Meet. (2010), pp. 22.7.1–22.7.4 torque switching) and its thermal response, in IEEE
C. Augustine, A. Raychowdhury, D. Somasekhar, Magn. Conf. (2015), pp. 1–1
J. Tschanz, V. De, K. Roy, Design space exploration P. Gambardella, I.M. Miron, Current-induced spin-orbit
of typical stt mtj stacks inmemory arrays in the pres- torques. Philos. Trans. A. Math. Phys. Eng. Sci. 369,
ence of variability and disturbances. IEEE Trans. 3175–3197 (2011)
Electron Devices 58(12), 4333–4343 (2011) Y. Huai, Spin-transfer torque MRAM (STT-MRAM):
C. Augustine, N.N. Mojumder, X. Fong, S.H. Choday, challenges and prospects. AAPPS Bull. 18(6), 33–40
S.P. Park, K. Roy, Spin-transfer torque mrams for low (2008)
power memories: perspective and prospective. IEEE Y. Huai, F. Albert, P. Nguyen, M. Pakala, T. Valet, Obser-
Sens. J. 12(4), 756–766 (2012) vation of spin-transfer switching in deep submicron-
L. Berger, Emission of spin waves by a magnetic multi- sized and low-resistance magnetic tunnel junctions.
layer traversed by a current. Phys. Rev. B 54(13), Appl. Phys. Lett. 84(16), 3118–3120 (2004)
9353–9358 (1996) S. Huda, A. Sheikholeslami, A novel STT-MRAM cell
P.M. Braganca, J.A. Katine, N.C. Emley, D. Mauri, with disturbance-free read operation. IEEE Trans.
J.R. Childress, P.M. Rice, E. Delenia, D.C. Ralph, Circ. Syst. I, Reg. Papers 60(6), 1534–1547 (2013)
R.A. Buhrman, A three-terminal approach to develop- S. Ikeda, K. Miura, H. Yamamoto, K. Mizunuma,
ing spin-torque written magnetic random access mem- H.D. Gan, M. Endo, S. Kanai, J. Hayakawa,
ory cells. IEEE Trans. Nanotechnol. 8(2), 190–195 F. Matsukura, H. Ohno, A perpendicular-anisotropy
(2009) CoFeB-MgO magnetic tunnel junction. Nat. Mater. 9
W.F. Brown, Thermal fluctuations of a single-domain (9), 721–724 (2010)
particle. Phys. Rev. 130(5), 1677–1686 (1963) A. Jog, A. K Mishra, C. Xu, Y. Xie, V. Narayanan,
D. Datta, B. Behin-Aein, S. Datta, S. Salahuddin, Voltage R. Iyer, C. R. Das, Cache revive: architecting volatile
asymmetry of spin-transfer torques. IEEE Trans. STT-RAM caches for enhanced performance in
Nanotechnol. 11(2), 261–272 (2012) CMPs, in Proceedings of the 49th Annu. Des. Autom.
Conf.—DAC’12 (2012), p. 243
244 X. Fong and K. Roy

J. Jung, Y. Nakata, M. Yoshimoto, H. Kawaguchi, cache-coherence-enabled adaptive refresh. ACM


Energy-efficient spin-transfer torque RAM cache Trans. Des. Autom. Electron. Syst. 19(1), 1–23
exploiting additional all-zero-data flags, in Int. Symp. (2013b)
Qual. Electron. Des. (2013), pp. 216–222 C. J. Lin, S. H. Kang, Y. J. Wang, K. Lee, X. Zhu, W. C.
S. H. Kang, X. Li, S. Gu, K. Lee, X. Zhu, STT MRAM Chen, X. Li, W. N. Hsu, Y. C. Kao, M. T. Liu,
magnetic tunnel junction architecture and integration M. Nowak, N. Yu, 45 nm low power CMOS logic
(2014) compatible embedded STT MRAM utilizing a
J. Katine, F. Albert, R. Buhrman, E. Myers, D. Ralph, reverse-connection 1T/1MTJ cell, in 2009 I.E. Int.
Current-driven magnetization reversal and spin-wave Electron Devices Meet. (2009), pp. 1–4
excitations in Co/Cu/Co pillars. Phys. Rev. Lett. 84 L. Liu, C.-F. Pai, Y. Li, H.W. Tseng, D.C. Ralph,
(14), 3149–3152 (2000) R.A. Buhrman, Spin-torque switching with the giant
Y. Kim, S. K. Gupta, S. P. Park, G. Panagopoulos, K. Roy, spin hall effect of tantalum. Science 336(6081),
Write-optimized reliable design of STT MRAM, in 555–558 (2012)
Proceedings of the 2012 ACM/IEEE Int. Symp. Low M. Mao, G. Sun, Y. Li, A. K. Jones, Y. Chen, Prefetching
power Electron. Des.—ISLPED’12 (2012), p. 3 techniques for STT-RAM based last-level cache in
Y. Kim, S.H. Choday, K. Roy, DSH-MRAM: differential CMP systems, in 2014 19th Asia South Pacific Des.
spin hall MRAM for on-chip memories. IEEE Elec- Autom. Conf. (2014), pp. 67–72
tron Device Lett. 34(10), 1259–1261 (2013) I.M. Miron, K. Garello, G. Gaudin, P.-J. Zermatten,
T. Kishi, H. Yoda, T. Kai, T. Nagase, E. Kitagawa, M.V. Costache, S. Auffret, S. Bandiera, B. Rodmacq,
M. Yoshikawa, K. Nishiyama, T. Daibou, A. Schuhl, P. Gambardella, Perpendicular switching
M. Nagamine, M. Amano, S. Takahashi, of a single ferromagnetic layer induced by in-plane
M. Nakayama, N. Shimomura, H. Aikawa, current injection. Nature 476(7359), 189–193 (2011)
S. Ikegawa, S. Yuasa, K. Yakushiji, H. Kubota, N.N. Mojumder, K. Roy, Proposal for switching current
A. Fukushima, M. Oogane, T. Miyazaki, K. Ando, reduction using reference layer with tilted magnetic
Lower-current and fast switching of a perpendicular anisotropy in magnetic tunnel junctions for spin-
TMR for high speed and high density spin-transfer- transfer torque (STT) MRAM. IEEE Trans. Electron
torque MRAM, in 2008 I.E. Int. Electron Devices Devices 59(11), 3054–3060 (2012)
Meet. (2008), pp. 1–4 N.N. Mojumder, S.K. Gupta, S.H. Choday, D.E. Nikonov,
K. Kwon, S.H. Choday, Y. Kim, K. Roy, AWARE K. Roy, A three-terminal dual-pillar STT-MRAM for
(Asymmetric Write Architecture with REdundant high-performance robust memory applications. IEEE
blocks): a high write speed STTMRAM cache archi- Trans. Electr. Devices 58(5), 1508–1516 (2011a)
tecture. IEEE Trans. Very Large Scale Integr. Syst. 1, N. N. Mojumder, S. K. Gupta, K. Roy, Dual pillar spin
1–1 (2013) transfer torque MRAM with tilted magnetic anisot-
K. Kwon, X. Fong, P. Wijesinghe, P. Panda, K. Roy, ropy for fast and error-free switching and near-disturb-
High-density & robust STT-MRAM array through free read operations, in 69th Device Res. Conf. (2011),
device/circuit/architecture interactions. IEEE Trans. pp. 67–68
Nanotechnol., 14(6), 1024–1034 (2015) N.N. Mojumder, X. Fong, C. Augustine, S.K. Gupta,
P. Lacharme, Post-processing functions for a biased phys- S.H. Choday, K. Roy, Dual pillar spin-transfer torque
ical random number generator, in Fast Software mrams for low power applications. ACM J. Emerg.
Encryption (Springer, Berlin, 2008), pp. 334–342 Technol. Comput. Syst. 9(2), 1–17 (2013)
D. Lee, S. K. Gupta, K. Roy, High-performance E.B. Myers, Current-induced switching of domains in
lowenergy STT MRAM based on balanced write magnetic multilayer devices. Science 285(5429),
scheme, in Proceedings of the 2012 ACM/IEEE Int. 867–870 (1999)
Symp. Low power Electron. Des.—ISLPED’12 (2012), H. Naeimi, C. Augustine, A. Raychowdhury, S. Lu,
p. 9 J. Tschanz, STTRAM scaling and retention failure.
D. Lee, X. Fong, K. Roy, R-MRAM: A ROM-embedded Intel Technol. J. 17(1), 54–75 (2013)
STT MRAM cache. IEEE Electron Device Lett. 34 Y. Niimi, M. Morota, D.H. Wei, C. Deranlot, M. Basletic,
(10), 1256–1258 (2013) A. Hamzic, A. Fert, Y. Otani, Extrinsic spin Hall
J. Li, P. Ndai, A. Goel, S. Salahuddin, K. Roy, Design effect induced by iridium impurities in copper. Phys.
paradigm for robust spin-torque transfer magnetic Rev. Lett. 106(12), 126601 (2011)
RAM (STT MRAM) from circuit/architecture per- Y. Niimi, Y. Kawanishi, D.H. Wei, C. Deranlot,
spective. IEEE Trans. Very Large Scale Integr. Syst. H.X. Yang, M. Chshiev, T. Valet, A. Fert, Y. Otani,
18(12), 1710–1723 (2010) Giant spin Hall effect induced by skew scattering from
Q. Li, J. Li, L. Shi, C. J. Xue, Y. Chen, Y. He, Compiler- bismuth impurities inside thin film CuBi alloys. Phys.
assisted refresh minimization for volatile STT-RAM Rev. Lett. 109, 156602 (2012)
cache, in 2013 18th Asia South Pacific Des. Autom. J. Nogués, J. Sort, V. Langlais, V. Skumryev, S. Suriñach,
Conf. (2013), pp. 273–278 J.S. Muñoz, M.D. Baró, Exchange bias in
J. Li, L. Shi, Q. Li, C.J. Xue, Y. Chen, Y. Xu, W. Wang, nanostructures. Phys. Rep. 422(3), 65–117 (2005)
Low-energy volatile STT-RAM cache design using
7 On-Chip Non-volatile STT-MRAM for Zero-Standby Power 245

T. Ohsawa, H. Koike, S. Miura, H. Honjo, K. Tokutome, G. Sun, Y. Zhang, Y. Wang, Y. Chen, Improving energy
S. Ikeda, T. Hanyu, H. Ohno, T. Endoh, 1Mb efficiency of write-asymmetric memories by log style
4T-2MTJ nonvolatile STT-RAM for embedded write, in Proceeding of the 2012 ACM/IEEE Int.
memories using 32b fine-grained power gating tech- Symp. Low power Electron. Des., ISLPED’12 (2012),
nique with 1.0ns/200ps wake-up/power-off times, in pp. 173–178
2012 Symp. VLSI Circuits (2012), pp. 46–47 Z. Sun, X. Bi, H. Li, W.-F. Wong, X. Zhu, STT-RAM
C.F. Pai, L. Liu, Y. Li, H.W. Tseng, D.C. Ralph, cache hierarchy with multiretention MTJ designs.
R.A. Buhrman, Spin transfer torque devices utilizing IEEE Trans. Very Large Scale Integr. Syst. 22(6),
the giant spin Hall effect of tungsten. Appl. Phys. Lett. 1281–1293 (2014)
101(12), 1–5 (2012) S. Tsunegi, H. Kubota, S. Tamaru, K. Yakushiji,
Z. Pajouhi, X. Fong, K. Roy, Device/Circuit/Architecture M. Konoto, A. Fukushima, T. Taniguchi, H. Arai,
Co-Design of Reliable STT-MRAM, in Proc. 2015 H. Imamura, S. Yuasa, Damping parameter and inter-
Des. Autom. Test Eur. Conf. Exhib. (2015), facial perpendicular magnetic anisotropy of FeB
pp. 1437–1442 nanopillar sandwiched between MgO barrier and cap
S. P. Park, S. Y. Kim, D. Lee, J.-J. Kim, W. P. Griffin, layers in magnetic tunnel junctions. Appl. Phys. Exp.
K. Roy, Column-selection-enabled 8T SRAM array 7(3), 033004 (2014)
with 1R/1W multi-port operation for DVFS-enabled J. Von Neumann, Various techniques used in connection
processors, in IEEE/ACM Int. Symp. Low Power Elec- with random digit. Natl. Bur. Stand. Appl. Math. Ser.
tron. Des. (2011), pp. 303–308 12, 36–38 (1951)
S. P. Park, S. Gupta, N. Mojumder, A. Raghunathan, W.G. Wang, C.L. Chien, Voltage-induced switching in
K. Roy, Future cache design using STT MRAMs for magnetic tunnel junctions with perpendicular mag-
improved energy efficiency, in Proceedings of the netic anisotropy. J. Phys. D. Appl. Phys. 46(7),
49th Annu. Des. Autom. Conf.—DAC’12 (2012), 74004 (2013)
p. 492 W.-G. Wang, M. Li, S. Hageman, C.L. Chien, Electric-
S.S.P. Parkin, N. More, K.P. Roche, Oscillations in field assisted switching in magnetic tunnel junctions.
exchange coupling and magnetoresistance in metallic Nat. Mater. 11(1), 64–68 (2012)
superlattice structures: Co/Ru, Co/Cr, and K.L. Wang, J.G. Alzate, P. Khalili Amiri, Low-power
Fe/Cr. Phys. Rev. Lett. 64(19), 2304–2307 (1990) non-volatile spintronic memory: STT-RAM and
M. Rasquinha, D. Choudhary, S. Chatterjee, beyond. J. Phys. D. Appl. Phys. 46(7), 074003 (2013a)
S. Mukhopadhyay, S. Yalamanchili, An energy effi- J. Wang, X. Dong, Y. Xie, OAP: an obstruction-aware
cient cache design using spin torque transfer (STT) cache management policy for STT-RAM last-level
RAM, in Proceedings of the 16th ACM/IEEE Int. caches, in Des. Autom. Test Eur. Conf. Exhib.
Symp. Low power Electron. Des.—ISLPED’10 (2013), pp. 847–852
(2010), p. 389 W. Xu, Y. Chen, X. Wang, T. Zhang, Improving STT
ITRS Roadmap (2014) MRAM storage density through smaller-than-worst-
S. Salahuddin, D. Datta, S. Datta, Key role of non equi- case transistor sizing, in Des. Autom. Conf. (2009),
librium spin density in determining spin torque, in pp. 87–90
2008 Device Res. Conf. (2008), pp. 161–162 S. Yamamoto, S. Sugahara, Nonvolatile static random
Y. Shiota, T. Nozaki, F. Bonell, S. Murakami, T. Shinjo, access memory using magnetic tunnel junctions with
Y. Suzuki, Induction of coherent magnetization current-induced magnetization switching architecture.
switching in a few atomic layers of FeCo using volt- Jpn. J. Appl. Phys. 48(4), 043001 (2009)
age pulses. Nat. Mater. 11(1), 39–43 (2012) T. Yamauchi, Prospect of embedded non-volatile memory
J.C. Slonczewski, Current-driven excitation of magnetic in the smart society, in 2015 Int. Symp. VLSI Technol.
multilayers. J. Magn. Magn. Mater. 159, L1–L7 Syst. Appl. (2015), pp. 1–2
(1996) J. Yang, B. Geller, M. Li, T. Zhang, An information
C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, theory perspective for the binary STT-MRAM cell
M. R. Stan, Relaxing non-volatility for fast and operation channel, in IEEE Trans. Very Large Scale
energy-efficient STT-RAM caches, in 2011 I.E. 17th Integr. Syst. (2015), pp. 1–1
Int. Symp. High Perform. Comput. Archit. (2011), G. Yu, P. Upadhyaya, Y. Fan, J.G. Alzate, W. Jiang,
pp. 50–61 K.L. Wong, S. Takei, S.A. Bender, L.-T. Chang,
J. Sun, Spin-current interaction with a monodomain mag- Y. Jiang, M. Lang, J. Tang, Y. Wang,
netic body: a model study. Phys. Rev. B 62(1), Y. Tserkovnyak, P.K. Amiri, K.L. Wang, Switching
570–578 (2000) of perpendicular magnetization by spin-orbit torques
Z. Sun, X. Bi, H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, in the absence of external magnetic fields. Nat.
W. Wu, Multi retention level STT-RAM cache Nanotechnol. 9, 548–554 (2014)
designs with a dynamic refresh scheme, in S. Yuasa, T. Nagahama, A. Fukushima, Y. Suzuki,
Proceedings of the 44th Annu. IEEE/ACM Int. Symp. K. Ando, Giant room-temperature magnetoresistance
Microarchitecture—MICRO’11 (2011), p. 329 in single-crystal Fe/MgO/Fe magnetic tunnel
junctions. Nat. Mater. 3(12), 868–871 (2004)
246 X. Fong and K. Roy

L. Zhang, X. Fong, C.-H. Chang, Z. H. Kong, K. Roy, dual-mode applications: data storage and key genera-
Feasibility study of emerging non-volatile memory tor. IEEE Trans. Comput. Des. Integr. Circuits Syst.
based physical unclonable functions, in 2014 I.E. 6th 34(7), 1176–1187 (2015)
Int. Mem. Work. (2014), pp. 1–4 B. Zhao, J. Yang, Y. Zhang, Y. Chen, H. Li, Architecting
L. Zhang, X. Fong, C.-H. Chang, Z. H. Kong, K. Roy, a common source-line array for bipolar non-volatile
Highly reliable memory-based physical unclonable memory devices, in 2012 Des. Autom. Test Eur. Conf.
function using spin-transfer torque MRAM, in Exhib. (2012), pp. 1451–1454
2014 I.E. Int. Symp. Circuits Syst. (2014), P. Zhou, B. Zhao, J. Yang, Y. Zhang, Energy reduction for
pp. 2169–2172 STT-RAM using early write termination, in IEEE/
L. Zhang, X. Fong, C.-H. Chang, Z.H. Kong, K. Roy, ACM Int. Conf. Comput. Des.—Dig. Tech. Pap.
Optimizating emerging nonvolatile memories for (2009), pp. 264–268
Security Down to the Hardware Level
8
Anastacia Alvarez and Massimo Alioto

This chapter introduces the concept of Physically oftentimes data needs to be unreadable from an
Unclonable Functions (PUFs), their prospects for unintended receiver. Accordingly, IoT requires
hardware security in IoT devices, and their inter- security to be assured down to the hardware
action with traditional cryptography. Section 8.1 level, as the authenticity and the integrity need
summarizes the background on PUFs, whereas to be assured also in terms of the hardware imple-
Sect. 8.2 covers the metrics that are commonly mentation of each IoT node (i.e., each node needs
used to evaluate PUF performance. Such metrics to be confirmed to be authentic and intact, while
are used to comparatively review the state of the signaling if it has been counterfeited or
art on PUFs in Sect. 8.3. Section 8.4 covers tampered with).
vulnerabilities to malicious attacks attempting In the recent past, Physically Unclonable
to clone or mimic a PUF. In the last section, we Functions (PUFs) have emerged as potentially
introduce the novel concept of PUF-enhanced highly secure and lightweight solution to ensure
cryptography as a promising direction aiming to data and hardware security, assuring trustworthi-
merge PUFs and cryptography in a cohesive ness down to the chip level (Mathew et al. 2014;
framework for IoT hardware-level security. Maes et al. 2012; Rosenblatt et al. 2013; Su et al.
2007; Maes 2012). A PUF is a function that maps
an input (digital) challenge to an output (digital)
8.1 Physically Unclonable response in a repeatable but unpredictable man-
Functions for IoT ner, leveraging on chip-specific random process
variations. PUFs are sometimes referred to as
The spatial pervasiveness and the prospectively “silicon biometrics”, i.e. something equivalent
very large number of deployed nodes monitoring to a “chip fingerprint” that is unique for each
the environment, people, shared resources and die. As such, it eliminates the need to store any
goods, makes security a fundamental challenge key, as the latter is naturally generated and
in IoT. Serious security issues are indeed arising embedded into the chip during its manufacturing.
in terms of data authenticity, integrity and confi- This avoids the need for key programming (e.g.,
dentiality. Indeed, it is typically necessary to via fuses or e-Flash), and makes IoT nodes less
assure that the data and the sender are legitimate, prone to the many existing attacks that uncover
the data has been sent uncorrupted, and the content of memories (Nedospasov et al.
2013), as discussed below.
A. Alvarez (*) • M. Alioto PUFs are used for chip identification and
National University of Singapore, Singapore, Singapore authentication (Rosenblatt et al. 2013; Su et al.
e-mail: anastacia@u.nus.edu

# Springer International Publishing AG 2017 247


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_8
248 A. Alvarez and M. Alioto

2007; Maes 2012; Alvarez et al. 2015; Gassend (NVM). Unfortunately, storing the key off chip
et al. 2002), secure key storage and lightweight or in an on-chip NVM facilitates the recovery of
encryption (Mathew et al. 2014; Xu et al. 2014), the key by other parties. Indeed, several studies
hardware-entangled cryptography (Sadeghi and have shown that NVM are prone to attacks and
Naccache 2010) and identification of malicious easy to read out (Samyde et al. 2002;
hardware (Maes 2013). Chip identification and K€ommerling et al. 1999). PUFs replace the con-
authentication are typically performed by prelim- ventional key storage, and hence offer superior
inarily storing all challenge-response pairs robustness against invasive attacks, as they do
(CRPs) of the chip PUF in a secure database, not really store information since they recreate
during a first enrollment phase. These (or a sub- the keys only while the chip is being powered on.
set thereof) are used to verify the response of the
chip to a given challenge during in-field opera-
tion, making sure not to reuse CRPs to reduce 8.2 PUF Properties and Metrics
susceptibility to cloning, and counteract replay
attacks. Figure 8.1 shows an illustration of the Ideally, an array of PUF bitcells generate chip-
enrollment process and chip authentication. specific keys that are:
To keep data secure during transmission, it is
typically encrypted using a key that is stored • unpredictable, leveraging on on-chip random
externally, or in an on-chip non-volatile memory process variations

Fig. 8.1 Illustration of


typical chip enrollment and
subsequent in-field
authentication using
challenge-response pairs
(CRPs) from PUFs
8 Security Down to the Hardware Level 249

• repeatable, by amplifying random variations, In detail, any PUF output should ideally
while rejecting global variations and noise remain the same under fluctuating environmental
(Maes et al. 2012) conditions (e.g., voltage, temperature), and at
• not directly accessible or measurable exter- any process corner. Actual PUFs are not able to
nally, once the enrollment phase is completed. provide perfectly stable outputs, due to
non-perfect rejection of noise, global and envi-
There are two main types of PUFs: weak ronmental variations. Stability is measured by
PUFs and strong PUFs. Weak PUFs have limited counting all bits that become unstable across
number of challenge-response pairs, making repeated PUF evaluations and environmental
them equivalent to random key generators that conditions, within the specified range of voltage
are typically used for encryption and decryption. and temperature of operation.
Weak PUFs essentially provide chip ID, whereas Repeatability (or reproducibility) and unique-
strong PUFs offer a very large number of ness are measured from the Hamming Distance
challenge-response pairs (CRPs), each for (HD) across several measurements of PUF keys.
one-time use. Given the long lifespan required Such measurements are compared to a reference
by IoT applications, PUFs with very large num- “golden” key (Maes 2013) that is taken as the
ber of CRPs (and therefore large area) are very first measurement under nominal conditions.
expensive and typically infeasible. As a numeri- Repeatability is the average intra-PUF Hamming
cal example, Table 8.1 shows an example of the Distance (HD) between the golden key and sev-
cost for a PUF with 256-bit key in 65 nm eral key evaluations with the same challenge in
(Alvarez et al. 2015; Alvarez et al. 2016), the same chip, under different environmental
whose cost invariably exceeds the overall IoT conditions. By definition, highly reproducible
node cost. PUFs should have low intra-PUF Hamming dis-
Given the fundamental PUF properties, such tance (ideally zero). Uniqueness, on the other
as stability, repeatability, uniqueness and hand, is taken as the average inter-PUF HD
randomness (Maes 2013), and knowing the sta- between the golden key and key evaluations
tistical nature of process variations, several from different chips under the same PUF input
metrics have been introduced to quantify the (Mathew et al. 2014). The inter-PUF HD should
quality of PUF bitcells. In the following, such be close to the ideal value equal to half the length
metrics are summarized in Table 8.2, where typ- of the PUF key (e.g., the ideal inter-PUF HD of a
ical values based on current literature are also 256-bit key is 128). Alternatively, the fractional
reported. Hamming Distance (FHD) can be used to

Table 8.1 Example of SRAM PUF silicon cost (assumed to be 5 cents/mm2, with area/bit representative of very dense
SRAM PUFs)
(Encrypted) data transmitted every PUF capacity (MB) PUF area (mm2) Silicon cost (US$)
1h 5 24 1.2
10 min 32 147 7.4
1 min 320 1478 74

Table 8.2 PUF metrics and typical values


Metric Measured by Typical value Ideal value
Stability Unstable bits 1–60% 0
Repeatability Intra-PUF FHD 0.8–15% 0
Uniqueness Inter-PUF FHD 30–60% 50%
Identifiability Inter/intra HD 5–80 1
Randomness 0/1 bias 40–60% 50%
250 A. Alvarez and M. Alioto

quantify reproducibility and uniqueness (Maes are minimized. Type I error is the false positive,
et al. 2012), where the Hamming distance is where an invalid key is accepted as a valid one.
simply expressed as a percentage of the key Type II error, on the other hand, is the false
length, or the number of bits N in a PUF key negative, where a valid key is discarded as an
(ideal inter-PUF FHD is 50%). Identifiability invalid one.
quantifies the distinguishability of a PUF Regarding chip authentication, false rejection
instance to other instances, and is loosely taken rate (FRR) and false acceptance rate (FAR) can
as the ratio of the inter-PUF and intra-PUF HD be used as relevant metrics to assess its quality
(on the assumption that it is both repeatable and and security level (Stanzione et al. 2009; Lim
unique), where a larger value is desired (Maes et al. 2005). Referring to Fig. 8.2, FRR
2013; Mathew et al. 2014; Yang et al. 2015). corresponds to the probability of having an out-
Figure 8.2 shows an example of probability put with FHD under the false negative area,
distribution function of reproducibility (intra- whereas FAR corresponds to the area under the
PUF FHD) and uniqueness (inter-PUF FHD). A false positive area. Accordingly, the PUF yield
perfectly identifiable PUF ideally has no inter- Y can be defined as the probability that no
section between the inter-PUF and intra-PUF authentication error occurs during the lifetime
curves, which means that a single PUF response of a given PUF chip, i.e. Y ¼ 1  FAR  FRR.
is enough to determine whether the chip is The bit error rate (BER) or the percentage of
authentic or not. In practical cases, the two unstable bits can also be used as a metric of the
curves in Fig. 8.2 have an intersection, and an quality of chip authentication, when the whole
optimal decision threshold needs to be chosen to array is considered, rather than dividing the array
determine whether a given PUF is identifiable. into keys of length N.
As shown in Fig. 8.2, such decision threshold is Another important property of PUFs is the
set by the point where Type I and Type II errors randomness of its responses, as needed to ensure

Fig. 8.2 Sample Inter- and


Intra-PUF FHD showing
decision threshold and
Type I (false positive) and
Type II (false negative)
errors
8 Security Down to the Hardware Level 251

their unpredictability. Randomness is routinely IoT devices are tightly energy constrained,
quantified through the statistical characterization since they are either battery operated or energy
in terms of 0/1 bias (defined as the probability of harvested, hence the energy consumption of the
having a 1 in a PUF output bit (Yu et al. 2012)), PUF is another important metric. To abstract the
the entropy (Mathew et al. 2014), and more thor- energy from the PUF organization and size, the
oughly through the NIST randomness tests most commonly adopted metric is the energy per
(Rukhin et al. 2010) (see below). To quantify bit, obtained by dividing the average energy per
the randomness of PUF responses across differ- access by the number of bits within the key. The
ent positions of PUF bitcell within the die, the energy per bit of existing PUFs typically ranges
autocorrelation function (ACF) is routinely used from tens of fJ/bit to tens of pJ/bit (Alvarez et al.
to detect repeating or correlated patterns among 2015, 2016).
different responses (Mathew et al. 2014; Yang Due to the stringent cost requirement in IoT
et al. 2015). The correlation between PUF output nodes (including silicon area), another important
bits is generally due to layout-dependent PUF metric is the effective area per bit, as
variations (Alvarez et al. 2015, 2016; Li et al. obtained by considering the actual number of
2015). Visually, randomness can be represented available PUF bits obtained after removing
in the form of the speckle diagram shown in unstable bits, and including the area cost of the
Fig. 8.3, where each pixel represents a PUF circuitry performing post-processing on the raw
bitcell and the PUF output 0’s (1’s) are PUF output (see later). Robustness to ageing and
represented with black (white) pixels. In chip lifetime are assessed through accelerated
Fig. 8.3, the distribution looks somewhat random ageing tests (Puntin et al. 2008; Stanzione et al.
(i.e., there are no clear patterns) and the 0/1 bias 2009; Selimis et al. 2011). Modeling complexity,
is also close to ideal value of 0.5. in terms of the number of brute force trials
The NIST statistical test suite (Rukhin et al. needed to model the PUF, can likewise be used
2010) is a set of tests to quantify the randomness to characterize PUFs (Stanzione et al. 2009).
of a stream of bits. Version 2.1.2 contains
15 tests, each one exercising one property to
test randomness. The simplest test is the fre- 8.3 PUF Topologies and State
quency test, which computes the 0/1 ratio of the of the Art
whole bitstream. For each of the tests, certain
parameters need to be preliminarily set (e.g., The concept of PUFs have been introduced in the
length of bitstream n, block size M ). Table 8.3 early 2000s, and they have been initially referred
shows the complete list of the tests and to as ICID (Lofstrom et al. 2000), Physical
parameters to be set. One-Way Functions (POWF) (Pappu et al.

Fig. 8.3 Sample speckle diagram from (Alvarez et al. 2015)


252 A. Alvarez and M. Alioto

Table 8.3 NIST statistical test suite


Minimum stream
NIST test Description length n Other parameters
Frequency Test Takes ratio of number of 1’s and 0’s 100 –
Frequency test within a Ratio of 1’s and 0’s with M-bit block 100 M  20
block M > 0.01n
Runs test Relative oscillation of bit stream 100 –
Longest Run of Ones Length of longest consecutive 1’s with a 128 M (set based on
block present n)
Binary Matrix Run Rank of disjoint sub-matrix 38  M  Q M, Q
DFT Detect periodic features 103 –
Non-overlapping Detect occurrence of patterns in an m-bit 106 m ¼ [2,10]
Template window
Overlapping Template Detects occurrence of patterns, with 106 m ¼ [2,10]
overlaps included
Universal Statistical Test Number of bits between matching patterns 387,840 L ¼ [6,16]
Q ¼ 10 * 2L
Linear Complexity Test Length of equivalent LFSR 106 M
Serial Test Detect frequency of overlapping patterns – m < log2n  2
Approximate Entropy Detect frequency of overlapping patterns – m < log2n  5
Cumulative Sums Random walk 100 –
Random Excursions Test Random walk cycle 106 –
Random Excursions Deviations from a random walk 106 –
Variant Test

Fig. 8.4 Illustration of the Physical One-Way Function from a non-homogenous material (Pappu et al. 2002)

2002), or Physical Random Functions (PRF) leveraging the low-cost and high-volume capa-
(Gassend et al. 2002), among others. ICID uses bility of CMOS chips.
an array of MOSFET to generate the random Most of the existing silicon PUFs can be clas-
values from random process mismatch, via FET sified as either delay-based or memory-based
drain current. The physical one-way function PUFs (Gassend et al. 2002; Guajardo et al.
was proposed as a solution to the need for a 2007; Suh et al. 2007; Kumar et al. 2008). In
one-way function (easy to evaluate but difficult delay-based PUFs, bits are generated by compar-
to invert) for cryptographic applications. The ing the delay of two nominally identical paths.
approach uses a laser to scatter light through an The sign of the random delay difference between
inhomogeneous structure (at some precise angle, the two delays determines the output bit. One of
which serves as the challenge), as shown in the earliest implementation of such a concept is
Fig. 8.4. The resulting optical speckle diagram the ring oscillator (RO) PUF (Gassend et al.
is hashed to obtain the key. Most of the literature 2002; Suh et al. 2007), whose digital output is
has then reverted to silicon-based solutions, determined by the relative frequency of each
8 Security Down to the Hardware Level 253

selected pair of nominally identical ring on the oscillation collapse in an even-stage ring
oscillators. Figure 8.5a shows the general dia- of delay-adjustable stages. The delay is set by an
gram of a ring oscillator PUF, where the chal- applied input (PUF challenge) via inverter rep-
lenge selects two of the available ring oscillators, lica multiplexing. The native instability of PUF
and the corresponding response depends on outputs was substantially reduced at the cost of
whether the frequency of the first selected oscil- much higher energy and the need for CTAT
lator is greater than the second or not. Knowing biasing.
that these inverter chain ring oscillators tend to All above delay-based PUFs are also intrinsi-
be very sensitive to environmental conditions, cally vulnerable to PUF modeling attacks, which
several techniques have been introduced to can capture and clone the content of the entire
improve the high native instability rate, and PUF with very low effort. Indeed, the PUF output
poor statistical quality of this pair-wise compari- is dictated by the overall PUF delay, which is in
son. Some of these techniques include the adop- turn defined by the sum of the delays of cascaded
tion of k-sum or 1-out-of-k masking techniques stages. Since each stage delay is fixed (although
(Suh et al. 2007; Lee et al. 2004). unpredictable), identifying all stage delays from
Another delay-based PUF is the arbiter PUF the analysis of the PUF outputs entails only a
(Suh et al. 2007; Lim et al. 2005; Lee et al. 2004), linear complexity, making the PUF easy to
as shown in Fig. 8.5b. It compares the delay of clone (Rührmair et al. 2010).
two delay lines, and suffers from the same In memory-based PUFs, a bistable structure of
limitations as the RO PUF (Gassend et al. 2002; two cross-coupled inverters is used to generate
Lee et al. 2004). A improved delay-based version the output bits. They leverage on the natural
was recently proposed (Yang et al. 2015), based tendency of cross-coupled inverters to resolve

Fig. 8.5 Delay-based PUFs: (a) ring oscillator (RO) PUF (Suh et al. 2007), (b) arbiter PUF (Lee et al. 2004)
254 A. Alvarez and M. Alioto

to a preferred state at the power-up, as deter- found in other PUFs that rely on the same princi-
mined by their asymmetry due to random ple, such as senseamp (Bhargava et al. 2010;
variations (Suh et al. 2007). For example, Bhargava and Mai 2014). For such PUFs, reason-
SRAM PUFs leverage this property in SRAM able levels of stability are typically achieved
bitcells (Guajardo et al. 2007; Holcomb et al. through substantial temporal redundancy at the
2009). Other similar PUFs are the Latch PUF expense of energy consumption (Helinski et al.
(Su et al. 2007), DFF PUF (Maes et al. 2008), 2009). Other proposed PUFs are based on (1) the
butterfly PUF (Kumar et al. 2008), and the glitch generated in digital paths, although they
buskeeper PUF (Simons et al. 2012), which is suffer from high instability rates (Suzuki et al.
similar to the SRAM PUF albeit without the 2010), (2) the difference in leakage current
write ability, as access transistors are removed generated by nominally identical transistors,
since PUF bitcells are read-only. The butterfly although at the cost of large energy due to the
PUF (Kumar et al. 2008) follows the same con- necessary circuitry for current/voltage references
cept of leveraging on the unstable state of cross- and opamp (Ganta et al. 2011), (3) DRAM errors
coupled inverters. It was proposed for implemen- under different wordline voltage, although such
tation in an FPGA and uses the available cross- PUFs are highly vulnerable to non-invasive
coupled latches instead of inverters, as shown in attacks (Rosenblatt et al. 2013), (4) open or
Fig. 8.6. The operation starts by asserting the short connection in vias (Choi et al. 2014),
excite signal, thereby forcing the PUF to be in (5) oxide breakdown in OTP ROMs (Liu et al.
the unstable state. This signal is then released and 2010), (6) capacitance mismatch (Tuyls et al.
after a few clock cycles, out signal settles to its 2006; Roy et al. 2009; Wan et al. 2015), (7) the
natural stable state determined by the random variations in the supply network resistance,
variations in the related logic gates. although this requires the generation of very
The recent literature on memory-based PUFs large currents (Helinski et al. 2009).
and their experimental characterization has A hybrid PUF was proposed in (Satpathy et al.
shown that PUFs typically have poor stability 2014; Mathew et al. 2014, 2016) combining
(Schrijen et al. 2012), and are highly vulnerable delay and metastability as sources of
to semi-invasive attacks such as electrical and randomness. The basic bitcell is shown in
optical probing (Nedospasov et al. 2013). The Fig. 8.7, where bistability is forced through the
same vulnerability to semi-invasive attacks is pre-charge transistors controlled by clk0 and
clk1. The randomness in delay is introduced
through the clock skew between clk0 and clk1.
In order to reduce unstable bits, significant tem-
poral majority voting is employed. Soft dark bit
masking was also used in (Satpathy et al. 2014)
by modulating the load in the bit and bit’, and
masking bits that become unstable with the
change in the load. Indeed, load modulation sim-
ply injects controlled perturbation in the stability
of the PUF bitcell, which in turn permits to
identify the truly stable bitcells that do not
change output even in the presence of such
perturbation.
In order to achieve adequate native stability in
spite of voltage and temperature fluctuations,
authors in (Li et al. 2015) proposed to use a
proportional-to-absolute-temperature (PTAT) as
Fig. 8.6 Butterfly PUF (Kumar et al. 2008) a bitcell. Figure 8.8 shows the bitcell and the
8 Security Down to the Hardware Level 255

Fig. 8.7 Metastability-


based PUF (Mathew et al.
2014)

Fig. 8.8 PTAT-based PUF (Li et al. 2015)

architecture and principle of operation of the operation and low native instability rate was
PTAT-based PUF. As seen in the figure, the recently proposed, which relies on the amplifica-
PUF bitcell output is determined by the sign of tion of random transistor mismatch through two
the difference between Out_l and Out_r, both of complementary current mirrors. Figure 8.9
which are independent of voltage and tempera- shows two implementations of the same general
ture. Aside from the high resiliency against volt- concept. Figure 8.9a shows the INV_PUF bitcell
age and temperature variation, another implementation of this concept, which comprises
noteworthy feature of this PUF is its good area the cascode current mirrors M1–M4 and M5–
efficiency (only 727 F2/bit, being F the mini- M8. The two 1:1 current mirrors see the same
mum feature size of the adopted process), as current flowing through their respective input
enabled by the shared header per column. transistors (M3 and M5), and tend to mirror it
A class of static mono-stable PUFs (Alvarez to their output transistors (M4 and M6, respec-
et al. 2015, 2016) for extremely low energy tively). Without mismatch, M4 and M6 would
256 A. Alvarez and M. Alioto

Fig. 8.9 Static mono-stable PUFs (Alvarez et al. 2016) (a) INV_PUF and (b) SA_PUF

conduct the same saturation current (I M4, SAT and A more complete list of fabricated PUFs can
I M6, SAT ), and node Y would assume the same be found in the new public PUF database (Alioto
voltage as node X (e.g., V DD =2), due to the and Alvarez 2016). Extracted trends in terms of
symmetry of the topology in Fig. 8.9a. However, native instability rate, area, and energy are shown
random mismatch between M1–M2 and M7–M8 in Fig. 8.10. From Fig. 8.10a, the metastability-
makes these currents unpredictably different. based PUFs have the worst native instability rate,
The large output small-signal impedance RY at while the monostable PUFs exhibit the best
node Y (Fig. 8.9a) translates the difference in native instability rate. The high native instability
such currents into a large voltage deviation. rate in metastability-based PUFs is reduced
Accordingly, the voltage at node Y becomes through post-processing and other stability
essentially VDD if I M4, SAT - I M6, SAT >0, or ground enhancement techniques that increase testing
if I M4, SAT - I M6, SAT <0. Thus a digital output that is time (i.e., cost). For the rest of the PUFs, the
dominantly defined by the random mismatch native instability rate has slightly increased over
between the two current mirrors is generated. In the years.
Fig. 8.9b, the alternative SA_PUF topology adds From Fig. 8.10b, the area per bit is highest for
a sense amplifier (transistors M9–M13) after delay-based PUFs, due to the large number of
M1–M8 to further increase the voltage gain stages required to (1) limit the oscillation fre-
(and thus reduce the number of unstable bits) quency to acceptable values that can be distin-
and introduce additional random mismatch guished by the subsequent circuitry, (2) to
through the sense amp offset. mitigate the instability rate of individual ring
Table 8.4 compares various PUFs in terms of oscillators via k-sum or 1-out-of-k masking
the above mentioned metrics, with the best (Suh et al. 2007; Lee et al. 2004). In general,
performing PUF for each metric being the area efficiency of PUF bitcells has improved
highlighted in bold. As can be seen from this over time, especially due to the adoption of more
table, most PUFs are typically affected by rela- digital approaches that offer better density than
tively poor randomness/statistical quality (see, analog ones. Analog PUF bitcells have an oppo-
e.g., bias) (Maes et al. 2012; Maes 2013; Yu site trend, as their area tends to increase over
et al. 2012) and up to 30% instability rate time, when area is normalized to the square of
(Maes et al. 2012; Alvarez et al. 2015, 2016; the minimum feature size of the technology. This
Bhargava and Mai 2014). is mostly because of their analog nature, which
8

Table 8.4 Comparison of different PUFS


ICID Latch Mathew PTAT Yang INV_PUF
(Lofstrom et al. Arbiter (Lim (Su et al. et al. (Li et al. et al. (Alvarez et al.
PUF SRAM_PUF* RO_PUF* 2000) et al. 2005) 2007) (2014) 2015) (2015) 2016)
Technology 65 nm 65 nm 0.35 μm 0.18 μm 0.13 μm 22 nm 65 nm 40 nm 65 nm
Stability (% native unstable 16.66 18.16 1.3 9.8 3.04 30 7.1 12.5 2.34
bits at nominal condition)
Security Down to the Hardware Level

V-T variation 0.6–1 V 0.4–0.5 V 1.1–5 V, 25 1.8 V 2%, 0.9–1.2 V 0.7–0.9 V 0.7–0.9 V 0.6–1 V,
to 250  C 27–70 C 25–85  C
% error with VT variation 55.73 53.9 5 4.82 5.46875 30 12.5 5.72
Energy (pJ/bit) 1.1 0.4748 8333.3333 0.17125 0.93 0.19 1.1 17.75 0.015
Area (F2/bit) 306 39,000 1708 708,403 4369 9628 727 2062 6000
Randomness 0.6141 0.5023 0.4805 0.4928 0.5016
(bias ¼ probablity of 1)
Uniqueness (mean inter-PUF 0.3321 0.4738 0.4911 0.3800 0.5055 0.5100 0.5001 0.5007 0.5014
FHD)
Repeatability (mean intra-PUF 0.0602 0.0458 0.0134 0.0268 0.0057 0.0101 0.0034
FHD)
Identifiability (inter-PUF/intra- 6 10 37 19 88 50 149
PUF FHD)
NIST test PASS PASS PASS PASS
Entropy 0.9903 0.9947 0.9997 0.9998 0.9967
Autocorrelation function @ 0.0156 0.0884 0.01 0.0188 0.0283 0.0363
95% confidence
*Data taken from (Alvarez et al. 2016)
257
258 A. Alvarez and M. Alioto

Fig. 8.10 Trend of (a)


native instability rate, (b)
area per bit (normalized to
F2, F being the minimum
feature size of the CMOS
process), (c) energy per bit
for different PUFs
implemented in custom
PUF chips (Alioto and
Alvarez 2016) (normalized
to the energy Einv
consumed by a minimum-
sized inverter in a single
transition)
8 Security Down to the Hardware Level 259

does not really enable shrinking with finer (Yu et al. 2011), Code-Offset Syndrome
technologies. (Gassend et al. 2002; Yu and Devadas 2010;
From Fig. 8.10c, the energy per bit is improv- Suh et al. 2007; Dodis et al. 2008; Paral et al.
ing, thanks to the adoption of more energy-aware 2011; Eiroa et al. 2012), pattern matching
PUFs. The circuit improvements in terms of techniques (Paral et al. 2011), and fuzzy
energy definitely dominate the benefits of mere extractors (Selimis et al. 2011; Dodis et al.
technology scaling. This is shown by Fig. 8.10c, 2008)
which plots the energy normalized to the energy • temporal majority voting across repeated PUF
consumed by a minimum-sized inverter in the readings, which typically slow down and
same technology, and hence represents a increase the energy per access by more than
technology-independent metric. Interestingly, an order of magnitude (Mathew et al. 2014,
from Fig. 8.10c delay-based PUFs are an excep- 2016; Li et al. 2015)
tion, as they tend to have larger energy per bit • on-the-fly PUF bitcell masking (Satpathy
over the years. This is due to the need for a larger et al. 2014), and PUF redundancy (Yang
number of ring oscillators or oscillations to main- et al. 2015; Suh et al. 2007), which skips the
tain acceptable stability, in spite of the progres- bitcells that are found to be unstable at testing
sively worse native stability in 8.10a. time by storing the bit error map in an addi-
Some prior work enables the capability to tional volatile memory array (Mathew et al.
assure a well-defined stability safety margin at 2014; Bhargava and Mai 2014; Karpinskyy
the output word level (Yu et al. 2012), as a form et al. 2016); this approach may introduce sig-
of robustness assurance against individual bit nificant area/energy overhead, and consider-
instability. Other prior work focuses on improv- ably widens the opportunities to perform
ing the stability of PUF bitcells without quantita- successful invasive attacks (e.g., interfering
tive stability assurance at run-time. For example, with PUF operation by writing on the addi-
introducing burn-in hardening in (Mathew et al. tional memory).
2014) improves stability at the expense of signif-
icantly longer testing time, which conflicts with Figure 8.11 shows an example where ECC is
the very low cost requirement of IoT nodes (see used to improve the reliability of the PUF
Chap. 1). Another way to improve the statistical (Gassend et al. 2002). In this implementation,
quality and suppress a limited number of unsta- redundant information is generated for each
ble bits is through digital post-processing, at the challenge-response pair, to allow the correction
expense of substantially larger silicon area and of the PUF output. The ECC overhead is
energy. The post-processing block can be a mix- ~14 kgates, which is about an order of magnitude
ture of the following techniques: bigger than the PUF array itself. Similarly, in
(Rahman et al. 2014), ECC encoder was shown
• Error Correcting Code (ECC), which to have an area of ~3–12 kgates, with the ECC
introduces a large area/energy overhead espe- decoder requiring an even larger area of ~20–-
cially for high levels of targeted security, as 75 kgates. Detection of instability was proposed
its complexity grows exponentially in in (Karpinskyy et al. 2016) during the PUF
applications requiring wider PUF outputs; response generation, and in (Satpathy et al.
post-processing also leaks information and 2014) at boot time, as depicted in Fig. 8.12.
makes the PUF more vulnerable to physical
attacks (Bhargava and Mai 2014). Various
ECCs were used (Yu et al. 2012), such as 2D
8.4 PUF Vulnerability Analysis
Hamming (Gassend et al. 2002), BCH (Suh
et al. 2007; Mathew et al. 2016), two-stage
Existing PUF solutions suffer from various
ECC (B€ osch et al. 2008), soft-decision ECC
limitations that have limited their adoption in
(Maes et al. 2009a, b), Index-Based Syndrome
260 A. Alvarez and M. Alioto

Fig. 8.11 Block diagram of an improved PUF that utilizes ECC to improve the PUF reliability (Gassend et al. 2002)

Fig. 8.12 Possible circuits for runtime error detection: (a) glitch detector from (Karpinskyy et al. 2016); (b) dark bit
masking from (Satpathy et al. 2014)

real products. Indeed, to date there are only a few of output statistics and randomness. Hence, they
PUFs available in the market, such as (ICTK, typically require multiple silicon runs to reliably
Co. Ltd. 2014; Intrinsic-ID 2016; Invia PUF assess a given design. Glitch based PUFs are not
2016; QuantumTrace 2013; Verayo Inc. 2013). yet mature, and are well known to be rather
For example, delay- and metastability-based unstable and complex. Leakage-based PUFs are
PUFs are very sensitive to voltage/temperature sensitive to environmental variations and need
variations, aging and noise (Maes et al. 2012), extra circuitry for voltage and current biasing,
and are very hard to verify at design time in terms which are prohibitively costly in terms of area,
8 Security Down to the Hardware Level 261

Fig. 8.13 Attacks to PUFs


versus level of invasiveness
and level of abstraction

energy and design effort. Memory-based PUFs attacks). Delay-based PUFs are prone to these
are strongly technology dependent and hence not types of attacks. In (Rührmair et al. 2013), for
technology-portable, and suffer from bit flipping example, they were able to show more than 95%
and data remanence (Eiroa et al. 2012). DRAM prediction rate using machine learning to model
error maps are well known to have obvious cor- the arbiter and ring oscillator PUFs.
relation between different responses, thus drasti- For identification purposes, a “strong PUF”
cally weakening the unpredictability of with large number of CRPs is clearly needed to
responses (Rosenblatt et al. 2013). Both delay- limit the effectiveness of man-in-the-middle
and supply network-based PUFs are very vulner- cryptanalytic attacks, but unfortunately all prac-
able to modeling attacks (Rührmair et al. 2013). tically viable strong PUFs are very vulnerable to
The trustworthiness of a PUF is defined by its modeling attacks (Sadeghi and Naccache 2010;
resistance to attacks that aim to impersonate, Helinski et al. 2009), and hence unsuitable for
replicate or recover portions of the PUF bits. moderate-to-high levels of security.
These attacks can be active (injecting fault into Side-channel attacks aim to identify the PUF
the design) or passive (simply observing). They key employed by the chip through its correlation
can also be classified as invasive (meaning with the measured power consumption (e.g.,
requiring depackaging the chip to see the design DPA (Kocher et al. 1999), correlation attacks
or probe internal signals) or non-invasive. The (Brier et al. 2004) and Leakage Power Analysis
most representative attacks to PUFs are (Alioto et al. 2014, 2010a; Giorgetti et al. 2007)
summarized in Fig. 8.13, from the least to the or electromagnetic emissions (Mangard et al.
most invasive. Modeling attacks are passive 2007)). These attacks are performed on the exe-
non-invasive, and involve only the observation cution of the cryptographic algorithm that uses
of transmitted information and successive trials the key that the attacker is trying to retrieve.
to impersonate the device by leveraging the small Differential Power Analysis (DPA) was pro-
search PUF key space or poor randomness, or by posed in 1998 (Kocher et al. 1999). Figure 8.14
recording and reusing previous CRPs (Sadeghi shows a sample trace of the whole DES opera-
and Naccache 2010) (e.g., man-in-the-middle tion, where the 16 repetitive pulses correspond to
262 A. Alvarez and M. Alioto

4.25

Current (mA)
4.0
3.75
3.5
3.25
3.0
2.75

0 0.8 1.6 2.4 3.2 4.0 4.8 5.6 6.4 7.2 8.0
Time (mS)

Fig. 8.14 SPA traces showing 16 rounds of DES

7.0 Jump Taken


6.0
5.0
4.0
3.0
Current (mA)

2.0
1.0
0.0
6.0 Jump Not Taken
5.0
4.0
3.0
2.0
1.0
0.0
1 2 3 4 5 6 7
Time (in 3.5714MHz clock cycles)

Fig. 8.15 SPA single round trace showing difference in power consumption

the 16 rounds of DES. Measurements are required trials. For each trial, the cipher’s power
performed during encryption (or decryption) consumption estimated under two key bit guesses
under different plaintexts, in order to see the is compared, or correlated to the actual power.
difference in power consumptions in the different The difference of two traces (see, e.g., Fig. 8.15)
stages of the algorithm. A closer view of the has a spike in cycle 6 when such key bit is used in
same pulse in Fig. 8.14 could also reveal more the algorithm execution, and is almost zero else-
about the data, as shown in Fig. 8.15. From the where. Similar procedure is applied to specific
figure, the difference in power consumption in operations in the algorithm in order to help iden-
cycle 6 is due to a jump instruction (where jump tify whether the initially assumed key is correct
is taken in the top plot and not taken in the or not. A power model for DPA attacks on
bottom plot). The idea of DPA is to retrieve the symmetric-key cryptographic algorithms imple-
key of the cipher using divide and conquer mented using static logic was proposed in (Alioto
approach by guessing portions of the key, et al. 2010b). The model predicts the effective-
thereby breaking the exponential complexity of ness of DPA attacks and the conditions for which
deciphering the key and reducing the number of the circuit becomes vulnerable to these attacks.
8 Security Down to the Hardware Level 263

As DPA attacks target the cryptographic core would be needed to meet a given level of trust-
rather than the PUF itself, one way to prevent worthiness, with each being allocated to the
them is to mask the power consumption of dif- appropriate level of abstraction to meet a security
ferent operations. This can be done through a level target at low cost.
change in the algorithm, masking data (Canright
et al. 2008), or resorting to codeword encoding
(Merli et al. 2013), among others. The use of 8.5 Novel Concept of PUF-
different logic styles, such as the sense amplifier Enhanced Cryptography,
based logic (SABL) (Tiri et al. 2002) and dual- Trends and Perspectives
rail pre-charge (DRP) circuits, such as wave dif-
ferential dynamic logic (WDDL) (Tiri et al. Despite the recent and broad interest in PUFs,
2004), masked dual-rail pre-charge logic they have made a very limited impact on real
(MDPL) (Popp et al. 2005) and dual-rail random applications to date due to several challenges
switching logic (DRSL) (Chen et al. 2006), was that seriously hinder PUF trustworthiness, as
shown to be effective in masking operations by above mentioned and summarized below:
maintaining almost the same power consumption
regardless of the operation and processed data • PUF responses can be very unstable (i.e., not
(Schaumont et al. 2007; Monteiro et al. 2011). repeatable), thus requiring a large cost in
Semi-invasive attacks (e.g., fault attacks) aim terms of testing time (post-silicon masking,
to interfere with the circuit operation by PUF hardening), area or energy overhead
introducing glitches and injecting faults that (to suppress unstable bits at design time, and
expose data that would otherwise be securely to include expensive post-processing blocks).
processed internally. Fault injection and timing Additional post-processing circuits also make
attacks can be avoided by consistency checking, PUF more vulnerable to physical attacks (e.g.,
at the expense of additional runtime and/or area. side-channel, probing), as they become part of
Laser scanning was used in (Holcomb et al. the PUF itself, and hence introduce additional
2009) and (Nedospasov et al. 2013) to read out backdoors to the PUF
the state values of memory elements. • there is no available methodology to system-
Invasive attacks aim to physically observe atically assure a level of security (e.g., bit
(e.g., reverse engineering) or modify the chip randomness, stability) at design/verification
physical implementation (e.g., probing, fibbing), time. This prohibits the provability of PUF
and can be counteracted through secure trustworthiness at design time, requires
coprocessors (Smith and Weingart 1999). This repeated silicon runs to converge to a targeted
solution, however, is very expensive both in and provable security level, and drastically
terms of area and energy, as the coprocessor has prolongs the design cycle and time to market
to be powered at all times. The work in (Wan • existing solutions only target a specific type of
et al. 2015) proposes to counteract invasive security threat and level of abstraction (mostly
attack by laying out metal wires on top of trans- circuit level). Hence, they cannot address the
mission lines to switch capacitor circuitry. By security challenges posed by different types of
doing this, the capacitance of the sampling attacks, and do not introduce across-level
capacitor changes during invasive attacks, solutions (e.g., from PUF layout to architec-
making the PUF output invalid. ture and software)
The targeted level of trustworthiness defines • the PUF design is essentially a manual design
an adequate set of attacks that need to be process, which prohibits design automation
counteracted in a PUF design, although to date and technology portability, and entails very
only individual and fragmented techniques have low design productivity
been proposed. As research challenge in the area • specifically for IoT nodes, constant node-to-
of PUFs, a comprehensive set of techniques node communications and data transmission
264 A. Alvarez and M. Alioto

Fig. 8.16 (a) PUF as key generator, (b) crypto-core-based strong PUF

requires regular authentication. This Instead, a more efficient security scheme is


translates into and unacceptably large number introduced in Fig. 8.18. In this “PUF-enabled
of CRPs and therefore PUF capacity (see node-to-node communication” scheme, secure
Table 8.1). Public-key cryptography to estab- PUF key exchange is enabled at the authentica-
lish trust cannot mitigate such problem, due to tion phase through cryptography. After one-time
its prohibitive area and energy cost in IoT authentication, both nodes can communicate
nodes. with each other securely through encryption and
decryption using the exchanged keys, and with-
So far, we have treated PUFs as secure ran- out server assistance (therefore not needing a
dom key generators for chip identification large CRP database). This makes communication
through conventional challenge-response pairs, over complex networks scalable, as the database
as illustrated in Fig. 8.16a. A more promising is involved only at the first communication
approach implements a strong PUF through a between nodes. As can be seen in the figure,
crypto-core (e.g., AES) using the PUF key as node-to-node communication is simplified
encryption key, and then treating input/output through the joint use of PUF and cryptography,
values as CRPs (Bhargava and Mai 2014), as which permit to securely exchange keys over an
illustrated in Fig. 8.16b. In (Bhargava and Mai insecure channel, and avoiding the very energy-
2014), the introduction of AES increased the area and area-hungry public-key cryptography. Such
by 4.6 compared to the PUF array, but interesting and synergistic use of PUFs and cryp-
increased the number of available CRPs expo- tography is here introduced and named “PUF-
nentially. By using an AES design that is enhanced cryptography”.
designed specifically for IoT applications (Zhao Another interesting ramification of PUF-
et al. 2015), the power consumption of AES is enhanced cryptography is the ability to substan-
well below 1 μW and the area cost is reduced by tially strengthen the security of a crypto-core
3, thus becoming very affordable for the same against cryptanalytic attacks, by appropriately
exponential increase in CRPs. embedding a PUF into it. As illustrated in
For IoT node-to-node communications, the Fig. 8.19, PUF-enhanced cryptography goes
concept of combining a PUF with a crypto-core beyond the traditional scheme of securely storing
can also be used to reduce the circuit complexity a single crypto-key, and permits to extend the
and energy required for continuous authentica- crypto-key compared to the size imposed by the
tion, thereby reducing the required PUF capacity crypto-algorithm, thus making it stronger against
at a given level of security. Conventional node- cryptoanalytic attacks. Traditionally, key exten-
to-node communication is illustrated in Fig. 8.17, sion is not possible since its length is dictated by
where CRPs are used to authenticate both nodes the encryption standard. However, in PUF-
each time data is transferred between them. enhanced cryptography, a PUF with capacity
8 Security Down to the Hardware Level 265

Fig. 8.17 Conventional node-to-node data transfer through server, which needs to constantly assist the two nodes
during their communications

larger than the key is used to generate repeatable machine, which generates time-varying challenges
but unpredictable new keys that are combined with to a PUF, or a lightweight cipher itself (Shiozaki
the conventional user key to generate the fixed et al. 2015). As a result, as opposed to the tradi-
length enhanced key used by the on-chip crypto- tional scheme that uses a single private key, the
core. To this aim, the key enhancer in Fig. 8.19 is PUF-enhanced cryptography scheme in Fig. 8.19
introduced to dynamically concatenate the user actually uses a larger set of keys, whose number is
and PUF keys, and then compress them into the basically limited by the desired PUF capacity.
pre-defined length. Although the key enhancer in From an attacker point of view, guessing the
Fig. 8.19 is shown to be outside the crypto-core private crypto-key of a typical cryptography sys-
(i.e., without interfering with conventional opera- tem requires an effort that is (exponentially)
tion), it can also extend to the inside of the latter, defined by the size of the single key size. Instead,
and operate across several blocks of plaintext. The in the PUF-enhanced cryptography scheme in
encryption sequence is initialized by the user key, Fig. 8.19, the search space for the crypto-key is
and then managed by a key enhancer. The key enlarged by the capacity of the PUF, thus easily
enhancer can likewise be a simple finite state making the key search unfeasible even under
266 A. Alvarez and M. Alioto

Fig. 8.18 PUF-enabled key exchange and node-to-node communication

Fig. 8.19 The new concept of PUF-enhanced cryptography (key is continuously enriched with PUF key).

very powerful equipment and computing the security of an existing algorithm with (1) lim-
resources. In practical cases, the PUF-enhanced ited area cost, thanks to the exponential increase
cryptography permits to drastically strengthen of the size of the key search space, under PUF
8 Security Down to the Hardware Level 267

capacity extension, and (2) no throughput pen- Acknowledgement The authors acknowledge the kind
alty, since the generation of the PUF output is support by the MOE2016-T2-1-150 grant from the
Singaporean Ministry of Education.
generally much faster than encryption. When
using PUFs like in (Alvarez et al. 2016), the
latter property is enabled by the intrinsically
high speed of the PUF architecture, since PUF References
bits are always available at the output and only
need to be routed to the circuitry that consumes M. Alioto, L. Giancane, G. Scotti, A. Trifiletti, Leakage
power analysis attacks: a novel class of attacks to
them.
nanometer cryptographic circuits. IEEE Trans.
The above mentioned dynamic change of the Circuits Syst. I 57(2), 355–367 (2010a)
key over time is a tool to improve the strength of M. Alioto, M. Poli, S. Rocchi, Differential power analysis
PUF-enhanced cryptography against crypto- attacks to precharged buses: a general analysis for
symmetric-key cryptographic algorithms. IEEE
analytic attacks. In the case of IoT devices rely-
Trans. Dependable Secur. Comput. 7(3), 226–239
ing on energy harvesting, changing keys (2010b)
becomes a necessity as dictated by the availabil- M. Alioto, S. Bongiovanni, M. Djukanovic, G. Scotti,
ity of supply. For example, in (Aysu and A. Trifiletti, Effectiveness of leakage power analysis
attacks on DPA-resistant logic styles under process
Schaumont 2016) key generation is divided into
variations. IEEE Trans. Circuits Syst. I Regul. Pap.
several phases and precomputation is done when- 61(2), 429–442 (2014)
ever supply available, and intermediate results A. Alvarez, W. Zhao, M. Alioto, 15 fJ/bit static physically
are stored, for use in the next phase. unclonable functions for secure chip identification
with <2% native bit instability and 140x inter/intra
In summary, PUF-enhanced cryptography
puf hamming distance separation in 65 nm. IEEE Int.
permits to drastically enhance the security of a Solid-State Circuits Conf. 5, 256–258 (2015)
crypto-core by leveraging its synergy with a A.B. Alvarez, W. Zhao, M. Alioto, Static physically
PUF, to generate time-varying crypto-keys unclonable functions for secure chip identification
with 1.9–5.8% native bit instability at 0.6–1 V and
instead of having a fixed one. In addition, the
15fJ/bit in 65 nm. IEEE J. Solid State Circuits 60(5),
adoption of such PUF to enhance the crypto- 1–4 (2016)
algorithm also permits to easily scale up the A. Aysu, P. Schaumont, Precomputation methods for
level of security on demand. Indeed, the level hash-based signatures on energy-harvesting platforms.
IEEE Trans. Comput. 65(9), 2925–2931 (2016)
of security defines the number of PUF words
M. Bhargava, K. Mai, An efficient reliable PUF-based
that are needed, and hence it only affects the cryptographic key generator in 65 nm CMOS. Design
periodicity of the key enhancer for a given PUF Autom. Test Europe Conf. Exhibition 1, 1–6 (2014)
capacity. Also, the PUF unambiguously M. Bhargava, C. Cakir, K. Mai, Attack resistant sense
amplifier based PUFs (SA-PUF) with deterministic
authenticates the die that the crypto-core runs
and controllable reliability of PUF responses, in
on. In addition, the addition of a PUF to a IEEE International Symposium on Hardware-
crypto-core generally entails a very small energy Oriented Security and Trust (HOST) (2010)
overhead, as the energy per bit of a PUF is pp. 106–111
C. B€osch, J. Guajardo, A.R. Sadeghi, J. Shokrollahi,
typically two to three orders of magnitude
P. Tuyls, Efficient helper data key extractor on
smaller than a crypto-core. Very similar FPGAs. Lect. Notes Comput. Sci. (including Subser.
considerations hold for the area efficiency. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)
These features are particularly interesting in the 5154 LNCS, 181–197 (2008)
E. Brier, C. Clavier, F. Olivier, Correlation power analy-
context of the Internet of Things, as they make
sis with a leakage model, in Cryptographic Hardware
crypto-algorithms and crypto-cores affordable in and Embedded Systems (2004), pp. 16–29
terms of area and energy, thus enabling continu- D. Canright, L. Batina, A very compact ‘perfectly
ous and ubiquitous security. When a much higher masked’ S-box for AES, in Lecture Notes in Computer
Science (2008), pp. 446–459
level of security is occasionally needed, the PUF
Z. Chen, Y. Zhou, Dual-rail random switching logic: a
enhancement permits to further scale it up at a countermeasure to reduce side channel leakage, in
very low area/energy cost. Cryptographic Hardware and Embedded Systems
(CHES) (2006), pp. 242–254
268 A. Alvarez and M. Alioto

B.D. Choi, T.W. Kim, D.K. Kim, Zero bit error rate ID Security and Trust (HOST) (2008), no. 71369,
generation circuit using via formation probability in pp. 67–70
0.18 μm CMOS process. IET J. Mag. 50(12), 876–877 J.W. Lee, B. Gassend, G.E. Suh, M. van Dijk, S. Devadas,
(2014) A technique to build a secret key in integrated circuits
Y. Dodis, R. Ostrovsky, L. Reyzin, A. Smith, Fuzzy for identification and authentication applications, in
extractors: how to generate strong keys from Symposium on VLSI Circuits (2004), pp. 176–179
biometrics and other noisy data. SIAM J. Comput. J. Li, M. Seok, A 3.07 μm^2/bitcell physically unclonable
38(1), 97–139 (2008) function with 3.5% and 1% bit-instability across 0 to
S. Eiroa, J. Castro, M. Martı́nez-Rodrı́guez, E. Tena, 80  C and 0.6 to 1.2 V in a 65 nm CMOS, in IEEE
P. Brox, I. Baturone, Reducing bit flipping problems Symposium on VLSI Circuits, Digest of Technical
in SRAM physical unclonable functions for chip iden- Papers (2015), pp. 250–251
tification, in IEEE International Conference on Elec- D. Lim, J.W. Lee, B. Gassend, G.E. Suh, M. Van Dijk,
tronics, Circuits, and Systems (ICECS) (2012), S. Devadas, Extracting secret keys from integrated
pp. 392–395 circuits. IEEE Trans. Very Large Scale Integr. Syst.
D. Ganta, V. Vivekraja, K. Priya, L. Nazhandali, A highly 13(10), 1200–1205 (2005)
stable leakage-based silicon physical unclonable N. Liu, S. Hanson, D. Sylvester, D. Blaauw, OxID:
functions, in International Conference on VLSI on-chip one-time random ID generation using oxide
Design (2011), pp. 135–140 breakdown, in Symposium on VLSI Circuits (2010),
B. Gassend, D. Clarke, M. van Dijk, S. Devadas, Silicon pp. 231–232
physical random functions, in ACM Conference on K. Lofstrom, W.R. Daasch, D. Taylor, IC identification
Computer and Communications Security (CCS) circuit using device mismatch. IEEE Int. Solid-State
(2002), p. 148 Circuits Conf. 46(8), 1999–2000 (2000)
J. Giorgetti, G. Scotti, A. Simonetti, A. Trifiletti, Analysis M. Alioto, A. Alvarez, Physically Unclonable Function
of data dependence of leakage current in CMOS cryp- database (2016), http://www.green-ic.org/pufdb
tographic hardware, in Great Lakes Symposium on R. Maes, Physically Unclonable Functions: Construction,
VLSI (GLSVLSI) (2007), pp. 78–83 Properties and Applications (Springer, London, 2013)
J. Guajardo, S.S. Kumar, G. Schrijen, P. Tuyls, FPGA R. Maes, Physically unclonable functions : constructions,
intrinsic PUFs and their use for IP protection, in Lec- properties and applications. Katholieke Universiteit
ture Notes in Computer Science, ed. by P. Paillier, Leuven (2012)
I. Verbauwhede (Springer, Heidelberg, 2007), R. Maes, P. Tuyls, I. Verbauwhede, Intrinsic PUFs from
pp. 63–80 flip-flops on reconfigurable devices, in Workshop on
R. Helinski, D. Acharyya, J. Plusquellic, A physical Information and System Security (2008), no. 71369,
unclonable function defined using power distribution pp. 1–17
system equivalent resistance variations, in ACM/IEEE R. Maes, P. Tuyls, I. Verbauwhede, A soft decision helper
Design Automation Conference (2009), pp. 676–681 data algorithm for SRAM PUFs, in IEEE International
D.E. Holcomb, W.P. Burleson, K. Fu, Power-up SRAM Symposium on Information Theory (2009),
state as an identifying fingerprint and source of true pp. 2101–2105
random numbers. IEEE Trans. Comput. 58(9), R. Maes, P. Tuyls, I. Verbauwhede, “Low-overhead
1198–1210 (2009) implementation of a soft decision helper data algo-
ICTK, Co. Ltd. (2014), http://www.ictk.com/ rithm for SRAM PUFs, in Cryptographic Hardware
servicenproduct/puf and Embedded Systems (CHES) (2009), pp. 1–15
Intrinsic-ID, SRAM PUF: the secure silicon fingerprint, in R. Maes, V. Rozic, I. Verbauwhede, P. Koeberl, E. van
White Paper (2016) der Sluis V. can der Leest, Experimental evaluation of
Invia PUF IP (2016), http://invia.fr/infrastructure/physi physically unclonable functions in 65 nm CMOS, in
cal-unclonable-function-PUF.aspx European Solid State Circuit Conference (ESSCIRC)
B. Karpinskyy, Y. Lee, Y. Choi, Y. Kim, M. Noh, S. Lee, (2012), pp. 486–489
Physically unclonable function for secure key genera- S. Mangard, E. Oswald, T. Popp, Power Analysis Attacks:
tion with a key error rate of 2E-38 in 45 nm smart-card Revealing the Secrets of Smart Cards (Springer,
chips, in IEEE International Solid-State Circuits Con- New York, 2007)
ference (ISSC) (2016), pp. 158–160 S.K. Mathew, S.K. Satpathy, M.A. Anders, H. Kaul,
P. Kocher, J. Ja, B. Jun, Differential power analysis. Lect. S.K. Hsu, A. Agarwal, G.K. Chen, R.J. Parker,
Notes Comput. Sci. 1666, 388–397 (1999) R.K. Krishnamurthy, V. De, A 0.19pJ/b PVT-
O. K€ommerling, M.G. Kuhn, Design principles for variation-tolerant hybrid physically unclonable func-
tamper-resistant smartcard processors, in USENIX tion circuit for 100% stable secure key generation in
Workshop on Smartcard Technology (1999), pp. 9–20 22 nm CMOS. Digest Tech. Pap. - IEEE Int. Solid-
S.S. Kumar, J. Guajardo, R. Maes, G. Schrijen, P. Tuyls, State Circuits Conf. 2(c), 278–280 (2014)
The butterfly PUF protecting IP on every FPGA, in S. Mathew, S. Satpathy, V. Suresh, M. Anders, H. Kaul,
IEEE International Workshop on Hardware-Oriented A. Agarwal, S. Hsu, G. Chen, R. Krishnamurthy,
V. De, A 4fJ/bit delay-hardened Physically unclonable
8 Security Down to the Hardware Level 269

function circuit with selective bit destabilization in cryptographic applications. Natl. Inst. Stand. Technol.
14 nm trti-gate CMOS, in Symposium on VLSI 800–22(Rev 1a), 131 (2010)
Circuits (2016), pp. 248–249 A.-R. Sadeghi, D. Naccache (eds.), Towards Hardware-
D. Merli, F. Stumpf, G. Sigl, Protecting PUF error correc- Intrinsic Security: Foundations and Practice
tion by codeword masking (2013), pp. 1–16 (Springer, Berlin, 2010)
C. Monteiro, Y. Takahashi, T. Sekine, “Resistance against D. Samyde, S. Skorobogatov, R. Anderson, J.-J.
power analysis attacks on adiabatic dynamic and adi- Quisquater, On a new way to read data from memory,
abatic differential logics for smart cards, in Interna- in International IEEE Security in Storage Workshop
tional Symposium on Intelligent Signal Processing (2002), pp. 65–69
and Communication Systems (ISPACS) (2011), S. Satpathy, S. Mathew, J. Li, P. Koeberl, M. Anders,
pp. 1–5 H. Kaul, G. Chen, A. Agarwal, S. Hsu,
D. Nedospasov, J.P. Seifert, C. Helfmeier, C. Boit, Inva- R. Krishnamurthy, 13fJ/bit probing-resilient 250 K
sive PUF analysis, in Workshop on Fault Diagnosis PUF array with soft dark-bit masking for 1.94%
and Tolerance in Cryptography (FDTC) (2013), bit-error in 22 nm tri-gate CMOS,” in European
pp. 30–38 Solid State Circuit Conference (ESSCIRC) (2014),
R. Pappu, B. Recht, J. Taylor, N. Gershenfeld, Physical pp. 239–242
one-way functions. Science 297, 2026–2030 (2002) P. Schaumont, K. Tiri, Masking and dual-rail logic don’t
Z.S. Paral, S. Devadas, Reliable and efficient PUF-based add up, in Cryptographic Hardware and Embedded
key generation using pattern matching, in IEEE Inter- Systems (CHES) (2007), pp. 95–106
national Symposium on Hardware-Oriented Security G.-J. Schrijen, V. Van Der Leest, Comparative analysis of
and Trust (HOST) (2011), no. 978, pp. 128–133 SRAM memories used as PUF primitives, in Design,
T. Popp, S. Mangard, Masked dual-rail pre-charge logic: Automation & Test in Europe Conference & Exhibi-
DPA-resistance without routing constraints, in Cryp- tion (DATE) (2012), pp. 1319–1324
tographic Hardware and Embedded Systems (CHES) G. Selimis, M. Konijnenburg, M. Ashouei, J. Huisken,
(2005), pp. 172–186 H. De Groot, V. Van Der Leest, G.J. Schrijen, M. Van
D. Puntin, S. Stanzione, G. Iannaccone, CMOS Hulst, P. Tuyls, “Evaluation of 90nm 6T-SRAM as
unclonable system for secure authentication based on physical unclonable function for secure key genera-
device variability, in European Solid State Circuit tion in wireless sensor nodes, Proceedings of IEEE
Conference (ESSCIRC) (2008), pp. 130–133 International Symposium on Circuits Systems (2011),
QuantumTrace, LLC PUF IP Product (2013), http://www. pp. 567–570
quantumtrace.com/Products/IP/PUF%20IP/ M. Shiozaki, T. Kubota, T. Nakai, A. Takeuchi,
M.T. Rahman, D. Forte, J. Fahrny, M. Tehranipoor, T. Nishimura, T. Fujino, Tamper-resistant authentica-
ARO-PUF: an aging-resistant ring oscillator PUF tion system with side-channel attack resistant AES and
design, in Design, Automation & Test in Europe Con- PUF using MDR-ROM. in IEEE International Sympo-
ference & Exhibition (DATE) (2014), pp. 1–6 sium on Circuits and Systems (ISCAS) (2015),
S. Rosenblatt, D. Fainstein, A. Cestero, J. Safran, pp. 1462–1465
N. Robson, T. Kirihata, S.S. Iyer, Field tolerant P. Simons, E. Van Der Sluis, V. Van Der Leest,
dynamic intrinsic chip ID using 32 nm high-K/metal Buskeeper PUFs, a promising alternative to D Flip-
gate SOI embedded DRAM. IEEE J. Solid State Flop PUFs, in IEEE International Symposium on
Circuits 48(4), 940–947 (2013) Hardware-Oriented Security and Trust (HOST)
D. Roy, J.H. Klootwijk, N.A.M. Verhaegh, (2012), pp. 7–12
H.H.A.J. Roosen, R.A.M. Wolters, Comb capacitor S.W. Smith, S. Weingart, Building a high-performance,
structures for on-chip physical uncloneable function. programmable secure coprocessor. Comput. Networks
IEEE Trans. Semicond. Manuf. 22(1), 96–102 (2009) 31(8), 831–860 (1999)
U. Rührmair, F. Sehnke, J. S€ olter, G. Dror, S. Devadas, S. Stanzione, G. Iannaccone, Silicon physical unclonable
J. ürgen Schmidhuber, Modeling attacks on physical function resistant to a 10^25-trial brute force attack in
unclonable functions, in Proceedings of ACM Confer- 90 nm CMOS, in Symposium on VLSI Circuits (2009),
ence on Computer and Communications Security pp. 116–117
(2010), pp. 237–249 Y. Su, J. Holleman B. Otis, A 1.6pJ/bit 96% stable chip-
U. Rührmair, J. S€olter, F. Sehnke, X. Xu, A. Mahmoud, ID generating circuit using process variations, in
V. Stoyanova, G. Dror, J. Schmidhuber, W. Burleson, Digest of Technical Papers - IEEE International
S. Devadas, PUF modeling attacks on simulated and Solid-State Circuits Conference (ISSCC) (2007),
silicon data. IEEE Trans. Inf. Forensics Secur. 8(11), pp. 406–408
1876–1891 (2013) G.E. Suh, S. Devadas, Physical unclonable functions for
A. Rukhin, J. Soto, J. Nechvatal, M. Smid, E. Barker, device authentication and secret key generation, in
S. Leigh, M. Levenson, M. Vangel, D. Banks, ACM/IEEE Design Automation Conference (2007),
A. Heckert, J. Dray, S. Vo, A statistical test suite for pp. 9–14
random and pseudorandom number generators for
270 A. Alvarez and M. Alioto

G.E. Suh, C.W. O’Donnell, S. Devadas, Aegis: a single- T. Xu, J.B. Wendt, M. Potkonjak, Matched digital PUFs
chip secure processor. IEEE Des. Test Comput. 24(6), for low power security in implantable medical
570–580 (2007b) devices, 2014 I.E. International Conference on
D. Suzuki, K. Shimizu, The glitch PUF: a new delay-PUF Healthcare Informatics (2014), pp. 33–38
architecture exploiting glitch shapes, in Workshop on K. Yang, Q. Dong, D. Blaauw, D. Sylvester, A physically
Cryptographic Hardware and Embedded Systems unclonable function with BER < 10^-8 for robust
(CHES) (2010), pp. 366–382 chip authentication using oscillator collapse in 40 nm
K. Tiri, M. Akmal, I. Verbauwhede, A dynamic and CMOS, in IEEE International Solid-State Circuits
differential CMOS logic with signal independent Conference (ISSCC) (2015), pp. 254–256
power consumption to withstand differential power M.M. Yu, S. Devadas, Secure and robust error correction
analysis on smart cards, in European Solid-State for physical unclonable functions. IEEE Des. Test
Circuits Conference (ESSCIRC) (2002), pp. 403–406 Comput. 27(1), 48–65 (2010)
K. Tiri, I. Verbauwhede, A logic level design methodol- M.M. Yu, D.M. Raihi, R. Sowell, S. Devadas, Lightweight
ogy for a secure DPA resistant ASIC or FPGA imple- and secure PUF key storage using limits of machine
mentation, in Design, Automation & Test in Europe learning, in Workshop on Cryptographic Hardware and
Conference & Exhibition (DATE) (2004), pp. 246–251 Embedded Systems (2011), pp. 358–373
P. Tuyls, G.-J. Schrijen, B. Škorić, J. van Geloven, M.M. Yu, R. Sowell, A. Singh, D.M. Raihi, S. Devadas,
N. Verhaegh, R. Wolters, Read-proof hardware from Performance metrics and empirical results of a PUF
protective coatings, in Cryptographic Hardware and cryptographic key generation ASIC, in IEEE Interna-
Embedded Systems (CHES) (2006), pp. 369–383 tional Symposium on Hardware-Oriented Security
Verayo Inc. (2013), http://www.verayo.com/tech.php and Trust (HOST) (2012), pp. 108–115
M. Wan, Z. He, S. Han, K. Dai, X. Zou, An invasive- W. Zhao, Y. Ha, M. Alioto, Novel self-body-biasing and
attack-resistant PUF based on switched-capacitor cir- statistical design for near-threshold circuits with ultra
cuit. IEEE Trans. Circuits Syst. I 62(8), 2024–2034 energy-efficient AES as case study. IEEE Trans. VLSI
(2015) Systems 23(8), 1390–1401 (2015)
Design Methodologies for IoT Systems
on a Chip 9
David Flynn, James Myers, and Seng Toh

This chapter addresses the approaches and rent in between periodic sensing, data processing
methodologies appropriate to energy-constrained and data transmission: this is annotated
SoC design, implementation and verification STANDBY in the figure, and in many systems
using standard multi-voltage Electronic Design this is the predominant impact on battery life.
Automation tools, rather than resorting to full-
custom circuit approaches. The Physical-IP 1. The sensing activity is normally periodic and
libraries, memories and power-management triggered by some form of real-time sample
components required to address both active- request at a controlled data collection rate.
mode energy and deep-sleep state retention The height and width are conceptually
power are introduced, followed by a case study marked as the activities labeled SENSE in
addressing the specific challenges of optimizing the figure.
a micro-processor subsystem for Near- and 2. Some form of data processing step, such as
Sub-Threshold Voltage operation. As well as filtering, or anomaly or limit detection, is
system level power management the implemen- often initiated after a certain number of
tation and verification of clock distribution and samples have been buffered. The duration
system timing closure are covered in detail. and current profile may be data-dependent,
and is shown annotated as PROCESS.
3. A Wireless Sensor Node will typically pack-
9.1 Example Activity Profiles age or compress data to minimize the energy
for IoT Sensor Nodes required to transmit the data, and in many
systems the transmission time-slots are
For IOT “edge-nodes” such as Wireless Sensor pre-scheduled at specific times dependent on
Nodes the typical activity and power profile is the wireless access protocol and scheduler
shown in Fig. 9.1. (maybe in a sensor hub or base-station) at a
The height of the bar indicates the relative rate that is independent of the data-sampling
current consumed or power dissipated and the rate. This is shown with arbitrary power and
width is indicative of the duration. The system duration as TRANSMIT in the figure.
is typically optimized for minimum residual cur-
Regardless of the specific waveform ampli-
tude and activity profiles the key elements
D. Flynn (*) • J. Myers • S. Toh required to be minimized in design and imple-
ARM Ltd, Cambridge, UK mentation are:
e-mail: David.Flynn@arm.com

# Springer International Publishing AG 2017 271


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_9
272 D. Flynn et al.

Fig. 9.1 Example active and standby profile for wireless sensor node

• Leakage and state retention energy—the inte- • The PMOS and NMOS power gates are
gration of power over time for the Standby optimized for ION/IOFF ratio, but the series
component when the main circuitry is on-resistance typically impacts circuit perfor-
inactive. mance despite the off-current savings so in
• Dynamic energy consumption when clocks industrial usage only one is typically used.
are enabled to specific components required Footer NMOS power gating is shown in (b).
for active computation or communication.
• Peak active current—which is often the limit- To mitigate peak current inrush when turning
ing factor in small on-chip power regulation on the power gate there are various ways to build
schemes. resistive turn-on networks but one approach is to
• Both active and leakage power consumption support a threshold-voltage drop using a PMOS
for the “always-on” circuitry such as the timer transistor in the case of footer-switched VVSS
or Real-Time Clock (RTC) that provides the rail, as shown in (c). A logic-0 drive on both
wake-up event scheduling. PWR and nDROWSE controls enables this
mode of operation.

9.2 Static Power Reduction 9.2.2 Power Gating and Well-Bias

9.2.1 Power Gating In the case of full MTCMOS power gating, as


shown in Fig. 9.3a with the P- and N-wells are
The primary technique for static power reduction explicitly annotated, the VVDD and VVSS vir-
is power gating (Mutoh et al. 1996), which is tual switched rails collapse towards a mid-rail
well supported in Multi-Voltage EDA tools. voltage with symmetric reverse bias to the switch
Figure 9.2 illustrates the theory and practice: P- and N-channel logic transistors. With the addi-
tion of “drowsy” threshold-voltage transistors it
• Early academic research focused on Multi- is possible to provide a mode which holds the
Threshold CMOS power gating, MTCMOS logic sub-threshold with symmetric well bias
(a), where high-threshold “header” and (DROWSE¼1, nDROWSE¼0) and can support
“footer” switches are added to create gated quick wake from sleep with 3 lower wake
“virtual” VDD and VSS rails, labeled VVDD energy (Mistry et al. 2014).
and VVSS. The logic is powered when PWR This is a special case of well-bias for standby
control to footer is logic-1 and nPWR to which is simple to implement without requiring
header switch is logic-0. multi-dimensional standard cell characterization
9 Design Methodologies for IoT Systems on a Chip 273

Fig. 9.2 Power gating (a) MTCMOS, (b) footer, (c) “drowsy” scaling

Fig. 9.3 (a) Power gating


and well-bias, (b) drowsy
rail power gating

as needed for forward-body-bias active modes. 9.2.3 Boosted-Gate Switches


While traditional reverse-body-bias techniques for Low-Voltage Power Gating
are effective in older bulk process nodes it is a
challenge in smaller IoT designs to generate As discussed in Sect. 5.5, in near- and
boosted P-well and N-well voltages without sub-threshold designs the voltage headroom to
expending more active power than the leakage drive power gate controls with logic-level volt-
that is saved. age swings results in compromised ION switch
274 D. Flynn et al.

Fig. 9.4 (a) BG-CMOS footer power gating, (b) footer power gating cell

behavior. An effective technique to ensure highly but signals at the boundaries collapse to
effective power gating for reduced voltage logic non-logic levels. In order to prevent crow-bar
rails is to use high threshold power gates with currents flowing in logic down-stream of a
boosted gate voltage (Stan 1998). An optimal power gated block or region, specialized isola-
implementation can be achieved by building the tion or clamp cells are provided which are
control buffering and the footer power gates with powered from the global VDD and VSS rails
Thick-Gate-Oxide (TGO) devices which can be and provide the equivalent of AND- or OR-
operated from the unregulated battery (or super- gate signal clamping. IEEE1801 power intent
capacitor in the case of battery-less systems) supports explicit association of high or low iso-
voltage rail, shown as VBAT in Fig. 9.4a. This lation signals to interface nets (IEEE Standard
requires care in the implementation flow to han- for Design and Verification of Low Power
dle the extra higher-voltage rail distribution, but Integrated Circuits). Example cells are shown in
this is all low-current from a control signal per- Fig. 9.5.
spective and results in highly effective
distributed sub-circuit power gating.
Figure 9.4b also shows an example of the
9.2.5 State Retention with Power
standard cell abstraction used for power gates
Gating
that can be cleanly deployed in EDA implemen-
tation flows. The switch shares the VDD supply
Power gating provides leakage power savings by
row architecture but connects the standard-cell
switching off a sub-circuit, but any state is lost
ground rail as a switched virtual VVSS track.
and the sub-circuit needs to be reset or
The global VSS supply is via-ed down to the
reinitialized after the power is turned back
power gate from the thick-metal ground mesh,
on. For circuits such as fixed function
and the switch is laid out as multiple fingers of
accelerators with only transient state this is usu-
switch where device length and width are tuned
ally acceptable. But in many cases the loss of all
for best ION/IOFF ratio.
current state is too costly such that either the
architectural state must be saved away before
9.2.4 Clamping and Isolation power gating and restored after power is turned
on (which may have considerable latency and
Power gating provides effective leakage current energy cost), or the state is preserved in-place
reduction when sub-circuits are powered down, and maintained for a much smaller leakage cost.
9 Design Methodologies for IoT Systems on a Chip 275

Fig. 9.5 (a) Clamp-low


cell, (b) clamp-high cell

which can be either fully on (PWR_R¼1), drowsy


voltage scaled (PWR_R¼0, nDROWSE_R¼0) or
fully off (PWR_R¼0, nDROWSE_R¼1). Careful
validation of the voltage scaled retention reliabil-
ity must be evaluated for the technology used, but
this can provide valuable retention current savings
for designs with a large number of registers.
Figure 9.7b illustrates a further optimization
where only the slave latch portion of the master-
slave flip-flop register is retained, while the input
stage, master latch and output driver are all
powered from the switched standard-cell VVSS
rail, but a separate drowsy-voltage-scaled reten-
tion virtual ground rail is provided to the slave
Fig. 9.6 (a) Always-on register, (b) independently
latch only (Flynn et al. 2012). While retention
gated state retention
registers and power gating have been shown to
save 95% of the standby leakage power, drowsy
Figure 9.6 illustrates the basic approach of either retention can achieve a further 50% reduction.
providing “always-on” power to register state In both cases there is a need to add minimal
(not supplied from the gated virtual standard clamping circuitry around the register or slave
cell rails) in (a), or adding a second gated reten- latch to protect this for floating inputs, especially
tion rail “VVSS_R” which can be independently clocks and resets, and this can be added to the
controlled by the PWR_R control signal to allow cell or supported across a group of registers con-
state to be retained for certain periods and turned trolled by a “retention isolation” signal.
off fully for deep sleep modes.

9.2.7 Physical Considerations


9.2.6 State Retention with Power for Independently Gated State
Gating Optimizations Retention

To minimize the state retention currents incurred While power intent formats such as CPF and
in standby mode states when retaining state is UPF (IEEE 1801) support an arbitrary number
required there are optimizations that can be appro- of switched supplies, there is an assumption of a
priate depending on the voltage headroom and single default supply pair for each power domain.
register stability at scaled voltages (Kumagai Power aware EDA tools expect this to be routed
et al. 1998). Figure 9.7 shows the concept applied as the standard cell main rail. While always-on
to standard master-slave registers (a). A drowsy buffers (for buffering power gate, clamp or reten-
state virtual ground rail, labeled DVSS is shown tion controls) are a well understood exception
276 D. Flynn et al.

Fig. 9.7 (a) Drowsy voltage retention, (b) slave latch drowsy retention

Fig. 9.8 ARM Cortex-M0+ power domain cell placement highlighting power gates (a) retention power gates
implemented as top/bottom rows, (b) retention power gates implemented as columns with dedicated standard cell

where power supply effectively bypasses the prevent VVSS and VVSS_R from shorting
local power gates, this is more complex for inde- together. For smaller power domains this is best
pendently gated state retention as two or more implemented as top and bottom rows shown in
switches must be placed within a single floorplan Fig. 9.8a, but larger domains may also require
region. pairs of dedicated rows through the center to
The scheme illustrated in Fig. 9.6b can be reduce power grid voltage droop.
implemented in two ways. The main logic The other option is to use an alternate power
power gate is usually distributed throughout the gate cell, which does not drive onto the standard
floorplan as many small power gate cells, such cell main rail but a retention rail instead. This has
that the default VVSS for stateless logic can be the advantage of allowing distributed placement
driven onto the standard cell rails most effec- as with the logic power gate, shown as additional
tively. If the retention power gate is to be half-density columns in Fig. 9.8b.
implemented with the same library power gate In power domains with a high concentration
cell, then it must be confined to dedicated rows to of retention registers it may be beneficial to route
9 Design Methodologies for IoT Systems on a Chip 277

this retention rail across the entire design, In the figure a power-gating acknowledge sig-
although this consumes an entire routing track nal is shown which is valuable to ensure that the
and may not be feasible with the highest density timing required to power down and back up the
six or seven track standard cell libraries where switched network is handled correctly by design
pin access is a challenge. In other cases it may be (e.g. to avoid the condition when a wake up occurs
best to connect power gate VVSS_R directly up just after the power gating is turned off and power
to a power grid and use on-demand routing to is un-driven momentarily). Although comparators
connect down to individual standard cells as or Schmitt circuits may be used to assert this
required. PWR_ACK when the virtual rail has been charged
Note that the drowsy schemes illustrated in to a target voltage, in practice it is usually the
Fig. 9.7 intentionally short the outputs of the output of a delay line, synchronous counter, or
PMOS & NMOS footers, so the floorplan does power gate daisy-chain (Shi et al. 2006).
not increase in complexity beyond the switched Larger designs may suffer from high in-rush
retention case. currents and ground bounce when powering up
too quickly, which can cause corruption of
retained state or timing errors in active blocks.
9.2.8 Sequencing of Power Gating Analysis of this in-rush current is therefore an
Controls important sign-off step for low power designs
and is well supported by EDA tools. Where
Power gating typically has to be controlled by in-rush currents are found to be unacceptably
a state machine powered in a relatively high (a typical target is no greater than active
always-on voltage domain. The control signals peak currents, although tighter constraints may
that are required to drive the power gating and be required for some classes of design), the
isolation clamping described previously need most common mitigation approach is to stagger
to have explicit sequencing and these are the power gate enables, such that a fraction of power
ports that get connected to the IEEE 1801 gates are used to initialize the virtual rail more
inferred power controls (Keating et al. 2007). slowly. Drowsy power gates (as mentioned in
Figure 9.9 illustrates the standard sequencing Sect. 9.2.2) are another good option.
into power-gated mode and then waking up On a request to wake-up, power is requested,
back to active mode. On a request to sleep, and only when valid is the reset de-asserted, the
first the clock is stopped, then the reset isolation clamping turned off and the clock
asserted (not strictly necessary but maintains finally re-enabled. In the case of state retention
symmetry), the isolation clamping of outputs power gating the retain control signal timing is
is asserted and then the power-gating network often similar to the isolation NCLAMP control
turned off. waveform.

Fig. 9.9 Example clock, clamp, reset and power gate control sequencing
278 D. Flynn et al.

9.3 Active Power Reduction balancing attributes to EDA tools to support


clean static timing analysis and clock tree bal-
The management of active power is mainly ancing with such gated clocks.
focused on addressing terms in the familiar Such ICG elements can also be instantiated in
CMOS dynamic power proportionality to CV2F designs to support high-level architectural clock
equation. The capacitance term, C, is minimized gating where the designer can determine where
by striving to keep the circuit small and simple, clock segments can be individually gated explic-
balancing drive strength and keeping signal rout- itly at system level.
ing capacitance to a minimum. The frequency
term, F, is addressed by optimizing the circuit
implementation for peak required performance 9.3.2 Voltage Scaling
and then factoring in both architectural and
inferred clock gating to suppress dynamic Today’s production microcontrollers rarely sup-
power dissipation whenever possible. The volt- port dynamic voltage and frequency scaling
age term, V, is the most valuable control knob (DVFS) due to the complexities of lightweight
given the square-law contribution; in IoT OS/SW scheduling, interfacing to off-chip volt-
applications the primary focus is to work with age regulators, managing transition periods,
low voltage technology, with under-driven identifying optimal voltage-frequency pairs, lim-
super-threshold libraries and memories, or more ited super-threshold voltage headroom, and
specialized near-threshold or sub-threshold more. But it is clear that this will be a key area
robust physical IP. of improvement for future IoT edge-node
applications, enabled by integrated voltage
regulators (see Chap. 10) and the increased ver-
9.3.1 Clock Gating satility offered by near- and sub-threshold
designs.
EDA tools are able to provide transparent clock The physical IP to support this includes cell-
gating where common sub-expressions in the libraries that are optimized for the constrained
enable terms of synchronous logic are coded in voltage headroom. For near- and sub-threshold
a clean synthesizable style. Figure 9.10 illustrates operation this usually implies constrained cell
the basic scheme for determining groups of architectures, which avoid small length/width
registers that share an enable term where the devices and minimize transistor stack depths,
state is defined as sampling or re-circulating and register latch and memory bit-cells in partic-
values. The multi-bit registers are shielded from ular need to be designed with increased
the high-toggle-rate clock, marked in red by the robustness.
inference of a latch and AND gate structure that The only additional cells required are in the
suppresses clock pulses in cycles when the EN form of level-shifters that manage the voltage
term is inactive (Fig. 9.10b). Figure 9.10c shows drive from low voltages up to higher voltage
the cell abstraction for an Integrated Clock Gate domains or input/output interface drivers. There
(ICG) that provides the timing and clock are three types of interface with specialized level

Fig. 9.10 Clock gating


inference and abstraction
9 Design Methodologies for IoT Systems on a Chip 279

shifters for each: guaranteed low voltage input to power without sacrificing performance, espe-
high voltage output; guaranteed high voltage cially at the minimum energy point where leak-
input to low voltage output (which may be sim- age energy is strongly dependent on
ply re-characterized buffer cells to enable STA performance. This sub-chapter describes how
tools to estimate timing correctly); and interfaces conventional EDA flows may be adapted to
where input/output voltages scale independently achieve a minimum energy design. The impact
and may be high/low, low/high, or the same. of key decisions such as standard cell choice and
EDA tools can infer the appropriate level shifters clock design methodology on minimizing energy
from IEEE 1801 power intent definition and cell- and cost are also evaluated. Results in this sec-
library attributes. tion are derived from a 65 nm R&D
sub-threshold ARM® Cortex®-M0+ WSN
processing sub-system with prototype 300 mV
9.3.3 Wake-Up and Power physical IP.
Management Circuits

A special case of active circuits that need to be


9.4.1 Implementation Flow
always-on relative to the processing subsystems
are the power management state machines and
The majority of the implementation methodol-
wake-up sources such as Real-Time-Clock
ogy is unchanged from a standard EDA flow and
(RTC) alarms. RTC circuits are typically clocked
no custom tools are required at any stage. Power
as low as 32 KHz, which dramatically reduces
aware verification of the design is performed
dynamic power compared to other logic running
using a gate level simulator together with UPF
at MHz. Always-on leakage is still a concern and
power intent. The flow modifications identified
should be reduced by aggressive gate-count
in Fig. 9.11 will be covered in detail below, with
reduction and implementation with the lowest
the exception of design-for-test and placement
leakage devices available. Simple libraries of
steps that are not unusual for a highly power
TGO gates and registers can be beneficial here
gated design.
as these demonstrate up to 100 reduction in
leakage compared to regular threshold thin-
oxide devices (Taki et al. 2011). The most com-
pelling benefit of TGO libraries however is that 9.4.2 Energy Reporting
they can run directly from unregulated battery
voltages, thereby allowing all voltage regulators Energy is the most important metric in this
to be shutdown in deep sleep modes and saving design and needs to be reported in all optimiza-
regulator losses which can be significant under tion steps. The tools however only report power.
very light loading. The calculation is simple so custom reporting can
be implemented. Leakage energy/cycle is leak-
age power integrated over the minimal clock
9.4 Automated Minimum Energy period. The subtlety here is that increased leak-
Design age power is acceptable, if the corresponding
speedup is greater than or equal in magnitude.
Conventional EDA tools for automated synthe- Dynamic energy/cycle is simply dynamic power
sis, place, and route are usually optimized to divided by clock frequency. The libraries in this
produce designs with maximum performance or example were characterized at five voltages,
minimum power. Minimum energy design in which allow the majority of the voltage-energy
general requires achieving both minimum curve to be interpolated.
280 D. Flynn et al.

Fig. 9.11 EDA flow with


key modifications for
minimum energy design

9.4.3 VT Selection Leakage/


Performance Tradeoff

Regular VT (RVT) and low VT (LVT) gates


exhibit an 8 difference in performance when
operating at sub-threshold voltages while leak-
age power scales by almost 20 (Fig. 9.12). The
amplified performance difference deviates from
observed behavior at nominal voltages where
performance typically improves by 50% with a
10 leakage power increase by switching to a Fig. 9.12 Leakage vs. frequency comparison of various
lower VT choice. Traditional leakage recovery standard cell choices
flows that trade-off timing slack on each timing
path for leakage reduction are not effective in higher leakage. Using these MCK cells sparingly
sub-threshold design as the number of cell on the design helps improve performance,
swaps that can be taken on each path are limited. resulting in lower leakage energy.
Our front-end and back-end flows utilize only
RVT gates in order to minimize system leakage.
We also utilize an RVT mixed-channel kit 9.4.4 Cell Choices During
(MCK) which has higher performance gates, Optimization Flow
achieved by optimizing the gate lengths of the
transistors. The MCK library cells achieve an This design ends up with 9.27% of MCK cells
average 12% performance improvement at 3 (by area) when no constraints are placed on
9 Design Methodologies for IoT Systems on a Chip 281

Table 9.1 Comparison of unconstrained vs. 5% MCK


usage constraint
5% MCK
Unconstrained constraint
Normalized area 1 1.025
Normalized dynamic 1 1.060
energy
Normalized leakage 1 1.029
energy
Normalized performance 1 1.013
MCK cell area 9.27% 5.95%
Fig. 9.13 Leakage vs. frequency comparison of various
standard cell choices

percentage of MCK cell usage. MCK cell usage expected to require multi-corner optimization to
constraints applied during synthesis results in a ensure good performance across corners.
design that has higher area and energy A study was conducted into multi-corner
(Table 9.1). This is likely caused by the leakage setup optimization vs. single corner setup opti-
optimization algorithms in the synthesis tool that mization to determine whether multi-corner opti-
try as hard as possible to first implement the mization is an absolute necessity. Hold timing
design using high VT cells before enabling a was still optimized across all corners in these
minimal amount of low VT cells. This high effort experiments. Figure 9.14 presents the resulting
algorithm results in area increase and worse leakage and dynamic energy observed
dynamic power. We also experimented with (normalized to single corner setup at SS
ECO leakage recovery in timing signoff tools 1.08 V). Good correlation is observed between
but only observed a handful of cell swaps, the choice of setup corner and optimized leakage
resulting in miniscule leakage recovery. Based energy especially at voltages below 0.6 V. For
on these experiments, we decided to adopt a example, SS 0.54 V minimizes leakage energy
simple flow and enable unconstrained use of around 0.54 V while TT 0.3 V minimizes leakage
MCK cells beginning from synthesis. at 0.3 V. The better performance at the respective
voltages results in lower leakage energy, as leak-
age power is integrated over a shorter period of
time. Multi-corner setup optimization appears to
9.4.5 Optimization Corner Selection produce results that are close to minimum leak-
for Minimum Energy age energy across voltages because multi-corner
and Runtime Impact optimization strives to optimize performance
across all voltages. Figure 9.14 also plots the
Battery powered WSN are required to scale volt-
Synthesis runtime for the various single corner
age and frequency actively in order to meet
and multi-corner optimizations. Multi-corner
throughput or latency requirements as well as to
optimization incurs a 4 runtime over single
minimize energy during low periods of activity.
corner, which is quite reasonable considering
These circuits are therefore required to be func-
the various corners that are being considered
tional as VDD is scaled from nominal voltages
simultaneously.
down to sub-threshold voltages. Figure 9.13 plots
the relative frequency of a design at TT and SS
global process corners as supply voltage is
scaled. Frequency degrades by 4000 across 9.5 Clock Distribution
the span of operating voltages while performance
degrades by 5 between TT and SS corners. Clock distribution is extremely challenging in
DVFS across such wide operating conditions is sub-threshold voltage design due to increased
282 D. Flynn et al.

1.05
Dynamic
1.00
Relative Energy

0.95 SS 1.08V
SS 0.54V
0.90
TT 0.3V
0.85
Multi Corner
0.80
0.3 0.5 0.7 0.9 1.1
VDD (V)
1.10
Static
1.05
Relative Energy

1.00
Fig. 9.15 Comparison of single corner vs. multi-corner
optimization
0.95 SS 1.08V
0.90 SS 0.54V Monte Carlo analysis. The spread in delay
0.85 TT 0.3V increases as voltage is scaled down to
0.80 Multi Corner sub-threshold voltages because transistor perfor-
0.3 0.5 0.7 0.9 1.1
VDD (V)
mance is exponentially dependent on threshold
voltage. Foundries typically specify OCV derates
250 that are relevant to Vnom +/ 10%. Larger OCV
Runtime (minutes)

200 derates are required at sub-threshold voltages in


150 SS 1.08V order to margin for the worse variation. An OCV
SS 0.54V derate is derived from statistical data by
100
multiplying the worst observed sigma by 3.
TT 0.3V
50 Clocks are typically distributed using
Multi Corner
0 synthesized tree structures or more structured
Synthesis networks like H-trees (Jain 2012) or meshes
(SolvNet 2014). Synthesized clock trees are
Fig. 9.14 Comparison of power and runtime effects of
designed and optimized automatically by the
single corner vs. multi-corner optimization
tools but tend to exhibit worse performance com-
on-chip-variation (OCV), larger clock latency pared to more structured networks. Clock meshes
due to slow buffers, and the requirement for are quite attractive in sub-threshold design
minimizing energy. because they can be used to distribute the clock
signal across the entire chip without OCV
impact. Figure 9.16 illustrates the clock mesh
structure that was investigated. Clock meshes
9.5.1 OCV Characterization essentially distribute a clock source across an
entire portion of the design. This is usually
OCV impacts clock distribution by introducing achieved using a pre-tree. The outputs of the
variation in arrival times to registers and leaf nodes of the pre-tree are shorted together
memories. Margins are used to design for this on the clock mesh to reduce the skew of the
variation, resulting in performance degradation clock signal distributed by the pre-tree, creating
or data-path upsizing to meet performance an almost-ideal clock signal that spans the entire
targets. design area. The key to achieving best results
Figure 9.15 presents variation in delay from a clock mesh is to reduce the number of
through a chain of inverters across different gates between the mesh and the clock sinks. We
dies. This analysis was performed using SPICE used all clock gates to accomplish this final step,
simulation of an extracted netlist of inverters and forcing dummy clock gates on clock sinks that
9 Design Methodologies for IoT Systems on a Chip 283

Fig. 9.16 Clock mesh


structure

Fig. 9.17 Sub-system standard cell area floorplan, annotated with mesh structure

are not gated. The effective latency of this clock were placed in the middle section of the floorplan
structure is therefore only one gate. Figure 9.17 is because this area corresponds to a power domain
a layout view of the clock mesh with respective that is on in all active modes but can be switched
drivers indicated by respective symbols off during retention and power down modes. This
corresponding to Fig. 9.16. All pre-tree drivers ensures that the clock mesh is always alive
284 D. Flynn et al.

regardless of the power state combinations of the and ICG from the clock cell list reduced sigma to
system. Unfortunately, this also implies that the 9%. The area penalty incurred from not using X1
clock mesh is always consuming dynamic and cells is worth the reduction in sigma because the
leakage power. clock tree accounts for less than 1% of total area.
Sigma of clock arrival time is further reduced by
introducing tighter max-transition constraints on
the clock tree.
9.5.2 Clock Tree Synthesis
Automated clock tree synthesis algorithms are
Methodology
typically designed to improve clock-related
metrics (latency and skew) at a particular corner.
Clock trees are the easiest way to distribute
Reducing skew is usually accomplished by
clocks with minimal dynamic power and area.
adding additional gates to balance delays
Clock trees however suffer from more variability
between paths. We analyzed clock tree metrics
due to the lack of regularity in the structure. This
observed when using SS 1.08 V CTS corner
section demonstrates best practices for clock tree
vs. SS 0.54 V CTS corner (Table 9.2) to deter-
synthesis, especially targeting sub-threshold
mine which strategy produces the best results.
minimum energy clock trees.
Building the clock tree at SS 0.54 V provides
Transistor variability is inversely proportional
minimum clock skew at sub-threshold voltages,
to gate area. Low drive-strength clock cells are
compared to the SS 1.08 V CTS corner. This
therefore expected to exhibit larger variability.
tighter skew was achieved by padding with
Figure 9.18 plots results obtained from OCV
more clock cells, resulting in 2 larger area.
characterization of the clock tree. Clock arrival
This clock tree however does not scale well
sigma of up to 12% was observed in a clock tree
with voltage and exhibits a clock skew of 8.9%
that was not optimized (equating to 36% OCV
clock period when measured at the SS 1.08 V
derate!). Eliminating X1 drive inverters, buffers,
corner. The clock tree constructed at the SS
1.08 V corner results in a clock tree that exhibits
consistent skew across operating voltages and
minimizes area. We have also limited clock tree
fanout to 32 which minimizes interconnect delay
and helps ensure consistent scaling of all clock
endpoints across all operating conditions.
LVT clock cells present an interesting option
especially for sub-threshold design due to the
8 better performance (only ~50% better perfor-
mance is typically observed at nominal voltages).
These faster cells will result in up to 8 lower
clock latency, which reduces the impact of OCV
by the same magnitude. An LVT clock tree will
exhibit up to 10 higher leakage power than an
Fig. 9.18 Clock end-point variability across different RVT clock tree but the improvement in perfor-
optimizations (TT 0.4 V)
mance can potentially offset the leakage power

Table 9.2 Comparison of clock tree metrics synthesized at different CTS corners
Clock skew at respective corner as % of clock period
Target CTS corner SS 1.08 V SS 0.54 V TT 0.4 V Area (μm2) Depth
SS 1.08 V 1.5 2.5 2.1 1587 5–14
SS 0.54 V 8.9 0.9 1.2 3414 13–18
9 Design Methodologies for IoT Systems on a Chip 285

increase, resulting in a net reduction in leakage RVT tree because the transitions are much
energy. sharper, resulting in lower short-circuit current.
Figure 9.19 presents the dynamic energy and Leakage energy of the clock network however is
leakage energy of the clock distribution network almost 10 higher than the RVT tree. Table 9.3
implemented using different strategies. The LVT presents some additional metrics measured from
tree has lower dynamic energy compared to the our WSN sub-system implemented using differ-
ent clock strategies. The clock latency of the
25 LVT tree is 4% of the clock period while the
Clock Dynamic Energy
20 RVT tree latency is 18% of the clock period.
Energy (pJ)

15 Note that the LVT tree only achieves a design


RVT Tree
10 that is at-most 4% faster than the RVT tree, even
LVT Tree
5 though clock latency is much lower. The WSN
RVT Mesh
0 sub-system we have designed is too small to
0.3 0.5 0.7 0.9 1.1
VDD (V)
realize the benefits of an LVT clock tree with
lower latency. We expect larger designs, where
1
Clock Leakage Energy clock latency is a larger fraction of clock period,
0.1 to exhibit higher performance improvements
Energy (pJ)

RVT Tree from an LVT tree due to reduced impact of


0.01
LVT Tree OCV. Another thing to consider with LVT tree
0.001 RVT Mesh is the cost of the additional VT implant. This
0.0001 could potentially tip the scale in favor of an
0.3 0.5 0.7 0.9 1.1
VDD (V)
RVT tree especially in IOT systems where cost
1.05 is also important.
Frequency
Our analysis of an RVT clock mesh
Relative Frequency

1.04
1.03 implemented on the sub-system indicates that
RVT Tree
1.02 the clock mesh consumes significantly more
LVT Tree
1.01 dynamic energy than clock trees. This is because
RVT Mesh
1 a large portion of the clock structure is always
0.99 running and can never be gated. Leakage energy
0.3 0.5 0.7 0.9 1.1
VDD (V) of the mesh is slightly higher than an RVT tree
due to all the clock gates driving the final stage.
Fig. 9.19 Comparison of clock energy and frequency Effective clock latency of the mesh is compara-
across clock tree implementations
ble to an LVT tree. The OCV derate of the mesh

Table 9.3 Comparison of WSN sub-system energy and performance across clock implementations
TT 0.4 V 25C RVT tree LVT tree RVT mesh
Frequency (kHz) 835 865 847
Dynamic energy (pJ) 8.075 8.025 9.448
Leakage energy (pJ) 0.284 0.311 0.289
Total energy (pJ) 8.359 8.336 9.737
Clock latency (ns) 216 43 163
Effective latency (ns) 216 43 48.1
Clock skew (ns) 77.6 14.7 17.7
Total clock cell area (μm2) 5940 4999 (0.8% total cell area) 9640
Number of stages 5–13 7–15 1
OCV derate % (TT 0.4 V) 19.5 22.5 66.0
Clock leakage energy (pJ) 0.00436 0.0425 0.00640
Clock dynamic energy (pJ) 0.757 0.628 1.89
286 D. Flynn et al.

(applied only to final ICGs) is much larger than References


the trees due to the shallow effective clock depth
and poor transition on the clock mesh. The RVT D. Flynn, High performance state retention with power
clock mesh could be a lower cost alternative to a gating applied to CPU subsystems—design
approaches and silicon evaluation, HotChips24 2012
LVT tree especially for larger designs. Archives HC24.30.p10 (Poster) (2012)
IEEE Standard for Design and Verification of Low Power
Integrated Circuits, https://standards.ieee.org/findstds/
9.6 Perspectives and Trends standard/1801-2013.html
M. Keating, D. Flynn, A. Gibbons, R. Aitken, K. Shi, Low
Power Methodology Manual—For System-on-Chip
The current trend in the industry is towards Design (Springer, New York, 2007). ISBN 978-0-
enabling near-threshold SoCs to be easily and 387-71818-7
safely designed, implemented, verified and S. Jain, A 280 mV-to-1.2 V wide-operating-range IA-32
processor in 32 nm CMOS, in ISSCC, 2012
optimized. This is not an easy task however, K. Kumagai, H. Iwaki, H. Yoshida, H. Suzuki,
nor can it be done in isolation. IP providers will T. Yamada, S. Kurosawa, A novel powering-down
have to offer new logical and/or physical IP, scheme for low VT CMOS circuits, in VLSI Circuits,
EDA tools will need enhancement to support 1998. Digest of Technical Papers. 1998 Symposium
on, Jun 1998 (1998), pp. 44–45
energy optimization and handling of large J.N. Mistry, J. Myers, B.M. Al-Hashimi, D. Flynn,
variation, and silicon foundries will have to pro- J. Biggs, G.M. Merrett, Active mode subclock power
vide qualified models. The challenge is in gating. IEEE Trans. VLSI Syst. 22(9), 1898–1908
coordinating all of these elements, but there is a (2014)
S. Mutoh et al. A 1v multi-threshold voltage CMOS DSP
real desire and demand for progress, such that the with an efficient power management technique for
authors are confident that key barriers will be mobile phone applications, ISSCC1996 (1996),
overcome in the next few years. pp. 168–169
Looking beyond near threshold, it is clear K. Shi, D. Howard, Challenges in sleep transistor design
and implementation in low-power designs, in Design
that there are many innovative and exciting Automation Conference, 2006 43rd ACM/IEEE
approaches for optimized IoT designs (2006), pp. 113–116
(sub-threshold, adaptive systems, asynchro- M. Stan, Low-threshold CMOS circuits with low standby
nous, drowsy power gating, non-volatile logic, current, in Proceedings of the International Sympo-
sium on Low-Power Electronics and Design (IEEE/
etc. . .). Similarly to near-threshold there is often ACM, Monterey, CA, 1998), pp. 97–99
an IP & EDA barrier to the adoption of these SolvNet: Power Compiler I-2013.12 Update Training (ID:
cutting edge techniques. Unlike near-threshold 1506611) (Synopsys Inc., 2014)
however, there can also be an analysis barrier— D. Taki, T. Shiozawa, K. Ito, Y. Shiba, K. Horisaki,
H. Kajihara, T. Yamagishi, M. Sekiya, A. Yamaga,
the system-level cost/benefit tradeoffs of such T. Fujita, H. Hara, M. Kuwahara, T. Fujisawa,
techniques can be complicated to predict and Y. Unekawa, A 7uW deep-sleep, ultra low-power
model. Without progress in system-level explo- WLAN baseband LSI for mobile applications, in
ration and design methodology, otherwise very Cool Chips XIV, 2011 IEEE, 20–22 April (2011),
pp. 1–3
beneficial technology will continue to be
overlooked and fail to gain critical mass or
wide adoption.
Power Management Circuit Design
for IoT Nodes 10
D. Brian Ma and Yan Lu

This chapter addresses the fundamental and so on. By taking advantage of available
structures and operation principles of power information and powerful computing capacity of
management circuits that are commonly used in the internet, IoT brings significant social benefits
IoT applications. Following a brief discussion on to human life of civilization. Generally speaking,
system design considerations, the chapter any electronic device with sensing and/or com-
reviews the essential power circuit topologies munication ability can work as an IoT node. Apart
with discussions on key control and operation from well-studied devices such as mobile phones
principles, performance features and drawbacks. and computers, it is highly anticipated that count-
With the focus on IoT applications, state-of-the- less IoT nodes are low-power smart sensors and
art works and promising design trends are actuators. For reliability and cost concerns, these
addressed. devices are preferred to operate autonomously.
Hence, the energy used to power these nodes is
most likely scavenged from ambient environ-
10.1 System Design Overview ment, instead of traditional batteries. The main
reason is that many IoT nodes have stringent
Internet of things (IoT) has gained significant requirement on system volume, whereas
research interests in recent years. It is widely batteries can be too bulky for them. In addi-
believed that the ever-expending network infra- tion, a battery can only store limited amount of
structure of IoT enables unprecedented sensing energy and thus requires frequent recharge.
and control capability over a vast number of inter- Due to aging effect, after certain number of
ested objects, with applications including, but not recharge, a battery has to be replaced. With
limited to, home environmental control, remote countless nodes in the network, maintenance
healthcare, industrial process monitoring, smart cost can be formidably high. On the other
buildings and cities, intelligent transportation, hand, harvesting renewable energy eliminates
battery replacement cost and is environment
friendly. However, in contrast to a battery-
powered system, a self-powered device deals
D. Brian Ma (*) with a much unstable energy source that is
University of Texas at Dallas, Richardson, sensitive to ambient environment. Effective
TX 75080, USA measures must be taken at device, circuit and
e-mail: d.ma@utdallas.edu
system levels to ensure acceptable reliability,
Y. Lu energy efficiency and system lifetime.
University of Macau, Macao, China

# Springer International Publishing AG 2017 287


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_10
288 D. Brian Ma and Y. Lu

Energy
Sensing
Power Management Controller

Renewable Energy Source Storage


Load
Sensing
Sensing
Psource
Power
Pstore Delivery
Energy
Pload
Harvesting

Load Application
Energy Storage

Power Flow Signal Flow

Fig. 10.1 Power management system block diagram for self-powered device

Figure 10.1 illustrates the generic block dia- is the energy storage module. Because operation
gram of a power management system for self- conditions of both energy sources and load
powered applications. It is made up with five applications vary continuously in such
major components: energy source(s), energy applications, an energy storage module usually
harvesting module, energy storage module, serves as an “energy buffer”. When the harvested
power delivery module and power management power is greater than the load power, the exces-
controller. In order to extract the energy from a sive power is stored for future use. Due to vary-
renewable energy source and convert the ing intensity and availability of the energy, a
extracted energy into electronic format, energy renewable energy source can either fluctuate sig-
harvesting module becomes an essential part of nificantly or offer very low voltage/power level.
power management system. To maximize the They cannot be directly used to power an elec-
harvesting efficiency, some best energy tronic device. Power conditioning module is thus
harvesting systems undergo a software-hardware needed to improve the power deliver quality and
co-design process. The software part contributes provide a relatively stable power supply to load
well-planned harvesting scheme that guides the applications at a desired voltage level. If the
transducer to extract the highest amount of power quality of the power is still not acceptable, a
from the energy source. Because of the high power regulation stage would be added. The
diversity of energy forms and harvesting two stages together accomplish the power deliv-
transducers, harvesting schemes can be signifi- ery to the load application. Here, the load appli-
cantly different. For example, maximum power cation refers to all the electronic modules that
point tracking (MPPT) techniques (Esram and demand electrical power. In particular to IoT
Chapman 2007) are highly popular for photovol- applications, as addressed in Sect. 2.1 of
taic solar energy harvesting, whereas resonant Chap. 2, it includes processors/microcontroller
frequency and impedance matching schemes are units, communication unit as well as sensors/
common for piezoelectric energy harvesting actuators. In a sophisticated system, dictated by
(Ottman et al. 2003; Sankman and Ma 2015). intelligent power management schemes such as
As a result, the hardware design of energy dynamic voltage/frequency scaling (DVFS), the
harvesting system represents a vast variety of power delivery can be expended into a sophisti-
circuit implementations defined by their software cated power management platform with multiple
schemes. Hence, a separate chapter (Chap. 11) is power domains, leveraged by multiple supply
dedicated to this topic particularly. Another voltage levels. In this situation, power manage-
important part of the power management system ment is strategically planned by a global system
10 Power Management Circuit Design for IoT Nodes 289

power optimizer (Sect. 2.7 in Chap. 2), and is switches, and linear (power) regulators, in
often application-specific. which power transistors work as voltage- or
In this chapter, we discuss the power circuits current-controlled current sources. Furthermore,
that are essential for power delivery and power based on the type of energy storage elements,
management. The discussion starts with a gen- nonlinear power converters include both inductor
eral design consideration on the system level, based and switched-capacitor based power
followed by a comprehensive review of funda- converters. In general, voltage regulation in a
mental power circuit topologies and operations. nonlinear power converter involves at least two
Tailored for the applications like IoT nodes, processes: charging and discharging the energy
state-of-the-art works are introduced. The chap- storage element(s). The purpose of charging pro-
ter ends with a brief discussion on future trends. cess is to extract the energy from input power
source and then temporarily store the energy in
the energy storage elements. The discharging
10.2 Fundamentals of Power process is to release the stored energy to load
Converters application in a desired power format specified
by the output voltage and current levels. If the
Functional modules in IoT power management energy storage element is an inductor or a trans-
circuits are built largely based on DC-DC power former, the converter is called as a switch mode
converters. Such a converter converts an unregu- power converter or a switching converter in
lated DC power source to a much regulated one short. If the element is a capacitor, the converter
with relatively constant voltage at the output. is then named as switched-capacitor (SC) power
The output voltage level can be either higher or converter.
lower than the input, with either the same or A linear regulator regulates its output voltage
opposite polarity. The stability of the output volt- by controlling the ON-resistance of its power
age is a strong indicator of performance. Because transistor, and thus does not generate any output
the input voltage can fluctuate with ambient envi- ripples, which can be caused by switching action
ronment or battery charge/discharge activities in nonlinear power converters. Hence, a linear
(which are detailed in Chap. 15), the ability of regulator does not require large output capacitor
the converter to maintain a constant output volt- for ripple filtering and its loop bandwidth is in
age is highly expected. This is measured by the general higher than its nonlinear counterparts.
line regulation, which is defined as the ratio of Consequently, it achieves faster transient
the output voltage change ΔVOUT to the input response. Moreover, a linear regulator is often
voltage change ΔVIN. In addition, at the output, employed as a post-regulation stage to suppress
if the load current IOUT varies, the converter the ripples generated by nonlinear power
should keep VOUT constant. This behavior is converters, and provides a clean supply to noise
measured by the load regulation, which is defined sensitive loads. The major drawback of linear
as the ratio of the output voltage change ΔVOUT regulators is that its power transistor bears
to the load current change ΔIOUT. In order to non-zero voltage and current simultaneously,
accomplish superior line and load regulation, a and thus introduces unavoidable power loss.
DC-DC converter requires a controller, which And this loss is linearly proportional to the drop-
adaptively controls the power flow according to out voltage from VIN to VOUT.
instant input voltage and load current conditions. For electromagnetic wave or vibration energy
The ultimate goal is to retain stable output volt- harvesting systems, AC input energy needs to be
age at the desired level. converted to DC format, which necessitates the
Depending on the operation of power process of rectification. Rectification can be sim-
transistors, DC-DC converters can be ply achieved by using a diode that only allows
categorized as nonlinear power converters, in the current to flow in one direction and blocks it
which power transistors operate as power from going reversely. Low turn-on voltage drop
290 D. Brian Ma and Y. Lu

Schottky diode has been conventionally used in input voltage. If the output voltage should be
discrete-type AC-DC converters for rectification. higher than the input voltage, then a boost con-
However, it is not commonly available in stan- verter in Fig. 10.2b can be used. For both
dard CMOS processes, and requires additional converters in Fig. 10.2c, d, the magnitude of the
mask and thus cost during manufacture. Low output voltage can be either higher or lower than
cost CMOS rectifier with power transistors the input’s. However, the polarity of the output
replacing the passive diodes gives promising voltage is opposite to its input in Fig. 10.2c, but
performances in the above mentioned keeps the same in Fig. 10.2d.
applications, especially at low-input voltage In order to provide a constant VOUT, the ON
conditions. times of the switches should be meticulously
controlled. Figure 10.3 illustrates such control
actions using a buck converter. Compared to the
10.2.1 Switching Converters buck converter in Fig. 10.2a, the main switch S1
is implemented with a transistor MP, whereas the
Figure 10.2 illustrates the power stage circuits of secondary switch S2 is replaced by a diode. MP is
four most common switching converters—the switched ON and OFF periodically, with a
buck, the boost, the flyback and the switching period T. Assume both the transistor
non-inverting buck-boost switching converters. and the diode are ideal, and have zero DC resis-
Each consists of an inductor L, a series of tance when turned-on. When Mp is ON, it shorts
power switches, and a capacitor CL at VOUT. the input VIN to the circuit node VX, causing a
The inductor L serves as the energy storage ele- positive inductor voltage vL ¼ VINVOUT, if VIN
ment. The capacitor CL filters out switching rip- is greater than VOUT. The inductor L is energized
ple noise and keeps VOUT constant. The switches, and the current iL rises. Here, the ON-time of Mp
implemented by power transistors or diodes, are is measured as DT, where D is called as the duty
periodically turned on and off to charge and ratio. During DT, the energy from the power
discharge the inductor L accordingly. Depending source is stored in the inductor L. When DT
on the input and output voltage levels, a expires, MP is shut off and the diode is forced
switching converter can be constructed with dif- ON to short VX to ground. Hence, vL ¼
ferent topologies. A buck converter in Fig. 10.2a VINVOUT. The energy stored in the inductor L
provides a regulated output voltage lower than its is released to the load and iL drops accordingly.

L L
VOUT VOUT
S1 S2

S2 CL RL VIN S1 CL RL

(a) (b)

L
VOUT VOUT
S1 S2 S1 S4

L CL RL VIN S2 S3 CL RL

(c) (d)

Fig. 10.2 Power stage circuit topologies of switching converters for (a) buck, (b) boost, (c) flyback, and (d)
non-inverting buck-boost voltage conversion
10 Power Management Circuit Design for IoT Nodes 291

Fig. 10.3 Switching iL


operation in a buck L VOUT
MP
switching converter
vL

VIN CL RL

VL
VIN−VOUT

0 t

−VOUT

IL

VOUT/RL

0 DT T t
MP MP MP MP
ON OFF ON OFF

Clearly, if the energy charged into the inductor is In Fig. 10.4b, iL is employed as the feedback
equal to the energy demand by the load RL, the signal. Hence, it is a current-mode control. It
converter enters the steady state. A constant out- should be noted that not all switching converters
put VOUT ¼ DVIN is achieved. operate with constant switching period T. For
It is apparent that the voltage regulation of a example, in a voltage-mode hysteretic controlled
switching converter is accomplished by converter of Fig. 10.5, each switching period is
controlling the duty ratio D, so that the average not initiated by a clock periodically. Instead, the
of iL is equal to the load current VROUT
L
. The control voltage changes at VOUT determine the status of
method can be a feedback control, or a MP as well as the switching period T.
feedforward control, or a combination of both. With all the converters in Figs. 10.4 and 10.5
With a constant switching frequency, Fig. 10.4 managed by feedback controls, a converter can
shows a buck converter with two most com- also be regulated with a feedforward control by
monly exploited pulse-width modulation using the input voltage VIN as the control variable
(PWM) feedback controllers. As the converter (Smedley and Cuk 1995), as illustrated in
contains two reactive components L and CL, Fig. 10.6a. One problem for a feedforward con-
either of parameters iL and VOUT can serve as a trolled converter, however, is that the output
control variable to regulate the converter. In VOUT is not accurately regulated due to the
Fig. 10.4a, VOUT is used as the feedback signal absence of its own feedback signal. The problem
to determine the duty ratio of MP. The control is can be overcome by employing an additional
thus considered as voltage-mode control. feedback control to improve the regulation
292 D. Brian Ma and Y. Lu

Fig. 10.4 Switching buck L VOUT


MP
converter with (a) voltage-
mode PWM control, and
(b) current-mode PWM
control
CL RL

Rfb1

Q R
Vref Rfb2
Q S CLK Ramp
Signal
(a)

iL
MP VOUT
L

CL RL

Rfb1

Q R
Vref
Rfb2
iL
Q S CLK

Rfb3

(b)

accuracy. Figure 10.6b depicts an example of converters can serve as an alternative in this
such Ma et al. (2004). situation. Figure 10.7 shows a common SC
power converter—voltage doubler, which
achieves a voltage conversion gain (CG) of
10.2.2 Switched-Capacitor Power 2. Instead of using an inductor L in a switching
Converters converter, a capacitor CP is used as energy stor-
age element. The converter is controlled by a pair
For switching converters introduced in of complementary clock signals, ϕ1 and ϕ2.
Sect. 10.2.1, the use of inductive devices usually When ϕ1 ¼ 1, the switches S1 and S2 are ON,
leads to large electromagnetic interference the pumping capacitor CP is charged to VIN.
(EMI), which is not desirable for noise sensitive Energy from the input source is temporarily
applications. Switched-capacitor (SC) power stored in CP. When ϕ2 ¼ 1, S1 and S2 are OFF
Fig. 10.5 A switching iL VOUT
converter with voltage- MP
mode hysteretic control L

VIN CL RL

Rfb1

VH
VL Rfb2

Fig. 10.6 (a) A switching L


control with one cycle MP VX
VOUT
control (Smedley and Cuk
1995), and (b) a dual-loop
one-cycle controller
(Ma et al. 2004) VIN CL RL

Vref
Q R

Q S CLK

(a)
D2

C D VIN
RC 1−D−D2

VOUT
VMP R1
R1
Vref
D2

RC

R1
R1

(b)
294 D. Brian Ma and Y. Lu

and S3 and S4 is ON. The energy stored in CP is that it is important to ensure that there are
discharged to the load RL. At this moment, VOUT non-overlapping time periods between the two
is equal to the sum of the voltage across CP clock signals to avoid large shoot-through
(¼VIN) and the input voltage VIN, making VOUT currents.
¼ 2VIN. The circuit is thus called voltage dou- Compared to a switching converter, a SC
bler. The control clock signals, ϕ1 and ϕ2, can be power converter possesses a fundamental differ-
generated using the circuit in Fig. 10.7b. Note ence in circuit control. The output voltage of a
switching converter, as discussed earlier, is
f1 regulated by adjusting the duty ratio D of the
S1
main power switch. By varying D, the amount
f2 f2 of energy per switching period is adjusted. The
VOUT
S3 S4 output voltage is dynamically controlled accord-
CP ingly. However, the duty ratios of the power
switches in Fig. 10.7a are all fixed at 50%.
S2 f1 CL RL
Hence, control methods used in switching
converters may not be applicable for SC power
converters. In fact, many of SC power converters
still remain open-loop today. However, without
(a) the presence of feedback control, the output VOUT
suffers significant variations and can become
CLK fluctuated during load transient periods. A com-
mon remedy is to add a linear regulator, as
illustrated in Fig. 10.8. The feedback loop of
f1 the linear regulator mitigates the load transient
f2 challenge by isolating the SC converter from the
load. However, because the SC converter has no
mechanism to adaptively adjust its load current,
if a large load current increase is imposed by RL,
(b)
the linear regulator will draw a significant current
from VOUT. As a result, VOUT drops significantly.
Fig. 10.7 (a) SC voltage doubler, with (b) control clock This forces the load supply voltage VOUT,REG
circuit

MP3
f1 MP4 VOUT Linear Regulator
MP3 VREF
CP
f2 f2
VOUT,REG
f1 MN2
COUT
Rfb1
RL CL
SC Voltage Doubler Rfb2

Fig. 10.8 SC voltage doubler followed by a linear regulator for post-regulation


10 Power Management Circuit Design for IoT Nodes 295

Fig. 10.9 A SC power


converter with voltage- SC Power Stage
VOUT
mode hysteretic control

CL RL
VIN Rfb1

Non-overlapping VH
Clock Generator
Rfb2
VL

Fig. 10.10 (a) Series regulator and (b) shunt regulator with voltage reference

drops as well. Hence, the problem is not funda- converter decreases, allowing more current to
mentally solved. be delivered until it meets the new load demand.
Since the duty ratio of the SC power However, as switching frequency increases, the
converters cannot be changed, the switching fre- output resistance of a SC power converter
quency, which is equal to 1/T, is used occasion- becomes a weak fucntion of the switching fre-
ally for closed-loop control. Figure 10.9 quency. Therefore, the hysteretic control is less
demonstrates a circuit implementation of such. effective at heavy load.
Because the equivalent small signal output resis-
tance of a SC power converter is reversely pro-
portional to the switching frequency, the output 10.2.3 Linear Regulators
voltage VOUT can be regulated by adaptively
adjusting the switching frequency through the Both DC-DC and AC-DC converters in power
hysteretic feedback control. For example, when management units generate high levels of
load current increases, VOUT drops due to CL switching noise. On the other hand, analog linear
dischaging, which triggers a higher switching regulators, as shown in Fig. 10.10, can filter out
frequency. The equivalent resistance of the the supply noise and provide a clean supply
296 D. Brian Ma and Y. Lu

voltage to drive noise sensitive circuits in wire- one clocked comparator, one bi-directional
less communication front-end systems and criti- shift-register array, and one power transistor
cal paths in VLSI chips, or simply provide a array. The comparator compares VREF to VOUT
voltage step-down function in low-cost in every clock cycle to decide whether the shift
applications. One major difference between a register output bits D[1:n] shift to left (add a “1”)
switch mode power converter and a linear regu- or to right (add a “0”). D[1:n] controls the num-
lator is the nature of energy conversion. The ber of turned-on power transistors and conse-
power converters use inductors or capacitors to quently the total output current. The clocked
store the energy in one phase and release the comparator can operate at low supply voltage
energy in the other phase, while the linear (0.5 V for example) and consumes no DC
regulators just simply consume or dump the power, while the other digital cells can operate
extra energy. Thus, the efficiency of a linear at low voltage as well. This feature gains the
series regulator simply approximately equals to digital LDO regulator potential of being
VOUT/VIN, assuming the quiescent current con- implemented in low-power low-voltage IoT
sumed by the error amplifier is negligible. The nodes.
dropout voltage is defined as VIN,MINVOUT, The transient response time of the digital
where VIN,MIN is the minimum input voltage LDO regulator is proportional to its clock fre-
that can maintain the desired output voltage at quency and the size of each power transistor.
full load. Therefore, low quiescent current Thus, coarse-fine tuning and adaptive clock
low-dropout (LDO) regulators are favorable in techniques can be used to improve its transient
an IoT node due to their low-cost, ripple free, fast response without increasing the standby power
transient response and good power supply ripple (Huang et al. 2016). Similar to digitally con-
rejection characteristics. trolled switching converters, digital LDO
Since the energy source in energy harvesting regulators also suffer from limit cycle oscillation
scenarios has large variations, when the input due to finite resolution of the output current. This
current is too high, one more important function problem can be mitigated by adding a one-bit
for the shunt regulator is to bypass the extra strength feedforward path bypassing the shift
energy to ground to prevent device from break- register or by introducing a dead zone to the
ing down. When an energy storage component is comparator (Huang et al. 2016).
available in the system shown in Fig. 10.1, the
shunt regulator is then not needed.
It is also worth noting that digital linear regu- 10.2.4 AC-DC Converter
lator becomes popular in recent years due to its
low-voltage operation feature. The operation As IoT nodes can also be powered by wireless
principle is straightforward and can be illustrated power transfer (WPT), AC-DC converter (recti-
in Fig. 10.11. The digital LDO regulator employs fier) is a vital block in both near-field and
far-field WPT systems. Note that, rectifiers
operating at different frequency ranges, input
amplitudes, and output power levels, need spe-
cific considerations and dedicated architectures.
Thus, researchers have made tremendous efforts
to develop wide-range high-efficiency and area-
efficient rectifiers in the past decade (Cheng et al.
2016; Guo and Lee 2009; Lu et al. 2011, 2013;
Lu and Ki 2014; Lam et al. 2006; Lee and
Ghovanloo 2012).
Fig. 10.11 A regular digitally controlled LDO regulator Diode, the simplest semiconductor device,
(Okuma et al. 2010) can be easily used to convert AC power into
10 Power Management Circuit Design for IoT Nodes 297

Fig. 10.12 Schematics of (a) half-wave rectifier and (b) full-wave rectifier

DC power as shown in Fig. 10.12. One terminal In standard CMOS processes, the diodes can
of the AC source is tied to ground in the half- be simply replaced by diode-connected MOS
wave rectifier as shown in Fig. 10.12a, while transistors. This is a cost-effective way to imple-
both AC terminals are floating in the full-wave ment the passive rectifier. However, the maxi-
rectifier as shown in Fig. 10.12b. The voltage mum achievable transconductance gm of the
difference between VAC1 and VAC2, VAC, is a MOS transistors is smaller than that of a real
sinusoidal wave (could be a distorted wave in PN junction diode or a Schottky diode. For this
practice). The half-wave rectifier delivers current reason, a large VGS for MOS transistors is
from the AC source to the DC output VDC once in required especially when delivering a large cur-
every cycle when VAC1 is higher than VDC, while rent, meaning that it is less efficient than diodes.
the full-wave rectifier delivers current to the out- If low-threshold voltage transistors are used to
put at both positive and negative peak points of reduce the VGS, reverse (sub-threshold) leakage
VAC. That means the time interval for charge current would increase accordingly, and the
transfer of a half-wave rectifier is only half of stand-by time would be reduced.
that of a full-wave rectifier. Thus, in delivering For domestic electrical appliances that can use
the same load current with the same output an AC supply voltage of 220 VRMS or 110 VRMS,
capacitor, the output ripple of the half-wave rec- a passive reconfigurable rectifier (universal recti-
tifier is basically two times higher than that of the fier) with 1X (rectifier) mode and 2X (voltage
full-wave rectifier. More importantly, when the doubler) mode is commonly used as shown in
input energy is limited like the energy harvesting Fig. 10.13. To cater for coupling coefficient
cases, the equivalent input impedance of the rec- k variations with distance and/or orientation in
tifier is different between the half-wave and the WPT applications, a reconfigurable rectifier may
full-wave cases since they draw different current be able to increase VDC without increasing the
from the source. This feature could be a tuning transmitted power. The similar idea was
factor for the impedance matching between the employed in the loosely coupled inductive
AC energy source and the load. power link in Lu et al. (2013) and Lee and
298 D. Brian Ma and Y. Lu

Fig. 10.13 Reconfigurable rectifier (universal rectifier) with 1X (rectifier) mode and 2X (voltage doubler) mode

S2 Energy S3
Storage

S1 L S4

S6 S5
Energy
Battery Load C1 C2 Load Battery
Energy
Source Source

Fig. 10.14 A bi-directional power management unit for self-powered system

Ghovanloo (2012) to extend the coupling range. battery-powered systems. Because power
Note that, the input impedance of the 1X mode generated through energy harvesting
ideally is four times larger than that of the 2X mechanisms is typically low, any power loss
mode when driving the same resistive load. matters in these low power devices. Energy con-
Because the 2X mode ideally provides twice version efficiency thus rises as the number one
output voltage and has twice input-to-output cur- design priority. In order to improve the effi-
rent ratio, making four times input current differ- ciency, traditional techniques used for individual
ence comparing to the 1X mode. This impedance power converters do not suffice. The optimiza-
transfer characteristic should be taken into con- tion has to be done from system level and employ
sideration when designing the matching network. much more environmental adaptive operation
schemes and circuit architectures. Self-adaptive
power circuit architectures thus have been
10.3 Control Schemes and Design reported (Chen et al. 2010; Sankman et al. 2011).
Trends Figure 10.14 illustrates an example of such
self-adaptive power management, which has
10.3.1 Self-Adaptive Architecture two distinct differences from a traditional
power converter. First, an energy storage unit is
As many IoT nodes are self-powered and rely on included to maximize the utilization of energy
renewable energy sources, power management and provide a viable way to recycle the energy
circuits face more challenges than in traditional when excessive. Second, power flow is
10 Power Management Circuit Design for IoT Nodes 299

bi-directional with flexibility of flipping input 10.3.2 Interleaving Multiple-Phase


and output power ports. An advantage of this Operation
circuit is that only one inductor is required, lead-
ing to a low profile implementation in compari- There are three main reasons to consider
son with traditional circuits. In the meantime, the multiple-phase operation in a power converter:
single-stage architecture benefits the power effi- transient response, ripple reduction and reliabil-
ciency. Figure 10.15 illustrates the adaptive ity. When the load current changes drastically, it
reconfiguration Sankman et al. (2011) of the takes time for a converter to respond due to circuit
circuit in Fig. 10.14. In this description, the left delay and limited loop-gain bandwidth, causing
power port is connected to a power source, voltage drooping at the output. If loop bandwidth
whereas the right one is to the load. This situation is very limited, the drooping can be severe and the
will change and the power flow becomes converter takes a long time to recover. A straight-
reversed, when the source and load swap the forward way to mitigate this issue is to employ
ports due to energy availability and load condi- parallel architecture. For example, in Fig. 10.17,
tion. For each direction of power flow, it consists if there are four identical converters are
of four operation modes, as shown in Fig. 10.15. connected in parallel, the current handing capac-
In the Power Mode (Fig. 10.15a), when the ity is increased by four times, with equivalently
harvested power is greater than the load power, four times increase on loop bandwidth. In theory,
switches S2 and S3 are off to disconnect the an N parallel-cell converter is N times faster than
energy storage element CS, while the switch a single cell converter (Kassakian et al. 1990;
pairs S1/S5 and S4/S6 are complementarily turned Perreault and Kassakian 1997).
on and off to deliver energy to the load. When the Interleaving multiple-phase operation also
harvested power drops too low, CS is discharged yields great benefit on ripple cancellation. As
to power the load application in Power Hungry shown in Fig. 10.17, when each cell converter
Mode. In this mode, S1 and S3 are off while operates at the same switching frequency but is
switch pairs S2/S5 and S4/S6 are turned on and displaced in phase in each switching period T,
off as depicted in Fig. 10.15b. In Fig. 10.15c, if due to harmonic cancellation, both voltage and
the harvested input power is in excess of load current ripples at the input and output are
power, the power circuit self-configures into a reduced in magnitude and ripple frequency is
maximum power point (MPP) tracker to extract pushed to higher level. One critical issue in
power into CS. In this Storage Mode, S4 is off to implementing the interleaving multiple-phase
isolate the load and switch pairs S1/S5 and S3/S6 operation is to ensure the phase displacement of
are complementarily turned on and off to deliver each cell is equal, which is also called phase
harvested energy to CS. Finally, the dynamic synchronization (Song et al. 2014). A potential
voltage scaling (DVS) configuration is shown in phase error can lead to imbalanced current shar-
Fig. 10.15d, in which S1 is off to isolate the input ing among the cells (Fig. 10.18), causing hot
power source, and switch pairs S2/S5 and S4/S6 spots and jeopardizing reliability.
are turned on and off to recycle energy from CO Another benefit of interleaving multiple-
to CS. Thus, VOUT is reduced to save energy. phase operation is the reliability improvement.
Once the new VOUT is reached, this Energy If all the cell converters are controlled indepen-
Recycling Mode can be terminated. dently, then the converter as a whole is not cen-
Similarly, self-adaptive architectures can be trally controlled. The distributed architecture and
adopted in SC power converters (Ma and operation itself leads to high tolerance to cell
Bondade 2013). Figure 10.16 illustrates a failures. Furthermore, similar concept can be
reconfigurable SC converter, which applied to each-phase cell converter. Because of
accomplishes both step-up and step down voltage large sizes of power devices in the cell, they can
conversions and is compatible with bi-directional be constituted with a series of paralleled small
power flow operation as well. power devices (also called sub-cells), such as
300 D. Brian Ma and Y. Lu

Fig. 10.15 (a) Power Charge Path


Mode, (b) Power Hungry CS
Discharge Path
Mode, (c) Storage Mode, S2 S3
and (d) Energy Recycling
Mode S1 L S4

S5 Load
Source CIN S6 CO Application

(a) Power Mode

Charge Path
Discharge Path CS
S2 S3
S1 L S4

S5 Load
Source CIN S6 CO Application

(b) Power Hungry Mode

Charge Path
Discharge Path CS
S2 S3
S1 L S4

S5 Load
Source CIN S6 CO Application

(c) Storage Mode

Charge Path
Discharge Path CS
S2 S3
L S4

S5 Load
Source CIN S6 CO Application

(d) Energy Recycling Mode


10 Power Management Circuit Design for IoT Nodes 301

VOUT VOUT
S1 S9 S2 S10 S1 S9 S2 S10

C1 C2 C1 C2
VIN CL RL VIN CL RL

S5 S3 S7 S6 S4 S8 S5 S3 S7 S6 S4 S8

Φ=1 Φ= 1
Step-up Conversion
CG=2
VOUT
S1 S9 S2 S10

C1 C2
VIN CL RL

S5 S3 S7 S6 S4 S8

Step-down Conversion
CG=1/2

VOUT VOUT
S1 S9 S2 S10 S1 S9 S2 S10

C1 C2 C1 C2
VIN CL RL VIN CL RL

S5 S3 S7 S6 S4 S8 S5 S3 S7 S6 S4 S8

Φ=1 Φ= 1

Fig. 10.16 Reconfiguring a SC power converter with CG ¼ 2 or 1/2

IL1
VSW1 L1 IL2
IL3
S1 IL1 IL4
t
VSW1
VOUT
t
VSW2
t
VIN CL VSW3
VSW4 L4 t
VSW4
S4 IL4 t

VOUT

t
T

Fig. 10.17 Illustration of a multiple-phase switching converter


302 D. Brian Ma and Y. Lu

Fig. 10.18 Imbalanced IL1


current sharing in a
multiple-phase switching D IL1
converter IL1
IL2
IL3
IL4
t

VSW1
t
VSW2
t
VSW3
t
VSW4
t

VOUT

t
T

VIN
fA
A-1 A-2 A-3 A-4

fA
fB
B-1 B-2 B-3 B-4
fB
VOUT

fC fC
CL RL
C-1 C-2 C-3 C-4

fD

fD
D-1 D-2 D-3 D-4

Fig. 10.19 Reconfigurable multiple-phase SC power converter (Ma and Bondade 2013)
10 Power Management Circuit Design for IoT Nodes 303

VDD1

Power Management System VDD2

VDD3
VIN

Control Bus

Load Load Load


Application Application Application
A B A

Fig. 10.20 DVFS based power management system with multiple power supplies

A-1 to D-4 in a multiple-phase SC power con- It directs optimal voltages and frequencies of
verter in Fig. 10.19 (Ma and Bondade 2013). Due operation that allow each load application to
to the difficulty of implementation, phase dis- complete the task right before the deadline, in
placement cannot be too small in interleaving order to reduce the energy consumption to
operation. Hence, the sub-cells (such as A-1 to minimum.
A-4) retain the same phase (ϕA) in each cell (cell As power supply voltage is a critical control
A). When one sub-cell fails to function, the other variable in DVFS techniques, traditional battery-
three carry over the total load current designated powered fixed power supply does not suffice.
to the cell, without affecting the operation of Instead, it usually relies on power converters to
others. In addition, the sub-cell arrangement intelligently generate the desired voltages. One
also facilitates dynamic power component sizing, common way to do so is to generate multiple
benefiting the efficiency of the converter. discrete supply voltages. As illustrated in
Fig. 10.20, based on the power budget and
processing deadlines, the DVFS-based power
10.3.3 DVFS Capability management system instructs the load
applications to switch to the supplies that meet
In order to mitigate the power crisis in electronic the optimization of power and performance.
devices, it has been widely believed that power However, if multiple power converters are
management should be performed not only on employed for such a purpose, system cost, volume
device and circuit level, but also on system and design complexity increase drastically. The
level. Accordingly, numerous system power use of multiple and bulky power inductors and
management techniques have been reported. To capacitors require extra PCB area, IC pins, bond-
illustrate the effectiveness of the techniques as ing pads. To mitigate this issue, single-inductor
such, one example is given in Table 15.5 of multiple-output (SIMO) power converters were
Chap. 15 by Cymbet Corporation with up to developed (Ki and Ma 2001; Ma and Bondade
10.11 power saving ratio. Among these power 2010; Ma et al. 2003a, b). As illustrated in
management techniques, a most popular and Fig. 10.21, a SIMO converter only requires one
effective one is dynamic voltage/frequency scal- single inductor and could save the number of
ing (DVFS) (Burd et al. 2000; Cho and Chang power switches by up to 50% and is controlled
2007; Luo et al. 2007). In this technique, power by a single controller. The cost-effectiveness
delivered to each load application is varied adap- makes it a popular choice in recent years (Chen
tively according to the actual load demand. et al. 2012; Zhang and Ma 2014, 2015).
304 D. Brian Ma and Y. Lu

Fig. 10.21 A SIMO VDD1


power converter for DVFS L
power management (Ki and VDD2
Ma 2001)
VSource VDDn

Feedback/Phase Load
Controller Application

Feedback Control Signal DVFS Supply Selection Signal

Fig. 10.22 DVFS based


Power Management System
power management system
using a variable-output VO
buck converter Loop Filter DC-DC Converter VCO
fVCO
C
Idd
∑ Vdd
Application Load

Apart from multiple supply implementations, Unlike traditional fixed-output converters that
DVFS techniques can also be accomplished using are optimized and compensated at a fixed voltage
a single power converter with variable output. For level, variable output causes stability issues dur-
the sake of hardware cost and form factor, this is ing large dynamic transient periods. Robust con-
clearly more economical, because it only requires trol methods (Smedley and Cuk 1995; Song et al.
one single power converter. However, because of 2014; Utkin 1993; Wang and Ma 2012) are
the output of the power converter has to continu- expected for such applications.
ously vary in a relatively wide voltage range and
reasonable transient speed, the design could face
some serious challenges. With a wide output 10.3.4 CMOS Rectifier
range, SC power converters and linear regulators
likely suffer from low efficiency. Even for a CMOS rectifier that does not need off-chip diodes
switching converter, as depicted in Fig. 10.22, and extra mask is a low-cost solution for the IoT,
its transient response time has to be largely short- without degrading performances. Two important
ened to reduce the latency and computing errors parameters in evaluating a rectifier are the voltage
in the application load, which can be a processor conversion ratio M and the power conversion
running at a much higher operation frequency. efficiency (PCE). M is defined as
10 Power Management Circuit Design for IoT Nodes 305

Fig. 10.23 An active full-wave rectifier with simulated AC current waveforms showing the reverse current problem in
active rectifier

V DC For the passive rectifiers, as discussed in


M¼ ; ð10:1Þ
jV AC j Sect. 10.2.3, the diode voltage drop VD of 0.7 V
for typical diodes and 0.3 V for Schottky diodes,
where jVACj is the amplitude of the input AC limits the voltage conversion ratio (M ) and the
signal to the rectifier, and VDC is the averaged power conversion efficiency (PCE). An active
rectified output DC voltage. The PCE of an rectifier that only use CMOS transistors is
AC-DC converter is defined as shown in Fig. 10.23. The high-side diodes in
passive rectifier are replaced by two cross-
POUT V 2 =RL coupled PMOS switches, and the low-side diodes
PCE ¼ ¼ ð t0 þNT DC
PIN are replaced by two comparator-controlled
1
NT V AC ðtÞ  I AC ðtÞdt;
t0 NMOS switches (active diodes). In this configu-
ð10:2Þ ration, the voltage drops are reduced from 2VD to
2VDS (VDS is the turn-on voltage of the power
where T is the period of the input sinusoidal switches). The operation principle is described as
wave, N is the number of cycles that are follows: when VAC2VAC1 > jVtPj (threshold
integrated for PIN calculation, and VAC(t) and voltage of MP1,2), MP1 is turned on and VAC2 ¼
IAC(t) are the instant voltage and current of the VDC; and then VAC1 swings below the ground
AC source. voltage, the comparator CMP1 turns on the
306 D. Brian Ma and Y. Lu

Fig. 10.24 Comparator schemes for reverse current control, and the hysteretic comparator

switch MN1, IAC1 charges up VDC by VAC. After hand, the falling edge delay tpHL forces the power
VAC1 swings above zero, MN1 is then turned off NMOS transistors MN1,2 to turn off late, and the
by CMP1. During the next half of the AC input charge of the output capacitor will flow back to
cycle, the other half of the rectification circuit ground through MN1,2, resulting in reverse leak-
will conduct in a similar fashion as described age current. However, MN1,2 have to be large to
above. Thus, by replacing the four diodes of a handle large output current, increasing the
full-wave rectifier with power transistors, M is response time of the active diodes. This problem
significantly increased especially when the input is more pronounced when the operation fre-
amplitude is low. However, when operating at a quency is required to be high and the input ampli-
high frequency such as 13.56 MHz, the compar- tude jVACj is low, because delay time of
ator delay and gate-drive buffer delay would comparators and buffers are inversely propor-
affect the efficiency of the rectifier. tional to the supply voltage.
As demonstrated by the simulated IAC Comparators with unbalanced bias currents or
waveforms, the reverse current will occur if the asymmetric differential input are used to set an
large power switches are not turned off immedi- artificial input offset voltage to compensate for
ately when VAC1 or VAC2 is higher than the the delay and to turn the power switches on and
ground voltage. Thus, in this architecture, the off properly. Prior reverse current control
main losses are conduction loss, switching loss schemes fall into one of the cases sketched in
and comparator static power, if the reverse cur- Fig. 10.24. The symbol of artificial input offset
rent is well-controlled. with an enable pin is defined as: when the enable
Many schemes were proposed to compensate bit is high, a non-zero offset voltage is introduced
the delay of active diodes (Cheng et al. 2016; to the comparator; and when the enable bit is
Guo and Lee 2009; Lu and Ki 2014; Lam et al. low, the offset voltage is zero (a wire). In Case
2006; Lu et al. 2013; Lee and Ghovanloo 2012). 0, the comparators have no artificial offset
The rising edge propagation delay of the compar- (reverse current occurs due to delays). In Case
ator and buffer, tpLH, will shorten the current 1, the comparators have fixed artificial offset
conduction time Δt, limiting the highest opera- such that the power switches are turned off ear-
tion frequency of the active rectifier. On the other lier, but turned on later. In Case 2, the
10 Power Management Circuit Design for IoT Nodes 307

comparators have dynamic artificial offset such artificial offset would disappear right after the
that the power switches are only turned off ear- power NMOS transistor is turned off. If the
lier. In Case 3, the comparators have dynamic RCC transistor turns on prematurely (for exam-
artificial offset at both edges such that the power ple, due to process variations) and turns off the
switches are turned on and off earlier. Case 4 is a active diode while jVACj is still higher than VDC,
hysteretic comparator that will not be used in this the comparator will go high again in the same
scenario, it is listed here only as a complemen- cycle. Simulation waveforms of the described
tary example of Case 3. scenario are shown in Fig. 10.26. The efficiency
To compare the performance of the above would deteriorate with this multiple-pulsing
comparator schemes on the low-voltage active problem, especially in light load condition
rectifiers especially, the metrics Crest Factor when switching loss dominates. Thus, SR latches
(CF) used in evaluating electrical appliances were used to realize a one-shot per cycle logic to
could be also employed here. CF is defined as handle this multiple-pulsing problem (Lu et al.
the ratio between the peak current delivered to 2011).
the load and the corresponding RMS current Case 3 was realized in Lee and Ghovanloo
(2012), where it uses self-biased active diodes
IP similar to Lam et al. (2006), with an offset-
CF ¼ : ð10:3Þ
I RMS control function for the comparators to compen-
sate for both turn-on and turn-off delays such that
A higher CF means a higher peak current for Δt3 could be maximized (lowest CF). It suffers
the same load condition, and larger power from the same and even worse multiple-pulsing
transistors are thus needed to reduce the VDS for problem as (Lam et al. 2006), as the dynamic
achieving high voltage conversion ratios. Rout- offset (offset voltages) transition of the offset-
ing metal needs to be wider for a higher CF as control circuit is unstable: when the comparator
well. The CF for each case is considered as outputs a 1(0), the dynamic offset flips the output
below. to 0(1), and this positive feedback makes the
The scenarios of current conduction of Case comparator undergoes self-oscillation. This phe-
0 to Case 3 are sketched in Fig. 10.25, and nomenon is what a hysteretic comparator is
discussed as follows. Case 1 is implemented in designed to avoid. To make the scheme work,
Guo and Lee (2009), with constant offset Lee and Ghovanloo (2012) adds a delay cell tdp
introduced to the comparators using unbalanced in the offset-control path in addition to its cali-
bias currents. The power NMOS switches are bration bits. If the delay time tdp is large enough
turned off earlier by td to eliminate the reverse (comparable to Δt), the comparator could be
current; however, they are turned on later by 2td stable, but then the power switch would be turned
than the ideal case, and the conduction time Δt1 on for at least the duration of tdp that limits the
is 2td shorter. For the same load current, the peak minimum conduction time (Δtmin > tdp). This
current Ip1 has to be higher, limiting its operation property makes its operation more like a
at a higher frequency. The delays get worse when constant-on time control at light load condition.
jVACj is low. As a result, its light load efficiency is degraded.
In Lam et al. (2006), self-biased active diodes Case 3 was also realized in Cheng et al. (2016)
were employed, and to reduce or eliminate td, a that a hard one-shot logic allowing the power
reverse current control (RCC) scheme (Case 2) is transistor only being turned-on once per cycle is
introduced. However, both the bias current and used to avoid the multiple-pulsing problem. Fur-
the operation of the RCC transistor are highly thermore, a near-optimum switched-offset com-
affected by jVACj and process variations, making pensation was implemented for both on- and
the rectifier hard to be optimized over a wide off-delays with additional voltage sensing feed-
input range. Moreover, as the reverse current back loops. Consequently, the reverse current in
control is realized by a time-varying offset, the Cheng et al. (2016) is eliminated elegantly.
308 D. Brian Ma and Y. Lu

Fig. 10.25 Conceptual


waveforms of the input/
output voltages, power
NMOS gate voltage and
conducted currents for
active rectifiers with
different comparators

Fig. 10.26 Simulated


waveforms of multiple-
pulsing problem associated
with dynamic offset
schemes
10 Power Management Circuit Design for IoT Nodes 309

Fig. 10.27 Schematic of


the multi-stage step-up
rectifier

To summarize, Case 2 has longer Δt than Case perspective. Since the RF input power level usu-
1, and in the ideal situation, Case 3 has the largest ally is low, multiple step-up stages are needed to
Δt. However, a comparator with both boost the output voltage. A Dickson type multi-
compensated turn-on and turn-off delays is logi- stage rectifier is shown in Fig. 10.27. Each stage is
cally unstable: the hysteresis goes the opposite a voltage doubler, thus the ideal no-load output
direction as a normal hysteretic comparator goes, voltage of an N-stage converter is 2 N times of the
and the robustness of the rectifier is degraded, AC input amplitude. An interesting and useful
special logics are needed for a stable operation. characteristic of the multi-stage rectifier is that
There are some other variations of the CMOS the number of stage N does not affect the maxi-
active rectifiers been proposed (Radecki et al. mum achievable PCE (Yi et al. 2007). Because
2011; Xu et al. 2014) for near-field WPT. For each stage draws current from the input source in
example, to operate at a relatively high WPT the charging phase of each cycle, and the stages
frequency (150 MHz), the synchronous rectifier are stacked in the discharging phase for higher
(Radecki et al. 2011) uses the delay locked-loop output voltage. Each stage operates in a similar
(DLL) to synchronize with the AC input wave- way as the single-stage rectifier, thus the maxi-
form and to generate discrete timing slots for mum PCE is basically the same.
controlling the power switches. In Chung et al. As mentioned above, diode drop is the main
(2012), instead of using the voltage or phase constrain that limits the low-voltage operation of
offset to compensate the comparator delay, a pos- the rectifiers, this problem becomes even severe
itive feedback loop was used to shorten the delay. in the RF-DC senarios. Thus, the cross-
RF energy harvesting needs the rectifier to connected (CC) CMOS rectifier structure
operate in the ultra-high frequency (UHF) range, (Facen et al. 2006), as shown in Fig. 10.28a, is
thus it is commonly known as RF-DC conversion. a commonly used topology for its low-voltage
When the AC-DC conversion considers the and auto-switching characteristics. However, a
problems in voltage domain, the RF-DC conver- high leakage current will occur at high input
sion prefers to look into circuit in a power conditions because the PMOS and NMOS will
310 D. Brian Ma and Y. Lu

Fig. 10.28 Schematics of (a) cross-connected rectifier and (b) rectifier with VTH compensation

be turned-on simultaneously during the transition signals that make the power supply rejection
period, which is a similar problem to the shoot- ratio of the input buffer a very important specifi-
through current in a CMOS inverter. To achieve cation. Alternatively, reducing the supply noise
high heavy load efficiencies requires larger tran- from power source itself is one equivalent way to
sistor sizes for the rectifier, which will cause reduce the effect of supply noise.
more reverse leakage current. Larger transistor For an LDO regulator, the largest capacitors
size will also increase the parasitic loss during are the load capacitor CL and the parasitic gate
the step up conversion. It has been demonstrated capacitor CG of the power transistor. Hence,
that the voltage conversion ratio of the CC topol- there are at least two low-frequency (LF) poles
ogy could be larger than 80% in Nakamoto et al. on the left-half-plane (LHP): the pole at the out-
(2007), when the input amplitude of the rectifier put node pO, and the pole at the gate of the power
was higher than 150 mV. Evidently, the CC MOS pG, as sketched in Fig. 10.29 with either pO
rectifier provides higher peak PCE than the or pG being the dominant pole. The pole pO
Dickson-type rectifiers. However, its high-PCE would shift to a lower frequency when the load
range is narrow due to the above mentioned resistance increases and vice versa. For stability
leakage current problem. A high-PCE range consideration, pO moves to lower frequency as
extension technique that employed one high- the load current decreases, which makes the pO
power path and one low-power path with auto- dominant case more stable. On the other hand,
selection can be used to achieve a wider input the zero-load condition is seldom discussed in
power range (Lu et al. 2017). As shown in many output-capacitor-less designs for stability
Fig. 10.28b, another solution (Stoopman et al. reason. Instead, a minimum load current is
2014) that has much less leakage current is to needed to satisfy stability requirements.
internally generate a VTH compensation voltage By using most of the available capacitance at
to bias the diode-connected MOSFET. But large the output node, pO is designed to be the domi-
values of RB and CB are required because the nant pole. Then, a larger load capacitor could
power budget for bias current is tiny, then con- better filter out power supply noise and glitches
siderably large area would be occupied in multi- from the pre-stage DC-DC or AC-DC converters,
stage topology. and serves as a buffer for load-transient current
variation, resulting in a cleaner VOUT. Also, as
shown in Fig. 10.29c, because the output voltage
10.3.5 Supply Noise Rejection is well regulated by the control loop at low fre-
quency, and the noise is bypassed to ground by
There are many sensory circuits exist in the IoT CL at high frequency, the pO dominant case could
nodes, and most of them are sensing single-ended provide a good power supply noise rejection
10 Power Management Circuit Design for IoT Nodes 311

Fig. 10.29 Conceptual frequency response of generic gain with pG being its dominant pole, (c) PSR with pO
LDO regulators with two low frequency poles: (a) open being its dominant pole, and (d) PSR with pG being its
loop-gain with pO being its dominant pole, (b) open loop- dominant pole

(PSR) across the full-spectrum (Lu et al. 2015). supply noise with little area and power
On the other hand, as shown in Fig. 10.29d, the overhead.
worst case PSR for the pG dominant case would
occur at medium frequency (Gupta et al. 2004).
Thus, increasing both the load capacitance and 10.3.6 Passive Components
the loop bandwidth (that is, the unity-gain fre-
quency, UGF) would improve the PSR of an As mentioned above, passive components
LDO regulator. including inductors and capacitors are used in
Meanwhile, PSR can also be enhanced by power converters for energy storage and transfer
circuit techniques, like in Park et al. (2014), to purposes. However, passive components are chip
replicate the effects from the CG and to cancel or PCB area consuming, and have non-idealities
out the input noise. And also, larger dropout that limit the performances of the converters.
voltage usually could give better PSR perfor- Circuit models of both practical inductor and
mance but will sacrifice the efficiency capacitor are given in Fig. 10.30. For the inductor
(Zhan and Ki 2014). In a step-down DC-DC model, it includes a parasitic capacitor CP which
converter in cascade with LDO topology, using decides the self-resonant frequency (SRF) of the
NMOS as the power transistor driven by the employed inductor. When the inductor operates
DC-DC input voltage (Lu et al. 2016) could be at frequencies higher than the SRF, the parasitic
an easy but effective solution to reject power capacitor path provides lower impedance com-
pared to the inductor path, it starts to act like a
312 D. Brian Ma and Y. Lu

Fig. 10.30 Circuit models of (a) a practical inductor and (b) a practical capacitor with paracitics

Table 10.1 Examples of state-of-the-art power inductors


L DCR SRF IRMS Price
Pattern Model number Dimensions (mm) (μH) (Ω) (MHz) (mA) (US$)
PFL1005 (Coilcraft PFL1005 1.14  0.64  0.7 1 0.97 460 310 <0.2
Datasheet 2015)
PFL2010 (Coilcraft PFL2010 2.20  1.45  1.0 4.7 0.62 66 420 <0.2
Datasheet 2015)

LPS3008 (Coilcraft LPS3008 2.95  2.95  0.8 180 9.0 10 120 <0.28
Datasheet 2015)

MSS1038 (Coilcraft 10.2  10.0  3.8 1000 2.8 3 330 <0.51


MSS1038 Datasheet 2016)

capacitor. Meanwhile, the DCR of the inductor area and performance degradation on inductance
will generate I2R conduction loss. Under the density and quality factor Q (Boggetto et al.
same technology, the value of DCR basically 2002; Gardner et al. 2009; Sullivan et al. 2009).
increases with the inductance, because more As the alternative, efforts of using package
windings are needed for larger inductance. To bond-wire inductors (Huang and Mok 2013;
reduce DCR, thick wires should be used. But Song et al. 2016; Wu et al. 2014) through System
then, the component size will be increased. in Package (SiP) have been reported to avoid the
Thus, there are tradeoffs between the device cost and performance penalty of on-chip
size, the current rating, and the operating fre- inductors, while keeping low system profile.
quency. Since the inductor has poor scalability This is also consistent with the global system
due to its magnetic property, it has always been miniaturization trends and technologies
one of the main limiting factors for the addressed in Chap. 16.
miniaturized devices such as IoT nodes. A com- For the capacitor model, parasitic equivalent
parison of state-of-the-art power inductor series resistors (ESRs) and inductors (ESLs) are
candidates for IoT nodes on their size, electrical unavoidable. Both ESR and ESL will generate
parameters, and cost is given in Table 10.1. unwanted additional voltage ripples when the
Historically, it has been a challenge to inte- capacitor is being charged or discharged. Also,
grate power inductors into silicon technology. there will be leakage current conducts through
Major concerns lie in compatibility with CMOS the two terminals of a capacitor. This leakage
technology, increased fabrication cost and chip phenomenon can be modeled as a large resistor
10 Power Management Circuit Design for IoT Nodes 313

Table 10.2 Examples of state-of-the-art capacitors


Voltage
Dimensions rating Leakage Price
Pattern Type (mm) C (μF) (V) (μA) (US$)
Ceramic Capacitor 1  0.5  0.5 1 25 n/a <0.04
Capacitor Array 3.2  1.6  1.35 41 10 n/a <0.15

Super Capacitor (Maxwell HC 12  8  8 10 6


2.7 6 <1
series Ultracapacitors Datasheet
2016)

RLEAK across the capacitor. Table 10.2 For the IoT applications in particular, when
summaries some state-of-the-art capacitors that selecting the topology of power converters, if
can be used in IoT nodes. The capacitor array the application has rigorous constraints on
which contains two or four separate capacitors device volume and switching noise, switched-
provides a compact PCB solution, while the capacitor power converters are more
super capacitor provides super capacity but preferred for fully integration when compared
suffers from considerable leakage current. to switching converters, because the required
For on-chip implementation, parasitic capacitor value of a SC power converter
capacitors (CP1 and CP2) coupled to the silicon decreases linearly with the output power. If the
substrate would be larger compared to the requirement on device volume is loose,
off-chip case. Because the on-chip distances switching converters are more popular for high
between the capacitor plates and the substrate efficiency and flexibly wider output range. Lin-
are short. Nevertheless, the capacitance density ear regulator is another indispensable block in
has been significantly increased either in power management modules for its clean, area-
advanced nanometer CMOS processes efficient, and supply ripple rejection
(El-Damak et al. 2013; Jain et al. 2014) or in characteristics. Also, a linear regulator with
the post-CMOS deep trench technology (Johari fast adaptive output would be favorable for
and Ayazi 2009). Thus, the power density of SC low-duty-ratio IoT nodes. Lastly, wireless
power converters has been largely increased in powering with near-field or far-field operation
recent years. is the major intentional (man-made) energy
source for passive or active wireless nodes for
IoT applications. CMOS rectifier is the most
10.4 Perspectives and Trends cost-effective and efficient way to convert the
received AC power into DC outputs. A wireless
As semiconductor technology enters the nano power receiver that can acquire multiple fre-
regime in recent years, power crisis continues to quency bands is in demand since there are vari-
grow as the number one challenge for system ous ambient RF energy sources and a couple of
design. The impacts of power management and wireless power standards co-exist.
power circuits have been appreciated far beyond In conclusion, power management has
the traditional domain of power electronics. become an essential measure to accomplish the
Power circuits, as a hardware platform that optimal performance in modern electronic sys-
enables efficient power management, should be tem designs. Cross-layer power management at
clearly understood by all the design engineers device, circuit and system level represents an
from topology to performance matrix. inevitable trend and leads to a broad spectrum
314 D. Brian Ma and Y. Lu

of research opportunities. At device level, wide IEEE J. Solid-State Circuits 47(10), 2496–2504
bandgap and 3D power devices that enable much (2012)
Coilcraft LPS3008 Datasheet, revised Sep 2015, http://
faster and more efficient operation are in high www.coilcraft.com/pdfs/LPS3008.pdf
demand. Heterogeneous integration of these Coilcraft MSS1038 Datasheet, revised Mar 2016, http://
devices with mainstream silicon technologies as www.coilcraft.com/pdfs/mss1038.pdf
well as their interface circuits determines the Coilcraft PFL1005 Datasheet, revised Sep 2015, http://
www.coilcraft.com/pdfs/pfl1005.pdf
timeframe of commercialization. With progres- Coilcraft PFL2010 Datasheet, revised Sep 2015, http://
sive development on fundamental power circuits, www.coilcraft.com/pdfs/pfl2010.pdf
power circuits in the future will be highly D. El-Damak, S. Bandyopadhyay, A. Chandrakasan, A
diversified and system-oriented. Adaptation, effi- 93% efficiency reconfigurable switched-capacitor
DC-DC converter using on-chip ferroelectric
ciency and reliability are the key elements in capacitors, in IEEE International Solid-State Circuits
evaluating the performances of these circuits. Conference (ISSCC), Feb 2013, pp. 374–375
At the system and application level, the function T. Esram, P.L. Chapman, Comparison of photovoltaic
of a power circuit should no longer be limited to array maximum power point tracking techniques.
IEEE Trans. Energy Convers. 22(2), 439–449 (2007)
voltage regulation. It will bridge global system A. Facen, A. Boni, Power supply generation in CMOS
power management strategies with local device passive UHF RFID tags, in Ph.D. Research in Micro-
operation to accomplish near-optimal energy electronics and Electronics (2006), pp. 33–36
efficiency, maximized lifetime and highly ele- D.S. Gardner, G. Schrom, F. Paillet et al., Review of
on-chip inductor structures with magnetic films.
vated reliability. IEEE Trans. Mag. 45(10), 4760–4766 (2009)
S. Guo, H. Lee, An efficiency-enhanced CMOS rectifier
with unbalanced-biased comparators for
transcutaneous-powered high-current implants. IEEE
References J. Solid-State Circuits 44(6), 1796–1804 (2009)
V. Gupta, G.A. Rincon-Mora, P. Raha, Analysis and
J.M. Boggetto, Y. Lembeye, J.P. Ferrieux, Y. Avenas, design of monolithic, high PSR, linear regulators for
Micro fabricated power inductors on silicon. IEEE SoC applications, in Proceedings of IEEE Interna-
Power Electron. Spec. Conf. 3, 1225–1229 (2002) tional SOC Conference, Sep 2004, pp. 311–315
T.D. Burd, T.A. Pering, A.J. Stratakos, R.W. Brodersen, C. Huang, P.K.T. Mok, A 100 MHz 82.4% efficiency
A dynamically voltage scaled microprocessor system. package-bondwire based four-phase fully-integrated
IEEE J. Solid-State Circuits 35(11), 1571–1580 buck converter with flying capacitor for area reduc-
(2000) tion. IEEE J. Solid-State Circuits 48(12), 2977–2988
H. Chen, B. Wei, D. Ma, Energy storage and management (2013)
system with carbon nanotube supercapacitor and R. Jain, B.M. Geuskens, S.T. Kim et al., A 0.45–1 V fully-
multi-directional power delivery capability for auton- integrated distributed switched capacitor DC-DC con-
omous wireless sensor nodes. IEEE Trans. Power verter with high density MIM capacitor in 22 nm
Electron. 25(12), 897–2909 (2010) tri-gate CMOS. IEEE J. Solid-State Circuits 49(4),
H. Chen, Y. Zhang, D. Ma, A SIMO parallel-string driver 917–927 (2014)
IC for dimmable LED backlighting with local bus H. Johari, F. Ayazi, High-density embedded deep trench
voltage optimization and single time-shared regula- capacitors in silicon with enhanced breakdown volt-
tion loop. IEEE Trans. Power Electron. 27(1), age. IEEE Trans. Compon. Packag. Technol. 32(4),
452–462 (2012) 808–815 (2009)
L. Cheng, W.H. Ki, Y. Lu, T.S. Yim, Adaptive on/off J.G. Kassakian, High frequency switching and distributed
delay-compensated active rectifiers for wireless power conversion in power electronic systems, in 6th Con-
transfer systems. IEEE J. Solid-State Circuits 51(3), ference on Power Electronics and Motion Control,
712–723 (2016) Budapest, Hungary, 1990, pp. 990–994
Y. Cho, N. Chang, Energy-aware clock-frequency assign- W.-H. Ki, D. Ma, Single inductor multiple output
ment in microprocessors and memory devices for switching converters. IEEE Power Electron. Spec.
dynamic voltage scaling. IEEE Trans. Comput.- Conf. 1, 226–231 (2001). Vancouver
Aided Des. Integr. Circuits Syst. 26(6), 1030–1040 Y.-H. Lam, W.-H. Ki, C.-Y. Tsui, Integrated low-loss
(2007) CMOS active rectifier for wirelessly powered devices.
H. Chung, A. Radecki, N. Miura, H. Ishikuro, T. Kuroda, IEEE Trans. Circuits Syst. II: Express Briefs 53(12),
A 0.025–0.45 W 60%-efficiency inductive-coupling 1378–1382 (2006)
power transceiver with 5-bit dual-frequency H.-M. Lee, M. Ghovanloo, An adaptive reconfigurable
feedforward control for non-contact memory cards. active voltage doubler/rectifier for extended-range
10 Power Management Circuit Design for IoT Nodes 315

inductive power transmission. IEEE Trans. Circuits Y. Okuma, K. Ishida, Y. Ryu, X. Zhang, P.-H. Chen,
Syst. II: Express Briefs 59(8), 481–485 (2012) K. Watanabe, M. Takamiya, T. Sakurai, 0.5-V input
Y. Lu, W.-H. Ki, A 13.56 MHz CMOS active rectifier digital LDO with 98.7% current efficiency and 2.7-μA
with switched-offset and compensated biasing for bio- quiescent current in 65 nm CMOS, in IEEE Custom
medical wireless power transfer systems. IEEE Trans. Integrated Circuits Conference (CICC), Sep 2010,
Biomed. Circuits Syst. 8, 334–344 (2014) pp. 1–4
Y. Lu, W.-H. Ki, J. Yi, A 13.56 MHz CMOS rectifier with G.K. Ottman, H.F. Hofmann, G.A. Lesieutre, Optimized
switched-offset for reversion current control, in IEEE piezoelectric energy harvesting circuit using step-
Symposium on VLSI Circuits (VLSIC), Jun 2011, down converter in discontinuous conduction mode.
pp. 246–247 IEEE Trans. Power Electron. 18(2), 696–703 (2003)
Y. Lu, X. Li, W.-H. Ki, C.-Y. Tsui, C.P. Yue, A C.-J. Park, M. Onabajo, J. Silva-Martinez, External
13.56 MHz fully integrated 1X/2X active rectifier capacitor-less low drop-out regulator with 25 dB supe-
with compensated bias current for inductively rior power supply rejection in the 0.4–4 MHz range.
powered devices, in IEEE International Solid-State IEEE J. Solid-State Circuits 49(2), 486–501 (2014)
Circuits Conference (ISSCC), Feb 2013, pp. 66–67 D.J. Perreault, J.G. Kassakian, Distributed interleaving of
Y. Lu, Y. Wang, Q. Pan, W.-H. Ki, C.P. Yue, A fully- paralleled power converters. IEEE Trans. Circuits
integrated low-dropout regulator with full-spectrum Syst. I 44(8), 728–734 (1997)
power supply rejection. IEEE Trans. Circuits Syst. I: A. Radecki, H. Chung, Y. Yoshida, N. Miura, T. Shidei,
Reg. Pap. 62(3), 707–716 (2015) H. Ishikuro, T. Kuroda, 6 W/25mm2 inductive power
Y. Lu, W. Ki, C. Patrick Yue, An NMOS-LDO regulated transfer for non-contact wafer-level testing, in IEEE
switched-capacitor DC–DC converter with fast- International Solid-State Circuits Conference
response adaptive-phase digital control. IEEE Trans. (ISSCC), Feb 2011, pp. 230–232
Power Electron. 31(2), 1294–1303 (2016) J. Sankman, D. Ma, A 12-μW to 1.1-mW AIM piezoelec-
J. Luo, N.K. Jha, L.S. Peh, Simultaneous dynamic voltage tric energy harvester for time-varying vibrations with
scaling of processors and communication links in real- 450 nA IQ. IEEE Trans. Power Electron. 30(2),
time distributed embedded systems. IEEE Trans. Very 632–643 (2015)
Large Scale Integr. (VLSI) Syst. 15(4), 427–437 (2007) J. Sankman, H. Chen, D. Ma, Supercapacitor-based
M. Huang, Y. Lu, S.-W. Sin, S.-P. U, R. Martins, A fully- reconfigurable energy management unit for autono-
integrated digital LDO with coarse-fine-tuning and mous wireless sensor nodes, in IEEE International
burst-mode operation. IEEE Trans. Circuits Syst. II: Symposium on Circuits and Systems, pp. 2541–2544,
Exp. Briefs 63(7), 683–687 (2016) Rio de Janeiro, Brazil, May 2011
M. Huang, Y. Lu, S.-W. Sin, S.-P. U, R. Martins, W.-H. K. Smedley, S. Cuk, One-cycle control of switching
Ki, Limit cycle oscillation reduction for digital low converters. IEEE Trans. Power Electron. 10(6),
dropout regulators. IEEE Trans. Circuits Syst. II: Exp. 625–633 (1995)
Briefs 63(9), 903–907 (2016) M. Song, J. Sankman, D. Ma, A 6-A, 40-MHz four-phase
D. Ma, R. Bondade, Enabling efficient power manage- ZDS hysteretic DC-DC converter with 118 mV droop
ment on silicon. IEEE Circuits Syst. Mag. 10(1), and 230 ns response time for a 5A/5 ns load transient,
14–29 (2010) in International Solid-State Circuits Conference,
D. Ma, R. Bondade, Reconfigurable Switched-Capacitor pp. 80–81, San Francisco, CA, Feb 2014
Power Converters, 1st edn. (Springer, New York, M. Song, J. Sankman, J. Lee, D. Ma, A 200-MHz 4-phase
2013). ISBN 978-1-4614-4186-1 and 978-1-4614- fully integrated voltage regulator with local ground
4187-8 (e-Book) sensing dual loop ZDS hysteretic control using
D. Ma, W.-H. Ki, C.-Y. Tsui, P.K.T. Mok, Single- 6.5nH package bondwire inductors on 65 nm Bulk
inductor dual-output CMOS switching converters in CMOS, in IEEE/ACM Design Automation
discontinuous-conduction mode with time- Conference-Asia & South Pacific, Jan 2016, pp. 9–10
multiplexing control. IEEE J. Solid-State Circuits 38 M. Stoopman, S. Keyrouz, H.J. Visser, K. Philips,
(1), 89–100 (2003a) W.A. Serdijn, Co-design of a CMOS rectifier and
D. Ma, W.-H. Ki, C.-Y. Tsui, A pseudo-CCM/DCM small loop antenna for highly sensitive RF energy
SIMO switching converter with freewheel switching. harvesters. IEEE J. Solid-State Circuits 49(3),
IEEE J. Solid-State Circuits 38(6), 1007–1014 622–634 (2014)
(2003b) C.R. Sullivan, Integrating magnetics for on-chip power:
D. Ma, W.-H. Ki, C.-Y. Tsui, An integrated one-cycle challenges and opportunties, in IEEE Custom
control buck converter with adaptive output and dual- Integrated Circuits Conference, 2009, pp. 291–298
loop output error correction. IEEE J. Solid-State Maxwell HC series Ultracapacitors Datasheet, www.max
Circuits 39(1), 140–149 (2004) well.com/images/documents/hcseries_ds_1013793-9.
H. Nakamoto, D. Yamazaki, T. Yamamoto et al., A pas- pdf. Accessed May 2016
sive UHF RF identification CMOS tag IC using ferro- V.I. Utkin, Sliding mode control design principles and
electric RAM in 0.35-μm technology. IEEE J. Solid- applications to electric drives. IEEE Trans. Ind. Elec-
State Circuits 42(1), 101–110 (2007) tron. 40(1), 23–36 (1993)
316 D. Brian Ma and Y. Lu

Y. Wang, D. Ma, A 450-mV single fuel cell power man- J. Yi, W.-H. Ki, C.-Y. Tsui, Analysis and design strategy
agement unit with switch-mode quasi-V2 hysteretic of UHF micro-power CMOS rectifiers for micro-
control and automatic start-up on standard 0.35 μm sensor and RFID applications. IEEE Trans. Circuits
CMOS process. IEEE J. Solid-State Circuits 47(9), Syst. I: Reg. Papers 54(1), 153–166 (2007)
2216–2226 (2012) C. Zhan, W.-H. Ki, Analysis and design of output-
H. Wu, D.S. Gardner, C. Lv, et al., Integration of mag- capacitor-free low-dropout regulators with low quies-
netic materials into package RF and power indcutors cent current and high power supply rejection. IEEE
on organic substrates for system in package (SiP) Trans. Circuits Syst. I: Reg. Pap. 61(2), 625–636
applications, in IEEE Electronic Components and (2014)
Technology Conference, May 2014, pp. 1290–1295 Y. Zhang, D. Ma, A fast-response hybrid SIMO power
H. Xu, M. Lorenz, U. Bihr, J. Anders, M. Ortmanns, converter with adaptive current compensation and
Wide-band efficiency-enhanced CMOS rectifier, in minimized cross-regulation. IEEE J. Solid-State
IEEE International Symposium on Circuits and Circuits 49(5), 1242–1255 (2014)
Systems (ISCAS), Jun 2014, pp. 614–617 Y. Zhang, D. Ma, A single-stage solar-powered LED driver
Y. Lu, H. Dai, M. Huang et al., A wide input range dual- using power channel time multiplexing technique. IEEE
path CMOS rectifier for RF energy harvesting, in Trans. Power Electron. 30(7), 3772–3780 (2015)
IEEE Transactions on Circuits and Systems II:
Express Briefs, 64, (2017)
Energy Harvesting
11
Ying-Khai Teh and Philip K.T. Mok

This chapter complements other chapters on bat- 11.1 Available Energy Sources
tery technologies and on-chip DC-DC conver- for IoT
sion, and mainly addresses the challenges of
designing low power IoT nodes that are powered 11.1.1 Classification of Energy Source
by energy harvesting sources, which is the key Characteristics
enabling technology to extend battery life and and Classification
minimize manual battery maintenance, using in
situ power extraction from the surrounding. Advancement in the thin-film fabrication has
Energy harvesting options, circuit concepts, successfully produced new generation of energy
considerations and trade-offs regarding circuit harvesters at reasonable cost, comes with com-
topology, passive component and CMOS process pact size and able to generate microwatts range
are surveyed. In particular, recent circuit of power via surrounding events. An overview of
solutions involving non-conventional power energy harvesters commercially available are
management schemes specifically catering for given in this section, which include thermoelec-
energy-harvesting-assisted IoT systems will be tric generator (TEG), microbial fuel cell (MFC),
discussed and compared. photovoltaic cell (PVC) and piezoelectric har-
vester (PEH). Key classification matrix to char-
acterize the energy sources are defined and
outlined, i.e. power profile (current-voltage) and
availability (continuous-time, discrete-time), and
physical profile. Energy sources which are highly
volatile in nature (photovoltaic cell (PVC) in
rough, cloudy environment or piezoelectric har-
vester (PEH) installed in poor vibration fre-
quency band) are considered as discrete-time
Y.-K. Teh (DT) energy source; which have high output
Hong Kong University of Science and Technology, voltage once ready, but suffers intermittency in
Clear Water Bay, Hong Kong
term of power availability.
San Diego State University, San Diego, CA, USA Energy sources such as body-powered TEG
P.K.T. Mok (*) and MFC are considered as continuous-time
Hong Kong University of Science and Technology, (CT) energy source (Bandyopadhyay and
Clear Water Bay, Hong Kong
Chandrakasan 2012; Danesh and Long 2011).
e-mail: eemok@ust.hk

# Springer International Publishing AG 2017 317


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_11
318 Y.-K. Teh and P.K.T. Mok

Although the output voltage is low (<300 mV), difference). State-of-the-art TEGs are fabricated
the available energy is stable and omnipresence. using thin-film deposition technology of Bi2Te3
Table 11.1 classifies these four energy material. Such improvements are desirable for
harvesters into according to voltage-current- further system miniaturization. The new genera-
time matrix. Table 11.2 summarizes key device tion TEG (Micropelt) has smaller device area
characteristics of four different energy with increased power density (100 μW/mm2),
harvesters under low output power condition compared to conventional bulk material based
(less than 100 μW) i.e. TEG only has 2  C of TEGs (Marlow Industries) (13 μW/mm2) at tem-
temperature difference between the Peltier perature difference of 10  C. Internal resistance
junctions, MFC with 50 ml of biofuel, PVC is (RTEG) of new generation TEG is 400 Ω, which is
illuminated at 200 lux (indoor lighting condi- 80 times larger whereas the absolute output
tion) and PEH is vibrated at 0.1 g. Figure 11.1 power for conventional TEG is 8 times higher.
shows the power density of different energy The output power and power density comparisons
sources, assuming all are at maximum power of the two TEGs are shown in Fig. 11.2.
point. TEG which has the best power density Simplified circuit model of TEG (Micropelt),
performance has attracted the most research and its current-voltage characteristics are shown
attentions in recent decade. in Fig. 11.3.
Besides thermoelectric solution, another
emerging power source is microbial fuel cells
(MFC). MFC harvests electricity from organic
11.1.2 Continuous-Time Energy
dispose such as waste water, which is essen-
Sources (TEG and MFC)
tially by-product of catalytic activities of
microorganisms over various organic substrates.
TEG is an interesting energy harvester for its
The nature where MFC derives its energy
continuous operation against environmental con-
enables the niche application of waste water
dition change, due to omnipresence of thermal
treatment and water quality monitor systems.
energy gradient (which manifests as temperature

Table 11.1 Classifications of energy sources


Type TEG (Micropelt) MFC Wang et al. (2013) PVC (Enocean) PEH (Microgen)
Voltage Low Low High High
Current High High Low Low
Time CT CT DT DT
CT, continuous time; DT, discrete time

Table 11.2 Device characteristics of energy sources


MFC Wang et al. PEH
Type TEG (Micropelt) (2013) PVC (Enocean) (Microgen)
VOC (max) 0.3 V 0.6 V 4V 8 V (rectified)
ISC (max) 750 μA (@ΔT ¼ 2  C) 2.7 mA 7 μA (@200 lux) 14 μA (@0.1 g)
VMPPa 0.15 V (@ΔT ¼ 2  C) 0.35 V 3 V (@200 lux) 4 V (@0.1 g)
PMPPb 56 μW (@ΔT ¼ 2  C) 440 μW 14 μW (@200 lux) 56 μW (@0.1 g)
Dimension 431 37  37  37 35  13  1 15  15  6
(L  W  H mm3)
Power density (μW/mm3) 4.67 8.68  103 0.03 0.04
a
VMPP ¼ Operating voltage of energy source at maximum power point
b
PMPP ¼ Output power of energy source at maximum power point
11 Energy Harvesting 319

Current state-of-the-art MFCs are fabricated 11.1.3 Discrete-Time Energy Sources


using emerging material such as graphene (PVC and PEH)
(Wang et al. 2013). It is interesting to note
that thin-film TEG operating at low temperature Photovoltaic cell (PVC) is capable of converting
difference (4–5  C) and graphene-based MFC light directly into electric energy. Most PVCs
have similar electrical characteristics, as shown consist of silicon material. Due to the high cost
in Fig. 11.4. Nonetheless MFC’s power density of the vacuum-based fabrication process of such
is approximately three orders lower than TEG’s. solar cells, research and development has been
Electrically both energy sources can be modeled directed toward the invention and development
as an ideal voltage source (VGEN) in series with of thin film inorganic and emerging inorganic/
internal resistance (RS) of a few hundred ohms. organic thin film PV solar cells in an attempt to
fabricate solar cells out of less-expensive
materials and processes. Light is often present
for a prolonged period, during which the light
energy can be accumulated in storage
mechanisms and then used when needed.
Simplified circuit model of PVC in (Enocean)
and typical output behavior is shown in Fig. 11.5.
Piezoelectric harvester (PEH) with size
smaller than a US quarter coin and internal para-
sitic capacitance of a few nano Farad can be
manufactured via commercial CMOS process
(Microgen). The generated power is in
alternating current (AC) mode and thus requires
to be rectified before use. Simplified circuit
model of PEH with rectified output in
(Microgen) and its typical output behavior is
shown in Fig. 11.6. However, as miniaturization
continues, the mechanical vibration bandwidth
Fig. 11.1 Power density of TEG (Micropelt), MFC for smaller devices is also getting smaller
(Wang et al. 2013), PVC (Enocean) and PEH (Microgen) (a few Hertz). Therefore PEH is sensitive

Fig. 11.2 Power and Power density of conventional TEG (TEG-I (Marlow Industries)) and new generation TEG
(TEG-II (Micropelt))
320 Y.-K. Teh and P.K.T. Mok

Fig. 11.3 Simplified


equivalent circuit model
and output characteristics
of TEG (Micropelt)

Fig. 11.4 I–V characteristics of TEG (Micropelt) at ΔT ¼ 4  C, ΔT ¼ 5  C and MFC (Wang et al. 2013)
11 Energy Harvesting 321

Fig. 11.5 Simplified equivalent circuit model of PVC and its output characteristics (Enocean)

towards environment stimuli and the output The fundamental difference between battery
power profile is highly volatile. power source and energy harvesting source is
that energy harvesters have significantly higher
source resistance (much higher than 1 Ω) com-
11.2 Comparison of Battery- pared to battery (less than 1 Ω). However, in
Powered and Energy- battery powered application, charge stored in
Harvested Systems non-rechargeable battery is limited. Before the
battery stored charge depletes, power sourcing
In this section, the difference of battery-powered capability of a battery is much higher than energy
and energy-harvested systems primarily lies on harvesting source in general (Table 11.3).
the known and unknown voltage-current profile, For energy harvesting source, supply of elec-
due to different source resistance is highlighted. trical charge is in theory unlimited and most of
322 Y.-K. Teh and P.K.T. Mok

Fig. 11.6 Simplified


equivalent circuit model of
PEH and its output
characteristics (Microgen)

Table 11.3 Energy harvesting source vs. battery


Energy harvesting source Battery
High internal resistance Low internal resistance
Unlimited charge in theory Finite charge
Limited instantaneous power (Duty cycled operation) High instantaneous power (Duty cycled operation is optional)
DC-DC converter functions: DC-DC converter functions:
– Output voltage regulation – Output voltage regulation
– Maximum power point tracking (MPPT) – No MPPT necessary

the time the energy harvester is charging up the additional charge harvested will be dissipated
storage capacitor. Before storage capacitor volt- by shunt regulator. If a series regulator is used
age reaches the designed threshold, the regulator instead, the system is prone to overvoltage fail-
is turned off. Having a shunt regulator in this ure as excess charge will continue to accumulate
scenario basically acts as charge release valve. and build up storage capacitor voltage. Overvolt-
The objective is to sense the storage capacitor age protection should be prioritized over power
voltage, once the set threshold is reached, conversion efficiency in this scenario. Hence the
11 Energy Harvesting 323

automatic protection against overvoltage func- 11.3 Energy Harvesting and Power
tion of a shunt regulator is indeed attractive for Conditioning Circuits
energy harvesting application. High voltage
headroom generated by PVC and PEH output Multi-level supply voltage (VDD) is common
can be absorbed by a properly designed shunt for IoT systems implemented using low power
regulator such as (Sarkar and Chakrabartty modern CMOS process. External I/O interface
2013), allowing storage capacitor to be directly circuits typically works at 3 V (5 V for legacy
charged by these sources. devices) and internal digital core circuits
One key challenge is the design of interfacing typically works at 1 V. Modules such as sensors,
power management circuit that efficiently analog-to-digital-converters (ADC) and amp-
combines many different sources of energy into lifiers always prefer higher voltage headroom to
one. Combining the power from variable sources achieve better resolution, gain and linearity (Mak
conventionally requires a sophisticated control and Martins 2010). For battery-powered app-
system as demonstrated in (Bandyopadhyay and lications, such multiple output voltage can be
Chandrakasan 2012). The control itself requires realized using Single-Inductor-Multiple-Output
additional overhead power, i.e. first sense the (SIMO) DC-DC converter (Jing and Mok 2013;
availability of power from each source, then Ma et al. 2003a, b) or multiple series regulator
compute and optimize the multi-source time i.e. low dropout regulator (LDO) (Ho and Mok
multiplexing. 2012; Zhan and Ki 2012, 2014). The conven-
Another challenge is the limit imposed by the tional approaches become unrealistic when the
low supply voltages in standard CMOS input power approaches microwatts range as
technologies. These constrain the output voltage the overhead power is too high to jeopardize
to be within the defined supply voltage. At higher the entire system operation. Design requirements
voltages outside of the nominal range, break- of energy harvesting system to power a
down mechanisms and hot carrier effects will wireless sensor IoT node and the associated
be present. Hence the power management circuit challenges hitherto discussed are summarized in
must prevent the overvoltage problem. On the Table 11.4.
other end of the spectrum, some energy sources In previous sections, electrical characteristics
have very low input voltage (<0.2 V), which are of various energy harvesting source and storage
lower than transistor threshold voltage. At capacitors were reviewed. Based on these
sub-threshold region, transistors are not fully
turned on and conventional voltage converter
topologies are no longer valid or suffer poor Table 11.4 Summaries of design requirements and
conversion efficiency. challenges of IOT wireless sensor node
Typical power consumption pattern of an IoT Input voltage 0–8 V
sensor node is low average power (no active Input current 0–1 mA
power consumption except watch dog timer) Input power 0–1 mW
but occasional occurrence of high concentrated Dual output 1 V (Digital), 3 V (Analog, RF and
burst of power to accomplish a task such as voltage sensors)
activating external sensors or sending radio Design features – Smallest input voltage and power
required possible
packets. If these bursts occur with a low duty
– Smallest overhead and leakage
cycle such that the total energy needed for a power possible
burst can be accumulated between bursts then – Highest voltage conversion ratio
the output can be maintained entirely by the and efficiency possible
energy harvesters. An ideal energy harvesting – Self-start-up during cold-start
scheme is required to maintain low leakage state
power of harvested energy during idle period – Over-voltage protection
– Maximum power point tracking
yet able to supply high pulsed power on demand.
324 Y.-K. Teh and P.K.T. Mok

properties, this chapter surveys integral system interrupting energy harvesting flow since circuit
schemes to accommodate multiple types of does not need to pause and sample the energy
energy sources simultaneously and minimize harvester source open circuit voltage from time
storage leakage loss. Each circuit block that can to time.
be utilized to implement such systems i.e. linear Table 11.5 summarizes the characteristics of
regulator, switching and switched-capacitor different DC-DC converter circuits according
DC-DC converters are stepped through briefly to conversion possible, peak power efficiency
in sequential manner. and output voltage control that can be used to
Different power converter control schemes construct a multi-input-multi-output energy
will have different dynamic power loss. As rule harvesting power management systems for IoT
of thumb, ZCS control is desired to maximize node.
power efficiency. Pulse frequency modulation
(PFM) has superior power efficiency than pulse-
width modulation (PWM) schemes at light-load. 11.3.1 Linear Regulator in Energy
Comparator is one of the most power-consuming Harvesting Application
circuit blocks in power converter. Designers
should use low bias current or clocked compara- As discussed in Chap. 10, there are two possible
tor to save quiescent power. For energy implementations for linear regulators i.e. series
harvesting source which is steady against envi- and shunt regulators. In principle these circuits
ronmental change, maximum power point can be designed based on the functional blocks
tracking (MPPT) algorithm can be executed in a shown in Figs. 11.7 and 11.8 respectively by
low duty cycle fashion to save computation assuming the internal resistance of energy
power. For energy harvesting source with harvesting source to be series resistance RS
known, fixed source resistance, timing-based needed in any typical linear regulator realization.
impedance matching MPPT technique can be Series voltage regulator is more power efficient
employed with a comparator-less fully digital but lacks the ability to prevent over-voltage.
approach. This MPPT technique also avoids Therefore if shunt regulator is used, constant

Table 11.5 Summary of different DC-DC converters


Type Conversion direction (Up/Down) Peak power efficiency Output voltage
Linear regulator Down Low Adjustable by reference voltage
Switching converter Up/Down High Adjustable by clock duty cycle
Switched- capacitor Up/Down Medium Adjustable by clock duty cycle

Fig. 11.7 Basic working principle of series voltage regulator


11 Energy Harvesting 325

Fig. 11.8 Basic working principle of shunt voltage regulator

Fig. 11.9 Reconfigurable


switch for photovoltaic
energy harvesting

power will be drawn from battery despite the For good regulation, VREF must be stable across
load power being low. This is disaster especially VDD variations. Voltage Sample element
for low duty long working hour IoT application monitors VDD and translates it into a level equal
because battery charge depletes quickly. This to the VREF for a desired regulated voltage.
scenario is exactly how the conventional wisdom Variations in VDD cause feedback which changes
i.e. series regulator is more efficient is being VSAMPLE to some value greater than or less than
derived. Shunt voltage regulator is less power the VREF. ΔVSAMPLE causes comparator to drive
efficient but it can protect the circuit from over- series or shunt element which would respond
voltage, which is useful when interfacing with appropriately to correct the output voltage
high voltage low input energy source. However, change.
linear regulator can only step down input voltage Besides the textbook classic series and shunt
to generate a lower output voltage. The best voltage regulator, there is also a unique case of
conversion power efficiency possible is VO/VS, linear (resistive) based power converter circuit
which implies that this type of regulator is con- worth mentioning, which is a reconfigurable
sidered efficient only when the ratio of conver- switch to control an array of PVCs (Lee et al.
sion is close to unity. 2016). By using low impedance switches and
Voltage Reference element forms the founda- basic digital logic control, as shown in
tion of linear regulator since VDD must be either Fig. 11.9, high power conversion efficiency
equal or a multiple of reference voltage VREF. (>90%) is achieved to control PVC array since
326 Y.-K. Teh and P.K.T. Mok

there is essentially zero switching loss. This cir-


cuit works under the assumption that the total,
stacked output voltage of PVC array is always
sufficient to direct charge a rechargeable battery.

11.3.2 Switching Converter


and Switched-Capacitor
(SC) Power Converter in Energy
Harvesting Application

As discussed in Chap. 10, switching converter


involves three different electrical components
i.e. switches SN, inductor LN, diode DN, with
the objective of converting a given supply volt-
age VIN, to output voltage VOUT. There are three
basic configurations possible, as illustrated in
Fig. 10.2 of previous chapter. Buck converter
shown in Fig. 11.2a (Chan and Mok 2011), is
suitable for high voltage energy sources (PVC
and PEH) where VOUT < VIN. Figure 11.2b is a
Boost converter shown in Fig. 11.2b (Man et al.
2008), is suitable for low voltage energy
sources (TEG and MFC) where VOUT > VIN. Fig. 11.10 Basic working principle of switched-
capacitor DC-DC converter for (a) step-down, (b) step-
Buck-boost converter Chew et al. (2013), up, (c) invert
shown in Fig. 11.2c is suitable for hybrid energy
harvester (combination of both high and low
voltage sources) where VOUT can be larger than regulator and switching. Similar to the switching
or less than VIN. All output-input voltage rela- counterpart, as shown in Fig. 11.10, there are
tionship of these converters can be adjusted by three basic switched-capacitor configurations,
changing the switching clock duty cycle. The which allow designers to step down (Jiang et al.
advantage of switching converter is the high 2014), step up (Su and Ki 2007) and invert
conversion power efficiency, which can be up (Wang and Wu 1997) a given voltage. Typically
to 90%, achievable across different conversion 50% duty 2-phase (Φ1, Φ2) clock is used. The
ratio. The disadvantage is the requirement of invert charge pump is useful in generating nega-
off-chip inductor, which is bulky and expensive. tive voltage, which is instrumental in
Fully integrated on-chip switching converter is a sub-threshold level power converter to suppress
popular on-going research topic (Huang and Mok start-up system leakage (Kim et al. 2014).
2013a, b).
An alternative to using inductor is by shifting
to the other reactive circuit element, 11.3.3 Maximum Power Point
i.e. capacitor. Switched-capacitor (SC) power Tracking in Energy Harvesting
converter is also known as charge pump. The
main advantage is that fully integrated on-chip Due to the volatile nature of energy harvesting
implementation is possible and bi-directionality source, it is important that the interfacing power
of conversion. For applications where bulky management unit maximizes the charge extrac-
off-chip components must be avoided, charge tion from the source, to minimize system down
pump is a good compromise between linear time before the loading IoT node is ready to
11 Energy Harvesting 327

Fig. 11.11 Conventional


maximum power point
tracking algorithm

perform the next duty-cycled workload. The constant. To ensure absolute variability is less
maximum power point is reached when the inter- than 5%, simulations based on the assumption
nal resistance of energy harvesting source is RS varies by 20% (an excessive approximation),
matched to the effective load connected. PVC the tracked MPP voltage is only off by 5%. For
often has non-linear peak output power point very high performance system, designers need to
but TEG is linear, where the maximum power incorporate additional temperature sensor and
point is always half of the open-circuit voltage. A look-up table containing the temperature varia-
typical conventional maximum power point tion information RS to compensate for all process,
tracking (MPPT) algorithm for thermoelectric voltage and temperature (PVT) variation
harvesting, as shown in Fig. 11.11 for a switching accurately. Since typical sampling pulse is of a
converter (Im et al. 2012) requires sampling of few hundreds μs to ms range, digital clock can be
TEG open circuit voltage (VOC), and then run at MHz (100s of ns) clock range. The power
modifies the DC-DC converter duty cycle to consumption of such on-chip MPPT circuit is
change the input voltage to match the 0.5 VOC. estimated to be within the ballpark of 100 μW
This approach needs to periodically switch off when active. If MPPT algorithm is allowed to be
converter so that VOC can be sampled and run at low duty cycle (once for every few second,
updated according to environmental change. since temperature change is typically slow), aver-
The alternative, perturb and observe (P&O) age power can be reduced to microwatts range.
MPPT method is not commonly used in Another continuous-time MPPT method, as
microwatts systems due to the high power con- shown in Fig. 11.13, is to observe the output
sumption of voltage/current sensor. power via sampling the voltage accumulated
Since internal resistance RS is a known con- over an on-chip capacitor. Within fixed clock
stant value with TEG, we can achieve period, the charge pump clock varies according
continuous-time tracking of MPP by setting the to input voltage and hence the voltage-domain
input impedance to match RS, without the need to sensing is shifted to time-domain sensing, i.e. the
sample VOC. The associated algorithm is shown higher power harvester is translated to shorter
in Fig. 11.12. This idea effectively leverages on rising time. This approach effectively
the advantage of a clock-power stage integrated circumvents the power hungry voltage/current
boost converter reported in (Teh and Mok sensor problem (Liu et al. 2016).
2014a).
The input impedance of this boost converter
can be derived from timing of the self-generating 11.3.4 Unique “Investment” Concept
pulse. This assumption is valid as the temperature in Piezoelectric Harvesting
variation of RS of thin-film TEG (Micropelt) is
less than 1% of its nominal value, even when ΔT Kinetic energy in motion is attractive because
is up to 60  C. However, since ΔT in typical vibrations are abundant in the environment. This
energy harvesting application is small, such vari- is why piezoelectric transducers are popular
ance is acceptable and RS can be regarded today, and because they generate more power
328 Y.-K. Teh and P.K.T. Mok

Fig. 11.12 Continuous-


time maximum power point
tracking algorithm for
switching topology based
on (Teh and Mok 2014a)

Fig. 11.13 Time-domain maximum power point tracking algorithm for switched-capacitor power converter based on
(Liu et al. 2016)

from motion under similar space constraints work. This way, the circuit eventually draws div-
than their electrostatic and electromagnetic idend from the investment, i.e. extracting more
counterparts. In piezoelectric harvesting, there is power from the transducer after investing fixed
a unique harvesting technique that is not present amount of battery energy into PEH (Kwon and
in other energy harvesting modes which is the Rincon-Mora 2014). The operation waveforms of
concept of investment. The idea is to strengthen such piezoelectric harvesting mechanism are
the electrostatic force against which vibrations shown in Fig. 11.14.
11 Energy Harvesting 329

2
VPZ(PK)+

0
[V]

tVIB VPC
-2
VPZ

VPZ(PK)-
-4

-6
0 1 2 3 4 5 6 7 [ms]
40
0.5pÖ(LHCPZ)
iL
Current [mA]

20 tH- tCHC
-iBAT
0
-iBAT
-20 t1 tH+ tPC
iL
-40 Invest Harvest
0 2 4 6 8 [ms] 0 2 4 6 8 [ms]

Fig. 11.14 Energy investing concept in piezoelectric harvesting (Kwon and Rincon-Mora 2014)

excitation voltage during shock, and then, as


11.4 Adaptation to Source and Load shown in Fig. 11.15. The idea is to stack multiple
capacitors in series during high voltage transient,
11.4.1 One-Shot Input Voltage and then redistribute the charge accumulated in
in Piezoelectric Harvesting parallel by reconfiguring the stacked capacitors,
subsequently lowering the accumulated voltage.
Typically, piezoelectric harvesters generate the
most power when they vibrate at their resonant
frequency. Unfortunately, motion is not always
consistent or periodic. In many applications, in 11.4.2 Shunt Regulator as Protection
fact, vibrate in response to one-shot shocks, or Circuit
repeated impact. Many related work such as
(Kwon and Rincon-Mora 2014) discussed earlier Under extreme low input stimulus as shown in
can still function with one-shot or shocks but are Table 11.2, the output power from miniaturized
not optimized for such operation mode. One PVC and PEH is of microwatts range, albeit
interesting work (Yang et al. 2015a) is to use output voltage at open circuit condition of these
switched capacitor topology to absorb the high sources can be as high as 8 V. Considering the
330 Y.-K. Teh and P.K.T. Mok

Fig. 11.15 Series-parallel


switched-capacitor to
absorb high voltage over-
shoot (Yang et al. 2015a)

burst mode nature of incoming power, isolation has higher gain which helps shunting extra current
diodes and interfacing power IC are especially to ground. In the default implementation, RCOMP
vulnerable to breakdown. In standard CMOS represents the equivalent resistance of transistor
0.13-μm process, the nominal voltage is only drain-body junction leakage component. To fur-
1.2 V for thin oxide transistor and 3.3 V for ther adjust the targeted output voltage after fabri-
thick oxide transistor. Obviously input protection cation, the value of resistor RCOMP can be adjusted
circuit needs to be included to avoid overvoltage by having an additional bond pad to access the
breakdown caused by such high harvester output VSHUNT node and connect an additional off-chip
voltage. Conventionally series regulator such as resistor. This off-chip resistor is only able to com-
low dropout regulator (LDO) is preferred over pensate for the process variation whenever RCOMP
shunt regulator for higher power efficiency but is higher than the simulated value. Note that this
this topology does not provide any input protec- circuit is susceptible to CMOS process variation
tion. When input voltage is as high as 8 V, the and can only offer coarse voltage regulation with
differential voltage across the dropout transistor 100 mV margin of error.
is 5 V, if the output voltage is set to 3 V; this
condition exceeds the drop out transistor break-
down limit. Shunt regulator is a straight forward 11.4.3 Multiple Input Multiple Output
solution to protect the load from the sudden volt- Voltage Requirements
age surge of the energy harvesting source and
this cause reduction in power efficiency due to Existing energy harvesting systems is commonly
the constant quiescent power drawn by a shunt powered by single energy harvesting source with
regulator. photovoltaic cell being the most popular and
To build a low quiescent shunt regulator, matured technology. Nevertheless a single
Schmitt-Trigger as shown in Fig. 11.16 can be source energy harvesting system is often inter-
deemed as the ideal candidate to optimize power mittent and unpredictable in term of power
consumption in this context (Teh and Mok generated. Instead of increasing storage capacitor
2014b). Conventional “problem” i.e. more gain or battery size, recent research direction is to
for positive cycle and less gain for negative pursue energy harvesting schemes using multiple
cycle is paradoxically again becomes favorable energy sources. The key challenge is the design
for shunt regulator application. Negative cycle of of interfacing power management circuit that
input voltage implies insufficient power from efficiently combines many energy sources into
energy harvesting source; subsequently lowering common storage device. Combining the power
gain of Schmitt-Trigger and cause more current to from multiple time-varying sources typically
be diverted to the load instead of being shunted to requires a sophisticated control system.
ground via PSHUNT. During positive cycle of input A capacitor is analogous to storage reservoir
voltage (more harvested power), Schmitt-Trigger for electrical charge carriers, as a water bucket to
11 Energy Harvesting 331

Fig. 11.16 Digital logic based implementation of shunt voltage regulator (Teh and Mok 2014b)

water flow. The capacitor leakage current can be applied voltage characteristics of such emerging
compared to the water seeping through the water capacitor type is shown in Fig. 11.17, of which
bucket. A capacitor will never become fully leakage current is exponentially dependent on
charged if the leakage current is greater than or DC bias.
equal to the supply current. Designer needs to Connecting capacitors in series results in the
take extra caution especially when these voltage being split between the capacitors and in
capacitors are to be charged by microwatts turn this is influenced by the leakage current
power source as the charging current are compa- difference between the individual capacitors in
rable to leakage current. Recently, molecularly a series, as shown in Fig. 11.18. The leakage
thin film capacitor of 28 nm thickness (claimed current component can be modeled as a parallel
to be better than graphene) and capacitance den- resistor to the capacitor. The leakage current
sity of 27.5 μF/cm2, which is a factor of three differences become apparent when the circuits
orders better than commercial products in mar- are activated in the form of overvoltage on the
ket, was fabricated using oxide nanosheets component with the lowest leakage current.
(Wang et al. 2014). Leakage current versus Since considerable fluctuations are found
332 Y.-K. Teh and P.K.T. Mok

between individual capacitors (even from the the system since leakage current can vary up to
very same production run) in terms of their leak- two order of difference.
age currents and capacitance, it is possible that The simplest form of balancing circuit is a
large voltage differences may occur and one fixed value resistor parallel to each capacitor.
capacitor in the stacks will sustain higher voltage However, this method merely provides balanced
despite all having the same nominal capacitance. voltage but not voltage regulation. To obtain
It is therefore important that both capacitance both balancing and regulated voltage, a low qui-
and leakage current differences are balanced in escent current shunt regulator (with minimal
leakage current in its internal voltage reference
and parasitic junction diodes) is a better candi-
date. A proposed scheme essentially emulates a
water bucket fountain by envisioning capacitor
as individual water bucket; as illustrated in
Fig. 11.18 (Teh and Mok 2014b). Incoming
charge from high voltage low current (HVLI)
energy harvesting sources i.e. PVC and PEH
will charge up stack capacitors directly. Once
capacitor voltage reaches the set value, excess
charge will be pushed to next capacitor below,
and eventually to ground if all capacitors are
fully charged to preset ΔV value i.e. 1 V. This
operation is similar to push pull shunt regulation
discussed in (Alon and Horowitz 2008). Diodes
can be used to isolate incoming power of PVC
Fig. 11.17 Leakage current versus voltage profiles of the and PEH.
ultrathin capacitor (Wang et al. 2014)

Fig. 11.18 Stacked capacitor energy harvesting scheme (Teh and Mok 2014b)
11 Energy Harvesting 333

time-multiplexing nature and the control is


deemed complicated.

11.5 Voltage Limits and Cold Start

The caveat of employing the new generation,


seemingly advantageous TEG is the increased
RTEG, which has lower output current. Such
device characteristics put further design con-
straint, on top of the low voltage start-up opera-
tion, which is already a challenging task to solve.
The main problem for a boost converter circuit to
Fig. 11.19 Single-inductor multiple-input-multiple-out- self-start-up is the limited voltage level
put energy harvesting scheme (Bandyopadhyay and generated by the TEG source, which is often
Chandrakasan 2012) below CMOS transistor’s threshold voltage (typ-
ically 300–500 mV). For body-wearable applica-
By combining the idea of divided series tion with temperature difference of 1–2  C,
capacitor voltage to provide multiple voltage minimum output voltage can be as low as
level, by connecting multiple capacitors in series, 25 mV. Therefore circuit designers face a
multiple voltage levels can be generated. Top catch-22 situation, where the dilemma is to boot-
capacitor (3 V) can power high voltage circuits strap the boost converter in generating an output
(I/O, sensors, ADC, analog and RF) and bottom voltage of 1 V or more to power up other periph-
capacitor (1 V) can power low voltage low power eral circuits such as sensors, analog-to-digital
digital circuits. In this proposed scheme, high converter, digital baseband and RF transceiver.
incoming power can be driven directly by the Since voltage up conversion always requires
primary power source when it is available. oscillation clock (which is generated on-chip or
However when ambient power is insufficient, off-chip), as shown in Figs. 11.20 and 11.22.
capacitor stacks will act as “flying-battery” con- Therefore the solution to start-up problem funda-
figuration (Alon and Horowitz 2008) to supply mentally boils down to designing a sub-threshold
high power, short interval required by IoT sensor oscillation clock generator at the lowest voltage
node since each capacitor-shunt regulator pair possible. Many recent innovations in energy
forms a standalone “battery”. All capacitors in harvesting therefore focus on optimization of
the stack therefore share a common DC current cold-start conditions in no-battery or dead-
when IoT node draws current. Additional capac- battery situations.
itor can be installed in parallel fashion if greater The minimum operation voltage of a binary
energy is demanded. switching signal transfer limit is known as the
Besides the stacked capacitor approach, Meindl limit (Meindl and Davis 2000) where the
another popular research direction such as theoretical minimum operation supply voltage
(Bandyopadhyay and Chandrakasan 2012) is to (VDD) is 36 mV (where 1’s and 0’s in signals
pursue energy harvesting schemes using multiple are discernible) for ring oscillator implemented
energy sources and simultaneously generate mul- in standard CMOS technology. Another subtle
tiple output voltage, using single inductor as the start-up problem arises for systems employing
core power conversion element, shown in TEG with high RTEG, significant voltage drop
Fig. 11.19. However, for such single inductor across RTEG and low current reduces transistor’s
multi-input-multi-output topology, only one gm. This jeopardizes the feedback gain required
source can be polled at a time due to the implicit by oscillators, which need relatively high current
334 Y.-K. Teh and P.K.T. Mok

Fig. 11.20 (a) RF-pulse driven charge pump (b) On-chip clock driven charge pump

to function despite the low VDD requirement reverse bias for CMOS 130 nm process is
(Teh and Mok 2014a). approximately 0.3 V, as shown in Fig. 11.21.
To accomplish voltage boost conversion, lin- The control circuit of the microwatts power
ear regulator which is purely resistive is out of single inductor switching converter is mainly
consideration since it can only perform voltage digital-based, using an on-chip oscillator and
down conversion. Switching converter has the counter controlled pulse width to achieve zero
lowest start-up voltage (21 mV) and has the current switching (ZCS) as shown in Fig. 11.22.
best power efficiency at the cost of bulky exter- Oscillator generates a clock signal, with a partic-
nal inductor. To achieve the lowest start-up volt- ular frequency and duty cycle, which drives the
age, low threshold or zero threshold transistors low-side switch, SWLS. The same clock signal is
should be used. However, power efficiency at then delayed and used to generate pulses that
higher output voltage will be lowered due to the drive the high-side switch SWHS. The pulse
transistor off-state leakage loss, which is width is controlled by digital counter value.
inversely proportional to threshold voltage. Adaptive dead-time control is employed to
From empirical data reported in recent works, minimized synchronization mismatch loss. The
CMOS 130 nm process seems to be the optimal minimum start-up voltage of early low voltage
choice of technology. Adopting more expensive TEG-powered switching converter is around
CMOS process with smaller transistor size, 600 mV. However, the peak power efficiency is
renders diminishing return-over-cost for the high, reaching 75% and lowest input voltage is
sake of achieving lower start-up voltage or better 20 mV once activated (Carlson et al. 2010).
power efficiency due to higher transistor leakage Transformer-based switching converters using
loss. In systems with negative supply voltage either high turns-ratio (Im et al. 2012) or unity
(Teh and Mok 2014b), transistor leakage loss turn-ratio pulse transformer (Teh and Mok
can be reduced by applying negative gate volt- 2014a) further reduce the start-up voltage down
age. However, larger magnitude of negative bias to 21 mV, due to minimal overhead control cir-
does not suppress leakage current further due to cuit required and the embedment of oscillator
gate-induced-drain-leakage (GIDL) effect into the power stage. With an extra diode and
(Chatterjee et al. 2003). For example, the optimal output capacitor, (Teh and Mok 2014a) can be
11 Energy Harvesting 335

Fig. 11.21 Off-state


leakage current versus
reverse bias voltage for
CMOS 130 nm transistors

conductance loss (Kim et al. 2014). To eliminate


the overhead of separate oscillator circuit, a self-
oscillating SC converter is proposed (Jung et al.
2014). Hybrid converter refers to systems which
employ both switching and SC converters in the
system. LC-tank oscillator is used to generate
sinusoidal clock (Bender Machado et al. 2014;
Weng et al. 2013), which then drives a Dickson
Fig. 11.22 Zero Current Switching (ZCS) control charge pump similar to those found in RFID tag
RF-DC rectifier (Yi et al. 2009). (Fuketa et al.
reconfigured to give bipolar outputs up to 3 V 2014) shows a fully integrated version using simi-
(Teh and Mok 2014b). lar concept using on-chip transformers. Systems
Compared to the conventionally one-stage in (Chen et al. 2012b; Shrivastava et al. 2014) first
structure of switching converter, SC power con- self-starts using SC converter, which then drive
verter usually comprises multiple passive diode- another switching converter to provide the final
based or active gate driven stages due to the output voltage. In (Bandyopadhyay et al. 2014),
topology-constrained, finite voltage gain per primary power converter is switching converter-
stage, as shown in Fig. 11.20. Similar to based but SC converter is used to generate sepa-
switching converter, the minimum start-up of rate voltage to reduce high-side switch leakage. In
switched-capacitor power converter is bound by (Teh et al. 2014), SC converter is used to supple-
the minimal supply voltage required by its ment a SI converter, i.e. to transfer extra energy
on-chip oscillation clock generation. By having not consumed by the load to another secondary
more switches and sub-threshold gate drive volt- storage capacitor as backup.
age, SC power converter’s power efficiency at Different self-start-up mechanisms reported
low voltage is lower than switching converter, in recent boost converter works are illustrated
especially for the fully integrated versions due to in Fig. 11.23. Performance of the representative
poor on-chip passives. Design techniques pro- papers of each start-up mechanism such as mini-
posed to improve power efficiency of SC con- mum start-up voltage, peak power efficiency and
verter include adaptive dead-time control which component requirements are summarized in
alleviates shoot-through loss, negative voltage Table 11.6. An interesting trend observed in
gate drive and substrate bias control to reduce research efforts to reduce start-up power is that
336 Y.-K. Teh and P.K.T. Mok

Fig. 11.23 Different self-start-up boost converter circuit topologies

Table 11.6 Performance summary of different topologies


Performance
Min. start-up Peak power CMOS
Start-up mechanism & converter type voltage efficiency External components process
Ring oscillator 220 mV/ 83%/58% 10 μH inductor 130 nm
SC + Switching converter Shrivastava et al. 70 mV
(2014)/Goeppert et al. (2015)
Ring oscillator 120 mV 39% None 65 nm
SC converter Chen et al. (2012a)
Ring oscillator 150 mV 73% 6  10 nF capacitors 130 nm
SC converter Kim et al. (2014)
LC oscillator+Switching converter Weng et al. 50 mV 73% 2  2 μH, 27 μH, 65 nm
(2013) 100 μH inductors
Enhanced swing LC oscillator+SC converter 10 mV NA 2  9.5 μH, 130 nm
Bender Machado et al. (2014) 2  950 μH inductors
LC-transfromer oscillator+SC converter Fuketa 85 mV 1.5% None 65 nm
et al. (2014)
Transformer-LC oscillator+Switching 40 mV 61% 1:60 transformer 130 nm
converter Im et al. (2012)
Pulse transformer oscillator+Switching 21 mV 74% 1:1 transformer 130 nm
converter Teh and Mok (2014a)
11 Energy Harvesting 337

Fig. 11.24 (a) Leakage-based thyristor (b) Parasitic BJT to form equivalent thyristor (c) thyristor I-V curve (d)
thyristor-based relaxation oscillator

non-conventional parasitic element in the CMOS


process, such as a leakage based thyristor, using
parasitic BJT junctions in the CMOS process, as
shown in Fig. 11.24 is being used to design
on-chip oscillator. The power consumption falls
within the ballpark of sub-nanowatts (Liu et al.
2016). Ring oscillator based on a leakage-
optimized Schmitt-Trigger cell shown in
Fig. 11.25 is also proven the capability to start
oscillating at only 70 mV supply voltage
(Goeppert et al. 2015), without additional
trimming as shown in (Chen et al. 2012b).

11.6 Passives in Energy Harvesting

Transformer-based systems often yield the best


(lowest) self-start-up voltage performance, the Fig. 11.25 Leakage-optimized Schmitt-Trigger cell
inter-winding capacitance and series resistance (Goeppert et al. 2015)
338 Y.-K. Teh and P.K.T. Mok

Table 11.7 Electrical characteristics of high turns-ratio vs. unity turn-ratio transformer
Self-start-up boost LC transformer oscillation Pulse transformer oscillation
converter topology
Transformer type Non-standard, vendor specific Standard (Ethernet application)
Coilcraft LPR6235 Murata 78601/9C
Turns-ratio 1:100 1:1
Coil resistance 316 Ω 1.3 Ω
Inter-winding capacitance 638 pF 121 pF
Inductance 7.5 μH:75 mH 10 mH:10 mH
Coupling co-efficient (1.0 ¼ perfect) 0.95 0.9999
PCB board space 6  6 mm 9  9 mm

of transformer coils are increased as transformer


size is shrunk, as shown in Table 11.7. High 11.7 Perspectives and Trends
turns-ratio transformer also has more flux leak- on Energy Harvested IoT
age than pulse transformer besides the increased Nodes
parasitic capacitance and resistance. On-chip
transformer has lower quality factor (below 20) This chapter has presented a general overview of
compared to off-chip inductors which is 80 or low power energy harvesting and the associated
higher (Moazenzadeh et al. 2015). Another power management circuit techniques reported
important external component is Schottky in the recent decade. Designers need to trade
diode, where lower turn-on voltage and smaller off the choice of circuit topology, external
junction capacitance is always desired. Besides, component and CMOS process, based on the
due to high voltage conversion ratio (high SI application requirement i.e. minimal start-up
boost converter duty cycle), output capacitor is voltage, power efficiency and system size. The
mostly in self-discharge state during the time availability and power profile (determined
switching cycle, making capacitor leakage loss by source resistance) of energy harvesting source
an important consideration. Capacitor with used in the application also limits the choice of
higher rated voltage should be used to reduce circuit topologies. Further innovation and major
capacitor leakage loss which is exponentially breakthrough will require first, the advancement
dependent on biased voltage. Further miniaturi- of CMOS process with better leakage reduction
zation of the system size is yet another research and second, miniaturized integrated magnetic
direction. For instance, advanced three- and capacitive components with higher density
dimension die stacking packaging technology and quality factor.
(Yu et al. 2011) can further reduce system The development of next generation energy
board area. Novel wire bonding techniques harvesters and energy storage using flexible
(Raimann et al. 2012) and new transformer mate- materials is an emerging research topic (Koo
rial (Camarda et al. 2015; Raimann et al. 2012) to et al. 2012; Zhong et al. 2014). Despite enjoying
realize miniaturized transformer on PCB would the convenience of being wearable and bendable,
also be interesting research topics to explore. these emerging devices also pose new design
On-chip capacitors are often associated with bot- challenges for circuit designers, due to the inter-
tom plate parasitic that degrades the switched- changeable roles of energy storage and
capacitor power efficiency. However, the avail- generators in operation, and also the fluctuating
ability of deep trench capacitor has managed to electrical characteristics. In recent years, 130 nm
push the power efficiency up to 90% levels (Jung and 65 nm technology are commonly chosen for
et al. 2014). prototype implementation as both are considered
11 Energy Harvesting 339

as optimum choice of CMOS technology for M. Bender Machado, M. Sawan, M. Cherem Schneider,
ultra-low voltage circuit design. Advanced C. Galup-Montoro, 10 mV—1V step-up converter for
energy harvesting applications, in Proceedings of the
CMOS technologies with shorter channel length Symposium on Integrated Circuits and Systems Design
have diminishing returns in advantages from the (SBCCI) (2014), pp. 1–5.
perspective of low voltage analog circuits A. Camarda, A. Romani, E. Macrelli, M. Tartagni, A
(Galup-Montoro et al. 2012). However, it is still 32 mV/69 mV input voltage booster based on a piezo-
electric transformer for energy harvesting
necessary to explore digital-based reprogramma- applications. Sensor. Actuat. A Phys. 232, 341–352
ble power management integrated circuits such (2015)
as digital LDO (Yang et al. 2015b) which are E.J. Carlson, K. Strunz, B.P. Otis, A 20mV input boost
capable of adapting to such unique device converter with efficient digital control for thermoelec-
tric energy harvesting. IEEE J. Solid State Circuits 45
characteristics during in-circuit operation, espe- (4), 741–750 (2010)
cially under sub-1 V operations. Further study in M.P. Chan, P.K.T. Mok, Design and implementation of
designing an advanced reconfigurable charge fully integrated digitally controlled current-mode
pump topologies (Jung et al. 2014) which have buck converter. IEEE Trans. Circuits Syst. I 58(8),
1980–1991 (2011)
multiple output, self-driven clock and balances B. Chatterjee, M. Sachdev, S. Hsu, R. Krishnamurthy,
bi-directionality of power flow would also be S. Borkar, Effectiveness and scaling trends of leakage
instrumental. control techniques for sub-130 nm CMOS
The control of multi-input-multi-output technologies, in Proceedings of the International.
Symposium on Low Power Electronics and Design
energy harvesting system typically requires addi- (ISLPED) (2003), pp. 122–127
tional overhead power, i.e. first sense the avail- P.-H. Chen, K. Ishida, X. Zhang, Y. Okuma, Y. Ryu,
ability of power from each source, then compute M. Takamiya, T. Sakurai, A 120-mV input, fully
and optimize the multi-source time multiplexing integrated dual-mode charge pump in 65-nm CMOS
for thermoelectric energy harvester, in Asia and South
on the power converter. Similar to scarcity prob- Pacific Design Automation Conference (ASP-DAC)
lem in wireless communication channel alloca- (2012), pp. 469–470.
tion, it is foreseen that game theoretic strategies P.-H. Chen, X. Zhang, K. Ishida, Y. Okuma, Y. Ryu,
and models can be employed to analyse M. Takamiya, T. Sakurai, An 80 mV startup dual-
mode boost converter by charge-pumped pulse gener-
harvested power supply and demand budget, ator and threshold voltage tuned oscillator with hot
and to optimize the converter utilization time carrier injection. IEEE J. Solid State Circuits 47(11),
allocation problem in a multi-source power con- 2554–2562 (2012b)
verter. Hence a software-hardware co-design K.W.R. Chew, Z. Sun, H. Tang, L. Siek, A 400 nW
single-inductor dual-input-tri-output DC-DC buck-
power management approach is foreseen to be boost converter with maximum power point tracking
the future trend of energy harvested IoT nodes. for indoor photovoltaic energy harvesting, in IEEE
International Solid-State Circuits Conference Digest
of Technical Papers (2013), pp. 68–69.
M. Danesh, J.R. Long, An autonomous wireless sensor
References node incorporating a solar cell antenna for energy
harvesting. IEEE Trans. Microw. Theory Tech. 59
(12), 3546–3555 (2011)
E. Alon, M. Horowitz, Integrated regulation for energy-
Enocean, Solar Cells ECS300 and ECS310 [Online].
efficient digital circuits. IEEE J. Solid State Circuits
http://www.enocean.com/en/enocean_modules/ecs-
43(8), 1795–1807 (2008)
300-data-sheet-pdf
S. Bandyopadhyay, A.P. Chandrakasan, Platform archi-
H. Fuketa, Y. Momiyama, A. Okamoto, T. Sakata,
tecture for solar, thermal and vibration energy com-
M. Takamiya, T. Sakurai, An 85-mV input, 50-μs
bining with MPPT and single inductor. IEEE J. Solid
startup fully integrated voltage multiplier with passive
State Circuits 47(9), 2199–2215 (2012)
clock boost using on-chip transformers for energy
S. Bandyopadhyay, P.P. Mercier, A.C. Lysaght,
harvesting, in Proceedings of the European Solid
K.M. Stankovic, A.P. Chandrakasan, A 1.1 nW energy
State Circuits Conference (ESSCIRC) (2014),
harvesting system with 544pW quiescent power for
pp.263–266
next-generation implants, in Proceedings of the IEEE
C. Galup-Montoro, M.C. Schneider, M.B. Machado,
International. Solid State-Circuits Conference
Ultra-low voltage operation of CMOS analog circuits:
(ISSCC) (2014), pp. 396–397
340 Y.-K. Teh and P.K.T. Mok

amplifiers, oscillators, and rectifiers. IEEE Trans. X. Liu, E. Sanchez-Sinencio, A single-cycle MPPT
Circuits Syst. II Exp. Briefs 59(12), 932–936 (2012) charge-pump energy harvester using a thyristor-
J. Goeppert, Y. Manoli, Inductive boost converter for based vco without storage capacitor, in IEEE Interna-
thermoelectric energy harvesting with fully integrated tional Solid-State Circuits Conference Digest of Tech-
start-up at 70 mV, in Proceedings of the European nical Papers (2016), pp. 364–365
Solid State Circuits Conference (ESSCIRC) (2015), D. Ma, W.H. Ki, C.Y. Tsui, P.K.T. Mok, Single-inductor
pp. 233–236. multiple-output switching converters with time-
E.N.Y. Ho, P.K.T. Mok, Wide-loading-range fully multiplexing control in discontinuous conduction
integrated LDR with a power-supply ripple injection mode. IEEE J. Solid State Circuits 38(1), 89–100
filter. IEEE Trans. Circuits Syst. II 59(6), 356–360 (2003a)
(2012) D. Ma, W.H. Ki, C.Y. Tsui, A pseudo-CCM/DCM SIMO
C. Huang, P.K.T. Mok, A 100 MHz 82.4% efficiency switching converter with freewheel switching. IEEE
package-bondwire based four-phase fully-integrated J. Solid State Circuits 38(6), 1007–1014 (2003b)
buck converter with flying capacitor for area reduc- P.I. Mak, R.P. Martins, High-/mixed-voltage RF and ana-
tion. IEEE J. Solid State Circuits 48(12), 2977–2988 log CMOS circuits come of age. IEEE Circuits Syst.
(2013a) Mag. 10(4), 27–39 (2010)
C. Huang, P.K.T. Mok, An 84.7% efficiency 100-MHz T.Y. Man, P.K.T. Mok, M. Chan, A 0.9-V input
package bondwire-based fully integrated buck con- discontinuous-conduction-mode boost converter with
verter with precise dcm operation and enhanced CMOS-control rectifier. IEEE J. Solid State Circuits
light-load efficiency. IEEE J. Solid State Circuits 48 43(9), 2036–2046 (2008)
(11), 2595–2607 (2013b) Marlow Industries, TG12-2.5 Datasheet [Online]. http://
J.P. Im, S.W. Wang, S.T. Ryu, G.H. Cho, A 40 mV www.marlow.com/media/marlow/product/downloads/
transformer-reuse self-start-up boost converter with tg12-2-5-01l/TG12-2.5.pdf
mppt control for thermoelectric energy harvesting. J.D. Meindl, A.J. Davis, The fundamental limit on binary
IEEE J. Solid State Circuits 47(12), 3055–3067 (2012) switching energy for terascale integration (TSI). IEEE
J. Jiang, Y. Lu, W.H. Ki, Analysis of two-phase on-chip J. Solid State Circuits 35(10), 1515–1516 (2000)
step-down switched capacitor power converters, in Microgen, MEMS-based vibrational energy harvesting
IEEE Asia Pacific Conference on Circuits and Systems [Online]. http://www.microgensystems.co
(APCCAS), Ishigaki Island, Okinawa, Japan (2014), Micropelt, MPG-D751 Datasheet [Online]. http://www.
pp. 575–578 micropelt.com/down/datasheet_mpg_d651_d751.pdf
X. Jing, P.K.T. Mok, Power loss and switching noise A. Moazenzadeh, F.S. Sandoval, N. Spengler, V. Badilita,
reduction techniques for single-inductor multiple-out- U. Wallrabe, 3-D Microtransformers for DC-DC
put regulator. IEEE Trans. Circuits Syst. I 60(10), on-chip power conversion. IEEE Trans. Power Elec-
2788–2798 (2013) tron. 30(9), p5088 (2015)
W. Jung, S. Oh, S. Bang, Y. Lee, D. Sylvester, D. Blaauw, M. Raimann, A. Peter, D. Mager, U. Wallrabe,
A 3nW fully integrated energy harvester based on self- J.G. Korvink, Microtransformer-based isolated signal
oscillating switched-capacitor DC-DC converter, in and power transmission. IEEE Trans. Power Electron.
Proceedingds of the IEEE International Solid State- 27(9), 3996–4004 (2012)
Circuits Conference (ISSCC) (2014), pp. 398–399 P. Sarkar, S. Chakrabartty, Compressive self-powering of
J. Kim, P.K.T. Mok, C. Kim, A 0.15V-input energy- Piezo-floating-gate mechanical impact detectors.
harvesting charge pump with switching body biasing IEEE Trans. Circuits Syst. Regul. Pap. 60(9),
and adaptive dead-time for efficiency improvement, 2311–2320 (2013)
in Proceedings of the IEEE Internatioanl Solid A. Shrivastava, D. Wentzloff, B.H. Calhoun, A 10mV-
State-Circuits Conference (ISSCC) (2014), input boost converter with inductor peak current con-
pp. 394–395 trol and zero detection for thermoelectric energy
M. Koo, K. Park, S. Lee, M. Suh, D.Y. Jeon, J.W. Choi, harvesting, in Proceedings of the IEEE Custom
K. Kang, K.J. Lee, Bendable inorganic thin-film bat- Integrated Circuits Conference (CICC) (2014),
tery for fully flexible electronic systems. Nano Lett. 12 pp. 1–4
(9), 4810–4816 (2012) F. Su, W.H. Ki, Design strategy for step-up charge pumps
D. Kwon, G.A. Rincon-Mora, A single-inductor 0.35-μm with variable integer conversion ratios. IEEE Trans.
CMOS energy-investing piezoelectric harvester. IEEE Circuits Syst. II 54(5), 417–421 (2007)
J. Solid State Circuits 49(10), 2277–2291 (2014) Y.K. Teh, P.K.T. Mok, Design of transformer-based boost
I. Lee, W. Lim, A. Teran, J. Philips, D. Sylvester, converter for high internal resistance energy
D. Blaauw, A > 78% efficient light harvester over harvesting sources with 21 mV self-start-up voltage
100-to-100klux with reconfigurable PV-cell network and 74% power efficiency. IEEE J. Solid State
and MPPT circuit, in IEEE International Solid-State Circuits 49(11), 2694–2704 (2014a)
Circuits Conference Digest of Technical Papers Y.K. Teh, P.K.T. Mok, A stacked capacitor multi-
(2016), pp. 370–372 microwatts source energy harvesting scheme with
11 Energy Harvesting 341

86 mV minimum input voltage and 3 V bipolar F. Yang, P.K.T. Mok, A 0.6-1V input capacitor-less asyn-
output voltage. IEEE J. Emerging Sel. Top. Circuits chronous digital LDO with fast transient response
Syst. 4(3), 313–323 (2014b) achieving 9.5b over 500mA loading range in 65-nm
Y.K. Teh, P.K.T. Mok, A bipolar output voltage pulse CMOS, in European Solid State Circuits Conference
transformer boost converter with charge pump (ESSCIRC), Graz, Austria (2015), pp. 180–183
assisted shunt regulator for thermoelectric energy J. Yi, W.H. Ki, P.K.T. Mok, C.Y. Tsui, Dual-power-path
harvesting, in Proceedings of the IEEE Midwest Sym- RF-DC multi-output power management unit for
posium on Circuits and Systems (MWSCAS) (2014), RFID tags, in Proceedings of the IEEE Symposium
pp. 37–40 on VLSI Circuits (VLSIC) (2009), pp. 200–201
C.C. Wang, J.C. Wu, Efficiency improvement in charge A. Yu, J.H. Lau, S.W. Ho, A. Kumar, W.Y. Hnin, W.S. Lee,
pump circuits. IEEE J. Solid State Circuits 32(6), M.C. Jong, V.N. Sekhar, V. Kripesh, D. Pinjala,
852–860 (1997) S. Chen, C.-F. Chan, C.-C. Chao, C.-H. Chiu, C.-M.
H. Wang, G. Wang, Y. Ling, F. Qian, Y. Song, X. Lu, Huang, C. Chen, Fabrication of high aspect ratio TSV
S. Chen, Y. Tong, Y. Li, High power density microbial and assembly with fine-pitch low-cost solder
fuel cell with flexible 3D graphene-nickle foam as microbump for Si interposer technology with high-
anode. Nanoscale 5(21), 10283–10290 (2013) density interconnects. IEEE Trans. Compon. Packag.
C. Wang, M. Osada, Y. Ebina, B.-W. Li, K. Akatsuka, Manuf. Technol. 1(9), 1336–1344 (2011)
K. Fukuda, W. Sugimoto, R. Ma, T. Sasaki, C. Zhan, W.H. Ki, An output-capacitor-free adaptively
All-nanosheet ultrathin capacitors assembled layer- biased low-dropout regulator with subthreshold
by-layer via solution-based processes. ACS Nano 8 undershoot-reduction for SoC. IEEE Trans. Circuits
(3), 2658–2666 (2014). doi:10.1021/nn406367p Syst. I 59(5), 1119–1131 (2012)
P.S. Weng, H.Y. Tang, P.C. Ku, L.H. Lu, 50 mV-input C. Zhan, W.H. Ki, Analysis and design of output-
batteryless boost converter for thermal energy capacitor-free low-dropout regulators with low quies-
harvesting. IEEE J. Solid State Circuits 48(4), cent current and high power supply rejection. IEEE
333–341 (2013) Trans. Circuits Syst. I 61(2), 625–636 (2014)
J. Yang, M. Lee, M.J. Park, S.Y. Jung, J. Kim, A 2.5-V, J. Zhong, Y. Zhang, Q. Zhong, Q. Hu, B. Hu, Z.L. Wang,
160-μJ-output piezoelectric energy harvester and J. Zhou, Fiber-based generator for wearable electron-
power management IC for Batteryless Wireless ics and mobile medication. ACS Nano 8(6),
Switch (BWS) applications, in 2015 Symposium on 6273–6280 (2014). doi:10.1021/nn501732z
VLSI Circuits Digest of Technical Papers (2015),
pp. 282–283
Ultra-Low Power Analog Interfaces
for IoT 12
Jerald Yoo

This chapter addresses the challenges and design 12.1.1 Unique Environment
strategies in Analog Front-End (AFE) interface
circuit design with an umbrella of IoT. A strin- Let us cover in details several important
gent energy constraint in IoT means the circuit challenges that Interface circuit faces under IoT
specification must take into account the energy- environment. These challenges lead to the design
efficient operation. Also, at the same constraint, requirements and trade-offs.
the dynamic and static offset/noise compensation
Limited Power/Energy Source: As an example,
should be done effectively.
a state-of-the-art MEMS piezoelectric power
generator, thermoelectric generators (TEG) or
photovoltaic cells (PV) is capable of generating
12.1 Introduction
~100s of μW in a form factor smaller than 1 cm3;
let us not forget that these energy harvesting
Internet-of-Things (IoT) has variety of
sources are, most of time, not “always available”,
applications, including but not limited to, bio-
in other words, average power that we can draw
medical sensors, MEMS, environmental sensors
may be even smaller (See Table 12.1 for more
and sensor networks. The interface circuits
details). On the other hand, if we choose a coin-
bridges between the physical world and the elec-
cell battery, we have tens of mAh in the budget,
trical signal, and it is a crucial component in IoT
which is still very limited.
system. Common constraints on these
applications are limited form factor and the con- Noise and Offset: IoT interface circuit may have
sequent limited power/energy sources. There- near DC up to hundreds of kHz as its bandwidth.
fore, it is very important we understand the In this bandwidth, there are intruders that we
constraints and draw the specification of the refer to as “in-band noise”. As shown in
interface circuit that meets the IoT needs. Fig. 12.1, these include 1/f noise, 50/60 Hz
noise, thermal noise and static/dynamic offset.
In many cases, the noise well embeds the weak
signal. Hence noise-aware design is of crucial.
1/f Noise: Flicker noise, often referred to as 1/f
noise, is unavoidable in all CMOS amplifiers,
J. Yoo (*) caused by charge carriers randomly trapped/
Electrical and Computer Engineering, National
University of Singapore, Singapore 119077, Singapore
released between the gate oxide and the substrate.
e-mail: jyoo@nus.edu.sg

# Springer International Publishing AG 2017 343


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_12
344 J. Yoo

Table 12.1 Estimated power output values per harvesting methods (Table from Belleville et al. (2009))
Source Source characteristics Physical efficiency Harvested power
Photovoltaic
Office 0.1 mW/cm2 10–24% 10 μW/cm2
Outdoor 100 mW/cm2 10 mW/cm2
Vibration/motion
Human 0.5 m@1 Hz Maximum power 4 μW/cm2
1 m/s2@50 Hz is source dependent
Industry 1 m@5 Hz 100 μW/cm2
10 m/s2@1 kHz
Thermal energy
Human 20 mW/cm2 0.10% 25 μW/cm2
Industry 100 mW/cm2 3% 1–10 mW/cm2
RF
GSM 900 MHz 0.3–0.03 μW/cm2 50% 0.1 μW/cm2
1800 MHz 0.1–0.01 μW/cm2

50/60
104 Interferences
Signal Magnitude [mV]

103 Respiration ECG

102
EMG
10 1 TIV @ 0.5mAp-p

100 Electrode
offset 1/f noise
10-1
0.1 1.0 10 100 1000
Frequency [Hz]

Fig. 12.1 Example IoT noise and signal bandwidth: case of wearable sensor applications (figure courtesy of Dr. Long
Yan, Samsung Electronics)

The average power of 1/f noise is given by 12.1.2 Design Requirement


Eq. (12.1): and Performance Metrics
κ 1
V 2n ¼  ð12:1Þ System Resolution: Depending on what are the
Cox WL f
back-end and post-processing needs, we can
The existence of 1/f noise is problematic decide the system resolution. For example, a wear-
because it overlaps with many IoT application’s able physiological sensor may require >10b signal
signal bandwidth; specifically, the 1/f nature of the accuracy, therefore the A/D Converter (ADC)
noise makes the dominant component near DC. As should have >10b resolution.
Eq. (12.1) shows, in order to decease the 1/f, we ADC Reference Voltage and Interface Amplifier
need to have larger device (with larger WL), or Gain: Once the ADC bit widths is decided, the
use dynamic offset cancellation technique, which Least-Significant Bit (LSB) system resolution can
will be discussed in detail in Sect. 12.3. be calculated by the ADC reference voltage. For
12 Ultra-Low Power Analog Interfaces for IoT 345

example, if 1.10 V is used as a ADC voltage consumption of the amplifier, dependent on the
reference, and the ADC is 10b, then the LSB noise spectral density requirement.
resolution becomes 1.07 mV. At this point, if we To probe this further, we can start the noise
want to have the minimal signal level that we can analysis with a Bipolar Junction Transistors
detect to be 1 μV, the interface amplifier need to (BJTs), which has a lower noise spectral density
have at least 60 dB gain; a 60 dB gain means we RTI than CMOS transistors have for a given bias
need a multi-stage amplifier. Note that the mini- current. For a single common-emitter amplifier, a
mal signal level is related to the noise floor, where short-circuit voltage noise density is well known
Signal-to-Noise Ratio (SNR) becomes 1. as in Eq. (12.3):
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Sampling Rate: ADC’s sampling speed is strictly
πqV 2t BW
related to the signal processing throughput at the Vni, RMS ¼ ð12:3Þ
back-end. Also, for the AFE perspective, this IC
impacts the anti-aliasing filter specification. Con-
Where Vni,RMS is the RMS value of the noise
sidering the stringent energy/power and form
RTI integrated over a bandwidth of BW, IC is the
factor (area), we need to choose the right order
collector current, q is the charge on an electron,
(roll-off factor). Choosing too high order will
and Vt is the thermal voltage of kT/q. With this,
increase the power and area consumption.
we can now compare the performance of the
Common Mode Rejection Ratio (CMRR): The different amplifiers to that of the theoretical
interface circuitry may be composed of off-chip limit of a single BJT; this is called the Noise
devices such as capacitors and resistors in the Efficiency Factor (NEF) (Steyaert et al. 1987):
signal path. In such cases, CMRR has two rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
components: on-chip AFE (CMRRAFE) and the I overall
NEF ¼ Vni, RMS ð12:4Þ
off-chip interface (CMRRIF), where the overall πV t 4kT BW
CMRRSYS is determined by Eq. (12.2) (Yoo 2014):
The NEF shows the trade-off between the
1 1 1 current consumption and the noise of an ampli-
¼ þ ð12:2Þ fier. Note that the NEF of a single BJT having
CMRRSYS CMRRAFE CMRRIF
only thermal noise equals to 1, where a differen-
It should be noted that in general off-chip tial BJT is √2 ¼ 1.414. Since we know that
devices have limited matching, which makes it CMOS amplifier in general have worse noise
challenging to have CMRRIF of over 60 dB; con- spectral density for a given bias current, we can
sequently, as Eq. (12.2) shows, however high the expect that NEF of a CMOS amplifier will be
CMRRAFE would be, the CMRRIF will determine higher than that of a BJT.
the overall CMRRSYS. The message here is clear:
when it comes to IoT system, integrate as many
components on-chip as possible.
12.2 Ultra Low Power/Energy
Noise Efficiency Factor (NEF): One of the per- Interface Design
formance metric we can use for the IoT AFE is
NEF. Due to variety of gains that different We should keep in mind that in IoT applications,
amplifiers have, it is not fair to compare the output we have a very tight power and energy con-
noise directly. Instead, we divide the output noise straint. As probed earlier, we have largely two
with the gain of the amplifier, and this is called the scenarios in power source: energy harvesting/
Referred-to-Input (RTI) noise. Note that noise scavenging, and battery. With this in mind, in
RTI exists regardless of technology node we use, this Sect. 12.2, we will take a closer look at the
and there is a theoretical lower limit on the power interface circuit design, in power and energy
346 J. Yoo

perspective. We will also cover design strategies average power consumption) of the circuity will
for low-cost, energy-efficient filter. determine the system lifetime. This means, when
the battery is the power supply, then minimizing
the “energy consumption”, or the average power
12.2.1 Power Vs. Energy: Choosing consumption, should be the target goal. System-
the Right Goal atic approach to achieving energy efficiency was
already covered in Sects. 1.3 and 2.3. Also, in
When an energy harvesting/scavenging is used Sect. 12.3, we will see the design strategies for
for the IoT power supply, it is very important to power- and energy-efficient analog interface cir-
design the circuitry not to exceed the instanta- cuit design.
neous current driving capability of the power
source. Table 12.1 shows the estimated output
values of energy harvesting (for different 12.2.2 Top-Down Approach for Power
sources). Depending on the IoT application, the and Energy Efficiency
harvested power amount varies significantly. We
should not forget that these amount are optimistic Especially for IoT system/sensor, it is very
values on average; for example, in PV, if there is important we design the system in to-down
an instantaneous blockage of the cell, the approach to meet the power and energy require-
generated power will drop abruptly. Therefore, ment. This means, when designing such system,
if the circuit needs continuous operation, the we should answer the following questions, from
designer may want to introduce a secondary top to bottom:
energy storage such as super capacitor or battery.
Nevertheless, when using the harvesting/scav- • What is the application we are targeting for?
enging as the main source, keeping the “peak • What is the energy source? (Harvesting/Scav-
power consumption” of the circuitry well below enging/Battery/Hybrid/etc.)
that of the generator’s limit should be the key • What are the system components that will be
design target. This is aligned with what we saw included/excluded? (analog interface, filters,
from Sect. 1.3.1. voltage reference, analog-to-digital converter,
With this in mind, it is helpful to see what will post processing, communication block,
be the consequences of designing lower power memory)
interface circuit. First of all, the amplifier will be • What is the available power budget for each
impacted significantly. The bias current cannot block?
exceed the limit set by the supply, therefore the • What is the performance requirement for each
thermal noise will increase. Available power block?
level of μW order means the transistors will • What are the constraints/bottleneck in achiev-
likely operate in weak inversion (or - ing such performance under such
sub-threshold), which will be prone to Voltage environment?
and Temperature variation. Also, the impact of • Defining the specification of each block, and
dynamic offset will become more serious. start designing.
Strategies to overcome these issues will be
described in detail in Sect. 12.3. Without considering the system perspective,
Now let us consider the case where we use one might end up designing the lowest-power
battery as the main power supply. A coin cell consuming analog interface circuit, but that
battery, or a small flexible battery, will have the block is only a small fraction of overall system
capacity of tens to hundreds of millamp-hour power consumption; as a result, in such cases, the
(mAh). Compared to energy harvesting/scaveng- total system power consumption would not
ing case, we have more room for maximum decrease much. This can be avoided by breaking
allowable current. However, as we discussed in down the system power budget from the
Sect. 1.3.1, now the energy consumption (or the beginning.
12 Ultra-Low Power Analog Interfaces for IoT 347

12.2.3 Small Form Factor, Efficient R3


Filter Design
R1
+ +
Because of the very strict power budget in IoT,
Vin A1 Vout
often the filter design becomes an issue in interface
circuits. Not only the power and energy, but also - - R4
R2 A1diff=
the area is of concern. Most of time we do not have R2
luxury of using high order filters in such area- and R4
power- constrained system. Therefore, approaching
from a system perspective is very crucial. Fig. 12.2 Single-amp IA
As previously described in Sect. 12.1, this
becomes particularly important in dynamic offset
Vin+ +
compensation circuit, where extensive filtering is R3 R4
A1
needed. Here are some strategies we can take for
-
an area- and energy-efficient filter design: R5
-
RG A3 Vout
• Active filtering of ripple-induced noise by real-
+
time sampling and subtracting (Fan et al. R6
-
2011): suitable for an area-constrained environ-
A2
ment. The amplifier features a ripple-reduction Vin- + R1 R2
loop (see Fig. 12.18) to sample the noise and
subtract it in real time; by doing so, we can Fig. 12.3 Three-amp IA
avoid using area-consuming high-order filters.
• Balance of analog interface and digital
Single-Amp IA: The simplest IA is a Single-amp
processing filtering (Altaf et al. 2015; Yoo
IA shown in Fig. 12.2. The gain is defined by
et al. 2013): Choose the oversampling rate
R4/R2 with a large input voltage range. How-
and chopping frequency to use only 2nd
ever, this structure has a fundamental problem:
order filter in analog interface; then the digital
low input impedance. The input impedance is
processor also aids filtering.
determined by R1 and R2, but we cannot increase
them too much, because otherwise the gain will
be limited. Hence, loading effect may become an
12.3 Noise-Aware Interface Circuit issue in IoT applications. More importantly, in
Design IoT, with stringent low power requirement (low
bias current), the thermal noise may be large
In this subsection, we will look into details of the enough to corrupt the input signal, and there is
interface circuit design for IoT. We will start no mitigation technique of it.
with basic instrumentation amplifier (IA), com-
Three-Amp IA: The classical 3-amp IA shown in
pare the strengths and weaknesses. After that, we
Fig. 12.3 solves the low input impedance prob-
will see the two widely used dynamic offset
lem of a Single-amp IA. This is done by adding
compensation techniques: Auto-Zeroing and
buffer amplifiers (A1 and A2) to each input of the
Chopper-Stabilization.
differential amplifier (A3).
Here, the gain is defined as shown in Eq. (12.5):
  
12.3.1 Instrumentation Amplifier R5 þ R6 R4
Gain ¼ 1 þ ð12:5Þ
Basics RG R3

Now let us look into the Instrumentation Amplifier Note that when RG is removed, A1 and A2 will
(IA) basics. Understanding the IA basics is crucial operate as a unity-gain buffer amplifier. By adding
for expanding the perimeter to IoT applications. RG, and additional gain is achieved. On top of
348 J. Yoo

Cfb
+ + - +
Vin A1 Vo1 A2 Vout
- - + - 2-Stage OTA
CINT
Cin
Fig. 12.4 CMRR of a multi-stage amplifier Gm1 Gm2

that, RG is free from matching issues, since only G=Cin/Cfb


V-to-I I-to-V
one resistor is added. This is why this structure is
widely adopted to-date in IA domain.
Fig. 12.5 IA with a capacitive gain element
However, for IoT, this is still not enough due
to lack of dynamic offset removal—the offsets
This gives us an important message: in a multi-
will be a problem on top of that, with such low
stage amplifier, the CMRR of the first stage will
power requirement, the bias current cannot be
dominate the overall CMRR performance! Hence,
high, and subsequently noise is significant.
in an IoT AFE/IA design, it is very important we
CMRR of a multi-stage IA: It is also worth give an attention on the low-noise yet high-
remembering that in a multi-stage amplifier, CMRR performance of the first stage amplifier.
CMRR of each stage will impact differently in
Capacitive Coupled IA (CCIA): Noting that in
overall CMRR performance. Let us assume that
general capacitors have better matching than the
the 2-stage amplifier shown in Fig. 12.4 has gain
resistor has, the amplifier shown in Fig. 12.5
of A1 and A2, with CMRR of each stage to be
utilizes capacitive gain element (Harrison and
CMRR1 and CMRR2, respectively.
Charles 2003). Here, the gain is defined as Cin/
Now let us assume the input voltage of V in Cfb. This type of an amplifier is called Capacitive
¼ V id  CMRR
V ic
1
is applied, where Vid and Vic are Coupled IA (CCIA).
the differential and common mode component of Using capacitive gain element has another
the input, respectively. Then, in Fig. 12.4, Vo1 very strong advantage: Cin blocks DC and there-
can be expressed as V o1 ¼ A1  V in . Therefore, fore the amplifier accepts rail-to-rail input. Since
Vout will become: then, this simple-but-powerful idea is widely
  adopted in IA domain.
V ic Aforementioned amplifier types—single-
V out ¼ A2 V 1 
CMRR2 amp, three-amp or CCIA—has a fundamental
limit, when it comes to IoT applications: static
¼ ðA1  A2Þ  V id
  and dynamic offset. None of above amplifiers
A1  A2 A2 have the offset mitigation scheme, and especially
 þ  V ic ð12:6Þ
CMRR1 CMRR2 in low power (with low bias current) environ-
ment, the dynamic offset including 1/f noise
Observing Eq. (12.6), we find that the becomes a series issue.
differential gain is ðA1  A2Þ and the common-
 
mode gain becomes A1A2
CMRR1 þ A2
CMRR2 . Since
CMRR is the ratio of differential gain to the 12.3.2 Auto-Zeroing
common mode gain, now the CMRR of the
overall amplifier becomes as in Eq. (12.7): Aforementioned issues in IA—especially the
dynamic offset—must be mitigated in IoT
1 1 1
¼ þ applications. One technique to overcome the
CMRRamp, overall CMRR1 A1  CMRR2
issue is Auto-Zeroing (Enz and Temes 1996).
ð12:7Þ The concept of the auto-zeroing is simple:
12 Ultra-Low Power Analog Interfaces for IoT 349

Fig. 12.6 Auto-zeroing concept: sampling phase (left), and amplification phase (right) (figure from Makinwa et al.
(2007))

sample the offset (from either input or output),


and then subtract it. This is illustrated in
Fig. 12.6.
In auto-zeroing, there are two phases: Sam-
pling Phase (shown in left), and Amplification
Phase (right). Assuming an ideal amplifier in
Fig. 12.6, during the sampling phase S1 and S2
are closed (forming a buffer amplifier), and the
offset of the amplifier is sampled at the output
and stored into the storage capacitance Caz.
Alternatively, in other structures, intermediate
or input offset can be sampled. Fig. 12.7 Auto-zeroing in signal processing perspective
Now in the amplification phase, S1 and S2 are (figure from Makinwa et al. (2007))
open, and S3 is closed. Since the Caz stores the
offset value, and is connected to the negative
phase, the higher frequency component (mostly
input, the offset is removed and only the signal
thermal noise) will be under sampled, and this will
component will be amplified.
be folded into baseband during amplification.
The auto-zeroing can be explained in sampled
Regardless of the limitation, the auto-zeroing
signal perspective as well. Let us consider the
can be widely used in IoT applications, especially
auto-zeroing system as a sample-and-hold (S&H)
those with sampled system or with applications
structure as shown in Fig. 12.7. Here, the
having single-ended signals as their inputs.
Vn, az zðf Þ ¼ V n ðf Þ  ð1  H ðf ÞÞ, where H( f ) is
the transfer function of the S&H block. S&H is
gate function in time domain, and therefore
H ðf Þ ¼ sincðπf =f S Þ. Noting that sinc function 12.3.3 Chopper Stabilization
acts as a low-pass filter (LPF) with passband in
DC, 1-H( f ) is a high-pass filter (HPF). This is Chopper-stabilization (Enz and Temes 1996) is
why the auto-zeroing mitigates the 1/f noise (near another commonly used dynamic offset cancella-
DC) and the other offset as well. tion technique that suits well with IoT application.
Auto-zeroing also has one limitation and that is In a nutshell, chopper stabilization modulates the
the noise folding into baseband. This is due to input signal into higher frequency before the 1/f
limited bandwidth of the S&H; during the sampling and other dynamic offset corrupts the signal.
350 J. Yoo

Figure 12.8 illustrates a chopper-stabilized Figure 12.9 shows the same in frequency
amplifier in time domain. It is composed of a domain. Again, the signal component is denoted
differential amplifier (Amp), a modulator/ by blue (and the gray shade in spectrum), and the
demodulator pair, and a low pass filter (LPF). aggressors are in red. Since the demodulation
The signal drawn in blue, denoted as VIN, is leaves harmonics at higher frequencies, careful
differential. As the input chopper swaps VIN+ design of LPF is needed. fchop should be higher
and VIN with the chopping frequency fchop, the than the 1/f corner frequency, in order to suffi-
input of the differential amplifier will be ampli- ciently mitigate it. One may think that choosing
tude modulated (shown as a square wave in higher fchop will ease the filter specification,
Fig. 12.8 at the amplifier input). After the ampli- because Fig. 12.9 implies having baseband as far
fier, the modulated input signal is amplified, and away from fchop will enable using lower order
the dynamic offset such as 1/f noise, denoted by filters. However, doing so will also introduce
red dotted line, is added to the output. Then the more frequent chopping induced spikes due to
“modulated and amplified” input signal, plus the charge injection, which in turn translates to a resid-
noise, are demodulated at the frequency fchop. At ual offset. Therefore, balancing the filtering speci-
this stage, the signal is demodulated (with an fication and chopping frequency becomes a crucial
image at 2  fchop), and the noise is “up design choice! It should be also noted that the
modulated” at fchop. After filtering out the higher choosing fchop is also dependent on the 1/f corner
frequency component, clean amplified signal is frequency and the Gain Bandwidth (GBW) of the
obtained. This is just as in amplitude modulation/ amplifier—the amplifier BW should be suffi-
demodulation. ciently higher than the fchop to avoid gain errors.

VMOD VDEMOD LPF


VIN VOUT
Amp

Chopping
Clock fchop

Fig. 12.8 chopper-stabilization (time domain)

VMOD VDEMOD LPF


VIN VOUT
Amp

Chopping
Clock fchop
dB dB dB dB

offset &noise LPF


signal signal
signal modulated
modulated residual
offset & noise
Signal noise & offset

f/fch f/fch f/fch f/fch


1 3 5 1 3 5 1 3 5 1 3 5

Fig. 12.9 Chopper-stabilization (frequency domain)


12 Ultra-Low Power Analog Interfaces for IoT 351

VMOD

t T 2T Time (t)

fchop

T 2T Time (t)
VDEMOD Residual
Offset at VOUT

T 2T Time (t)

Fig. 12.10 Chopper-induced residual offset (figure from Fig. 12.12 Delayed sampling (figure from Makinwa
Yan et al. (2010)) et al. (2007))

such spikes are from parasitic capacitance


At Input
(1/f Noise) between gate and drain/source; hence, using
Noise Power [dB]

smaller switch will reduce the parasitic capaci-


After IA tance, consequently reduce the spike strength.
(Residual Offset)
However, using smaller switch means the on
Output resistance of the switch becomes higher, which
maybe an issue for a given application.
To mitigate the residual offset in more active
Thermal
Noise Floor way, delayed sampling (Menolfi et al. 1999) can
be used. Shown in Fig. 12.12, a demodulator
1/f Corner Frequency [Hz] matched to a modulator is operated by a “delayed
clock” with respect to the modulator clock. The
Fig. 12.11 Noise power in spectrum
amount of delay is controlled in a such way that
the demodulated spikes, when averaged out, will
Figures 12.10 and 12.11 denotes the chopper- have DC offset of nearly zero. Controlling the
induced spikes and its residual offset. An ideal delayed timing is difficult, however, due to PVT
chopper switch would not induce such issues; variation and load-dependent spike amount.
however, in real world, due to the parasitic Another technique is nested chopping (Bakker
capacitance between the gate and source/drain, et al. 2000), as shown in Fig. 12.13. The inner
clock feedthrough and the chopper induced chopper, operating at fCHOP_H, performs the 1/f
spikes will show up at the chopper output, as in and offset mitigation. This will generate
VMOD. After the demodulation and the LPF, chopper-induced spikes. The outer chopper,
these spikes will show up as a DC offset—in operating at fCHOP_L, embraces the spikes to
this text, we denote this as the “residual DC alternate and swap the group of spikes, as
offset”. Of course, the higher the fchop, the more shown in the figure. This will effectively average
frequent the spikes happen, and the more residual out the spikes, thereby reducing the residual DC
DC offset we will get. offset. We should keep in mind, however, that
The residual DC offset will incur the distor- the nested chopping will reduce the bandwidth
tion and gain error, hence we need to mitigate considerably, because now the signal bandwidth
it. First of all, we should remember the source of is bound by fCHOP_L, but not by fCHOP_H.
352 J. Yoo

Fig. 12.13 Nested

LPF
chopping (figure from Yan Vin Amp Vout
et al. (2010))

CHOP_L CHOP_H

Noise PSD [dBV/÷Hz]


Outer 2fCHOP_LVspiket
chopper

Thermal Noise Floor “Average-out”


Frequency [Hz]
the chopping spike

Fig. 12.14 Chopper-Stabilized Capacitive-Coupled IA (CS-CCIA) with chopper switch moved before the DC
blocking capacitance (Denison et al. 2007)

12.3.4 State-of-the-Art IA Design (Denison et al. 2007). By doing so, any mismatch
on Ci can be modulated and mitigated. With the
In this subsection, we will explorer several state- capacitive-coupled structure with capacitive gain
of-the-art IA design, adopting low-noise, element, the amplifier has very high CMRR
low-power and low-energy topologies. Pros and (100 dB), low integrated input referred noise
cons of each design is also presented. (0.98 μVrms for 0.5–100 Hz), and well-defined
gain with excellent high-pass corner is obtained.
12.3.4.1 Chopper-Stabilized Capacitive- One limitation of the strcuture, though, is the
Coupled IA (CS-CCIA) with exitance of chopper at the input stage drops the
Chopper Ahead of Input differential input impedance (Zin,diff) of
Capacitor: High CMRR, Low the amplifier; the parasitic capacitance within the
Noise chopper switch, combined with high frequency
The amplifier shown in Fig. 12.14 has the input switching, degrades the Zin,diff significantly. As a
chopper prior to DC the input capacitor (Ci) result, the amplifier has only Zin,diff of 8 MΩ.
12 Ultra-Low Power Analog Interfaces for IoT 353

Fig. 12.15 Chopper-stabilized resistive IA with DC servo loop

This is perfectly okay in the application of the rail input is required, this structure may impose
amplifier (Denison et al. 2007), which is an major limitation.
implantable device; however, for other IoT
applications, this may not be suitable due to load-
12.3.4.3 Chopper-Stabilized Capacitive-
ing effect from input side.
Coupled IA (CS-CCIA)
with Chopping at Virtual
12.3.4.2 Chopper-Stabilized Resistive IA: Ground: High Input Impedance
DC Offset Removal with To improve the limited Zin,diff of a CS-CCIA, a
a Servo Loop groundbreaking structure was proposed in
Figure 12.15 shows a chopper-stabilized IA that (Verma et al. 2010). Shown in Fig. 12.16, the
uses resistive gain element (Yazicioglu et al. amplifier moves the chopper to the input virtual
2011). It introduces a IA offset reduction loop ground, which boosts the Zin,diff to a very high
(GM1) to reduce the IA offset. Additionally, it value. Also note that now the Zin,diff will include
samples the output DC level with a DC servo the impedance of the capacitors CIN as well.
loop (GM2), which provides a negative feedback Therefore, this structure is free from loading
to stabilize and mitigate the DC offset. This effect, and ideal for many IoT applications.
smart approach mitigates the chopper induced Also, using capacitive gain component means it
offset as well as the DC offset caused by mis- has well defined gain of CIN/CFB. Additionally,
match at the sensor interface (in this example, the bias resistors are implemented by switched
DC offset mainly due to electrode mismatch). As capacitors with high resistance (15 GΩ) to save
a result, the amplifier shows excellent low noise power. The amplifier shows excellent low noise
performance (noise floor of 60 nV/√Hz), high performance (noise floor of 60 nV/√Hz), high
input impedance (>100 MΩ), well defined gain input impedance (>700 MΩ), well defined gain
and a DC offset removal. and a rail-to-rail input.
One downside of the structure is the limited The only downside of the strcuture, though, is
DC headroom by the current-based DC servo that the off-chip capacitor CIN. Since CIN are
loop; in some IoT applications, where rail-to- outside the chopper modulation, any mismatch
354 J. Yoo

in the devices will affect the CMRR; as we 12.3.4.4 Chopper-Stabilized Capacitive-


discussed in Sub-section 12.1 and in Eq. (12.2), Coupled IA (CS-CCIA) with
the offchip device typically has worse matching Active Filtering, Impedance
than the on-chip device has. As a result, the Boosting Loop and a Ripple
CMRR of the overall system in (Verma et al. Reduction Loop: Low Noise,
2010) is limited at 60 dB. High Input Impedance
and Small Area
In order to resolve the limited Zin,diff of a
CS-CCIA with chopper at input stage, the
CS-CCIA in Fig. 12.17 introduces a ground-
breaking concept of using positive feedback
impedance boosting loop [FSH14]. The amplifier
samples the output and forms a positive feedback
loop (PFL), which aids the input signal current.
Hence, the amount of current needed to achieve
desired output swing is dropped, and Zin,diff is
effectively boosted. It inherits all the features of
CS-CCIA, so it has extremely low-power
(1.8 μW), low-noise (60 nV/√Hz) operation,
and therefore ideal for many IoT applications.
To further suit for an area-constrained envi-
ronment, a ripple-reduction loop can be used
(Wu et al. 2009). As shown in Fig. 12.18, the
amplifier samples the output of the CS-CCIA
(with the chopper-induced ripple denoted as
Fig. 12.16 chopping at virtual ground to improve input Vout,ripple). It is then passed through C4, CH6
impedance then to Gm6/Cint to a residual DC offset. This is

Fig. 12.17 Improving Zin,diff with a positive feedback impedance boosting loop
12 Ultra-Low Power Analog Interfaces for IoT 355

Fig. 12.18 Reducing the chopper-induced ripple with ripple-reduction loop, removes the area-consuming high-order
filters

then given as a negative feedback at the input of the DC servo loop is primarily used to remove
the amplifier, which effectively removes the the DC component, and hence the passband of
residual DC offset at the output stage. With this the loop itself is can be extremely low.
scheme, use of high-order passive filter can be Additionally, the amplifier has a ripple reduc-
removed, and significant area reduction can be tion loop and an impedance boosting loop; there-
achieved. This scheme is particularly helpful in fore, the amplifier achieves extremely low
IoT where stringent power and area filtering is power (1.1 μA), low noise (0.81 μVrms for
needed. 0.5–100 Hz) while having high input impedance
(>500 MΩ).
12.3.4.5 Chopper-Stabilized Capacitive-
Coupled IA (CS-CCIA) with An
Additional Chopper at DC Servo 12.4 Summary and Future
Loop: Lower Noise, Small Area Perspectives
and High Input Impedance
As we have seen so far, using DC servo loop to So far we have explored the challenges,
remove DC offset is an effective means to requirements and design techniques of analog
remove DC offset (Fan et al. 2011; Verma et al. interface circuit for IoT. In this section, we will
2010; Yoo et al. 2013), but an issue with such look into future perspectives of the IA for IoT,
structure is that the DC servo loop feedback is and summarize what we have covered so far.
applied to an “up modulated” signal; this means,
even after the demodulation, the DC servo loop-
induced noise cannot be removed. To solve this 12.4.1 Future Trends on IoT Interface
issue, along with dynamic offset cancellation, the Circuits
amplifier shown in Fig. 12.19 uses another chop-
per (operating at lower frequency fL) around the With more and more IoT devices being used,
DC servo loop to effectively remove the elevated recent research on analog interface circuits for
noise (Altaf and Yoo 2016). Unlike the nested IoT shows direction towards power and area
chopper case, this additional chopper does not efficiency with system-level design consider-
affect the main amplifier signal bandwidth, since ation; here, we will see two good examples.
356 J. Yoo

Fig. 12.19 CS-CCIA with ripple-reduction loop, impedance boosting loop and chopped DC servo loop

12.4.1.1 Bulk Switching


Chopper-stabilized IAs described in previous
section may suffer from low input impedance,
which causes loading effect with high-
impedance signal source. We have seen the
amplifiers with positive-feedback impedance
boosting loop (Fan et al. 2011; Yoo et al. 2013)
may mitigate the issue, but at the cost of addi-
tional control circuitry and careful design of sta-
bility test. To overcome this, a new concept of
Bulk Switching was recently introduced (Han
et al. 2015). Figure 12.20 shows the concept
applied to a CS-CCIA.
The amplifier core (PMOS input pair) has its
bulk terminal switched on and off with fre-
quency fs. As we can observe from the figure,
the gate of the input pairs are free from chopper
switch, so the input impedance of the amplifier Fig. 12.20 Bulk switching CS-CCIA
is very high. With the bulk of the input pair
switched on and off, the transistor is also turned
on and off, thereby decreasing the 1/f noise. while consuming only 3.96 μW from 1.2 V
This smart approach solves the low input supply. The amplifier occupies 0.053 mm2 in
impedance issue of chopper-stabilized ampli- 65 nm CMOS, which is by far one of the
fiers with chopper switch at the input, without smallest in IAs with this performance.
adding any complex circuitry. As a result, the The bulk switching is a good example of the
measurement shows the amplifier has NEF research directions in IA for IoT, where area- and
of 2.2, input referred noise of 0.75 μVrms over power-efficiency are both an important design
1–200 Hz for an 100 kΩ source impedance, metric.
12 Ultra-Low Power Analog Interfaces for IoT 357

Fig. 12.21 Channel sharing: Dual Channel Charge Recycled (DCCR) CS-CCIA

12.4.1.2 Channel Sharing current consumption (per channel) and in area


An aggressive approach of saving power and (per channel), respectively, when compared to
area in an IA was recently proposed (Altaf et al. using two separate IAs. As a result, it comes
2015). The Dual Channel Charge Recycled only 1.62 μW/channel, integrated input referred
(DCCR) CS-CCIA, switches among two chan- noise of 0.9 μVrms (0.5–100 Hz), NEF of 3.29/
nel, as shown in Fig. 12.21. The idea is to have an channel, CMRR of 97 dB while occupying
IA shared and time-multiplexed among two sep- 0.716 mm2 in 0.18 μm CMOS.
arate channels. The channel sharing IA is a breakthrough
To avoid bias settling issues, the amplifier concept in IA for IoT; this is another good exam-
core (OTA) samples and stores the bias point. ple where system-level design consideration and
That is, when CDSL_CH1 is in amplification phase, ideas, balanced with the IA’s circuit idea,
the CDSL_CH2 and the OTA internal bias storing improves the overall system performance.
capacitance hold its bias point; in the next phase,
CDSL_CH1 holds its bias point and CDSL_CH2
resumes its amplification. This effectively 12.4.2 Summary
“recycles” the bias current of OTA, thereby
decreasing the power consumption significantly. As we have seen so far, designing an analog
An advantage of such this method is that we interface for IoT faces many challenges, espe-
do not need to have a separate circuitry to multi- cially towards the strict power, energy and area
plex the channels. Rather, the Dual Channel constraint. It is very important that system-level
Chopper (DC_CHOP) (Fig. 12.22) is employed. consideration is made prior to designing each
The DC_CHOP uses two gated clocks, which are component. Therefore, top-down approach is
originated from a single chopper clock, to per- highly recommended.
form chopping function and multiplexing func- IoT covers many different applications, from
tion at once! environmental sensing/monitoring to wearable
The implementation and measurement results healthcare applications, with sensing/processing/
(Fig. 12.23) show that there is minimal impact on communication/storing functions. Each applica-
integrated noise (10% increase) while amplifying tion has different needs, and it is crucial that the
two channels with a single amplifier. The input is IA performance meets the proper target within the
180-degrees out-of-phase between channels (one application. These metrics include system resolu-
of the worst case for multiplexing). The improve- tion, sampling rate, CMRR, power/energy con-
ment is very clear: 43% and 28% reduction in sumption, noise performance, NEF and area.
358 J. Yoo

Fig. 12.22 Dual channel chopper switch

CS-CCIA
-6
10 DCCS CS-CCIA
Input RTI (V/÷Hz)

0.80uVrms
-7 0.90uVrms
10

10-8

10-9
0.01 0.1 1 10 100
Frequency(Hz)

Fig. 12.23 Time-domain output of out-of-phase inputs, and the IA’s input-referred noise

future trends in devices, technology, and systems,


References CATRENE Report on Energy Autonomous Systems
(2009)
M.A.B. Altaf, J. Yoo, A 1.83μJ/classification, 8-channel, T. Denison, K. Consoer, W. Santa, A.-T. Avestruz,
patient-specific epileptic seizure classification SoC J. Cooley, A. Kelly, A 2μW 100nV/√ Hz chopper-
using a non-linear support vector machine. IEEE stabilized instrumentation amplifier for chronic mea-
Trans. Biomed. Circuits Syst. 10(1), 49–60 (2016) surement of neural field potentials. IEEE J. Solid State
M.A.B. Altaf, C. Zhang, J. Yoo, A 16-channel patient- Circuits 42(12), 2934–2945 (2007)
specific seizure onset and termination detection SoC C.C. Enz, G.C. Temes, Circuit techniques for reducing the
with impedance-adaptive transcranial electrical stim- effects of op-amp imperfections: autozeroing,
ulator. IEEE J. Solid State Circuits 50(11), 2728–2740 correlated double sampling, and chopper stabilization.
(2015) Proc. IEEE 84(11), 1584–1614 (1996)
A. Bakker, K. Thiele, J.H. Huijsing, A CMOS nested Q. Fan, F. Sebastiano, J.H. Huijsing, K.A.A. Makinwa, A
chopper instrumentation amplifier with 100nV offset. 1.8μW 60nV/sqrtHz capacitively-coupled chopper
IEEE J. Solid State Circuits 35(12), 1877–1883 (2000) instrumentation amplifier in 65nm CMOS for wireless
M. Belleville, E. Cantatore, H. Fanet, P. Fiorini, P. Nicole, sensor nodes. IEEE J. Solid State Circuits 46(7),
M. Pelgrom, C. Piguet, R. Hahn, C. Van Hoof, 1534–1543 (2011)
R. Vullers, M. Tartagni, Energy autonomous systems:
12 Ultra-Low Power Analog Interfaces for IoT 359

M. Han, B. Ki, Y.-A. Chen, H. Lee, S.-H. Park, R. Wu, K.A.A. Makinwa, J.H. Huijsing, A chopper
E. Cheong, J. Hong, G. Han, Y. Chae, Bulk switching current-feedback instrumentation amplifier with a
instrumentation amplifier for a high-impedance source 1mHz 1/f noise corner and an AC-coupled ripple
in neural signal recording. IEEE Trans. Circuits Syst. reduction loop. IEEE J. Solid State Circuits 44(12),
Exp. Briefs 62(2), 194–198 (2015) 3232–3243 (2009)
R.R. Harrison, C. Charles, A low-power low-noise CMOS L. Yan, J. Yoo, B. Kim, H.-J. Yoo, A 0.5-μVrms 12-μW
amplifier for neural recording applications. IEEE wirelessly powered patch-type healthcare sensor for
J. Solid State Circuits 38(6), 958–965 (2003) wearable body sensor network. IEEE J. Solid State
K.A.A. Makinwa, Dynamic-offset cancellation Circuits 45(11), 2356–2365 (2010)
techniques in CMOS, in IEEE International Solid- R.F. Yazicioglu, S. Kim, T. Torfs, H. Kim, C. Van Hoof,
State Circuits Conference, Tutorial-04 (2007) A 30μW analog signal processor ASIC for portable
C. Menolfi, Q. Huang, A chopper modulated instrumenta- biopotential signal monitoring. IEEE J. Solid State
tion amplifier with first-order low-pass filter and Circuits 46(1), 209–223 (2011)
delayed modulation scheme, in Proceedings of the J. Yoo, L. Yan, D. El-Damak, M.A.B. Altaf, A.H. Shoeb,
IEEE European Solid-State Circuits Conference A.P. Chandrakasan, An 8-channel scalable EEG
(1999), pp. 54–57 acquisition SoC with patient-specific seizure classifi-
M. Steyaert, W. Sansen, C. Zhongyuan, A micropower cation and recording processor. IEEE J. Solid State
low-noise monolithic instrumentation amplifier for Circuits 48(1), 214–228 (2013)
medical purposes. IEEE J. Solid State Circuits J. Yoo, Design strategies for wearable sensor interface
SC-22, 1163–1168 (1987) circuits: from electrodes to signal processing, in IEEE
N. Verma, A. Shoeb, J. Bohorquez, J. Dawson, J. Guttag, International Solid-State Circuits Conference, Short
A.P. Chandrakasan, A micro-power EEG acquisition Course on Biomedical and Sensor Interface Circuits,
SoC with integrated feature extraction processor for a 13 Feb (2014)
chronic seizure detection system. IEEE J. Solid State
Circuits 45(4), 804–816 (2010)
Ultra-Low Power Analog-Digital
Converters for IoT 13
Pieter Harpe

This chapter addresses ADCs for IoT nodes, to-Noise-and-Distortion-Ratio), i.e., the ratio
which are needed to digitize sensor information between signal power and power caused by all
before processing, storage or wireless transmis- forms of noise and distortion (Pelgrom 2017).
sion. ADCs are also required for the radio com- The SNDR can also be recalculated to ENOB
munication channel. This chapter focusses on (Effective-Number-Of-Bits) with equation (13.1).
successive approximation (SAR) ADCs, a popu- Note that an ideal Nyquist-rate ADC with N-bit
lar architecture for IoT thanks to their high output codes will have an ENOB equal to N.
power-efficiency. After deriving requirements From this, the maximum theoretical SNDR of
for IoT, the design basics of SAR ADCs are such an ADC can be determined as well. In
discussed, followed by various design examples reality, ENOB and SNDR will be lower than
to illustrate key enabling techniques. this theoretical upper bound.
SNDR ¼ 6:02  ENOB þ 1:76 ½dB ð13:1Þ
13.1 ADC Requirements for IoT In terms of speed, two metrics are relevant: the
sample rate fs of the ADC and the signal band-
This section first reviews basic ADC definitions, width (BW) that can be converted. For a Nyquist
and then discusses ADC requirements in the con- rate ADC, fs will be twice the value of BW to
text of IoT nodes. State-of-the-art ADCs are satisfy the Nyquist criterion. Oversampled ADCs
highlighted and the SAR architecture is selected have an fs that is much higher than BW to allow
for further elaboration in this chapter thanks to its for instance noise-shaping techniques. In this
suitability for IoT. chapter, fs,Nyq denotes the equivalent Nyquist
The function of an Analog-to-Digital Con- sample-rate of an ADC, which is thus twice
verter (ADC) is to convert analog information BW. The last performance parameter of an ADC
into the digital domain. The three most important is its power consumption, here denoted with
performance metrics of ADCs are accuracy, P. Combining the above three metrics, different
speed and power consumption. While the Figure-Of-Merits can be calculated to evaluate the
ADC’s accuracy can be expressed in different overall performance with a single number.
ways, a common way is to use SNDR (Signal- Two commonly used FOMs are the FOMW and
FOMS as shown in equations (13.2) and (13.3),
P. Harpe (*) respectively. Both FOMs show the trade-off
Eindhoven University of Technology, Eindhoven, between power, accuracy, and speed, albeit with
Netherlands
e-mail: p.j.a.harpe@tue.nl different weighting functions.

# Springer International Publishing AG 2017 361


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_13
362 P. Harpe

Fig. 13.1 ADCs in IoT


nodes: wireless frontends
and sensor interfaces
RF
ADC
front-end
DSP
Analog
Sensor ADC
front-end

P the ADC. For sensors, some applications require


FOMW ¼ ½J=conversion-step
f s, Nyq  2ENOB only quasi-static conversion, for instance in case
of temperature, pressure, light, or humidity mon-
ð13:2Þ
itoring. Many other sensors have relatively low
 
f s, Nyq speed of operation and require ADCs up to a few
FOMS ¼ SNDR þ 10  log ½dB ð13:3Þ kS/s, for instance in case of bio-potential record-
2P
ing, accelerometers, or gyroscopes. Lastly, some
Inside IoT nodes, data converters are gener- particular sensors, such as image sensors, could
ally found in two locations (Fig. 13.1): inside the require a very high conversion bandwidth.
baseband of the wireless receiver and as the final In summary, most IoT applications can be
stage of the sensor interface. Due to energy covered with ADCs with relatively low speed
constraints, ADCs in early-generation IoT (DC up to a few MS/s) and moderate resolution
nodes are typically optimized for low power (ENOB around 7 to 12-bit). With today’s CMOS
consumption by reducing accuracy and speed technology, such speed of operation is straight-
down to the minimum acceptable performance. forward to achieve, and hence the focus in design
For instance, 8/9-bit ADCs with up to a few MS/s can be mostly on power efficiency.
sample rate are used in low-power proprietary Figure 13.2 gives an overview of state-of-the-
standard ISM-band radios (Vidojkovic et al. art ADCs (1997–2016, (Murmann 2016)) in
2011) or in standard compliant BLE (Bluetooth terms of efficiency (energy consumption per
Low Energy) radios (Liu et al. 2015; Prummel sample) versus SNDR. It can be seen that for
et al. 2015; Sano et al. 2015). For sensor the resolutions of interest, Successive Approxi-
interfaces, traditional high-precision applications mation Register (SAR) ADCs are the most power
typically use ADCs with 12 to 16-bit resolution efficient solution. Therefore, the remainder of
(Makinwa et al. 2015), but IoT systems have this chapter focusses on SAR ADCs.
sacrificed precision to achieve power reduction. As stated before, IoT nodes have minimized
For instance, 7 to 10-bit performance was pro- ADC accuracy and speed to save power. How-
posed in systems used for IoT and wearable ever, this could for instance reduce the interfer-
sensing applications (Harpe et al. 2015; Yip ence tolerance of the radio, or it may give
et al. 2011). With the progress of technology, a suboptimal sensor performance. For those
trend towards higher resolutions like 10 to 12-bit reasons, higher performance ADCs are
can be seen more recently (Konijnenburg et al. demanded. Fortunately, technology scaling and
2016). The required speed of IoT ADCs strongly design techniques enable this demand. As shown
depends on the application. In wireless commu- in Fig. 13.3, it can be observed that the power-
nication, this depends on the signal bandwidth efficiency of ADCs has been improving steadily
which is typically in the kHz or low-MHz range, over the last decades, reaching more than a factor
requiring up to a few MS/s of sample rate from 1000 of power reduction in 20 years. Thanks to
13 Ultra-Low Power Analog-Digital Converters for IoT 363

1.E+05

1.E+04

1.E+03
P/fsnyq [pJ]

1.E+02

Flash and Folding ADCs


1.E+01 Pipeline ADCs
Sigma Delta ADCs
Other ADCs
1.E+00 SAR ADCs
FOMW = 2fJ/conv.step
Typical IoT area FOMS = 176dB
1.E-01
20 30 40 50 60 70 80 90 100 110

SNDR @ fin,hf [dB]

Fig. 13.2 State-of-the-art ADC power-efficiency versus SNDR, data from (Murmann 1997)

100000
State-of-the-art ADCs 1997 - 2016
Trend
10000
FOMW [fJ/conv.step]

1000

100

10

0.1
1995 2000 2005 2010 2015 2020
Year

Fig. 13.3 Improvement of ADC power-efficiency over time, data from (Murmann 1997)

this trend, future IoT nodes will either benefit and instantaneous operation are preferred. SAR
from lower power consumption or improved pre- ADCs suit well in this context, because they can
cision/speed. be made without static bias currents, allowing
Besides the critical trade-off between power, automatic power-down in standby phases. More-
precision, and speed, that lead to the selection of over, as they are Nyquist converters, a single
the SAR architecture, there are several other conversion can be made instantaneously
aspects that are relevant in the context of IoT. (as opposed to oversampled converters that could
First, IoT nodes are often operated on-demand, suffer from memory or start-up effects).
implying that the node is inactive for a long Reconfigurability, i.e., the ability to operate effi-
time, and only operational in short bursts. For ciently at different speeds and resolutions, is
this reason, ADCs with low standby consumption another key parameter that is favorable for IoT.
364 P. Harpe

For instance, this allows to digitize signals with a output code, to the sampled voltage Vs. After N
variable speed or resolution, dependent on the cycles of approximation, VDAC will be very close
activity, application, or environmental conditions. to Vs, implying that Dout is the output code
In the remainder of this chapter, Sect. 13.2 describing the sampled input Vin. As shown in
describes the general design aspects of low- the timing diagram, sampling takes place in a
power SAR ADCs. The main requirements of first clock cycle, after which there are N clock
IoT are discussed by means of design examples, cycles to determine the N bits. After this, the final
namely: low power consumption and duty cycling code is obtained and the operation repeats for the
(Sect. 13.3), developments towards higher pre- next input sample. In order to implement this
cision (Sect. 13.4), and reconfigurability (Sect. ADC, a T&H, comparator, logic and DAC are
13.5). Section 13.6 elaborates on voltage refe- needed. These components will be described in
rences that are needed for the ADC, and Sect. the following paragraphs.
13.7 discusses perspectives and trends in the
field of ADCs for IoT.
13.2.2 Track&Hold

13.2 Basics of SAR ADC Design The basic T&H topology that is often used in
SAR ADCs is shown in Fig. 13.5a. Vin is sam-
In this section, the basics of low-power SAR pled on capacitor Cs at the sampling rate fs. When
ADC design are discussed, addressing all major the control signal is high, the output tracks in
components in the system. the input signal; when the control signal is low,
the signal is held on the capacitor. Crucial for the
T&H performance is the implementation of the
13.2.1 SAR Principle of Operation switch. The most straightforward way is to use an
NMOS (or PMOS) device as switch (Fig. 13.5b).
Figure 13.4 shows the topology and timing dia- However, assuming the control signal going to
gram of a SAR ADC where an analog input the gate of the NMOS is limited by the supply
voltage Vin is first sampled at the sample clock voltage, this NMOS device cannot conduct for
fs using a Track & Hold (T&H). Next, the SAR rail-to-rail input signals: the NMOS only
logic performs a binary search algorithm to find conducts when Vin < VDDVtn, where VDD
the N-bit output code Dout that matches the sam- is the supply voltage and Vtn the threshold volt-
pled input voltage. This is done by comparing age of the NMOS. Similarly, a PMOS only
VDAC, the voltage corresponding to the estimated conducts when Vin > |Vtp|, where Vtp is the
PMOS threshold voltage.
fclk A conventional solution to this problem is to
VS use a CMOS switch, where an NMOS and PMOS
Vin T&H Logic device are connected in parallel. In this way, as

fs
VDAC N-bit (a) (b)
DAC Dout
fs fs

fclk Vin VS Vin VS


CS CS
fs
bit 1 2 3 ··· N 1
Fig. 13.5 Basic T&H topology (a) and T&H with
Fig. 13.4 Basic SAR ADC topology and timing diagram NMOS switch (b)
13 Ultra-Low Power Analog-Digital Converters for IoT 365

long as |Vtp| + Vtn < VDD, a rail-to-rail input Rleak


swing can be supported. Unfortunately, VDD
tends to scale down faster than the threshold Vin VS
voltages in advanced processes, making it more Cds CS
and more difficult to satisfy this requirement.
Moreover, IoT nodes may prefer to use a reduced
supply voltage to save power, making it even Fig. 13.6 T&H leakage and capacitive coupling in the
harder to reach this requirement. hold mode
Due to the above issues, techniques such as
clock boosting (Cho and Gray 1995) and are relatively mild in IoT nodes, boosted or
bootstrapping (Abo and Gray 1999) are often bootstrapped switches can usually achieve suffi-
used. Both techniques essentially use the circuit cient linearity. In some cases, a CMOS switch
of Fig. 13.5b with an NMOS only, but they might also be adequate. Only if the supply voltage
control the gate with a voltage beyond VDD to is scaled down too much, the above problems
allow a rail-to-rail input swing and a better become a limiting factor. As example of what is
switch conductivity. In clock boosting, a fixed feasible, the T&H in Harpe et al. (2014) achieves a
voltage multiplier is used to increase the gate linearity well above 80 dB for a signal bandwidth
voltage from VDD to kVDD, where k is typically of 16 kHz using a 0.8 V supply, a 65 nm technol-
between 1.5 and 2x of gain. In bootstrapping, a ogy and a clock-boosted switch.
level-shifter is used to lift the gate voltage to The last two problems, leakage and capacitive
VDD + Vin, making the gate-source voltage coupling, occur when the T&H is in the hold
always equal to VDD. A limitation of boosting mode. Ideally there should an infinite impedance
and bootstrapped techniques is that they rely on a between the input and output node of the switch
charge pump. For ADCs operating at extremely during hold mode. However, there might be
low speed or on-demand (which could be the drain-to-source leakage from the transistor
case in IoT nodes), these charge pumps could (modeled by a resistance Rleak) and capacitive
fail to operate properly due to charge leakage. coupling from the layout (modeled by Cds), as
In such cases, a CMOS switch could be shown in Fig. 13.6. As a result, the output voltage
preferable. is not completely isolated from the input signal
Having discussed the basic topologies of the during the conversion, which can cause distor-
T&H, the most important imperfections are tion in the output code (Harpe et al. 2011). In
discussed next, namely: sampling noise, particular in advanced CMOS nodes, it can be
on-resistance, charge injection, leakage and important to simulate these effects to verify the
capacitive coupling. impact on the performance.
At the moment the switch samples the input
voltage on capacitor Cs, the thermal noise pro-
duced by the transistor will also be sampled on 13.2.3 DAC
the capacitor. This leads to an output noise power
Pnth of kT/Cs, where k is the Boltzmann constant In this subsection, the design of the feedback
and T the temperature. For ADCs with higher DAC is discussed. While there are various possi-
SNDR, a lower Pnth can be tolerated, implying ble implementations, the most typical solution is
that they will need a larger Cs. Typical a capacitive charge-redistribution DAC, using
low-power SAR ADCs can have a Cs in the voltage-mode operation. In the context of IoT,
order of 0.3–1 pF (Harpe et al. 2013, 2015). one important advantage of a switched-capacitor
Two important imperfections causing distor- DAC is that its power consumption is fully
tion are the finite on-resistance and the charge dynamic. It thus scales intrinsically with the sam-
injection of the sampling switch (Pelgrom 2017). pling rate and it allows on-demand operation and
However, as the speed and resolution requirements standby, simply by enabling/disabling the control
366 P. Harpe

Fig. 13.7 3-bit charge- (a) (b)


redistribution DAC (a) and
switch simplification (b) VDAC

C2 C1 C0 C0

VREF
d2 d1 d0 VREF d0
VREF
Ci = 2i·C 0 d0

signals. Figure 13.7a shows a single-ended 3-bit ground levels, as this simplifies the design of the
example of such a DAC. It is composed of switches in Fig. 13.7a. At reduced supplies it is
binary-scaled capacitors Ci, controlled by a digi- particularly difficult to make a well-conducting
tal binary code d2:0. The digital control can switch for signal levels in the middle between
switch the bottom plate of each capacitor supply and ground. On the other hand, the ground
between various reference voltages, in this exam- level can be easily switched with an NMOS
ple GND and VREF. If code di switches capacitor while the supply level can be easily switched
Ci from GND to VREF, an output voltage step with a PMOS device. By doing so, the switching
ΔVDAC of Ci/CS∙VREF is induced, where Cs is the network simplifies to a digital-style inverter,
sum of all capacitors (8Ci in this example). Based where the control signal is applied to the input
on superposition, the overall DAC voltage is of the inverter, and the reference voltage VREF
hence proportional to the value of the binary (often equal to VDD) is the supply of the inverter
code: (Fig. 13.7b).
The selection of the value of the unit capacitor
C0 XN1 i
V DAC ¼ V REF d 2 ½V
i¼0 i
ð13:4Þ (C0), implying a total DAC capacitance CS equal
CS to 2N∙C0, is the most critical decision in the DAC
Note that the DAC is shown here as a stand- design process. In the following, the various
alone component. However, as will be explained considerations are discussed.
in Sect. 13.3, the DAC often acts as sampling In terms of noise, the DAC has several
capacitance of the T&H as well, such that a contributions. First, when the DAC is reused as
dedicated CS (as in Fig. 13.5) is not needed sampling capacitor, it will exhibit kT/C-noise as
anymore. explained in Sect. 13.2.2. Furthermore, noise
The above circuit shows only one example of from the reference voltages and noise from the
a 3-bit DAC that creates a transfer function as in DAC switches contribute additional noise to the
equation (13.4). In reality, different topologies DAC’s output. In terms of linearity, the ideal
with different switching schemes can achieve transfer function is given in equation (13.4), but
the same transfer function while saving power this is only valid if the capacitors Ci are perfectly
or resources by using for instance multiple binary scaled. In reality, each capacitor element
references, semi-differential switching schemes, C0 will experience a random mismatch σ0, caus-
split capacitors, or charge recycling schemes. ing the function to become non-linear and thus
Without being complete, a few examples are: introducing INL/DNL errors and loss of SNDR.
(Ginsburg et al. 2006; Hariprasath et al. 2010; Generally speaking, a larger value of CS is
Liu et al. 2010; Liou et al. 2013; Liu et al. 2016b; required to suppress noise and mismatch.
Tai et al. 2014; Zhu et al. 2010). These schemes As opposed to the above, power consumption
can save a substantial amount of power, and are and chip area benefit from using smaller
thus very relevant to IoT nodes. capacitors. To first order, both power and chip
In terms of reference voltages, a trend is to use area are linearly related to the total capacitance
only voltages close (or equal) to the supply and CS, suggesting one should reduce its value as
13 Ultra-Low Power Analog-Digital Converters for IoT 367

much as possible down to noise and linearity become signal-independent and remain around
limitations. In practice, the smallest possible ele- mid-supply.
ment can be dictated by the minimum element The most important non-idealities of the com-
available from the foundry’s library. A solution parator are its input-referred noise, offset and
to that is to develop custom designed sub-fF decision time. Comparator noise can lead to deci-
capacitors (Harpe et al. 2011) which can save sion errors once the input signal magnitude is
power, while having relatively good matching similar to the noise level. For that reason, com-
(given the capacitor value), and a small form- parator noise is as important as T&H noise and
factor layout. However, the chip area will also be quantization noise. As the comparator is often
related to the number of interconnections and critical for the overall ADC’s power consump-
spacing requirements, which will be proportional tion, the power-efficiency is a particular point of
to the number of control signals (N), or the num- attention. Fundamentally, each 6 dB noise
ber of elements (2N). As such, for higher resolu- improvement costs a factor of four in power
tion ADCs, the area might be dictated by the consumption. However, several strategies can
large number of units (2N). This can be reduced be employed to optimize the efficiency. For
by using a split-capacitor array (Agnes et al. instance, the dynamic topology in van Elzakker
2008) or by using multiple layout units (Harpe et al. (2010) achieves relatively good efficiency,
et al. 2014). which is further improved by biasing the critical
devices in sub-threshold. A further enhancement
of the above circuit is made in Liu et al. (2016)
13.2.4 Comparator where a factor of two in power is saved by using
both phases for amplification rather than wasting
In this subsection, the design of the comparator is one for reset. Another approach is to minimize
discussed. Conceptually, in a single-ended case, the supply voltage, as the power of a dynamic
the comparator compares the sampled input volt- circuit scales with the square of the supply.
age VS against the DAC voltage VDAC, as Besides circuit innovations, system-level
illustrated in Fig. 13.8a. Comparators in SAR solutions such as in Harpe et al. (2012, 2013)
ADCs are usually dynamic, i.e., they perform a have helped to reduce the comparator power
comparison and a reset for each clock period consumption by using adaptive performance dur-
fcmp. Similar to the DAC, this allows convenient ing the conversion.
power scaling dependent on the sample rate. In A second imperfection of the comparator is its -
most practical implementations, the signals VS input-referred offset. Fortunately, the comparator’s
and VDAC are differential. To avoid a 4-input offset is equivalent to an ADC input-referred offset
comparator, VDAC is usually subtracted from VS and does not introduce distortion. For IoT
prior to the comparator. Then, a 2-input compar- applications where the signal being converted is
ator can simply decide the sign of VSVDAC as not containing DC information, this input-referred
shown in Fig 13.8b to obtain the required infor- offset might be ignored. For applications where the
mation. An advantage of doing so is that the offset is critical, for instance an offset calibration
input common mode of the comparator can could be performed, or a system-level chopping

Fig. 13.8 Dynamic (a) (b)


comparator in single-ended
(a) and differential case (b)
VS VS > VDAC ? + VS - VDAC > 0?
VS - VDAC
VDAC -

fcmp fcmp
368 P. Harpe

technique could be applied to cancel the ADC’s A first string (or register) is acting as a thermom-
offset (Harpe et al. 2014). eter counter to memorize in which phase of the
The third consideration for the comparator is conversion process the SAR ADC is. Assuming
it speed. As soon as the comparator is triggered that the ADC uses N + 1 clock cycles (1 for
by the clock signal, it still takes a certain delay tracking and N for the N-bit SAR conversion),
until the comparator reaches its decision. This the thermometer counter will count from 0 up to
delay depends on the input signal magnitude N. A second register contains the actual DAC
and increases for smaller input signals (Pelgrom code that will ultimately compose the ADC out-
2017). The clock signal should thus allow suffi- put code at the end of the conversion. Besides
cient time to reach a decision. Moreover, the these two registers, additional combinational
comparator also needs a sufficiently long reset logic needs to be added to generate various inter-
phase to reset the comparator to the initial condi- nal signals to control the comparator, T&H and
tion. If this time is too short, the next decision DAC, based on the state of the registers. For
can be affected by the previous one, causing simplification this is not shown.
signal dependency and hence non-linearity. The basic operation of the registers is shown
While IoT applications are not very demanding in the timing diagram of Fig. 13.9b. The external
in terms of speed, the comparator delay will clock fclk, at N + 1 times the sample rate fs,
increase substantially when lowering the supply drives the thermometer counter, thus creating
voltage. Moreover, when operating in sub- N + 1 counter values from 0 to N. Combinational
threshold, the delay can also be severely affected logic will combine fclk with the counter value to
over PVT variations, thus requiring sufficient generate a sampling clock fs in the first counter
design margin with respect to the timing. As cycle (counter ¼ 0) and to generate N compara-
discussed in the next paragraph, asynchronous tor clock cycles (fcmp) in the other counter cycles.
timing can alleviate the comparator delay varia- The incrementing counter also strobes the data
tion to some extent. register, addressing each bit one by one (starting
with the MSB down to the LSB), to store the
comparator output in the correct data bit. The
13.2.5 Logic data register drives the DAC, but some
additional combinational logic might be needed,
The logic in a SAR ADC is conventionally build dependent on the switching scheme used in
around two strings of flip-flops (Fig. 13.9a). the DAC.

Fig. 13.9 SAR logic core (a)


(a) and timing diagram (b)
D Q D Q D Q D Q D Q Counter
fclk
Comparator output
D Q D Q D Q
Data
(SAR)
Bit 1 Bit 2 Bit N

(b)
fclk

Counter 0 1 2 3 ··· N 0 1
fs
fcmp
bit 1 2 3 ··· N 1
13 Ultra-Low Power Analog-Digital Converters for IoT 369

The above solution has two drawbacks. First, single (on-demand) conversion. Examples of
the logic requires an external clock which is asynchronous timing are for instance given in
N + 1 times faster than the actual sample rate. Chen and Brodersen (2006), Harpe et al. (2011,
The higher speed could increase system-level 2013), and detailed logic implementations are
power consumption, but it also complicates described in Harpe et al. (2011, 2012).
duty-cycled operation as one has to keep track
of the number of cycles of fclk. A second disad-
vantage is in terms of comparator metastability
13.3 An Ultra-Low Power SAR ADC
handling. As explained in Sect. 13.2.4, the com-
parator decision time depends on the signal mag-
This section discusses a design example of an
nitude applied to the comparator. Ultimately,
ultra-low power SAR ADC with 10 bit resolution
when the signal becomes very small, the compar-
and a variable sampling rate from DC to 100kS/s,
ator is close to metastability where the decision
consuming down to 0.15 nW of power (Harpe
can take a long time. During the N cycles of the
et al. 2015, 2016). While originally developed
SAR conversion, the input signal to the compar-
for low-power bio-potential recording, the
ator will vary substantially. As a result, the com-
specifications are suitable for versatile sensing
parator decision time varies significantly from
applications where especially the power con-
cycle to cycle. However, with a fixed fclk rate,
sumption is critical. Thanks to nW-level opera-
the maximum fclk is limited by the slowest deci-
tion, this ADC allows extremely small form-
sion time of the comparator. Since in most cases
factor devices powered by energy harvesting or
the comparator is much faster, this implies that
tiny batteries.
time is wasted in the other cycles.
The topology of the ADC, which is rather
A solution to the above problems is to use
standard, is shown in Fig. 13.11. To save
asynchronous timing inside the SAR ADC. In
power, the nominal supply is reduced to 0.6 V,
this case, as sketched in Fig. 13.10, only an
well below the regular 1.2 V supply of the
external clock at the sample rate fs is provided.
applied 65 nm CMOS technology. At the same
The comparator clock fcmp is internally
time, the supply of 0.6 V is still sufficiently high
generated, usually by means of a feedback loop
such that conventional circuits can operate
that automatically produces N comparator
correctly. The sampling switches are clock-
cycles. The actual cycle time can be made vari-
boosted to achieve sufficient linearity (Cho and
able by waiting exactly until the comparator
Gray 1995). As mentioned earlier, the T&H has
decision has been made. As a result, some cycles
no explicit sampling capacitance, but the input
will be faster and others slower, to accommodate
signal is sampled directly on the DAC capacitor
the timing variation of the comparator without
array instead. Because all circuits are
wasting time. By doing so, the average clock
dynamically biased, the power scales inherently
cycle is now set by the average delay of the
with the sample frequency. To reduce the leak-
comparator rather than the worst-case delay,
age power consumption, high threshold-voltage
hence improving speed and improving the ability
transistors are used.
to deal with metastability issues. This also
simplifies the system-level design, as a single
T&H clock ADC clock
clock pulse on fs is now sufficient to perform a
Boosting Comp. Delay Enable
clock
Ready CMOS
Analog
fs input Decision out logic

fcmp OUTP 10b


DAC Digital
bit 1 2 3 ··· N 1 OUTN output

Fig. 13.10 Example of asynchronous timing Fig. 13.11 10 bit asynchronous SAR ADC
370 P. Harpe

Fig. 13.12 Asynchronous On-demand conversion


on-demand operation of
the ADC ADC clock
Zoom-in
~5μs
T&H clock Tracking
Comparator enable
Comparator clock

(a) (b) structures. Normally, a 10-bit binary scaled


array would be needed. In this case, the 10 bit
3b unary + 7b binary Unit capacitor
OUTP
is segmented in 3 MSBs and 7 LSBs. The LSBs
7x are binary coded and controlled by code B<6:0>,
32fF 16~0.25fF
the 3 MSBs are thermometer encoded and there-
T<6:0> B<6:0> fore control 7 identically sized capacitors, con-
trolled by thermometer code T<6:0>. Using
{

2X, pseudo differential Top view, M6/M7 stacked


thermometric coding has two advantages: it
Fig. 13.13 Implementation of the DAC (a) and the saves switching energy in the DAC and it reduces
capacitor elements (b) the maximum DNL error as the worst-case num-
ber of switching elements is reduced (Harpe
The ADC only requires a single external clock et al. 2013).
at the sample rate fs. An internal loop around the To maximize the dynamic range and to
comparator creates the clock that is needed for simplify the ADC system integration, the DAC
the comparator and logic. As shown in the timing only uses ground and VDD as its reference
diagram in Fig. 13.12, a rising edge of the exter- voltages, as explained before in Fig. 13.7b.
nal ADC clock initiates a complete conversion. Because of that, the ADC requires only a single
The logic enables the feedback loop around the supply voltage that is used by all components.
comparator (Fig. 13.11): this loop will start the To save power in the DAC, as well as in the
comparator clock first. As soon as the comparator (external) reference buffer and in the analog
has resolved a decision, its Ready output will be buffer driving the ADC, the capacitance of the
engaged. By means of the inverted delay, this DAC is minimized as much as possible by devel-
Ready signal will disable the comparator clock. oping custom designed fringing capacitors. Sim-
As a result, the comparator is reset and its Ready ilar to Harpe et al. (2011), this work uses
will switch off again. As the Ready turns off, this capacitive elements as shown in Fig 13.13b
will initiate a next comparison clock cycle. In with a unit value of 0.25fF. To improve capacitor
this way, the consecutive bit cycles are generated density, two metal layers (6 and 7) are stacked.
until all 10 bits are resolved. At that point, the The lower layers (1 to 5) are not used, because
logic disables the feedback loop, produces an they would increase the parasitic capacitance to
ADC output code, returns the DAC to tracking the substrate, causing an attenuation of the signal
mode, and then places the ADC into standby. As range. By using these small capacitors, the total
the timing diagram shows, the entire conversion input capacitance of the DAC is only 0.3 pF,
takes approximately 5 μs, after which the ADC which leads to a kT/C-limited SNR of approxi-
returns to sleep. Since the ADC only requires a mately 66 dB, which is still sufficient for a
single rising clock-edge on its external clock to 10 bit ADC.
trigger a conversion, the ADC is very suitable for Figure 13.14 presents the die photo of the
on-demand operation. ADC, implemented in a 65 nm CMOS process
The implementation of the capacitive DAC is and occupying 180 μm∙80 μm. The chip area is
shown in Fig. 13.13a. The differential DAC is mostly dominated by the capacitive DAC. How-
implemented by two identical single-ended ever, a large part of the DAC area (~75 %) is due
13 Ultra-Low Power Analog-Digital Converters for IoT 371

9.5
9.4 f s= 100kS/s

ENOB [bit]
9.3 VDD = 0.6V

9.2
9.1 ENOB = 9.2bit
9.0
1 10 50
Input frequency [kHz]
Fig. 13.14 Die photo of the ADC in 65 nm CMOS Fig. 13.16 Measured ENOB of the ADC

1
leakage of 0.15 nW is measured by disabling the
INL [LSB]

clock altogether and measuring the supply cur-


0
rent. From the simulated breakdown (post-
layout), it is clear that the comparator contributes
-1
1 most to the overall power. The DAC contribution
DNL [LSB]

is small thanks to the small unit capacitors, even


0 though the DAC switching scheme was not
optimized.
-1 0 1023 Table 13.1 shows a performance summary
Code
and comparison to prior-art. The efficiency is
= Thermometer code transition comparable to state-of-the-art, but not as good
Fig. 13.15 Measured INL/DNL of the ADC as (Tai et al. 2014). However, as can be seen
from Fig. 13.2, this is still one of the few ADCs
to interconnections that were not really under the indicated FOMW trend line of 2 fJ/
optimized for chip area, while the capacitors conversion-step. Another advantage of the pro-
only take <20 % of the total DAC area. posed ADC is that it has the lowest leakage
The precision of the ADC is verified by mea- power, allowing to maintain power efficiency
suring the INL, DNL and ENOB. The INL and even when the sample rate is reduced to well
DNL are shown in Fig. 13.15. Both parameters below 1 kS/s.
remain within 1LSB. As expected, the largest This section described a 10-bit ADC with a
DNL errors (and thus the largest discontinuities versatile sampling rate. As shown, the architec-
in the INL) happen at the thermometer code ture and implementation are relatively basic.
transitions. Thanks to technology scaling, the simple archi-
The ENOB, measured as a function of the tecture and circuits, the small unit capacitors, and
input signal frequency at a sample rate of a reduced supply voltage, this still allows to
100 kS/s is given in Fig. 13.16. The performance achieve state-of-the-art power efficiency. More-
is constant over the entire Nyquist zone, showing over, by using only a single supply and a single
the ADC has sufficient bandwidth despite the clock-edge for triggering conversion, the ADC is
reduced supply voltage. simple to integrate and use in an IoT system.
Lastly, Fig. 13.17 displays the measured
power consumption versus sampling rate
together with the simulated power breakdown at 13.4 A High-Precision SAR ADC
1 kS/s. It can be seen that the power scales
proportional to the sample rate. Due to The previous section described a SAR ADC with
limitations of the measurement setup, the lowest 10 bit resolution. While this can be sufficient for
frequency measured is around 0.2kS/s. A standby basic sensing applications, other applications
372 P. Harpe

Fig. 13.17 Measured 100 Power at VDD = 0.6V:


power consumption and 0.15nW at 0S/s

Power [nW]
simulated power 10 1nW at 1kS/s
breakdown of the ADC 88nW at 100kS/s
1
Breakdown at 1kS/s:
0.1 5% T&H 57% Comp.
0.1 1 10 100 22% DAC 16% Logic
Sampling rate [kS/s]

Table 13.1 ADC performance summary and comparison


This work, Harpe et al. (2015) Zhang et al. (2012) Harpe et al. (2013) Tai et al. (2014)
Process [nm] 65 65 65 40
Supply [V] 0.6 0.7 0.6 0.45
Total power [nW] 88 3 72 84
Leakage power [nW] 0.15 0.67 0.4 N/A
Resolution [bit] 10 10 10 10
ENOB [bit] 9.2 9.1 9.4 8.95
Sample rate [kS/s] 100 1 40 200
FOMW [fJ/conv.step] 1.5 5.5 2.7 0.85

could demand higher precision. Therefore, this of a SAR ADC while maintaining a constant
section describes a SAR ADC with a relatively FOMS. However, correlated errors (such as dis-
high precision, selectable from 67.8 dB up to tortion and 1/f noise) cannot be solved by
79.1 dB of SNDR (Harpe et al. 2014). oversampling.
In order to increase a SAR ADC’s SNDR, the The second technique being applied is
main challenge is to reduce noise and distortion system-level chopping. Chopping is a known
contributions while maintaining power- technique to mitigate offset and 1/f noise
efficiency. In this design, a combination of problems of amplifiers, and can be applied like-
oversampling, chopping, dithering and data- wise to an ADC. Figure 13.18 shows a simplified
driven noise-reduction is applied to achieve explanation of an ADC converting an analog
this. These techniques will be described one input X to a digital code Y. As shown in the
by one. upper graph, the ADC might add signal-
Oversampling is a known technique to dependent harmonic distortion (HD(X)), offset
improve the SNR (Signal-to-Noise-Ratio) in a (O) and 1/f noise (1/f). In the second graph,
signal band of interest by sampling faster than system-level chopping is applied to this ADC.
the Nyquist rate. Given that the sampling rate is The input signal is modulated with a chopping
increased with a certain oversampling ratio clock (fc) before the conversion, and
(OSR), the in-band noise power will be reduced demodulated by the same clock at the output of
by the same factor: the ADC. If fc is set to half of fs, this means that
the system is transparent in the odd clock cycles,
Pn, total
Pn, inband ¼ ð13:5Þ while the input and output signals are inverted in
OSR the even clock cycles. The equations in
This implies that every factor 4 of Fig. 13.18 show how the output codes in the
oversampling reduces the in-band noise by two phases (Y1 and Y2) depend on the input
6 dB, thus improving the SNR by 6 dB at the and various imperfections.
cost of 4x higher speed and power. Oversampling Next, the harmonic distortion is separated in
allows to mitigate all random noise contributions two components, namely the even-order and the
13 Ultra-Low Power Analog-Digital Converters for IoT 373

Fig. 13.18 System-level


chopping of an ADC ADC

= + + + 1/

ADC
1
=
2

Odd cycles Even cycles

+1 ADC +1 1 -1 ADC -1 2

1 = + + + 1/ 2 = − (− + − + + 1/ )

odd-order distortion components, denoted by causing an improvement of local linearity. How-


HDe(X) and HDo(X), respectively. This means ever, it should be noted that since the dither is
that the relations in equations (13.6) and (13.7) limited in amplitude, it cannot solve global
hold. non-linearities.
Figure 13.19a shows the implementation of
HDðXÞ ¼ HDe ðXÞ þ HDo ðXÞ ð13:6Þ
the ADC including the chopping and dithering
( techniques. It supports a native resolution of
HDe ðXÞ ¼ HDe ðXÞ
ð13:7Þ 12 or 14-bit. To integrate chopping, there are
HDo ðXÞ ¼ HDo ðXÞ two sets of sampling switches, allowing to sam-
ple the input signal alternatingly in the normal
As a final step, the average value of Y1 and Y2
way (φ1) or with inverted polarity (φ2). The
can be determined, which yields the result in
second chopper, which is in the digital domain,
equation (13.8). The result shows that chopping
simply needs to forward the output bits in one
removes offset, 1/f noise and even-order distor-
clock cycle, and invert the output bits in the next
tion when looking to the average of codes (or in
clock cycle.
reality: when looking to the low-frequency part
The dithering technique is implemented
of the spectrum). In fact, these imperfections are
inside the DAC, as illustrated in Fig. 13.19b.
not removed, but modulated to the chopping
The actual circuit is differential, but only one
frequency. If oversampling is applied together
half is shown here. Similar to the ADC design
with chopping, this implies that offset, 1/f noise
discussed in Sect. 13.3, the lower bits (9 down to
and even-order distortion will be moved out of
0) are binary encoded, while the upper 4 bits are
band, hence allowing to improve the SNDR of a
unary encoded (requiring 15 identical-sized
SAR ADC.
capacitors). The dither signal can be injected
Y1 þ Y2 with a capacitive network connected to the
¼ X þ HDo ðXÞ ð13:8Þ DAC. In time, the input signal is first sampled
2
on the top plates of all the capacitors. Then, the
The third technique used to improve linearity delayed sample clock will switch the dither
is dithering. It can be observed (for instance in capacitors, causing a dither value to be added to
Fig. 13.15) that the ADC distortion is often rather the sampled input signal. After that, the normal
irregular due to capacitor mismatch, causing conversion starts, such that the ADC converts the
strong local variations. Dithering aims to smooth dithered input signal. Additional details on this
out these irregularities by adding an amount of technique are explained in Harpe et al. (2014).
dither to the input signal before conversion. In The implementation of the DAC capacitors
this way, the local irregularities can be averaged, leads to a practical challenge. Since the
374 P. Harpe

(a)
ϕ1
ϕ2 Logic ϕ1
Analog ϕ2 12/14bit
input ϕ1 output
DAC + dither

(b)
15x DAC Dither circuit
out

T<14:0> B<9> B<0> 2b/4b Counter delayed fs


4b unary + 10b binary

Fig. 13.19 Implementation of chopping technique (a) and DAC with dithering (b)

Table 13.2 DAC capacitor implementation saving almost a factor of 16 in the number of
devices and thus making the layout more simple
Unit capacitor Number of units
and compact. The total DAC capacitance is 9 pF,
T<14:0> 8.8fF 15  64 (960 in total)
B<9:4> 8.8fF 32, . . ., 1 (63 in total) which leads to about 85 dB of kT/C-related SNR
B<3> 4.4fF 1 at 0.8 V supply.
B<2> 2.2fF 1 The last technique applied in this design is
B<1> 1.1fF 1 Data-Driven Noise-Reduction (DDNR). Its goal
B<0> 1.1fF ½ is to improve the comparator noise level in a
Total 9 pF 1027 more efficient way than by simple analog circuit
scaling which costs 4x in power for a 6 dB better
ADC has 14-bit of resolution, it theoretically SNR. This is critical in low-power ADCs, as the
needs 214 ¼ 16,384 capacitive elements, which comparator can dominate the overall ADC power
is unpractical. A split capacitor array (Agnes consumption (e.g., 57 % in Fig. 13.17). When
et al. 2008) could reduce this, but may lead to looking to Fig. 13.20a, it can be observed that
additional non-linearities. In this case, the num- during the SAR conversion, the input signal mag-
ber of units is reduced by using multiple layout nitude applied to the comparator is often large,
units, of respectively 8.8, 4.4, 2.2 and 1.1fF. and only in a few cycles it will be small (in the
Table 13.2 gives an overview of how the DAC order of an LSB). Comparator noise can cause
capacitors are composed. The largest capacitors decision errors, but this will only happen if the
(for the thermometer bits and the binary bits input signal is in the same order of magnitude as
9 down to 4) are composed of 8.8fF layout the noise. For those cycles where the input is very
elements to reduce the number of devices. The large, the comparator noise is in fact not critical
lower bits (bit 3, 2, and 1) use the down-scaled at all. Thus, rather than using a comparator with a
layout elements of 4.4, 2.2 and 1.1fF. Lastly, bit fixed noise level, DDNR saves power by
0 effectively uses half a unit of 1.1fF. This is modifying the comparator noise level on-the-fly
done by placing a capacitor of 1.1fF on the posi- for each cycle, dependent on the input signal
tive side of the DAC array, and no capacitor on magnitude: for large signals a high noise level
the negative side of the differential topology. In is tolerated to save power, and for small signals
this way, the effective value is 0.55fF when the noise level is reduced to achieve better preci-
looking to the differential operation. The slight sion. In order to implement this concept, two
common-mode imbalance is irrelevant as this is components are needed: (1) it should be detected
very small. From Table 13.2 it is clear that the (within each cycle) whether the input signal mag-
number of units is reduced to 1027 in this way, nitude is large or small; (2) the noise-level of the
13 Ultra-Low Power Analog-Digital Converters for IoT 375

(a) (b) and the SAR continues with the next bit cycle.
Comparator input However, if the decision was slower than the
during approximation reference delay, four additional comparisons are
Vin τdelay performed on the same input signal. The majority
Vin

vote on five decisions is taken and forwarded to


Clock
0 the SAR logic. Only then will the SAR logic
* * * Delay
time τref proceed to the next bit cycle.
cell
* = Small input magnitude A die photo of the implemented ADC is
shown in Fig. 13.21. The largest portion of the
indicated area is occupied by supply decoupling
Fig. 13.20 Comparator input during conversion (a) and capacitors and the DAC. The ADC operates at a
decision time monitor (b)
nominal supply of 0.8 V, and can work either in
12 bit or 14 bit Nyquist mode, or in oversampling
comparator should be tunable instantaneously modes. 12 bit resolution is simply implemented
during the conversion process. by skipping the last 2 conversion cycles and by
To address the first problem, recall that the disabling majority voting to save power. When
comparator decision time is related to the input oversampling is enabled, the chopping and dith-
signal magnitude: smaller signals lead to a longer ering techniques can be turned on to improve the
decision time (Sect. 13.2.4). Thus, by observing in-band SNDR.
the decision time, the input signal magnitude can Figure 13.22 shows an example of a measured
be classified. As shown in Fig. 13.20b, a tunable output spectrum, in this case in 14 bit 16x
delay cell is triggered together with the compar- oversampling mode with 128kS/s sampling rate.
ator. By comparing the delay of the comparator The in-band SNDR is 80.0 dB while the linearity
(τdelay) against this reference delay (τref), it can reaches 87.5 dB and the ENOB is 13 bit. In this
be decided whether the input signal was large or mode, the power breakdown (based on post-layout
small. In practice, by tuning the reference delay simulations) is as follows: DAC 51 %, compara-
cell by means of feedback, the reference delay tor 40 %, logic 4 %, chopped T&H 3 %, dithering
τref can be stabilized to a desired value regardless 2 %. Figure 13.23 shows the SNDR of this ADC
of PVT variations (Harpe et al. 2014). in different modes of operation (12 bit/14 bit,
The second problem is to tune the noise-level Nyquist and oversampling), and an overall perfor-
of the comparator dynamically. While this could mance summary is given in Table 13.3. Compared
be done by tuning the analog circuit, this is cum- to prior-art (Fig. 13.2), this work achieves state-
bersome and could induce new errors (such as of-the-art power-efficiency (FOMS from 173.8 to
offset variations). Therefore, a digitally-intensive 176.8 dB) and also enables a relatively high
solution is applied by majority voting on a SNDR up to 79.1 dB.
repeated set of comparator decisions to enhance The SAR ADC presented in this section
the effective noise level. For instance, when the shows several simple features that enhance the
same comparator decision is repeated five times, SNDR while maintaining state-of-the-art power-
and the majority vote of those five decisions is efficiency. Also, the ADC can cover different
used as final output, this effectively reduces the performance settings (67.8 dB to 79.1 dB of
input-referred noise by 6 dB, as shown in the SNDR) to allow a flexible trade-off between
presentation of Harpe et al. (2013). In fact, this performance and power.
is a bit similar to oversampling, where a higher
sample rate is used to reduce in-band noise.
Combining the above two components, the 13.5 A Reconfigurable SAR ADC
overall approach is as follows: each SAR cycle
starts with a single comparison. If the decision The design examples of the previous sections are
time is faster than the reference delay, this deci- mostly relevant for sensor interfaces because of
sion is immediately forwarded to the SAR logic their sample rates in the kHz range. However,
376 P. Harpe

Fig. 13.21 Die photo of


the ADC in 65 nm CMOS

Power [dBFS]

Fig. 13.22 Measured 0


output spectrum of the -40 SNDR = 80.0dB ENOB = 13bit
ADC in 14 bit mode with SFDR = 87.5dB
-80
16x OSR
-120
0 1 2 3 4
Frequency [kHz]

Fig. 13.23 Measured 90


SNDR of the ADC in 85 14b 4x/16x OSR mode
SNDR [dB]

12 and 14 bit mode, with 80


and without oversampling 75
70
65 12b/14b Nyquist mode
60
10 100 1k 10k
Input frequency [Hz]

Table 13.3 ADC performance summary similar SAR topologies can be used easily to
Process [nm] 65 operate in the MHz range, allowing to re-use
Supply [V] 0.8 the converter in low power radios. In this section,
Resolution [bit] 12 14 a SAR ADC with flexible resolution (7, 8, 9, or
Sample rate [kS/s] 32 32 128 128 10 bit) and a variable speed (DC to 2 MS/s) is
OSR – – 4x 16x
described, allowing to re-use the design for either
Bandwidth [kHz] 16 16 16 4
low-power sensing or low-power narrow-band
SNDR [dB] 67.8 69.7 76.1 79.1
Power [μW] .310 .352 1.367 1.370
communication (Harpe et al. 2012).
FOMW [fJ/conv.step] 4.8 4.4 8.2 23.2 Figure 13.24 shows the architecture of the
FOMS [dB] 174.9 176.3 176.8 173.8 reconfigurable SAR ADC. The basic architecture
13 Ultra-Low Power Analog-Digital Converters for IoT 377

fs Reconfiguration switch
VS
Async.
Vin T&H VDAC
logic

VDAC 7, 8, 9, or 10-bit CN-1 CN-2 CN-3 C0


Ci = 2i·C0
DAC Dout VDD CS ª 2N·C0

Fig. 13.24 Architecture of the reconfigurable SAR ADC Fig. 13.25 Implementation of the resolution
reconfigurable DAC

is similar to the previous design examples, using reconfiguration switch is permanently connected.
asynchronous clocking and a single supply for all In this case, the DAC has a resolution of N bit, and
components (including the DAC reference volt- the power consumption is proportional to the total
age). In order to make the sample rate capacitance CS, which is equal to 2N∙C0. To
reconfigurable, the ADC is implemented with reduce the DAC’s resolution, the reconfiguration
dynamic circuits, similar to the examples switch can be disconnected. In this situation, the
discussed earlier. Therefore, the main attention largest two capacitors (CN1 and CN2) are perm-
in this section is on how to reconfigure the ADC anently disconnected and do not contribute to the
resolution while maintaining power-efficiency. DAC’s resolution or power consumption. Thus,
For the digital logic inside a SAR ADC, the the DAC has now N2 bit of resolution, while the
complexity and power consumption theoretically effective capacitance is reduced to 2N2∙C0. This
scale linear with the resolution N, as can be seen implies that the power consumption is exponen-
from the diagram in Fig. 13.9. To implement tially scaled down with a factor 22, following the
reconfigurable logic where the resolution can be exponential scaling requirement in N.
adapted on-the-fly is straightforward: for In theory, it would be possible to add 3 recon-
instance, if 10 registers are implemented, up to figuration switches, such that DAC resolutions of
10 bit ADC resolution can be supported. If only 7, 8, 9, and 10 bit can be implemented with a
7 out of the 10 registers are activated, the ADC relative power consumption of 100 %, 50 %,
resolution is effectively reduced from 10 to 7 bit, 25 % and 12.5 %, respectively. However, for
while the power is scaled down proportionally as sake of simplicity, the implementation is limited
well, following the expected linear trend with to have only 2 modes of operation as shown in
N. Since the ideal power-scaling trend can be the graph, with a factor of 4 in power scaling.
achieved with this reconfigurable implementa- The DAC is set to 8-bit mode to support ADC
tion, the power efficiency is maintained for the resolutions of 7 and 8 bit, and it is set to 10-bit
logic. mode to support ADC resolutions of 9 and 10 bit.
As opposed to the power consumption of the The unit capacitance (C0) is 0.6fF and
digital part, the power consumption of the com- implemented similar to Fig. 13.13b. Note that
parator and DAC should theoretically scale expo- the proposed technique manages to scale the
nentially with N, because these blocks are DAC power with 2N, which achieves constant-
limited by physical constraints such as noise FOMW scaling. However, the scaling is still not
and matching. To maintain the best possible optimal, as the DAC’s noise limit would theoret-
ADC efficiency throughout a reconfigurable ically allow scaling with 4N.
range of resolutions, these analog blocks need For the comparator, the design is based on the
to be reconfigured in such a way that they dynamic two-stage topology proposed in van
achieve exponential power scaling with N. Elzakker et al. (2010). In that design, a first
Figure 13.25 shows the implementation of the stage acts as a dynamic pre-amplifier, while the
reconfigurable DAC. First, it is assumed that the second stage implements a latch. As the first
378 P. Harpe

stage contributes most of the comparator noise, it hardly helps, because by then the power is
also dominates the total power consumption. dominated by other components. Further
Hence, it is sufficient to reconfigure the dynamic up-scaling of C also has a limited impact,
pre-amplifier only. The pre-amplifier (Fig. 13.26) because by then the noise is dominated by other
is enabled when the CLK signal is turned on: a contributors.
current will start to flow through the tail transis- So far, the reconfigurability of logic, DAC and
tor. Dependent on the differential input signal, comparator has been explained. A last feature
the differential pair will generate a differential integrated in this ADC is redundancy in order
output current that will be integrated on the load to save power. Rather than using a binary-scaled
capacitors, thus creating a dynamically amplified DAC which requires N cycles to find the N-bit
output voltage. As analyzed in van Elzakker et al. output code, this ADC uses N + 1 cycles and a
(2010), the effective input-referred noise voltage non-binary DAC. This redundancy allows to
of this stage is inversely proportional to √C, relax precision requirements in the early conver-
while the power consumption is given by sion cycles, as the redundancy can solve errors in
2∙C∙VDD2. As 1 bit additional resolution requires the later conversion cycles. As shown in Giannini
a 6 dB better SNR, this can be achieved by et al. (2008), this can be exploited to save power:
including a 4x larger C. In this way, the capacitor Giannini et al. (2008) uses a noisy low-power
size and power consumption scale with 4N, fol- comparator in the first SAR cycles to save
lowing the ideal noise-power trade-off. The power, and a precise higher-power comparator
implemented pre-amplifier has therefore a pro- in the last few cycles to obtain the required pre-
grammable load capacitance. This is done by cision. While (Giannini et al. 2008) required two
adding a small array of capacitors and switches separate comparators, this work can achieve the
such that the value can be programmed digitally. same result by simply reconfiguring the load
Four different noise/power settings are capacitance of the comparator during the conver-
supported, with which the power can be scaled sion process. More details can be found in Harpe
almost by 4x while the noise voltage scales about et al. (2012).
2x. A wider scaling range would be preferable, The ADC was implemented in a 90 nm
but is hard to achieve: further down-scaling of C CMOS technology (Fig. 13.27) and operates

VDD
C CLK C
Vout- Vout+
Vin+ Vin-

CLK Reconfigurable
load capacitors

Fig. 13.26 Implementation of the resolution


reconfigurable comparator Fig. 13.27 Die photo of the ADC in 90 nm CMOS

Fig. 13.28 Measured 10


10bit
9 9bit
ENOB [bit]

ENOB of the ADC for


different resolutions 8 8bit
7 7bit
6 ENOB: 6.94 ~ 9.30bit
5
0.3 Input frequency [MHz] 1.0
13 Ultra-Low Power Analog-Digital Converters for IoT 379

Fig. 13.29 Measured 10μ


7bit 8bit 9bit 10bit

Power [W]
power consumption for 1μ
different resolutions and 2nW leakage
100n
sampling rates
10n
1.61 ~ 3.56 μW @ 2MS/s
1n
100 1k 10k 100k 1M
Sample rate [S/s]

Table 13.4 ADC performance summary


13.6 On-Chip Voltage References
Process [nm] 90
Supply [V] 0.7
The previous sections described low-power
Sample rate [MS/s] 2
ADCs. To use such ADCs, analog signal condi-
Resolution [bit] 7 8 9 10
ENOB [bit] 6.94 7.81 8.70 9.30
tioning (discussed in Chap. 12) and references
Power [μW] 1.61 1.77 2.72 3.56 are also required. As seen from the basic topol-
FOMW [fJ/conv.step] 6.6 3.9 3.3 2.8 ogy (Sect. 13.2), the SAR ADC usually only
requires a voltage reference for the DAC that
sets the full-scale range of the ADC (equation
from a 0.7 V supply. Figure 13.28 shows the (13.4)). Moreover, as can be observed from the
measured ENOB as function of the input fre- design examples (Sects. 13.3–13.5), this DAC
quency, while operating at 2 MS/s. The ENOB reference voltage is often equal to the VDD of
for the different resolutions (7, 8, 9, and 10 bit) the other circuit blocks. Therefore, this section
varies from 6.94 to 9.30 bit. focusses on Reference Voltage Generators
Figure 13.29 shows the measured power con- (RVG) and their application to ADCs for IoT.
sumption versus sampling rate for the different The reference voltage provided to the DAC can
resolutions. The power scales linear with the experience several different types of imper-
sample rate. The minimum power is 2 nW, fections, here classified as random variations and
caused by leakage. The power also scales with systematic variations. Random variations happen
resolution, but this is not very well visible due to for instance due to thermal noise, or due to inci-
the logarithmic axis in the graph. At 2 MS/s, the dental glitches/disturbances. Since these effects
power scales from 1.61 up to 3.56 μW when the are directly seen by the DAC, they will immedi-
resolution is changed from 7 to 10 bit. ately modulate with the signal being digitized.
A summary of the performance is given in However, if the problem is truly random, it can
Table 13.4. The ADC achieves good power effi- be filtered out either by filtering the reference with
ciency for all resolutions and covers a useful a by-pass capacitor, or by filtering the digitized
resolution and speed range for IoT. For instance, samples in the digital domain. Systematic
for biopotential or environmental monitoring, the variations of the reference voltage could happen
ADC could operate at 10 bit resolution and due to gradual temperature, process or supply
1kS/s, where it consumes only 4nW. For a voltage variations. These errors might drift over
low-power radio, it could work at 8 bit resolution time but are usually strongly correlated from sam-
and 2MS/s where it consumes 1.61 μW. To ple to sample. As such, they cannot be filtered out.
expand the application range, it would be inter- However, it depends on the application if these
esting for future work to include higher errors are a problem or not. For instance, ratio-
resolutions for more precise sensing and higher metric measurements can be insensitive to the
speeds to support more advanced wireless com- precise reference voltage. Or, in case of a wireless
munication standards. link, the absolute amplitude of a received signal
380 P. Harpe

Table 13.5 Examples of low-power voltage references


Magnelli et al. (2011) Seok et al. (2012) Dong et al. (2016) Liu et al. (2016)
Process [nm] 180 130 180 65
Minimum supply [V] 0.45 0.5 1.2 0.62
Power [nW] 2.6 0.0022 0.114 2.5–25
Line sensitivity [%/V] 0.44 0.033 0.38 0.07
PSRR @100Hz [dB] 45 53 42 62
Temp. coefficient [ppm/ C] 165 231 124 108
Temp. range [ C] 0 to 125 20 to 80 40 to 85 25 to 110

does not matter in itself. Communication can still


≥0.8V
be reliable if the reference is unprecise, as long as VDDraw
it is stable during the time of a transmission.
However, for applications where the absolute VREF VDDADC
RVG LDO
voltage of the sensed signal is relevant, the refer- 0.4V
0.6V
ence voltage needs to be tightly controlled. Duty
In case a reference voltage needs to be cycling Vin Dout
generated, several requirements of IoT should ADC
be noted. Low power consumption is mandatory.
To save further power, low-voltage operation Fig. 13.30 Duty-cycled Voltage Reference Generator
and the ability for duty-cycling are preferred as (RVG), LDO and ADC
well. Lastly, to integrate the references together
with the rest of the IoT system, technology
portability to scaled CMOS nodes is also rele- are very different. Seok et al. (2012) uses the
vant. Table 13.5 shows several state-of-the-art threshold difference between a native and a
examples of low-power voltage references. thick-oxide device, Dong et al. (2016) uses the
They all manage to achieve pW to nW levels of difference caused by the body bias effect, and
power consumption, albeit with different perfor- Liu et al. (2016) uses the difference based on
mance characteristics. The line sensitivity and different ion implant levels in two thin-oxide
PSRR describe how sensitive the generated transistors. While these designs prove that
reference is with respect to variations or dis- low-power reference generation is feasible,
turbances of the input voltage. The temperature these circuits have a very high output impedance
coefficient describes the sensitivity to tempera- and are thus not able to directly drive the DAC,
ture variations in the indicated temperature as the DAC requires substantial power from the
range. Most designs, except (Dong et al. 2016), reference. For that reason, a buffer or low-drop-
can operate at sub-1 V supplies. Liu et al. (2016) out regulator (LDO) is still required to connect
is the only design in this list that is integrated in the RVG to the ADC.
an advanced CMOS node, and the only one that A complete system example, published in Liu
includes duty-cycling ability. On the other hand, et al. (2016), is shown in Fig. 13.30. Here, the
the power consumption of the other designs is so RVG provides a reference voltage to an LDO.
low that duty-cycling is not even necessary. In The LDO multiplies the reference and powers the
Magnelli et al. (2011), the generated reference is entire 10bit ADC. A raw supply of at least 0.8 V
determined by the threshold voltage of a single is needed to generate a reference voltage of 0.4 V
MOS transistor. In Dong et al. (2016), Liu et al. and an ADC supply of 0.6 V. To save reference
(2016), Seok et al. (2012) the generated reference power, the RVG can be duty-cycled. A S&H at
is dependent on the threshold voltage difference the RVG output is included so that the generated
between two MOS transistors. However, their reference is continuously available for the LDO
actual principles and circuit implementations even when the RVG is powered down.
13 Ultra-Low Power Analog-Digital Converters for IoT 381

Table 13.6 ADC performance summary and comparison


Harpe et al. (2015), Section 13.3 Liu et al. (2016), Section 13.6
Process [nm] 65 65
Supply [V] 0.6 0.8
Resolution [bit] 10 10
ENOB [bit] 9.2 9.1
Sample rate [kS/s] 100 80
Total power [nW] 88 106
FOMW [fJ/conv.step] 1.5 2.4
Including RVG and LDO NO YES

The overall system was measured while the could be saved, or advantage can be taken from
RVG was duty-cycled at 10 %. Table 13.6 shows improved speed and precision at existing power
the measured performance and compares it levels. However, due to technology scaling, leak-
against the very similar 10 bit ADC that was age starts to dominate the overall power con-
discussed in Sect. 13.3. The ENOB of both sumption for ADCs that are heavily duty-
designs is similar, while the FOMW increased cycled. For instance for quasi-static monitoring,
from 1.5 to 2.4fJ/conversion-step. The increase the leakage power will be higher than the active
of FOMW is because this design includes the power, requiring leakage mitigation techniques
power of the RVG and LDO, and it uses a higher to maintain efficiency.
supply voltage. Nonetheless, this example shows Present state-of-the-art ADCs for IoT can be
that an ADC including reference generation can so power-efficient, that the bottleneck in terms of
be power-efficient. It also confirms that the ADC power is usually not inside the ADC anymore,
is still dominant over the RVG and LDO in terms but in the components that surround the ADC.
of power consumption. For instance the reference voltage generation, the
analog input buffer, or the anti-aliasing filter
could consume similar or more power than the
13.7 Perspectives and Trends ADC itself. This is particularly true in
on-demand sensing applications as the ADC
In this chapter, ADCs for IoT nodes were can be duty-cycled easily but the other analog
discussed. The SAR ADC is a suitable solution blocks often experience static consumption that
in this context, allowing very low power con- does not scale down with the activity. Hence,
sumption with suitable speed and precision for research is needed in those circuits to take full
most applications. Several examples described advantage of low-power ADCs at the system
techniques to achieve state-of-the-art in terms level.
of efficiency and precision. As also shown, the Lastly, while the simple SAR architecture
SAR ADC can deal well with modern suits well to IoT requirements, other topologies
technologies, and allows operation at low supply should not be discarded. Hybrid architectures are
levels. The dynamic power consumption of SAR becoming more and more popular in recent
ADCs and the Nyquist operation enable auto- years. For instance, Shu et al. (2016) combines
matic power scaling with the sample rate and a SAR structure with noise-shaping and
on-demand operation. Techniques to implement mismatch-shaping techniques, allowing to reach
versatility in the ADC’s resolution to expand the a far greater SNDR (101 dB) than typical SAR
application range of a single design were also ADCs with state-of-the-art power-efficiency
introduced. (FOMS ¼ 180 dB). The power-efficiency of
In the future, the ongoing improvement of Sigma-Delta Modulators is also improving,
ADC power-efficiency (illustrated in Fig. 13.3) such as Billa et al. (2016) which achieves an
will enable further benefits for IoT. Either power SNDR of 98.5 dB for 24 kHz signal bandwidth
382 P. Harpe

with a FOMS of 177.8 dB. Hybrid and Sigma- P. Harpe, H. Gao, R. van Dommele, E. Cantatore, A. van
Delta converters especially stand out for high- Roermund, A 3 nW signal-acquisition IC integrating
an amplifier with 2.1 NEF and a 1.5fJ/conv-step ADC.
precision applications that cannot be covered ISSCC Dig. Tech. Papers (Feb. 2015), pp. 382–383
easily with pure SAR ADCs. P. Harpe, H. Gao, R. van Dommele, E. Cantatore,
A.H.M. van Roermund, A 0.20 mm2 3 nW signal
acquisition IC for miniature sensor nodes in 65 nm
CMOS. IEEE J. Solid-State Circuits 51(1), 240–248
References (2016)
M. Konijnenburg, S. Stanzione, L. Yan et al., A battery-
powered efficient multi-sensor acquisition system
A.M. Abo, P.R. Gray, A 1.5-V, 10-bit, 14.3-MS/s CMOS
with simultaneous ECG, BIO-Z, GSR, and PPG.
pipeline analog-to-digital converter. IEEE J. Solid-
ISSCC Dig. Tech. Papers (Feb. 2016), pp. 480–481
State Circuits 34(5), 599–606 (1999)
C.-Y. Liou, C.-C. Hsieh, A 2.4-to-5.2fJ/conversion-step
A. Agnes, E. Bonizzoni, P. Malcovati, F. Maloberti, A
10b 0.5-to-4MS/s SAR ADC with Charge-Average
9.4-ENOB 1V 3.8 μW 100kS/s SAR ADC with time-
Switching DAC in 90 nm CMOS, in ISSCC Dig.
domain comparator. ISSCC Dig. Tech. Papers (Feb.
Tech. Papers (Feb. 2013), pp. 280–281
2008), pp. 246–247
C.-C. Liu, S.-J. Chang, G.-Y. Huang, Y.-Z. Lin, A 10-bit
S. Billa, A. Sukumaran, S. Pavan, A 280 μW 24 kHz-BW
50-MS/s SAR ADC with a monotonic capacitor
98.5 dB-SNDR chopped single-bit CT ΔΣM achiev-
switching procedure. IEEE J. Solid-State Circuits 45
ing <10Hz 1/f noise corner without chopping
(4), 731–740 (2010)
artifacts. ISSCC Dig. Tech. Papers (Feb. 2016),
Y.-H. Liu, C. Bachmann, X. Wang et al., A 3.7 mW-RX
pp. 276–277
4.4 mW-TX fully integrated bluetooth-low-energy/
S.-W.M. Chen, R.W. Brodersen, A 6-bit 600-MS/s 5.3-
IEEE802.15.4/proprietary SoC with an ADPLL-
mW asynchronous ADC in 0.13-μm CMOS. IEEE
based fast frequency offset compensation in 40 nm
J. Solid-State Circuits 41(12), 2669–2680 (2006)
CMOS. ISSCC Dig. Tech. Papers (Feb. 2015),
T.B. Cho, P.R. Gray, A 10 b, 20 M sample/s, 35 mW
pp. 236–237
pipeline A/D converter. IEEE J. Solid-State Circuits
M. Liu, K. Pelzers, R. van Dommele, A. van Roermund,
30(3), 166–172 (1995)
P. Harpe, A 106nW 10 b 80 kS/s SAR ADC with duty-
Q. Dong, K. Yang, D. Blaauw, D. Sylvester, A 114‐pW
cycled reference generation in 65 nm CMOS. IEEE
PMOS‐only, trim‐free voltage reference with 0.26%
J. Solid State Circuits 51(10), 2435–2445 (2016a)
within‐wafer inaccuracy for nW systems, in IEEE
M. Liu, A. van Roermund, P. Harpe, A 7.1fJ/conv.-step
Symposium on VLSI Circuits (June 2016)
88dB-SFDR 12b SAR ADC with energy-efficient
V. Giannini, P. Nuzzo, V. Chironi et al., An 820μW 9b
swap-to-reset, in European Solid-State Circuits Con-
40MS/s noise–tolerant dynamic-SAR ADC in 90 nm
ference (Sep. 2016b), pp. 409–412
digital CMOS. ISSCC Dig. Tech. Papers (Feb. 2008),
L. Magnelli, F. Crupi, P. Corsonello, C. Pace,
pp. 238–239
G. Iannaccone, A 2.6 nW, 0.45 V temperature-
B.P. Ginsburg, A.P. Chandrakasan, A 500 MS/s 5b ADC
compensated subthreshold CMOS voltage reference.
in 65 nm CMOS. in IEEE Symposium on VLSI Circuits
IEEE J. Solid-State Circuits 46(2), 465–474 (2011)
(June 2006), pp. 140–141
K.A.A. Makinwa, A. Baschirotto, P. Harpe (eds.), Effi-
V. Hariprasath, J. Guerber, S.-H. Lee, U.-K. Moon,
cient Sensor Interfaces, Advanced Amplifiers and Low
Merged capacitor switching based SAR ADC with
Power RF Systems—Advances in Analog Circuit
highest switching energy-efficiency. IET Electron.
Design 2015 (Springer, Berlin, 2016), ISBN 978-3-
Lett. 46(9), 620–621 (2010)
319-21184-8
P. Harpe, C. Zhou, Y. Bi et al., A 26 μW 8 bit 10 MS/s
B. Murmann, ADC Performance Survey 1997–2016.
asynchronous SAR ADC for low energy radios. IEEE
http://web.stanford.edu/~murmann/adcsurvey.html
J. Solid-State Circuits 46(7), 1585–1595 (2011)
M. Pelgrom, Analog-to-Digital Conversion (Springer,
P. Harpe, G. Dolmans, K. Philips, H. de Groot, A 0.7 V
Berlin, 2017) ISBN 978-3-319-44970-8
7-to-10 bit 0-to-2 MS/s flexible SAR ADC for ultra
J. Prummel, M. Papamichail, M. Ancis, A 10mW
low-power wireless sensor nodes. in European Solid-
bluetooth low-energy transceiver with on-chip
State Circuits Conference (Sep. 2012), pp. 373–376
matching, in ISSCC Dig. Tech. Papers (Feb. 2015),
P. Harpe, E. Cantatore, A. van Roermund, A 2.2/2.7fJ/
pp. 238–239
conversion-step 10/12b 40kS/s SAR ADC with data-
T. Sano, M. Mizokami, H. Matsui, A 6.3mW BLE trans-
driven noise reduction. ISSCC Dig. Tech. Papers (Feb.
ceiver embedded RX image-rejection filter and TX
2013), pp. 270–271
harmonic-suppression filter reusing on-chip matching
P. Harpe, E. Cantatore, A. van Roermund, An
network, in ISSCC Dig. Tech. Papers (Feb. 2015),
oversampled 12/14b SAR ADC with noise reduction
pp. 240–241
and linearity enhancements achieving up to 79.1dB
M. Seok, G. Kim, D. Blaauw, D. Sylvester, A portable
SNDR. ISSCC Dig. Tech. Papers (Feb. 2014),
2-transistor picowatt temperature-compensated
pp. 194–195
13 Ultra-Low Power Analog-Digital Converters for IoT 383

voltage reference operating at 0.5 V. IEEE J. Solid- M. Vidojkovic, X. Huang, P. Harpe et al., A 2.4 GHz ULP
State Circuits 47(10), 2534–2545 (2012) OOK single-chip transceiver for healthcare
Y.-S. Shu, L.-T. Kuo, T.-Y. Lo, An oversampling SAR applications, in ISSCC Dig. Tech. Papers (Feb.
ADC with DAC mismatch error shaping achieving 2011), pp. 458–459
105dB SFDR and 101dB SNDR over 1 kHz BW in M. Yip, A.P. Chandrakasan, A resolution-
55 nm CMOS, in ISSCC Dig. Tech. Papers (Feb. reconfigurable 5-to-10b 0.4-to-1V power scalable
2016), pp. 458–459 SAR ADC, in ISSCC Dig. Tech. Papers (Feb.
H.-Y. Tai, Y.-S. Hu, H.-W. Chen, H.-S. Chen, A 0.85fJ/ 2011), pp. 190–191
conversion-step 10b 200kS/s subranging SAR ADC in D. Zhang, A. Alvandpour, A 3-nW 9.1-ENOB SAR ADC
40 nm CMOS, in ISSCC Dig. Tech. Papers (Feb. at 0.7 V and 1 kS/s, in European Solid-State Circuits
2014), pp. 196–197 Conference (Sep. 2012), pp. 369–372
M. van Elzakker, E. van Tuijl, P. Geraedts et al., A 10-bit Y. Zhu, C.-H. Chan, U.-F. Chio et al., A 10-bit 100-MS/
charge-redistribution ADC consuming 1.9 μW at s reference-free SAR ADC in 90 nm CMOS. IEEE
1 MS/s. IEEE J. Solid-State Circuits 45(5), J. Solid-State Circuits 45(6), 1111–1121 (2010)
1007–1015 (2010)
Circuit Techniques for IoT-Enabling
Short-Range ULP Radios 14
Pui-In Mak, Zhicheng Lin, and Rui Paulo Martins

This chapter addresses the design of cost-aware (e.g., sub-GHz and 2.4-GHz ISM bands), while
ultra-low-power (ULP) radios for both 2.4-GHz occupying a small die area and entailing a mini-
and sub-GHz ISM bands. Starting from the sys- mum number of external components. These
tem aspects that provide the essential insights, next-generation ULP radios will be decisive for
effective circuit techniques are presented to a wide variety of products that have strong com-
improve the radio performances and power effi- petition among cost, performance and time-to-
ciency, while minimizing the die area and num- market. Nevertheless, the tradeoff analysis
ber of external components. between cost, size and power for an ULP wire-
less link can involve many parameters that must
be co-designed, implying that deeper understand-
14.1 ULP Wireless Nodes in the IoT ing of the system aspects and effective circuit
Landscape techniques are both essential to reach an opti-
mum solution.
Smart cities, environmental monitoring, energy
management and healthcare systems, just to
name a few, are all inside the gigantic landscape 14.2 System Aspects of Short-Range
of Internet of Things (IoT) (Stankovic 2014) or ULP Radios
Internet of Everything (IoE). The estimated IoT
market by 2020 will be close to hundreds of Focusing on short-range connectivity with a RF
billion dollars (annually ~16 billions). To accel- link budget of ~80 to 90 dB, the physical (PHY)
erate the proliferation of IoT products in different layer specifications of Zigbee and BLE are not
application sectors, it is opportune to develop particularly tough for modern RF skills. Yet,
ultra-low-cost software-defined ULP radios traditional textbook RF and analog techniques
that are flexible to support different data rates can unlikely help to bring down the radio’s
(e.g., from kb/s to a few Mb/s), different power by orders of magnitude, while allowing it
standards [e.g., ZigBee and Bluetooth Low to be universal enough to serve multiple bands
Energy (BLE)] and a wide range of frequency without resorting from costly external
components. The following sub-sections briefly
P.-I. Mak (*) • Z. Lin • R.P. Martins discuss the PHY layer of Zigbee and BLE
University of Macau, Macau, China standards. The pros and cons of opting different
Instituto Superior Técnico, Universidade de Lisboa, frequency bands and supply voltages (VDD) are
Lisbon, Portugal
e-mail: pimak@umac.mo

# Springer International Publishing AG 2017 385


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_14
386 P.-I. Mak et al.

Table 14.1 Key PHY specifications of ZigBee and BLE standards


ZigBee BLE
Frequency (GHz) 0.3, 0.4, 0.7, 0.9, 2.4 2.4
Bandwidth (MHz) 2 1
Channel Spacing (MHz) 5 2
Modulation BPSK, OQPSK GFSK
Range (m) 10 to 200 10 to 100
Data Rate (Mbps) 0.25 1
Network Topology Star/Mesh Star/P2P

also mentioned; all are correlated to the overall specifications of ZigBee and BLE standards are
cost, size and power efficiency of the radios. summarized in Table 14.1.
BLE is a prospective short-range wireless
standard ratified in 2009. It supports 40 channels
14.2.1 ZigBee and Bluetooth Low in the 2.4-GHz band, each of which is 2-MHz
Energy (BLE) Standards wide. It is based on Gaussian frequency-shift
keying (GFSK) modulation with an index of
Both ZigBee and BLE standards are suitable for 0.5. The state-of-the-art 2.4-GHz receiver (Liu
short-range ULP communication as they draw et al. 2014) achieves an energy efficiency of
low peak and average power. Their key features ~1.2 nJ/b at 2 Mb/s. Unsurprisingly, >40% of
are briefed next. the receiver power is dissipated by the forefront
ZigBee was developed as a wireless personal- low-noise amplifier (LNA) and mixer to maxi-
area network (WPAN) standard with the IEEE mize the sensitivity (92 dBm). Such a high
802.15.4 to define the PHY and Media Access sensitivity seems overkill, but it is indeed effec-
Control (MAC) layers. It can operate at a very tive to reduce the power consumption of the
low duty cycle (<1%) and is allowed in three transmitter which normally has a lower energy
different frequency bands. The first band efficiency to fulfill an RF link budget of ~90 dB.
(868 MHz) is for Europe only offering only a Thus, it is highly desired to develop circuit
single channel. It supports a low bit rate of techniques for better LNA, mixer and voltage-
20 kbps using binary phase-shift keying (BPSK) controlled oscillator (VCO) for a better overall
modulation. The second band (915 MHz) permits energy efficiency. In fact, for the 2.4-GHz band,
10 channels and is widely adopted in North Zigbee and BLE share a similar PHY, and mod-
America, Australia, New Zealand, and some ern solutions can easily support both. For the
countries in South America. Each channel sub-GHz bands, multi-band operation poses
supports 40 kbps using BPSK modulation. The additional challenges. To achieve this without
third band is 2.4 GHz available worldwide, and leveraging the cost, a fully-integrated RF-tunable
has a total of 16 channels with 250 kbps each. ULP radio will be of great relevance.
Unlike the sub-GHz bands, this third band
exploits offset quadrature phase-shift keying
(OQPSK) with half sine-wave shaping for its 14.2.2 Cost, Size and Power
modulation. Beyond these three bands, the
IEEE 802.15.4c/d study groups also considered Ultra-scaled CMOS technologies are still the best
to open 314 to 316 MHz, 430 to 434 MHz, and platform for full integration of ULP radios that
779 to 787 MHz bands for use in China, and have RF (transceiver), analog (sensor and power
950 to 956 MHz for use in Japan. Obviously, an management) and digital (microcontroller and
international market will be opened if the ULP memory). Established technology nodes (e.g.,
radio can be reconfigured to support multiple 90 or 65 nm) are regaining lots of interest for
bands from sub-GHz to 2.4 GHz. The key PHY low-cost fast-to-market IoT products, as they can
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 387

leverage more reasonably between the radio will entail an increment of power budget
manufacturing cost, development time and (by 8.5 dB), in order to match the range of a
power consumption. Apparently, the system 900-MHz radio. Besides, biological tissues
cost and size can be optimized by reducing the absorb RF energy as a function of frequency.
chip area, number of external components and Lower frequencies can penetrate the body easily
battery volume that depends on the targeted life- without being absorbed, meaning a better RF link
time of the system. Although using on-chip for sub-GHz when compared to 2.4 GHz for
passives (inductors and transformers) can help body-area networks.
to reduce the VDD and system power, we will Interference—The 2.4-GHz band has a high
describe later that recent cost-aware ULP RF and chance to come across interferences due to the
analog circuits can balance better the power, chip co-existence of other wireless standards,
area and cost. For instance, a fully-integrated degrading the link reliability. For example, the
input matching network not only can reduce the IEEE 802.11 (WiFi) can transmit an output
cost and system form factor of an ULP receiver, power 10x to 100x higher than the ZigBee.
but also enhance its power efficiency. Also, the Signals from Bluetooth-enabled computer, cell
matching network can offer passive pre-gain to phone peripherals and microwave ovens can
enhance the sensitivity of the receiver. Other also be considered as “jammers”, which have a
low-power techniques such as current-reuse and much lower output power. Sub-GHz ISM bands
function-reuse are will be introduced later in this are mostly used for proprietary low-duty-cycle
chapter. links and are not as likely to interfere with each
other. A quieter spectrum means easier
transmissions and fewer retries, which is more
efficient to save the battery power. In fact, due to
14.2.3 Frequency Bands: 2.4 GHz
the limited power budget, it is hard for an ULP
vs. Sub-GHz
radio to tolerate large out-of-channel blockers.
Antenna size—One disadvantage of sub-GHz
Most existing ULP radios were designed for the
operation is the larger antenna size since most
2.4-GHz band as it is available worldwide and
antenna types are designed to be resonant at their
has a smaller antenna size suitable for integra-
intended operation frequency. Since the antenna
tion. Yet, the sub-GHz ISM bands offer other
size is inversely proportional to the frequency, a
advantages such as longer propagation distance
small wireless node would prefer the 2.4-GHz
and less interference that are worth to be consid-
band. Communication distance, low potential
ered when the power budget is the priority.
interference and low power consumption are the
Communication range—In highly congested
obvious advantages of the sub-GHz bands.
environments, the 2.4-GHz signal can weaken
rapidly, which adversely affects the signal qual-
ity. To quantify the influence of frequency on
14.2.4 Supply Voltage (VDD)
path loss with respect to the wavelength λ, we
can use the simplified Friis transmission
To minimize the system size, short-range ULP
equation,
radios should run preferably from a tiny battery,
 
4πd thus sub-2 V supply voltages are highly desired.
L ¼ 20log10 ð14:1Þ
λ Radios that work down to 1.2 V allow extra
flexibility in sensors’ design and reduce the
Hence, it can be calculated that the path loss at power management constraints (Rajan 2012).
2.4 GHz is 8.5 dB higher than that at 900 MHz. Besides, low peak current and sub-1 V VDD
This translates into a 2.67 longer range for a also benefit wireless sensors that run from
900-MHz radio. Since the range almost doubles harvested energy sources which will enhance
with every 6 dB increment of power, a 2.4-GHz flexibility, lower the maintenance cost, and
388 P.-I. Mak et al.

open up more applications. For example, on-chip 14.3 Current-Reuse ULP Receiver
solar cells only can provide an output voltage Techniques for the 2.4-GHz
between 200 and 900 mV, while thermoelectric ISM Band
generators exhibit an even lower VDD
(50–300 mV) (Bandyopadhyay et al. 2011). Nanoscale CMOS offers sufficiently high ft and
Although boost converters can be employed to low Vt favoring the design of ULP receivers via
boost up the output voltage, their efficiency is stacking the RF-to-baseband (BB) functions in
still quite limited (~75%). Besides, a low peak one cell, while sharing the smallest possible bias
current consumption will ease the design of the current. Also, the signals can be conveyed in the
power management. Furthermore, radio current domain to enhance the area efficiency
operating at higher VDD is only required when (i.e., no AC-coupling capacitor), RF bandwidth
a higher output power is entailed. This is not the and linearity at those inner nodes. The proposed
case for short-range communications, as the out- Zigbee receiver (Lin et al. 2013, 2014a) is
put power rarely exceeds 0 dBm. Thus, a low inspired by the above hypothesis, and its block
VDD is in general the simplest way to reduce the diagram is depicted in Fig. 14.1.
power consumption at the system level. The single-ended RF input (VRF) is taken by a
In a low VDD design, however, due to the low-Q input-matching network before reaching
limited dynamic range, for the given parameters the Balun-LNA-I/Q-Mixer (Blixer). Merging the
such as third-order intercept point (IIP3), noise- Blixer with the hybrid filter not only saves
figure (NF), gain etc., the current should be larger power, but also reduces the voltage swing at
than that with a high VDD. For example, for the internal nodes benefitting the linearity. The wide-
given NF requirement, the current-reuse P-type band input-matching network is also responsible
metal-oxide-semiconductor (PMOS) and N-type for the passive pre-gain to reduce the NF. Unlike
metal-oxide-semiconductor (NMOS) self-biased the LMV cell that only can utilize single-
amplifier with a VDD of 1 V consumes half of balanced mixers (Tedeschi et al. 2010), here the
the current of a single NMOS (or PMOS) without balun-LNA featuring a differential output
current-reuse and with a VDD of 0.5 V. This (iLNA) allows the use of double-balanced
constraint is even tighter if a small chip area mixers (DBMs). Driven by a 4-phase 25% LO,
and/or no/limited external components are the I/Q-DBMs with a large output resistance
imposed for cost reduction. As an example, robustly correct the differential imbalances of
inductors/transformers can help to boost the iLNA. The balanced BB currents (iMIX,I and
operating frequency and bias the circuit with iMIX,Q) are then filtered directly in the current
lower voltage headroom and noise. If inductors/ domain by a current-mode Biquad stacked atop
transformers cannot be used due to the limited the DBM. The Biquad features in-band noise-
area budget, only resistors or transistors can be shaping centered at the desired intermediate fre-
adopted instead. This imposes a hard trade-off quency (IF, 2 MHz). Only the filtered output
with IIP3, NF and bandwidth. Thus, to balance currents (irLPF,I and  irLPF,Q) are returned as
the VDD, current, area and external components voltages (Vo,I and Vo,Q) through the
with the key performance metrics (NF and out- complex-pole load, which performs both image
of-band (OB) IIP3), effective system-to-circuit- rejection and channel selection. Out of the
level co-design, RF and analog circuit techniques current-reuse path there is a high-swing vari-
become highly important and correlated. The able-gain amplifier (VGA). It essentially deals
next two sections present the key circuit with the gain loss of its succeeding 3-stage pas-
techniques applied into two state-of-the-art sive RC-CR polyphase filter (PPF), which is
cost-aware ULP receivers: one for the 2.4-GHz responsible for large and robust image rejection
band and one for the sub-GHz bands. over mismatches and process variations. The
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 389

BB Circuitry
VDD12 3-Stage Inverter 50Ω
VGA RC-CRPPF Amplifier Buffer
Vo,I
Complex Pole Complex Pole
VBuf,I
-IF +IF -IF +IF -IF +IF VBuf,Q
Hybrid Vo,Q
irLPF,I irLPF,Q
Filter Biquad Biquad
Signal Response
NoiseResponse LO
-IF +IF -IF +IF Generator
Current
iMIX,I iMIX,Q LO 4 2 1
Reuse BUF ÷2 LOext
2

MUX
DIV1 Buffer
iLNA iLNA 2
Blixer 4
÷2 4
÷2 2
Vf-tune
DIV2 DIV1 10GHz
VCO
LOIp
LOQp For test under an
LOIn Integrated VCO
WidebandI nput LOQn
Matching Network
25%DutyCycle OnChip
VRF

Fig. 14.1 Proposed RF-to-BB-current-reuse ULP 2.4-GHz receiver

final stage is an inverter amplifier before 50-Ω for M1. Rp is the parallel shunt resistance of LM.
test buffering. The 4-phase 25% LO can be Cp stands for the parasitic capacitance from the
generated by an external 4.8-GHz reference pad and ESD diodes. Rin and Cin are the equiva-
(LOext) after a divide-by-2 (DIV1) that features lent resistance and capacitance at node Vin,
50%-input 25%-output, or from an integrated respectively. R’in is the downconversion resis-
10-GHz VCO after DIV1 and DIV2 (25%-input tance of Rin.
25%-output) for additional testability. LBW is the bondwire inductance and Rs is the
source resistance. To simplify the analysis, we
first omit LBW and Cin, so that LM, Cp, CM, RS
14.3.1 Circuit Implementation and RT (¼ Rp//Rin) together form a tapped capac-
itor facilitating the input matching. Generally,
Wideband Input-Matching Network—As shown S11  –10 dB is required and the desired value
in Fig. 14.2a, a low-Q inductor (LM) and 2 tapped of R’in is from 26 to 97 Ω over the frequency
capacitors (Cp and CM) can be employed for band of interest. Thus, given the RT and CM
impedance down-conversion resonant and pas- values, the tolerable Cp can be derived from
 2
sive pre-gain. A high-Q inductor is unnecessary R,in ¼ RT CMCþCM
. The pre-gain value (Apre,
p
since the Q of the LC matching is dominated by
V2 V2
the low input resistance of the LNA. Thus, a amp)from VRF to Vin is derived from 2RinT ¼ 2RRFS ,
qffiffiffiffi
low-Q inductor results in area savings, while which can be simplified as Apre, amp ¼ RRTS . The –
averting the need of an external inductor for
cost savings. LM also serves as the bias inductor 3-dB bandwidth of Apre,amp is related to the
390 P.-I. Mak et al.

VDD12

RL Simplified Load RL 0
Vo,Ip Vo,In

S 11 (dB)
-10 Cp = 1 pF
Cin = 400 fF
CM = 1.5 pF
-20 Lbw = 0.5 nH LM = 4.16 nH
DBM LO M5 M6 LOIn M7 M8 LOIp Lbw = 1.5 nH Rp = 600Ω
Ip
-30 Lbw = 2.5 nH Rin = 200Ω

IBIAS iLNAp iLNAn IBIAS 1 1.5 2 2.5 3 3.5 4


Input RFF requency (GHz)
Q Channel Vb,LNA
C3 C2 (b)

Balun-LNA
M1
5.6
AGB Vb,LNA 0.3m
Rin’ Ceq

NFtotal (dB)
W
Rin 5.4 Power of AGB
LBW CM C1
VRF Vin 5.2
M2 0.4m
W
VDD06 5 0.9m
RS Cp W
LM Rp Cin 0.6m
W
VS M3 4.8
1 1.5 2 2.5 3
BB Frequency (MHz)
Wideband Input M4
Matching Network (c)
(a)

Fig. 14.2 (a) Proposed wideband input matching network, balun-LNA and I/Q-DBMs (Q channel is omitted and the
load is simplified as RL). (b) Variation of S11-bandwidth with bondwire inductance LBW. (c) Power of AGB versus NF

network’s quality factor (Qn) as given by: asymmetric CG-CS transconductances and
ω0
Qn ¼ 2ωR0TLM ¼ ω3dB , with ω0 ¼ pffiffiffiffiffiffiffiffiffiffi
1 ffi and loads make the output balancing not wideband
LM CEQ
C M Cp
consistent. In Blaakmeer et al. (2008b), output
CEQ ¼ CM þCp . balancing is achieved by scaling M5–8 with
In our design (RT ¼ 150 Ω, CM ¼ 1.5 pF, cross-connection at BB, but that is incompatible
LM ¼ 4.16 nH, Rp ¼ 600 Ω, Cp ¼ 1 pF and Rin with this work that includes a hybrid filter. In Mak
¼ 200 Ω), Apre,amp has a passband gain of and Martins (2011), by introducing an
~4.7 dB over a 2.4-GHz bandwidth (at RF ¼ 2.4 AC-coupled CS branch and a differential current
GHz) under a low Qn of 1. Thus, the tolerable Cp balancer (DCB), the same load is allowed for both
is sufficiently wide (0.37 to 2.1 pF). The low-Q CS and CG branches for wideband output balanc-
LM is extremely compact (0.048 mm2) in the ing. Thus, the NF of such a balun-LNA can be
layout and induces a small parasitic capacitance optimized independently. This technique is trans-
(~260 fF, part of Cin). Figure 14.2b demonstrates ferred to this ULP design, but only with the I/Q-
the robustness of S11-bandwidth against LBW DBMs inherently serves as the DCB, avoiding a
from 0.5 to 2.5 nH. The variation of Cin to S11- high VDD (Mak and Martins 2011). The detailed
bandwidth was also studied. From simulations, schematic is depicted in Fig. 14.2a. To maximize
the tolerable Cin is 300 to 500 f. at the voltage headroom, M1 (with gm,CG) and M2
LBW ¼ 1.5 nH. (with gm,CS) were sized with non-minimum chan-
Balun-LNA—The common-gate (CG) nel length (L ¼ 0.18 μm) to lower their VT. The
common-source (CS) balun-LNA (Blaakmeer AC-coupled gain stage is a self-biased inverter
et al. 2008a) avoids the off-chip balun and amplifier (AGB) powered at 0.6-V (VDD06) to
achieves a low NF by noise canceling, but the enhance its transconductance (gm,AGB)-to-current
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 391

ratio. It gain-boosts the CS branch while creating a cancellation of M1 should not lead to the lowest
loop gain around M1 to enhance its effective total NF (NFtotal). In fact, one can get a more
transconductance under less bias current (IBIAS). optimized Gm,CS (via gm,AGB) for stronger reduc-
This scheme also allows the same IBIAS for both tion of noise from Gm,CS and RL, instead of that
M1 and M2, requiring no scaling of load (i.e., only from M1. Although this noise-canceling principle
RL). Furthermore, a small IBIAS lowers the supply has been discussed in Bruccoleri et al. (2004) for
requirement, making a 1.2-V supply (VDD12) still its single-ended LNA, the output balancing was
adequate for the Blixer and hybrid filter, while not a concern there. Here, the optimization pro-
relaxing the required LO swing (LOIP and LOIn). cess is alleviated since the output balancing and
C1–3 for biasing are typical metal-oxide-metal NF are decoupled. The simulated NFtotal up to the
(MoM) capacitors to minimize the parasitics. Vo,Ip and Vo,In nodes against the power given to
The balun-LNA features partial-noise cancel- the AGB is given in Fig. 14.2c. NFtotal is reduced
ing. To simplify the study, we ignore the noise from 5.5 dB at 0.3 mW to 4.9 dB at 0.6 mW, but
induced by DBM (M5–M8) and the effect of is back to 5 dB at 0.9 mW. Due to the use of
channel-length modulation. The noise transfer passive pre-gain and a larger Rp that is ~3 times
function (TF) of M1’s noise (In,CG) to the BB of Rin, the noise contribution of the inductor is
differential output (Vo,Ip – Vo,In) can be derived <1% from simulations. The simulated NF at the
when LOIp is high, and the input impedance is outputs of the LNA and test buffer are 5.3 and
matched, 6.6 dB, respectively.
Double-Balanced Mixers Offering Output
1
TFIn, CG ¼  ðRL  Rin Gm, CS RL Þ ð14:2Þ Balancing—The output balancing is inherently
2 done by the I/Q-DBMs under a 4-phase 25%
where Gm,CS ¼ gm,CS + gm,AGB. The noise of M1 LO. For simplicity, this principle is described
can be fully canceled if RinGm,CS ¼ 1 is satisfied. for the I channel only under a 2-phase 50% LO,
However, Rin  200 Ω is desired for input as shown in Fig. 14.3, where the load is
matching at low power. Thus, Gm,CS should be simplified as RL. During the first-half LO cycle
5 mS, rendering the noises of Gm,CS and RL still when LOIp is high, iLNAp goes up and appears at
significant. Thus, device sizings for full noise Vo,Ip while iLNAn goes down and appears at Vo,In.

After One LO Cycle


Vo,Ip
VDD12 VDD12
Vo,In
RL Simplified Load RL RL Simplified Load RL
Vo,Ip Vo,In Vo,Ip Vo,In
iMIX,Ip iMIX,In iMIX,Ip iMIX,In

LOIp M5 M6 LOIn M7 M8 LOIp LOIp M5 M6 LOIn M7 M8 LOIp

iLNAp 1st-Half iLNAn iLNAp 2nd-Half iLNAn


LO Cycle LO Cycle
IBIAS IBIAS IBIAS IBIAS

(a) (b)
Fig. 14.3 Operation of the I-channel DBM. It inherently offers output balancing after averaging in one LO cycle as
shown in their (a) 1st-half LO cycle and (b) 2nd-half LO cycle
392 P.-I. Mak et al.

In the second-half LO cycle, both of the currents’ can correct perfectly the gain and phase errors
sign and current paths of iLNAp and iLNAn are from the balun-LNA, independent of its different
flipped. Thus, when they are summed at the out- output impedances from the CG and CS
put during the whole LO cycle, the output bal- branches. In fact, even if the conversion gain of
ancing is robust, thanks to the large output the 2 mixer pairs (M5, M8 and M6, M7) does not
resistance (9 kΩ) of M5-M8 enabled by the very match (e.g., due to non-50% LO duty cycle), the
small IBIAS (85 μA). To analytically prove the double-balanced operation can still generate bal-
principle, we let iLNAp ¼ αIA cos ðωs t þ φ1 Þ and anced outputs (confirmed by simulations). Of
iLNAn ¼ IA cos ðωs t þ φ2 Þ, where IA is the course, the output impedance of the DBM can
amplitude, ωs is the input signal frequency, α is be affected by that of the balun-LNA
the unbalanced gain factor and φ1 and φ2 are [Fig. 14.2a], but is highly desensitized due to
their arbitrary initial phases. When there is suffi- the small size of RL (i.e., the input impedance
cient filtering to remove the high-order terms, we of the hybrid filter) originally aimed for current-
can deduce the BB currents iMIX,Ip and iMIX,In as mode operation. Thus, the intrinsic imbalance
given by, between Vo,lp and Vo,ln is negligibly small (con-
firmed by simulations).
2
iMIX, Ip ¼ αIA cos ðωs t þ φ1 Þ For devices sizing, a longer channel length
π (L ¼ 0.18 μm) is preferred for M5–8 to reduce
2
 cos ω0 t þ IA cos ðωs t þ φ2 Þ  cos ω0 t their 1/f noise and Vt. Hard-switch mixing helps
π
αIA to desensitize the I/Q-DBMs to LO gain error,
¼ cos ðωs t  ω0 t þ φ1 Þ
π leaving the image rejection ratio (IRR) mainly
IA determined by the LO phase error that is a
þ cos ðω0 t  ωs t þ φ2 Þ
π tradeoff with the LO-path power. Here, the
ð14:3Þ targeted LO phase error is relaxed to ~4o, as
letting the BB circuitry (i.e., the complex-pole
IA
iMIX, In ¼  cos ðωs t  ω0 t þ φ2 Þ load and 3-stage RC-CR PPF) to handle the IRR
π
αIA ð14:4Þ is more power efficient.
 cos ðω0 t  ωs t þ φ1 Þ Hybrid Filter 1st Half—Current-Mode Biquad
π
¼ iMIX, Ip with IF Noise-Shaping—The current-mode Biquad
[Fig. 14.4a] proposed in Pirola et al. (2010) is an
and a consistent proof for I/Q-DBMs under a
excellent candidate for current-reuse with the
4-phase 25% LO is obtained. Ideally, the DBM

irLPF,I VDD12 in,out


Vb Vb Vb in,out gmf
(gm,act) =
Mf3 Mf4 Mi3 Mi4
Cf2/2 in,Mf1 gmf –(1+gmfZP)(gmf+sCf2)
(gmf) Mf3
Cf2
Ci/2 -1 1
ZP = //sLact//Rsf
sCf1
(gm,act) in,Mf1 Lact ≈ Ci
Mi1 Mi2
Mf1 Cf1/2 Mf2 Mf1 2
gm,act
(gmf)
Cf1 Rsf Lact iMIX,I
iMIX,I Active Inductor (2Lact)

(a) (b)
Fig. 14.4 (a) Proposed IF-noise-shaping Biquad and (b) its small-signal equivalent circuit showing the noise TF of Mf1
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 393

For Rsf >> sLact

A: ˶ ื 0.1˶or ZP ≈ sLact
A B C D E
sLact Without Lact
B: 0.1˶or < ˶ < ˶or ZP ≈ 1 LactCf1
1+ s2LactCf1

Ind
Resonant

uct
in,out

iv
C: ˶ = ˶or ZP = Rsf With Lact

e
0.6
in,Mf1
sLact itiv
e
Noise
D: ˶or < ˶ < 10˶or ZP ≈ pac
0.2 Ca
1+ s2LactCf1 Response
1 1 2 3 4 5
E: ˶ ุ 10˶or ZP ≈
sCf1 Frequency (MHz)

(a) (b)
i
Fig. 14.5 (a) Equivalent impedance of ZP versus ωor, and (b) simulated noise TF of inn,,Mf1
out
with and without Lact

Blixer for channel selection. However, this


Biquad only can generate a noise-shaping zero 5.7
iBIQ,I iMIX,I Without Lact
spanning from DC to e 2π0:1QB ω0B MHz for
NFtotal (dB)

Rbiq Lbiq Cf1 Rsf Lact


Mf1–Mf2, where QB and ω0B are the Biquad’s 5.6
quality factor and –3-dB cutoff frequency,
5.5 With Lact
respectively. This noise shaping is hence ineffec-
tive for our low-IF design having a passband 5.4
from ω1 to ω2 (¼ ω0B), where ω1 > 0.1QBω0B. 1 1.5 2 2.5 3
To address this issue, an active inductor (Lact) is BB Frequency (MHz)
added at the sources of Mf1–Mf2. The LactCf1
resonator shifts the noise-shaping zero to the Fig. 14.6 Simulated NFTotal (at Vo,lp and Vo,ln) with and
desired IF. The cross-diode connection between without Lact
Mi1–Mi4 (all with gm,act) emulate Lact  Ci/gm,act
2
(Ler et al. 2008; Chen et al. 2012). The small- in-band noise. At even higher frequencies, the
signal equivalent circuit to calculate the noise output noise decreases due to Cf2, being the
TF of in,Mf1/in,out is shown in Fig. 14.4b. The same as its original form (Pirola et al. 2010).
approximated impedance of ZP in different The signal TF can be derived from Fig. 14.6.
frequencies related to ω0r is summarized in Here RL ¼ g1 , Lbiq ¼ gC2f2 . For an effective
mf mf
Fig. 14.5a, where ω0r ¼ ω1 þω 2
2
is the resonant improvement of NF, Lact  Lbiq should be
frequency of LactCf1 at IF. The simulated in,Mf1/ made. The simulated NFtotal at Vo,Ip and Vo,In
in,out is shown in Fig. 14.5b. At the low fre- with and without the Lact is shown Fig. 14.6,
quency range, ZP behaves inductively, showing about 0.1 dB improvement at the TT
degenerating further in,Mf1 when the frequency corner (reasonable contribution for a BB circuit).
is increased. At the resonant frequency, ZP ¼ For the SS and FF corners, the NF improvement
Rsf, where Rsf is the parallel impedance of the reduces to 0.04 and 0.05 dB, respectively. These
active inductor’s shunt resistance and DBM’s results are expected due to the fact that at the FF
output resistance. The latter is much higher corner, the noise contribution of the BB is less
when compared with RL thereby suppressing in, significant due to a larger bias current; while at
Mf1. At the high frequency range, ZP is more the SS corner, the IF noise-shaping circuit will
capacitive dominated by Cf1. It implies in,Mf1 can add more noise by itself, offsetting the NF
be leaked to the output via Cf1, penalizing the improvement. Here Mf1–Mf4 use isolated P-well
394 P.-I. Mak et al.

VDD12
MC ML ML MC MC ML ML MC
Vo,Qn Vo,Qp Vo,Ip Vo,In

CL/2 RL CL/2
Vo,Ip Vo,In Vo,Qp Vo,Qn
irLPF,I irLPF,Q
(a)

irLPF,I Vo,I Im
1
RL CL +IF
gm,McRL
1
Re
gm,Mc -gm,Mc
RLCL
irLPF,Q Vo,Q Vo,I RL
=
RL CL irLPF,I 1 + sRLCL – jgm,McRL

Complex-Pole Load
(b)

Fig. 14.7 (a) Proposed complex-pole load and (b) its small-signal equivalent circuit and pole plot

for bulk-source connection, avoiding the body


35
effect while lowering their VT.
Hybrid-Filter Gain Response

Hybrid Filter 2nd Half—Complex-Pole 5.2 dB


12 dB
Load—Unlike most active mixers or the original 25
Blixer (Blaakmeer et al. 2008a, b) that only use a
RC load, the proposed “load” synthesizes a 1st-
(dB)

29 dB
order complex pole at the positive IF (+IF) for 15
channel selection and image rejection. The cir-
cuit implementation and principle are shown in
Fig. 14.7a and b, respectively. The real part (RL) 5
is obtained from the diode-connected ML,
whereas the imaginary part (gm,Mc) is from the fLO-10 fLO-5 fLO fLO+5 fLO+10
I/Q-cross-connected MC. The entire hybrid filter Input RF Frequency (MHz)
offers 5.2-dB IRR, and 12-dB (29-dB) adjacent
(alternate) channel rejection as shown in Fig. 14.8 Simulated hybrid-filter gain response
Fig. 14.8 (the channel spacing is 5 MHz). Similar
to gm-C filters the center frequency is defined by VGA formed with ML [Fig. 14.7a] and a seg-
gm,McRL.. When sizing the –3-dB bandwidth, the mented MVGA (Fig. 14.9), offering gain controls
output conductances of MC and ML should be with a 6-dB step size. To enhance the gain preci-
taken into account. sion, the bias current through MVGA is kept con-
Current-Mirror VGA and RC-CR PPF—Out- stant, so as its output impedance. With the gain
side the current-reuse path, Vo,I and Vo,Q are switching of MVGA, the input-referred noise of
AC-coupled to a high swing current-mirror MVGA will vary. However, when the RF signal
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 395

VDD12

Vo,Ip MVGA Vo,In Vo,Qp MVGA Vo,Qn

VBB1,Ip VBB1,In VBB1,Qp VBB1,Qn

(a)
3-Stage Inverter 50Ω
RC-CR PPF Amplifier Buffer

VBB1,Ip VBuf,Ip

VBB1,Qp VBuf,Qp

VBB1,In VBuf,In

VBB1,Qn VBuf,Qn

(b)
Fig. 14.9 Schematics of the BB (a) VGA, and (b) 3-stage RC-CR PPF, inverter amplifier and 50-Ω buffer

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
σ 2 σ 2ffi
level is low the gain of the VGA should be high, σðImage OutÞ R C
rendering the gain switching not influencing the ¼ 0:25 þ
Desired Out R C
receiver’s sensitivity. The VGA is responsible
ð14:5Þ
for compensating the gain loss (30 dB) of the
3-stage passive RC-CR PPF that provides robust Accordingly, the matching of the resistors (σR)
image rejection of >50 dB (corner simulations). and capacitors (σC) can be relaxed to 0.9%
With the hybrid filter rejecting the out-band (2.93%) for 40-dB (30-dB) IRR with a 3σ yield.
blockers the linearity of the VGA is further Here, ~150-kΩ resistors are chosen to ease the
relaxed, so as its power budget (192 μW, limited layout with a single capacitor size (470 fF), bal-
by the noise and gain requirements). ancing the noise, area and IRR. The simulated
A 3-stage RC-CR PPF can robustly meet the worst IRR is 36 dB without LO mismatch, and
required IRR in the image band (i.e., the –IF), still over 27 dB at a 4o LO phase error checked by
and cover the ratio of maximum to minimum 100x Monte-Carlo simulations. Furthermore, if
signal frequencies (Kaykovuori et al. 2008; the 5-dB IRR offered by the complex-pole load is
Behbahani et al. 2001). In our design, the added the minimum IRR of the IF chain should
expected IRR is 30 to 40 dB and the ratio of be 32 dB. The final stage before 50-Ω output
frequency of the image band is fmax/fmin (¼3). buffering is a self-biased inverter amplifier
However, counting the RC variations as large as (power ¼ 144 μW), which embeds one more
25%, the conservative Δfeff ¼ fmax_eff/fmin_eff real pole for filtering. The simulated overall IF
should be close to 5. The selected RC values are gain response is shown in Fig. 14.10, where the
guided by Behbahani et al. (2001) notches at DC offered by the AC-coupling
396 P.-I. Mak et al.

network, and around the –IF offered by the generation. The loss of its LC tank is
3-stage RC-CR PPF, are visible. The IRR is compensated by complementary NMOS-PMOS
about 57 dB [¼52 dB (RC-CR PPF) + 5 dB negative transconductors.
(complex-pole load)] under an ideal 4-phase The divider chain [Fig. 14.11a] cascades two
25% LO for the image band from [fLO – 3, fLO types of div-by-2 circuits (DIV1 and DIV2) to
– 1] MHz. generate the desired 4-phase 25% LO, from a
VCO and Dividers and LO Buffers—To fully 2-phase 50% output of the VCO. The two latches
benefit the speed and low-Vt advantages of fine (D1 and D2) are employed to build DIV1 that can
linewidth CMOS, the entire LO path is powered directly generate a 25% output from a 50% input
at a lower supply of 0.6 V to reduce the dynamic (Razavi et al. 1995), resulting in power savings
power. For additional testability, an on-chip due to less internal logic operation (i.e., AND
VCO is integrated. It is optimized at ~10 GHz gates) and load capacitances. Each latch consists
to save area and allows division by 4 for I/Q of two sense devices, a regenerative loop and two
pull up devices. For 25%-input 25%-output divi-
sion, DIV2 is proposed that it can be directly
65
interfaced with DIV1. The 25% output of DIV1
Overall IF GainResponse (dB)

55 are combined by MD1 to MD4 to generate a 50%


45
clock signal for D3 and D4.
For testing under an external LOext source at
35 3 notches from 4.8 GHz, another set of D1 and D2 is adopted.
the RC-CR PPF
25 The output of these two sets of clocks are com-
AC-coupled
at VGA
bined by transmission gates and then selected.
15 Although their transistor sizes can be reduced
5 aggressively to save power, their drivability and
robustness in process corners can be degraded.
fLO-10 fLO-5 fLO fLO+5 fLO+10
From simulations, the sizing can be properly
Input RF Frequency (MHz) optimized. The four buffers (Buf1–4) serve to
reshape the pulses from DIV2 and enhance the
Fig. 14.10 Simulated overall IF gain response

DIV1 DIV2
Buf1
CLKin
LOIpB
CLKin CLKin X Y X Y
Buf2
X X
CLK Y CLK CLK2 CLK1 CLK2 CLK1 LOInB
D Q D Q X D Q D Q
D1 D2 D3 D4 Buf3
D Q D Q X D Q D Q
Y LOQpB Y Y
Buf4 X+Y
LOQnB

D1, D2 D3, D4 X +Y

VDD06 VDD06
D D
LOIpB LOInB
CLK
Q Q
Q Q
CLK2 LOQpB LOQnB
D D CLK1 CLK1
MD1 MD2 MD3 MD4

(a) (b)

Fig. 14.11 (a) Schematics of DIV1 and DIV2, and (b) their timing diagrams
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 397

9.2 62
Optimum
Region 61
0.6Vpp

Gain (dB)
8.8
0.4Vpp
NF (dB)
60 Vb,LO
Buf1-4 CLO
4 4
Rb
8.4 59 LOIp,n
CMIX,in LOQp,n
58 CL,eq
8
Post-Layout Simulation
57
0.15 0.25 0.35 0.45 0.55
Magnitude of LOIp,n & LOQp,n (Vpp)

(a) (b)
Fig. 14.12 (a) Post-layout simulation of NF and gain versus LO’s amplitude, and (b) additional CLO generates the
optimum LO’s amplitude

Fig. 14.13 Chip micrograph of the receiver. It was tested under CoB and CQFP44 packaging. No external component
is entailed for input matching

drivability. The timing diagram is shown in 14.3.2 Experimental Results


Fig. 14.11b. Due to the very small IBIAS for the
I/Q-DBMs, a LO amplitude of around 0.4 Vpp is The ZigBee receiver was fabricated in 65-nm
found to be more optimized in terms of NF and CMOS (Fig. 14.13) and optimized with dual
gain as simulated and shown in Fig. 14.12a. To supplies (1.2 V: Blixer + hybrid filter, 0.6 V:
gain benefits from it CLO is added to realize a LO and BB circuitries). The die area is 0.24 mm
2
capacitor divider with CMIX,in (input capacitance (0.3 mm2) without (with) counting the LC-tank
of the mixer) as shown in Fig. 14.12b. This act VCO. Since there is no frequency synthesizer
brings down the equivalent load (CL,eq) of Buf1–4 integrated, the results in Fig. 14.14a–d were
by ~33%. measured under LOext for accuracy and data
398 P.-I. Mak et al.

(a) (b)
-5
S11<-10 dB
-10 50 Gain
CQFP

Gain and NF (dB)


- 15
S11 (dB)

- 20 CoB 30

-25

-30 10 NF

-35
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 2.2 2.4 2.6 2.8 3
Input RF Frequency (GHz) Input RF Frequency (GHz)
(c) (d)
60
Max. Gain

Output IF Gain Response (dB)


30 IIP3 = -6 dBm =57dB
50

40 IRR=36dB
Pout (dBm)

-10
30
-50
20

10
-90
0
-50 -40 -30 -20 -10 0 fLO-10 fLO-5 fLO fLO+5 fLO+10
Pin (dBm) Input RF Frequency (MHz)

Fig. 14.14 Measured (a) S11, (b) wide band gain and NF, (c) IIP3out-band, and (d) low-IF filtering profile

repeatability. The S11-BW (<–10 dB) is image band under an ideal LO, the measured
~1.3 GHz for both chip-on-board (CoB) and notches are merged. Similar to Behbahani et al.
CQFP-packaged tests [Fig. 14.14a], which (2001), this discrepancy is likely due to the LO
proves its immunity to board parasitics and pack- gain and phase mismatches, and the matching
aging variations. The gain (55–57 dB) and NF and variations of the RC-CR networks. The lay-
(8.3–11.3 dB) are also wideband consistent out design is similar to Behbahani et al. (2001)
[Fig. 14.14b]. The gain peak at around that uses dummy to balance the parasitic
2.4–2.5 GHz is from the passive pre-gain. Fol- capacitances. The filtering rejection profile is
lowing the linearity test profile of Tedeschi et al. ~80 dB/decade. The spurious free dynamic
(2010), two tones at [LO + 12 MHz, LO + 22 range (SFDR) is close to 60 dB according to
MHz] are applied, measuring an IIP3out-band of – Tedeschi et al. (2010),
6 dBm [Fig. 14.14c] at the maximum gain of
2ðPIIP3 þ 174dBm  NF  10logBWÞ
57 dB (there is 24-dB gain loss in Fig. 14.14c SFDR ¼
associated with the test buffer and used 1:8 trans- 3
former). This high IIP3 is due to the direct  SNRmin
current-mode filtering at the mixer’s output ð14:6Þ
before signal amplification. The asymmetric IF
response [Fig. 14.14d] shows 22-dB (43-dB) where SNRmin ¼ 4 dB is the minimum signal-to-
rejection at the adjacent (alternate) channel, and noise ratio required by the application, and BW
36-dB IRR. Differing from the simulated IF fre- ¼ 2 MHz is the channel bandwidth.
quency response that has three notches at the
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 399

The receiver was further tested at lower volt- is a tradeoff with the phase noise, which
age supplies as summarized in Table 14.2. Only measures—114 dBc/Hz at 3.5 MHz that has an
the NF degrades more noticeably, the IIP3, IRR enough margin to the specifications (Liscidini
and BB gain are almost secured. The better IIP3 et al. 2008) [Fig. 14.15a]. Porting it to the simu-
for 0.6-V/1-V operation is mainly due to the lation results, it can be found that the
narrower –3-dB bandwidth of the hybrid filter. corresponding VCO’s output swing is 0.34 Vpp
For the 0.5-V/1-V operation, the degradation of and the total LO-path power is 1.7 mW (VCO +
IIP3out-band is likely due to the distortion dividers + BUFs). Such an output swing is ade-
generated by AGB. Both cases draw very low quate to lock DIV1 as shown in its simulated
power down to 0.8 mW, being comparable with sensitivity curve [Fig. 14.15b].
other ULP designs such as Herberg et al. (2011). The chip summary and performance
The LC-tank VCO was tested separately. Its benchmarks are given in Table 14.3, where
power budget is related with its output swing and (Tedeschi et al. 2010) is a current-reuse architec-
ture and (Zhang et al. 2013) is an ultra-low-
voltage design. For this work, the results
Table 14.2 Key performances of the receiver at differ- measured under a 10-GHz on-chip VCO are
ent supply voltages also included for completeness, but they are
Supply voltage (V) 0.6/1.2 0.6/1 0.5/1 more sensitive to test uncertainties. The degraded
Power (mW) 1.7 1.2 0.8 NF and IRR are mainly due to the phase noise of
Gain (dB) 57 58 57.5 the free-running VCO. In both cases, this work
IIP3out-band (dBm) 6 4 8 succeeds in advancing the IIP3out-band, power and
NF (dB) 8.5 11.3 12
area efficiencies, while achieving a wideband S11
IRR (dB) 36 38 35
with zero external components. Particularly,

Fig. 14.15 (a) The (a)


measured phase noise has 2.3
enough margin to the -102 [Spec.]
Phase Noise @ 3.5MHz

specifications. From
-106 2.1

Power (mW)
simulations, it can be
(dBc/Hz)

shown that it is a tradeoff


with the power budget -110 LOG’s Total
Measured 1.9
according to the VCO’s Power (Simulated)
-114
output swing. (b)
Simulated sensitivity curve 1.7
-118 Phase Noise (Simulated)
of DIV1 showing its small
input-voltage requirement 0.32 0.36 0.4 0.44
at ~10 GHz VCO Output Swing (Vpp)

(b)
0.7
10 GHz
Input Swing of DIV1 (Vpp)

LC VCO DIV1 DIV2 BUF


2 4 4 4 To
0.5 ÷2 ÷2 Mixers

0.3

0.1 DIV1's Sensitivity Curve

0 5 10 15 20 25
Input Frequency of DIV1 (GHz)
400 P.-I. Mak et al.

Table 14.3 Performance summary and benchmark with the state-of-the-art


This work (Lin et al. 2013, JSSC ’10 (Tedeschi
2014a) et al. 2010) ISSCC ’13 (Zhang et al. 2013)
Application ZigBee ZigBee Energy Harvesting
Architecture Blixer + Hybrid- LMV LNA + Mixer + Frequency-
Filter + Passive RC-CR PPF Cell + Complex translated IF Filter
Filter
BB Filtering 1 Biquad + 4 complex poles 3 complex poles 2 real poles
External I/P matching zero 1 inductor, 1 capacitor 2 capacitors, 1 inductor
components
S11 < –10 dB 1300 (2.25 to 3.55 GHz) <300 (2.3 to 2.6 GHz) >600 (<2 to 2.6 GHz)
Bandwidth (MHz)
Integrated VCO No Yes Yes Yes
Gain (dB) 57 55 75 83
Phase Noise (dBc/Hz) NA 115 @ 116 @ 3.5 MHz 112.8 @ 1 MHz
3.5 MHz
NF (dB) 8.5 9 9 6.1
IIP3out-band (dBm) 6 6 12.5 21.5
IRR (dB) 36 (worst of 28 35 N/A
5 chips)
SFDR (dB) 60.3 60 55.5 51.6
LO-to-RF Leak (dBm) 61 61 60 N/A
Power (mW) 1.7a 2.7 3.6 1.6
Active Area (mm2) 0.24 0.3 0.35 2.5
Supply Voltage (V) 0.6/1.2 1.2 0.3
Technology 65 nm CMOS 90 nm CMOS 65 nm CMOS
a
Breakdown: 1 mW: Blixer + hybrid filter + BB circuitry, 0.7 mW: DIV1 + LO Buffers

when comparing with the recent work (Zhang 14.4.1 Receiver Architecture
et al. 2013), this work achieves 8x less area and
15.5 dBm higher IIP3, together with stronger BB Specifically, the gain-boosted 4-path SC network
channel selectivity. [Fig. 14.16a] separates the output of each gain
stage Gm (Gm has a transconductance of gm3,
output resistance of 4RL, and feedback resistor
14.4 Function-Reuse ULP Receiver of RF3) with capacitor Co that is an open circuit at
Techniques for the Sub-GHz BB. The I/Q BB signals at VB1,I and VB1,Q are
ISM Bands further amplified along the Path C [Fig. 14.16b]
by each Gm stage. With the memory effect of the
Differing from the previous design that is for capacitors, the functional view of the gain
single band, the function-reuse receiver (Lin response is shown in Fig. 14.16c. In order to
et al. 2014b, c) to be described here can flexibly achieve current-reuse between the RF LNA and
support multiple bands (433/860/915/960 MHz) BB amplifiers without increasing the VDD, the
and can operate at a single low VDD. It features a circuit published in Han and Gharpurey (2008)
gain-boosted N-path switched-capacitor with an active mixer has a similar function. How-
(SC) network embedded into a function-reuse ever, the BB NF behavior and the RF filtering
RF front-end, offering concurrent RF (common- behavior are different from the N-path passive
mode) and BB (differential-mode) amplification, mixer applied here that is at the feedback path.
LO-defined RF filtering, and input impedance For the BB amplifiers, it is one Gm with one RF3,
matching with zero external components. The balancing the BB gain and OB-IIP3. After con-
details are presented next. sidering that the BB amplifiers have been
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 401

(a) (b) Common-Mode RF Signal B


Common-Mode RF Signal A Differential BB Signal C
Q Channel Q Channel
LO1 I Channel LO1 I Channel

VB1,I+ Gm VB2,I+ VB1,I+ Gm VB2,I+


Rs Ci Co Vo Rs Ci Co Vo
Vi Vi Virtual
ground
VRF Ci Co VRF Ci Co
VB1,I- Gm VB2,I- VB1,I- Gm VB2,I-

LO3 LO3

(c)
Virtual Blocks
Q Channel
RF3 I Channel
4-Path SC VB1,I+
Network Gm VB2,I+
LO1 Ci Co
Rs Vi Vo
4Gmm Buf

VRF LO3 Ci Co
Gm VB2,I-
RF3/4 VB1,I-
High
Impedance RF3

Fig. 14.16 (a) Function-reuse receiver embedding a gain-boosted 4-path SC network and its operation for RF signal,
(b) BB signals and (c) its functional view to model the gain response

absorbed in the LNA, the I/Q passive mixers and implementation of the function-reuse receiver
capacitors absorbed by the 4-path SC network, that has AC-coupling. For the NF difference
the blocks after the LNA can be assumed virtual. (ΔNF), with a large (small) RF3, it is ~0.8 dB
These virtual blocks reduce the power, area and (3.5 dB) as compared in Fig. 14.18a, b. This is
NF. To validate the above viewpoint, the gain due to the lower gain at the LNA’s output, forc-
and noise performances under two sets of RF3 are ing the input-referred noise from the
simulated. Here, the virtual blocks in Fig. 14.16c downconversion passive mixers and the BB
are implemented with physical transistors and amplifiers to increase with a small RF3. Either
capacitors for the BB amplifiers and the mixers with a small or large RF3, it is noteworthy that the
while the buffer is ideal. Thus, the power of the variation of BB NF is small (i.e., for RF3 ¼ 20
modeled receiver is at least 2x larger than the kΩ it is 3.6 dB while for RF3 ¼ 150 kΩ it is
proposed receiver. For the IB BB gain at VB2,I 3.4 dB), because the BB NTF has a weak relation
(VB2,Q) between the proposed function-reuse with RF3. It also indicates that the BB NTF is
receiver and its functional view, the difference weakly related with the gain at the LNA’s output,
is only 1 dB at a large RF3 of 150 kΩ which is dissimilar to the usual receiver where
[Fig. 14.17a]. For a small RF3, the gain error the NF should be small when the LNA’s gain is
goes up to 2 dB [Fig. 14.17b], which is due to large. Similarly, the NF at the LNA’s output
the gain difference between the model of the (now shown) can be larger than that at BB due
N-path tunable LNA [Fig. 14.16c] and the to the different NTFs.
402 P.-I. Mak et al.

Fig. 14.17 Simulated BB (a) (b)


gain response of the 50
Large RF3 = 150 kΩ Small RF3 = 20 kΩ
30

BB Gain (dB)
BB Gain (dB)
function-reuse receiver and Function-Reuse Function-Reuse
its functional view with (a) 30 Receiver Receiver
Its Functional
a large RF3 and (b) a small Its Functional 10 View
RF3 10 View

-10 -10

0 20 40 60 80 100 0 20 40 60 80 100
BB Frequency (MHz) BB Frequency (MHz)

Fig. 14.18 Simulated BB (a) (b)


NF of the function-reuse
7 14 Small RF3 = 20 kΩ
receiver and its functional Large RF3 = 150 kΩ
ΔNF = 0.8 dB 12 ΔNF = 3.5 dB
BB NF (dB)

view with (a) a large RF3

BB NF (dB)
6 10
and (b) a small RF3
5 8
Its Functional View
6
4 Its Functional View ΔNF
ΔNF 4
Function-Reuse Receiver Function-Reuse Receiver
3 2
0 2 4 6 8 10 0 2 4 6 8 10
BB Frequency (MHz) BB Frequency (MHz)

14.4.2 Low-Voltage Current-Reuse tuning range). The –R cell using cross-coupled


VCO-Filter transistors is added at VF1,I+ and VF1,I- to boost
the BB gain without loss of voltage headroom.
In order to further optimize the power, the VCO For the BB complex poles, A2,5 and Cf1 deter-
is designed to current-reuse with the BB complex mine the real part while A3,6 and Cf1 yield the
low-IF filter (Fig. 14.19). The negative imaginary part. There are three similar stages
transconductor of the VCO is divided into multi- cascaded for higher channel selectivity and
ple Mv cells. The aim is to distribute the bias image rejection ratio (IRR). Rblk and Cblk were
current of the VCO to all BB gain stages (A1, added to avoid the large input capacitance of A1,4
A2. . .A18) that implement the BB filter. For the from degrading the gain of the front-end.
VCO, MV operates at the frequency of 2fs or 4fs
for a div-by-2 or div-by-4 circuit. Thus, the VCO
signal leaked to the source nodes of MV (VF1,I+,
14.4.3 Experimental Results
VF1,I-) is pushed to very high frequencies (4fs or
8fs) and can be easily filtered by the BB
Two versions of the receiver were fabricated in
capacitors. For the filter’s gain stages such as
65-nm CMOS (Fig. 14.20) and optimized with a
A1, Mb (gMb) is loaded by an impedance of ~1/
single 0.5-V VDD. With (without) the LC tank
2gMv when Lp can be considered as a short circuit
for the VCO, the die area is 0.2 mm2 (0.1 mm2).
at BB. Thus, A1 has a ratio-based voltage gain of
Since the measurement results of both are simi-
roughly gMb/gMv, or as given by 4TgMb/GmT,
lar, only those measured with VCO in
where GmT is the total trans-conductance for the
Fig. 14.21a–d are reported here. From 433 to
VCO tank. The latter shows how the distribution
960 MHz, the measured BB gain is 50  2 dB.
factor T can enlarge the BB gain, but is a tradeoff
Two tones at [fs + 12 MHz, fs + 22 MHz] are
with its input-referred noise and can add more
applied, measuring an OB-IIP3 of –20.5  1.5
layout parasitics to Vvcop,n (i.e., narrower VCO’s
dBm at the maximum gain. The IRR is
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 403

4 4
÷2 ÷2 430 ~ 480 MHz fs LC VCO with
0.5 V
4 distributed Mv
LP ÷2 860 ~ 960 MHz fs cells as the loads
of A1, A2...A18
VVar VVCOp VVCOn
Cvar

Mv Cell Mv Cell Mv Cell Mv Cell Mv Cell


Mv Mv Mv Mv
for A4-6 for A7-9 for A10-12 for A13-15 for A16-18

Mv Cell for A1-3


VF1,I+ VF1,I-
1st Complex Pole

-R Cell Rblk Cblk


VB2,I+ VF1,I+
A1 A2 Cf1
VB2,I- VF1,I-
Mb Mb
2nd & 3rd
A6 A3
Poles
VB2,Q+ VF1,Q+
A1-3 VB2,Q- A4 A5 Cf1
VF1,Q-

Complex Low-IF gm-C Filter


Fig. 14.19 Proposed low-voltage current-reuse VCO-filter

Fig. 14.20 Chip


micrograph of the function-
reuse receiver with a
LC-tank for the VCO (left)
and without it (right)

20.5  0.5 dB due to the low-Q of the Since the VCO is current-reuse with the filter, it
VCO-filter. The IIP3 is mainly limited by the is interesting to study its phase noise with the BB
VCO-filter. The measured NF is 8.1  0.6 dB. signal amplitude. For negligible phase noise
404 P.-I. Mak et al.

915MHz
960MHz
z
z

MH
MH
(a) (b)

860
433
60 -15 - 100

Phase Noise (dBc/Hz)


50 -17 - 105 VCO Phase
Gain
Gain, NF, IRR (dB)

OB-IIP3 (dBm)
-19 Noise @ 3.5MHz
40
OB-IIP3 - 110
30 -21
- 115
20 -23
IRR
10 NF -25 - 120 fLO=860MHz
0 -27 - 125
400 600 800 1000 10 60 110
RF Frequency (GHz) BB Signal Swing (mVpp)
915MHz
960MHz
z
z

MH
MH

860
433

(c) (d)
2 -110
Power (mW), S11 (dB)

Phase Noise (dBc/Hz)


fLO=860MHz

Gain Response (dB)


0 Power 48
-2 -115 AC-coupled
VCO Phase
-4 28
Noise @ 3.5MHz
-6 -120
-8 8
S11
-10 -125
400 600 800 1000 -10 -5 0 5 10
RF Frequency (GHz) BB Frequency (MHz)

Fig. 14.21 Measured key performance metrics: (a) Gain, NF, IRR and OB-IIP3. (b) VCO phase noise versus BB
signal swing. (c) S11, power and VCO phase @ 3.5-MHz offset. (d) BB complex gain response centered at –2-MHz IF

degradation, the BB signal swing should be <60 RF filtering [Fig. 14.22a]. For an offset fre-
mVpp, which can be managed by variable gain quency of 60 MHz, the P1dB is –20 dBm, limited
control. If a 60-mVpp BB signal is insufficient for by the current-reuse VCO-filter. For the blocker
demodulation, a simple gain stage (e.g., inverter NF, with a single tone at 50 MHz, the blocker NF
amplifier) can be added after the filter to enlarge is almost unchanged for the blocker <–35 dBm.
the gain and output swing. The total power of the With a blocker power of –20 dBm, the NF is
receiver is 1.15 mW (0.3 mW for the LNA + BB increased to ~14 dB [Fig. 14.22b].
amplifiers and 0.65 mW for VCO-filter and This work is compared with the prior art in
0.2 mW for the divider), while the phase noise Table 14.4, where (Lin et al. 2013) is the current-
is –117.4  1.7 dBc/Hz at 3.5-MHz frequency reuse architectures described previously, while
offset. The S11 is below –8 dB across the whole (Zhang et al. 2013) is the cascade architecture
band. The asymmetric IF response shows 24-dB with ULV supply for energy harvesting. For this
(41-dB) rejection at the adjacent (alternate) work, the results measured under an external LO
channel. are also included for completeness. In both cases,
To study the RF filtering behavior, the P1dB this work succeeds in advancing the power and
and blocker NF are measured. For the in-band area efficiencies with multi-band convergence,
signal, the P1dB is –55 dBm while with a fre- while achieving tunable S11 with zero external
quency offset frequency of 20 MHz, it increases components. When comparing with the most
to –35 dBm, which is mainly due to the double- recent ULV design (Zhang et al. 2013), this
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 405

Fig. 14.22 Measured (a) (a) (b)


P1dB versus input offset - 15 20
frequency and (b) blocker
NF versus input power -25 -20 dBm

Blocker NF (dB)
15

P1dB (dBm)
-35 14 dB
-35 dBm

-45
10
-55 -55 dBm

- 65 5
0 20 40 60 80 100 -55 -45 -35 -25 -15
Offset Frequency (MHz) Input Power (dBm)

Table 14.4 Performance summary and benchmark with the state-of-the-art RXs
ISSCC ’13 (Lin et al. ISSCC ’13 (Zhang et al.
This work (Lin et al. 2014b) 2013) 2013)
Application 433/860/915/960 MHz (ZigBee/ 2.4 GHz (ZigBee/IEEE 2.4 GHz (Energy
IEEE802.15.4c/d) 802.15.4) Harvesting)
Architecture Function-Reuse RF Front-End + N- Blixer + Hybrid CG LNA + Passive
path Tunable LNA + Current-Reuse Filter + Passive RC-CR Mixers + N-Path SC IF
VCO-Filter Filter + LC VCO Filter + LC VCO
BB Filter 3 complex poles 1 Biq., 4 complex poles 2 real poles
Input matching On-chip N-path SC (tunable by LO, On-chip LC (fixed, low Q) Off-chip LC (fixed, low Q)
technique high Q)
External zero zero 2 caps, 1 inductor
components
Input matching 433 to 960 MHz (tunable by LO) 2.25 to 3.55 GHz (fixed) ~2 to 2.6 GHz (fixed)
BW and
tunability
Active Area 0.2 (0.1a) 0.3 2.5
(mm2)
Power 1.15  0.05 @ 0.5 V 2.7 @ 0.6/1.2 V 1.6 @ 0.3 V
(mW) @VDD
Gain (dB) 50  2 (51  3a) 55 83
NF (dB) 8.1  0.6 (8  1a) 9 6.1
OB-IIP3 20.5  1.5 (23  1a) 6 21.5
(dBm)
IRR (dB) 20.5  0.5 (21  0.5a) 28 N/A
VCO Phase 117.4  1.7 @ 3.5 MHz 115 @ 3.5 MHz 112 @ 1 MHz
Noise
(dBc/Hz)
Technology 65 nm CMOS 65 nm CMOS 65 nm CMOS
a
Results measured from the test kit that has no VCO

work saves more than 10x of area while scalable output power (Pout) and system effi-
supporting multi-band operation with zero exter- ciency for ZigBee and other Internet-of-Things
nal components. wireless solutions.
To improve the system efficiency of a TX,
which normally consists of a VCO (or DCO)
14.5 Sub-1 V ULP 2.4GHz and a power amplifier, it is worth to consider a
Transmitter current-reuse topology between the two blocks.
The current-reuse VCO-PA (Li et al. 2015) has
This section briefly covers an ongoing-design of demonstrated good system efficiency (17.5%),
a sub-1 V ULP 2.4-GHz transmitter (TX) with but has a limited Pout (-1 dBm) after stacking.
406 P.-I. Mak et al.

Fig. 14.23 Class-F VCO VGB VDD


(Babaie et al. 2013) Moderately-Coupled k
Cs Ls Lp Cp
|Zd|
Vg Vd
k
Zd
M1 Id I
IRVF IRVF
AC GND

Fig. 14.24 Ongoing work VGB VDD GND


on a Class-F VCO driving k1
up the antenna (Peng et al.
2016) C3 L3 L1 C1 L2 C2 RL=50Ω
Vg Vd Vout (Antenna)
k2
Zd
M1 Id

Table 14.5 Brief summary of this work with respect to the state-of-the-art TXs
ISSCC ’15 (Liu et al. ISSCC ’15 (Prummel et al.
Parameters This work (Peng et al. 2017) 2015a) 2015)
Applications 2.4 GHz ZigBee 2.4 GHz BLE/ZigBee/2 M 2.4 GHz BLE
Proprietary
Architectures Class-F VCO Class-D PA + LC-DCO Class-D PA + LC-VCO
On-chip inductor or 1 (Shared by VCO, PA and 2 (1 for LC-DCO, 1 for O/P 3 (1 for LC-VCO, 2 for O/P
transformer O/P Matching) Matching) Matching)

Thus, our recent work (Peng et al. 2017) reported a class-F VCO that can directly up the antenna
a function-reuse DCO-PA that not only shares (Fig. 14.24). This scheme reuses the amplifying
bias current, but also upholds a full VDD for both device (M1) to unify the DCO and PA functions
the DCO and PA operation. for power and area savings (Table 14.5).
Here, the basic circuit is inspired by the class- L1–3 form a transformer enabling self-
F oscillator (Fig. 14.23), which is attractive for oscillation and boosts up Vg by step-up ratioing
its high FoM of 192 dBc/Hz (Babaie et al. 2013). (L3 > L1), and output-impedance matching to
It features a resonant tank (LpCp, LsCs) with a deliver adequate Pout (>0 dBm) to RL. The cou-
moderately-coupling k to peak up the drain pling coefficients can be customized for optimum
impedance (Zd) at both 1st and 3rd harmonics, performance.
resulting in a pseudo-square drain voltage (Vd) to
reduce the oscillator’s impulse sensitivity func-
tion. Although Vd and the drain current (Id) are 14.6 Conclusions and Future
alike a square-wave, the tank’s resonant response Perspectives
recovers a sine Vg by suppressing the harmonic
components, while offering a passive gain to Vg This chapter described the system aspects and
under step-up ratioing (Ls > Lp). These circuit techniques of building cost-aware ultra-
properties are preserved when it is modified as low-power (ULP) radios for both 2.4-GHz and
14 Circuit Techniques for IoT-Enabling Short-Range ULP Radios 407

sub-GHz ISM bands. We demonstrated that filtering technique (Lin et al. 2014c) can be a
effective co-design between the system and cir- helpful technique to enhance the resilience of
cuit levels, as well as RF and analog block levels, the receiver. For the state-of-the-art transmitters,
are decisive to concurrently balance the cost their power efficiency is still not that high (Liu
and power budgets, while keeping up the radio et al. 2015a) at a 0-dBm output power. Thus, it is
performance without resorting from external worth to revisit the design of ULP PA and VCO
components. Other state-of-the-art techniques as described in (Peng et al. 2017).
are summarized in Lin et al. (2016) and Yu LO generation can consume significant power
et al. (2017). and area when approaching multi-band opera-
For future development, to cope with the fast tion. For example, for a universal radio to cover
market shift and many upcoming applications, the 2.4 GHz and sub-GHz ISM bands, the tuning
multi-band multi-standard ULP radios with flex- range of the VCO should be 57% if a 2.4-GHz
ible data rate will become promising for the VCO is selected and followed by a divide-by-4
future IoT growth. In addition to the obvious circuit. Such a wide tuning range should con-
goal of high energy efficiency during the active sume more power than the single-band design.
mode of the radios, they should also be designed In fact, from area and tuning range’s viewpoint, a
for very low sleep/leakage power, preferably in ring oscillator can be more attractive. However,
the range of pW (Paidimarri et al. 2015), such to meet the required phase noise, ULP consump-
that after heavily duty cycling, the average sys- tion is still challenging. Time-interleaved ring
tem power can be minimized. The technology oscillator (Yin et al. 2016a, b) with effective
choices also offer the flexibility of using low-Vt phase noise reduction offers the potential to alle-
thin-oxide transistors for core circuits such that a viate this tradeoff.
lower VDD can be used to save power, whereas
high-Vt or thick-oxide transistors can be used to Acknowledgement This research is funded by the
suppress the sleep current. Mixed-VDD design Macau Science and Technology Development Fund
(FDCT)—SKL Fund, and the University of Macau—
can be a chip-level strategy for power savings
MYRG2015-00040-FST.
(Mak and Martins 2012).
To save power, it is possible to avoid the RF
PLL for channelization by using a temperature-
compensated thin-film bulk-acoustic-wave reso- References
nator (FBAR) that assists the RF-to-IF
downconversion. Thus, the channel selection M. Babaie, R. Staszewski, Third-harmonic injection tech-
nique applied to a 5.87-to-7.56 GHz 65nm class-F
can be delayed to lower frequency (Wang et al. oscillator with 192 dBc/Hz FoM. ISSCC Dig. Tech.
2014). Papers (Feb. 2013), pp. 348–349
The average power is critical for a long bat- S. Bandyopadhyay, A. Chandrakasan, Platform architec-
tery lifetime, as it will dominate the maintenance ture for solar, thermal and vibration energy combining
with MPPT and single inductor, in Proceedings Sym-
cost of massive-scale wireless sensor networks. posium on VLSI Circuits (2011), pp. 238–239
This fact urges the need of highly autonomous F. Behbahani, Y. Kishigami, J. Leete, A.A. Abidi, CMOS
ULP radios that can survive with mainly/only mixers and polyphase filters for large image rejection.
energy harvesting. To this point, fully-integrated IEEE J. Solid-State Circuits 36, 873–887 (2001)
S. Blaakmeer, E. Klumperink, D. Leenaerts, B. Nauta,
ULP power management units and multi-source Wideband balun-LNA with simultaneous output bal-
energy harvesters will be of great importance ancing, noise-canceling and distortion-canceling.
(Masuch et al. 2013). IEEE J. Solid-State Circuits 43, 1341–1350 (2008a)
For the transceiver, although the sensitivity of S. Blaakmeer, E. Klumperink, D. Leenaerts, B. Nauta,
The blixer, a wideband Balun-LNA-I/Q-mixer topol-
a state-of-the-art ULP receiver is better than ogy. IEEE J. Solid-State Circuits 43, 2706–2715
90 dBm (Liu et al. 2014), their tolerability to (2008b)
large out-of-band blockers should have room to F. Bruccoleri, E. Klumperink, B. Nauta, Wide-band
be further improved. The gain-boosted N-path CMOS low-noise amplifier exploiting thermal noise
408 P.-I. Mak et al.

canceling. IEEE J. Solid-State Circuits 39, 275–282 W.-H. Yu, H. Yi, P.-I. Mak, J. Yin, R.P. Martins, A 0.18 V
(2004) 382 μW bluetooth low-energy (BLE) receiver with
Y. Chen, P.-I. Mak, L. Zhang, Y. Wang, A 0.07mm2, 1.33 nW sleep power for energy-harvesting
2mW, 75MHz-IF, 4th-order BPF using a source- applications in 28 nm CMOS, in IEEE ISSCC Dig.,
follower-based resonator in 90 nm CMOS. IET Elec- Tech Papers, 2017, to appear
tron. Lett. 48, 552–554 (2012) P.-I. Mak, R.P. Martins, A 0.46-mm2 4-dB NF unified
J. Han, R. Gharpurey, Recursive receiver down- receiver front-end for full-band mobile TV in 65-nm
converters with multiband feedback and gain-reuse. CMOS. IEEE J. Solid-State Circuits 46, 1970–1984
IEEE J. Solid-State Circuits 43, 1119–1131 (2008) (2011)
A.C. Herberg, T.W. Brown, T.S. Fiez, K. Mayaram, P.-I. Mak, R.P. Martins, High-/Mixed-Voltage Analog and
A250-mV, 352-μW GPS receiver RF front-end in RF Circuit Techniques for Nanoscale CMOS. Series of
130-nm CMOS. IEEE J. Solid-State Circuits 46(4), Analog Circuits and Signal Processing (ACSP)
938–949 (2011) (Springer, Berlin, 2012)
J. Kaykovuori, K. Stadius, J. Ryynanen, Analysis and J. Masuch, M. Delgado-Restituto, Ultra Low Power
design of passive polyphase filters. IEEE Trans. Transceiver for Wireless Body Area Networks, Series
Circuits Syst. I, Reg. Papers 55, 3023–3037 (2008) of Analog Circuits and Signal Processing (ACSP)
C.L. Ler, A.K. A’ain, A.V. Kordesh, CMOS source (Springer, Berlin, 2013)
degenerated differential active inductor. IET Electron. A. Paidimarri, N. Ickes, A. P. Chandrakasan, A +10 dBm
Lett. 44, 196–197 (2008) 2.4 GHz transmitter with sub-400pW leakage and
C. Li, A. Liscidini, A current re-use PA-VCO cell for 43.7% system efficiency. ISSCC Dig. Tech. Papers
low-power BLE transmitters, in European Solid- (Feb. 2015), pp. 246–247
State Circuits Conference, Paper No. 1198 (Sept. X. Peng, J. Yin, P.-I. Mak, W.-H. Yu, R.P. Martins, A 2.4-
2015) GHz ZigBee Transmitter using a function-reuse Class-
Z. Lin, P.-I. Mak, R.P. Martins, A 1.7 mW 0.22 mm2 F DCO-PA and an ADPLL Achieving 22.6% (14.5%)
2.4 GHz ZigBee RX exploiting a current-reuse system efficiency at 6-dBm (0-dBm) Pout. IEEE J.
Blixer + Hybrid filter topology in 65nm CMOS. Solid-State Circuits (2017), to appear
ISSCC Dig. Tech. Papers (Feb. 2013), pp. 448–449 A. Pirola, A. Liscidini, R. Castello, Current-mode,
Z. Lin, P.-I. Mak, R.P. Martins, A 2.4-GHz ZigBee WCDMA channel filter with in-band noise shaping.
receiver exploiting an RF-to-BB-current-reuse IEEE J. Solid-State Circuits 45, 1770–1780 (2010)
Blixer + Hybrid filter topology in 65-nm J. Prummel, M. Papamichail, M. Ancis et al., A 10 mW
CMOS. IEEE J. Solid-State Circuits 49, 1333–1344 bluetooth low-energy transceiver with on-chip
(2014a) matching. ISSCC Dig. Tech. Papers (Feb. 2015),
Z. Lin, P.-I. Mak, R. P. Martins, A 0.5V 1.15mW 0.2mm2 pp. 238–239
Sub-GHz ZigBee receiver supporting 433/860/915/ R. Rajan, Ultra-Low Power Short-Range Radio Trans-
960MHz ISM bands with zero external components. ceiver (Microsemi Corporation, Aliso Viejo, CA,
ISSCC Dig. Tech. Papers (Feb. 2014b), pp. 164–165 2012)
Z. Lin, P.-I. Mak, R.P. Martins, A Sub-GHz multi-ISM- B. Razavi, K.F. Lee, R.H. Yan, Design of high-speed,
band ZigBee receiver using function-reuse and gain- low-power frequency dividers and phase-locked
boosted N-path techniques for IoT applications. IEEE loops in deep submicron CMOS. IEEE J. Solid-State
J. Solid-State Circuits 49, 2990–3004 (2014b) Circuits 30, 101–109 (1995)
Z. Lin, P.I. Mak, R.P. Martins, Ultra-Low-Power and J.A. Stankovic, Research directions for the internet of
Ultra-Low-Cost Short-Range Wireless Receivers in things. IEEE Internet Things J. 1(1), 3–9 (2014)
Nanoscale CMOS. Series of Analog Circuits and Sig- M. Tedeschi, A. Liscidini, R. Castello, Low-power quad-
nal Processing (ACSP) (Springer, Berlin, 2016) rature receivers for ZigBee (IEEE 802.15.4)
A. Liscidini, M. Tedeschi, R. Castello, A 2.4GHz 3.6mW applications. IEEE J. Solid-State Circuits 45,
0.35mm2 quadrature front-end RX for ZigBee and 1710–1719 (2010)
WPAN applications. ISSCC Dig. Tech. Papers (Feb. K. Wang, J. Koo, R. Ruby, B. Otis, A 1.8 mW PLL-free
2008), pp. 370–371 channelized 2.4 GHz ZigBee receiver utilizing fixed-
Y.-H. Liu, A. Ba, J. van den Heuvel et al., A 1.2 nJ/bit LO temperature-compensated FBAR resonator.
2.4 GHz receiver with a sliding-IF phase-to-digital ISSCC Dig. Tech. Papers (Feb. 2014), pp. 21–22
converter for wireless personal/body area networks. J. Yin, P.-I. Mak, F. Maloberti, R.P. Martins, A time-
IEEE ISSCC Dig. Tech Papers (Feb. 2014), interleaved ring-VCO with reduced 1/f3 phase noise
pp. 166–167 corner, extended tuning range and inherent divided
Y.-H. Liu, C. Bachmann, X. Wang et al., A 3.7mW-RX output. IEEE J. Solid-State Circuits 51, 2979–2991
4.4mW-TX fully integrated bluetooth low-energy/ (2016)
IEEE802.15.4/proprietray SoC with an ADPLL- F. Zhang, K. Wang, J. Koo, Y. Miyahara, B. Otis, A 1.6
based fast frequency offset compensation in 40 nm mW 300 mV supply 2.4 GHz receiver with –94 dBm
CMOS. ISSCC Dig. Tech. Papers (Feb. 2015), sensitivity for energy-harvesting applications. ISSCC
pp. 236–237 Dig. Tech. Papers (Feb. 2013), pp. 456–457
Battery Technologies for IoT
15
Jeff Sather

Although IoT devices appear in myriad physical functionality will determine the energy and
configurations and serve countless purposes, the power requirements, as well as whether a pri-
battery requirements for any particular category mary or rechargeable battery is better suited for
of IoT devices can be evaluated by recognizing the application. Duty cycle of radio transmit/
their physical, electrical, and functional elements receive pulses, communication protocols, sensor
as follows: activity, and activation of visual indicators are
some of the most power consumptive factors
• physical size—typically miniature assemblies influencing the battery capacity needed in order
having a battery compartment, socket, or to operate for an acceptable period between
wiring harness for connection to the power replacement or recharge.
source; Using this broad framework of what an IoT
• operating environment—wide range of node does and how it performs those functions,
conditions, from indoor, outdoor, body-worn and with an understanding of the power
and implantable, to extreme environments; requirements of the disparate components and
• functionality—sensing, data collection, and system as a whole, a survey of power sources
radio communications; can be undertaken, comparing how each delivers
• components—sensors, radios, microcon- in terms of performance, cost, size, and other
trollers/microprocessors, including human parameters. Of course, not all IoT devices need
interfaces in some cases; a battery or other portable power source. How-
• purpose—remote monitoring/diagnostic, pre- ever, a significant number of devices are
ventive maintenance, controls, gathering sta- deployed where batteries are the only viable
tistics, machine-to-machine communication. power source. In some cases, rechargeable
batteries are used as the main power source,
The physical size of the IoT device might in where a recharging source is available and prac-
turn limit the physical size of the battery that can tical, or as a backup power source in the event of
be considered. The operating environment will main power loss—where main power can be a
dictate whether the battery options are limited to larger battery, grid power, or a renewable supply
those having industrial, automotive, or commer- such as photovoltaic cells. In many other
cial temperature ratings. Components and embodiments, a non-rechargeable battery is the
more useful power source, as there will be no
J. Sather (*) practical means of replenishing the energy in a
Cymbet Corporation, Elk River, MN, USA discharged battery.
e-mail: jsather@cymbet.com

# Springer International Publishing AG 2017 409


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_15
410 J. Sather

Because IoT devices encompass any number 15.1 An Overview of Primary


of combinations of active and passive and Secondary Battery
components, are configured in untold shapes Chemistries
and sizes, operate over a range of power con-
sumption modes, and can have product operating Batteries are designed as either non-rechargeable
lives of days to years, the battery specifications or rechargeable cells—referred to as primary and
required to support these features and functions secondary, cells, respectively. Within each of
can vary almost as widely as the devices them- these categories, chemistries vary widely, from
selves. Whatever the IoT implementation, it is alkaline chemistry in the ubiquitous primary
important to select the battery that meets mini- batteries such as AA, AAA, C, and D-cells, and
mum performance objectives under all possible coin variety including the commonly used “CR”
operating conditions, will last the intended life of family (e.g., CR2032), to secondary batteries like
the product or, in the case where battery replace- Li-ion cylindrical (e.g., 18650) and Li-polymer
ment is expected, can be replaced with minimal pouch cells. Each chemistry and construction
expense, difficulty, and in compliance with dis- method has its tradeoffs in terms of performance
posal regulations. This chapter presents the and cost, with lithium-based primary and second-
characteristics of batteries that, because of their ary cells of various types having gained wide-
size, voltage, power delivery, and cost profiles, spread use in recent years, owing to their
are candidates as power sources in IoT devices. relatively high energy density, higher (relative
Special-use batteries such as those used specifi- to alkaline cells) nominal voltage of 3.0–3.6 V
cally in implantable medical devices and in many of the cell types, and typically low self-
aerospace applications; stationary batteries; and discharge. Cells are available in consumer and
batteries having chemical compositions no lon- industrial grade, offering a range of performance
ger in widespread use, will not be covered in this in parameters such as cycle life, self-discharge,
chapter. A much wider selection of batteries than pulse current delivery, and cost. An overview of
those covered here are commercially available chemistries in widespread use, their common
for every imaginable application. Aside from embodiments, and noteworthy characteristics, is
references to materials comprising the positive given in Table 15.1.
(cathode) and negative (anode) electrodes of dif- Storage capacities of the batteries listed range
ferent battery chemistries, fundamental battery from a few tens of microampere-hours, to several
principles and electrochemical reactions will ampere-hours.
not be discussed. Throughout this chapter, the
terms “cell” and “battery” will be used inter-
changeably, although in the strictest sense they
are distinct terms. 15.2 Voltage, Power, and Battery
Battery designations as defined by active Capacity Demands in IoT
material composition—negative electrode, elec- Applications
trolyte, and positive electrode—are categorized
by both ANSI and IEC standards, yet each of Given the range of features, functions, and
those standards uses its own nomenclature. applications in which IoT devices are deployed,
Designations used in this chapter will refer to each category of devices, and different models
the letter codes in some cases, but also the more within a category, will impose different demands
common parlance of, for example, “AA alkaline” on the battery, not only in operating voltage and
or “lithium coin cell” when generalities will current, but in physical size constraints and
suffice. When making specific technical product life.
comparisons, the more precise nomenclature Components deployed in IoT products operate
will be used. from a range of supply voltages and present load
15
Battery Technologies for IoT

Table 15.1 A selection of battery types commonly used in portable applications and IoT devices
Anode Cathode
Designator/ Common (negative (positive Nominal Standard continuous Shelf/operating life in Temp. range
common name type electrode) electrode) Rechargeable? volgate (V) load rating (typical) years (room temp.) (discharge) ( C)
CR Coin Li MnO2 No 3 C/100 10 30 to +60
BR Coin Li CFx No 3 C/1000 10 30 to +80
ML Coin Li-Al MnxOy Yes 3 C/300 10 20 to +60
AA alkaline Cylindrical Zn MnO2 No 1.5 C/10 7–10 20 to +55
(LR6)
AA lithium Cylindrical Li FeS2 No 1.5 1C 20 40 to +60
(FR6)
Li-ion (18650) Cylindrical C LiCoO2 Yes 3.6 1C 5 20 to +60
Li thionyl Coin and Li SOCl2 No 3.6 C/10,000 20+ 55 to +85
chloride cylindrical
Solid state thin Thin film/ Li LiCoO2 Yes 3.8 2C 20+ 40 to +70
film planar
411
412 J. Sather

AA Alkaline (LR6) Discharge Characteristics (0.8V Cutoff Voltage)


Fig. 15.1 Discharge
capacity as a function of 3500
discharge rate for standard
3000 Mfr. 1
AA alkaline (LR6)
batteries 2500 Mfr. 2

Capacity (mAh)
2000

1500

1000

500

0
10 100 1000
Discharge Current (mA)

AA Alkaline (LR6) Discharge Characteristics


Fig. 15.2 Voltage drop as
3500
a function of discharge rate Mfr. 1 to 0.8V
for typical AA alkaline 3000 Mfr. 2 to 0.8V
(LR6) batteries Mfr. 1 to 1.0V
2500
Capacity (mAh)

Mfr. 2 to 1.0V
2000 Mfr. 1 to 1.2V

1500 Mfr. 2 to 1.2V

1000

500

0
10 100 1000
Discharge Current (mA)

currents spanning a wide range—from a few tens capacity is specified, but the internal resistance
of nanoamperes for sleep currents, memory reten- of the battery has an attendant voltage drop
tion, and low power clocks, to hundreds of under high current loads that can cause the sys-
milliamperes or in extreme cases, amperes needed tem to drop out of regulation or below the
to operate radios, actuators, and other mechanical minimum operating voltage of some device
devices. Pulsed loads can rapidly diminish the components, resulting in a loss of device
capacity of many battery chemistries. Figure 15.1 functionality—either momentarily or perma-
illustrates the diminished battery capacity as a nently (i.e., requiring a system reset). Note
function of load current for AA alkaline cells in Fig. 15.2 (http://data.energizer.com/PDFs/
from two manufacturers (http://data.energizer. EN91.pdf; http://na.industrial.panasonic.com/
com/PDFs/EN91.pdf; http://na.industrial.pana sites/default/pidsa/files/1.5vseries_datasheets_
sonic.com/sites/default/pidsa/files/1.5vseries_data merged.pdf) the capacity of the same AA alka-
sheets_merged.pdf). line cells from Fig. 15.1, for a given discharge
When considering battery lifetime for the IoT current and cutoff voltage. In systems using
device, the average and peak loads to be placed alkaline or other lower voltage cells, a series
on the battery must therefore be taken into combination of two will be used. Hence,
account. the additive effects of the voltage drop will be
Not only do batteries deliver less total capac- proportional to the number of series-connected
ity when discharged above the rate at which the cells.
15 Battery Technologies for IoT 413

15.3 Attributes of Batteries that is, selecting the battery that fits within a
Commonly Used in IoT pre-defined compartment, board footprint, or
Applications height constraint. Conversely, the size of the
IoT device is in some cases determined largely
Given there are literally thousands of by the physical size of the battery that has suffi-
combinations of battery shapes, sizes, capacities, cient storage capacity and other attributes that
chemistries, and other parameters to choose from serve the power and energy needs of the IoT
when selecting the appropriate battery for an IoT device. Fortunately for product designers,
device, it can be a daunting exercise to cull batteries are available in sizes ranging from a
through all the options and select the battery that cubic millimeter upwards, with operating
has just the right combination of specifications voltages from <1 to >4 V for single cells and
meeting the performance and cost objectives of tens of volts for larger battery packs. Storage
the product designer, manufacturer, and end user capacity ratings span more than five orders of
alike. It is all too easy then to rely on a default magnitude, with discharge currents ranging
selection process based on what has been used in from less than C/10,000 to 5C. The C-rate is the
similar designs, a battery type that is known and rate at which the battery is being discharged,
understood, or what is least expensive and just notionally defined as the battery discharge cur-
enough to do the job. With engineering time rent divided by the current draw under which the
constraints always an implicitly or explicitly battery delivers its rated capacity in 1 h—
understood as being part of the design process, although the rated capacity of most batteries
these approaches to battery selection are not nec- today is at discharge rates much slower than
essarily unjustified. However, such approaches do 1 h. Moreover, a range of shapes and sizes can
not always lead to the optimum result when the be found in both primary and secondary batteries.
interests of all parties are taken into account. The In this section, the emphasis will be on describ-
battery in the system might last well longer than ing batteries that are physically no larger than
the product life demands, sometimes resulting in a ~25 cm3. In addition to larger battery sizes being
larger or more expensive battery being used that available, battery packs are also configured to
otherwise needed. Conversely, the battery might serve applications requiring higher voltage,
fail prematurely as a result of some mode or higher power, and/or higher storage capacity.
another of system operation demanding more Table 15.2 lists the attributes of several com-
from the battery than its intended use. As with mercially available battery types, in formats that
most engineering exercises, there are tradeoffs can serve myriad IoT device sizes.
between thoroughly investigating all options and
getting the job done on time and on budget.
This section endeavors to provide guidance 15.3.2 Comparisons of Energy
into the variety of batteries available to the and Power Densities
designer, along with their respective advantages
and limitations relative to one another in terms of When selecting a battery for a particular system,
performance, durability, cost, and other designers usually want to get as much energy as
parameters important in IoT device design and possible in the smallest volume, other
operation in the field. considerations—such as voltage, safety, cost,
reliability—aside. Energy density is one of the
more important specifications of a battery, yet
15.3.1 Battery Shapes and Sizes it is not usually specified explicitly in data
sheets. It is, rather, implicit in the rated capacity
Because IoT devices take on a wide range of of a battery of a given physical size. Energy
physical embodiments, selection of a suitable density is measured in volumetric and gravimet-
battery often comes down to its physical size— ric terms. Volumetric energy density is measured
414 J. Sather

Table 15.2 Commonly used battery types and their general characteristics
Typical
Common Depictions (not to Storage capacity discharge Discharge
Battery type name scale) range (mAh) rate voltage (V)
Coin/button CR, ML, 1–1000 C/100 1.5–3.3
LR

Cylindrical (Li-ion, 18650, 25–5000 C/10–C/2 1.2–3.7


alkaline, NiMH, NiCd) AAA

Prismatic Li-ion 9V 200–2000 C/10–C/2 3.6

Lithium polymer Li-poly 5–500 C/10–1C 3.6

Solid state/thin film EnerChip, 0.01–20 2C 3.8


EnFilm

in units of watt-hours per liter (Wh/L), or its that could otherwise operate as efficiently at a
equivalent in milliwatt-hours per cubic centime- lower voltage, or, if the battery voltage is being
ter (mWh/cm3). Gravimetric energy density is a routed through a linear regulator, for example,
measure of the amount of energy stored per unit that wastes energy in regulating to a lower volt-
of mass and has units of watt-hours per kg age. In these cases, charge density is a truer
(Wh/kg), or milliwatt-hours per gram (mWh/g). metric of available capacity. Batteries are
Gravimetric energy density is a more critical specified in amp-hours and not watt-hours.
parameter in electric vehicles, where the battery Energy density varies widely, as a function of
power must propel its own weight as well as that chemistry, physical form, packaging, and to
of the vehicle itself, and in mobile devices, where some extent by manufacturer within a given bat-
weight presents a burden to humans. In stationary tery type. Volumetric densities range from as low
devices, gravimetric energy density is usually not as 10 Wh/L, to over 1000 Wh/L. (Zn-air
critical. This section will therefore focus on vol- batteries, of the type commonly used in hearing
umetric energy density of various types of aids, have energy densities exceeding 1200 Wh/
batteries. L in many cases. But these cells are impractical
Energy density can be a misleading parameter in most IoT devices because the operational life
depending on the output voltage of the battery is only a few days once the sealing tab is
and the voltage required by the circuit to operate. removed.) The highest energy densities are typi-
That is, sometimes charge density is a more cally found in primary cells, with lithium thionyl
useful measure, especially if the battery voltage chloride (LiSOCl2) cells boasting energy
is applied directly to a circuit or circuit element densities >1200 Wh/L in some cell designs.
15 Battery Technologies for IoT 415

Of course, in a rechargeable cell, the rated capac- And while commercially available Li-ion and
ity of a battery is multiplied n-fold, n being the other battery charge/management integrated
number of charge-discharge cycles available to circuits are readily available to the designer,
the user or system over the life of the products. In there is additional cost and component count
that sense, the energy density of rechargeable associated with implementing the battery man-
cells, when measured as lifecycle capacity agement circuit and not all IoT device
(discussed later), is usually much higher than applications will have a charging source that
that of primary cells. Again, not all systems can meets the requirements of constant current/con-
take advantage of the rechargeable aspect of a stant voltage (CC/CV) battery charging.
battery. And this is true for many IoT devices. Other secondary cells, such as ML/MT/NBL/
Batteries nearing the physically smaller end of VL coin cells, have less stringent charging
the spectrum tend to have lower energy density, requirements than do larger Li-ion cells. For
as the packaging an inactive components of the example, 18650 Li-ion cells require CC/CV
cell tend to occupy a larger percentage of the charge profiles, while ML coin cells, for exam-
total battery volume. In some cases, the size of ple, accept constant voltage, requiring only a
the IoT device might be constrained in the foot- current limiting resistor and/or protection diode.
print, but not so much in height. In such cases, Moreover, a restricted charging temperature
the amount of energy per unit of board space is range of Li-ion cylindrical cells, from 0 to
more important the total volume occupied by the 45  C, can make them impractical in some
battery. applications.
Table 15.3 depicts a cross-section of primary In cases where a battery meets most system
and secondary coin and cylindrical cells and their requirements, save the ability to deliver sufficient
respective energy densities and related metrics. current to send short-duration radio
Battery size and associated energy density transmissions, for example, a low ESR (Equiva-
assume an amount of area occupied by the bat- lent Series Resistance) capacitor can be deployed
tery on a circuit board. In other words, the coin to deliver pulsed currents without forcing the
and cylindrical cells are squared off in the x-y designer to upsize the battery to deliver the
plane, as the region in proximity to the battery is power. When using a supporting capacitor to
typically unavailable for other components deliver pulse currents, it is important to recog-
because of mounting tabs, sockets, or other nize and account for the fact that the capacitor
keep-out provisions for manual assembly. There- will have a leakage parameter associated with it,
fore, the actual energy density of the cells dependent on capacitor material, construction,
represented will be slightly greater than that ambient temperature, and other factors. The cir-
shown. Figures are exclusive of battery sockets cuit designer must therefore consider whether the
and holders needed to secure the battery, and capacitor ought to be electrically isolated from
connectors, harnesses, and cables used to make the battery or other components when not
the electrical connection from battery to circuit. actively delivering current.
The storage capacity and volumetric energy In contrast to energy density, power density is
density for several battery types are depicted in the measure of how much current a battery
Fig. 15.3. Generally, primary (non-rechargeable) sources relative to its volume or mass. As with
cells have considerably higher volumetric energy energy density, power density varies by chemis-
density than the secondary (rechargeable) cells. try and construction. It is also highly dependent
With rechargeable cells—provided some on temperature, increasing (decreasing) on the
number of charge-discharge cycles can be prac- order of 2 per 10  C increase (decrease) in
tically exploited—the additional cycles are then temperature depending on the chemistry.
effective multipliers to the energy density. Some Some cells are designed to deliver relatively
secondary batteries, however, require more low currents (e.g., C/100, C/1000, C/10,000) for
sophisticated charge control circuitry than others. periods of months or years, while other cell types
416

Table 15.3 Parameters of commonly used batteries, including storage capacity, energy density, and related characteristics
Recharge Energy Volumetric Volumetric Effective
Typical Nominal cycles (to rated Effective Physical density per energy charge volumetric
Battery capacity Rechargeable? voltage depth-of- capacity Dimensions volume Footprint unit area density density energy density
example (mAh) (Y/N) (V) discharge) (mAh) (x, y, z; mm)a (mm3) (mm2) (mWh/cm2) (mAh/cm3) (mWh/cm3) (mWh/cm3)b
BR2032 190 N 3 0 190 20  20  3.2 1280 400 142.50 445.3 148.44 445
CR2032 225 N 3 0 225 20  20  3.2 1280 400 168.75 527.3 175.78 527
CR2450 620 N 3 0 225 24  24  5.0 2880 576 322.92 645.8 215.28 234
VL621 1.5 Y 3 1000 150 6.8  6.8  2.15 99.4 46.2 9.73 45.3 15.09 4526
ML614 3.4 Y 3 1000 340 6  6  1.4 50.4 36 28.33 202.4 67.46 20,238
18650 3000 Y 3.6 500 1,200,000 18  65  18 21,060 11,700 92.31 512.8 142.45 205,128
Li-ion
AA 2100 N 3.6 0 2100 14.5  50.5  14.5 10,618 7323 103.24 712.0 197.78 712
thionyl
chloride
AA 3000 N 1.5 0 3000 14.5  50.5  14.5 10,618 7323 61.45 423.8 282.55 424
lithium
AA 2800 N 1.5 0 2800 14.5  50.5  14.5 10,618 7323 57.36 395.6 263.71 396
alkaline
(LR)
a
x-y dimensions represent occupied board space (rectangular rather than diametrical); energy density is calculated using those same dimensions
b
Effective energy density is calculated from actual capcity, physical size, and number of recharge cycles available, at rated depth-of-discharge
J. Sather
15 Battery Technologies for IoT 417

Capacity and Energy Density Common


Fig. 15.3 Illustration of
Coin and Cylindrical Cells
storage capacity and 3500 800
volumetric energy density

Volumetric Energy Density (mAh/cm3)


of commonly used batteries 700
3000

600

Typical Capacity (mAh)


2500
500
2000
400
1500
300
1000 Typical Capacity
200
Energy Density
500 100

0 0
BR CR CR VL M 18 AA AA AA
20 20 24 62 L6 65
1 14 0 Th Li Al
32 32 50 Li io th ka
-io ny iu lin
lC m e
n (L
hl R
or )
id
e

can source peak currents of 2C or more Note that in some cases while the discharge
without suffering significant capacity loss. capacity might remain relatively constant over as
The profiles of Fig. 15.4 represent the discharge much as a 100 range of discharge currents, the
rate performance at room temperature, to the total energy delivered is not the same from low-
rated discharge voltage, for seven battery est discharge current to highest, as the cell volt-
types—four primary cell types and three age is markedly lower at the higher discharge
secondary: current. Depending on the series/parallel cell
combination, if any, this effect could force a
• AA alkaline (LR6) (http://data.energizer.com/ designer to add series- or parallel-connected
PDFs/EN91.pdf) cells to sustain a minimum voltage at worst-
• AA lithium (FR6) (http://data.energizer.com/ case current draw. Moreover, it is obvious that
PDFs/l91.pdf) not all AA cells are the same in this regard.
• AA thionyl chloride (www.tadiranbat.com/ Lithium primary cells have relatively flat dis-
pdf.php?id¼TL-5104) charge profiles over most of the discharge
• 18650 Li-ion rechargeable cylindrical (http:// range, while alkaline cells sag measurably even
industrial.panasonic.com/lecs/www-data/ at relatively modest discharge currents. The
pdf2/ACI4000/ACI4000CE17.pdf) thionyl chloride battery is noted for its exception-
• CR2032 primary coin (http://na.industrial. ally flat discharge profile—albeit at very low
panasonic.com/sites/default/pidsa/files/ C-rates. An adverse characteristic of the thionyl
crseries_datasheets_merged.pdf) chloride cell is the onset of a voltage recovery
• ML1220 rechargeable coin (http://na.indus after long-term storage due to passivation of the
trial.panasonic.com/sites/default/pidsa/files/ lithium electrode.
mlseries_datasheets_merged.pdf) The voltage decay, while often limiting the
• Solid state thin film lithium micro-battery ability to deliver the last portion of capacity at a
(http://www.cymbet.com/pdfs/DS-72-41.pdf) usable voltage, can be used to advantage as a low
voltage threshold detection level in circuits
418 J. Sather

Fig. 15.4 Effects of a


Battery Discharge Capacity as a Function of Discharge Rate
discharge rate on battery
discharge capacity for 110%
several cell types, shown as
a function of (a) C-rate and
(b) discharge current
100%

Normalized Discharge Capacity (% of Rated)


90%

80%

AA Alkaline (LR6)
70%
AA Lithium (FR6)
AA Thionyl Chloride

18650 Li-ion
60% CR2032 Coin

ML1220 Coin
CBC050 Solid State
50%
0.1 1 10 100 1000 10000 100000 1000000
1/C-Rate (Hours Under Load)

b Battery Discharge Capacity as a Function of Discharge Rate


10000
Discharge Capacity (mAh)

1000

100
AA Alkaline (LR6)
AA Lithium (FR6)
AA Thionyl Chloride
18650 Li-ion
CR2032 Coin
ML1220 Coin

10
0.001 0.01 0.1 1 10 100 1000 10000
Discharge Current (mA)
15 Battery Technologies for IoT 419

employing low battery indicators or for internal At a given discharge rate, it is not unusual for
circuit housekeeping of system components that there to be an order of magnitude or more dispar-
might otherwise undergo chaotic operation or ity in discharge capacity over the rated operating
catastrophic shutdowns if the battery were to temperature range. Some types of cells perform
fall below a critical voltage level without warn- better than others in retaining capacity over the
ing. Fuel gauges, while readily available for specified temperature range.
many types of batteries, can present a load to Note in some cases the competing effects of
the battery that, in IoT applications operating higher self-discharge and lower impedance at
from small batteries, is non-negligible. Because higher temperature, actually resulting in a dimin-
battery terminal voltages do not drop rapidly ished discharge capacity—relative to room tem-
under light loads, in devices where it is important perature capacity—at the specified upper
to monitor battery terminal voltage periodically temperature limit over a wide range of discharge
to anticipate power loss, it is practical to use low rates.
power sampling circuits that measure the termi- Manufacturers of certain types of batteries
nal voltage once every several seconds or (e.g., alkaline cells) will often provide test results
minutes depending on operating conditions. for their cells as tested in accordance with
Power density is not only a function of battery standardized IEC/ANSI test regimens. Such
chemistry and cell type, but also of operating tests mimic applications in remote controls,
temperature for a given battery. The discharge radios, lighting, and others. Battery loads in
curves of Fig. 15.5 demonstrate the dependence these tests drain the battery over the course of
of temperature on battery performance for a vari- several hours to several days. Although the stan-
ety of cells: dard suite of IEC/ANSI battery tests are
performed under significantly higher loads than
• AA lithium (L91) those encountered in most long operating life IoT
• AA alkaline (LR6) from two manufacturers devices, comparing standardized test results
• CR2032 primary coin cell among different battery types, and from different
• AA thionyl chloride (ER14505) manufacturers, can be helpful in selecting the
• 18650 Li-ion best candidate from a particular battery category.

Temperature-Dependent Discharge Capacity at Various Rates


Fig. 15.5 Representative 110%
battery discharge
characteristics as a function
Normalized Discharge Capacity (% of Rated)

100%
of operating temperature
and discharge rate for 90%
various cell types
80%

70%

60%

50%
AA Alkaline (LR6) at C/10 (to 0.8V)

40% AA Lithium (FR6) at C/10 (to 0.8V)


AA Lithium (FR6) at C/3 (to 0.8V)
30% CR2032 at C/100 (to 2V)
CR2032 at C/1000 (to 2V)
20%
CR2032 at C/2000 (to 2V)

10%
-40 -20 0 20 40 60 80
Operating Temperature (°C)
420 J. Sather

Battery Voltage During Discharge - Various Chemistries


Fig. 15.6 Battery voltage 4.5
as a function of delivered
capacity
4

3.5

Battery Voltage (V)


2.5

1.5

1 18650 Li-ion at C/5


AA Thionyl Chloride at C/4000
AA Lithium (FR6) at C/300
0.5 CR2032 at C/1000
Solide State Lithium Micro-Battery
0
0% 20% 40% 60% 80% 100%
Discharge Capacity

15.3.3 Voltage Profiles and Load Cell Resistance as a Function of State of Charge
5
Requirements
Relative Cell Resistance

4
Every battery type has a well characterized volt-
3
age discharge profile under the rated conditions
of temperature and discharge rate. Typical volt- 2
age profiles for common battery types discharged
at room temperature are shown in Fig. 15.6. 1
As described earlier, battery discharge
profiles vary by chemistry, state-of-charge, 0
100 80 60 40 20 0
operating temperature, and discharge rate, in Battery State of Charge (%)
addition to other factors. Active circuit elements
require a minimum operating voltage to perform Fig. 15.7 Cell resistance as a function of state-of-charge
their one or several functions. Some functions
can be performed at a lower voltage than others battery life and circuit performance in the case of
within the same integrated circuit, but all such momentary voltage drops if load current exceed
elements still require a minimum voltage and the power rating of the battery.
therefore the power source must deliver adequate From Fig. 15.7, a generic representation of cell
operating current at that voltage in order for that resistance vs. state of charge, note how rapidly the
circuit to operate according to specification. The cell resistance climbs as the battery reaches its
battery capacity available to the system is final ~10% of stored charge (http://data.ener
that capacity delivered at a voltage above the gizer.com/PDFs/l91.pdf). While the cell might
minimum operating voltage of the several be sustaining an adequate voltage level all the
components comprising the IoT device. way to the end of life, should a high current load
Pulse currents drawn from batteries in IoT be presented during this high resistance phase of
devices can have unanticipated consequences to discharge, there could be a loss of power and
15 Battery Technologies for IoT 421

circuit operation before any low voltage detection While cold temperature reduces power deliv-
circuit can react and issue the necessary alerts to ery, storage and operation at elevated tempera-
the system controller, or the controller circuit has ture ages a battery as significantly as any other
time to perform any essential housekeeping oper- factor. Periodic excursions to high temperature,
ation prior to power loss. if kept to limited few in number and duration
Batteries also have a characteristic pulse over the product life, can be accounted for in
response as a function of the electrochemical estimating battery life. Extended periods at
processes and parasitic elements within the cell. temperatures nearing the upper specified temper-
Therefore, it is sometimes necessary to support ature range of a battery can lead to premature
the battery with a low ESR capacitor to avoid failure. It is therefore important to understand
effects of such behavior on circuit operation. the likely operating temperature of the IoT
Proper circuit design, with a view to worst-case device and, where possible, thermally shield or
circuit loads, will prevent such disruptions from isolate the battery from heat-generating sources
ever occurring by properly sizing the battery— in the device. Using a rule of thumb of a
accounting for aging, state-of-charge, and other 2 reduction in life per 10  C increase in tem-
factors affecting battery impedance. perature, the designer might have the luxury of
putting a larger capacity battery into the system
to compensate for aging effects of elevated tem-
15.3.4 Operating Life, Wear-Out perature on the battery.
Mechanisms, and Factors The rate at which a battery is discharged also
Affecting Battery Aging has affects the battery lifetime. Whether
discharged under constant resistance, constant
Battery performance includes more than simply power, or constant current, the performance will
delivering the necessary operating currents at a vary accordingly. When testing a battery for a
minimum voltage, having a meaningful number particular application, it is therefore advisable to
of recharge cycles (where applicable), and test it under conditions—discharge rate, temper-
operating in a temperature range spanning the ature, etc.—as similar as possible to the load
expected conditions of the IoT device. Battery expected during operation in the specific device
longevity under these conditions is essential to and application.
not only the performance of the IoT device, but Batteries are generally connected electrically
also its practicality in deployment. An otherwise and physical by one of four methods:
perfectly useful battery will have limited appli-
cability if its service life falls short of the • Socket or holder soldered to the circuit board,
expected operational life of the device it powers. with metal tabs making electrical connection
This is especially true where replacement of the to the battery terminals.
battery is difficult, expensive, or simply not pos- • Wire harness with a snap-on connection to the
sible. When selecting the proper battery for an battery terminals, as commonly used with 9 V
IoT device, the system designer should consider batteries, for example.
how the typical and extreme operating conditions • Spring-load receptacle to house cylindrical
are likely to affect battery life. Important batteries such as AA and AAA.
parameters determining battery longevity and • Solder connection of the battery directly to
long-term performance are: the circuit board by way of through-hole
tabs, or surface mount soldering of welded
• Temperature tabs on batter, to solder pads on the circuit
• State-of-charge, overcharge, and over- board.
discharge
• Magnitude of load currents Each of these connection and mounting
• Reliability of connectors, sockets, methods has associated reliability, cost, and physi-
compartments, and holders cal size and weight advantages and disadvantages.
422 J. Sather

Corrosion and breakage of the socket, holder, and satisfactory battery from a performance stand-
mounting tabs can occur in applications where point might be excluded from contention because
mechanical shock and vibration occur, or in high of cost. The cost of the battery is not always so
humidity or other corrosive atmospheres. When simple to determine, as the purchase price of the
selecting a proper mounting device or electrical battery itself is usually not the only cost that
connector, these environmental factors should be comes into play.
considered. Some connectors are specially Table 15.4 lists typical battery pricing as pur-
designed for ruggedized applications. chased through a component distributor or other
Self-discharge of a cell is a function of mate- volume outlet, made by reputable manufacturers,
rial, construction, and varies by manufacturer for a in modest volumes of 1000–5000 pieces (http://
given cell type. Self-discharge rates vary from as www.digikey.com/product-search/en?keywords¼
little as 0.5% per year for solid state batteries, to battery). Naturally, prices vary within any given
several percent per month for other chemistries. battery subcategory, depending on manufacturer,
Primary batteries typically have lower self- cells adorned with or without mounting tabs,
discharge rates than secondary cells. The self- accompanying charging circuit (where applica-
discharge of a battery becomes more important ble), temperature ratings, and other parameters.
in long life applications, as the self-discharge con- Figure 15.8 depicts the data from the prior
tribution to total capacity loss competes increas- table, excluding the two most expensive (on a
ingly with the load placed on the battery. In the price/mAh measure) in order to expand the right-
case of secondary cells, self-discharge can hand vertical axis to show differences among
become a significant contributor to capacity loss pricing of the other cells.
if long periods of time elapse between recharge Note the wide range of absolute prices, as well
events. Self-discharge increases with increasing as the wide range of prices per unit of energy.
temperature, although the battery will have higher While selecting the lowest price battery type that
current and capacity delivering capability at a meets the minimum energy, voltage, and power
higher temperature due to decreased internal requirement of the system might at first be
impedance as the temperature increases. appealing, other factors should be considered.
For example, note the large disparity in price
between the BR2032 and CR2032 cells. Both
15.3.5 Cost of Batteries, Holders, have a nominal voltage of 3 V and storage capac-
Connectors, Integration, ity of ~200 mAh. But the chemistries differ and
and Replacement upon closer view of the respective data sheets, it
is seen that the upper temperature limit of the BR
As part of the battery selection process, cost must series is higher than that of the CR series cells.
be weighed along with the technical This might be important in light of the applica-
specifications. In some cases, an otherwise tion environment, for example if the IoT device

Table 15.4 Battery pricing in quantities of thousands of pieces, as might be purchased through component distributors
Battery example Typical capacity (mAh) Rechargeable? (Y/N) Price ($US) Price/mAh ($US/mAh)
BR2032 190 N $0.60 $0.0032
CR2032 225 N $0.15 $0.0007
CR2450 620 N $0.60 $0.0010
VL621 1.5 Y $2.35 $1.5667
ML614 3.4 Y $1.10 $0.3235
18650 Li-ion 3000 Y $2.00 $0.0007
AA thionyl chloride 2100 N $3.50 $0.0017
AA lithium 3000 N $1.80 $0.0006
AA alkaline (LR) 2800 N $0.20 $0.0001
15 Battery Technologies for IoT 423

Battery Price Comparison


Fig. 15.8 Absolute price

Battery Price per mAh Capacity ($US/mAh)


$4.00 $0.0035
and price per unit of energy
of commonly used batteries $3.50 $0.0030

Actual Price, in Volume ($US)


$3.00
$0.0025

$2.50 Battery Price


Battery Price / Energy $0.0020
$2.00
$0.0015
$1.50

$0.0010
$1.00

$0.50 $0.0005

$- $-
BR CR CR 18 AA AA AA
20 20 24 65 Th Li Al
32 32 50 0 io th ka
Li ny iu lin
-io m e
n lC
hl (L
or R)
id
e

Fig. 15.9 Cycle life graph Cycle Life Characteristics


of 18,650 cylindrical cell
Charge: CC-CV 0.7C (max) 4.20V,55mA cut-off at 25˚C
3500 Discharge: CC 1C,2.50V cut-off at 25˚C

3000
CAPACITY (mAh)

2500

2000

1500

1000

500

0
0 100 200 300 400 500 600
CYCLE COUNT Source: Panasonic

operates in an industrial setting where the device documents to find critical specifications of a bat-
will be exposed to elevated temperature on a tery from a specific manufacturer. This can be
periodic or continuous basis. Another example especially true when seeking longevity data,
is that of the AA lithium cells vs. the AA alkaline aging characteristics, and related data.
cell. The AA lithium cell is priced considerably It is also interesting to see that the recharge-
higher, yet note that not only is the operating able 18650 Li-ion cell is cost competitive with
temperature range of that cell wider than that of the other cells when measuring price per unit of
the alkaline cell, but referring back to Fig. 15.4, energy. And that is before factoring in the addi-
the high current discharge performance of the tional capacity that can be utilized with each of
lithium cell greatly exceeds that of the the several hundred recharge cycles available
alkaline cell. from the 18650 cell; that is, of course, if the
Again, those parametric differences might or IoT design and application can in fact take
might not be important. With some battery prod- advantage of the rechargeable feature of the
uct data sheets, not all potentially critical 18650 cell as depicted in Fig. 15.9.
parameters are routinely specified. Designers When weighing the cost of a battery, it is
might be forced to turn to quality and reliability necessary to factor in the cost of the compart-
reports, user manuals, and other technical ment, holder, connector, or other component
424 J. Sather

necessary for securing the battery and making be considered when selecting the battery. Note
electrical connection to the battery. In addition, that in some cases, the cost of the holder—
assembly costs should be considered, as they assembly costs excluded—can be as much or
vary as a function of the type of battery and more as the cost of the battery itself.
connection scheme used.
If the battery will likely have to be replaced
during product life, the replacement cost will
depend on many factors, including accessibility 15.4 Battery and Power
to the device; labor cost in the case of industrial Management Topologies
devices; possible equipment down time resulting
from the IoT device being out of service, etc. IoT devices have many components, and while
Where batteries are soldered onto or into the many integrated circuits have common operating
circuit board, additional effort can be expended voltage ranges based on the technology node in
replacing such batteries. On the other hand, when which they are designed, there are often different
batteries which can be soldered to the circuit minimum/maximum voltage specifications among
board using automated assembly equipment are the disparate components of the system. Conse-
used, there can be an up-front cost savings in quently, there is often a need for having multiple
manufacturing. regulated supplies—1.8 V, 3.3 V, 5 V—on board,
In devices where rechargeable batteries are all of which might be sourced from a single bat-
used, cost of battery management circuits and tery or battery pack. Whenever DC-DC
attendant board space must be considered. converters or other power converters are used in
Battery holders, connectors, compartments, a system, their power conversion efficiencies
and the need for waterproof enclosures in some under the range of load conditions must be
devices, will all add component and assembly accounted for when estimating the battery storage
costs to the manufacture of the device. capacity needed for the product. This is true
Figure 15.10 illustrates nominal pricing of an whether the system is operating from a recharge-
assortment of battery holders, sockets, and able or primary battery, and whether or not
connectors. Again, the cost of soldering wires, energy harvesting circuits (discussed elsewhere
making through-hole solder connections, and in this chapter and more thoroughly in Chap. 11)
inserting cells into holders or sockets must also are used.

AAA holder
$0.77 18650 holder
$1.03 9V battery connector
and wires
$0.24

ML621 coin cell ML1220 coin cell


20mm coin 20mm coin $1.26
cell holder cell holder $1.08
(Source: Panasonic) (Source: Panasonic)
$0.22 $0.41

Fig. 15.10 Assortment of battery holders, sockets, and cells with solderable tabs, with approximate prices for each
15 Battery Technologies for IoT 425

15.4.1 Battery Life Extension Using life. Avoiding complete stack initializations and
Efficient Power Management endlessly running loops in MCUs can also add
Architectures appreciably to battery life.
Another approach to extending battery life is
Not only are the batteries themselves important to use power manager integrated circuits that put
considerations when identifying a power source components in deep sleep or hibernation modes
for a system, but the power management archi- whenever the functions those components per-
tecture is equally critical to system performance. form are not needed. Optionally, the power man-
While many ICs operate over a wide range of ager, by way of a high side or low side switch,
voltages, it is often the case the battery can’t can completely disconnect those components
directly supply power to all of the components from the supply or ground rail for a significant
in a system. Step-up or step-down converters are portion of time and reconnect them periodically
often needed, with attendant losses of energy in or upon occurrence of an interrupt or other event.
the conversion process. A schematic example of this concept is shown in
Power management techniques including Fig. 15.11 (http://www.cymbet.com/pdfs/AN-
dynamic voltage scaling; extensive use of the 1059.pdf).
various power savings modes resident within Typical microcontroller-based devices where
microcontrollers and microprocessors—such as energy conservation is an issue spend a large part
hibernation/sleep modes and clock speed reduc- of their time in a low-power “sleep” state. This is
tion—can have substantial impacts on battery a state where the microcontroller is not running

Fig. 15.11 Isolation and


control of active system VCC VCC VCC
CBC34803
components from the 12C MicroController
+
power supply for the Sensor
/Switch
purpose of reducing power
consumption and extending VSS
nlRQ2
VSS VSS
battery life using (top)
switched VSS, (middle)
switched VCC/VDD, and
(bottom) interrupt
configurations. (Courtesy
of Cymbet Corporation)

VCC VCC VCC


CBC34803
+ 12C MicroController
Sensor
/Switch

nlRQ2
VSS VSS VSS

CBC34803 VCC VCC VCC


+ 12C MicroController
Sensor
INT /Switch
nlRQ2
VSS VSS VSS
426 J. Sather

but is waiting for an interrupt from either a sensor periodic wakeups to the microcontroller and
or a timer. When one of these interrupts is issued associated circuitry. This way the entire system
the microcontroller goes to a higher power active is totally asleep for the majority of time and the
mode to process the event and then goes back to microcontroller is only awakened for short
sleep. The active mode operation may include periods to determine if it needs to service a sen-
processing the sensor data and then may operate sor or switch.
an actuator or send a message. Many times a Power savings are determined by accounting
message may be sent via a low-power radio pro- for and managing several variables, including
tocol which requires significant processing active run times; sleep periods; active time/
cycles to correctly operate the protocol stack. sleep time; and number of instructions executed
The amount of processing depends greatly on by the microcontroller. In some applications,
the complexity of the protocol. these techniques can extend battery life by
The average power consumption of the sys- 2 or much more, as given by the power savings
tem is the sleep power times the percentage of ratios in Table 15.5 (http://www.cymbet.com/
time the system is asleep plus the active power pdfs/AN-1059.pdf). The result can be the use of
times the percentage of time the system is active a much smaller, and/or less expensive battery
divided by 100. than would otherwise be required to meet prod-
uct performance and lifetime objectives.
Pavg: ¼ Psleep  %time asleep þ Pactive The power savings technique described in the
%time activeÞ=100 previous section works well with systems where
the sleep power is a large or significant contribu-
Minimizing the largest of these terms provides tor to the average power. In systems where a
the greatest power savings. In some cases the radio is used the power for the radio and the
active power term is much larger than the sleep power used by the microcontroller to run the
power term either because the power per event is software stack can be large. Many radio
large or the active power events happen very standards are complicated and use a sophisticated
often. stack to manage the protocol. These stack
When the system that has a large sleep power implementations often have significant runtimes
compared to its active power is asleep there is an to initialize their internal memory structures.
opportunity to reduce power by placing the Often these stacks are supplied by the radio
microcontroller in its lowest power mode while chip vendor and therefore are attractive since
using the standalone timer functions to provide they save software development time. If the

Table 15.5 Power savings possible when utilizing a power manager to isolate a microcontrollers from the power
supply (Courtesy of Cymbet Corporation)
Power savings with periodic interrupt
Active Sleep Active current Active Power
Original sleep μA/ runtime in period in when running time/sleep Number of savings
current in μA MHz MHz ms ms (μA) time instructions ratio
0.6 200 1 0.1 16 200.000 0.0063 100 0.47
0.6 200 1 0.1 33 200.000 0.0030 100 0.97
0.6 200 1 0.05 33 200.000 0.0015 50 1.89
0.6 200 1 0.1 100 200.000 0.0010 100 2.80
0.6 200 1 0.05 100 200.000 0.0005 50 5.26
0.6 200 1 0.1 250 200.000 0.0004 100 6.38
0.6 200 1 0.05 250 200.000 0.0002 50 11.11
0.6 200 1 0.2 250 200.000 0.0008 200 3.45
1.6 200 1 0.3 250 200.000 0.0012 300 6.30
15 Battery Technologies for IoT 427

stack needs a long time to initialize, then the product, beyond which the preferred battery can
previous power savings technique only makes store, retain for that product life, and deliver as
sense if the radio operates very infrequently. needed? Once answered, the next two-part ques-
A typical radio application might require, say, tion is: Is that amount of energy harvestable
25 mA for 25 ms ¼ 625 μA-s for each transmis- under expected operating conditions and, if so,
sion. A simple stack may take essentially no time at what cost? The return on an energy harvesting
to initialize but a sophisticated one may take investment is not always clearly measurable
upwards of 400 ms of processing time at 5 mA, when designing a new IoT product. However,
or 2000 μA-s to initialize. This highlights the we can at least lay out the basic concomitant
importance of either a simple stack initialization elements of energy harvesting architectures,
or a microcontroller that has a low power mode their practical limits from a technical perspec-
to retain the initialization in RAM at low power. tive, and rough costs of implementing such
Chapter 10 provides a more thorough analysis circuits.
of this topic. Energy harvesting circuits can take many
forms, from simple, widely deployed solar-
powered devices charging a capacitor, to more
15.4.2 Using Primary and Secondary elaborate power collection and energy manage-
Batteries with Energy ment schemes utilizing peak power tracking
Harvesting (EH) Sources circuits and algorithms relying on incoming
power from transducers having characteristic
In cases where an IoT device demands more from impedance and power profiles according to their
a battery than what is commercially available— design, materials, and construction.
or available at an acceptable cost—energy Beginning with a system-level view, a block
harvesting can be brought to bear as a supple- diagram of a conceptual IoT node is illustrated in
mentary source of power. There are, of course, Fig. 15.12, including the basic elements of the
attendant design and component costs when energy harvesting module.
implementing energy harvesting circuits. The While core IoT functions vary by application,
designer must consider whether the costs and it is not uncommon to find basic sensor, MCU,
complexities associated with energy harvesting and often, radio elements, as shown. Likewise, a
circuits can be justified over the life of the IoT device equipped with energy harvesting circuitry
device. And the advantage of using energy will normally include—in addition to the power
harvesting might be apparent to one party in the transducer itself—a power conversion circuit and
supply chain that extends from IoT device energy storage device, either a battery or ade-
designer, device manufacturer, to intermediate quately sized capacitor. When sizing the battery
distribution or other sales outlet, to end user— in an energy harvesting design, it is important to
and/or redound to their benefit—, but not to keep in mind that having a bulky, medium to
others involved in the transaction. Perhaps the high capacity energy storage cell is counterpro-
end user would bear the expense of replacing a ductive if the transducer can never replenish it in
dead battery, but isn’t willing to pay the addi- practical operating environments. It is analogous
tional cost of having energy harvesting circuits to to trying to fill a swimming pool one glass of
minimize or eliminate the need for battery water at a time. Similarly, if the power consump-
replacement. Or, to take another example, per- tion of the device exceeds the average power
haps the IoT device is an intrusion sensor that is production of the harvesting source and
offered by a company with a service contract, associated conversion circuitry, eventually the
who in turn can make or lose money with each battery will become depleted and will be of little
battery replacement service call. to no use in the long run.
A simple first question to ask is: How much IoT devices generally are comprised
extra energy is needed throughout the life of the of several of the elements, or functions
428 J. Sather

Fig. 15.12 Block diagram


of sensor node
incorporating an energy
harvesting module

Table 15.6 Devices typically suitable to be powered by energy harvesting sources


<100 nW <1 μW <10 μW <100 μW <1 mW <10 mW <100 mW <1 W <10 W
Real-time MCU Watch/ RFID tag Sensors/ Wireless Bluetooth GPS Mobile
clock standby calculator remotes sensors transceiver phone
← Practical, indoor energy harvesting using low cost transducers !

(e.g., microcontroller/microprocessor, sensor, achieve intended results. A glimpse of the most


radio), listed in Table 15.6, not all of which are conventional transducers, their limitations,
operating at any given time, perhaps, but in some challenges, and typical performance measures
combination present a load to the battery and are listed in the following table. Such transducers
incoming power from the harvester. It is impor- have found their way into remote applications as
tant to know not only the average power con- a means of reducing or eliminating the need for
sumption in various IoT operating modes, but batter replacement over the life of the product. In
peak loads also, as the combination of the some manifestations, the energy harvesting cir-
harvested power and battery must together sup- cuit serves as way to extend the life of a primary
port these loads. Depending on the battery battery by delivering power directly to the circuit
choice, input power from the harvesting trans- when sufficient incoming power from light, tem-
ducer under worst-case ambient conditions (e.g., perature differential, vibration, or other sources
times when minimum or no light is imparted to a is available from the environment. In these
solar cell), and other factors, the battery itself implementations, the energy harvesting circuit
might be relied upon to support the peak loads. is connected as a logical ‘OR’ with the primary
If the battery itself does not have sufficient power battery. In others, the energy harvesting circuit is
delivery under these worst-case conditions, a low used to regularly replenish the charge in a sec-
ESR capacitor might be necessary as part of the ondary battery when needed, and powers the
power delivery chain. circuit directly at times, when surplus power is
By definition, energy harvesting requires a available from the transducer and intermediate
transducer to convert energy from one form to conversion circuitry—whether through a boost,
another, so it can be stored or delivered directly buck, linear regulator, or other power conversion
to the system load. Transducers come in a range circuit.
of technologies, shapes, sizes, electrical With some exceptions, where the IoT device
characteristics, and suitability for any number is also a wearable, energy harvesting is often not
of operating environments. However, not all practical because of the difficulty in collecting
transducers are practical in terms of cost, power ambient energy from body-worn transducers.
generation, or in the physical size needed to One such exception is the use of a piezoelectric
15 Battery Technologies for IoT 429

or inductive element buried in a shoe that their characteristics, are listed in Table 15.7.
generates energy with each impact. And of Transducers are available from a variety of
course, for many years low power watches have manufacturers.
incorporated solar cells in the watch face. Conceptually, the approach to a power effi-
Aside from technical limitations of generating cient energy harvesting circuit—at least the
and converting sufficient power on a predictable, front-end portion of it—is to match the imped-
regular basis, there are issues with commercial ance of the transducer to the load in order to
availability of many transducer types. Solar—or maximize power transfer. While easier idealized
photovoltaic cells—are the most readily avail- than implemented in some cases, the basic circuit
able in a range of configurations, conversion structure is shown in Fig. 15.13.
efficiencies, and prices. But of course ambient The transducer characteristics shown in the
light is not always practical in the application preceding table must therefore be understood by
setting. Battery charge time is important, espe- the designer so the energy harvesting circuit
cially if the rechargeable cell is completely elements can efficiently collect their output
depleted during the period while input power is power. The circuit must also be adaptable as
not available to/from the transducer. If the cell environmental conditions change and the trans-
takes many hours to be recharged—either ducer output in turn varies in its response. A
because of its inherent characteristics, or because critical piece of energy harvesting design that is
only small amounts of energy are available to often overlooked is the start-up phase of opera-
trickle charge the cell—then not much stored tion. Because the transducers are usually high
energy will be available in the event it is impedance devices (compared with batteries),
discharged and is relied on again a short time they cannot source much current. When sourcing
after it has been depleted. Rechargeable coin low current loads and trickle charging batteries,
cells can take hours for a full recharge, while this is not necessarily a problem. However, when
some solid state batteries can be recharged to starting from 0 V or being sourced by a slowly
80% capacity in a few minutes—provided suffi- rising voltage delivered by some transducers,
cient input power is available to charge the cell. active components like microcontrollers and
The typical power transducers available to the radios, if supplied by a current-starved source,
energy harvesting circuit designer, along with might not make it through the power-up state and

Table 15.7 Conventional energy harvesting transducers, their characteristics, and challenges presented to designers
Energy Typical Normalized
source Challenge impedance Typical voltage Typical power output cost
Light Conform to small Varies with light DC: 0.5–5 V 10 μW–15 mW $0.50–
surface area; wide input—low kΩ [depends on (outdoor: $10.00
input voltage range to 10s of kΩ number of cells in 0.15–15 mW) (indoor:
array] <500 μW)
Vibrational Variability of Constant AC: 10s of volts 1 μW–20 mW $2.50–
vibrational impedance—10s $50.00
frequency of kΩ to 100 kΩ
Thermal Small thermal Constant DC: 10s of mV to 0.5–10 mW (20  C $1.00–
gradients: efficient impedance— 10 V gradient) $30.00
heat sinking 1 Ω to 100s of
Ωs
RF and Coupling and Constant AC: Varies with Wide range $0.50–
inductive rectification impedance—low distance and $25.00
kΩs power 0.5–5 V
430 J. Sather

Fig. 15.13 Circuit model


representing power
conversion using matched
impedances

into normal operation. This is because there can sleep state long enough for the energy harvester
be relatively high currents being drawn internally to replenish the energy storage device. If the
to these components as the various transistors power budget is not exceeded during this phase,
and circuits are transitioning into their respective the system can continue with its initialization.
normal operating regimes. If the EH design does Next, the main initialization of the system,
not account for these transition currents by way radio links, analog circuits, and so forth, can
of sourcing current from a pre-charged capaci- begin. Care should be taken to ensure that the
tor—or using the on-board battery if available— power consumed during the time the system is on
the system can stall and never get successfully during this phase does not exceed the power
through the power-up phase. Use of a power budget. Several sleep cycles might be needed to
supply supervisor integrated circuit can be used ‘stair step’ the system up to its main operational
to switch on the input and output stages quickly state.
once the pre-charged capacitor has been charged In most system power budgets, the peak power
to a sufficient voltage to operate the MCU and drawn is not as critical as the length of time the
other active components. Capacitance can be power is required—in other words, the energy
sized according to the amount of energy needed budget. Careful selection of the message protocol
to power up the MCU, radio, or other component, for the RF link can have a significant impact on
or according to the amount of energy needed to the overall power budget. In many cases, using
send a radio transmission, for example, in the higher power analog circuits that settle quickly
case of pre-charging a capacitor at the output after being turned on, and which are turned off
stage. immediately thereafter, can decrease the overall
While it is relatively straightforward to calcu- energy consumed. Microcontroller clock fre-
late a power budget and design a system to work quency can also have a significant impact on the
within the constraints of the power and energy power budget. In some applications it might be
available, it is easy to overlook the power advantageous to use a higher microcontroller
required to initialize the system to a known clock frequency to reduce the time the microcon-
state and to complete the radio link with the troller and peripheral circuits are active. Avoid
host system or peer nodes in a mesh network. using circuits that bias microcontroller digital
The initialization phase can sometimes take 2–3 inputs to mid-level voltages; this can cause signif-
times the power needed for steady state opera- icant amounts of parasitic currents to flow. Use
tion. Ideally, the hardware should be in a low large value (10–22 MΩ) pull-up/down resistors
power state when the system power-on reset is where possible. However, be aware that high cir-
in its active state. If this is not possible, the cuit impedances coupled with parasitic capaci-
microcontroller should place the hardware in a tance can make for a slow rise/fall time that can
low power state as soon as possible. After this is place the voltage on the microcontroller inputs at
done, the microcontroller should be put into a mid-levels, resulting in parasitic current flow. One
15 Battery Technologies for IoT 431

solution to the problem is to enable the internal cycles in the case of secondary batteries.
pull-up/down resistor of the microcontroller input As an example scenario, suppose a photovol-
to force the input to a known state, then disable the taic (i.e., solar) cell is the transducer, to which an
resistor when it is time to check the state of the off-the-shelf converter circuit is added, and the
line. If using the microcontroller’s internal pull- harvester is used as a battery life extender by way
up/down resistors on the inputs to bias push- of powering the load when delivering sufficient
button switches in a polled system, leave power, and relying on the battery when ample
the pull-up/down resistors disabled and enable harvested power is not available. The battery is
the resistors only while checking the state of not rechargeable in this case. Further, suppose
the input port. Alternatively, in an interrupt-driven the solar cell is a low cost amorphous silicon
system, disable the pull-up/down resistor within type, capable of generating 10 μW/cm2 of cell
the first few instructions in the interrupt service area under modest indoor lighting conditions of
routine. Enable the pull-up/down resistor only 400 Lux. A cell 30 cm2 in size will then produce
after checking that the switch has been opened. 300 μW of power at, say, 3 V. This leaves 80 μA
Microcontroller pull-up/down resistors are typi- of current available to power the load, assuming
cally less than 100 kΩ and will present a mean- 80% conversion efficiency through the DC-DC
ingful load on the system if left on continuously conversion. Let’s further assume that 80 μA is
while a switch is activated for any significant sufficient average current to keep the system
length of time. When using external pull-up/ running continuously.
down resistors, bias the resistors not with the Over a 10-year life, producing power 12 h per
power rail, but with a microcontroller port instead. day, the total energy (or charge) generated is
As there are always cost tradeoffs or addi- 3.5 Ah. Now, what was the cost to generate that
tional expenses in design complexity, versus energy? The solar cell is on the order of $2.50 in
simply using a larger battery or replacing a bat- modest volumes. The DC-DC converter and ancil-
tery in coming years, it is important to assess lary components is perhaps $1.50. Assume for the
these costs before accepting that energy sake of simplicity that the incremental design
harvesting is a necessary or desirable feature of effort is negligible because the cost is spread over
an IoT design. The return on investment depends a large number of units sold. Neglect also the cost
in part on the total energy budget for the system of additional board space for the harvesting
and whether a larger battery delivering to this components, as well as the solar cell housing.
energy budget can physically fit within the Total cost of the harvesting components is then
confines of the IoT device. The various costs $4.00. That cost over 3.5 Ah is $1.14/Ah, or
are presented below, using cost per unit of energy $0.0011/mAh. This is on par with the cost of
for an energy harvesting design in comparison to alkaline and lithium primary cells, although in
that of the several battery options available and this simplistic analysis the cost of battery
as discussed earlier. Considering the energy har- connectors, holders, compartments, and any other
vester as a variable capacity battery, the cost per overhead associated with the battery is neglected,
unit of energy generated can be calculated and because in this scenario a battery is needed in
compared with that of a battery as discussed addition to the EH circuit—it is just a matter of
above. In the case of rechargeable batteries, the which size/capacity battery is going to be needed.
cost of the charging circuitry must be included. In a case where a rechargeable battery is used
The cost of a battery is in the range of $0.001– to receive harvested energy and deliver it to the
$0.005 per mAh of capacity, including the cumu- load when energy is unavailable for harvesting,
lative capacity of multiple charge-discharge the battery lifecycle capacity can be calculated to
432 J. Sather

determine whether its aggregate storage capacity such, a candidate rechargeable battery rated for
over the rated number of charge-discharge 10- to 20-year life, recharged daily, is generally
cycles, at the rated depth-of-discharge, is ade- more expensive than is a consumer grade battery
quate for the task. The lifecycle capacity is cal- of similar capacity. Small solid state batteries,
culated as follows: with recharge cycle life ratings to several thou-
sand cycles, are now commercially available and
Lifecycle capacity ¼ Rated battery capacity  suitable as supplemental power sources in IoT
Rated charge-discharge cycle life  Depth- designs where energy harvesting is implemented.
of-discharge corresponding to rated cycle life Two energy harvesting scenarios will now be
considered, the objective being to calculate the
Example:
battery lifetime. Several commercially available
ML1220 rechargeable coin cell rechargeable batteries are compared, with a
Rated capacity ¼ 17 mAh range of IoT device power consumption. Rated
Charge-discharge cycles ¼ 1000 cycles capacity and cycle life at specified depth-of-dis-
Depth-of-discharge ¼ 10% charge for each battery type are used to estimate
Lifecycle capacity ¼ 17 mAh  1000 cycles operating life, assuming negligible self-
 10%/cycle discharge and ignoring irrecoverable charge
¼1.7 Ah. loss mechanisms such as mechanical wear-out
of gaskets (where applicable), degradation of
An obvious problem encountered when electrolyte, high temperature aging effects, and
analyzing an energy harvesting system that other losses not directly induced by charge-
includes a rechargeable battery, is the difficulty discharge cycling of the cell. Such losses cannot
in confining battery discharge to the be ignored, especially when the battery is being
recommended depth-of-discharge on each cycle used over the course of many years and in harsh
with adding a monitor circuit that depletes sig- environments. However, the aging mechanisms
nificant battery capacity in the process. As differ widely by not only the type of battery, but
described earlier, it is sometimes practical to by construction methods particular to a given
make infrequent measurements of the battery manufacturer, materials that comprise the bat-
voltage, sacrificing precision and real-time mon- tery, and manufacturing process. Therefore, this
itoring for the sake of energy conservation. And exercise seeks to calculate the ideal lifetime of a
of course, cutting off the battery from the circuit battery under the two energy harvesting
after it is been depleted just 10% from a full scenarios:
charge means that only 1/10th the battery capac-
ity is available to operate the system on any given 1. Recharge the battery during the 12-h period in
charge-discharge cycle. Discharging the battery which the IoT device is being powered by an
to deeper discharge depths will cut into the num- energy harvesting transducer, and deliver
ber of charge-discharge cycles available. There- power to the IoT device directly from the
fore, the designer must make a tradeoff between battery for 12 h per day, while the power
charge available per cycle and the number of transducer is not generating power.
cycles available over the life of the product. 2. Recharge the battery only when the battery is
Flat discharge profiles make this measurement depleted to its rated depth-of-discharge. This
process more difficult to control, as charge assumes a coulomb counter or other fuel
removed does not correlate strongly with battery gauging method is used to measure battery
voltage over narrow segments of the discharge capacity, to signal when the battery has
profile. reached its deepest rated depth-of-discharge
With the exception of some industrial grade and requires a recharge. A 1-h recharge time
cells, most rechargeable batteries are not rated to is assumed in this scenario. Recharging might
be recharged more than about 1000 cycles. As be accomplished through a near-field
15 Battery Technologies for IoT 433

Idealized Operating Life of Rechargeable Batteries


Fig. 15.14 Battery (12 hours charging, 12 hours discharging per day)
operating life in a system 25
operating continuously on a
12-h charge, 12-h discharge

Operating Life (years)


20
cycle Average Power
Consumption
15
1uW
10uW
10
50uW

100uW
5

0
500mAh Solid State ML414 1mAh Coin ML621 6mAh Coin ML1220 17mAh Coin
Battery Cell Cell Cell

communication and inductive power transfer storage devices are accompanied by internal
method when an IoT device is retrieved from leakage currents that have to be accounted for
the field for data transfer to a computer, for in the average power consumption figures.
example; or, by way of a direct connection to Figure 15.14 represents the operating life of
a portable terminal periodically for the pur- various batteries under a range of IoT device load
pose of transferring information from the IoT conditions, from 1 to 100 μW average power
device, at which time the device is recharged consumption, using the first scenario described
and redeployed to the field. above, whereby the battery is charged 12 h each
day, and discharged 12 h each day. Results are
Battery lifetime depends on the number of calculated based on room temperature operation
rated charge-discharge cycles available—which and typical battery specifications.
varies as a function of depth-of-discharge—and It is clear that, absent the permanent energy
the frequency at which the cell is being cycled. In losses arising from aging mechanisms not related
scenario (1), the average power consumption will to battery cycling, an operating life greater than
determine to what depth-of-discharge the battery 10 years is possible with all batteries compared
is drained on each charge-discharge cycle, which in this example.
in turn affects the number of charge-discharge Calculating battery life using the second sce-
cycles available. As rechargeable battery nario, whereby the battery is taken to a rated
specifications do not declare cycle life as a func- depth-of-discharge before being recharged,
tion of depth-of-discharge at more than one or results as shown in Fig. 15.15 are obtained.
two depths, extrapolations and interpolations will Depending on the type of device and method
be made for the sake of this exercise in order to used to recharge the battery, in the case of the
calculate the operating life in scenario (1). Sce- smaller capacity batteries it might be impractical
nario (2) is more straightforward, as the depth-of- to recharge the battery as frequently as required
discharge used in this example will be that as by the higher operating power of 100 μW.
explicitly defined in the respective product Again, this is an idealized situation that does
specifications. Another assumption is that the not account for self-discharge and other aging
cell resistance of each battery considered in mechanisms. The projected operating lifetimes
these examples is low enough to power the sys- are unrealistic in commercial batteries.
tem that an additional energy storage element— Rechargeable batteries, with the exception of
such as a low ESR capacitor—is not needed to solid state batteries, are usually not expected to
supply currents to IoT device elements that retain their energy for more than 10 years.
would otherwise consume higher power than Note how the scenario two results differ vastly
the battery itself can deliver. Supplemental from the first scenario. This is primarily the result
434 J. Sather

Idealized Operating Life of Rechargeable Batteries


Fig. 15.15 Battery (recharge when depleted to rated depth-of-discharge)
operating life in a system in 1000
which the battery is
recharged only upon

Operating Life (years)


depletion to rated depth-of- 100
Average Power
discharge (DoD) Consumption
1uW
10
10uW
50uW
1 100uW

0
500mAh Solid State ML414 1mAh Coin ML621 6mAh Coin ML1220 17mAh Coin
Battery (80% DoD) Cell (20% DoD) Cell (10% DoD) Cell (10% DoD)

of the battery cycle life not being linear with • If a battery fails, what are the consequences of
depth-of-discharge, especially when the battery not having the auxiliary power source the
is under a charging voltage 50% of the time as in energy harvesting supply affords?
the first scenario. Continuously applying a charg- • How reliable and consistent is the source of
ing voltage to the battery has the effect of reduc- ambient power to supply the energy
ing battery cycle life and retained capacity. It is harvesting transducer? How reliable is the
better to remove the charging voltage from the transducer itself?
cell when it is fully charged. Doing so might not • How impractical or expensive is the effort of
be practical in some system configurations. In replacing and disposing of the battery if it dies
other words, keeping a battery at or close to a before end of product life? Or does battery
full state of charge has more deleterious effects failure in fact determine the end of the product
to its service life than maintaining a lower state life?
of charge. • Can a lower cost battery type be used if
By comparing the energy consumed over a energy harvesting is used to supplement the
10-year period, for example, one can ascertain energy needs?
whether a primary battery can be found that has • Could use of energy harvesting in an IoT
ample storage capacity, sufficiently low self- device lead to fewer batteries in the waste
discharge, acceptably small physical size, low stream? If so, can a business case be made to
enough cost, and other attributes which could justify on implementation of EH on this basis?
make it a suitable candidate for the application. • Are there advantages in product marketing
A rule of thumb: every 1 μA of operating current and possibly increased customer adoption
results in 10 mAh of battery capacity consumed through the use and promotion of a “green”
per year. This rule accounts for self-discharge of design—whether actual or perceived?
typical primary cells.
Therefore, other factors have to be considered With the thousands of IoT device formats,
when making the decisions of whether or not to applications, and use cases to draw from, it is
implement energy harvesting, such as: impossible to say for any given device whether it
is a good candidate for energy harvesting without
• Will a larger battery fit within the product size knowing more about the specifics of the applica-
constraints? Is there an incremental cost of tion and system operation. However, there are
having a larger, or other, compartment, general categories of devices that can be classi-
holder, or socket to hold the battery? fied as likely or unlikely candidates for energy
15 Battery Technologies for IoT 435

Table 15.8 Sample IoT devices and their suitability for use with energy harvesting sources
Device/
application Features EH transducer Good fit for EH? Why? Why not?
Home Wireless; bluetooth radio Solar Yes. Low power consumption and regular
thermostat availability of ambient light
Indoor air CO2, CO, humidity, Solar Perhaps. Some sensors draw more power
quality temperature sensors than can be replenished with harvested power
monitor
Fitness Accelerometer; memory; Thermoelectric or No. High current draw from “always on”
band/ processor solar operation; inconsistent ambient power;
monitor on-body power transducers usually
impractical
Smartwatch CPU; real-time operating Solar No. Power consumption of the many
system; radios; MEMS, functions overwhelm power production of
other sensors; display small EH transducers
Indoor Wireless switch Solar; electromagnetic In some cases. Push-button switch or ambient
lighting light power the transmission
controller
Medical Temperature, pulse rate Solar; electromagnetic Solar cell or near-field inductive (using NFC
patch monitors; NFC power) to recharge battery during use, device
communication interrogation, or as needed
Water meter Flow meter; wireless Thermoelectric Yes. Harvest energy from temperature
transmitter differential of water pipe to ambient air
Outdoor Humidity, temperature Solar, electromagnetic Yes, provided energy consumption is not too
sensor sensors; wireless transmitter (wind generator) high to be stored in battery for operation in
the dark
Animal Various sensors; GPS Solar In some cases. GPS power consumption
monitoring requires relatively large solar cell
and tracking
Wireless Vehicle detection sensors; Solar Yes. Daylight provides ample energy to
parking radio communication monitor and communicate. Industrial battery
meter needed for communication and operation in
the dark
Smart door Keypad; real-time clock; Solar; possibly Yes. Low power consumption, except
lock low power processor; data piezoelectric solenoid latch mechanism where used;
logger depending on latch ambient light is regularly available
mechanism
Factory Proximity, motion sensors; Solar, piezoelectric Yes, if ambient energy sources are available
automation radio link
Machine Vibration, temperature, Solar, thermoelectric, Practical in some cases, depending on
monitor other sensors; radio link piezoelectric ambient power available for harvesting
Bluetooth Sensor, radio, low power Various Yes. Low power wireless link and MCU
smart MCU modes are very energy efficient
beacons
Security Proximity sensor; wireless Solar Yes. Ambient light is available to recharge
monitor transmitter battery and supply energy during daytime
operation
436 J. Sather

harvesting, at least from a technical standpoint, if to maintain real-time clock or memory in the
not necessarily in cost-effectiveness. Table 15.8 event of main power loss—super-capacitors are
lists several candidate devices/applications and suitable substitutes for batteries. More often,
whether and why they might or might not be capacitors are used to augment a battery where
suitable for harvesting ambient energy. higher current loads demand more current than a
This concludes the brief exploration of the battery can deliver, or more than a battery can
potential for energy harvesting in IoT devices. deliver without falling below a critical voltage or
See Chap. 11 for a more thorough treatment of having its useful life diminished as a result of
this topic. delivering current well in excess of its rated
output. Another limitation of super-capacitors is
the inherently linear discharge profile that results
15.5 Comparisons of Batteries in a steadily decaying voltage as charge is
to Other Energy Storage removed. Consequently, not all charge within
Devices: Performance and Cost the capacitor can be accessed by the load at a
useful voltage.
While batteries are often the power source of The self-discharge of super-capacitors is usu-
choice for wireless and mobile applications, ally orders of magnitude higher than that of
other storage devices are occasionally used as batteries. For that reason—in addition to having
the main energy storage device in the system, much lower energy density than batteries—
or, more commonly, as a supplemental energy super-capacitors are largely excluded as viable
storage device to complement the characteristics main power sources in IoT devices. There are
of the battery—such as delivering higher power hybrid battery/capacitor devices—lithium-ion
form a capacitor in systems where pulsed loads capacitors—that have properties resembling that
are found. In other cases, supplemental power of lithium ion batteries at one electrode and elec-
sources might be used with batteries to the oppo- tric double-layer capacitors at the other elec-
site effect, whereby a fuel cell, for example, trode. These devices achieve higher cell
produces a continuous supply of energy until voltage, higher energy density, and lower self-
the fuel source is depleted, but might not have discharge than pure electric double-layer
sufficient drive current capability to operate the capacitors, while retaining relatively good
circuit directly. The fuel cell might then be used power density. Capacitances range from a few
to charge a secondary battery, which in turn millifarads to hundreds of farads.
delivers higher power to the load.

15.5.2 Fuel Cells


15.5.1 Super-Capacitors
As fuel cell technology has migrated to the realm
Electric double-layer capacitors (EDLCs), com- of portable applications, coupled with develop-
monly known as super-capacitors and ultra- ment of metal/air batteries, the distinction
capacitors, are typically low resistance devices between fuel cells and batteries is sometimes
used to deliver high power pulse currents. Their not so clear. Depending on the amount of fuel
volumetric energy density is much lower than available in the fuel cell, the operating life,
that of most batteries, but their power density is power requirements, and whether the IoT device
usually considerably higher than that of batteries. operates continuously, in special cases a fuel cell
Capacitance of these devices ranges from several can serve as a practical power source. State-of-
millifarads (several microampere-hours equiva- the-art technology in portable fuel cells achieves
lent) to thousands of farads. energy density comparable to common batteries.
In cases where small amounts of energy are However, the cost of fuel cells is generally much
needed—such as in a backup power source used higher than batteries having similar energy
15 Battery Technologies for IoT 437

Table 15.9 Standards and publications related to battery safety and transportation
Standards body Publication Title/purpose
American National ANSI C18.1M Portable primary cells and batteries with aqueous electrolyte—
Standards Institute general and specifications
ANSI C18.2M Portable rechargeable cells and batteries—general and
specifications
ANSIC18.3M Portable lithium primary cells and batteries—general and
specifications
International IEC 60086 Primary batteries—safety standard for lithium batteries
Electrotechnical IEC 61960 Secondary lithium cells and batteries for portable applications
Commission IEC61809 Product safety standard for sealed alkaline secondary cells and
batteries
IEC62133 Secondary cells and batteries containing alkaline or other
non-acid electrolytes—safety requirements for portable sealed
secondary cells, and for batteries made from them, for use in
portable applications
Underwriters Laboratories UL1642 Standard for household and commercial batteries (superseded by
IEC 62133)
UL2054 Standard for lithium batteries (cells)
Occupational Safety and Various OSHA- Safety and health in battery manufacturing
Health Administration approved state plans

storage capacity as a small fuel cell cartridge, for Some battery chemistries and types are
example. Portable fuel cells generally have a inherently more hazardous than others in terms
narrow effective operating temperature range, of being flammable, prone to exploding, toxic to
thus limiting their practical use in many devices. humans, and harmful to the environment in dis-
Restrictions that might apply to the transport of posed of improperly.
fuel cells aboard commercial aircraft can limit Table 15.9 is a partial list of bodies issuing
their viability in mobile devices. safety standards, example publications, and the
Portable—and micro—fuel cells having general purpose of each (Linden and Reddy 2002).
power output less than 5 W have been developed There are numerous independent, accredited
for applications in military systems, personal laboratories providing services for testing to the
electronics, and toys. The cost of fuel cells is various published standards. Because standards
not yet competitive with commodity batteries, are updated regularly, it is important to always
in part due to the expense of the components refer to the latest edition of a particular standard
comprising the fuel cell, but also due in large when testing a device or product for compliance.
part to the fact that fuel cells have generally not In addition to the batteries being inherently more
yet reached commercial production status (Lin- or less safe in their materials, chemical reactions,
den and Reddy 2002). and construction, external charging circuits, protec-
tion devices, and enclosures are important aspects
of product design as it pertains to batteries.
15.5.3 Safety, Transportation, Problems of thermal runaway, short-circuits, care-
and Disposal of Batteries: less handling, external application of electrical
Regulations and Trends supplies to the battery, and protection against heat
sources are all important considerations for
As batteries become increasingly energy dense designers of battery-powered products.
and find greater use in an increasingly mobile For designers developing products to be used
world, standards, directives, and regulations in explosive atmospheres, compliance with IEC
pertaining to their use, transport, and disposal are standard 60079-11—pertaining to “intrinsically
being developed and adopted by commissions. safe” equipment and the components making up
438 J. Sather

that equipment—might be necessary. Batteries Capacity ¼ 3 Ah ! lithium content ¼ 0.9 g


used in such equipment must therefore pass lithium.
tests related to spark generation, leaking,
venting, and generation of electrical and thermal As batteries of all types become more widely
energy. used in stationary and especially portable equip-
Transportation of batteries on commercial ment and devices, rules governing their disposal
carriers is coming under increasing scrutiny and have been enacted and continue to be developed.
regulation as safety concerns are elevated in the For example, the EU Battery Directive is a com-
wake of several high profile incidents aboard prehensive regulatory document that regulates
airliners, in particular. Battery fires are highly the manufacture and disposal of batteries and
reactive, with incredible amounts of energy accumulators in the European Union.
liberated over a brief time, resulting in high Enforcement of disposal regulations can be
temperatures and violent reactions. Conse- difficult, with billions of battery-operated
quently, restrictions are being proposed and devices in use throughout the world and compli-
enacted by government bodies, trade ance not always foremost in the minds of users
associations, and commercial enterprises to pro- when disposing of batteries specifically, or the
tect the safety of those in proximity to high devices containing them. Achieving battery
capacity batteries of the most energetic recycling objectives relies on public cooperation
chemistries. and participation.
Regulations have been developed that govern Cytotoxicity of batteries is also of ongoing
battery packaging to prevent batteries from concern as more and more battery-powered med-
becoming shorted across their terminals or from ical (implantable and non-implantable) devices
one cell to another during transport, and from are being designed and used in everyday life.
leaking or exploding if dropped or otherwise Some types of solid state batteries have been
mishandled. There are restrictions in the amount tested and shown to be non-cytotoxic.
of lithium that can be transported aboard aircraft
or other carriers without taking special
precautions. Regulations of particular types or
15.6 New Battery Technologies,
capacities (amount of lithium, for example) of
Trends, and Opportunities
cells subject them to restrictions or prohibitions
in IoT Applications
on passenger-carrying aircraft. Various agencies
have published guidelines or regulations, includ-
Since the introduction of Li-ion batteries in par-
ing U.S. Department of Transportation (DOT),
ticular, great strides have been made in energy
Federal Aviation Administration (FAA), United
density, the array of cell types available, rate
Nations (UN) and International Air Transport
capability, package sizes, and other important
Association (IATA). International transportation
attributes. Yet, as with electronic components
regulations can be found in publications UN
generally, ever-increasing demands are placed
3480/3481 and UN 3090/3091.
on batteries. Consumers want more, for less, as
Equivalent lithium content of a lithium-ion or
with most products. And while the trend in pric-
lithium polymer battery is calculated based on
ing per unit of energy has been steadily down-
the ampere-hour rating of the cell, battery, or
ward in recent years, improvements in energy
battery pack, as follows:
density have not kept pace with the proliferation
of device functions and capabilities of products
Lithium content (in grams) ¼ rated capacity (in
such as IoT devices. While the power consump-
ampere-hours) per cell  0.3  number of
tion—active, quiescent, leakage—for a
cells in the battery or pack.
Example: 18650 Li-ion cell
15 Battery Technologies for IoT 439

particular function is generally much lower than shapes to be designed and integrated with porta-
what is was just a few years ago, the inexorable ble electronic products including IoT devices.
tendency, precipitated by user demand, is for Thin film and thick film printed batteries are
designers to make the next generation of devices finding applications in wearable devices and
more feature rich than the preceding generation, other devices incorporating wireless sensors,
and thus the demand on batteries tends to radios, and processing capabilities requiring just
increase, not diminish. a few milliampere-hours of storage capacity in
In response, scores of organizations in gov- thin, usually flexible, formats.
ernment, academia, and private industry are Many challenges face the battery industry,
investing large sums of money and scientific including pricing pressures driven by production
and engineering talent into developing new bat- of commoditized cells from manufacturers in low
tery chemistries, materials, and manufacturing cost regions; quality concerns over products from
techniques in an effort to bring batteries with certain manufacturers; a finite number of
advanced capabilities to market for any number elements on the periodic table from which to
of applications. Efforts encompass every aspect develop new chemistries and the aforementioned
of battery design, materials development, and increasing demands in performance generated by
construction. Advances range from anode devel- the vast proliferation of electronic appliances of
opment—such as replacing carbon-based anodes every sort.
with silicon and lithium anodes in Li-ion Numerous trade associations and annual
batteries—to higher capacity and higher power conferences are dedicated to advancing battery
cathode materials, to new formulations in elec- technology and making improvements in energy
trolyte composition, including new polymers and density, safety, manufacturing, and in
solid state materials. Li-ion materials tolerant to incorporating materials more benign to the envi-
higher charging voltages are also in ronment through the lifecycle of materials
development. extraction, battery production, disposal, and
As described earlier, the energy density of recycling. Through their common objectives,
advanced Li-ion rechargeable cells is on the these organizations serve a purpose in continuing
order of 650 Wh/L, and just above that in to advance battery technology and products to
some primary lithium cells. On the horizon, meet the growing needs of consumers world-
rechargeable cells exceeding 1000 Wh/L are wide, while also continuing to recognize and
now in development. New rechargeable address the concerns over depletion of resources,
chemistries—using zinc rather than lithium— disposal hazards, and effects on the environment
are in development, and lithium polymer cells generally.
are now commercially available in myriad sizes Forecasts for unit volumes of IoT devices
and storage capacities. deployed in the coming years vary widely, with
Significant resources are being deployed in some forecasters projecting upwards of 100 bil-
the development and commercialization of solid lion sensors to be made and sold each year by
state batteries, promising higher energy density, 2020. While not all of these sensors will be used
a greater number of charge-discharge cycles, and in IoT devices per se, supporting a sizable frac-
improved safety relative to batteries employing tion of these devices in battery-powered systems
liquid and polymeric materials in their construc- will require a significant increase in the number
tion. Such solid state batteries have already made of batteries or other suitable energy storage
their way to the market in the form of thin film devices to be manufactured each year. Should
micro-batteries—allowing for special battery such markets and applications develop in line
with such forecasts, and to the extent that those
440 J. Sather

devices are “unplugged” from wall power, there Energizer AA Alkaline (EN91), http://data.energizer.
will be opportunity and incentives for battery com/PDFs/EN91.pdf. Accessed Nov 2015
Energizer AA Lithium (L91), http://data.energizer.com/
technologists and manufacturers alike to develop PDFs/l91.pdf. Accessed Nov 2015
and introduce battery technologies of including D. Linden, T. Reddy, Handbook of Batteries, 3rd edn.
more of the type in widespread use today, to (McGraw-Hill, New York, 2002)
nascent battery technologies currently in Panasonic AA Alkaline (LR6), http://na.industrial.
panasonic.com/sites/default/pidsa/files/1.5vseries_
development. datasheets_merged.pdf. Accessed Nov 2015
Panasonic CR2032 Manganese Dioxide Lithium Coin
Battery, http://na.industrial.panasonic.com/sites/
References default/pidsa/files/crseries_datasheets_merged.pdf.
Accesed Nov 2015
Panasonic Manganese Lithium Coin Battery, http://na.
Cymbet Corporation AN-1059: Extend Battery Life by industrial.panasonic.com/sites/default/pidsa/files/
Reducing System Power Using the EnerChip RTC, mlseries_datasheets_merged.pdf. Retrieved Nov 2015
http://www.cymbet.com/pdfs/AN-1059.pdf. Accessed Panasonic NCR18650 Li-ion, http://industrial.panasonic.
Dec 2015 com/lecs/www-data/pdf2/ACI4000/ACI4000CE17.
Cymbet Corporation Solid State Rechargeable Battery, pdf. Accessed Nov 2015
http://www.cymbet.com/pdfs/DS-72-41.pdf. Accesed Tadiran Model TL-5104 data sheet, www.tadiranbat.com/
Dec 2015 pdf.php?id¼TL-5104. Accessed Nov 2015
Digi-Key Online Catalog, http://www.digikey.com/prod
uct-search/en?keywords¼battery. Accessed Nov 2015
System Packaging and Assembly
in IoT Nodes 16
You Qian and Chengkuo Lee

The internet of things is the networks of physical the physical world and computational systems,
“thing” embedded with sensors, integrated and resulting in improved accuracy and effi-
circuits and power source. Thus, packaging all ciency. Having a massive amount of sensors
these elements into one IoT nodes is essential for and other elements communicated in wireless
promoting this technology. In this chapter, some manner has emerged as a disruptive wireless
of the popular and commercially available pack- sensor network technology in data collection
aging technologies such as wire bonding, flip- between every IoT node. Integration or packag-
chip bonding, tape automated bonding, etc. are ing of the above mentioned components
reviewed. Secondly, wafer level packaging is influences the overall system performance, cost
emphasized as it can be used as a low cost pack- and time-to-market.
aging technology, usually because the packaging From application point of view, the IoT allow
cost is much higher than other cost associated users to view the phenomena of interest from
with device manufacturing. Recent progress on multiple yet simultaneous vantage points, and
packaging for wearable electronics, silicon Pho- with the vast aspect of applications. More specifi-
tonics and energy harvesters are also included. In cally, such features enable users to simultaneously
the later part, infrared sensors have been used to monitor the health of civil structures, e.g. from
demonstrate the packaging constraints imposed bridges, dams to tunnels and buildings. The IoT
by device performance and application nodes can be used to localize and to measure the
requirements, various packaging solutions avail- condition of transport goods, thereby optimizing
able at an individual sensor level and compli- the logistics cost and energy consumption.
cated array level. Besides, the same platform technology realizes
wearable (or embedded) miniature sensors which
promote higher-quality medical procedures
16.1 Introduction and research, as well as next-generation preventa-
tive healthcare solutions. All these different
The internet of Things (IoT) is the network applications require careful design of the packag-
majorly comprises of multiple miniaturized ing, to allow the sensors to monitor the environ-
sensors, actuators, integrated circuit and power ment effectively, and to provide the whole node
source. It creates a seamless connection between with a reliable working condition.
To implement the above-mentioned app-
Y. Qian • C. Lee (*) lications, advances in hardware technology and
National University of Singapore, Singapore, Singapore engineering design have led to significant
e-mail: elelc@nus.edu.sg

# Springer International Publishing AG 2017 441


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_16
442 Y. Qian and C. Lee

reductions in size, power consumption, and cost In view of great impact made by IoT, we will
of all the basic components of the IoT nodes introduce the prevailing packaging technology
including sensors, digital circuitry, and wireless available in microelectronics and MEMS
communications. In particular, microelectrome- industries first, followed by discussion on design,
chanical systems (MEMS) technology has requirements and technology developed for
enabled various kinds of sensors in micrometer MEMS sensors in case study manner. At the
scale, which are more sensitive, occupy signifi- end of this chapter, the state-of-the-art MEMS
cantly less area and requires very lower power packaging technology will be provided for
than conventional sensors. For example, MEMS understanding the packaging technology trend
based gas sensors, pressure sensors and air flow in the applications of IoT.
sensors only demand 1–2 μW for operation (Guo
et al. 2012; Lou et al. 2012; Zhang et al. 2012).
While most of MEMS sensors are fabricated on 16.2 Packaging Technology
silicon substrates with the potential of monolithic
integration with integrated circuits (ICs), MEMS Electronics Packaging is the engineering of
sensors make the distributed IoT become establishing interconnections with electrical
feasible. components, managing heat dissipation, and pro-
Another major part of a IoT node is battery. tection of functional elements. The continuing
However, due to use of large batteries as the scaling trend in microelectronics/integrated cir-
power source, single packaged chip for entire cuit (IC) technology has a significant impact on
sensor node is not possible. Compared with tra- the different packaging technologies. Consider-
ditional batteries, MEMS energy harvesters ing the microelectronics manufacturing flow, the
seems to be a viable candidate, because of its first step is the fabrication processes involved at
compactness and unlimited lifetime. Addition- semiconductor wafer fabs, where ICs are
ally, Energy harvesters can be made with similar completed with wafer-level fabrication, inspec-
process like MEMS sensors and provide the tion and known-good-die test. Secondly each
entire IoT node as a single packaged chip. wafer is then mounted on a special sticky tape
Thus, various levels of system integration and sawed into individual dices. Then the back-
involve MEMS based sensor, chip for computa- end microelectronics manufacturing flow takes
tion work and communication are likely assem- place at the packaging house. This manu-
bled individually packaged components on a facturing flow includes the next few steps:
single board (Printed Circuit Board level packag- (1) one selected good die, i.e., the chip, is
ing) or housing all components in a single pack- attached to a carrier, e.g. a ceramic package, a
age (System-in-Package or System-on-Chip metal header, or a premolded plastic lead frame
approach) have been demonstrated. Packaging with an embedded metal lead frame; (2) electrical
is inevitable; as it provides a significant improve- interconnects between the IC chip and carrier are
ment in device performance and ensures long made by wire bonding, flip chip, or another
term reliability of components in it nodes. Pack- method; (3) either having plastic molded over
aging of IoT nodes and/or its components are the assembled chip on the plastic lead frame, or
more challenging, because of the variety of having a ceramic, metal, glass, or plastic cap
components, like sensors, actuators, Integrated sealed on the assembled ceramic package or
circuit (IC) controllers, etc., that make up such metal header is used to provide the required
nodes, where such elements require different reliability; (4) a final test and calibration is
condition of the packaging. Numerous packaging made to conclude the processes. The above pack-
solutions have also been explored for both com- aging is widely considered as the first level of
ponent level and system level packaging of IoT packaging. The modern first-level packages
nodes. might not only contain one but several IC chips
16 System Packaging and Assembly in IoT Nodes 443

and are then referred to as multichip modules integration effort of more and more functions
(MCM). Single-chip modules and MCM are on a single chip based on IC fabrication technol-
assembled together with other components, ogy, some other elements and components made
such as resistors and capacitors, onto a printed of non-silicon materials could be integrated
circuit board (PCB) via soldering process, i.e., together by system-in-a-package (SiP) solutions.
surface mount technology (SMT) process. This Figure 16.1b shows a prototype of wireless sen-
PCB package is considered as the second level of sor node under research based on a SiP approach.
package. Moreover, the motherboard which is Significant size reduction shows that the SiP
used in the personal computer (PC) providing approach is a promising solution for heteroge-
sockets for various PCB cards to be plugged in; neous integration in the fabrication of wireless
is considered as the third level package. Gener- sensor nodes.
ally speaking, an electronic system, like a PC or a
mobile phone, is a device with electronics in the
third level package assembled in a single 16.2.1 Current Commercial Packaging
housing. Technology for Internet
In order to realize sensor nodes which are of Things
truly ubiquitous, we really rely on the advanced
packaging technology to fabricate sensor nodes Packaging is one of the most essential and
in a very compact format and at a low cost way. expensive part of the sensor nodes in the IoT
A wireless sensor node typically consists of one system. The main challenge comes from the
or more sensors, an IC controller, a wireless packaging controller IC with different MEMS
transceiver, an antenna and a battery, or even sensor and battery. Even though, IC packaging
better, an energy scavenger. To precisely report is a well-developed technology, the MEMS
real-world variables, sensors are required to packaging is relatively new and is more difficult
detect pressure, temperature, heat, flow, force, owing to moving structural elements. Yet
vibration, acceleration, shock, torque, humidity, another complication for MEMS sensor packag-
and strain. In the applications such as security, ing is that, most MEMS sensors should interact
surveillance and traffic control, video and images with the environment and transfer the coupled
are taken by image sensors. Some of these energy effectively and at the same time has to
sensors are commercialized in MEMS based minimize noise and cross talk interference.
solutions, whereas others are conventional Hence, not only electrical signals need to be
solutions that have been available for decades, routed through the package, but also the sensing
for example, ceramics and polymer sensors. energy has to be effectively coupled to MEMS
Ranging from IC, passive elements (like devices. Different MEMS sensors demand for
capacitors, resistors and inductors), and power different packaging techniques which are
source like battery to various kinds of sensors, depending on the physical domain signals to
different fabrication technologies are required to be sensed (Bloss 2012; Sparjs 2001). The
make these components in rather different unique packaging requirements for each indi-
manufacturing lines. Thus the current cost- vidual MEMS sensor lead to additional com-
effective way of making a wireless sensor node plexity of designing a common packaging
is to assemble and integrate the above-mentioned platform technology among different MEMS
ICs, various components and sensors in a PCB sensors at the first place. Furthermore, the
package, i.e., the second level package as shown mechanical requirements for a sensor package
in Fig. 16.1a. Heterogeneous integration is the are typically much more stringent than the pack-
technical term to describe the technology for aging standard of ICs, because most of the
integrating multifunctional elements and devices MEMS devices have fragile suspended
into a compact standalone system. While system- structures, and so even a small stress or strain
on-chip (SoC) technology represents the can influence the performance or the long term
444 Y. Qian and C. Lee

Fig. 16.1 (a) A wireless sensor node in production and three-axis gyroscope and a 32-bit microcontroller unit
realized in the second level package, i.e., the PCB package; (MCU), i.e., a processor. This nine-degree-of-freedom
(b) a prototype of miniaturized wireless sensor node in (DoF) inertial system represents a fully integrated solution
system-in-package (SiP) format from IMEC Research Cen- for various applications such as virtual reality, augmented
ter, Belgium; (c) a state-of-the-art inertial sensor module in reality, image stabilization, human machine interfaces,
a system-on-board (SoB) package format from STMicroe- robotics and inertial body tracking [Courtesy: ST
lectronics. It integrates multiple sensors with a powerful Microelectronics]
computational core: a six-axis geomagnetic module, a

reliability of the device. In the worst case, it packaging criterion demanded by different
may also cause the device completely malfunc- components in the IoT, thus post a major road-
tion (Lau et al. 2010). MEMS sensors are block in the development of a common packag-
designed for sensing various physical ing platform technology that could provide
parameters, so in order to achieve certain device higher reliability and lower cost devices.
specifications, appropriate choice of materials is The sensor packaging design is considered
inevitable, which can introduce more constrains equivalently important as the design of the
to packaging specifications. Secondly, as sensors itself. The sensor packaging majorly
advancement in IoT, MEMS actuators are also influences the performance of the device, espe-
integrated into the unified standalone system, cially with respect to factors such as long-term
along with the sensors and controller IC, to drift, stability and reliability. More importantly,
autonomously control the system by physically the cost of the package and its development will
reacting to the decision signal from the control- often be significantly higher than that of the
ler. This feature of integrating actuators has MEMS device fabrication. Hence packaging of
greatly enhanced their capabilities and range integrated systems like IoT is extremely crucial
of applications. This also means additional for high performance, low cost and high reliable
complexity in overall system packaging product. Researches on heterogeneous integra-
capabilities. These are inherent problems of tion for building innovative standalone systems
16 System Packaging and Assembly in IoT Nodes 445

Fig. 16.2 Block diagram


showing the different levels
of possible system
integration (Reichl et al.
2009)

are gaining more interests globally. Various 16.2.1.2 First Level Packaging
packaging techniques and procedures are suc- The first level packaging refers to tlhe encapsu-
cessfully demonstrated for Internet of Things. lation of discrete functional chips onto individual
Packaging can be achieved at different levels packages. The fabricated wafers from the semi-
depending upon the number of different func- conductor wafer fabrication facility are tested
tional chips that are packaged in a single housing and passed the known-good-die inspection step,
as shown in Fig. 16.2. Heterogeneous integration and then are diced into individual dies for further
can be achieved using different level of packag- packaging. In first level packaging, the passed
ing as follows: die is placed onto the package carrier holder
and is electrically connected to the carrier pins.
16.2.1.1 Integration/Assembly First level of packaging is also known as “Single
of Multiple Chips (or) PCB Chip Packaging”. Single chip packages can be
Level Packaging categorized in terms of various aspects, such as
The integration and assembly of individually package material (plastic, ceramic, or metal),
packaged chips on to a dedicated Printed Circuit interconnect arrangement (peripheral or area
Board (PCB) is the most straight-forward means array interconnects), assembly technology
of integrated system packaging. Hence it is known (surface-mount technology, chip-stacking,
as “PCB level packaging”. The main advantage of multi-chip-module, or other package formats).
this approach is that each component can be pack- Some of the main stream single chip packaging
aged with the best suitable technology possible, techniques are described briefly,
and it enables the best of individual functional
blocks to get together. The complexity level for 1. Single in line packages (SIPs): SIPs have
integrating and assembling the overall system is pins for through-hole assembly, only along
low. Additionally, each component can be bought one side of the package. SIPs are generally
from various vendors and integrated into the PCB plastic packages and mostly used for memory
later. This packaging approach is highly compati- modules packaging. They are also limited to
ble with functionality upgrade and design application demanding fewer input/output
changes, while the lead time to market is signifi- (I/O) lead pins.
cantly minimal. However, the weakness of PCB 2. Dual in line packages (DIPs): DIPs provides a
level packaging is large footprint of the packaged rectangular housing with two parallel rows of
systems. The different levels of integrated electrical connecting pins. The package is a
standalone sensor network packages as shown in through-hole mounted package. They usually
Fig. 16.2 and mentioned in previous sections will made of ceramics or plastics. The DIP is a
be briefly discussed in next sections. commodity through-hole package with up to
446 Y. Qian and C. Lee

84 pins and a most common lead pitch of solder balls, which are popularly used for
2.54 mm. permanent mounting of devices like
3. Small outline packages (SOPs): A small- microprocessors. A BGA can provide more
outline package is a lead frame-based plastic interconnection pins than the number can be
package with leads on two sides of the body, put on a dual in-line or flat package. The
and is well-adopted in SMT packaging lines. whole bottom surface of the device can be
SOP is 30–50% less area than an equivalent used, instead of just the perimeter. The leads
DIP, with a typical thickness that is 70% less. are also on average shorter than with a
They are generally available in the same pin perimeter-only type, leading to better perfor-
outs as their counterpart DIPs. SOP is the mance at high speeds. Soldering of BGA
most common package type for low pin devices requires precise control and is usually
count ICs and MEMS pressure sensors. done by automated processes. Plastic BGAs
4. Quad flat packages (QFPs): Quad flat (PBGAs) based on laminated multilayer plas-
packages are SMT compatible packages with tic substrates and ceramic BGAs (CBGAs)
leads extending out from all four sides of the based on multilayer ceramic substrates are
package. QFP is available as ceramic (CQFP) available. The chip interconnection is accom-
or lead frame-based molded plastic (PQFP) plished either by wire bonding or by flip chip
packages. The lead pins are placed close and bonding.
are very thin to achieve maximum I/O pins.
5. Chip carriers (CCs): Chip carriers are also The advancement in the single-chip package
surface-mountable packages with peripheral is focused towards increased I/O counts and
leads on all four sides like the QFP reduced package form factor for both ICs and
counterparts. However, in contrast to the MEMS packaging as shown in Fig. 16.3a, b
QFP which has gull wing leads, chip carriers respectively. These advancements have led to
feature either J-leads or no leads at all. The the high density, low cost and much efficient
leadless chip carriers (LLCCs) have a single chip packages. Even though the
laminated ceramic substrate, leaded chip standard packaging evolved to provide more
carriers (LDCCs) with J-leads are either lead I/O pins with reduced footprint, the major
frame based molded plastic packages or uti- technological improvement that aided this
lize ceramic substrates. growth was the bonding techniques. Some of
6. Pin grid arrays (PGAs): PGAs are through- the well-established techniques are discussed
hole packages with an area array of intercon- below:
nect pins. The package is square or roughly
square, and the pins are arranged in a regular 1. Wire bonding
array on the backside of the package with Wire bonding is the process of providing elec-
either laminated ceramic (CPGA) or laminated trical connection between the chip and the
plastic (PPGA) substrates. CPGA packages are external lead pins of the chip carrier by
available in two configurations—cavity-up or using fine bonding wires. The wire used in
cavity-down. In the cavity up configuration, wire bonding is usually made either of gold
the IC is on the opposite side as the pin array, (Au) or aluminum (Al), although copper
while the IC is placed on the same side as the (Cu) wire bonding is starting to gain a foot-
pin array in the cavity down configuration. hold in the semiconductor manufacturing
PGAs allow for more pins per IC than older industry.
packages such as DIP. Intel Corporation During gold ball wire bonding, a gold ball
adopted PGA technology for several is first formed by melting the free end of the
generations of its microprocessors. wire. The gold ball is then brought into con-
7. Ball grid arrays (BGAs): A BGAs are the tact with the bond pad. Once the contact is
surface-mountable versions of the PGA with made, desired amounts of pressure, heat,
an area array of interconnect using attached and/or ultrasonic forces are then applied to
16 System Packaging and Assembly in IoT Nodes 447

Fig. 16.3 Trend in packaging industry: (a) general packaging trend and (b) MEMS packaging trend [Courtesy: MEMS
Journal]

a b
Step  Step  Step  Step 

Complete 1st Clamps open Complete 2nd Clamps close and


bond and move to bond form tall
shape loop

Fig. 16.4 (a) Schematic demonstration of gold ball wire Delft Institute of Microsystems and Nanoelectronics
bonding process [Courtesy: Freescale Semiconductor]; (DIMES) Technology Center, Netherlands]
(b) SEM photo of ball bonded gold wire [Courtesy:

the ball for a specific amount of time, and to is then run to the corresponding lead finger;
form the weld between the ball and the bond against which it is again pressed. The second
pad. The wire is then run to the corresponding bond is again formed by applying ultrasonic
finger of the lead frame, forming a gradual arc energy to the wire. The wire is then broken off
or “loop” between the bond pad and the lead by clamping and movement of the wire
finger. Pressure and ultrasonic forces are (Fig. 16.5).
applied to the wire to form the second bond Among the aforementioned wire bonding
this time with the lead finger. Finally, the wire techniques, gold ball bonding is much faster
bonding machine breaks the wire in prepara- than aluminium wedge bonding, and so are
tion for the next wire bonding cycle extensively used in plastic packaging. Unfor-
(Fig. 16.4). tunately, gold ball bonding on Al bond pads
During aluminum wedge wire bonding, a cannot be used in hermetic packages, primar-
clamped aluminum wire is brought in contact ily because the high sealing temperatures
with the aluminum bond pad. Ultrasonic (400–450  C) used for these packages will
energy is then applied to the wire for a specific tremendously accelerate the formation of
duration while being held down by a specific Au-Al intermetallics, leading to early life
amount of force, forming the first wedge bond failures. Gold ball bonding on gold bond
between the wire and the bond pad. The wire pads, however, may be employed in hermetic
448 Y. Qian and C. Lee

a Step  Step  Step  Step 


b

Rotate to position Clamps open Complete Clamps close


and complete 1st and move to 2nd bond and form tail
bond shape loop

Fig. 16.5 (a) Schematic demonstration of Al wedge wire Delft Institute of Microsystems and Nanoelectronics
bonding process [Courtesy: Freescale Semiconductor]; (DIMES) Technology Center, Netherlands]
(b) SEM photo of wedge bonded Al wires [Courtesy:

a b c
Underfill Applicator Underfill

Die Passivation Layer


Bump
BLM
Adhesive
Al Capture Pad
Substrate Chip
Via

Pb/Sn C4 “Bump” Conductive


Epoxy Encapsulant Particle
Au/Ni
Substrate Via Substrate
Ceramic Substrate

Substrate

Fig. 16.6 Schematic demonstration of (a) flip chip bumping; (b) flip chip underfilling; (c) flip-chip assembly process
from ACI technologies Inc

packages. Alternatively, the Al-Al ultrasonic stability and also enable heat transfer to sub-
wedge bonding does not require heat to facili- strate. Flip chip assembly is quite different from
tate the bonding process, unlike the Au-Al the traditional approach. Flip chip assembly
thermosonic ball bonding. The Al bond pad consists of three major steps (Gianchandani
is also harder than the Au ball bond, making et al. 2008; Lau 1996; Zhong et al. 2007):
good bonding between them through purely (a) Flip-Chip Bumping—The process of
ultrasonic means, thereby avoiding any ther- forming bump on a bond pad of the die
mal damage caused to wires, bond pads, or as shown in Fig. 16.6a. This can be done
silicon substrates. by various techniques like Under Bump
2. Flip-Chip Bonding Metallization (UBM), Plated Bumping,
Flip-Chip bonding is the process of flipping the Stud Bumping, and Adhesive Bumping.
chip over and bonding it onto a substrate, board, (b) Flipping and attaching the chip to the
or carrier in a ‘face-down’ manner. Prior to the chip carrier.
bonding process, conductive bumps are pre- (c) Flip-Chip Underfilling—It is a process of
cisely formed in the desired target positions filling the open spaces between the chip
on the chip surface, so while mounting the and the substrate or board with a
chip, it has to be faced down to enable electrical non-conductive but mechanically protec-
connection to the chip carrier. These bumps tive material. Needle dispensation
provide electrical connection, mechanical followed by thermal curing to form
16 System Packaging and Assembly in IoT Nodes 449

Fig. 16.7 Various commercially available flip-chip packages with different interconnect formats offered by Amkor
Technology (a) fcCSP package, (b) FC ceramic package (c) SuperFCTM

permanent bond is used, as shown in higher cost and time. Thus, tape automated
Fig. 16.6b. bonding is a better alternative to conventional
The major advantage of flip-chip bonding wire bonding if very fine bond pitch, reduced
approach is that, their size is much smaller die size, and higher chip density are desired. It
than their conventional counterparts, as the is also the technique of choice when dealing
chips do not require wire bonds. Additionally, with applications that need the circuits to be
all problems related to inductance and capaci- flexible.
tance that are associated with bond wires are
eliminated and so provides improved perfor-
mance. Flip-chip bonding is one the most Second Level Packaging
preferred techniques in chip-scale packaging In the second level of packaging, the packaged
and is commercially available with lot of var- chips with different functionalities are mounted
iation from number of interconnects to the on the printed circuit board, in which the electri-
substrate, package material, etc., as shown in cal routing between discrete components are
Fig. 16.7. designed and printed beforehand. The mounted
3. Tape Automated Bonding chips and other electronic components (i.e.
Tape Automated Bonding (TAB) is the pro- resistors, capacitors, inductors, power source
cess of mounting a die on a flexible tape made ports and other leads) are electrically connected
of polymer material. Initially conductive through soldering process. Various PCB level
bumps or balls are formed on the targeted integrated IoT has been reported by research
locations on the chip. Then the bumped chip groups all over the world. Some of the PCB
is connected to fine conductors on the tape assembled sensor nodes are shown in Fig. 16.9.
(known as inner lead bonds). The other side PCB level module integration has the advan-
of tape (known as outer lead bonds) connects tage of providing best possible packaging
the tape to the external circuit. Sometimes the demanded by individual chips and later
tape on which the die is bonded already integrating to realize the functionality of the
contains the other chips (Fig. 16.8). entire system, which in this case is the IoT. The
The main advantages of tape automated PCB technology and MCM integration at PCB
bonding include: it allows the use of smaller level are well established, and such platform
bond pads and finer bonding pitch; and the provides higher yield and flexibility to design
bond pads can be all over the die therefore change or modification in particular module.
increasing the possible I/O count of a given However, the biggest disadvantage of this
die size. The unique feature of TAB is that it kind of PCB level integration is larger footprint
allows circuits to be physically flexible. How- of the final product, and increased capacitive and
ever, on the other hand, the tailoring of the inductive interference from the routing wires. In
tape and the entire process may demand order to reduce the overall footprint, more
450 Y. Qian and C. Lee

Fig. 16.8 (a) Schematic drawings of tape automated interconnects based application; (c) shows real tape
bonding (TAB) and its respective cross section; (b) sche- automated bonded devices for flexible application
matic demonstration of using TAB for flexible

Fig. 16.9 Various sensor nodes based on PCB level demonstration for WSN (Reichl et al. 2009) and (c) com-
packaging—(a) in-plane packaging of discrete compo- pact PCB level packaged WSN [Courtesy: Wireless
nent of WSN, (b) effective 3D PCB level assembly Sensornets Laboratory, Western Michigan University]

interestingly the individual chips can also be 16.2.1.3 Multiple Chip Integration
placed on either side of the PCB, which straight (or) System in Package
away reduces the footprint to half but it also Even though PCB level integration is the popular
increases the packaging complexity at the same choice of technology for system integration, var-
time. Despite the above mentioned challenge, ious other technologies are also gaining interests
this 3D integration of ICs on a PCB has been and are explored actively worldwide. For Inter-
embraced by a lot of companies and PCB level net of Things, which employ synchronous func-
packaging remains as the leading technology for tioning of multiple sensors, IC controller,
heterogeneous integration. For example, the discrete electrical components and power source,
Google NEXUS 7 tablet computer has the vari- the most attractive scheme is the System-in-
ous functionality chips, assembled onto a PCB in Package (SiP). As mentioned in the introduction
3D manner as shown in Fig. 16.10. of this chapter, different functional dies are
16 System Packaging and Assembly in IoT Nodes 451

Fig. 16.10 The communication board of Google significant reduction in footprint [Courtesy: Google
NEXUS 7 tablet computer has various individual chip NEXUS 7]
assembled on either side of the PCB, which provides

manufactured independently, but are packaged in More aggressive scheme of high density inte-
a single carrier. By this means, the footprint can gration is the stacked SiP approach, where indi-
be drastically reduced. Various techniques are vidual dies are placed on top of one-another. This
available for packaging multiple functional dies effectively reduces the overall footprint of the
on to a single package. There two major ways of system. Different approaches are proposed to
achieving SiP—(1) Planar SiP packaging, where realize this concept. Some of them are:
individual chips are placed alongside laterally to
each other in a housing and (2) Stacked SiP 1. Stacked Packages—Different packaged chips
packaging, where individual dies are placed one are placed on top of each other and are wire
on top of another in a vertical fashion in a hous- bonded to bond pads of the carrier substrate
ing (Goldstein 2001). (see Fig. 16.12a).
In the case of 2D planar SiP approach, the 2. Stacked Chip-Scale Package—More compact
individual functional dies are picked and placed approach is a package that bare dies and are
onto a dedicated carrier one by one after the stacked and wire-bonded to bond pads that
known-good-die testing. Once the dies are placed connect to the leads of solder balls on the
side by side onto the carrier, they are wire substrate. This package is further connected
bonded to the bond pads of the lead pins of the to a PCB to form a complicated system via the
package, which is followed by encapsulation of solder balls based on SMT manufacturing
the entire system under a single package. The lines (see Fig. 16.12b).
schematics and commercialized 2D planar SiP 3. Folded Package—It is an advanced way of
based pressure sensor from Infineon achieving smaller package, i.e., chips are
Technologies are shown in Fig. 16.11. The initially placed on a flat polyimide tape,
main advantage of 2D planar SiP technique is followed by folding the flexible tape to yield
the reduced footprint as compared to PCB level a stacked package of small footprint (see
packaging. Fig. 16.12c).
452 Y. Qian and C. Lee

(a)

Encapsulation
(metal lead,
Glob-top molding compound, PCB
material

ASIC die
Interface circuit
Wire-bond interconnect Pressure sensor
MEMS die (b)
Pressure port
Die attach material

Substrate
(PCB laminate,
ceramic, leadframe)

Fig. 16.11 Schematic demonstration of system in package (Yole Developpement 2012) and SiP assembled device
[Courtesy: Infineon Technologies]

a b

Die Wire

Solder ball
Individual package Package substrate
Stacked Packages Stacked Chip-Scale Package

c d
Flexible tape IC

Folded Packages Bus metal

System-in-a-Cube

Fig. 16.12 Schematic demonstration various packaging folded packages and (d) system-in-a-cube approach to
approaches to achieve 3D integration of individual chips provide electrical interconnects through vertical bus
(a) stacked packages, (b) stacked chip level packages, (c) lines (Goldstein 2001)
16 System Packaging and Assembly in IoT Nodes 453

a b c

Die Wire

Solder ball
Package substrate

Fig. 16.13 (a) Conceptual drawing of 3D stacking of SEM photo of fabricated ST microelectronics gyroscope
individual chips in a package (Goldstein 2001); (b) con- demonstrating the concept of chip level stacking in a
ceptual drawing of 3D chip level stacking in a package for package [Courtesy: ST Microelectronics]
an inertial sensor modfule [Courtesy: InvenSense Inc]; (c)

4. System-in-a-cube—The densest package so 2010). Wafer level packaging has the ability to
far relies on a stack of epoxy layers and enable true integration of wafer fabs, packaging,
embedded chips. Metal interconnects to the test, and burn-in at wafer level in order to stream-
layers are made along the sides of the stack line the manufacturing process undergone by a
(see Fig. 16.12d). device from bare silicon substrate to customer
shipment. Hence WLP can be considered as the
Even though the technologies enabling 3D extension of the wafer fabrication processes to
stacking looks fascinating in terms of foot print include device interconnection and device pro-
reduction, the sole purpose of integrating the mul- tection processes. There are various methods of
tiple sensor chips with windows to interact with the achieving WLP as shown in Fig. 16.14.
environment will be limited or impossible to WLP also provides significant performance
achieve in some case. Thus, for the packaging, improvement over chip level packaging with
the planar and stacked chip level system-in-pack- almost zero increment in footprint, due to pack-
age looks very attractive (Fig. 16.13). age. This enables WLP to revolutionize the next
Without comprising with the high cost and generation IC, MEMS and integrated system
lead time associated with system-on-a-chip packaging and opened up whole new areas of
development, manufacturing SiP has the added applications as shown in Fig. 16.15.
benefit of compatibility with die design changes The ultimate advantage of WLP is that it
and integration of various die technologies (e.g., combines the possibly smallest implementation
Si, GaAs, SiGe, SOI, MEMS). Thus, together form of packaging through 3D hybrid integration
with SoC technology, interconnect and packag- of die stacking, with the economic benefits of
ing technologies are key enabling technologies wafer batch processing. The two most attractive
for realizing smart SiP solutions. WLP approaches for packaging Internet of
Things are discussed briefly.

16.2.2 Advanced Packaging 16.2.2.1 Wafer to Wafer (W2W) Bonding


Technology for Internet Wafer to Wafer bonding is the process of firmly
of Things joining two wafers (or more) to create a stacked
and bonded wafer. Wafer bonding is used in
More advanced packaging involves packaging of the MEMS field to build 3D microstructures
dies at wafer level and is commonly known as and to manufacture silicon-on-insulator (SOI)
“Wafer Level Packaging (WLP)” technology substrates. Wafer bonding techniques are not
(Esashi 2008; Ko and Chen 2010; Lau et al. only used to bond two silicon wafers to each
454 Y. Qian and C. Lee

Fig. 16.14 Different


designs of 3D MEMS
packaging for the MEMS
wafer, ASIC wader and
cavity cap wafer (Esashi
2008)

Fig. 16.15 Application scope of wafer level packaging technology (Yole Developpement 2012)
16 System Packaging and Assembly in IoT Nodes 455

Fig. 16.16 Essential elements of wafer bonding: wafer preparation, external activation and wafer bonding
(Gianchandani et al. 2008)

other, but can be employed for different are—(a) Fusion bonding—where very high
substrates, including glass wafers or non-silicon temperature is applied after intimate contact
semiconductor (e.g., GaAs) wafers as well. In to form bonding process and is usually used
general, two wafers are processed, one with func- for Si-Si wafer bonding and (b) Anodic or
tional devices and other with capping cavity. One field assisted bonding where a high electrical
wafer is flipped over and aligned onto the other bias is applied to form the bonding and is
wafer and bonded together through different usually done is case of glass to conductive
bonding techniques as shown in Fig. 16.16. wafer bonding.
There are two main kinds of wafer bonding 2. Intermediated Wafer Bonding (IWB): In
processes—direct wafer bonding and intermediated wafer bonding approach, an
intermediated wafer bonding. The most suitable intermediate layer or glue is used in between
choice of wafer bonding technology depends on two wafers to permit their bonding and per-
the particular applications and the materials manent attachment. The intermediate layer
involved. The two broad categories of W2W can be metals, semiconductors, inorganic
bonding approaches explained briefly below: insulators (like glass, oxides, or nitrides), or
organic insulators (like polymers, epoxies,
1. Direct wafer bonding (DWB): In direct and other adhesives). Some of the popular
wafer bonding approach, the two wafer MWB techniques are—(a) Eutectic bonding
surfaces are cleaned and polished to high (a thin gold layer as an adhesive layer to bond
degree of smoothness and are brought into silicon wafers), (b) Glass-frit bonding (uses a
intimate contact, thus allowing the atoms on glass frit seal at the interface between the
the surfaces of these wafers to covalently wafers), (c) Solder-based thermocompression
bond together and form a permanent bond. bonding (Bonding at room temperature by
The most popular subcategories of DWB
456 Y. Qian and C. Lee

Fig. 16.17 (a) Photo of a successfully bonded wafer Au UBM with the bonding temperature of 180  C (Yu et al.
with 100% yield (Esashi 2008); (b) scanning acoustic 2009); (c) interfacial microstructure of the joint bonding
micrograph (SAM) of bonded devices using Ti/Cu/Ni/ at 180  C—center region of the joint (Yu et al. 2009)

using solder as interface layer and with appli- bonding process. Stable and high-temperature
cation of very high pressure), and (d) Adhe- resistant bonding interfaces rely on the formation
sive bonding (polymer-based adhesive of IMCs from reaction between a low-melting-
bonding) (Lindroos et al. 2009; Niklaus et al. point (LMP) component, such as In or Sn, and a
2006). high-melting-point (HMP) component, such as
Au, Ag, or Cu. In the material selection for the
It is very important to provide complementary LMP component, In-Sn alloy is more preferred
metal oxide semiconductor (CMOS) compatible over single component solder of Sn or In, owing
wafer bonding technology to drive the cost down. to the low eutectic temperature (118  C) and
However, the traditional silicon fusion based better wettability with various commonly avail-
direct bonding requires annealing temperatures able substrates. For HMP component, Cu is more
around 1000  C, and so it is not compatible with preferred as the use of Au would lead to a high
standard CMOS metallization. However differ- cost for wafer bonding, and also Cu is used
ent approaches are demonstrated using widely in modern IC packaging technology and
intermediated wafer bonding technology. The in 12-in. CMOS process technology. More
low temperature bonding approach that gained a importantly it is much cheaper than Au. Hence
lot of research attention was the formation of the In-Sn of eutectic composition for Cu-based met-
bonding interface at low temperature with allization provides a low cost alternative
low-temperature solder (e.g., indium) for hermet- (Fig. 16.17).
ically sealed packaging. The key advantage of In order to achieve high yield of good dies
this technology is that low bonding temperature through wafer bonding technology, issues of
is achieved because of low melting point (LMP) wafer warpage and topography differences
solder; however, the same package can withstand along the bonding rings across a wafer can be
much higher temperature because of the high- overcome by a solder-reflow step when sufficient
melting-temperature intermetallics compounds solder is placed in the bonding rings. However,
(IMCs) formed at the interface during the the LMP solder has to be fully consumed by
16 System Packaging and Assembly in IoT Nodes 457

Fig. 16.18 Schematic drawing of the role of a buffer layer dissolved into solder; (d) high-temperature IMC
layer during bonding: (a) before bonding; (b) liquid sol- joint finally formed during bonding. Redrawn from
der formed at the beginning of wafer bonding; (c) buffer (Yu et al. 2009)

reacting with HMP metals during the bonding Since the buffer layer saves the low-temperature
and annealing steps to form reliable and high materials for a successful solder-reflow step dur-
temperature-resistant IMC phases at the bonding ing bonding, the production yield of wafer bond-
rings for hermetic sealing applications. Hence ing is improved significantly. Ni is well known as
reasonable thickness of solder layers is critical. a barrier layer for Sn-based solder and Cu sub-
One of the major issues with this approach is the strate, and at the same time, it can easily diffuse
interdiffusion of material into other layers. Con- into Cu-based IMCs. For In/Sn/Cu systems after
sidering an example of microstructure of Sn/In/ eutectic bonding, all low-temperature phase will
Au/Cu metallization, the diffusion rates between be converted to high-temperature IMCs.
low-temperature solders and HMP metal According to the ternary-phase diagram of
substrates, e.g. Cu and Au, are very high. A Cu-In-Sn, the calculated thickness ratio of Cu to
portion of the as-deposited solders on the HMP In-Sn needs to be larger than 0.5 to form Cu6(Sn,
metal would be consumed before the wafer- In)5 compounds. To protect In-Sn from
bonding step; thereby forming Cu6(Sn,In)5 oxidization, a thin Au layer is deposited on top
ternary phase at the Cu substrate side due to of the solder layers. Since In will form Au-In
interdiffusion. Similarly, AuIn phase is formed IMCs with Au, there will be two kinds of IMCs
before the bonding step completed due to the fast in the final seal joint—Cu6(Sn,In)5 and Au-In. To
diffusion of Au into In. These compounds may get a robust joint, the Au layer should be as thin
not be high temperature resistant and will as possible to reduce the volume of the Au-In
adversely affect the long time reliability of pack- IMC (Fig. 16.18).
aged devices. Therefore, it is very important to
prevent the interdiffusion between LMP and Through Silicon Vias (TSVs)
HMP components during the fabrication process Another key technology that essentially made
and room-temperature storage. One way to pre- wafer bonding technology achievable was the
vent interdiffusion is to include a thin buffer through silicon vias (TSV), by which electrical
layer between the Cu substrate and the solder connections can be made between two bonded
layers such that interdiffusion is inhibited, wafers without the need for wire bonding or
whereas this thin buffer layer preferentially bump based bonding.
dissolves into the melted solder quickly at the For 3D IC packaging, TSVs are the most
beginning of soldering reaction. Then the diffu- important enabling technology. TSVs provide
sion between the liquid solder and the Cu starts, advanced SiP solutions such as C2C (Chip-to-
and finally, all solder was converted into IMCs. Chip), C2W (Chip-to-wafer), and W2W
458 Y. Qian and C. Lee

Fig. 16.19 (a) SEM image showing the via cavity for- Cu, followed by CMP planarization and (d) process flow
mation by DRIE process, (b) SEM image of oxide/barrier for TSV formation, Cu filling followed by metallization
layer deposition, (c) SEM image showing via filled with through TSV (Esashi 2008)

stacking, WLP; with the shortest electrical path step by the BOSCH etch process, (2) Via taper-
(vertical electrical feedthrough) between two ing step by a controlled isotropic etch process,
sides of a silicon chip. The most critical step in and (3) Corner-rounding step by a global isotro-
TSV technology is via formation, because it pic etch process as shown in Fig. 16.19a. Once
demands very high aspect ratio via cavity to the vias are formed, the thermal oxidation is used
achieve maximum number of interconnects to provide the electrical isolation layer, followed
between the two chips. Via cavity in silicon by deposition of Ti barrier and copper seed layer
wafer can be formed by either wet or dry etching as shown in Fig. 16.19b. Now the via cavity is
process. At times, laser drilling is also used to filled through copper electroplating process,
achieve high aspect ratio via cavity. However, followed by Copper (Cu) planarization through
the most popular technology to form via cavities Chemical Mechanical Polishing (CMP) (see
on wafer level is by deep reactive-ion etching Fig. 16.19c). Once the Cu vias are formed, the
(DRIE), which is a highly anisotropic etching wafer is attached to a supporting wafer using a
process. The silicon vias are formed in temporary adhesive material, followed back-
inductively-coupled plasma (ICP)-based DRIE grinding (thinning) and backside processing to
system based on process comprising specially expose the vias. The bonded wafers then are
designed switched steps of etch and passivation processed for back-end metallization/UBM/sol-
processes, also known as the BOSCH process. der bumps. The backside UBM is electroplated
The high-aspect-ratio tapered silicon vias are copper. After all these processes, the bonded
formed by three independently controlled pro- wafers are debonded by sliding the wafers apart
cess steps: (1) Straight vertical via formation at higher temperature (200  C). Now the TSV
16 System Packaging and Assembly in IoT Nodes 459

Fig. 16.20 (a) Process


flow to achieve vertical
feedthrough on either side
of the Si wafer without via
cavity filling and (b)
optical microscope image
of the highly conformal
metal vertical feedthrough
without filling via cavity
(Kuhmann et al. 1999)

ASIC (Application specific Integrated circuit) TSV enables high performance devices, com-
wafer is ready for the MEMS device (C2W) pared to its alternative 3D packaging technology
bonding or the MEMS wafer (W2W) bonding such as package-on-package or stacked chip
(Chen et al. 2008; Lau et al. 2010; Lindroos approach, because of the higher density and
et al. 2009). shorter length of the vertical electrical
Alternatively, vertical feedthrough is also feedthrough. Additional TSV technology also
possible by conformal electroplating of thin enables zero footprint increment due to chip
metal, rather than filling up the entire via cavity. stacking and perfectly flat package with minimal
The process of achieving the above mentioned step height. TSVs could complement the W2W
vertical feedthrough is shown in Fig. 16.20a. The bonding technology to provide highly efficient
Si wafer is etched using DRIE process with a and low cost product.
hard mask. Then a conformal coating of thin
metal is achieved through electroplating, thus
providing electrical path between either sides of 16.2.2.2 Thin Film Encapsulation
the Si wafer. This approach looks attractive, The emerging MEMS technology in the recent
however the current carrying capability through years has open up advanced packaging technique
these vias are limited as compared to entirely that can potentially enable better performance
metal filled via. and lower manufacturing cost. One such essen-
Another means of vertical feedthrough is the tial prerequisite is MEMS encapsulation which
Si V-groove vias. At first, the via cavity is can be imperative to the device’s capability and
formed through anisotropic wet-etched Si reliability. For example, increased reliability and
V-grooves followed by dielectric material coat- higher Q-factor performance are reported by
using vacuum based encapsulation on MEMS
ing and metal layer deposition (Fig. 16.21). The
resonator (Candler et al. 2003; Pourkamali et al.
process of achieving V-grooves vias by utilizing
2007; Rais-Zadeh et al. 2007) and energy har-
anisotropic wet etch is shown in Fig. 16.22d. The
vester (Xie et al. 2010). In today’s context,
footprint of through-wafer V-groove vias is rela-
established manufacturing technologies in
tively larger, owing to the larger angle of the MEMS encapsulation are mostly performed at
V-grooves as shown in Fig. 16.22a. The size of chip level back-end assembly, which means
the through-wafer V-groove via is determined by each device has its own dedicated packaging
the thickness of wafer, hence using a thin device where the housings or encapsulations are mostly
layer of a silicon-on-insulator (SOI) wafer, the consists of polymers and molding material. Con-
V-groove vias can be shrunk to a very small area sequently, this method leads to inefficient pro-
as shown in Fig. 16.22b, c. duction and higher manufacturing cost. In
TSV is the key enabling technology to suc- contrast, some other wafer level packaging tech-
cessfully implement W2W bonding technology. nique, for example, glass frit is utilized in wafer
460 Y. Qian and C. Lee

1)
(a) (d)
Evaporation of plating
base (Ti/Au)

2)
Mold 1: thin Eagle resist
Patterning: Feesthroughs
Cu plating

3)
Strip of Eagle resist
Electrical interconnects µ-via Mold 2: thick Engle resist
Pattering: TSM
Ni/Au plating

4)
(b) Strip of Eagle resist
Solder ball Mold 3: thick Eagle resist
V-groove µ-via Patterning: UBM, bumpd
Ni/Sn plating
(c) SMD pad 5)

Strip of Eagle resist


Strip of plating base

Sealing ring µ-via


Sio2 Cu
Ti/Au, Au Ni
100µm BHT-15 00 w Signal A = SE2 Date : 18 Jan 2006
Meg= 335x
WD- 20mm Plating No -E 451 Time : 14.50
Eagle resist Sn

Fig. 16.21 (a) Optical microscope image of the electri- SEM of sowing the μ-via (Shiv et al. 2006); (d) process
cal interconnections and the μ-via [Hymite Corporation]; flow for V groove feedthrough with electroplated photo-
(b) schematics of V groove μ-via (Shiv et al. 2006); (c) resist and metal (Kuhmann et al. 1999)

Fig. 16.22 Wafer level packaging approach to form thin film encapsulation. (a, b) High temperature WLE device
[Courtesy: SiTime Corporation]

level spin coating and encapsulation using bond- complicated machinery. However, this technique
ing scheme as reported by Knechtel et al. is not vacuum based and often creates excessive
(Knechtel et al. 2006) may be a viable solution particle count resulting in difficulty to keep the
as the process is low cost and does not involved wafer in pristine condition. In fact, vacuum based
16 System Packaging and Assembly in IoT Nodes 461

Fig. 16.23 Fabrication


process flow of an epi-seal
devices. (a) Si deep trench
etching. (b) SiO2 overfill to
seal trench. (c) Via
definition with device
release. (d) Si epitaxy
sealing and electrode
isolation. (e) Passivation
opening. (f) Al metallization
(Soon et al. 2014)

sealing is one of the most significant features in access to the SiO2 underneath. Then, isotropic
wafer level encapsulation (WLE). vapor hydrofluoric acid is used to release the
Thin Film Encapsulation (TFE) consists of active parts. Subsequently the release hole is
using a sacrificial material to cover the MEMS sealed by a second layer of Si epitaxy. During
device and a deposited thin film as capping layer the sealing, the wafer is exposed to high temper-
to encapsulate the MEMS device, followed by ature with hydrogen gas content and this process
removal of the sacrificial layer by wet etching or removes most impurities and polymers as
vapor-phase etching through the release holes on reported (Hong et al. 2013). Subsequently, top
the capping layer and then sealing up the release Si is isolated with SiO2 and finally the contact
holes by deposition of a thin film sealing layer. pads are open and metalized with aluminum. The
The deposition of the sealing layer is usually finish chip has flat topography and only the metal
carried out in a reaction chamber, which is pads are visible.
under vacuum and this ensures that the sealed This high temperature process (1080  C for
cavity of the MEMS device can be preserved at the epitaxy) provides a MEMS-first front-end
low vacuum. wafer-level packaging, which completely
One of the pioneer works come from the defines, releases, and seals the microstructures
epi-seal process. It involves deposition of a in the front-end processing steps. Thus, it
20 μm layer of epi-polysilicon over unreleased provides reliable wafer-level vacuum packaging
devices to act as a sealing cap, and a final seal of at low cost. One advantage of using a high-
the parts in 7 mbar (700 Pa) vacuum as shown in temperature sealing process (i.e., Epitaxial Si
Fig. 16.22a, b. A more detailed example of pro- deposition) is that the absorbed molecules inside
cess can be found in Fig. 16.23. Initial silicon on the cavity will be removed during the high-
insulator (SOI) wafer of 40 μm thick device layer temperature sealing step, just as in the hot baking
with 1 μm buried oxide is used, where the active treatment. Thus, it enables the high vacuum of
parts are defined by deep reactive ion etching the sealed cavity. The derived products exhibit
(DRIE) in this silicon device layer. After that, excellent performance in terms of ultra-low long-
sacrificial SiO2 is overfill the trenches. It is then term drift of device features and good stability
etched to define where the electrode should be against temperature variation.
interfaced, at the same time the opening area The effect of this epi-seal process to the
serves as the anchor area for the following Si reliability of MEMS in vacuum have been
epitaxy capping. The release hole is patterned investigated by several groups (Jin et al. 2003;
and etches in the Si capping layer to provide Shea et al. 2006). Below we show a report of
462 Y. Qian and C. Lee

Fig. 16.24 (a) Exploded view of the Si-to-Si MEMS encapsulated device is flat with minimal topology (Soon
switch with encapsulation layers. (b) X-ray image et al. 2015). (c) Scanning electron micrograph (SEM) of a
showing the encapsulated three terminal device consisting curved beam Si-to-Si MEMS switch. The operation of the
of a movable curve beam (source), control terminal (gate) switch is shown where VG is used to control the actuation
and contact terminal (drain). (Inset) top view of the fin- with gate terminal and VS is used to detect signal from
ished chip with metalized aluminum pad. The source to drain

increased MEMS switch reliability under and drain (fixed) defined on silicon-on-insulator
epi-sealed vacuum environment. Si based MEMS wafer. The movable beam (source terminal) can
switch working under air ambient normally shows be electrostatically pulled in to contact the drain
the reliability of only ~10 cycles. With epi-seal terminal by applying a gate voltage between the
process, the reliability of the MEMS switch is gate and source terminal. Subsequently, electrical
dramatically improved as the device is working signals can pass from source to drain. The reliabil-
under vacuum environment with prevention of ity of the MEMS switch is characterized by repeat-
oxidation at the contacts. edly turn on and off the gate terminal, while
The epi-sealed MEMS switch with the entire monitoring the contact resistance between source
moving structure and all electrodes are and drain terminal.
encapsulated in a vacuum environment with elec- The measurement shows even under
trical interconnects built through the encapsulation accelerated conditions, while the device operates
to the chip surface and metalized by aluminum under 400  C, Fig. 16.25 there is no degradation
(Fig. 16.24). These terminals consist of source after 106 cycles. The epi-seal process, not only
(movable beam, fixed at one end), gate (fixed) elevate the reliability of Si-based MEMS switch
16 System Packaging and Assembly in IoT Nodes 463

from 10 cycles to 106 cycles, it also makes the an insulation layer of silicon Si3N4 is plasma
MEMS switch technology become feasible and enhanced chemical vapor deposition (PECVD)
practical for future development. deposited, followed by 4 μm of low stress silicon
However, the high-temperature epitaxial Si oxide (SiO2) sacrificial layer in the same tool. The
sealing process limits material selection for insulation layer of Si3N4 acts as an etch stopper
MEMS devices. Some metals and low- during the sacrificial layer etching at the same
temperature oxides cannot sustain their material time it is also the dielectric layer which is very
characteristics under such high temperatures. As commonly used in MEMS devices, in order to
shown in Fig. 16.26, a low-temperature process prevent leakage into the bulks Si. The sacrificial
has been proposed to provide a low temperature SiO2 then defines the entire area where the MEMS
alternative WLE process (Soon et al. 2012). First, devices should be encapsulated. Next, a thinner
layer of 1 μm creates a spacer area larger than then
first sacrificial layer. A dual layer deposition of
101 1 μm Si3N4 and 0.2 μm of amorphous Si is depos-
ited to complete the top housing and defines the
100 structure for the encapsulation. This layer is pat-
Resistance (GΩ)

Off Resistance
On Resistance terned with release holes. After that, the wafer is
dipped into buffered oxide etcher (BOE) wet etch-
10-1 On : 4MΩ ing to remove the sacrificial oxide. After the cav-
Off : 5-9 GΩ ity is formed, Al is sputtered to seal the etch hole.
10-2 The vacuum level is around 0.01 mTorr during the
deposition. In other mean, the encapsulation itself
is assumed to be around the same order of vacuum
10-3
0 2 4 6 8 10 level as the chamber.
No.of cycles (106) Another approach of low temperature WLE
using vapor hydrofluoric acid to release the
Fig. 16.25 Contact resistance of on and off state for one
million cycles devices instead of BOE is shown in Fig. 16.27.

a b
PROCESS FLOW
MEMS DEVICE Sealed hole

(1) 4umlow stress SiO2 on Si3N4 (4) Etch holes for release

(2) 1umlow stress SiO2 as stepper (5) Etch sacrifical SiO2 layer

(3) 1umSi3N4 + 2000A amorphos silicon (6) Seal the holes with Al sputtering Cross-section

Fig. 16.26 (a) Schematics of process flow for low temperature wafer level MEMS encapsulation. (b) Encapsulation
using Al with sealed opening and cross section of encapsulation (Soon et al. 2012)
464 Y. Qian and C. Lee

b
a

Si SiO2 Mo AIN AJ

Fig. 16.27 (a) Process flow for wafer level encapsulated microelectromechanical system (N/MEMS) switches
metallic device (b) cross-section of an encapsulated 2015)
metallic device (Qian and Development of nano/

Vapor hydrofluoric acid release is less likely to a timely manner. The field of wearable sensors
have stiction. Commonly used metal in CMOS has developed substantially over the past decade,
process like Al cannot survive in BOE but it does with most solutions targeting healthcare
not react with Vapor hydrofluoric acid at all. applications via the transduction of physical
Figure 16.27a shows the process flow, after for- parameters such as heart rate (Chiarugi et al.
mation of the active devices, the whole wafer is 2008; Paradiso et al. 2005), respiration rate
covered with SiO2, followed by etching to define (Chiarugi et al. 2008; Merritt et al. 2009; Rovira
the active area. A 500 nm AlN layer is deposited et al. 2011), oxygenation of the blood (Rothmaier
by sputtering and pattern for releasing hole. The et al. 2008), skin temperature (Jung et al. 2006),
devices are then released by VHF. The sealing is bodily motion (Hasegawa et al. 2008; Lorussi
conducted by 1 μm SiO2 deposition. Lastly, the et al. 2004; Lorussi et al. 2005), brain activity
metal pad is opened for Al interconnection. (Devot et al. 2007), and blood pressure (Espina
There are several advantages of low tempera- et al. 2006). As a component of biotelemetry
ture wafer level encapsulation technology, such systems, these sensors leverage the wireless
as (1) Low-temperature processing—the technol- infrastructure to form a IoT node capable of
ogy is suitable for packaging of MEMS devices relaying the physiological or environmental
that are sensitive to high temperatures and ther- information to the personnel in real-time. Novel
mally induced residual stress; (2) Low cost and fabrication concepts have been developed
simple packaging. The process does not require whereby robust sensors and biosensors are
cap-to-wafer alignment or a wafer-bonding step; implemented directly on flexible substrate such
and (3) High-yield process. that, sensors are able to withstand the rigors of
field-based use and exhibit uncompromising
16.2.2.3 Packaging for Wearable, resiliency to various forms of mechanical strain
Flexible Electronics and deformation that are typically encountered
Recent advances and developments in the field of during routine use.
wearable sensors with emphasis on devices that Flexible substrates such as Polyimide
are able to perform highly-sensitive analysis. (Kapton), polyethylene naphthalate (PEN), poly-
Novel fabrication methodologies have resulted ethylene terephthalate (PET), polytetrafluor-
in the demonstration of sensors able multiple oethylene (Teflon), among others have long
physical measurements (i.e. heart rate, EEG, been employed in the printed electronics indus-
ECG, etc.). Thereby providing added dimensions try. Owing to their intrinsic plasticity,
of rich, analytical information to the personnel in hydrophobicity, excellent dielectric and insulate
16 System Packaging and Assembly in IoT Nodes 465

(a) SU-8 packaging Schottky diodes Stimulating electrode (b) Bottom ASIC
gold disk Diodes Spiral inductor
ASIC chip

Inductor

Silver epoxy
Contact pad 300 mm

Fig. 16.28 (a) Schematic diagrams of SU-8-packaged biphasic stimulating electrodes; (b) Photomicrographs of
wireless neurostimulator showing a receiver coil, two the fabricated wireless neurostimulator
rectifying Schottky diodes, a stimulating ASIC chip, and

properties, thermal stability and a low coefficient ICs and flexible substrate are accomplished by
of thermal expansion, structural resiliency conductive epoxy, solder paste, and anisotropic
against repeated deformities forces, and compat- conductive film (ACF) respectively.
ibility with roll-to-roll fabrication processes, First, we introduce a fully integrated mini-
flexible substrates have served as the platform mally invasive compact polymer based wireless
of choice in various demonstrations of the wear- neurostimulator (Fig. 16.28). These microprobes
able electronics. on neurostimulator have been designed to be
Advances in microelectromechanical systems implanted to act as an interface between neurons
(MEMS) now allow nearly seamless integration and electronics for the stable detection of neural
of an array of sensors to collect information. spike signals or for efficient stimulation of
Microcontrollers have shrunk to the point that neurons. To avoid these issues, wireless opera-
they can be hidden away neatly and consume tion of the implanted neural stimulation devices
very little power. Wireless communication and is highly desirable. In order to realize such a
low power standards allow for a network of inde- wireless neural stimulator, wireless power trans-
pendently operating components to easily inte- mission from the outside of the body of the
grate with each other. Low-power LEDs and implanted devices in the body is used.
flexible display technology are close to bringing The designed wireless neurostimulator was
science fiction into reality. With very limited fabricated by SU-8-based surface micromachining
board space available, the packaging of the wear- technique (Fig. 16.29). Where the spiral inductor
able electronics is heavily relying on advanced for wireless power transmit is embedded inside the
technology like wafer level packaging (WLP) or Su-8 layer, only leave via on the Top Su-8 layer.
System-in-Package (SiP). The via is filled with 15-μm-thick electroplated
In the meantime, many concepts of how to gold. Then A 156-μm-deep SU-8 socket structure
integrate electronics are still new and very was then patterned in order to make sure that both
much in the prototyping phase. Form factor the diodes and the ASIC are completely enclosed
plays a major role in wearable technology—not by SU-8 socket. For bonding the diodes and ASIC
just for the processor, but also the entire system. board, a small amount of biocompatible conduc-
Chip-scale and wafer-based packaging is natu- tive epoxy was applied to the contact pads of the
rally found a home in wearables with early sockets. The ASIC chip and diodes were slid into
adopters. Package size and integration of com- the sockets with their contact pads side down.
mon functions are also important to wearables. Because solvent inside the biocompatible conduc-
Here, three cases are introduced to the advanced tive epoxy can affect the subsequent SU-8 process,
packaging technology for wearable, or flexible it should be completely evaporated. Thus, the sam-
electronics application. The bonding between ple was cured in a 95  C convection oven for 5 h to
466 Y. Qian and C. Lee

Fig. 16.29 Fabrication process flow of the wireless spiral inductor; (f) 156-μm-thick SU-8 sockets; (g) diodes
neurostimulator: (a) patterning of the bottom gold elec- and ASIC chip packaging into the sockets; (h) 100-μm-
trode; (b) formation of the base layer of the SU-8 socket thick SU-8 protection layer with the output via hole; (i)
platform; (c) conductor traces of the spiral inductor and patterning of the top gold stimulating electrode; and (j)
the contact pads; (d) SU-8 insulation on top of the spiral device completely released from the substrate
inductor with via holes; (e) conductor bridge over the

thoroughly evaporate the solvent of the conductive sweat, and surface temperature, as read by an
epoxy. Subsequently, the device was thoroughly Android smartphone app. All circuitry is solder-
sealed by 100-μm thick SU-8, and one of the via reflow integrated on a standard Cu/Polyimide
holes were created through the SU-8 sealing for flexible-electronic layer including an antenna,
the purpose of the output pad connection. but while also allowing electroplating for simple
The second case is an adhesive radio- integration of exotic metals for sensing
frequency identification (RFID) sensor bandage, electrodes.
which can be made completely intimate with Fabrication begins at the flexible electronic
human skin, a distinct advantage for chronologi- circuit board with a sheet of Dupont Pyralux
cal monitoring of biomarkers in sweat AC (18-μm-thick Cu foil clad to 12 μm Kapton)
(Fig. 16.30). In this demonstration, a commercial for the Cu wire. The electrodes of the sensor are
RFID chip is adapted with minimum components fabricated by electrodeposition of Pd and Ag on
to allow potentiometric sensing of solutes in the previously defined bare copper electrodes.
16 System Packaging and Assembly in IoT Nodes 467

Fig. 16.30 (a) Circuit layer layout detailing antenna coil geometry and electronics placement. (b) Device photographs
to illustrate electronic and sensor features

An alloy of Sn/Ag (96.5/3.5%) no-clean solder compress interconnection techniques. First, an


paste (Superior Flux & Mfg. Co. P/N: 3033-85) ACF is laminated to a substrate. Pressure and
is applied to each pad using an air-metered temperature are applied during the lamination
syringe dispenser, or higher volume fabrication process to ensure positioning accuracy, unifor-
by stencil printing. The chips are then placed, the mity, etc. Then, the bumps on integrated circuits
flexible circuit board is then placed in a program- (ICs) are aligned with the electrode prepared on
mable, convection reflow oven that applies a the flexible substrates. Finally, the IC chip is
heating profile according to the solder supplier pressed onto the substrate at a specified high tem-
specification. The reflow oven is connected to a perature and pressure. The conductive particles
pressurized N2 gas supply, which is used to purge are trapped between the bumps and tracks, while
the sample chamber and prevent oxidation of the the adhesive resin is squeezed out. The intercon-
Cu surfaces. nections are established by the compressive force
In applications where fluid contact could between the electrodes due to the shrinkage of the
occur, such as sweat sensing, the electronics adhesive after curing. Consequently, the electrical
must be hermetically sealed. This is easily conduction is restricted to the z-direction and the
achieved by masking the sensor electrodes with electrical isolation is maintained in the x–y plane.
Kapton Tape (silicone adhesive), and conformal Additionally, it is necessary to know the deforma-
coating the flexible circuit and electronics with tion of conductive particles during die mounting
10 μm of Parylene. Ideally, this is performed process. Generally, the conductive particles con-
before sensor functionalization, to ensure that sist of a polymer core material coated with a thin
the Kapton masking tape does not damage or metal layer (nickel and/or gold), and then
contaminate the final sensor surfaces. insulated again with a polymer material on its
Another commercially available method for exterior surface. The conductive particles in an
the packaging of flexible electronics is using ACF are of ball shape before die bonding. When
anisotropic conductive film (ACF). The principles the IC chip is pressed onto the glass in a high
of the ACF are that the electrical connections are temperature and pressure, the particles are
established through conductive particles and the entrapped and deformed between the bumps of
mechanical interconnections are maintained by the chip and the glass substrate, and then the
the cured adhesive. For example, Fig. 16.31 external insulation layer of the particle is damaged
shows a typical process based on ACF’s thermo- and the interior metal layer is exposed. Thus, the
468 Y. Qian and C. Lee

Fig. 16.31 Anisotropic Transparent ACF


conductive film Z
interconnection process

circuit X
Heat Y
Pre-bonding substrate
+
Pressure
Insulating layer
Un-deformed particles

Au layer

Chip

Bump on chip

Top protective film


Exposed metal layer

Heat
Bonding
+
Pressure
Deformed particles
Polymer core
Damaged insulating layer

Fig. 16.32 (a) A fabricated device under a stretching deformation. (b) Images of use of an epidermal hydration sensor
(Huang et al. 2013)

electrical connections are established through multiplexed mapping. The devices contain
conductive particles. miniaturized capacitive electrodes arranged in a
The ACF is capable of bonding ICs on a matrix format. The flexible electronics are a
flexible substrate, it is also possible to connect sandwiched structure Polyimide (PI; 1 μm
two flexible electrodes. As Fig. 16.32 presents a thick), Au (400 nm) and another layer of PI
hydration monitor that uses ultrathin, stretchable (1 μm). Selected regions of PI are removed by
sheets with arrays of embedded impedance reactive ion etching (RIE) to form via contacts.
sensors for precise measurement and spatially To obtain the signal from the capacitive
16 System Packaging and Assembly in IoT Nodes 469

electrodes, a flexible wire is attached to the func- the same semiconductor equipment and processes.
tional electronics with ACF. As a result, the next generation of optical
components needs to be low cost and compatible
16.2.2.4 Packaging with Silicon with high-volume manufacturing.
Photonics An obvious way to achieve this requirement is
It’s no secret that this IoT benefits photonics prod- to increase the integration of optical devices. In
uct manufacturers; specifically, those companies spite of many technology developments since the
in the optical fiber communications business, beginning of optical communications, one should
RFID and bar-code reader technology suppliers, admit that the integration level has remained low
imaging and sensor companies, and hosts of other compared to the microelectronic technology, due
high-technology equipment suppliers. Exponen- to various reasons, particularly the size of
tially increasing Internet traffic along with the integrated optics components and the heteroge-
IoT will place a huge burden on next generation, neity of photonic materials.
cloud resident data centers. The new requirements Consider a packaging between photonics and
include: higher system performance, coping with ICs, the 3-D technology is attractive since it
higher power consumption via more effective allows two or more different process
cooling concepts, faster interconnect speeds technologies to be stacked and interconnected.
(between components, modules, cards, and 3-D integration can be realized in various
racks). The challenge for designers is to provide technologies. The 3-D technology allows two or
faster compute/storage/networking systems with more different process technologies to be stacked
more effective bandwidth and performance. and interconnected. In an example of advanced
Silicon photonics is a new technology that mixed material short-wave infrared focal plane
should at least enable electronics and optics to be arrays Error! Reference source not found (Keast
integrated on the same optoelectronic circuit chip, et al. 2008), the 3-D technology enables complex
leading to the production of low-cost devices on focal architectures with ideal fill factors and a
silicon wafers by using standard processes from broad range of spectral sensitivities. 3-D integra-
the microelectronics industry. For more than tion technology has been used to fabricate four
30 years, the optical interconnect solutions have different focal planes, including: a two-tier
been gradually implemented from very long to 64  64 imager with fully parallel per-pixel
short distances due to the continually increasing A/D conversion; a three-tier 640  480 imager
bandwidth demand. As a result, optical fiber consisting of an imaging tier, an A/D conversion
networks have been progressively deployed, in tier, and a digital-signal-processing tier; a
long links first, down to the enterprise local area two-tier 1024  1024 pixel, imaging module
network, metropolitan, and access networks. Com- for tiling large mosaic focal planes 3; and a
pared with the electrical interconnect solutions, three-tier Geiger-mode avalanche photodiode
optical links exhibit several demonstrated (APD) 3-D light detection and ranging
advantages, such as low signal attenuation, lower (LIDAR) array (Fig. 16.33).
dispersion, and crosstalk leading to superior band- In IoT application, system complexity and
width by distance products. Another significant functionality are increasing. Everything from
advantage of the optical solutions is the immunity front end through signal processing and storage
of the signal transmission to electromagnetic inter- to sensors and actuators, are housing with
ference, which makes them very well suited to integrated energy sources. This can be satisfied
wireless systems in IoT. However, optical links by heterogeneous integration (see Fig. 16.34) of
require more research effort. Silicon photonics, different technologies, leading to the best
using highly confined optical modes in a silicon compromise in systems functionality and cost of
waveguide, appears as a unique opportunity to ownership for higher multifunctional converging
cope with this integration challenge. It will allow systems. Heterogeneous integration (Reichl et al.
manufacturers to build optical components using 2005) requires a set of technologies like thinning
470 Y. Qian and C. Lee

Fig. 16.33 Cross-


sectional SEM of a single
pixel from a functional
three-tier Geiger-mode
avalanche photodiode
(APD) 3-D LIDAR array
(Keast et al. 2008)

Fig. 16.34 Conceptual structure of 3-D heterogeneous optoelectronic integrated system-on-silicon for intelligent
vehicle system with variable signal-processing function depending on moving speed (Lee et al. 2009b)
16 System Packaging and Assembly in IoT Nodes 471

of wafers and components, vertical system harvesters seems to be a viable candidate as a


integration, functional layer technologies, assem- power source for WSN application, because of its
bly of thin components, thin interconnect compactness, environment friendliness, unlim-
technologies, wafer molding, bumping/ball plac- ited lifetime, and autonomy (Xie et al. 2009).
ing, and dicing. All these technologies need to be Additionally, Energy harvesters can be made
optimized and adapted into a modular integrated with similar process like MEMS sensors and
process flow. hence enables technologies like SiP to provide
System-on-silicon for intelligent vehicle sys- the entire IoT node as a single packaged chip.
tem with variable signal-processing functions Energy harvesters can ultimately open up the
depending on moving speed is shown in room for providing SoC level of integration and
Fig. 16.34. The proposed system including WLP of the SoC based IoT module.
image sensor stacked with Analog-to-digital con- The power unit of a wireless sensor node may
verter is for high-performance image processing. be supported by a power scavenging unit such as
MEMS sensor, optical sensor, and RF IC are for batteries or solar cells. In some application
high-sensitivity sensing of moving speed. Three- scenarios, replenishment of power resources
dimensional memory and 3-D processor are for might be impossible. Therefore, sensor nodes
high-performance computing. Optical intercon- are implemented in a “deploy and forget” sce-
nection is for high-speed information network- nario since the cost of battery replacement will
ing. Microfluidic channels are for heat sinks for be prohibitive and impractical (Mathúna et al.
high power consumption. Such heterogeneous 2008). Sensor node lifetime, therefore, shows a
integration of LSI, MEMS, and optoelectronic strong dependence on battery lifetime. Not to
devices is very difficult due to incompatible mention that the sensors and electronics of a
processes. wireless sensor node will be far smaller than 1 cm
3
, in this case, the battery would dominate the
16.2.2.5 Packaging for Energy system volume. Therefore, the development of
Harvesters alternative power sources for wireless sensor
The advancement in packaging of semiconductor nodes is acute. Significant research is ongoing to
chips has been keenly focused reducing on the deliver power from the environment using energy
overall product size by integrating more func- harvesting technology such as solar energy,
tional units into a single package. In case of vibration/motion energy, thermal gradient, et al.
packaging of IoT, which employs multiple func- (Niyato et al. 2007; Raghunathan et al. 2006),
tional units, the packaging trend has been shown which can deliver energy directly to a wireless
from the PCB level system packaging to more sensor load or to a storage element such as a
complex multichip packaging like SiP solutions rechargeable battery or capacitor. Some of the
to highly complicated SoC approach. This inte- major benefits of energy harvesting technology
gration of multiple functional devices into a sin- for WSNs are stated as (Thomas et al. 2006).
gle package as in the case of SiP or into a single Firstly, energy harvesting solution can reduce
chip like SoC also provides means of achieving the dependency on battery power and provide
high performance devices with improved reli- long-term solutions. With the advancement of
ability and at a lower cost. Currently approaches microelectronics, the power consumptions of sen-
like SiP look very attractive for IoT as it can sor nodes have been reduced significantly. Hence
incorporate most of the functional units like con- harvested ambient environmental energy may be
troller IC, MEMS sensors and actuators into a sufficient to replace battery completely. Sec-
single package. However due to use of large ondly, energy harvesting solution would reduce
batteries as the power source of a sensor network installation and maintenance cost. Self-powered
to operate autonomously, single packaged chip sensor nodes do not require power cables wiring
for entire sensor node is not possible. Compared and conduits, hence they are very easy to install.
with traditional batteries, MEMS energy The heavy installation cost can be reduced greatly
472 Y. Qian and C. Lee

(Kheng et al. 2010). Clearly, it can be deduced An example of a wideband energy harvesting
that energy harvesting technology is a promising system (Liu et al. 2012a) is illustrated in
solution to power WSNs for extended operation Fig. 16.35a. It contains a low-resonant-frequency
with the supplement of the energy storage cantilever (denoted as LRF cantilever) and a
devices. high-resonant-frequency cantilever (denoted as
There are various sources of energy available HRF cantilever). The LRF cantilever as shown
for energy harvesting, and indeed, many works in Fig. 16.35b comprises of a silicon supporting
have been presented on generating electrical beam deposited with a PZT layer and silicon
energy from solar energy (Lee et al. 1994; inertial mass. The PZT layer consists of ten PZT
Randall 2003), temperature gradients (B€ottner stripes electrically isolated from one another. The
et al. 2002; Stordeur et al. 1997; Xie et al. HRF cantilever (Fig. 16.35c) has similar
2010), ambient radio frequency (RF) (Hudak dimensions and configuration with that of the
and Amatucci 2008; Rabaey et al. 2000), LRF cantilever, but without an inertial mass.
vibrations (Mitcheson et al. 2004; Roundy et al. The resonant frequencies of the LRF and HRF
2002), and human motions (Shenck and Paradiso cantilevers are 36 Hz and 618 Hz, respectively.
2001). Thermoelectric power generators (TPGs) The LRF and HRF cantilevers are assembled with
scavenge energy from the body heat or any heat a pre-determined stop-spacing. The vibration
sources available in environment, where there is amplitude of the LRF cantilever is larger than
a temperature gradient to generate electricity. the stop-spacing and it is able to respond to ambi-
Ideally, all devices lose power as heat and ent vibrations nearby its resonant frequency and
hence this heat can be used collectively to gener- further extend over a wider frequency bandwidth
ate electricity in turn. Researchers have by the HRF cantilever which acts as a frequency
demonstrated that a TPG chip of 1 cm  1 cm up-converter (FUC) stopper. In the meantime, the
size at 5 K temperature difference across two HRF cantilever is triggered by the inertial mass of
sides of device provides an open-circuit voltage the LRF cantilever into a high frequency self-
of 16.7 V, and equivalent output power of oscillation. Due to the piezoelectric effect, the
1.3 μW under matched load resistance (Xie electric current is generated once the PZT layer
et al. 2009; Xie et al. 2011). Such TPGs are on both LRF and HRF cantilevers deform. The
quite attractive to be considered as a promising advantages of this assembly strategy in
power source in the WSN nodes. employing the HRF cantilever as a FUC stopper
Ambient vibrations are present in many are that it protects the excitation oscillator from
environments, such as air and water flows, damage during vibration and broadens the opera-
structures (building, towers, bridges, etc.), indus- tion range. The system also converts ambient low
trial machines, equipment and transportations frequency vibration to a high frequency self-
(washing machines, machines tools, oscillation. Using MEMS technology, the pro-
automobiles, air planes, etc.), movement of posed system is applied to a piezoelectric-based
human bodies (walking, running, pulse beats, energy harvesting system by the use of cantilever
and blood flow) and so on. Vibration-based beams deposited with piezoelectric PZT material.
energy harvesting techniques convert energy in On the other hand, a demonstration of wafer
the form of mechanical vibrations in the ambient level packaging using thin film encapsulation for
environment into electricity energy by using pie- thermoelectric power generator, shows the ease
zoelectric, electromagnetic, and electrostatic of process integration of WLE technology for
mechanisms (Beeby et al. 2006; Liu et al. packaging with the MEMS process technology.
2012a; Liu et al. 2012b; Liu et al. 2012c; Lee The main process steps and images are illustrated
et al. 2009a; Liu et al. 2011a; Liu et al. 2011b; in Fig. 16.36. The 0.7 μm-thick poly-Si is depos-
Liu et al. 2012d; Yang and Lee 2009; Yang and ited at 580  C in furnace by LPCVD as the
Lee 2010; Yang et al. 2012; Yang et al. 2010a; thermoelectric layer, followed by dry etching of
Yang et al. 2010b). the poly-Si layer to form thermoelectric leg
16 System Packaging and Assembly in IoT Nodes 473

Fig. 16.35 Piezoelectric


based wideband energy
harvesting system (Liu
et al. 2012a)

patterns with 5 μm in width and 16 μm in length last step, 0.7 μm-thick aluminum is deposited to
(see Fig. 16.36). Afterwards, n-type and p-type cover the whole device area as the heat sink layer
thermoelectric legs are formed by implanting one (see Fig. 16.36g). The cross section of the
half of poly-Si with phosphorous ions at 180 keV fabricated device in (see Fig. 16.36j) shows that
energy and other half with boron ions using the thermal legs are embedded into the top
80 keV in the other half to form the p-type cavities and the bottom cavities. The sealed bot-
thermoelectric legs respectively. The doping tom cavity is about 30 μm deep and 42 μm wide,
dose of 1016 cm 2 is used in both the cases. while the encapsulated top cavity is 2 μm high
The doped poly-Si is annealed in a furnace at and 35 μm wide. The lengths of both bottom and
1000  C for 30 min to activate the implanted top cavities are about 670 μm.
dopants. SiO2 insulating layer deposition is car- Thus from the above example of thermoelectric
ried out, following which contact via opening, power generator, the flexibility of adopting WLE
aluminum layer deposition and aluminum etch- onto the device fabrication process is quite straight
ing is performed to form electrical interconnects forward and does not demand any more
between the p-type and n-type thermoelectric complications. This WLE technology can be read-
legs in series (see Fig. 16.36b). The bottom cav- ily used for encapsulation of IR sensors as well.
ity and top cavity are created with the Hence from the presented case study example, it
pre-described process of wafer-level sealing and can be seen that the advancement in MEMS fabri-
encapsulation—first perforating the holes, cation and packaging technology has improved the
followed by DRIE, and isotropic etching silicon possibility of providing many functional ICs along
(see Fig. 16.36c), sealing bottom cavities (see with the MEMS sensors and actuators to be
Fig. 16.36d), patterning USG sacrificial layer, fabricated onto a single wafer and diced. After
opening etching holes (see Fig. 16.36e), and dicing, the chips can be effectively packaged into
finally removing the USG sacrificial layer and a single package to provide extremely small and
sealing the top cavities (see Fig. 16.36f). In the efficient IoT node at relative low cost.
474 Y. Qian and C. Lee

Fig. 16.36 Process flow to


fabricate thermoelectric
power generator and
corresponding device
images (OM or SEM)
are shown in right
(Xie et al. 2011)

from 3 to 5 μm and (3) long-wave infrared


16.3 Case Study of Infrared Sensor (LWIR) from 8 to 12 μm. Infrared sensors have
Packaging widespread applications including security, sur-
veillance, search and rescue, firefighting, night
In this section, we will consider the packaging of vision, traffic systems, law enforcement, process
infrared (IR) sensors as the case to understand the control, and predictive maintenance. Basically
basic functionality of a MEMS sensor and its IR detectors are classified into two major clas-
packaging requirements, followed by technolog- ses—photon detectors and thermal detectors. The
ical improvements and finally state-of-art pack- photon detectors utilize quantum well structures
aging technology. band gap designed for particular IR wavelength,
The infrared spectrum is part of the electro- and has very sensitivity and shorter response
magnetic spectrum, with wavelengths ranging time, but demands constant cooling hence
from about 1 to 1000 μm. The most attractive making the whole system heavy and costly. On
application lies especially in the specific wave- the other hand, the thermal detectors work on the
length regimes ranging from 1 to 12 μm, and are principle of change of temperature dependent
categorized as (1) short-wave infrared (SWIR) material and physical characteristics (like resis-
from 1 to 3 μm; (2) mid wave infrared (MWIR) tance, ferroelectricity, pyroelectricity, coefficient
16 System Packaging and Assembly in IoT Nodes 475

a c
Unit Cell Design
IR Radiation
Pt=2.3 torr

Thermal conductance [W/ °c]


Microbridge

10-5
Pitch
Gt

10-6 Gs
CMOS input cell
Gr
Gq

10-7
10-5 10-4 10-3 10-2 10-1 100 101 102 103
Pressure [Torr]
b d
50
radiation

40
Responsitity (V/W)
Gradiation
30 Xe
Gsolid
F-22
20
Air
Gsolid 10

Ggas 0
0.01 0.1 1 10 100
Pressure (torr)

Fig. 16.37 (a) Schematics of IR sensor array and a air convection, radiation and overall thermal conductance
single unit cell design (Kohin et al. 2004); (b) schematics with respect to pressure level in the sensor package (Chou
showing the incoming radiation and the various thermal et al. 1995); (d) shows the dependence of IR pixel
loss path in a single IR sensor pixel structure; (c) shows responsivity with respect to pressure level in the package
the dependence of thermal loss due to solid conduction, with different gases (Liddiard 1984)

of thermal expansion, resonant frequency, etc.) loss path—(1) radiation loss due to the heated
of the devices and have lower sensitivity and membrane and defines the fundamental limit of
longer response time, but doesn’t demand the thermal loss, (2) convection loss due to the
cooling and so are more cost effective. With the gas present in between the active sensor mem-
advancement of MEMS technology, thermal IR brane to outside environment or substrate and
detectors are made with very high sensitivity and (3) conduction loss due to the mechanical leg to
lower response time. Most of the IR detectors support the sensor active membrane. It can be
consist of the following parts, (1) absorber mem- seen from Fig. 16.37c, d, that air convection loss
brane to absorb IR radiation, (2) sensor structure is the dominating factor for the overall loss for a
to convert IR radiation to different read out pressure level of more than 0.1 Torr. It is highly
parameter, (3) long mechanical legs to provide desirable to operate the IR detectors in low pres-
high thermal isolation from substrate and sure; thus making vacuum packaging inevitable
mechanical stability at the same time. One of for IR sensors.
the most important considerations for IR The commercially available IR detector usually
detectors is the thermal losses. As shown in involves four levels of packaging and assembly:
Fig. 16.37b, there are three possible thermal (1) Pixel level packaging—vacuum packaging of
476 Y. Qian and C. Lee

Fig. 16.38 Various levels of device assembly and integration for IR sensor product (Yole Developpement 2012)

individual pixel is achieved through metal can packaging technique, the IR sensor array
micromachining or wafer bonding techniques; is mounted on a metal plate with a metal can
(2) Array level packaging—vacuum packaging of crimped on top of it. The metal package can be
an array device is also possible through wafer attached to a heat sink, there by dissipating, the
bonding technique and generic chip level packag- excess heat from the sensor array, which is highly
ing technologies; (3) Integration of vacuum pack- significant for high sensitive IR sensors. Thermal
aged IR detector array with electronic circuitry on compounds are also used to improve heat transfer
PCB level to perform various readout operations between the device case and the heat sink. Since the
and pixel to pixel compensation schemes and may device case is one of the electrical connections, an
also include IR lenses for focusing; (4) Assembly insulator may be required to electrically isolate the
of integrated PCBs on to a compact casing to component from the heat sink. Insulating washers
deliver the final product that can be easily handled. made of mica or other materials with good thermal
The four levels of packaging and assembling for conductivity are often used (Fig. 16.39).
IR detectors is shown in Fig. 16.38. The most attractive feature of metal can pack-
aging is that almost all the surface of the package
are metal, hence provides better heat dissipation
16.3.1 Conventional Packaging for IR and durability than most of other available pack-
Sensors aging technologies. Additionally, Metal-glass
seal is hermetic to protect the semiconductor
The conventional packaging at chip level involves from liquids and gases. However, on the other
packaging individual die in a chip carrier using hand, metal can packaging is more expensive
either a silicon or germanium window. In this than the counterpart solutions using plastic or
approach, the die are mounted in a chip carrier, ceramics. Also, achieving high level vacuum
usually with a solder perform attachment. The within the package is quite challenging and so
package and lid are placed in a vacuum chamber the gas convection in the metal can package
and vacuum baked at elevated temperature to limits the detector’s performance.
outgas the package and lid, prior to sealing the Due to the limitations of metal can packaging,
package. Once outgassing is complete, the lid is ceramic and metal based flip chip packaging
sealed to the package using a solder seal. became extremely popular as higher levels of
Depending upon the application, different packag- vacuum could be achieved with these
ing approaches are considered like metal can technologies (Hunter et al. 2008). The flip-chip
packaging or ceramic packaging. packaging provided smaller packaging, and also
Metal packaging is usually used in applications enabled much more components to be integrated
like automotives, where good heat conduction and into a single package as shown in Fig. 16.40.
durability at higher temperature are desired. In the Some of the key components are IR window
16 System Packaging and Assembly in IoT Nodes 477

Fig. 16.39 (a) Schematic demonstration of metal can technologies and (c) shows metal can packaged thermo-
packaging process for IR sensor array (Xu et al. 2010), pile based IR sensor array (Du and Lee 2002)
(b) metal can packaged IR sensor by OMRON

with antireflection coating to allow IR radiation 16.3.2 Wafer Level Packaging for IR
to reach the IR sensor array without getting Sensors
absorbed or reflected. Another important compo-
nent is the thermistor, which enables to deter- Wafer level packaging of IR sensors involves
mine the temperature of the sensor array addition of extra process technology to the sensor
dynamically and regulate the temperature in the fabrication process in order to achieve packaging
package by providing feedback input to the ther- of IR sensor in array or pixel level. The most
moelectric cooler. The thermoelectric cooler exciting advantage provided by WLP for IR sen-
(or TE stabilizer) is one of the key components sor is that very high vacuum levels can be
in photon IR detectors, as smallest increase in achieved without the use of external components
temperature can elevate the thermal noise in the like getters. Additional WLP reduces the cost
system thereby making device performance dete- and time to market by a large margin. Certain
riorate significantly. As IR sensors need high application demands the use of particular wave-
level vacuum packaging, the key challenge was length IR signal, in which case the optical filter
with the outgassing of the deposited material wafer can be directly bonded with the IR sensor
layers after vacuum packaging. Hence electri- to miniaturize the overall sensor module.
cally activated getters were used to restore back Recently, low-temperature wafer-level bonding
the vacuum level inside the package even after technologies have also been proposed for optical
sealing it. The getter is a material, which went device fabrication and packaging.
provided with certain amount of energy, reacts In the wafer bonding technology, the active IR
with the gas molecules in the package, thereby sensor array is fabricated on a silicon MEMS
providing a higher degree of vacuum. wafer and corresponding companion silicon lid
478 Y. Qian and C. Lee

Antireflected
a germanium window b
Lower periphery metalized
to premit soldering

Microbolometer array
Pads for Al bonds

Thermistor TE power leads

Metalized for solder

Pads for Al bonds


TE stabilizer
(BeO plates) 88 pins

AIO frame
c
Mounting holes

Copper/Tungsten
baseplate
OFHC copper
pumpout tube
Zr getter inside
pumpout tube

Crimp seal

Fig. 16.40 (a) Vacuum package for vanadium oxide packaged IR imager [Courtesy: ULIS-IR] and (c) metal
based microbolometer uncooled infrared focal plane packaged IR imaging module from Multispectral Imaging
array (IRFPA) (Capper and Elliott 2013), (b) completely Inc (Hunter et al. 2008)

wafer is fabricated with the seal ring to cover the work has been discussed. Following which, the
entire sensor array in the first wafer. The lid various packaging technologies were
wafer provides the IR window for the sensor explained as a case study of packaging
array. Vacuum packaging is carried out by plac- MEMS based IR sensor. Finally, the more
ing the MEMS wafer and lid wafer in a vacuum futuristic approach of using energy harvesting
chamber, vacuum-baking them at elevated as power source and ultimately integrating the
temperature to outgas, and finally sealing the whole IoT system within a single package has
wafers using either a solder seal or a been explained. Thus it is can be confidently
thermocompression bond seal as in Fig. 16.41 stated that the increasing efforts with keen
and 16.42. focus of semiconductor manufacturing and
packaging industries in “More than Moore”
technological roadmap; by integrating multi-
ple functional chips either through packaging
16.4 Summary
(SiP approach) or in process level (SoC
approach) will enable in achieving extremely
In summary, the importance of packaging of
small IoT modules at lower cost, thereby
individual chips or as a system is explained.
making it a disruptive technology for a wide
Various conventional and more advanced
range of applications.
WLP technologies for autonomous sensor net-
16 System Packaging and Assembly in IoT Nodes 479

Fig. 16.41 (a) Schematic demonstration of wafer level the filter cap wafer using Au-Au thermocompression
packaging of IR sensor using wafer bonding approach bonding (Xu et al. 2010) and (c) wafer level, vacuum
(inset: schematic cross section showing the vacuum pack- packaged thermopile based IR sensor later assembled
age IR sensor array) (Xu et al. 2010) (b) into a metal can package (inset: wafer level packaged
(i) micromachined IR sensor fabrication process, single unit array of IR sensor with the close up view of
(ii) filter cap wafer fabrication process, (iii) wafer-level thermopile structure) (Xu et al. 2011)
packaging of the sensor by bonding the sensor wafer with

Fig. 16.42 (a) The final product of wafer level packaged close up view of wafer level packaged single IR sensor
IR sensor array [Courtesy: Raytheon Company]; (b) array (Xu et al. 2010)
wafer level packaged IR sensors in a 6 in. wafer and (c)
480 Y. Qian and C. Lee

References produced by artificial hollow fiber. J. Micromech.


Microeng. 18(8), 085014 (2008)
V. Hong, B. Lee, D.L. Christensen, D. Heinz, E.J. Ng,
S.P. Beeby, M.J. Tudor, N. White, Energy harvesting
C.H. Ahn, Y. Yang, T.W. Kenny, XY and Z-axis
vibration sources for microsystems applications.
capacitive accelerometers packaged in an ultra-clean
Meas. Sci. Technol. 17(12), R175 (2006)
hermetic environment, in 2013 Transducers and
R. Bloss, Numerous recent shows highlight the latest in
Eurosensors XXVII: The 17th International Confer-
sensor innovations. Sens. Rev. 32(2), 101–105 (2012)
ence on Solid-State Sensors, Actuators and
H. B€ottner, Thermoelectric micro devices: current state,
Microsystems (TRANSDUCERS & EUROSENSORS
recent developments and future aspects for technolog-
XXVII) (2013), pp. 618–621
ical progress and applications, in Twenty-First Inter-
X. Huang, H. Cheng, K. Chen, Y. Zhang, Y. Zhang,
national Conference on Thermoelectrics, 2002.
Y. Liu, C. Zhu, S.-C. Ouyang, G.-W. Kong, C. Yu,
Proceedings ICT’02 (2002), pp. 511–518
Epidermal impedance sensing sheets for precision
R.N. Candler, W.-T. Park, H. Li, G. Yama, A. Partridge,
hydration assessment and spatial mapping. IEEE
M. Lutz, T.W. Kenny, Single wafer encapsulation of
Trans. Biomed. Eng. 60(10), 2848–2857 (2013)
MEMS devices. IEEE Trans. Adv. Packag. 26(3),
N.S. Hudak, G.G. Amatucci, Small-scale energy
227–232 (2003)
harvesting through thermoelectric, vibration, and
P. Capper, C. Elliott, Infrared Detectors and Emitters:
radiofrequency power conversion. J. Appl. Phys. 103
Materials and Devices (Springer US, New York,
(10), 101301 (2008)
2013)
S.R. Hunter, G.S. Maurer, G. Simelgor, S. Radhakrishnan,
K. Chen, C. Premachandran, K. Choi, C. Ong, X. Ling,
J. Gray, K. Bachir, T. Pennell, M. Bauer, U. Jagadish,
A. Ratmin, M. Pa, J. Lau, C2W low temperature
Development and optimization of microcantilever
bonding method for MEMS applications, in IEEE
based IR imaging arrays, in SPIE Defense and Secu-
Proceedings of Electronics Packaging Technology
rity Symposium (2008), pp. 694013–694013-12
Conference, Singapore, 2008 (2008), pp. 1–7
Y. Jin, Z. Wang, P. Lim, D. Pan, J. Wei, J. Wong, MEMS
F. Chiarugi, I. Karatzanis, G. Zacharioudakis, P. Meriggi,
vacuum packaging technology and applications, in
F. Rizzo, M. Stratakis, S. Louloudakis, C. Biniaris,
2003 5th Conference (EPTC 2003) Electronics Pack-
M. Valentini, M. Di Rienzo, Measurement of heart
aging Technology (2003), pp. 301–306
rate and respiratory rate using a textile-based wearable
S. Jung, T. Ji, V.K. Varadan, Point-of-care temperature
device in heart failure patients. Comput. Cardiol. 35,
and respiration monitoring sensors for smart fabric
901–904 (2008)
applications. Smart Mater. Struct. 15(6), 1872 (2006)
B. Chou, J.-S. Shic, Y.-M. Chen, A highly sensitive Pirani
C.L. Keast, B. Aull, J. Burns, C. Chen, J. Knecht,
vacuum gauge, in The 8th International Conference
B. Tyrrell, K. Warner, B. Wheeler,
on Solid-State Sensors and Actuators, 1995 and
V. Suntharaligam, P. Wyatt, Three-dimensional inte-
Eurosensors IX. Transducers’ 95 (1995), pp. 167–170
gration technology for advanced focal planes, in MRS
S. Devot, A.M. Bianchi, E. Naujokat, M.O. Mendez,
Proceedings (2008), pp. 1112-E01-02
A. Brauers, S. Cerutti, Sleep monitoring through a
T.Y. Kheng, Analysis, design and implementation of
textile recording system, in 29th Annual International
energy harvesting systems for wireless sensor nodes,
Conference of the IEEE Engineering in Medicine and
PhD Thesis, National University of Singapore,
Biology Society. EMBS 2007 (2007), pp. 2560–2563
Singapore (2010)
C.-H. Du, C. Lee, Characterization of thermopile based on
R. Knechtel, M. Wiemer, J. Fr€ omel, Wafer level encapsu-
complementary metal-oxide-semiconductor (CMOS)
lation of microsystems using glass frit bonding.
materials and post CMOS micromachining. Jpn.
Microsyst. Technol. 12(5), 468–472 (2006)
J. Appl. Phys. 41(6S), 4340 (2002)
C.-T. Ko, K.-N. Chen, Wafer-level bonding/stacking
M. Esashi, Wafer level packaging of MEMS.
technology for 3D integration. Microelectron. Reliab.
J. Micromech. Microeng. 18(7), 073001 (2008)
50(4), 481–488 (2010)
J. Espina, T. Falck, J. Muehlsteff, X. Aubert, Wireless
M. Kohin, N.R. Butler, Performance limits of uncooled
body sensor network for continuous cuff-less blood
VOx microbolometer focal plane arrays, in Defense
pressure monitoring, in 3rd IEEE/EMBS International
and Security (2004), pp. 447–453
Summer School on Medical Devices and Biosensors,
J.F. Kuhmann, M. Heschel, S. Bouwstra, F. Baleras,
2006 (2006), pp. 11–15
C. Massit, Through wafer interconnects and flip-chip
Y.B. Gianchandani, O. Tabata, H.P. Zappe, Comprehen-
bonding: a toolbox for advanced hybrid technologies
sive Microsystems (Elsevier, Amsterdam, 2008)
for MEMS, in 13th European Conference on Solid-
H. Goldstein, Packages go vertical. IEEE Spectr. 38(8),
State Transducers, 1999 (1999), pp. 265–272
46–51 (2001)
J.H. Lau, Flip Chip Technologies (McGraw-Hill,
H. Guo, L. Lou, X. Chen, C. Lee, PDMS-coated
New York, 1996)
piezoresistive NEMS diaphragm for chloroform
J.H. Lau, C. Lee, C.S. Premachandran, Y. Aibin, Advanced
vapor detection. IEEE Electron Device Lett. 33(7),
MEMS Packaging (McGraw-Hill Professional, 2010)
1078–1080 (2012)
J.B. Lee, Z. Chen, M.G. Allen, A. Rohatgi, R. Ayra, A
Y. Hasegawa, M. Shikida, D. Ogura, Y. Suzuki, K. Sato,
high voltage solar cell array as an electrostatic MEMS
Fabrication of a wearable fabric tactile sensor
power supply, in MEMS’94, Proceedings, IEEE
16 System Packaging and Assembly in IoT Nodes 481

Workshop on Micro Electro Mechanical Systems C.Ó. Mathúna, T. O’Donnell, R.V. Martinez-Catala,
(1994), pp. 331–336 J. Rohan, B. O’Flynn, Energy scavenging for long-
C. Lee, Y.M. Lim, B. Yang, R.K. Kotlanka, C.-H. Heng, term deployable wireless sensor networks. Talanta 75
J.H. He, M. Tang, J. Xie, H. Feng, Theoretical com- (3), 613–623 (2008)
parison of the energy harvesting capability among C.R. Merritt, H.T. Nagle, E. Grant, Textile-based capaci-
various electrostatic mechanisms from structure tive sensors for respiration monitoring. IEEE Sensors
aspect. Sens. Actuators, A 156(1), 208–216 (2009a) J. 9(1), 71–78 (2009)
K.-W. Lee, A. Noriki, K. Kiyoyama, S. Kanno, P.D. Mitcheson, T.C. Green, E.M. Yeatman,
R. Kobayashi, W.-C. Jeong, J.-C. Bea, A.S. Holmes, Architectures for vibration-driven
T. Fukushima, T. Tanaka, M. Koyanagi, 3D heteroge- micropower generators. J. Microelectromech. Syst.
neous opto-electronic integration technology for 13(3), 429–440 (2004)
system-on-silicon (SOS), in 2009 I.E. International F. Niklaus, G. Stemme, J.-Q. Lu, R. Gutmann, Adhesive
Electron Devices Meeting (IEDM) (2009) pp. 1–4 wafer bonding. J. Appl. Phys. 99(3), 031101 (2006)
K. Liddiard, Thin-film resistance bolometer IR detectors. D. Niyato, E. Hossain, M.M. Rashid, V.K. Bhargava,
Infrared Phys. 24(1), 57–64 (1984) Wireless sensor networks with energy harvesting
V. Lindroos, S. Franssila, M. Tilli, M. Paulasto-Krockel, technologies: a game-theoretic approach to optimal
A. Lehto, T. Motooka, V.-M. Airaksinen, Handbook energy management. IEEE Wirel. Commun. 14(4),
of Silicon Based MEMS Materials and Technologies 90–96 (2007)
(Elsevier, Amsterdam, 2009) R. Paradiso, G. Loriga, N. Taccini, A wearable health care
H. Liu, C.J. Tay, C. Quan, T. Kobayashi, C. Lee, Piezo- system based on knitted integrated sensors. IEEE
electric MEMS energy harvester for low-frequency Trans. Inf. Technol. Biomed. 9(3), 337–344 (2005)
vibrations with wideband operation range and steadily S. Pourkamali, F. Ayazi, Wafer-level encapsulation and
increased output power. J. Microelectromech. Syst. 20 sealing of electrostatic HARPSS transducers, in
(5), 1131–1142 (2011a) Sensors, 2007 IEEE (2007), pp. 49–52
H. Liu, C.J. Tay, C. Quan, T. Kobayashi, C. Lee, A Y. Qian, Development of nano/microelectromechanical
scrape-through piezoelectric MEMS energy harvester system (N/MEMS) switches, PhD Thesis, National
with frequency broadband and up-conversion University of Singapore, Singapore (2015)
behaviors. Microsyst. Technol. 17(12), 1747–1754 J.M. Rabaey, M.J. Ammer, J.L. da Silva, D. Patel,
(2011b) S. Roundy, PicoRadio supports ad hoc ultra-low
H. Liu, C. Lee, T. Kobayashi, C.J. Tay, C. Quan, Investi- power wireless networking. Computer 33(7), 42–48
gation of a MEMS piezoelectric energy harvester sys- (2000)
tem with a frequency-widened-bandwidth mechanism V. Raghunathan, S. Ganeriwal, M. Srivastava, Emerging
introduced by mechanical stoppers. Smart Mater. techniques for long lived wireless sensor networks.
Struct. 21(3), 035005 (2012a) IEEE Commun. Mag. 44(4), 108–114 (2006)
H. Liu, C. Lee, T. Kobayashi, C.J. Tay, C. Quan, A new M. Rais-Zadeh, H.M. Lavasani, F. Ayazi, CMOS-
S-shaped MEMS PZT cantilever for energy harvesting compatible encapsulated silver bandpass filters, in
from low frequency vibrations below 30 Hz. IEEE Microwave Symposium (IMS’07) (2007),
Microsyst. Technol. 18(4), 497–506 (2012b) pp. 1301–1304
H. Liu, C. Lee, T. Kobayashi, C.J. Tay, C. Quan, Piezo- J.F. Randall, On the use of photovoltaic ambient energy
electric MEMS-based wideband energy harvesting sources for powering indoor electronic devices.
systems using a frequency-up-conversion cantilever Thesis, EPFL, Switzerland (2003)
stopper. Sens. Actuators, A 186, 242–248 (2012c) H. Reichl, R. Aschenbrenner, H. Potter, Hetero System
H. Liu, S. Zhang, R. Kathiresan, T. Kobayashi, C. Lee, Integration-Enabling Technology for Future Products
Development of piezoelectric microcantilever flow (Productions, Frankfurt/Main, 2005) pp. 45–49
sensor with wind-driven energy harvesting capability. H. Reichl, R. Aschenbrenner, M. T€ opper, H. P€otter, Het-
Appl. Phys. Lett. 100(22), 223905 (2012d) erogeneous integration: building the foundation for
F. Lorussi, W. Rocchia, E.P. Scilingo, A. Tognetti, D. De innovative products, in More than Moore (Springer,
Rossi, Wearable, redundant fabric-based sensor arrays New York, 2009) pp. 279–303
for reconstruction of body segment posture. IEEE M. Rothmaier, B. Selm, S. Spichtig, D. Haensse, M. Wolf,
Sensors J. 4(6), 807–818 (2004) Photonic textiles for pulse oximetry. Opt. Express 16
F. Lorussi, E.P. Scilingo, M. Tesconi, A. Tognetti, (17), 12973–12986 (2008)
D.D. Rossi, Strain sensing fabric for hand posture S. Roundy, P.K. Wright, K.S. Pister, Micro-electrostatic
and gesture monitoring. IEEE Trans. Inf. Technol. vibration-to-electricity converters, in ASME 2002
Biomed. 9(3), 372–381 (2005) International Mechanical Engineering Congress and
L. Lou, S. Zhang, W.-T. Park, J. Tsai, D.-L. Kwong, Exposition (2002), pp. 487–496
C. Lee, Optimization of NEMS pressure sensors with C. Rovira, S. Coyle, B. Corcoran, D. Diamond,
a multilayered diaphragm using silicon nanowires as F. Stroiescu, K. Daly, Integration of textile-based
piezoresistive sensing elements. J. Micromech. sensors and Shimmer for breathing rate and volume
Microeng. 22(5), 055012 (2012) measurement, in 2011 5th International Conference
M.P. Market, Technology trend report, Yole on Pervasive Computing Technologies for Healthcare
Développement, 2012 (PervasiveHealth) (2011), pp. 238–241
482 Y. Qian and C. Lee

H.R. Shea, Reliability of MEMS for space applications, in power generators. J. Microelectromech. Syst. 19(2),
Moems-mems 2006 micro and nanofabrication 317–324 (2010)
(2006), pp. 61110A–61110A-10 J. Xie, C. Lee, M.-F. Wang, H. Feng, Seal and encapsulate
N.S. Shenck, J.A. Paradiso, Energy scavenging with shoe- cavities for complementary metal-oxide-semiconduc-
mounted piezoelectrics. IEEE Micro 21(3), 30–42 tor microelectromechanical system thermoelectric
(2001) power generators. J. Vac. Sci. Technol. B 29(2),
L. Shiv, M. Heschel, H. Korth, S. Weichel, R. Hauffe, 021401 (2011)
A. Kilian, B. Semak, M. Houlberg, P. Egginton, D. Xu, E. Jing, B. Xiong, Y. Wang, Wafer-level vacuum
A. Hase, Ultra thin hermetic wafer level, chip scale packaging of micromachined thermoelectric IR sensors.
package, in 2006 Proceedings of 56th Electronic IEEE Trans. Adv. Packag. 33(4), 904–911 (2010)
Components and Technology Conference (2006), 7 pp D. Xu, B. Xiong, Y. Wang, Micromachined thermopile IR
B.W. Soon, N. Singh, J.M. Tsai, C.-K. Lee, Vacuum detector module with high performance. IEEE Photon.
based wafer level encapsulation (WLE) of MEMS Technol. Lett. 23(3), 149–151 (2011)
using physical vapor deposition (PVD), in 2012 I.E. B. Yang, C.K. Lee, A wideband electromagnetic energy
14th Electronics Packaging Technology Conference harvester for random vibration sources. Adv. Mater.
(EPTC) (2012), pp. 342–345 Res. 74, 165–168 (2009)
B.W. Soon, E.J. Ng, V. Hong, Y. Yang, C.H. Ahn, B. Yang, C. Lee, Non-resonant electromagnetic wideband
Y. Qian, T.W. Kenny, C. Lee, Fabrication and charac- energy harvesting mechanism for low frequency
terization of a vacuum encapsulated curved beam vibrations. Microsyst. Technol. 16(6), 961–966 (2010)
switch for harsh environment application. B. Yang, C. Lee, W.L. Kee, S.P. Lim, Hybrid energy
J. Microelectromech. Syst. 23(5), 1121–1130 (2014) harvester based on piezoelectric and electromagnetic
B.W. Soon, Y. Qian, E.J. Ng, V. Hong, Y. Yang, mechanisms. J. Micro/Nanolithogr. MEMS MOEMS,
C.H. Ahn, T.W. Kenny, C. Lee, Investigation of a 9(2): 023002–023002-10 (2010)
vacuum encapsulated Si-to-Si contact microswitch B. Yang, C. Lee, R.K. Kotlanka, J. Xie, S.P. Lim, A
operated from 60  C to 400  C. J. Microelectromech. MEMS rotary comb mechanism for harvesting the
Syst. 24(6), 1906–1915 (2015) kinetic energy of planar vibrations. J. Micromech.
D. Sparjs, Packaging of microsystems for harsh Microeng. 20(6), 065017 (2010b)
environments. IEEE Instrum. Meas. Mag. 4(3), B. Yang, C. Lee, G.W. Ho, W.L. Ong, J. Liu, C. Yang,
30–33 (2001) Modeling and experimental study of a low-frequency-
M. Stordeur, I. Stark, Low power thermoelectric vibration-based power generator using ZnO nanowire
generator-self-sufficient energy supply for micro arrays. J. Microelectromech. Syst. 21(4), 776–778
systems, in Proceedings ICT’97. XVI International (2012)
Conference on Thermoelectrics, 1997 (1997), D.-Q. Yu, C. Lee, L.L. Yan, W.K. Choi, A. Yu, J.H. Lau,
pp. 575–577 The role of Ni buffer layer on high yield low tempera-
J.P. Thomas, M.A. Qidwai, J.C. Kellogg, Energy scav- ture hermetic wafer bonding using In/Sn/Cu metalli-
enging for small-scale unmanned systems. J. Power zation. Appl. Phys. Lett. 94(3), 034105 (2009)
Sources 159(2), 1494–1509 (2006) S. Zhang, L. Lou, C. Lee, Piezoresistive silicon nanowire
J. Xie, C. Lee, M.-F. Wang, Y. Liu, H. Feng, Characteri- based nanoelectromechanical system cantilever air
zation of heavily doped polysilicon films for CMOS- flow sensor. Appl. Phys. Lett. 100(2), 023111 (2012)
MEMS thermoelectric power generators. Z. Zhong, T. Tee, J. Luan, Recent advances in wire
J. Micromech. Microeng. 19(12), 125029 (2009) bonding, flip chip and lead-free solder for advanced
J. Xie, C. Lee, H. Feng, Design, fabrication, and charac- microelectronics packaging. Microelectron. Int. 24(3),
terization of CMOS MEMS-based thermoelectric 18–26 (2007)
An IPv6 Energy-Harvested WSN
Demonstrator Compatible with Indoor 17
Applications

Pascal Urard, Liviu Varga, Mališa Vučinić,


and Roberto Guizzetti

In this chapter, we present the results of a achieved with commercial, secured-IPv6


research project that was developed in ST with networks in a few years. The idea of a self-
key partners over several years. It is an energy- powered WSN has been identified in 2007 during
harvested, self-healing, secured IPv6 Wireless a boosters-of-innovation session. In 2010, the RF
Sensor & Actuator Network (WSAN), designed power consumption reached such a low level that
for indoor environments. Pairing and installation we started the project R&D. It has been held in
have been taken into account to propose a ST in 2014, and closed at the end of 2015 in the
complete system solution. GreenNet has university labs. The project enabled a vision of
demonstrated that it is possible to achieve auton- the upcoming IoT low-power nodes and the
omy with upcoming radio generations. However, upcoming low-power roadmaps to serve this
this is possible only if certain cross-layer market.
optimizations are performed throughout the soft- GreenNet is a Wireless Sensor and Actuator
ware stack. Details and results are provided along Network enabling true End-to-End IPv6 con-
the chapter. nectivity, where some network nodes can be
energy-harvested and capable of a fully auto-
nomous operation. This is done without sacri-
17.1 GreenNet at a Glance ficing security features required for most of the
applications running on WSANs. Figure 17.1.
17.1.1 Introduction depicts vision of the final application where
autonomous GreenNet nodes communicate
The system solution named GreenNet project is a through a gateway with hosts located anywhere
prototype developed by STMicroelectronics in in the Internet.
partnership with the Laboratoire d’Informatique
de Grenoble (LIG) and few other partners. It has
not been commercialized; however, the level of
performance and efficiency reached by this dem-
17.1.2 Overview of GreenNet System
onstrator are still among the best published.
Features
GreenNet demonstrates today what could be
Bi-directional wireless network. GreenNet is a
wireless network of sensors and actuators
P. Urard (*) • L. Varga • M. Vučinić • R. Guizzetti
operating in 2.4 GHz ISM band. It primarily
STMicroelectronics, Crolles, France
e-mail: urard.pascal@gmail.com; liviu.varga@gmail. targets low-cost networks, such as home or
com; malishav@gmail.com; roberto.guizzetti@st.com building automation, inventory management,

# Springer International Publishing AG 2017 483


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_17
484 P. Urard et al.

Fig. 17.1 Vision of a possible final application based on a GreenNet network

infrastructure monitoring, where wired solutions technology, and on the other hand, various
are neither cost effective, nor flexible. GreenNet low-power hardware wireless solutions.
ensures true bidirectional communication among
User-friendly experience. Based on IEEE
nodes or between a node and end-user application
802.15.4 PHY+MAC standard, GreenNet targets
(e.g. running on PC, smartphone or tablet),
self-organizing, scalable, self-healing networks
allowing remote sensor or actuator control.
making final user experience quite simple from
IoT ready. GreenNet is designed to support true installation or maintenance perspectives.
End-to-End connectivity over IPv6. This feature
Security. GreenNet protects against network
is foreseen as a key market shaper enabling on
threats by implementing security features that
one hand, massive development of interoperable
guarantee data confidentiality, data integrity,
applications that are agnostic of wireless
host and data-origin authentication and network
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 485

availability, both in the presence of external and


internal attackers.
Based on open standards. The goal of GreenNet
is to counteract the current market status quo
where many proprietary solutions already take
place, being themselves a de-facto limiting
factors of market development. For that reason,
GreenNet develops a fully optimized network
stack based on open-source implementations.
Sustainable energy. GreenNet is designed to
target operation of energy-harvested nodes. The
goal is achieved by combining both advanced
hardware design techniques (namely on the
radio chip and energy charger chip level) and
relevant standard and communication protocol
options. Altogether, this yields a system where
a temperature sensor node can, for instance, send
a reading with a 5 min period, just by being
exposed to 100Lux Compact Fluorescent Light
(CFL) during 6 h a day. This amount of light
corresponds to conditions typically found bellow
an office desk. With more energy (i.e: more
light), a node can be sustainable for years. Exper- Fig. 17.2 GreenNet Network Topology
iment has also demonstrated that such a node is
able to operate during more than 75 days in the
dark from a fully charged battery. the network is built automatically, without any
programming needed from the user. Each node in
the network selects the best available parent
17.1.3 Network Topology at a Glance based on the advertisements in its radio vicinity.

Network description. GreenNet network is


based on a cluster-tree topology depicted in
17.2 GreenNet Hardware
Fig. 17.2. Nodes in the network correspond to
different types of devices. The Network Coordi-
17.2.1 Introduction
nator (dark blue) controls the entire network and
provides external connectivity. Network Coordi-
We showed the layout of GreenNet hardware
nator is also called Edge Router (ER) or Sink.
demonstrator boards in Fig. 2.5 in Chap. 2. The
Leaf nodes (yellow) are either sensors or
three core components of the board are: an
actuators. Leaf nodes can attach to the Network
STM32L1 (STMicroelectronics) microcontrol-
Coordinator or to Hop Routers (light blue) if the
ler; an advanced low-power radio transceiver
Network Coordinator is outside of the wireless
compliant with IEEE 802.15.4(/e) standard; and
range. Hop Routers (light blue) ensure connec-
a low leakage battery charger. Apart from the
tivity between Leaves and the Network Coordi-
three core chips, an STM32F101 microcontroller
nator allowing physical network expansion.
provides USB interface during software develop-
Self-organizing network. GreenNet is a Self- ment. GreenNet board is powered by a low
Organizing Network (SON), which means that capacity Li-Ion cell, which is recharged from a
486 P. Urard et al.

Fig. 17.3 GreenNet hardware demonstrator

small photovoltaic panel. Three LEDs are used 17.2.2.2 Radio


for control and debug. GreenNet boards communicate wirelessly
For demonstration purposes, the board through a low-power radio transceiver, compli-
includes various ST sensors: accelerometer, ant with IEEE 802.15.4 standard and its “/e”
magnetometer, gyroscope, pressure, tempera- amendment. The radio chip implements the ana-
ture, and light sensors. An NFC EEPROM log interface operating in 2.4 GHz band as well
(M24LR64E-R datasheet) is added to demon- as a dedicated hardware in charge of the MAC
strate user-friendly bootstrapping operations. operations as defined by the IEEE 802.15.4 stan-
GreenNet can support additional sensors through dard. Advanced MAC features implemented in
the use of a dedicated port extension. The same hardware lighten the load of the microcontroller.
hardware board (see Fig. 17.3) can be In addition to the nominal 250 Kbits/s data rate
programmed as a leaf node, hop router or net- proposed by the IEEE 802.15.4 standard, the chip
work coordinator node (also called Edge router). also provides a proprietary extension to increase
The same board can as well be programmed with the data rate up to 2 Mbits/s.
several applications using different MEMs.

17.2.2.3 Power Management


17.2.2 Functional Description Power Management (PM) delivers power supply
to the rest of the system: main MCU (STM32L1)
17.2.2.1 Microcontroller & MEMs, 2.4 GHz radio. It also plays the role
GreenNet boards use the STM32L1 microcon- of battery charger: it insures solar-cell manage-
troller to control and communicate with the ment, battery charging and battery safety
radio transceiver, sensors and actuators. (Todeschini and Dimensionnement energetique
STM32L1 is an ultra-low-power 32-bit micro- de reseaux de capteurs ultra-compacts auto-
controller based on ARM Cortex-M3 core. We nomes en energie. PhD thesis, Supelec 2014).
chose the version with 256 KB Flash memory The battery is thus connected directly to this
and 32 KB RAM for data. This part was selected chip (see Fig. 17.4). PM chip limits the charging
for its low power properties. current to a maximum of 2 mA to preserve the
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 487

Fig. 17.4 GreenNet node


power management
architecture

battery. It stops charging when battery reaches lifetime of an operating node placed in harsh
3.2 V (high cut-off) and stops the full system if conditions during a long period.
battery goes lower than 2.35 V (low cut-off). The When a node is connected to USB for debug
PM core area is 0.4 mm2 in ST M10ULP (90 nm purpose, only STM32F1 microcontroller is
embedded NVM CMOS process) to allow a pos- powered by USB (default mode) unless a connec-
sible future integration in a SoC. PM quiescent tor allows the PM to charge (optional mode).
current in dark conditions is 130 nA. The energy
harvester is a photovoltaic (PV) cell: a standard
17.2.2.4 USB Interface
amorphous cell from a PV cell provider. Its
The USB interface is built from a physical USB
active area is 16 cm2, and its open circuit voltage
connector and an ARM-based 32-bit microcon-
is Voc ¼ 4.4 V at 200Lux. It has been
troller (STM32F101) that implements the bridge
co-designed with our PM to provide the best
towards the main STM32L1. This interface
possible energy efficiency in the critical zone:
supports the following functions:
between 70 and 200 Lux CFL light (indoor
conditions). The maximum energy efficiency
– Energy-supply of the node during develop-
(MPP) is obtained around 200Lux. Charging
ment and debug phase.
voltage varies from 2.4 V under 20Lux, to 3 V
– Bidirectional communication with PC for
under 200Lux, demonstrating a global efficiency
on-chip and serial port debugging, boot-
of the full system (PV + PM + battery) above
strapping configuration, and flashing of
86.4% at 200Lux (Todeschini et al. 2013).
STM32L1 program memory with firmware
Given the good results of this architecture, the
image.
added value of a more complex MPP algorithm
– Connection to the gateway or a PC, in the case
has not been demonstrated. PM chip is able to
the node is programmed as an Edge Router.
perform battery charging at a minimum current
of 290 nA, obtained with only a few Lux. This
corresponds to ~1 μW depending on the voltage 17.2.2.5 Sensors
of the battery. This is not sufficient to operate in a GreenNet node includes six types of sensors with
sustainable way but it enables to extend the key metrics summarized in Table 17.1.
488 P. Urard et al.

Table 17.1 Summary of the embedded sensors


Digital/analog
Type Reference output Dynamic ranges Accuracy
3D Gyroscope L3GD20 Digital 250 dps 16-bit-rate value data output
500 dps
2000 dps
Temperature sensor STLM20 Analog 40 to 85  C 1.5  C at NTC ¼ 25  C
<2.5  C at ETC
3D Accelerometer LSM303DLHC Digital Acceleration 16 bit data output
3D Magnetometer 2 g/4 g /
8 g/16 g
Magnetic field
1.3 Gs
1.9 Gs
2.5 Gs
4.0 Gs
4.7 Gs
5.6 Gs
8.1 Gs
Pressure sensor LP331AP Analog 260–1260 mbar 0.020 mbar rms
Light sensor Custom design Analog 20 lux to 1 sun 15% (between 20 lux &
7000 lux, indoor)

17.2.2.6 NFC and tradeoffs for energy efficient protocols


An NFC EEPROM (M24LR64E-R datasheet) enabling energy-harvested Wireless Sensor
enables user-friendly read/write access to the Networks.
main memory. The NFC transceiver is based on For communication, GreenNet nodes use the
ISO-15693 communication standard. Network IEEE 802.15.4 standard in its beacon-enabled
pre-shared (security) keys, radio channels and mode that enables efficient duty cycling. They
similar parameters can be configured before the sleep most of the time and wake up only to per-
deployment over NFC. For instance, a user can form their application task, such as sensing of a
copy the parameters from the Edge Router with a physical quantity and the radio transmission of the
smartphone and write them to each node that is measured value. Sleep periods correspond to the
about to join the network. low current draw, mainly caused by leakage
currents while the active periods require more
current due to the involvement of the micropro-
cessor and radio transceiver. The average current
17.3 Energy-Sustainable Operation:
consumption depends on the duty cycle—the
Intersection of Network
more time node spends sleeping, less current it
Protocols and Low-Power
consumes on average. We depict this operation for
Hardware
two types of network devices in Fig. 17.5.
Duty cycles can typically vary between 10%
17.3.1 Duty-Cycled Operations
(typical value for routers) down to 0.01% for leaf
nodes.
Low-power hardware design is a necessary but
In order to reach the balance between con-
not a sufficient condition to reach sustainable
sumed and harvested energy, the total energy
operation of a network with energy-harvested
consumption of the node while sleeping must
nodes. Optimization of average current con-
be in the range of μAs. To reach this goal, not
sumption takes place both at the hardware and
only the radio and all the peripherals including
network level. Romaniello’s PhD thesis
sensors must be put in the lowest possible
(Romaniello et al. 2015) describes explorations
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 489

Router Deep Sleep


Duty cycle
99% Radio & Micro OFF
100% - 1%
1% ON

Leaf
Duty cycle 99% OFF
10% - 0.01%

time

Fig. 17.5 Duty-cycled network operation

Fig. 17.6 Duty-cycle/


luminosity dependency
(simulation)

consumption mode, but also the microcontroller. sleep, the duration of the sleep period must be
This has an important impact on the network long enough to recharge the battery for the
stack, as the hardware delays of these operations subsequent operation. The interval depends on
can be significant. We always need to maintain light conditions and the type of measurement,
positive (among a certain duration) the energy as estimated in Fig. 17.6.
balance between harvested energy while sleeping Even though GreenNet targets indoor appli-
versus the energy needed for an active operation, cations with light intensity as low as 100 Lux,
as depicted in Fig. 2.4 in Chap. 2. there could be cases where energy provided by
PV panel at this intensity is not enough for a
particular application. In those scenarios, nodes
17.3.2 Relationship Between Light can either be connected to an external power
Conditions and Duty Cycle source or to a larger battery. GreenNet communi-
cation stack allows the networks to be comprised
Light conditions impact the allowed duty cycle. of both duty-cycled and always-on nodes.
The less light there is, the longer the node has to
sleep to recharge its battery. To enable auton-
omy, power consumption has to be balanced by 17.4 GreenNet Software
harvested power on a daily or week basis, Communication Stack
depending on the use case. If a node has to send
some measurement at a regular time interval, it 17.4.1 Introduction
has to stay awake some period to sense and
transmit. Typical value would be around 30 ms In order to reach true end-to-end IPv6 connectiv-
but it varies with different sensors and applica- ity and still operate in energy-constrained
tion requirements. When the node goes back to conditions, GreenNet uses a protocol stack
490 P. Urard et al.

Fig. 17.7 GreenNet protocol stack based on Open Standards

built upon 6LoWPAN (Montenegro et al. 2007) only available in Europe and has a single fre-
that optimizes IPv6 signaling overhead. quency channel available.
Figure 17.7 summarizes the GreenNet protocol IEEE 802.15.4 physical layer at 2.4 GHz uses
stack and the main points addressed in its DSSS technique for robustness: each 4 bits of
implementation. data are encoded as 32 chips (physical bits)
The implementation of this communication (Palattella et al. 2013). This helps recover from
stack was developed under Contiki environment errors caused by narrow band interference.
(Swedish ICT; Dunkels et al. 2004). The follow- O-QPSK modulation is then used and results in
ing sections provide a brief overview of each physical rate of 2 Mchips/s and effective data
abstraction layer and how it relates to GreenNet. rate of 250 kb/s.
Before the upper-layer information can be
exchanged, transmitter starts by sending a pre-
17.4.2 I.E. 802.15.4 PHY Layer (L1) amble, a pre-defined sequence of ones and zeros
that allows the receiver to synchronize. Trans-
IEEE 802.15.4 is arguably the most prominent mission of the preamble lasts 128 us, and is
standard (IEEE 2011) in low-power technology followed by start frame delimiter (SFD), another
and the one that was chosen for GreenNet proj- pre-defined sequence. SFD signals that the
ect, as outlined in Chap. 3 of Vučinić’s PhD subsequent byte corresponds to the physical-
thesis (Vučinić et al. 2015c) and detailed hereaf- layer payload. First byte of the payload indicates
ter. IEEE 802.15.4 standard specifies multiple the length of the encapsulated radio frame. IEEE
physical layers that can be used in different 802.15.4 specifies that the maximum length
parts of the world, depending on local frame, i.e. link-layer Maximum Transmission
regulations. The physical layer in the ISM band Unit (MTU), is 127 bytes.
at 2.4 GHz guarantees worldwide use free of any
licensing requirements, which is in practice the
most widely deployed and naturally supported 17.4.3 IEEE 802.15.4 Beacon-Enabled
with GreenNet’s radio transceiver. MAC Layer L2
The 2.405–2.480 GHz band is split into 16 fre-
quency channels that are 5 MHz apart. Each 17.4.3.1 Superframe Concept
channel is only 2 MHz wide, with the remaining Harvested nodes have very stringent constraints in
band used as a guard against adjacent-channel terms of available energy. Enabling IPv6 commu-
interference. The band at 868 MHz with better nication in this case requires the selection of an
propagation characteristics is also popular but is ultra-low-power MAC protocol. This point
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 491

Fig. 17.8 Time-beaconed


superframe as described in
IEEE 802.15.4-2006

Fig. 17.9 Incoming and outgoing superframe structure in IEEE 802.15.4 standard

together with the main requirement of using only 17.4.3.2 Network Association
standardized solutions at all levels brought to the and Disassociation
choice of using the CSMA/CA slotted option Also part of the MAC layer is the association and
part of IEEE 802.15.4 beacon-enabled standard disassociation procedure (how a device joins the
(IEEE 2011). At MAC (L2) level GreenNet stack network) which is specified by IEEE 802.15.4.
is therefore using Superframe structure as speci- The network association implemented in
fied in the IEEE specification. See Fig. 17.8. GreenNet stack respects this standard.
The standard divides time in Superframes
with regular beacons sent at a Beacon Interval
17.4.3.3 Incoming and Outgoing
(BI). Superframe is itself divided in
Active Period Allocation
Each Network Coordinator or Router sends its
– an Active Period (SD) during which commu-
own beacons to its sons further down in the tree
nication takes place with CSMA/CA mecha-
(see also Sect. 17.1.3), and can decide where to
nism for collision avoidance during
place its active period into the Beacon Interval
• Contention Active Period (CAP)
(BI). While a Leaf node only receives incoming
• Contention Free Period (CFP) which
beacons from its parent router, a router first
remains unused in GreenNet
receives incoming beacons from its parent (Net-
– an Inactive Period during which the node may
work Coordinator or other Router) and then
sleep.
sends beacons to its children (Leaf nodes or
other Routers), see Fig. 17.9.
Beacon Intervals (BI) and Active Period
IEEE 802.15.4 does not specify how the
(SD) durations can be changed according to the
active period can be allocated; this is left to the
IEEE-802.15.4 specifications between 15.36 ms
implementation choice (Active Period Manager).
up to ~4.2 min1 allowing the use of various duty-
In order to guarantee by construction that
cycles (ON/ON + SLEEP ratio) which best adapt
there is no overlap between active periods of
to available energy.
each Router in the network, an algorithm has
been implemented. It is based on Zigbee
Distributed Address Allocation Algorithm
1
Minimum Active Period ¼ 60 Symbols  16 μs  16
(DAAA). Active Period start time for each poten-
slots ¼ 15.36 ms
Maximum Beacon Interval ¼ 214 Symbols  16 μs  16 tial router is a function of hereafter Cskip(d)
slots ¼ ~4.2 min. computed according to the following formula
492 P. Urard et al.

(Cm: max number of children, Lm: max depth, 17.4.4 Routing Layer L3
Rm: max number of routers, d: actual depth of the
router): 17.4.4.1 Introduction
As this was introduced in Sect. 17.1, the
1  RmLmd
Cskipðd Þ ¼ assuming Cm GreenNet nodes can be of the following types
1  Rm (see Fig. 17.10):
¼ Rm with Rm > 1 f or 1 < d
 Lm – Edge Router or Network Coordinator is the
central node (a.k.a sink). It is connected to the
The above algorithm is a centralized mecha- gateway of the network (this is a FFD)
nism, which does not take into account, the radio – Routers act as a relay for the other nodes, to
range of the nodes. More flexible active period propagate the data in the whole network. They
allocation algorithms can be implemented. can also themselves sense, capture and send
their own data (this is a FFD)
17.4.3.4 Data Exchange – Leaf nodes are at the bottom of the tree, sens-
Between Devices ing and sending their own data (this is a RFD)
It should be noted that, based on IEEE 802.15.4
beaconed standard, Network Coordinator or Routing consists in finding the best route
Router never initiate a transmission, only reply (according to some objective function) from
to solicitations from children nodes. They notify any network node to the edge router. Chapter 7
children nodes by signaling in beacon frames of Varga’s PhD thesis (Varga et al. 2015b)
their corresponding address, together with a flag describes the innovative routing solution in
that they have a packet pending. Once the chil- GreenNet, which is particularly suited for
dren node wakes up and parses the beacon, it can harvested nodes. The solution will be described
solicit the data packet by transmitting a data hereafter.
request. Such a mode of operation allows the
children devices to skip beacons and sleep exten- 17.4.4.2 Limitations of RPL
sive periods of time. In case of traffic going Routing Protocol
upwards (from children towards the sink), the for Harvested Nodes
nodes simply transmit the packet within the IETF Routing Over Low power and Lossy
corresponding CAP, using the CSMA/CA networks (ROLL) Working Group specified
algorithm. RPL (Winter et al. 2012) routing protocol.

Fig. 17.10 IP based


GreenNet network example
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 493

Unfortunately, RPL is not well suited for energy- – Sends routing info in beacon payloads
harvested nodes having very stringent energy (piggyback)
constraints. – Keeps Routers by construction duty-cycled
Major drawbacks of RPL are (see (Clausen – Makes route request initiated ONLY by the
et al. 2011) for further details): Edge Router
– Avoids Broadcast messages issue
– DIO (DODAG2 Information Object) are – The following sub-sections give more details
descending messages sent periodically to cre- on how this innovative routing solution is
ate upward routes. With RPL these messages working.
may be fragmented
– RPL is engineered with preamble sampling in
mind i.e. DIO are sent with a Trickle timer to
17.4.4.4 Upward Route Request
avoid collisions and lower their global send-
In LOADng, when a node does not know the
ing rate, which solves the problem only par-
route to a destination, it initiates a route request
tially and makes the protocol unnecessarily
(RREQ), causing a flooding in the network with
complex
broadcast messages (as discussed above). In
– In storing mode3 problem of memory (routing
GreenNet stack, only the Edge Router (ER) can
table size) and DAO4 message footprint
initiate a route request.
– Complexity (executable RAM size:
As can be seen from Fig. 17.11 below, if a node
~50 Kbytes5)
A does not know a destination address B then:
– Routers MUST be always-on (i.e. you cannot
use duty-cycled routers) to keep track of the
– It sends data packet up to its coordinator until
radio environment to ensure proper trickle
it reaches ER (action 1 in Fig. 17.11)
functioning.
– If the ER does not know the destination either,
it will send down a broadcast RREQ which is
recursively broadcasted by each router of
17.4.4.3 Optimized Solution for Routing interest6 (action 2 in Fig. 17.11)
of Harvested Nodes
– An optimized solution was found by
implementing a cross layer (L2/L3) construc-
tion of the Cluster Tree (La et al. 2013),
exploiting the best of both RPL (Winter
et al. 2012) and LOADng (Clausen et al.
2015) routing protocols. This innovative
implemented solution
– Keeps the network structure built by RPL
(Collection Tree)
– Uses very compact code with few parameters

2
DODAG : Destination Object Directed Assisted Graph.
3
A node uses routing tables for forwarding packets to Fig. 17.11 Route request customization in GreenNet
leafs. stack
4
DAO (Destination Advertisement Object): upward
messages sent to create downward routes (see Winter
et al. 2012). 6
During this flooding, each router creates also a reverse
5
We did not succeed in implementing RPL and path inserting in its routing table the next hop to ER
ContikiMAC (Dunkels et al. 2011) within a typical (actually next-hop in upward direction is always the
8 KB RAM low End MCU. parent).
494 P. Urard et al.

Fig. 17.12 For broadcast


(D-1) router behaves as a
proxy for its children
leaves

– When destination node receives it, it replies – Unicast i.e. when we want to send a data (or a
with a route replay (RREP) that will go up to configuration) to a single node.
the ER through the same path7 (action 3 in
Fig. 17.11) For Broadcast (see Fig. 17.12), GreenNet
– Once the RREP arrives to the ER the path has routing solution is such that the router at depth
been properly created (on all nodes of (D-1) in the tree (i.e. the latest one before the
interest) sleeping Leaf), behaves as a proxy, i.e. replies on
– From this time on, each time A wants to send behalf of its sleeping children, to the RREQ.
a packet to B the same path (through ER) as For Unicast, implemented solution tends to
created by this RREQ/RREP, mechanism will avoid the drawback described hereafter: when
be used: this avoids the overhead caused by packets have to be sent to a sleeping leaf (i.e. a
any routers broadcasting RREQ to anyone low duty cycle leaf programmed for skipping
(as required in LOADng). beacons), only the first packet (i.e. that one
received immediately after the beacon) is likely
to arrive to destination. The next ones have a
high probability to be lost (even though a buffer
17.4.4.5 Downward Routes Toward
mechanism could be applied which anyway
Sleeping Nodes (Broadcast
would be limited by the RAM size).
and Unicast)
GreenNet implemented solution tends to
Wireless Sensor Networks are typically built
avoid this limitation (see Fig. 17.13): when a
to sense, collect and transmit information to a
packet arrives to a sleeping leaf, RDC layer
Sink (i.e. Edge Router) which is then possibly
stops skipping beacons for a certain (program-
connected to a gateway. The inherent tree of
mable) number of times (assuming that other
WSN is used to easily collect information up to
messages could arrive after the first one just
the sink.
received).
It is more complex to send data down into the
tree (i.e. configuration request to a leaf and/or
control an actuator) especially when a leaf is
most of the time sleeping. 17.4.5 Application Layers (L5–L7)
In the downward routes, we have to further
categorize in: 17.4.5.1 CoAP
IETF CoRE working group standardized CoAP
– Broadcast like the RREQ of previous section (Shelby et al. 2014) to answer the needs of IoT
devices and replace the more energy hungry
7
During this phase each router creates also the downward HTTP. CoAP runs over User Datagram Protocol
route for the destination of interest. (UDP) rather than over Transmission Control
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 495

Fig. 17.13 After receiving Unicast message, some beacons are NOT skipped

Table 17.2 Comparison of communication overhead semantics. The term “server” should by no means
between HTTP and CoAP
be associated with powerful machines hosting
HTTP/TCP CoAP/UDP HTTP servers in the Internet, as in our context
# Messages 9 2 CoAP “server” corresponds to a GreenNet board
# Bytes 681 111 exposing its resources to CoAP clients
RTT (ms) 7.2 4.4
(e.g. gateways, smart phones, or other GreenNet
nodes).
Protocol (TCP) that is used in the traditional GreenNet uses two open source imple-
Internet. We show an example in Table 17.2 mentations of CoAP. Erbium (“C”) is used for
how CoAP/UDP is more efficient than HTTP/ GreenNet nodes while Californium (‘Java’) is
TCP, in terms of the number of messages, total used for the gateway.
byte overhead and round trip time (RTT) for the
same application-level transaction.
CoAP is not a blind compression of HTTP, 17.4.5.2 ZigBee Smart Energy
although it is commonly referred to as such due Profile (SEP2.0)
to the same RESTful design and purpose. Zigbee Alliance specified SEP2.0 (ZigBee Alli-
Instead, CoAP implements a subset of HTTP ance), which is a software layer standardizing, at
features relevant to IoT devices and further the application level, data exchanges between the
specializes for constrained environments by network nodes so that these exchanges remain
leveraging typical traffic patterns and by independent of the lower-layer protocols. Differ-
minimizing overhead. ent profiles were standardized for various market
CoAP leverages the traditional client-server segments. A prototype subset of the SEP2.0
architecture where a client interested in a given Function Sets has been implemented under
resource makes a request towards the server, Contiki for GreenNet nodes, while a more
hosting that resource. The server prepares a exhaustive version is used for the gateway, run-
response that potentially encapsulates a resource ning in Java.
representation corresponding to the requested When a new node joins the network, it triggers
resource. For example, a request to resource a SEP2.0 Node Registration phase towards the
“temperature” would result in response con- gateway (“End Device” message). Then, each
taining temperature measurement in the payload. node sends upwards few status parameters, one
There could also be a request to change the value of them being its parent address. The information
of some resource on the server, in which case the from each node will be used to update the graph
response typically acknowledges the change. The in the gateway GUI and corresponds to the Status
separation of client and server roles in IoT Information Exchange Discovery phase. This is
environments is not as straightforward as in the followed by Function Sets Discovery phase,
traditional Internet. The same device often Resource Discovery phase and Data Exchange
executes both roles, depending on the application phase.
496 P. Urard et al.

Fig. 17.14 SEP registration procedure example

An example of SEP registration phase is For that reason, we refer to security at L2 as


described in Fig. 17.14. hop-by-hop security. An attacker can physically
compromise one of the nodes in the network
and therefore obtain access to the information
17.5 Security originating at other nodes through the
compromised node. We refer to such an attacker
As a wireless network, GreenNet is prone to as an insider. In addition, an insider can
both passive (e.g. eavesdropping) and active easily tamper with existing information flows or
(e.g. packet injection, replay) attacks. Network inject false readings to make damage to the
security protocols and mechanisms use crypto- system.
graphic techniques to protect against such Defense against inside attackers is end-to-end
threats, for example by encrypting and/or security that protects information travelling
authenticating packets. By protecting informa- end-to-end, such that no intermediary node on
tion at different abstraction layers and by the path has access to it. In practice, it means that
controlling access to cryptographic keys, it is parts of the packet that correspond to the appli-
possible to selectively place trust on different cation are encrypted and authenticated with cryp-
nodes in the network. tographic keys different from those used at L2.
For example, security at L2 protects the com- Figure 17.15 depicts how security at different
munication over the radio channel and provides layers protects information flows in GreenNet
the first level of protection against external networks.
attackers. Since GreenNet nodes typically send Additionally, GreenNet may leverage an
data to the gateway, that means that every node advanced security scheme OSCAR that protects
on the multihop path has to decrypt and encrypt application data, while it is stored at inter-
the packet as it travels towards its destination. mediaries, such as untrusted Cloud servers.
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 497

Fig. 17.15 Security embedded in L2 (hop-by-hop) and L4 (end-to-end) layers

17.5.1 Hop-by-Hop Security (L2) CCM is a particular mode of operation that


uses CTR mode for encryption and CBC-MAC
Security mechanisms at L2 are tightly bound to for authentication. It encrypts and/or
the radio technology, which in GreenNet case authenticates a sequence of plaintext bytes.8
corresponds to IEEE 802.15.4 standard. The When applied to a radio frame, this means that
standard completely relies on symmetric crypto- CCM can encrypt the MAC payload while keeping
graphic primitives to provide confidentiality, the MAC header intact (Sciancalepore et al. 2016).
data origin authentication and protection from When used to authenticate a radio frame, CCM
replay attacks on a per frame basis. The standard calculates a Message Integrity Code (MIC)
delegates host authentication and key exchange over the complete frame. This MIC is truncated
functions to upper layers of the network stack. to the desired length (4, 8 or 16 bytes), and
Since hop-by-hop security is only the first layer appended at the end of the radio frame.
of protection, GreenNet uses group keys: all Each frame secured with a given key must use
nodes in the network share the keys used for a different nonce. Encrypting two frames with
protection of radio frames. the same nonce has severe consequences on
The basic cryptographic primitive of IEEE security: plaintext of both frames may be easily
802.15.4 standard is Advanced Encryption Stan- recovered. By constructing the nonce with a
dard (AES), a block cipher that encrypts blocks monotonic counter, it is possible to ensure replay
of 16 bytes. To use AES in practice, it is neces- protection for two communicating nodes.
sary to apply different modes of operation for
protection of messages longer than 16 bytes.
IEEE 802.15.4 standard uses the CCM mode
8
(AES-CCM) to encrypt and authenticate radio While the upper bound on message length for CCM does
frames in the same time. exist, for practical purposes in IEEE 802.15.4 WSN it can
be considered arbitrarily large.
498 P. Urard et al.

Fig. 17.16 Message


exchanges during DTLS Client Server
handshake to agree on a ClientHello
common key
HelloVerifyRequest [cookie]
ClientHello [cookie]
ServerHello
(Certificate)
ServerKeyExchange
(CertificateRequest)
ServerHelloDone
(Certificate)
ClientKeyExchange
(CertificateVerify)
ChangeCipherSpec
Finished
ChangeCipherSpec
Finished
Application Data

17.5.2 End-to-End Security handshake is performed a limited number of


times during the node lifetime due to the signifi-
Securing end-to-end traffic is often achieved by cant performance issues.
transferring data over a secure logical channel
between the two communicating end-points.
The de-facto security standard for the Internet
application traffic is SSL and its IETF successor 17.6 Protecting Application Data
TLS (Dierks and Rescorla 2008). TLS was
designed for client-server applications that oper- In addition to end-to-end security, many use
ate over a reliable transport. To establish a secure cases require access control and authorization
channel, a client and a server first perform the (Seitz et al. 2016), but sensor nodes are not
TLS handshake during which they authenticate access control decision makers for the data they
each other and derive the symmetric keys to use generate—access control is rather a policy of the
during the session. network owner (Varga et al. 2015a). Recognizing
DTLS (Rescorla et al. 2012) is an extension of significant performance issues of the DTLS
TLS for datagram transport and runs over UDP handshake in radio duty-cycled networks, we
rather than TCP. Like TLS, DTLS protects have jointly addressed the problems of end-to-
the payload with encryption and authentication. end security and authorization by proposing
Figure 17.16 illustrates the exchanges during a OSCAR, a stateless content-centric security
DTLS handshake. architecture (Vučinić et al. 2015a). The main
Our evaluations of DTLS in the context of idea behind OSCAR is to protect the data rather
GreenNet and other MAC protocols (Vučinić than the communication channel, which allows
et al. 2015b), however, suggest that its usage is the secured data to be cached at intermediaries
acceptable for applications for which a DTLS without the need for decryption (Vučinić et al.
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 499

2015c). OSCAR relies on Authorization Servers harsh conditions is not sufficient to ensure
that provide clients with Access Secrets required sustainability. It has to be combined with cross-
to request resources from constrained nodes. the-board optimizations:
Nodes reply with the requested resources that
are signed and encrypted. The scheme intrinsi- – optimization of the hardware (Sect. 17.2)
cally supports multicast and asynchronous – optimization of the software stack described
traffic. The architecture achieves end-to-end in Sect. 17.4.
authenticity even in the presence of application-
level gateways that do not allow direct commu- The system total power consumption can be
nication from the outside of a sensor network, measured in real time, thanks to some board
for security reasons. OSCAR leverages secure debugging capability (Fig. 17.3).
channels established by means of DTLS and Figure 17.17 shows a typical active part of a
stays compatible with DTLS-only networks, typical super-frame for a self-powered tempera-
but uses it in a way that does not require the ture sensor node. We can detail wake-up (A),
repetition of expensive handshakes (Varga et al. sensor measurement (B) with in parallel radio
2015a). calibration, auto-synchronization through beacon
reception (C) with in parallel IPv6 frame prepa-
ration and security encryption, channel sensing
17.7 Technical Evaluation using CSMA-CA in 802.15.4 (D) and 802.15.4
frame Tx (E), frame acknowledgement by the
17.7.1 Total Power Measurement network (F), next wake-up preparation and back
to sleep (G). The (C) section is longer than bea-
We have seen in Sect. 17.3 that the GreenNet con reception to anticipate low oscillator errors.
node is able to charge its embedded battery Despite the length of the active period, it only
starting from 290 nA—corresponding to 20lux. represents an integrated charge of 192 μC, so
However, the capability to operate harvesting in around 576 μJ

Fig. 17.17 802.15.4-


beaconned active part
details
500 P. Urard et al.

17.7.2 Contributors to Energy directly affects the radio transmission/reception


Consumption and the Cost cost. Cryptographic processing causes a notable
of Hop-by-Hop Security increase in CPU usage from No Security to
ENC-MIC-32 modes. The LPM drop observed
We evaluate the main contributors to the overall with increasing security level is due to the over-
energy consumption in steady state over a 6-h head of time (and so overhead of energy con-
period, using software estimation techniques sumption) during other phases.
(Dunkels et al. 2007). We focus on the energy Figure 17.18 presents the average power con-
spent during four main phases: sumption measured over the experiment duration
of 6 h. It is interesting to note in Fig. 17.18 the
1. The energy spent during sleep mode (leakages amount of power drawn while the radio trans-
of the MCU and the rest of the board)—Low ceiver is in receive mode. Although the applica-
Power Mode (LPM); tion executing during the experiments induced
2. The energy spent during active computations typical convergecast traffic, with no packets
(CPU); going in the downward direction to the nodes,
3. The energy spent while the radio transceiver power drawn while receiving is three times
is in receive mode (RX); higher compared to power drawn while transmit-
4. The energy spent while the radio transceiver ting. Our node sleeps most of the time, but just
is in transmit mode (TX). before transmitting the sensor reading, it needs to
resynchronize to the rest of the network by
Then we compare the cost of security in dif- receiving a beacon frame at a precise interval.
ferent security modes (ENC-MIC-xx). The Delays caused by hardware imperfections, such
higher the ENC-MIC value, the higher the secu- as crystal oscillator warm up time, radio
rity level and lower the probability of blind
guessing the MIC value. All evaluated modes Table 17.3 Steady-state contributors to energy con-
include both encryption and authentication. sumption of GreenNet temperature sensor
We note in Table 17.3 that as much as 69% of LPM CPU RX TX
energy is spent during sleep mode. We see that Security level (%) (%) (%) (%)
the percentage of energy spent on radio commu- No security 69.77 15.77 10.92 3.54
nication increases with the security level. A ENC-MIC-32 68.57 16.35 11.30 3.78
higher security level corresponds to a longer ENC-MIC-64 67.62 16.63 11.69 4.06
MIC and a larger signaling overhead, which ENC-MIC-128 67.11 16.80 11.81 4.28

Fig. 17.18 Average 10


Average Power Concumption (µW)

power consumption of No Security


GreenNet temperature 9
ENC-MIC-32
sensor 8
ENC-MIC-64
7 ENC-MIC-128
6
5
4
3
2
1
0
LPM CPU RX TX TOTAL
17 An IPv6 Energy-Harvested WSN Demonstrator Compatible with Indoor Applications 501

Fig. 17.19 Energetic cost 30


of hop-by-hop (IEEE ENC-MIC-32
802.15.4) security for ENC-MIC-64
25
GreenNet temperature
sensor ENC-MIC-128

Percentage (%)
20

15

10

0
CPU RX TX OVERALL

calibration, clock drifts, and similar, induce DIO DODAG Information Object
margins that are an important component in the DODAG Destination Object Directed Assisted
overall power consumption. Roughly, only 15% Graph
of the receive mode consumption comes from the DTLS Datagram Transport Layer Security
actual physical reception of a beacon frame, ER Edge Router
whose size increases with the security level, FFD Full Functionality Device
while the rest is due to idle listening. GUI Graphical User Interface
We summarize in Fig. 17.19 the overall IETF Internet Engineering Task Force
energetic cost of security for a GreenNet tem- IoT Internet of Things
perature sensor. Overall cost of IEEE 802.15.4 LLN Low power and Lossy Network
hop-by-hop security in our scenario ranges from Lux SI unit for luminous emittance
1.75% to 3.96%, depending on the security level. LOADng 6LoWPAN Ad hoc On-Demand dis-
For GreenNet, this result directly corresponds to tance vector routiNG
the requirement that 1.75–3.96% extra energy MAC Medium Access Control
needs to be harvested from the environment. NDEF NFC Data Exchange Format
NFC Near Field Communication
PAN Personal Area Network
Glossary PHY PHYsical link (ie: the HW radio)
PV Photo Voltaic cell
AES-CCM Advanced Encryption Standard RDC Radio Duty Cycle
(Counter with CBC MAC) REST Representational State Transfer
BI Beacon Interval RFD Reduced Functionality Device
CoAP Constrained Application Protocol ROLL Routing Over Low power and Lossy
CoRE Constrained RESTfull Environments networks
CSMA/CA Carrier Sense Mulitple Access with RPL IPv6 Routing Protocol for LLNs
Collision Avoidance RREQ Route Request
DAAA Distributed Address Allocation RREP Route Replay
Algorithm RTT Round Trip Time
DAO Destination Advertisement Object SD Superfame duration
502 P. Urard et al.

SON Self Organizing Network E. Rescorla, N. Modadugu, Datagram Transport Layer


TLS Transport Security Layer Security Version 1.2. RFC 6347 (2012)
G. Romaniello, Energy Efficient Protocols for Harvested
WSAN Wireless Sensor & Actuator Network Wireless Sensor Networks. PhD thesis, Grenoble Alps
6LoWPAN IPv6 over Low power Wireless University, March 2015
Personal Area Networks S. Sciancalepore, M. Vucinic, G. Piro, G. Boggia,
T. Watteyne, Link-layer security in TSCH networks:
effect on slot duration. Trans. Emerg. Telecommun.
Technol. (2016)
References L. Seitz, S. Gerdes, G. Selander, M. Mani, S. Kumar, Use
Cases for Authentication and Authorization in
M24LR64E-R datasheet: 64-Kbit EEPROM with Constrained Environments. RFC 7744 (2016)
password protection, dual interface & energy Z. Shelby, K. Hartke, C. Bormann, Constrained Applica-
harvesting: 400 kHz I2C bus & ISO 15693 RF protocol tion Protocol (CoAP). RFC 7252 (2014)
at 13.56 MHz. http://www.st.com/internet/com/TECH STMicroelectronics. STM32L1 Series. http://www.st.
NICAL_RESOURCES/TECHNICAL_LITERATURE/ com/stm32l1
DATASHEET/DM00047008.pdf Swedish ICT. The Contiki OS. http://www.contiki-os.org/
T. Clausen, U. Herberg, M. Philipp, A critical evaluation F. Todeschini, Dimensionnement energetique de reseaux
of the ipv6 routing protocol for low power and lossy de capteurs ultra-compacts autonomes en energie.
networks (rpl), in Proceedings of the 2011 I.E. 7th PhD thesis, Supelec, 2014
International Conference on Wireless and Mobile F. Todeschini, C. Planat, P. Milazzo, S. Tricomi, P. Urard,
Computing, Networking and Communications P. Benabes, A nano quiescent current power manage-
(WiMob) (IEEE, 2011), pp. 365–372 ment for autonomous wireless sensor network, in
T.H. Clausen, A.C. de Verdiere, J. Yi, A. Niktash, Proceedings of the IEEE 20th International Confer-
Y. Igarashi, H. Satoh, U. Herberg, C. Lavenu, ence on Electronics, Circuits, and Systems (ICECS)
T. Lys, J. Dean, The Lightweight On-demand Ad hoc December 2013 (2013), pp. 815–818
Distance-Vector Routing Protocol—Next Generation L.-O. Varga, G. Romaniello, M. Vučinić, M. Favre,
(LOADng). Draft-clausen-lln-loadng-13. Work in A. Banciu, R. Guizzetti, C. Planat, P. Urard,
progress (2015) M. Heusse, F. Rousseau, O. Alphand, E. Duble,
T. Dierks, E. Rescorla, The Transport Layer Security A. Duda, GreenNet: an energy harvesting IP-enabled
(TLS) Protocol Version 1.2. RFC 5246 (2008) wireless sensor network. IEEE Internet Things
A. Dunkels, B. Gronvall, T. Voigt, Contiki—a light- J. (2015)
weight and flexible operating system for tiny L.-O. Varga, Multi-hop Energy Harvesting Wireless Sen-
networked sensors, in Proceedings of the 29th Annual sor Networks: Routing and Low Duty-cycle Link
IEEE International Conference on Local Computer Layer. PhD thesis, Grenoble Alps University,
Networks, 2004 (2004), pp. 455–462 December 2015
A. Dunkels, F. Osterlind, N. Tsiftes, Z. He, Software- M. Vučinić, B. Tourancheau, F. Rousseau, A. Duda,
based on-line energy estimation for sensor nodes, in L. Damon, R. Guizzetti, OSCAR: Object security
Proceedings of the 4th Workshop on Embedded architecture for the Internet of Things. Elsevier Ad
Networked Sensors, EmNets ’07 (ACM, New York, Hoc Netw 32, 3–16 (2015a)
NY, 2007), pp. 28–32 M. Vučinić, B. Tourancheau, T. Watteyne, F. Rousseau,
A. Dunkels, The ContikiMAC radio duty cycling proto- A. Duda, R. Guizzetti, L. Damon, DTLS Performance
col. Technical report, Swedish Institute of Computer in duty-cycled networks, in Proceedings of the IEEE
Science (SICS) (2011) 26th Annual International Symposium on Personal,
IEEE. 802.15.4-2011: IEEE standard for local and metro- Indoor, and Mobile Radio Communication (PIMRC),
politan area networks—Part 15.4: Low-Rate Wireless August 2015 (2015), pp. 1–6
Personal Area Networks (LR-WPANs) (2011) M. Vučinić, Architectures and Protocols for Secure and
C.-A. La, M. Heusse, A. Duda, Link reversal and reactive Energy-Efficient Integration of Wireless Sensor
routing in low power and lossy networks, in Networks with the Internet of Things. PhD thesis,
Proceedings of the 2013 I.E. 24th International Sym- Grenoble Alps University, November 2015
posium on Personal Indoor and Mobile Radio T. Winter, P. Thubert, A. Brandt, J.W. Hui, R. Kelsey,
Communications (PIMRC) (2013), pp. 3386–3390. P. Levis, K.S.J. Pister, R. Struik, J. Philippe Vasseur,
G. Montenegro, N. Kushalnagar, J. Hui, D. Culler, Trans- R.K. Alexander, RPL: IPv6 Routing Protocol for
mission of IPv6 packets over IEEE 802.15.4 networks. Low-Power and Lossy Networks. RFC 6550, March
RFC 4944 (2007) 2012
M.R. Palattella, N. Accettura, X. Vilajosana, T. Watteyne, ZigBee Alliance. ZigBee Smart Energy Profile 2.0 http://
L.A. Grieco, G. Boggia, M. Dohler, Standardized proto- www.zigbee.org/zigbee-for-developers/application
col stack for the internet of (important) things. Commun. standards/zigbeesmartenergy/
Surv. Tutorials IEEE 15(3), 1389–1406 (2013)
Ferro-Electric RAM Based
Microcontrollers: Ultra-Low Power 18
Intelligence for the Internet of Things

Sudhanshu Khanna, Mark Jung, Michael Zwerg,


and Steven Bartling

Microcontrollers (MCUs) serve a central role techniques that help achieve the low power metrics
in the design of IoT leaf nodes. A typical micro- in TI microcontrollers.
controller comprises of a processing core, program
and data memory, serial communication
interfaces, general purpose IO (GPIOs) ports, 18.1 MSP430FR5xx/6xx: 130 nm
comparators and ADCs, clock generation and FRAM Based Microcontrollers
power regulators. Microcontrollers have modest Optimized for IoT Applications
clock frequencies and memory capacities, keeping
the IC cost low. High level of integration helps This section discusses the design of the TI
improves performance, lowers power and helps FRAM Microcontroller series MSP430FR5xx/
achieve a small form factor. All of these are critical 6xx. We discuss a cross-hierarchy optimization
in the design of ICs for IoT applications. This process starting from technology node selection,
chapter focuses on the design of low power digital and analog circuit techniques, optimized
microcontrollers using two Texas Instruments digital design methodologies, and analog design
(TI) microcontrollers as examples. Both microcon- styles. The biggest knob in power optimization is
trollers feature embedded Ferro-electric RAM system design: we see how smart peripherals and
(FRAM) as a low power non-volatile unified data low power modes help achieve the ultra-low
and program memory. Low write power power (ULP) goals in this MCU. FRAM is a
non-volatile memory with high write endurance key enabler for IoT applications that need data
(number of write cycles) is critical in IoT logging. In this section we describe the technical
applications that feature data logging. FRAM has reasons and discuss FRAM operation,
low power writes at 1.5 V, practically unlimited advantages and limitations.
write endurance (>1015), and high yields with The MSP430FR5xx/6xx family of microcon-
millions of shipped parts. In the following sections trollers is designed for applications that demand
we describe analog, digital and system design ultra-low active and standby energy. Hence, it is
tailored for applications that require low average
and peak currents. With active currents in a range
S. Khanna (*) • M. Zwerg • S. Bartling
Texas Instruments, Dallas of 100 μA/MHz and standby currents of 450 nA
e-mail: skhanna@ti.com; m-zwerg@ti.com; including supply-voltage supervision and active
bartling@ti.com real-time clock (RTC), it serves the requirements
M. Jung of typical IoT applications. The FR5xx provides
Texas Instruments, Deutschland GmbH seven low power modes that address different
e-mail: m-jung@ti.com

# Springer International Publishing AG 2017 503


M. Alioto (ed.), Enabling the Internet of Things, DOI 10.1007/978-3-319-51482-6_18
504 S. Khanna et al.

Table 18.1 Typical ULP applications


Duty cycle
Application Active (%) LPM (%) Period Avg. IDD IDD dominated by:
Sensor hub 10 90 1s 89.6 μA Active
Fitness tracker 1 99 1 ms 52.9 μA Wakeup frequency
Blood glucose meter 0.4 99.6 3/day 4.7 μA LPM
Heat cost allocator 0.01 99.99 4s 5.3 μA LPM at high temp

scenarios: They serve for both applications with temperature leakage of the process, causing
frequent sleep/active transitions in the order of IDD to be higher than that of blood glucose
hundreds of microseconds with fast wakeup metering. This shows how temperature is also a
requirements, and applications with long sleep factor in application IDD. Thus an IoT MCU has
periods in the order of seconds or more. to be designed considering not only the duty
Power consumption in IoT applications varies cycle range across applications, but also the
due to several factors. Table 18.1 lists some application time period, and temperature range.
typical ULP applications. The duty cycle (active A main differentiator of the FR5xx is the use
to low power mode ratio) amongst these of FRAM which has a significantly lower write
applications varies by orders of magnitude. The access time and energy compared to flash mem-
longer an application stays in low power mode, ory. In typical IoT applications this shortens the
the lower the average current (IDD) should data logging phase and hence the active mode
be. Nevertheless, the average current consump- time. Due to the 1.5 V write voltage of the
tion is not a direct function of the duty cycle. The FRAM, no charge pump is required.
sensor hub takes sensor measurements and does The power supply architecture of the FR5xx is
further processing on the sensor data. In this case optimized for the requirements of the low-power
the average current is clearly dominated by the modes: Switchable low drop-out regulators
active mode current, with a high 10% duty cycle. (LDO) supply dedicated FRAM, debug, real-
However, while the fitness tracker application time-clock and core domains on demand. The
has a much lower duty cycle than the sensor core domain is subdivided into further switchable
hub (1% vs. 10%), the reduction in IDD is not power gated domains to reduce leakage currents
proportional. The small wakeup period in the during low-power modes. The predictive core
fitness tracker (1 ms) prevents the MCU from LDO quickly adapts to the load requirements of
going into the lowest low power mode. This the digital system which optimizes the bias cur-
shows that the time and energy cost of going rent of the LDO depending on the current load
into a low power mode is also important. Other scenario.
ULP applications like blood glucose metering In addition the digital multi-VT standard cells
remain in the deepest low power mode for a enable optimum balance between leakage and
majority of their lifetime (low duty cycle and active mode performance. Along with the variety
long time periods) and are dominated by the of provided slow 9 kHz up to fast 16 MHz clocks
leakage current of the MCU. Blood glucose and the extensive use of common low power
metering devices turn on a few times a day, but techniques like clock or bus gating, the design
stay active for several minutes. Here the MCU is optimized for low energy through all levels.
can make best use of the deepest low power By means of the Texas Instruments
mode. The heat cost allocator is a special case, MSP430FR59xx which is part of the FR5xx plat-
where the application is exposed to high form, it will be demonstrated how the ambitious
temperatures for most of its lifetime. Even low power requirements of an IoT application
though it has the lowest duty cycle, the low impacts the architecture and design of a micro-
power current is dominated by the high controller unit. The FR59xx family is developed
18 Ferro-Electric RAM Based Microcontrollers. . . 505

Fig. 18.1 FR59xx mixed-signal microcontroller block diagram (MSP430 FR59xx Mixed Signal Microcontrollers) and
chip photo (Baumann et al. 2013)

in a 130 nm CMOS process which fulfills the 18.1.1 Low Power Modes
requirements of a highly integrated ultra-low
leakage process. Figure 18.1 represents the The MSP430FR5xx microcontrollers provide
high-level block diagram of the FR59xx devices seven different low-power modes (LPM) which
and a chip photo of an example die of the all target different application scenarios. Each of
FR59xx family. the modes is a tradeoff between low bias and
FR59xx consists of: leakage currents versus quick wakeup times for
the system and its components. Table 18.2
• The digital MSP430 core and debug unit, summarizes the main characteristics of the
• the analog core and peripheral components low-power modes. For instance, LPM0 has all
like the power management system and high-frequency peripherals available and keeps
clock oscillators, the core LDO active. This results in minimum
• the digital peripherals and accelerators, and wakeup times. However, it comes along with rel-
• the FRAM and SRAM memories for code and atively high bias currents e.g. of the power man-
data storage. agement system or the active oscillators. In
contrast, LPM3 only allows for low frequency
All peripherals are optimized for ultra-low peripherals up to 50 kHz which can bring down
energy operation required by IoT applications. typical overall current consumption to only
Accelerators like the AES or the CRC reduce 400 nA. Power gating with full data retention
the encoding time and therefore enable a reduc- results in low digital leakage even at high
tion of active mode time. For example, AES temperatures. LPM3.5 and LPM4.5 provide low-
hardware encryption is 32 times faster compared est possible system currents, as the entire digital
to a software implementation on an MSP430 core is powered down. The penalty is a significant
(AES for Texas Instruments MSP430 Microcon- wakeup time increase, as the entire system has to
troller; MSP430 FR59xx Mixed Signal reboot after the low power mode wakeup. In most
Microcontrollers). Mixed-signal peripherals like instances LPM0 and LPM1 are best suited for
the ADC12 consume only 145 μA and are able to short sleep phases, while LPM3-LPM4.5 show
run during standby mode without the need for strength in case of long inactive periods.
waking up the CPU core. For a holistic comparison of the LPM current
consumption, the wakeup energy must be
506 S. Khanna et al.

Table 18.2 Power mode characteristics of the MSP430FR59xx devices (MSP430 FR59xx Mixed Signal
Microcontrollers)
Active LPM0/LPM1 LPM2 LPM3/LPM4 LPM3.5/LPM4.5
Description CPU Off Standby Standby/off RTC/Shutdown
Max. frequency 16 MHz 16 MHz 50 kHz 50 kHz/0 kHz 50 kHz/0 kHz
Typical currents ~100 μA/MHz 70 μA@1 MHz/ Downto 0.7 μA Downto Downto
35 μA@1 MHz 0.4 μA/0.3 μA 0.25 μA/0.02 μA
Typical wakeup Instant/6 μs 6 μs 7 μs 250 μs
1000 μs
CPU On Off Off Off Off
FRAM On Off or Off Off Off
standby/off
HF peripherals Yes Yes Off Off Off
LF peripherals Yes Yes Yes Yes/off Yes/off
Dig. power gating No No No Yes, full retention Yes, no retention

Fig. 18.2 FR59xx mixed-signal microcontroller block diagram (MSP430 FR59xx Mixed Signal Microcontrollers)

considered. Figure 18.2 shows that each of the (MSP430FR2xx Mixed Signal Microcontrollers)
power modes has a range in which it provides and MSP430F5xx (MSP430 F5xx Mixed Signal
optimum average current consumption. The Microcontrollers) as a crystal replacement. We
intersection of two curves means that the two mention it here to supplement the discussion on
low power modes show same current consump- tradeoffs in oscillator choices).
tion for the given frequency. The VLO is the slowest clock but also lowest
power oscillator in the system. The VLO is
always on and is mainly used by the power man-
18.1.2 Oscillators agement for internal low frequency
housekeeping. For example in low power mode
Clock sources are critical in ULP applications. the bandgap reference can be operated in a
The MSP430FR5xx microcontroller platform pulsed sample and hold scheme which drastically
provides different clock sources with varying reduces the average power consumption of the
frequency, power and startup time tradeoffs as voltage reference. In the FR57xx devices the
shown in Table 18.3 (*the REFO is not present in VLO helps to reduce overall system power by
MSP430FR59xx but is present in MSP430FR2xx providing a clock for the sample and hold circuit.
18 Ferro-Electric RAM Based Microcontrollers. . . 507

Table 18.3 Types of oscillators in MSP430FR5xx


VLO XTAL REFO* MODO DCO
Frequency 9 kHz 32 kHz 32 kHz 4.8 MHz 1–24 MHz
IDD 100 nA 200 nA 3 μA 25 μA (~15 μA @ 8 MHz)
(~23 μA @ 16 MHz)
Temp drift 0.5%/ C 0.03 ppm/ C 0.01%/ C 0.08%/ C 0.01%/ C
Supply drift 4%/V None 1%/V 1.4%/V 2%/V
Startup time Always on 1s 50 μs 100 ns 5 μs

The VLO can also be beneficial for other operation rather than bursting the system using
modules in the power modes LPM3 and LPM4. high frequency activity bursts. For minimum
For example it can be used as the clock source for energy it is better to run the oscillator as slow
the watch dog timer or LCD module. Both are as possible, rather than sub-dividing a higher,
low frequency peripherals that can easily tolerate fixed frequency oscillator clock. From power
the wider temperature and supply drift. Since the and startup time it fits between the REFO and
VLO is always on in the FR57xx family, it is also MODO. It has a faster startup time than the
considered a failsafe clock in this family. For REFO, but not as quickly as the MODO. On the
example it is used to detect whether the crystal other hand its current consumption is better as
oscillator has stopped working and in case of a the MODO for the same frequency and therefore
failure situation it will switch to the VLO clock it is better suited for the active mode system
so the system can avoid total failure. In this way clock.
modules will always see a clock source and con- The REFO is used as an on-chip replacement
tinue working. for the 32 kHz crystal oscillator on MSP430F5xx
The MODO is used as a fixed mid-range fre- and MSP430FR2xx. It is not as accurate as an
quency oscillator on demand. It has fast start up external crystal but can be used for clocking
time and thus can be activated for just for a few peripherals in low power modes, for example in
clock cycles to save power. For example, it is timers for accurate milli second or second count.
used for active ADC sampling or for noise filter- The UART can use the REFO for low baudrate
ing on the external reset pin. The high active communication (max 9600 baud). The main
power of this oscillator is the tradeoff for the advantage of the REFO in comparison to the
short on-time. Since the frequency is fixed and crystal oscillator is the overall system cost. In
aligned across device families, peripheral many low cost applications the crystal is an
modules can make use of this clock without the expensive component and on the other hand the
need to adjust clock dividers to match the DCO accuracy of the crystal is not required. Therefore
selected system frequency. This is important the REFO is a cost effective replacement for a
because the user can change the DCO frequency 32 kHz crystal, but cannot match an external
by software according to the application require- crystal in low power and accuracy. In
ment. But this should not impact modules that MSP430FR59xx, accuracy is prioritized over
require a constant frequency. cost and hence the XTAL is used instead of the
The DCO serves the purpose of a high fre- REFO. In MSP430FR59xx another clock, which
quency clock with configurable frequency. The is a divided by 128 version of MODOSC,
DCO is optimized for a wide tuning range with a replaces the REFO.
fairly low active current at the cost of startup time.
The DCO is optimized for a wide operating
frequency range to support different kind of 18.1.3 Power Supply Architecture
applications. With a current limited energy
source it might be better to avoid high peak The power architecture of the FR59xx is designed
currents by choosing a low frequency continuous to consume minimum possible power in each of
508 S. Khanna et al.

Fig. 18.3 Power Supply Architecture of the MSP430FR59xx devices (Baumann et al. 2013)

the active and low power states. It is illustrated in The digital core logic is supplied by the Low
Fig. 18.3. The power management unit (left block Power Core LDO during standby and off modes.
of Fig. 18.3) is one of the central components for At these power modes, the voltage and current
the ULP design of the microcontroller. It contains biasing circuits rely on a sampled mode concept
five low drop-out regulators (LDOs) which are (Fig. 18.4). This is based on the idea that the
enabled and configured depending on the current charges of a capacitor are frequently refreshed
active or low power mode. to ensure robust bias and reference conditions.
All of the LDOs can be disabled to reduce This short refresh cycle is reflected by the “On”
analog bias and digital leakage currents during state in the diagram of Fig. 18.4. In sampled
defined device states: For instance, the supply for mode, the reference and supply-voltage supervi-
the digital RTC logic is turned off during sor consume less than 200 nA. This stands in
LPM4.5, the supply for the FRAM from LPM0 contrast to the static full accuracy mode used in
onwards, and the debug LDO during user appli- active mode which requires currents in the range
cation mode. In LPM3.5 the RTC LDO is the of few μA.
only active regulator which supplies the RTC and
the 32 kHz oscillator which is required for the
RTC operation. All LDO output supplies use 18.1.4 Digital Core System
internal capacitors which reduces the number of
required supply pins and external components. In The digital core logic of the FR59xx devices,
addition, external voltage fail mechanisms have shown in the core domain box in Fig. 18.1, is
to be considered only for the main VDD supply. partitioned into an always-on domain, which is
During active mode, LPM0 and LPM1, the always powered if one of the core LDOs is
digital core and peripheral components are sup- active, and several power gated domains that
plied by the Predictive Core LDO. The supported can be switched off during LPM. While the
LDO load and hence the inherent LDO current is core domain is always switched off in the higher
adapted depending on the actual system activity, LPMs, the peripheral domains can remain active
which is determined by the number and fre- in case they are requested by one of the modules.
quency of the active clocks (Lueders et al. 2014). In order to enable faster wakeup from low power
18 Ferro-Electric RAM Based Microcontrollers. . . 509

Fig. 18.4 Switched mode reference generation of the FR59xx (Baumann et al. 2013)

Fig. 18.5 Low power mode current trend for the described architecture

mode, current sources (shown in the bottom part current is dominated by the quiescent
of Fig. 18.3) recharge the switchable power currents consumed by the analog reference and
domains in parallel to the LDO. Once the domain supervision circuits. The advantage of power gat-
supply hits a defined voltage, the current ing is becomes obvious with increasing
sources are disabled again to keep quiescent temperatures, when the overall system currents
currents low. are dominated by digital leakage. For reference,
Digital currents are dominated by leakage the green curve presents the power-off mode of a
during standby modes, especially when the device with active brownout, which only retains
device operates at higher temperatures. Fig- the wake-up capability via external
ure 18.5 shows the typical currents trend over I/O. Reference system and LDOs are switched off.
increasing temperatures (Baumann et al. 2013). Digital power gating becomes even more
The red curve stands for a system in standby attractive in mixed-VT designs, which make use
mode with no power gating, so all digital logic in of standard cells from different Vt classes. Cells
the core domain is powered. At low temperature with higher Vt (HVt) transistors are used to sig-
the total microcontroller current is just slightly nificantly reduce leakage currents, however they
higher compared to the same system in standby tend to be slower compared to cells with standard
mode with full power gating of all switchable Vt (SVt) transistors. Hence, SVt cells are usually
domains. This is due to the fact that the system’s used in timing critical logic. If all the timing
510 S. Khanna et al.

critical logic can be moved to the switched core


domain which is turned off during the leakage
sensitive standby modes, the system leakage can
be reduced without impacting system perfor-
mance in active mode.
However, digital power gating comes along
with additional area requirements with for the
power switches, logic retention signal isolation
and additional floorplanning constraints. For
logic retention, the data in the sequential cells
like flops or latches has to be powered even when
the rest of the domain is unpowered. Isolation Fig. 18.6 Structure of the PZT crystal
cells are placed at the boundary of an unpowered
and a powered domain to avoid floating inputs materials, and thus the name ferroelectric RAM.
into the powered domain when the switched However, unlike ferroelectric magnetic
domain is shut off. The floorplan usually materials, FRAM is immune to magnetic fields
segregates the domains, which introduces place- and alpha particles. Figure 18.7 shows the hys-
ment and logic optimization complexities. These teretic behavior of a ferroelectric (FE) capacitor.
additional requirements and limitations are The materials to manufacture FRAM are CMOS
strongly dependent on the design architecture, compatible, and add only two extra masks to the
e.g. number of domains, number of ports baseline CMOS process. This makes FRAM
between domains, number of retainable cells or processing much lower cost as compared to
maximum domain currents. A typical area pen- flash memory, which requires around 12 addi-
alty for the described system can be in between tional masks. TI has been mass producing
10 and 20%. standalone and embedded FRAM products for
more than 10 years with millions of shipped
parts demonstrating high yields and robust
18.1.5 Ferro-Electric RAM operation.
Fecaps (ferroelectric capacitors) can be writ-
In this section we describe the operation, ten fast (<10 ns), at low voltage (1.5 V), and
advantages and limitations of Ferroelectric without high write currents. This makes FRAM
RAM (FRAM). much more like SRAM than flash in its write
Data is stored in FRAM by electrically characteristics. Since the write mechanism
switching ions between stable positions in a involves only atomic movements, there is no
Lead Zirconium Titanate (PZT) crystal as wear and tear involved, resulting in high write
shown in Fig. 18.6. Movement of the Ti/Zr ion endurance (>1015 cycles). Fast, low voltage, low
is under the influence of an applied electric field current writes, and near-infinite write endurance
and once the field is removed, the ion retains its makes FRAM ideal for IoT applications that
position. Thus, based on the applied electric field need data logging. This is in contrast with flash
each PZT molecule has a stored polarization memory which requires multiple high voltages,
state. Similar to magnetic domains in a magnet, large write currents, and is also limited to ~10 K
domains in FRAM material align when an elec- write cycles. The high write energy in flash mem-
tric field is applied. When the field is removed, ory is a fundamental limitation in IoT
some domains get depolarized, but most domains applications.
retain polarization state, resulting in a net spon- Figure 18.7 shows the circuit diagram of the
taneous polarization. Thus the FRAM material simplest FRAM bitcell. It consists of an access
behaves as a programmable capacitor with hys- transistor and the Fecap. This is very similar to a
teretic behavior similar to ferroelectric magnetic DRAM bitcell. The difference is the Plateline
18 Ferro-Electric RAM Based Microcontrollers. . . 511

Read
Read signal
PL margin
Fe-Cap Q

Q PL
Linear
BL Cap
Write-0 Write-1 V(Q) = VDD * (CFE / CFE + CBL )
VDD VSS

VSS VDD
Capacitive Divider Based Read
Low Cap High Cap
Value Value

Q Q QB Q QB

Single Ended Differential 2C Differential 4C


1C

Area Read Signal Margin

Fig. 18.7 Hysteretic behavior of fecaps (top left), writing (top left) and reading (top right) a fecap. Structure of a
FRAM bitcell (bottom left)

(PL) terminal connected to the bottom plate of Read and write speed and energy are almost
the Fecap. To write a Fecap to a state of “1” a the same for FRAM. TI FRAM in 130 nm works
voltage of 1.5 V is applied at bit-line (BL) and at 8 MHz for both read and write operations. The
0 V at PL. To write a “0” 0 V is applied at the BL big enabler for IoT applications is that FRAM
and 1.5 V at the PL. For a NMOS pass gate, the write energy per bit is ~250 lower than flash.
word-line (WL) signal must be boosted for a Read energy is ~2 lower as well. With low
short period of time (few ns) to let the 1.5 V power and fast reads and writes, FRAM can be
signal pass while writing a “1”. Applying these used as a unified memory for both program code
voltages stores a high or low capacitance in the and data logging. Data retention in FRAM is also
Fecaps. To read the bitcell, the BL is floated and resistant to electric/magnetic fields as well as
PL is toggled from 0 V to 1.5 V as shown in radiation. Since FRAM state is not stored as a
Fig. 18.7. This forms a capacitor divider between charge, alpha particles are not likely to cause bits
the Fecap and the parasitic BL capacitor. In dif- to flip and the FRAM Soft Error Rate (SER) is
ferential FRAM schemes, each bitcell consists of below detectable limits.
two copies of the circuit and when the differen-
tial bitcell is read, BL and BLB develop voltages
near VDD/2 when PL is toggled low to high. 18.2 Compute Through Power Loss
Since the two sides of the bitcell have Fecaps in Using Non-Volatile Logic
low and high states, the voltage on BL and BLB
differ slightly, and a sense amplifier is used to IoT applications are duty cycled and use small
read the signal. Toggling the BL and PL is the batteries or energy harvesting energy sources.
most energy intensive operations since Fecaps Thus, lowering energy consumption (dynamic,
are dense capacitors. leakage and mode transition) is critical in
512 S. Khanna et al.

achieving the power and lifetime goals. As default values. For ULP applications the parasitic
described in the previous sections, ULP MCUs wakeup time and energy can be prohibitive. For
use process technologies that are specially simple applications like a temperature sensor,
optimized for low leakage. Voltage is lowered boot-up is the major component of the overall
to the minimum possible to meet latency. application active energy and time cost. Further,
Non-volatile FRAM is used instead of SRAM so real time applications like circuit breakers that
that in standby mode intermediate data and sensor need fast response times cannot use such a low
data can be retained with zero energy consump- power mode at all due to the long boot-up time.
tion. HVT logic and power gates are used exten- The mode transition can thus be a limiter in
sively to selectively reduce leakage currents with energy harvesting applications where the energy
lesser impact to drive strengths. Clock gating is source is erratic and energy or latency is critical
used to only enable switching in required sections (Zwerg et. al. 2011).
of logic. Sampled mode analog blocks are used to To prevent having to boot-up upon every
reduce quiescent currents. In addition, MCUs wakeup, FF state retention is used. Here, a por-
implement multiple low power modes that tion of each flip-flop is powered by a separate
tradeoff standby power for wakeup time by incre- “always-on” supply which is not power gated
mentally turning off different parts of the SoC. even in standby mode. This helps decrease
While dynamic and leakage power is critical wakeup time and energy by eliminating need
metrics in an IoT MCU, energy and time con- for boot-up but it increases the standby power
sumed in mode transitions and MCU wakeup are due to leakage in “always-on” portion of reten-
important metrics too. MCUs use thousands of tion flip-flops and the LDO generating the
flip-flops (FFs) as pipeline registers, general pur- “always-on” supply. In this manner, existing
pose registers and status flags. FFs are also used low power modes trade off standby power,
to save system configuration (available memory, wakeup time, and design volatility, but no single
peripheral count) and modes, analog calibration technology provides all of the above benefits
and trim settings. Unlike non-volatile memory together. Also, in energy harvesting applications,
like FRAM, flip-flops are volatile in nature, there may not be an always on supply available,
meaning that must have a constant power supply so retention may not be possible at all (without
to retain their logic state. Thus, when a MCU using supercapacitors or re-chargeable batteries).
(with volatile or even non-volatile FRAM) In this section, we describe non-volatile logic
faces a power interruption or is put in a power (NVL). NVL helps achieve standby power levels
gated low power mode, all flip-flops and latches close to a fully power gated mode while still
in a system are reset. This means that after each having fast wakeup similar to case where reten-
wakeup event or mode transition from a power tion is used. We describe a custom MCU that can
gated state to active mode, the MCU must seamlessly “compute through power loss” by
re-boot. Bootup not only consumes precious backing up all flip-slops and latches in the
energy resources, it can also be limiting in real MCU before a power loss. This also helps
time applications where the system must respond achieve lower application level power because
quickly to external events. the MCU can now be power gated with lower
Table 18.4 shows how the boot-up process energy and time overheads. We discuss different
takes 10,000s of cycles, 10s of milliseconds and architectures to enable compute through power
100s of nJ of energy. Boot-up is costly for cata- loss and describe our implementation in detail
log MCUs because they are built for a range of along with silicon measurements.
applications and cost points and are thus The NVL MCU achieves zero low power
configured and trimmed at boot. This “factory mode current while also having an ultrafast
boot” is then followed by the customer’s boot- sub-400 ns wakeup time. Non-volatile FRAM
up code or C-initialization with the variables and mini-arrays distributed throughout the logic
data structures used in the program are set to their domain of the MCU snapshot the state of all
18 Ferro-Electric RAM Based Microcontrollers. . . 513

Table 18.4 Wakeup times of typical MSP430 applications

• Brownout Application (1) C-init (2) User-Init (3)


• BG & REF
Power
• LDO
DC/DC FET 50us 23us
Management
USB FET BIOS 29ms 11ms
• Trim & Calibration ULP Bench 2.9ms 18us
• Access Protection
Boot Code TI-RTOS 44ms 1ms
• Debug Support
Graphics Lib
101ms 298us
• Variable init 400x240 pix.
• Queue/Buffer init (1) data based on real MSP430 applications
C-Init
User-Init • Peripheral init (2) C-init executes at default frequency (MSP430: 1MHz)
(3) User-init executes after clock config (here 16Mhz)

• Work Load Task


Application

sequential elements before the MCU goes into a


low power mode. A high bandwidth parallel con-
nection between the flip-flops and mini-arrays
helps achieve fast wakeup time. The MCU also
has ferroelectric RAM (FRAM) as a unified data
and program memory that retains data in the
absence of a supply voltage. Thus during a low
power mode, the MCU can be fully power gated,
and no power is required to maintain system
state. When the MCU wakes up from a low
power mode, data from NVL arrays is transferred
back to the flip-flops and latches, and the MCU
can resume operation from where it left off. Fig. 18.8 System architecture showing the 32b core,
Upon wakeup no bootup, C-initialization or peripherals, bus interface and the NVL controller. NVL
user-initialization is required. arrays are dispersed in the chip
Figure 18.8 shows the system block diagram
of the 1.5 V 8 MHz 75 μA/MHz NVL MCU with Section 18.2.1 introduces the NVL mini-array
zero standby power and 384 ns wakeup time. The architecture and compares it with existing NVL
MCU is built using a High VT (HVT) 130 nm architectures. Section 18.2.2 describes the NVL
CMOS process. The MCU contains a 32 bit core, bitcell and array periphery and summarizes the
64 KB FRAM (unified program and data mem- performance and power results from the
ory), 8 KB SRAM (for test only), 10 KB ROM, NVL MCU.
UART and SPI peripherals, and a serial wire
debug interface. Note that there is no power
management and analog blocks in this MCU. 18.2.1 NVL Mini-Array Architecture
So, low power mode current and wakeup time
contributions from these blocks would need to be Figure 18.9 shows the functional diagram of an
added to get a representative number for a full NVL MCU.
MCU SoC.
514 S. Khanna et al.

Supply Supply
Detection Bad /
Supply Good
Condition
Power External
Management Interrupt
Power State
Voltage
Source Machines
Regulator/LDO
Software
Backup /
Interrupt
SoC Restore
Supply Data

NVL Controller

FRAM
Digital Section Program and
Logic, Flip-flops, Stacks Data Memory
Backup
Program and
Data saved
NVL Arrays

Restore

Fig. 18.9 NVL SoC functional diagram. In this version of the NVL SoC, voltage regulation, supply detection and PM
state machines are off-chip

When the supply detection (SD) circuit resumed, following a similar procedure, data
detects a supply droop, the switch at the output from NVL arrays is restored back into the system
of the LDO is turned off to prevent spurious flip-flops, and system operation resumes from
voltages from reaching the MCU core. The sup- where it left off. No boot-up is required. We
ply capacitor holds enough charge to support the rely on off-chip components to regulate the
state backup. Then the SD activates the power power supply for this version of the chip.
management (PM) state machines with a supply NVL implementations reported so far (Wang
bad signal. Following this, data from all flip-flops et al. 2012; Qazi et al. 2013; Sakimura et al.
and latches in the MCU is backed up into NVL 2014; Liu et al. 2016) propose adding
arrays. Since the program and data memory is non-volatile elements in flip-flops (FF).
already non-volatile (FRAM), no action needs to Implementing NVL at the FF level is a yield
be taken for the memory backup/retention. In this risk because non-volatile elements (like ferro-
way the entire system state is saved before power electric capacitors) are not qualified to operate
loss. In the case of software based power inter- in the irregular environment of standard cell lay-
rupt, the same sequence is followed, except that out. Adding non-volatile elements to each FF
the supply switch is turned off last. The NVL also makes the FF large because read/write/test
controller orchestrates data backup and restore, circuits must be duplicated in each FF and cannot
and controls communication between the system be shared across multiple bits. Further, test
flip-flops and the NVL arrays. When power is related reference voltages and a high voltage
18 Ferro-Electric RAM Based Microcontrollers. . . 515

Fig. 18.10 (a) NVL mini-


array architecture. Arrays
dispersed throughout the
SoC serve 256 neighboring
FFs. A central NVL
controller synchronizes
communications between
the arrays and the FFs. (b)
NVL FF with an additional
data and enable port as
compared to a regular FF
(c) 256FF are connected to
a NVL array using a 32b
wide 8–1 mux

supply have to be routed to every FF in the signals to and from all FFs in the system have to
design. Finally, spreading non-volatile elements be routed to a central location. Limited band-
makes test complex and results in high test time. width and unscalable nature of a centralized
A centralized NVL implementation helps over- NVL architecture makes it less attractive for the
come the above yield, density and test issues. most energy and latency critical applications.
However, a centralized array results in limited We use a novel NVL implementation called
bandwidth and a slow wakeup time. For exam- the NVL Mini-Array Architecture (Khanna et al.
ple, the 64 KB FRAM we use in our MCU can 2014) (shown in Fig. 18.10a). Custom 256b
read/write 64 bits of data at 8 MHz. This FRAM mini-arrays are used to backup FF data.
translates to a theoretical minimum 5 μs wakeup Multiple such arrays are dispersed throughout the
time for our 32-bit core (which has 2537 FFs). If digital area with each array serving 256 neighbor-
a more complex core like ARM Cortex M3 or ing FFs. This approach allows us to amortize
MSP430 is used, the FF count and wakeup time fecap related read, write and test circuitry over
would be 3–4 the above number. Routing is 256 bits in the mini-array. Testability and bit
also an issue in the centralized approach since level screening features available in a regular
516 S. Khanna et al.

FRAM array can be similarly implemented for Figure 18.10c shows the schematic of the
such a mini-array. This is critical for making a NVL flip-flop. The NVL FF is a modified reten-
viable product. Being in a regular array layout tion FF. The difference between a regular FF and
environment also drives fecap yield higher. Since a retention FF is that the slave latch (shaded) of a
each mini-array has a 32b interface and all mini- retention FF is powered by a separate supply.
arrays can be accessed in parallel, system backup This allows the primary logic supply to be turned
and restore is fast and wakeup time remains off during sleep mode to save leakage power. At
independent of the total FF count. Thus, the the same time, state of the FF is retained by only
mini-array architecture provides speed similar keeping the minimum sized slave latch powered.
to a FF level implementation with the yield, The NVL FF modifies a retention FF by adding
density, and test advantages of a centralized two new ports ND and NU to the slave stage as
NVL implementation. shown in red. During regular operation, NU ¼ L
Figure 18.10b shows how NVL arrays connect and ND ¼ X. Here, the NVL FF behaves like a
to FFs. An NVL array is organized into 8 rows of regular retention FF. After coming back from a
32 bits each. During a data backup, the NVL sleep mode, data must be transferred from the
controller walks through all the Mux_Sel NVL array into the NVL FF. To accomplish this,
combinations transferring data from 256 FFs to NU is pulsed high for a cycle while ND has valid
8 rows of an NVL array. To enable transferring data. RET is kept high at this time to allow ND to
data from NVL array to FFs we add an additional propagate to QB (CLK maybe X because the
data port to a regular FF as shown in Fig. 18.10c. logic domain may be off while restoring data).
Upon wakeup, the NVL controller reads the NVL When NU goes low, the latch closes and ND is
arrays row by row while simultaneously pulsing latched as QB. Since inverted version of ND is
appropriate NU signals, thereby transferring data available at Q, an inverter is placed in the path
from the arrays to the FFs. The system CLK is between the NVL array and the NVL FF. We
kept low during backup and can be in any state carefully add circuits only to the slave latch so
during restore. RET is kept high during restore. that they are not on the critical CLK to Q path of
This direct parallel connection between FFs and the FF and are minimum sized. The NVL FF is
NVL arrays helps achieve a fast wakeup time. As 19% larger, 2% slower and has 7% higher power
MCU FF count increases, array count also than a retention FF. At the system level perfor-
increases, keeping bandwidth and wakeup time mance and power impact is not perceptible.
independent of the FF count. Also, since the Moreover, these numbers are for F1 FF. For
MCU core is not involved in any data transfer, higher drive strengths the percent increase is
this NVL implementation is design agnostic and lower.
easy to add in any existing core or ASIC.
Conventional flip-flops have a single data
input port. To insert data from NVL arrays back 18.2.2 NVL Bitcell Design
into a flip-flop, we added a second data input port
(ND) and an enable port (NU) to a regular FF. As In this section we describe the NVL bitcell. Sig-
shown in Fig. 18.10b, the 32 bit data output bus nal margin is a critical metric for evaluating
of the NVL array is connected to all 8 FF groups. robustness of a ferroelectric capacitor (FeCap)
When row 0 of the NVL is being read, NU ports based bitcell. Large non-volatile memory arrays
of the FF group 0 are enabled. In this fashion, all implement Error Correction Codes (ECC) to
eight rows are sequentially read. Again, the NVL detect and correct bit errors after a read opera-
controller generates the NU signals and tion. However, ECC is impractical to implement
synchronizes data transfer between the NVL in a small 256b mini-array. In absence of ECC
array and FFs. We call the modified flip-flop an the bitcell inherently must have a high signal
NVL flip-flop. margin. Thus we use a custom bitcell rather
18 Ferro-Electric RAM Based Microcontrollers. . . 517

Fig. 18.11 An FRAM


bitcell with two FeCaps
using BL capacitance as
load. An NVL bitcell uses
four active FeCaps resulting
in 2X the signal margin.
Local SA helps avoid charge
sharing between storage
nodes and the bitlines.
Bitcell read waveforms are
shown with modifications
for signal margin
measurement in dotted
lines. During a normal read,
Q and QB develop a
nominal signal of ~0.5 V
and ~1 V, respectively.
Pre-charging Q to 0.3 V
reduces signal differential
from 0.5 V to 0.2 V

than a conventional FRAM bitcell. Figure 18.11 The NVL bitcell is read using the waveform
shows a conventional FRAM bitcell, the NVL sequence shown in Fig. 18.11. NVL control
bitcell and NVL read and signal margin measure- signals are generated at clock edges of a high
ment waveforms. frequency NVL clock. Q and QB are initially
A ferroelectric capacitor is a programmable discharged. PL1 is toggled high, which causes a
hysteretic capacitor whose value is set during a signal to develop on Q (~0.5 V nominal) and QB
write based on the voltage polarity applied across (~1 V nominal). This differential (of 0.5 V) is
the FeCap. In this way, a FeCap can be set to a then converted to a full swing signal by the sense
low or high capacitance state. During a read, the amplifier (SA). Each bitcell has a local SA to
FRAM bitcell thus behaves like a capacitive avoid loss of signal due to charge sharing with
divider with the BL capacitor as a load. When bitlines. Loss due to charge sharing is small in
PL is toggled, a signal develops on BL and BLB nominal conditions, but can be significant for tail
based on the value stored in the FeCaps. To get bits. Signal margin is measured by pre-charging
higher signal margin, we use an active FeCap the side storing a “0” to a disturb voltage. In the
load rather than the constant BL load. Data is example shown in the figure, Q stores a “0”. To
stored in all four FeCaps rather than just two, measure signal margin, it is pre-charged to a
resulting in almost double the signal margin as disturb voltage (0.3 V here). Then, when PL1 is
compared to an FRAM bitcell. Figure 18.12 toggled, the signal that develops on Q is the sum
shows the array periphery. The NVL array of the disturb voltage and the signal that would
consists of the bitcell array, row drivers, IO have developed during a normal read
circuits and a control and timing block. (0.3 V + 0.5 V). The signal on the other side
518 S. Khanna et al.

Write Drivers RD latch and Scan flop


LCH-EN
STORE Vcon_a BL
IN BL SCLK
M1 L1
SO V_Off SI OUT
BLB
SCLKB SCLK PASSB
L2 SO
STORE Vcon_b

BL Disturb TGs

All 0/1 Write Drivers


Parallel Test Read Logic
ALL_1_A ALL_0_B
M1 M3
corr_1 * corr_1 corr_0 * corr_0
BL BLB G1 G0
ALL_0_A ALL_1_B OUT OUT
M2 M4
* from (n-1)th IO

Fig. 18.12 IO circuits used in the NVL array periphery. implement fast on-chip NVL test. Implementing all
Read/write circuits are amortized over eight rows. Scan these circuits in every FF is prohibitive, making the
flop helps provide for a dedicated NVL scan chain. BL mini-array architecture the most appropriate for a practi-
disturb TGs, and All 0/1 write and read circuits help cal NVL implementation

remains the same at 1.0 V. Hence, we are able to Fig. 18.10b) are not an RTL construct and are
compress the differential available to the SA added to the design as a post synthesis step. This
(from 0.5 V to 0.2 V). Signal margin for a bitcell is very similar to adding scan to the design: only
is defined as the highest disturb voltage for a a scan controller and scan related IOs and
correct read. interfaces are added to the RTL, but the actual
During a write operation, store drivers in the scan chains and scan FFs are added only after
array periphery are enabled and drive BL and synthesis. We have designed custom scripts that
BLB to the correct logic. In the waveform swap regular FFs with NVL FFs and connect
sequence, first PL1 and PL2 are pulsed high. Q them to NVL arrays in a physical aware manner
and QB couple high due to capacitance of the during place and route.
FeCaps. Next, the pass gates are enabled and Q Table 18.5 shows measured results from the
and QB are driven to the correct logic values. 1.5 V HVT NVL MCU. The MCU runs at a
Then the SA is enabled which assists the store maximum frequency of 8 MHz (at the slow cor-
driver in holding Q and QB. At this time, the ner) and consumes 75 μA/MHz and 170 μA/MHz
FeCaps on “0” side get written. Then the PL1 and while running code from SRAM and FRAM,
PL2 and driven back low. At this time, the respectively. NVL operates on a separate clock
FeCaps in the “1” side get written. This marks that can run up to 125 MHz resulting in an
the end of the write cycle. ultrafast 384 ns wakeup time. In comparison,
NVL insertion into a design starts by addition conventional MCUs take 100s of microseconds
of the NVL controller, NVL array and mux to bootup from a power gated state. NVL has a
instantiations and an NVL controller interface one-time energy cost of backing up and restoring
on the power management module (PMM) RTL FF data. However, without NVL the sleep mode
of the design. No changes are required to the core leakage due to the powered retention FF is
or peripherals. Note that at this point, FFs are not 280 nW at 85C and 1740 nW at 125C. This
connected to NVL arrays. NVL FFs (which have results in a breakeven time of 34 ms and
two data inputs as opposed to one, as show in 5.5 ms, respectively. This is much smaller than
18 Ferro-Electric RAM Based Microcontrollers. . . 519

Table 18.5 NVL MCU simulation and measurement are diverse and can vary in duty cycle, time
results period of usage and operating temperature
Regular SoC operation range. An IoT MCU design must be optimized
Technology node 130 nm, HVT for all such usage scenarios. MCUs are highly
Clock frequency 8 MHz @ 1.5 V integrated and thus a flexible but efficient system
Energy consumption SRAM: 75 μW/MHz design is key to minimizing energy and latencies.
(code in SRAM/FRAM) FRAM: 170 μW/MHz
In addition, innovations and optimizations are
NVL sleep mode Lkg power 0.000 nW
Retention mode Lkg power 280 nW @ 85C
needed at all design hierarchies to get to the
(only slave stage powered) 1740 nW @ 125C lowest energy and power targets. TI 430FR5XX
NVL operation devices feature multiple low power modes to
Clock frequency 125 MHz accommodate applications with varying duty
Chip backup cost 7.2 nJ, 320 ns cycles. A range of LDOs and oscillators with
Chip restore cost 2.4 nJ, 384 ns different performance vs. power tradeoffs help
Breakeven time (based on 34 ms @ 85C achieve optimal performance and power number
leakage above) 5.45 ms @ 125C for each mode without compromising the other.
Smart peripherals like the ADC, AES operate in
low power mode and can alert the MCU core to
32b CORE and
PERIPHERALS wake up if needed. Technology node selection,
transistor optimization, extensive clock gating,
10 NVL
Mini power gating, mixed Vt implementation, and
Arrays retention flip-flops are used to minimize digital
64KB FRAM
logic dynamic and leakage power. Some analog
10KB
ROM blocks in the chip are on even in the lowest
8KB power mode. Sampled analog techniques are
SRAM used to reduce quiescent current in these blocks.
This shows the range of optimizations needed to
achieve low power targets in a ULP MCU.
Fig. 18.13 NVL MCU die photo On-chip Ferro-electric RAM (FRAM) offers
non-volatility and low power, fast, random
typical sleep times in most ULP applications.
access write and read operation. Write energy
Moreover, we have not taken into account poten-
per bit is 250 lower than flash memory,
tial saving made possible by turning off the
enabling data logging in energy constrained IoT
on-chip LDO that is usually part of an MCU
applications. TI FRAM MCUs offer the com-
SoC. Figure 18.13 shows the die photo. Since
bined benefits of an optimized ULP implementa-
our MCU only has no power regulation, we
tion and FRAM flexibility.
achieve zero standby power. In a complete SoC
Mode transition time and energy restrict the
the RTC, and some power management may be
use of the lowest power modes. Power gating the
always on, but even then switching off the digital
entire chip helps lower leakage, but requires a
helps reduce standby power by about 40%
costly bootup process. Non-volatile logic helps
according to our system level estimates. Area
lower leakage by transferring the contents of flip-
overhead at the SoC level is estimated to be
flops into local NVL arrays before a power inter-
~10%.
ruption. Upon wakeup, the data is transferred
back, and no bootup is required. In this way,
NVL helps lower the energy cost and breakeven
18.3 Conclusion time of using the lowest lower mode in an MCU.
Further, the mini-array architecture helps provide
Leakage power, dynamic power and mode tran- a practical implantation of NVL as compared to
sition time and energy are all critical in an MCU other solutions that have yield, test time and area
targeted for IoT applications. IoT applications penalties.
520 S. Khanna et al.

References M. Lueders et al., Architectural and circuit design


techniques for power management of ultra-low-
power MCU systems. IEEE Trans. Very Large Scale
AES for Texas Instruments MSP430 Microcontroller.
Integr. Syst. 22(11), 2287–2296 (2014)
http://jcewww.iaik.tu-graz.ac.at/index.php/sic/
MSP430 FR2xx Mixed Signal Microcontrollers: http://
Products/Crypto_Software_for_Microcontrollers/
www.ti.com/lit/ds/symlink/msp430fr2533.pdf
AES_for_Texas_Instruments_MSP430_
MSP430 FR54xx Mixed Signal Microcontrollers: http://
Microcontrollers
www.ti.com/lit/ds/symlink/msp430f5438a.pdf
A. Baumann et al., A MCU platform with embedded
MSP430 FR59xx Mixed Signal Microcontrollers: http://
FRAM achieving 350 nA current consumption in
www.ti.com/lit/ds/symlink/msp430fr5969.pdf and
real-time clock mode with full state retention and
http://www.ti.com/lit/ug/slau367g/slau367g.pdf
6.5 μs system wakeup time, in VLSI Circuits
M. Qazi, A. Amerasekera, A.P. Chandrakasan, A 3.4pJ
(VLSIC), 2013 Symposium on, Kyoto, 2013 (2013),
FeRAM-enabled D flip-flop in 0.13 μm CMOS for
pp. C202–C203
nonvolatile processing in digital systems, in Solid-
M. Zwerg et al., An 82 uA/MHz Microcontroller with
State Circuits Conference Digest of Technical Papers
embedded FeRAM for energy harvesting applications,
(ISSCC), 2013 I.E. International, 17–21 Feb. 2013
in International Solid-State Circuit Conference
(2013), pp. 192–193
(ISSCC), San Francisco, CA, 2011, pp. 334–335
N. Sakimura et al., 10.5 A 90 nm 20 MHz fully nonvola-
S. Khanna, S.C. Bartling, M. Clinton, S. Summerfelt,
tile microcontroller for standby-power-critical
J.A. Rodriguez, H.P. McAdams, An FRAM-based
applications, in Solid-State Circuits Conference
nonvolatile logic MCU SoC exhibiting 100% digital
Digest of Technical Papers (ISSCC), 2014 I.E. Inter-
state retention at 0 V achieving zero leakage With
national, San Francisco, CA, 2014 (2014),
400-ns wakeup time for ULP applications. IEEE
pp. 184–185
J. Solid-State Circ. 49(1), 95–106 (2014)
Y. Wang, Y. Liu, S. Li, D. Zhang, B. Zhao, M.F. Chiang,
Y. Liu et al., 4.7 A 65 nm ReRAM-enabled nonvolatile
Y. Yan, B. Sai, H. Yang, A 3us wake-up time nonvol-
processor with 6 reduction in restore time and 4
atile processor based on ferroelectric flip-flops, in
higher clock frequency using adaptive data retention
Proceedings of the ESSCIRC (ESSCIRC), 17–21
and self-write-termination nonvolatile logic, in 2016 I.
Sept. 2012 (2012), pp. 149–152
E. International Solid-State Circuits Conference
(ISSCC), San Francisco, CA, 2016 (2016), pp. 84–86

You might also like