Ieee Computationalintelligence 202108userupload - in

Organising Committee
General Chairs
IEEE Symposium Series on Computational Intelligence
Keeley Crockett, UK
Sanaz Mostaghim, Germany IEEE SSCI 2021
Program Chairs
Alice Smith, USA
December 4th - 7th 2021, Orlando, Florida, USA
Carlos A. Coello Coello, Mexico
IEEE SSCI is an established flagship annual international series of symposia on
Finance Chair
computational intelligence sponsored by the IEEE Computational Intelligence Society. IEEE
Piero Bonissone, USA
SSCI 2021 promotes and stimulates discussion on the latest theory, algorithms,
Publications Chairs
applications and emerging topics on computational intelligence across several unique
Dipti Srinivasan, Singapore
Anna Wilbik, Netherlands symposia, providing an opportunity for cross-fertilization of research and future
collaborations. For more information visit http://attend.ieee.org/ssci-2021
Conflict of Interest Chair
Marley Vellasco, Brazil
Registration Chairs
IEEE Computational Intelligence Symposia on
Julia Cheung, Taiwan Adaptive Dynamic Programming and Explainable Data Analytics in Computational
Steven Corns, USA
Reinforcement Learning (IEEE ADPRL) Intelligence (IEEE EDACI)
Keynote Chairs CI in Agriculture (IEEE CIAg) Evolutionary Learning (IEEE EL)
Bernadette Bouchon-Meunier,
CI for Brain Computer Interfaces (IEEE Evolutionary Neural Architecture Search and
France
Gary Yen, USA CIBCI) Applications (IEEE ENASA)
Manuel Roveri, Italy Computational Intelligence in Big Data (IEEE Evolutionary Scheduling and Combinatorial
Special Session Chair CIBD) Optimisation (IEEE ESCO)
Marde Helbig, Australia CI in Biometrics and Identity Management Ethical, Social and Legal Implications of
Tutorial Chairs (IEEE CIBIM) Artificial Intelligence (IEEE ETHAI)
Sansanee Auephanwiriyakul, CI in Control and Automation (IEEE CICA) CI in Feature Analysis, Selection and
Thailand CI in Healthcare and E-health (IEEE Learning in Image and Pattern Recognition
Daniel Ashlock, Canada CICARE) (IEEE FASLIP)
Manuel Roveri, Italy CI in Cyber Security (IEEE CICS) Foundations of CI (IEEE FOCI)
Submission Chair CI in Data Mining (IEEE CIDM) Evolvable Systems (IEEE ICES)
Nicolò Navarin, Italy CI in Dynamic and Uncertain Environments Immune Computation (IEEE IComputation)
Local Organizing Chair (IEEE CIDUE) Intelligent Agents (IEEE IA)
Zhen Ni, USA CI and Ensemble Learning (IEEE CIEL) Multi-agent System Coordination and
Travel Grant Chair CI for Engineering Solutions (IEEE CIES) Optimization (IEEE MASCO)
Pauline Haddow, Norway CI for Human-like Intelligence (IEEE CIHLI) Model-Based Evolutionary Algorithms (IEEE
Webmaster Chair CI in IoT and Smart Cities (IEEE CIIoT) MBEA)
Jen-Wei Huang CI for Multimedia Signal and Vision Multicriteria Decision-Making (IEEE MCDM)
Processing (IEEE CIMSIVP) Nature-Inspired Computation in Engineering
Whova Chair
Albert Lam, Hong Kong CI in Remote Sensing (IEEE CIRS) (IEEE NICE)
CI for Security and Defence Applications Robotic Intelligence in Informationally
Conference Activities Chair
Bing Xue, New Zealand (IEEE CISDA) Structured Space (IEEE RiiSS)
Deep Learning (IEEE DL) Cooperative Metaheuristics (IEEE SCM)
Social Event Chair
Jo-Ellen Synder, USA Evolving and Autonomous Learning Systems Differential Evolution (IEEE SDE)
(IEEE EALS) Swarm Intelligence Symposium (IEEE SIS)
Publicity Chairs
Jialin Liu, China
Joao Carvalho, Portugal
Matt Garrat, Australia
Important Dates Call for Papers
Special Session Proposals: May 28th 2021 Papers for IEEE SSCI 2021 should be submitted
Tutorial Proposals: May 28th 2021 electronically using the conference website: http://
Paper Submissions: August 6th 2021 attend.ieee.org/ssci-2021 and will be reviewed by
experts in the fields and ranked based on the crite-
Paper Acceptance: September 17th 2021
ria of originality, significance, quality and clarity.
Camera ready Paper: October 15th 2021
Call for Tutorials Call for Special Sessions

IEEE SSCI 2021 will feature tutorials covering fun- Special session proposals should include title, a de-
damental and advanced topics in computational in- scription of the scope, a list of topics covered, a list
telligence. A tutorial proposal should include title, of potential contributors, and names, affiliations,
short introduction to the topic, an outline of the tu- websites, and bios of special session organizers.
torial, length of the tutorial, level and names, affilia-
Digital Object Identifier 10.1109/MCI.2021.3084497

Volume 16 Number 3 ❏ August 2021
www.ieee-cis.org
Features
10 Evolutionary Multi-Objective Model
Compression for Deep Neural Networks
by Zhehui Wang, Tao Luo, Miqing Li, Joey Tianyi Zhou,
Rick Siow Mong Goh, and Liangli Zhen
22
Fast and Unsupervised Neural Architecture
Evolution for Visual Representation Learning
by Song Xue, Hanlin Chen, Chunyu Xie, Baochang Zhang,
Xuan Gong, and David Doermann
33
Self-Supervised Representation Learning
for Evolutionary Neural Architecture Search
by Chen Wei,Yiping Tang, Chuang Niu, Haihong Hu,Yue Wang, and Jimin Liang
50
Forecasting Wind Speed Time Series
Via Dendritic Neural Regression
on the cover
©SHUTTERSTOCK.COM/ANDREY SUSLOV
by Junkai Ji, Minhui Dong, Qiuzhen Lin, and Kay Chen Tan
67
A Self-Adaptive Mutation Neural Architecture
Search Algorithm Based on Blocks
by Yu Xue,Yankang Wang, Jiayu Liang, and Adam Slowik
Promoting Sustainable Forestry
SFI-01681
IEEE Computational Intelligence Magazine (ISSN

Columns
1556-603X) is published quarterly by The Institute of
Electrical and Electronics Engineers, Inc. Headquarters:
79 Application Notes
3 Park Avenue, 17th Floor, New York, NY 10016-5997, Multi-View Feature Construction Using Genetic Programming
U.S.A. +1 212 419 7900. Responsibility for the contents for Rolling Bearing Fault Diagnosis
rests upon the authors and not upon the IEEE, the Soci-
by Bo Peng,Ying Bi, Bing Xue, Mengjie Zhang, and Shuting Wan
ety, or its members. The magazine is a membership benefit
of the IEEE Computational Intelligence Society, and sub-
scriptions are included in Society fee. Replacement copies
for members are available for US$20 (one copy only).
Nonmembers can purchase individual copies for
US$220.00. Nonmember subscription prices are available
on request. Copyright and Reprint Permissions:
Abstracting is permitted with credit to the source. Librar-
ies are permitted to photocopy beyond the limits of the
U.S. Copyright law for private use of patrons: 1) those
post-1977 articles that carry a code at the bottom of the Departments
first page, provided the per-copy fee is paid through the 2 Editor’s Remarks 8 Guest Editorial
Copyright Clearance Center, 222 Rosewood Drive, Dan-
vers, MA 01970, U.S.A.; and 2) pre-1978 articles without 3 President’s Message by Yanan Sun, Mengjie Zhang,
fee. For other copying, reprint, or republication permis-
by Bernadette and Gary G.Yen
sion, write to: Copyrights and Permissions Department,
IEEE Service Center, 445 Hoes Lane, Piscataway NJ Bouchon-Meunier 96 Conference Calendar
08854 U.S.A. Copyright © 2021 by The Institute of Elec-
5 Publication Spotlight by Marley Vellasco
trical and Electronics Engineers, Inc. All rights reserved.
Periodicals postage paid at New York, NY and at addi- by Haibo He, Jon Garibaldi,
tional mailing offices. Postmaster: Send address changes to Carlos A. Coello Coello,
IEEE Computational Intelligence Magazine, IEEE, 445
Hoes Lane, Piscataway, NJ 08854-1331 U.S.A. PRINT- Julian Togelius,Yaochu Jin,
ED IN U.S.A. Canadian GST #125634188. Yew Soon Ong, and Hussein Abbass
Digital Object Identifier 10.1109/MCI.2021.3056259 AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 1
CIM Editorial Board Chuan-Kang Ting Editor’s
Editor’s
Editor-in-Chief
Chuan-Kang Ting
National Tsing Hua University, TAIWAN Remarks
National Tsing Hua University
Department of Power Mechanical Engineering
No. 101, Section 2, Kuang-Fu Road
Hsinchu 30013, TAIWAN
(Phone) +886-3-5742611
(Email) ckting@pme.nthu.edu.tw
Founding Editor-in-Chief
Gary G.Yen, Oklahoma State University, USA
The Evolution of Neural Networks
Past Editors-in-Chief
Kay Chen Tan, Hong Kong Polytechnic
University, HONG KONG
Hisao Ishibuchi, Southern University of Science and
Technology, CHINA
Editors-At-Large
Piero P. Bonissone, Piero P Bonissone Analytics
LLC, USA
David B. Fogel, Natural Selection, Inc., USA
Vincenzo Piuri, University of Milan, ITALY
Marios M. Polycarpou, University of Cyprus,
The focus on evolutionary neural architecture search in this
issue has driven me to ponder the evolution of biological
neural networks, or rather, the human brain. Researchers in
neuroanatomy found that the cognitive and mental development
of humans is attributed to the increase of our brain size through-
CYPRUS
Jacek M. Zurada, University of Louisville, USA out evolution. Yet bigger is not always better. It was also discov-
ered that, once the brain reaches a certain size, further growth
Associate Editors
José M. Alonso, University of Santiago de will only render the brain less efficient. In addition, the brain is limited and affect-
Compostela, SPAIN ed by its inherent architecture and signal processing time. These discoveries in neu-
Erik Cambria, Nanyang Technological University,
SINGAPORE roscience are interestingly analogous to the advances in neural networks and
Liang Feng, Chongqing University, CHINA evolutionary computation. Stacking of perceptrons empowers artificial neural net-
Barbara Hammer, Bielefeld University, GERMANY
Eyke Hüllermeier, University of Munich, GERMANY works to solve complex problems but detracts from their efficiency. Indeed, the
Sheng Li, University of Georgia, USA human brain evolution and the artificial neural network evolution both face the
Hsuan-Tien Lin, National Taiwan University,
TAIWAN trade-off between capacity and efficiency. The workings of nature truly give us a
Hongfu Liu, Brandeis University, USA lot to mull over.
Zhen Ni, Florida Atlantic University, USA
Yusuke Nojima, Osaka Prefecture University, JAPAN This issue includes five Features articles. The first article proposes an evolutionary
Nelishia Pillay, University of Pretoria, SOUTH AFRICA multi-objective model compression approach to simultaneously optimize the model
Danil Prokhorov, Toyota R&D, USA
Kai Qin, Swinburne University of Technology, size and accuracy of a deep neural network. The second and third articles adopt the
AUSTRALIA notion of self-supervised learning in the neural architecture evolution, resulting in
Rong Qu, University of Nottingham, UK
Ming Shao, University of Massachusetts state-of-the-art performance. To improve the accuracy of wind speed forecasting, the
Dartmouth, USA fourth article uses an evolutionary algorithm to optimize the architecture of dendritic
Vincent S. Tseng, National Chiao Tung University,
TAIWAN neural regression. The fifth article presents a self-adaptive mutation strategy for block-
Kyriakos G.Vamvoudakis, Georgia Tech, USA based evolutionary neural architecture search on convolutional neural networks.
Nishchal K.Verma, Indian Institute of Technology
Kanpur, INDIA In the Columns, the article proposes a multi-view feature construction method
Handing Wang, Xidian University, CHINA based on genetic programming and ensemble techniques. The approach can automati-
Dongrui Wu, Huazhong University of Science and
Technology, CHINA cally generate high-level features from multiple views; moreover, a new fitness func-
Bing Xue, Victoria University of Wellington, tion is developed to improve the discriminability of these features. Experimental
NEW ZEALAND
results show the effectiveness of the proposed approach in rolling bearing fault diag-
IEEE Periodicals/ nosis with a limited amount of data.
Magazines Department
Managing Editor, Mark Gallaher
We hope you will enjoy the articles in this issue. If you have any suggestions or
Senior Managing Editor, Geri Krolin-Taylor feedback for this magazine, please do not hesitate to contact me at ckting@pme.nthu
Senior Art Director, Janet Dudar .edu.tw.
Associate Art Director, Gail A. Schnitzer
Production Coordinator, Theresa L. Smith
Director, Business Development—
Media & Advertising, Mark David
Advertising Production Manager,
Felicia Spagnoli
Production Director, Peter M. Tuohy
Editorial Services Director, Kevin Lisankie
Senior Director, Publishing Operations,
Dawn Melley
IEEE prohibits discrimination, harassment, and bullying.

For more information, visit http://www.ieee.org/web/
abou-tus/whatis/policies/p9-26.html.
Digital Object Identifier 10.1109/MCI.2021.3056260 Date of current version: 15 July 2021
2 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

President’s CIS Society Officers
Message
Bernadette Bouchon-Meunier
CNRS—Sorbonne Université, President – Bernadette Bouchon-Meunier,
FRANCE Sorbonne Université, FRANCE
President Elect – Jim Keller, University of
Missouri, USA
Vice President – Conferences – Marley M. B. R.
Vellasco, Pontifical Catholic University of
Rio de Janeiro, BRAZIL
Vice President – Education – Pau-Choo (Julia)
Chung, National Cheng Kung University,TAIWAN
Ethics, Diversity and Consciousness in AI Vice President – Finances – Pablo A. Estévez,
University of Chile, CHILE
Vice President – Members Activities – Sanaz
Mostaghim, Otto von Guericke University of
Magdeburg, GERMANY
E
Vice President – Publications – Kay Chen Tan,
thics in Artificial Intelligence (AI) is discussed everywhere, Hong Kong Polytechnic University, HONG KONG
in governmental circles as well as scientific forums. It is very Vice President – Technical Activities –
often associated with the two widely used concepts of Luis Magdalena, Universidad Politecnica
de Madrid, SPAIN
diversity and eXplainable Artificial Intelligence (XAI). The latter
was promoted by DARPA1, and it is expected to propose meth- Publication Editors
ods preserving the rights of users to understand how the AI sys- IEEE Transactions on Neural Networks
tems work and why decisions are made. Computational and Learning Systems
Haibo He, University of Rhode Island, USA
Intelligence methods have much to contribute to this effort to ensure that AI sys- IEEE Transactions on Fuzzy Systems
tems are ethical. Jon Garibaldi, University of Nottingham, UK
In July 2020, the European Union has published an Assessment List for Trustwor- IEEE Transactions on Evolutionary Computation
thy Artificial Intelligence (ALTAI)2 covering a wide range of ethical issues. Among the Carlos A. Coello Coello, CINVESTAV-IPN,
MEXICO
requirements listed, we find transparency, that encompasses the traceability of the data IEEE Transactions on Games
and model used to make a decision, the explainability of decisions and the communi- Julian Togelius, New York University, USA
cation with the user. Transparency mainly corresponds to the general approach of XAI IEEE Transactions on Cognitive and
Developmental Systems
and Computational Intelligence has much to do to participate in the effort of trans-
Yaochu Jin, University of Surrey, UK
parency of XAI, by means of the various methods of automatic generation of explana- IEEE Transactions on Emerging Topics in
tions, the use of counterfactuals, the help of visualization or the construction of fuzzy Computational Intelligence
systems providing interpretability and expressiveness, for example. Yew Soon Ong, Nanyang Technological University,
SINGAPORE
Another ALTAI requirement sounds familiar to us, dealing with diversity, non-dis-
IEEE Transactions on Artificial Intelligence
crimination and avoidance of unfair bias. Such concerns are at the heart of the IEEE Hussein Abbass, University of New South Wales,
Computational Intelligence’s actions, which is very committed to their respect in its AUSTRALIA
scientific and associative life. It is therefore natural to also consider these properties as
Administrative Committee
essential for all algorithms and products that are parts of AI systems, not only in the
collection of data and choice of attributes leading to a decision, but also in the ability Term ending in 2021:
David Fogel, Natural Selection, Inc., USA
of all categories of users to interact with the AI system regardless of their characteris- Barbara Hammer, Bielefeld University,
tics. Privacy and data protection is another aspect of the respect of human users. GERMANY
Other requirements focus more on the cognitive components of the construction Yonghong (Catherine) Huang,
of AI systems, the effect they can have on human users, the possible governance of McAfee LLC, USA
Xin Yao, Southern University of Science
human agents in AI systems and the integration of AI systems in the human society. and Technology, CHINA
The technical robustness of the AI system, its reliability, its security and its capacity to Jacek M. Zurada, University of Louisville, USA
identify and mitigate risks in a transparent way are not less important. Term ending in 2022:
Machine ethics and the notions of responsibility of AI systems or artificial moral Cesare Alippi, Politecnico di Milano, ITALY
agents3 are also topics of research that deserve a lot of attention. Consciousness in James C. Bezdek, USA
Gary Fogel, Natural Selection, Inc., USA
AI has been debated for a long time4 and it can take a form different from human Yaochu Jin, University of Surrey, UK
consciousness5. Multidisciplinarity, and in particular the cooperation with cognitive Alice E. Smith, Auburn University, USA
Term ending in 2023:
Piero P. Bonissone, Piero P Bonissone Analytics
1
https://www.cc.gatech.edu/~alanwags/DLAI2016/(Gunning)%20IJCAI-16%20DLAI%20WS.pdf LLC, USA
2
https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai Oscar Cordón, University of Granada, SPAIN
3
https://plato.stanford.edu/entries/ethics-ai/#MachEthi
4 Pauline Haddow, Norwegian University of Science
McCarthy, John (1996). Making robots conscious of their mental states. In S. Muggleton (ed.), _Machine Intelligence 15_.
Oxford University Press. and Technology, NORWAY
5
Pitrat, Jacques (2009). Artificial Beings : The Conscience of a Conscious Machine. Wiley, 2009 Haibo He, University of Rhode Island, USA
Hisao Ishibuchi, Southern University of Science
and Technology, CHINA
Date of current version: 15 July 2021 Digital Object Identifier 10.1109/MCI.2021.3056261
AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 3

IEEE Consortium On The Landscape
A special issue of the IEEE Computational Intelligence of AI Safety (CLAIS) and, on a more
Magazine dedicated to Explainable and Trustworthy specific level, it participates in the IEEE
Brain Community.
Artificial Intelligence will be published in 2022. I am convinced that the IEEE CIS
needs to pay even more attention to
these crucial issues. At a time when the
sciences and computational neurosci- Technical Committee task forces and IEEE CIS is strengthening its ties to
ence, is important in this regard. special issues of CIS publications are industry by creating a Committee for
The Computational Intelligence also focused on XAI. In particular, a Industrial and Governmental Activities
society (CIS) is at the heart of all these special issue of the IEEE Computation- starting in January 2022, the trust-
aspects of ethics in AI. They represent al Intelligence Magazine dedicated to worthiness of Artificial Intelligence
fascinating challenges for researchers Explainable and Trustworthy Artificial appears even more as a key factor in
and they can highlight the power of all Intelligence will be published in 2022. the realization and success of industrial
Computational Intelligence methods: The IEEE CIS Cognitive and Develop- products based on Computational
neural networks, learning methods, mental Systems Technical Committee Intelligence methods.
fuzzy systems and evolutionary compu- and the IEEE Transactions on Cogni- If you are interested in these topics
tation. Efforts are already done through tive and Developmental Systems also or any other activity of the IEEE CIS,
the IEEE CIS Task Force on Ethical address some of these concerns on a you can always contact me at b.bouchon
and Social Implications of Computa- cognitive basis. Moreover, the IEEE -meunier@ieee.org.
tional Intelligence. Several other CIS CIS is the oversight committee for the
TAP.
CONNECT.
NETWORK.
SHARE.
Connect to IEEE–no matter where you are–with the IEEE App.
Stay up-to-date Schedule, manage, or Get geo and interest-based
with the latest news join meetups virtually recommendations
Read and download Create a personalized Locate IEEE members by location,

your IEEE magazines experience interests, and affiliations

Haibo He, University of Rhode Island, USA Publication
Jon Garibaldi, University of Nottingham, UK Spotlight
Carlos A. Coello Coello, CINVESTAV-IPN, MEXICO
Julian Togelius, New York University, USA
Yaochu Jin, University of Surrey, UK
Yew Soon Ong, Nanyang Technological University, SINGAPORE
Hussein Abbass, University of New South Wales, AUSTRALIA
CIS Publication Spotlight
IEEE Transactions on Neural IEEE Transactions on

Networks and Learning Systems Fuzzy Systems
A Comprehensive Survey on Graph An Effective Multiresolution Hierarchical

Neural Networks, by Z. Wu, S. Pan, F. Granular Representation Based Classifier
Chen, G. Long, C. Zhang, and P. S. Using General Fuzzy Min-Max Neural
Yu, IEEE Transactions on Neural Net- Network, by T. T. Khuat, F. Chen,
works and Learning Systems, Vol. 32, ©SHUTTERSTOCK.COM/BEERENTOCHTER and B. Gabrys, IEEE Transactions on
No. 1, January 2021, pp. 4–24. Fuzzy Systems, Vol. 29, No. 2, Feb-
across various domains and summarize ruary 2021, pp. 427–441.
Digital Object Identifier: 10.1109/ the open-source codes, benchmark data
TNNLS.2020.2978386 sets, and model evaluation of GNNs. Digital Object Identifier: 10.1109/
“Deep learning has revolutionized Finally, we propose potential research TFUZZ.2019.2956917
many machine learning tasks in recent directions in this rapidly growing field.” “Motivated by the practical demands
years, ranging from image classification for simplification of data toward being
and video processing to speech recogni- A Survey of the Usages of Deep Learn- consistent with human thinking and
tion and natural language understanding. ing for Natural Language Processing, by problem-solving, as well as tolerance of
The data in these tasks are typically repre- D. W. Otter, J. R. Medina, and J. K. uncertainty, information granules are
sented in the Euclidean space. However, Kalita, IEEE Transactions on Neural becoming important entities in data pro-
there is an increasing number of applica- Networks and Learning Systems, cessing at different levels of data abstrac-
tions, where data are generated from non- Vol. 32, No. 2, February 2021, pp. tion. This article proposes a method to
Euclidean domains and are represented as 604–624. construct classifiers from multiresolution
graphs with complex relationships and hierarchical granular representations using
interdependency between objects. The Digital Object Identifier: 10.1109/ hyperbox fuzzy sets. The proposed
complexity of graph data has imposed sig- TNNLS.2020.2979670 approach forms a series of granular infer-
nificant challenges on the existing “Over the last several years, the field ences hierarchically through many levels
machine learning algorithms. Recently, of natural language processing has been of abstraction. An attractive characteristic
many studies on extending deep learning propelled forward by an explosion in the of our classifier is that it can maintain a
approaches for graph data have emerged. use of deep learning models. This article high accuracy in comparison to other
In this article, we provide a comprehen- provides a brief introduction to the field fuzzy min-max models at a low degree of
sive overview of graph neural networks and a quick overview of deep learning granularity based on reusing the knowl-
(GNNs) in data mining and machine architectures and methods. It then sifts edge learned from lower levels of abstrac-
learning fields.We propose a new taxono- through the plethora of recent studies tion. In addition, our approach can
my to divide the state-of-the-art GNNs and summarizes a large assortment of reduce the data size significantly as well as
into four categories, namely, recurrent relevant contributions. Analyzed research handle the uncertainty and incomplete-
GNNs, convolutional GNNs, graph auto- areas include several core linguistic pro- ness associated with data in real-world
encoders, and spatial-temporal GNNs. We cessing issues in addition to many appli- applications. The construction process of
further discuss the applications of GNNs cations of computational linguistics. A the classifier consists of two phases. The
discussion of the current state of the art first phase is to formulate the model at the
Digital Object Identifier 10.1109/MCI.2021.3084390 is then provided along with recommen- greatest level of granularity, while the later
Date of current version: 15 July 2021 dations for future research in the field.” stage aims to reduce the complexity of

the constructed model and deduce it from Digital Object Identifier: 10.1109/ “This paper proposes a framework
data at higher abstraction levels. Experi- TEVC.2020.3020046 for the procedural generation of level and
mental analyses conducted comprehen- “Multiobjective optimization is ruleset components of games via a surro-
sively on both synthetic and real datasets increasingly used in engineering to design gate model that assesses their quality and
indicated the efficiency of our method in new systems and to identify design trad- complementarity. The surrogate model
terms of training time and predictive per- eoffs. Yet, design problems often have combines level and ruleset elements as
formance in comparison to other types of objective functions and constraints that are input and gameplay outcomes as output,
fuzzy min-max neural networks and expensive and highly nonlinear. Combina- thus constructing a mapping between
common machine learning algorithms.” tions of these features lead to poor conver- three different facets of games. Using this
gence and diversity loss with common model as a surrogate for expensive game-
Discrete and Smoothed Resampling algorithms that have not been specifically play simulations, a search-based generator
Methods for Interval-Valued Fuzzy designed for constrained optimization. can adapt content toward a target game-
Numbers, by M. Romaniuk and O. Constrained benchmark problems exist, play outcome. Using a shooter game as
Hryniewicz, IEEE Transactions on but they do not necessarily represent the the target domain, this paper explores
Fuzzy Systems, Vol. 29, No. 3, challenges of engineering problems. In this how parameters of the players’ character
March 2021, pp. 599–611. article, a framework to design electro- classes can be mapped to both the level’s
mechanical actuators, called multiobjective representation and the gameplay out-
Digital Object Identifier: 10.1109/ design of actuators (MODAct), is present- comes of balance and match duration.
TFUZZ.2019.2957253 ed and 20 constrained multiobjective opti- The surrogate model is built on a deep
“In this article, we propose two new mization test problems are derived from learning architecture, trained on a large
resampling algorithms for the simulation the framework with a specific focus on corpus of randomly generated sets of lev-
of bootstrap-like samples of interval-val- constraints. The full source code is made els, classes, and simulations from game
ued fuzzy numbers (IVFNs). These available to ease its use. The effects of the playing agents. Results show that a
methods (namely, the d-method and the constraints are analyzed through their search-based generative approach can
s-method) reuse a primary sample (an impact on the Pareto front as well as on adapt character classes, levels, or both
initial set) of IVFNs to generate a sec- the convergence performance. A con- toward designer-specified targets. The
ondary sample, which also consists of straint landscape analysis approach is fol- model can thus act as a design assistant or
this type of fuzzy numbers, and simulta- lowed and extended with three new be integrated in a mixed-initiative tool.
neously utilize existing dependencies in metrics to characterize the search and Most importantly, the combination of
pairs of some characteristic points of objective spaces. The features of MODAct three game facets into the model allows it
IVFNs. During a corresponding resam- are compared to existing test suites to to identify the synergies between levels,
pling step, a non-parametric approach is highlight the differences. In addition, a rules, and gameplay and orchestrate the
used. Additionally, we apply a widely convergence analysis using NSGA-II, generation of the former two toward
used assumption about the Gaussian ker- NSGA-III, and C-TAEA on MODAct desired outcomes.”
nel densities. The proposed methods in and existing test suites suggests that the
some way resemble Efron’s bootstrap, design problems are indeed difficult due to IEEE Transactions on Cognitive
but, contrary to this classical approach, the constraints. In particular, the number and Developmental Systems
they generate “not exactly the same as of simultaneously violated constraints in
previous” IVFNs, so it leads to a greater newly generated solutions seems key in Exoskeleton Online Learning and
diversity of the obtained secondary sam- understanding the convergence challenges. Estimation of Human Walking
ple. We also numerically check the qual- Thus, MODAct offers an efficient frame- Intention Based on Dynamical
ity of the introduced methods using a work to analyze and handle constraints in Movement Primitives, by S. Qiu, W.
few more statistically oriented approach- future optimization algorithm design.” Guo, D. Caldwell, and F. Chen, IEEE
es together with four similarity measures Transactions on Cognitive and Develop-
and three types of IVFNs.” IEEE Transactions on Games mental Systems, Vol. 13, No. 1, March
2021, pp. 67–79.
IEEE Transactions on Evolutionary A Multifaceted Surrogate Model for
Computation Search-Based Procedural Content Gener- Digital Object Identifier: 10.1109/
ation, by D. Karavolos, A. Liapis, and TCDS.2020.2968845
Realistic Constrained Multiobjective G. Yannakakis, IEEE Transactions on “Human walking intention estimation
Optimization Benchmark Problems Games, Vol. 13, No. 1, March 2021, is a critical step for the active assistance
From Design, by C. Picard and pp. 11–22. control of lower limb exoskeleton, be
J. Schiffmann, IEEE Transactions on cause the purpose of active assistance
Evolutionary Computation, Vol. 25, No. 2, Digital Object Identifier: 10.1109/ control is human motion assistance rather
April 2021, pp. 234–246. TG.2019.2931044 than human motion tracking. Complying

with human walking intention is the is proposed in this article. The algorithm delay caused by sensor signal filtering.
basic requirement of human walking is based on the dynamical movement Hence, the proposed algorithm is suit-
assistance. Hence, the human walking primitives model which is used for able for the active walking assistance
intention must be estimated first to online learning and predicting human control of exoskeleton.”
ensure the exoskeleton will not impede joint trajectory which is substituted into
human motion. Actually, estimating the human dynamics model to estimate IEEE Transactions on Emerging
human walking intention is to estimate the human joint torque. The results of Topics in Computational Intelligence
human joint torque during walking. In human walking experiments demon-
order to estimate a smooth personalized strate that the proposed algorithm can Evaluating Evolving Structure in Stream-
human joint torque profile, an online not only predict a smooth human joint ing Data With Modified Dunn’s Indices,
learning and prediction algorithm of trajectory and joint torque profile in real
human joint trajectory and joint torque time but also compensate the phase (continued on page 94)
Call for Papers for Journal Special Issues

Special Issue on “Causal Discovery and Causality-Inspired Machine Learning”
Journal: IEEE Transactions on Neural Networks and Learning Systems
Guest Editors: Kun Zhang, Ilya Shpitser, Sara Magliacane, Davide Bacciu, Fei Wu, Changshui Zhang, and
Peter Spirtes
Submission Deadline: October 22, 2021
https://cis.ieee.org/images/files/Publications/TNNLS/special-issues/CFP_TNNLS_SI_Causal_Discovery_and_Caus
ality-Inspired_Machine_Learning.pdf
Special Issue on “Cyborg Intelligence: Human Enhancement with Fuzzy Sets”

Journal: IEEE Transactions on Fuzzy Systems
Guest Editors: Zhijun Li, Jian Huang, Hang Su, and Zhaojie Ju
Submission Deadline: September 30, 2021
https://cis.ieee.org/images/files/Publications/TFS/special-issues/Special_Issue_on_Cyborg_Intelligence.pdf
Special Issue on “Recent Advances in Fuzzy-based Intelligent IoT and Cyber-physical

Systems”
Journal: IEEE Transactions on Fuzzy Systems
Guest Editors: Mainak Adhikari, Varun G Menon, Jessie Park, and Danda Rawat
Submission Deadline: October 31, 2021
https://cis.ieee.org/images/files/Publications/TFS/special-issues/TFS_SI_-_Recent_Advances_in_Fuzzy-based_Intel
ligent_IoT_and_Cyber-physical_Systems.pdf
Special Issue on “Benchmarking Sampling-Based Optimization Heuristics: Methodology

and Software”
Journal: IEEE Transactions on Evolutionary Computation
Guest Editors: Thomas Bäck, Carola Doerr, Bernhard Sendhoff, and Thomas Stützle
Submission Deadline: August 31, 2021
https://cis.ieee.org/images/files/Documents/call-for-papers/tevc/cfp-bench-updated-may2021.pdf
Special Issue on “Computational Intelligence to Edge AI for Ubiquitous IoT Systems”

Journal: IEEE Transactions on Emerging Topics in Computational Intelligence
Guest Editors: Honghao Gao, Biao Luo, Ramón J. Durán Barroso, and Walayat Hussain
Submission Deadline: November 20, 2021
https://cis.ieee.org/images/files/Publications/TETCI/SI23_CFP_EAI-IoT.pdf
Special Issue on “Security and Privacy of Machine Learning”

Journal: IEEE Transactions on Artificial Intelligence
Guest Editors: Wenjian Luo, Yaochu Jin, and Catherine Huang
Submission Deadline: August 15, 2021
https://cis.ieee.org/publications/ieee-transactions-on-artificial-intelligence/special-issues

Guest Yanan Sun
Editorial Sichuan University, CHINA
Mengjie Zhang
Victoria University of Wellington,
NEW ZEALAND
Gary G. Yen
Oklahoma State University, USA
Evolutionary Neural Architecture Search and Applications
D eep neural networks (DNNs) have This special issue has brought together The second paper, titled “Fast and
shown significantly promising per- researchers to report state-of-the-art con- Unsupervised Neural Architecture Evolu-
formance in addressing real-world tributions on the latest research and devel- tion for Visual Representation Learning”
problems, such as image recognition, nat- opment, up-to-date issues, challenges, and by S. Xue et al., proposes the FaUNAE
ural language processing and self-driving. applications in the field of ENAS. Follow- algorithm to improve the unsupervised
The achievements of DNNs owe largely ing a rigorous peer review process, five visual representation learning ability
to their deep architectures. However, papers have been accepted for publication against the supervised peer competitors.
designing an optimal deep architecture for in this special issue. Specifically, FaUNAE employed the evo-
a particular problem requires rich domain The first paper included in the special lutionary algorithm to search for promis-
knowledge on both the investigated data issue is entitled “Evolutionary Multi- ing neural architectures from an existing
and the neural network domains, which is Objective Model Compression for Deep architecture designed by experts or exist-
not necessarily held by the end-users. Neural Networks” authored by Z.Wang et ing NAS algorithms that focus on the
Neural architecture search (NAS), as an al., which aims at accelerating the inference transferability from a small dataset to a
emerging technique to automatically speed of DNNs by optimizing the model larger dataset. To reduce the search cost
design the optimal deep architectures size and accuracy simultaneously with evo- and enhance the search efficiency, the
without requiring such expertise, is draw- lutionary algorithms.The architecture pop- prior knowledge and the inferior, as well
ing increasing attention from industry and ulation evolution was employed to explore as least promising operations, were used
academia. However, NAS is theoretically a and exploit the network space of pruning during the evolutionary process. In addi-
non-convex and non-differentiable opti- and quantization. In addition, a two-stage tion, the contrast-loss function was uti-
mization problem, and existing methods co-optimizing strategy of pruning and lized as the evaluation metric in a
are incapable of well addressing it. Evolu- quantization was proposed to significantly student-teacher framework to achieve the
tionary computation approaches, particu- reduce time cost during the architecture self-supervised evolution. FaUNAE was
larly genetic algorithms, particle swarm search process. To further lower energy evaluated on four widely used large-scale
optimization and genetic programming, consumption and reduce the model size, benchmark datasets. The results demon-
have shown superiority in addressing real- various dataflow designs and parameter strated the effectiveness of FaUNAE
world problems due largely to their pow- coding schemes were also considered in the upon various downstream applications,
erful abilities in searching for global optimization process. Unlike most related including object recognition, object
optima, dealing with non-convex/non- research solely focusing on reducing the detection, and instance segmentation.
differentiable problems, and requiring no model size while maintaining the model The third paper entitled “Self-Super-
rich domain knowledge. In this regard, accuracy, the work in this paper can achieve vised Representation Learning for Evolu-
deep neural architecture designed by evo- a trade-off between different model sizes tionary Neural Architecture Search” by
lutionary computation approaches, so and model accuracies, which meets the C. Wei et al. proposes a novel performance
called evolutionary neural architecture requirements for most edge devices in real- predictor, aiming to enhance the efficien-
search (ENAS), have attracted the interest world scenarios. The experimental results cy of ENAS algorithms. Specifically, the
of many researchers. demonstrated that the proposed algorithm ENAS algorithms are often computation-
can obtain a broad range of compact ally expensive in practice because hun-
Digital Object Identifier 10.1109/MCI.2021.3084391 DNNs for diverse memory usage and dreds of DNNs needed to be trained
Date of current version: 15 July 2021 energy consumption requirements. during the search process. Performance

preceptors are a kind of regression model results revealed that DNR can provide population degradation and slow conver-
that can directly predict the performance of superior performance compared to its gence. In addition, motivated by the
a neural network without any training and competitors, and DNR-SMS was an effi- recent advances of DNN models, the
is a hot research topic among the commu- cient tool for wind speed prediction. building blocks of DenseNet and ResNet
nity. However, the training of performance The fifth paper, “A Self-Adaptive were also designed as the search units of
predictors requires a large number of well- Mutation Neural Architecture Search SaMuNet. The performance of SaMuNet
trained neural networks, which is often Algorithm Based on Blocks” by Y. Xue et was compared with 17 peer competitors
scarce in practice. To address this issue, this al., proposes the algorithm named SaMu- on CIFAR10 and CIFAR100 bench-
paper first developed a new encoding strat- Net to tackle the problem of “loss of mark datasets. The results demonstrated
egy of architectures for calculating the experience” caused by ENAS algorithms that SaMuNet can outperform most of
graph edit distances among different archi- in their early search stage. Specifically, them in terms of both the classification
tectures. Also, two self-supervised learning most of the existing ENAS algorithms accuracy and the consumed computa-
methods were designed to improve the pre- mainly focused on investigating the search tional resource.
diction performance by learning the mean- space or evaluation strategy to enhance The guest editors of this special issue
ingful representations of neural architectures. their performance. However, the search would like to thank Prof. Chuan-Kang
This can help to enhance the quality of the strategy also plays an important role dur- Ting, the Editor-in-Chief of IEEE Com-
training data fed to the performance pre- ing the whole process. In addition, the putational Intelligence Magazine, for his
dictors. The experiments demonstrated information of individuals in different great support in initiating and developing
promising performance of the proposed generations are also crucial to the perfor- this special issue together. Many thanks to
performance predictor against the peer mance of the corresponding evolutionary all members of the editorial team for their
competitors. In addition, the proposed per- algorithm, which is ignored by most kind support during the editing process of
formance predictor is integrated into an existing ENAS algorithms. To address this special issue. Last but not least, we
ENAS algorithm for validation, and the both issues, SaMuNet incorporated a self- would also like to thank the authors for
results also showed its superiority of search- adaptive mutation component into the submitting their valuable research out-
ing for promising neural architectures. framework of the evolutionary algorithm comes as well as the reviewers who have
In the fourth paper, “Forecasting Wind to effectively search for the neural archi- critically evaluated the papers. We sincere-
Speed Time Series Via Dendritic Neural tectures. Furthermore, a semi-complete ly hope and expect that readers will find
Regression” by J. Ji et al., proposes a binary competition selection strategy was this special issue useful.
regressive version of the dendritic neuron also designed into SaMuNet to prevent
model (DNM), i.e., dendritic neural
regression (DNR), to forecast wind
power. Particularly, wind energy is one of
the fastest-growing green energy resourc-
es. Precise forecasting wind power is cru-
cial in planning the power system and
operating the wind farm. However, since
the wind speed time series is with chaotic
properties and high volatility, traditional
methods are incapable of producing satis-
factory forecasts. DNM is a plausible bio-
logical neural model and has the potential
to forecast wind power well. However,
DNM is originally designed for classifica-
tion problems.To this end, DNR is devel- Introducing IEEE Collabratec™
oped based on DNM to forecast the The premier networking and collaboration site
wind power that is a regression task. Like for technology professionals around the world.
other neural network-based models, the
performance of DNR is also voluntary to
IEEE Collabratec is a new, integrated online community
its architecture design. To address this
where IEEE members, researchers, authors, and
problem, the states of matter search (SMS)
algorithm is used to search for the prom- technology professionals with similar fields of interest
ising architecture of DNR without much can network and collaborate, as well as create and
manual effort. The experiments were manage content all in one place!
conducted on two benchmark datasets
with two different time intervals. The
Learn about IEEE Collabratec at
ieee-collabratec.ieee.org
©SHUTTERSTOCK.COM/DMITRIY PRAYZEL
Evolutionary Multi-Objective
Model Compression
for Deep Neural Networks
Compression (EMOMC), is proposed to optimize energy effi-
Zhehui Wang and Tao Luo ciency (or model size) and accuracy simultaneously. Specifically,
Agency for Science, Technology and Research (A*STAR), the network pruning and quantization space are explored and
SINGAPORE exploited by using architecture population evolution. Further-
Miqing Li, more, by taking advantage of the orthogonality between pruning
University of Birmingham, UK and quantization, a two-stage pruning and quantization co-opti-
mization strategy is developed, which considerably reduces time
Joey Tianyi Zhou, Rick Siow Mong Goh, cost of the architecture search. Lastly, different dataflow designs
and Liangli Zhen and parameter coding schemes are considered in the optimiza-
Agency for Science, Technology and Research (A*STAR), tion process since they have a significant impact on energy con-
SINGAPORE sumption and the model size. Owing to the cooperation of the
evolution between different architectures in the population, a set
Abstract—While deep neural networks (DNNs) deliver state-of- of compact DNNs that offer trade-offs on different objectives
the-art accuracy on various applications from face recognition to (e.g., accuracy, energy efficiency and model size) can be obtained
language translation, it comes at the cost of high computational in a single run. Unlike most existing approaches designed to
and space complexity, hindering their deployment on edge reduce the size of weight parameters with no significant loss of
devices. To enable efficient processing of DNNs in inference, a accuracy, the proposed method aims to achieve a trade-off
novel approach, called Evolutionary Multi-Objective Model between desirable objectives, for meeting different requirements
of various edge devices. Experimental results demonstrate that the
proposed approach can obtain a diverse population of compact
Date of current version: 15 July 2021 Corresponding author: Liangli Zhen (e-mail: zhenll@ihpc.a-star.edu.sg).
10 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021 1556-603X/21©2021IEEE

DNNs that are suitable for a broad range of different memory evolutionary population. Owing to the cooperation and interplay
usage and energy consumption requirements. Under negligible of the evolution between different architectures in the population,
accuracy loss, EMOMC improves the energy efficiency and a set of compact DNNs that offer trade-offs on different objectives
model compression rate of VGG-16 on CIFAR-10 by a factor of (e.g., accuracy, energy efficiency, and model size) can be obtained in
more than 8.9 # and 2.4 #, respectively. a single run. Unlike most existing approaches which aim to reduce
the size of weight parameters or the energy consumption with no
D
I. Introduction significant loss of accuracy, the proposed approach attempts to
eep neural networks (DNNs) are artificial neural net- achieve a good balance between desired objectives, for meeting the
works with more than three layers (i.e., more than one requirements of different edge devices. Experimental results dem-
hidden layer), which progressively extract higher-level onstrate that the proposed approach can obtain a diverse popula-
features from the raw input in the learning process. tion of compact DNNs for customized requirements of accuracy,
They have delivered the state-of-the-art accuracy on various memory capacity, and energy consumption.
real-world problems, such as image classification, face recogni- The novelty and main contributions of this work can be
tion, and language translation [1]. The superior accuracy of summarized as follows:
DNNs, however, comes at the cost of high computational and ❏❏ The model compression problem is formulated as a multi-
space complexity. For example, the VGG-16 model [2] has about objective problem. The optimal solutions are searched in the
138 million parameters, which requires over 500 MB memory network pruning and quantization space using a popula-
for storage and 15.5G multiply-and-accumulates (MACs) to tion-based algorithm.
process an input image with 224 × 224 pixels. In myriad applica- ❏❏ To speed up the population evolution, a two-stage pruning/
tion scenarios, it is desirable to make the inference on edge quantization co-optimization strategy is developed based on
devices rather than on cloud, for reducing the latency and the orthogonality between pruning and quantization.
dependency on connectivity and improving privacy and security. ❏❏ The trade-offs between accuracy, energy efficiency, and
Many of the edge devices that draw the DNNs inference have model size in model compression are explored by consider-
stringent limitations on energy consumption, memory capacity, ing different dataflow designs and parameter coding schemes.
etc. The large-scale DNNs [3], [4] are usually difficult to be The experimental results demonstrate that the proposed
deployed on edge devices, thus hindering their wide application. method can obtain a set of diverse Pareto optimal solutions in
Efficient processing of DNNs for inference has become increas- a single run. Also, it achieves a considerably higher energy
ingly important for the deployment on edge devices. For generat- efficiency than current state-of-the-art methods.
ing efficient DNNs, many neural architecture search (NAS)
approaches have been developed in recent years [5]–[7]. One way II. Preliminaries
of carrying out NAS is to search from scratch [8], [9]. In contrast, Network pruning and quantization are two commonly used
model compression1 [10] searches for the optimal networks starting model compression techniques to improve the energy efficiency
from a well-trained network. For instance, to reduce the storage in model inference and/or to shrink the size of the model.
requirement of DNNs, Han et al. proposed a three-stage pipeline Moreover, the dataflow design employed by edge devices and the
(i.e., pruning, trained quantization, and Huffman coding) to com- coding scheme applied to store the weight matrix both have a
press redundant weights [10]. Wang et al. suggested removing significant impact on the performance of model compression.
redundant convolution filters to reduce the model size [11]. Rather
than reducing the model size, a few attempts [12], [13] are conduct- A. Network Pruning and Quantization
ed to compress DNNs directly by taking the energy consumption For making the training easy, the networks are usually over-
as the feedback signals. They have achieved promising results in parameterized [14]. Pruning is a widely-used model compression
reducing the size of weight parameters (or energy consumption). technique that can effectively reduce the energy consumption of
However, these approaches require the model to achieve approxi- edge devices and shrink the model size [10]. Network pruning
mately no loss of accuracy, rendering the solution less flexible. removes some of the redundant parameters in the network by
In practice, different users often have distinct preferences on setting their values as zeros. A well-trained neural network usually
desirable objectives, e.g., accuracy, model size, energy efficiency, and contains a large number of weights whose values are relatively
latency, when they select the optimal DNN model for their appli- small compared to other parameters. In most cases, these parame-
cations. In this paper, a novel approach, called Evolutionary Multi- ters are not particularly important when performing model infer-
Objective Model Compression (EMOMC), is proposed to ence. Hence, one can sort all the parameters in the model and
optimize energy efficiency/model size and accuracy simultaneous- replace those parameters with the least absolute values by zeros,
ly. By considering network pruning and quantization, the model while the accuracy of the model can still be maintained. For
compression is formulated as a multi-objective problem under dif- instance, the pruning amount to be 33%, then one-third of the
ferent dataflow designs and parameter coding schemes. Each can- parameters in the model will be replaced by zeros. In the infer-
didate architecture can be regarded as an individual in the ence process, if the processing elements (PEs)2 whose input
1
The technique aims to shrink the size of the neural network model without a signifi-
2
cant drop of accuracy. The PE is a basic unit to conduct computation in processors.

typical convolutional layer. It contains six
Dataflow design is one of the most important features loops, each of which corresponds to one
in hardware accelerators, which allows the system to dimension either in the weight filter or in the
feature map. More specifically, C O and C I are
reuse the data among different processing elements. the numbers of output and input channels, X
and Y are the width and height of the feature
weight parameters are zeros, the computation process can be map, and FX and FY are the width and height of the weight
skipped in those PEs, thus reducing energy consumption. filter. In each iteration of the innermost loop, a basic arithmetic
Quantization is another critical model compression tech- operation called multiply-accumulate (MAC) is performed. In
nique that is used to accelerate DNNs and reduce model size one convolutional layer, there are C O ·C I ·X·Y·FX ·FY MAC
[10]. It involves mapping data to a small set of quantization lev- operations in total.
els and aims at minimizing the error between the reconstructed In typical hardware accelerators,3 there are a set of process-
data from the quantization levels and the original data. The ing elements. Each PE can execute one MAC operation inde-
quantization level reflects the precision and ultimately the pendently. How to map the MAC operation into each PE and
number of bits representing a parameter. After the quantization, how the data flow among those PEs become key consider-
the low precision parameters may still store enough informa- ations in the design of hardware accelerators. Theoretically,
tion for model inference, and the accuracy of the model can be there are many mapping methods, resulting in different data-
maintained. In practical implementations, if the weights are flow designs. For example, suppose that the device has an array
quantized, one can use multipliers with simpler structures, thus of PEs, one can unroll any one of the loops in Algorithm 1
reducing energy consumption. For instance, a high precision and map each iteration in the unrolled loop into each PE of
parameter with 32-bit float point (32FP) data type requires the array. Similarly, if the device has a matrix of PEs, one can
23 bit × 23 bit multipliers. Such type of multiplier contains unroll any two loops in Algorithm 1 and map the MAC oper-
506 adders in total. If quantizing the activation from 32FP to ations into each PE in the matrix. Thus, with six loops as
16-bit float point (16FP) and quantizing the weights from shown in Algorithm 1, there are C 26 = 15 possible dataflow
32FP to 8-bit integer (8INT), only 10 bit × 8 bit multipliers designs in total. To simplify the problem, in this work, only four
are required, each of which contains only 72 adders in total. popular dataflow designs are evaluated, as shown in Table I.
The fewer adders in multipliers, the lower energy consumption These dataflow designs are named as A : B, where A and B
for computation. stand for the names of the two unrolled loops.
Pruning and quantization can reduce not only the energy Figure 1 shows the schematic diagram of those four popu-
consumption on the computation process but also the energy lar dataflow designs, where only four PEs are involved in each
consumption on the data movement, which is roughly propor- example. Each PE contains one multiplier and one adder,
tional to the total amount of data transmitted from the memo- which can execute one MAC operation each time. The PE also
ry module in terms of bits [15]. For instance, if pruning 80% of contains register files, which can temporarily store input or
the parameters in the model and quantizing all the parameters output data. In X:Y, the MAC operation results are stored in
from 16 bits to 8 bits, then about 90% of the energy consump- registers at output ports of PEs. At each iteration, the last MAC
tion on data movement can be reduced. operation result is read from registers. In C I :C O, at each itera-
tion, the input feature map is reused C O times, and C I MAC
B. Dataflow Design in Hardware Accelerators operation results are summed up. In FX :FY , FX ·FY weights are
The dataflow design decides how data is reused among differ- stored in registers at input ports of PEs. At each iteration,
ent PEs. Since a large portion of the energy consumption of FX ·FY MAC operation results are summed up. In X :FX , FX
hardware accelerators is on the data movement, the dataflow weights are stored in registers at input ports of PEs. At each
design needs to be considered when optimizing the energy iteration, the weights are reused X times, and FX MAC opera-
efficiency. Algorithm 1 shows the computation procedure of a tion results are summed up.
Algorithm 1 Computation of a typical convolutional layer. TABLE I Popular dataflow designs

that are applied in literature.
for co in range(CO) do
DATAFLOW APPLIED BY DATAFLOW APPLIED BY
for ci in range(CI) do
X :Y [16] [17] CI : CO [18] [19]
for x in range(X ) do
FX : FY [20] X : FX [21] [22]
for y in range(Y ) do
for fx from -(FX -1)/2 to (FX -1)/2 do
for fy from -(FY -1)/2 to (FY -1)/2 do
3
O[co][x][y] The devices that are specialized to execute a certain task, such as the graphics pro-
cessing unit (GPU), field-programmable gate array(FPGA), and application-specific
+ = I[ci][x + fx][y + fy] # W [co][ci][fx][fy] integrated circuit (ASIC).

C. Coding Scheme for the
Parameters I0 I1 I0
After pruning, the filter in the model
becomes a sparse matrix, which means
W0 O0 W1 O1 W0 W10
that it contains plenty of zero elements.
To store the sparse matrix into the
I2 I3
memory, many coding schemes are
developed, and the choice of the cod-
ing scheme mainly depends on the W2 O2 W3 O3 W1 W11
O0 O1
characteristics of the matrix. In this
work, three coding schemes are consid- Dataflow - X :Y Dataflow - C l :C o
ered. The first one is the normal coding
scheme, where it stores those zero ele- I0 I1 I0 I1
ments in the same way as those non-
zero elements. In other words, it keeps W0 W1 W0
the space for all the zero elements in
the matrix. Therefore, the storage size of I2 I3 I1 I2
the normal coding scheme is N $ q i,
where N is the total number of weight W2 W3 W1
elements in the matrix and q i is the O0 O1
quantization depth of the weight. O
Dataflow - Fx:FY Dataflow - X:F x
In some cases, it is a waste of the
memory capacity to store all these zero Processing Element Register File Data Direction
weights. For saving the memory space,
various new coding schemes are pro-
FIGURE 1 Examples of the four popular dataflow designs [15]. For simplicity, only four PEs are
posed to shrink the size of the sparse shown in each example. W k /W kk, Ik, and Ok are the elements in the weight, the input feature
matrix. One common coding scheme is map, and the output feature map, respectively.
called the Coordinate (COO) coding.
In the COO coding scheme, the non-zero elements are stored, model size, number of floating-point operations per second
along with the row index and the column index, and zero ele- (FLOPs), latency, and energy efficiency. The initial intention of
ments are ignored. Another popular coding scheme is called the model compression is to alleviate the on-chip storage limit for
compressed sparse row (CSR) coding. In the CSR coding complicated CNN models [10]. Since then, many approaches
scheme, only the values of the non-zero elements are stored have been proposed to shrink the model size of CNNs [23],
along with the column index and the row offset. To further [24]. There are two major branches in this area. The first branch
save the memory space, one version of the CSR coding focuses on the computation cost, and they target the number of
scheme only stores the relative distance between two non-zero FLOPs [25]. For example, Li et al. [26] proposed to prune
elements, which is shown in Figure 2. For this example, the whole filters from CNNs, avoiding sparse connectivity patterns
weight matrix has eight elements, and three of them are non- and reducing the computational cost significantly. Lemaire et al.
zero elements. Only the values of these three elements and proposed a budgeted regularized pruning framework for deep
their relative positions in the array are stored. The second non- CNNs [25], which makes the compressed model less computa-
zero element is three slots away from the first non-zero ele- tion-intensive. The second branch targets the inference speed
ment, and the integer 2 is recorded as a relative row index for [27], [28]. For instance, He et al. leveraged reinforcement
the second element. It is assumed that each non-zero element
requires three bits to store the relative row index. If one non-
zero element is far away from the previous element, zero-pad- Distance = 2 Distance = 3
ding elements are inserted into the array to avoid overflow.
Therefore, the storage size of the CSR coding scheme with NA 2 3
relative positions is n· (p i + 3), where n is the number of non-
zero elements and q i is the quantization depth of the weight.
XXX 0 0 XXX 0 0 0 XXX
III. Related Work
W0 W1 W2 W3 W4 W5 W6 W7
A. Model Compression
FIGURE 2 The CSR coding scheme with relative positions, which
Model compression aims to compress and accelerate DNN stores the values of non-zero elements and their relative positions in
models. Different approaches target different objectives, such as the array.

In the literature, many approaches have
Evolutionary multi-objective optimization has been been developed to solve MOPs since the
widely used to search for the optimal solutions, 1950s [31]. Among them, evolutionary algo-
rithms (EAs) stand out thanks to the nature of
in the presence of trade-offs between multiple population-based search that aims to approxi-
conflicting objectives. mate the whole Pareto front in a single execu-
tion. Also, EAs are typically exempt from the
characteristics of the PF than conventional math-
learning to provide the model compression policy, which can ematical programming techniques [31]. They can handle the
accelerate the inference on mobiles considerably. MOPs with discontinuous and non-convex PFs well.
Recently, edge devices have become increasingly popular Since the seminal work, called Vector Evaluated Genetic
for AI applications. However, considering the large amount of Algorithm (VEGA) [32], was proposed by Schaffer in 1985, a
energy consumed for the model inference, the deployment of large number of multi-objective evolutionary algorithms
CNN on edge devices becomes challenging. To solve this (MOEAs) have been developed and adopted in various appli-
problem, some scholars proposed model compression cations. In MOEAs, the selection strategy of individuals in the
approaches to reduce the energy consumption directly, using population plays a key role in the evolutionary process. Since
quantization [13] and/or pruning [12] techniques. In [12], an the optimal solutions are those non-dominated to each other
energy-aware network pruning approach is proposed to reduce in the whole search space, Pareto dominance naturally becomes
the overall energy across all layers by 3.7 # for AlexNet [29] a viable criterion for selecting promising solutions during the
and 1.6 # for GoogLeNet [30]. evolutionary process. The Pareto dominance criterion, however,
From the above, it can be seen that model compression is may fail to provide sufficient selection pressure, making the
essentially a multi-objective optimization problem, with several algorithm hard to converge. This situation can be usually
objectives to be considered, including accuracy, energy con- encountered when the objective space is enormous, e.g., in
sumption, model size, etc. Previous studies rarely deal with many-objective optimization problems [33]–[35]. To push the
multiple objectives at the same time. A common way adopted population towards the PF, Goldberg proposed a mechanism
in literature is to optimize only one of the objectives while set- called Pareto ranking [36] for the selection in MOEAs. A niche
ting the remaining ones to be hard constraints. In this work, the method is then used in the Nondominated Sorting Genetic
evolutionary multi-objective optimization technique is applied Algorithm (NSGA) [37] to maintain stable sub-populations.
to tackle these objectives simultaneously. Later on, in its new version Nondominated Sorting Genetic
Algorithm-II (NSGA-II) [38], a crowding degree comparison
B. Evolutionary Multi-Objective Optimization operator is adopted to make the ranking scheme more effective
In the real-world systems, there exist plenty of problems having and efficient. NSGA-II is widely used to solve MOPs, despite
two or more (often conflicting) objectives which one needs to its limitations in handling the MOPs with more than three
consider simultaneously. Such problems are called the multi- objectives [39]. Recently, many MOEAs tend to consider other
objective optimization problems (MOPs). Without loss of gen- selection strategies since they may converge fast towards the PF,
erality, a multi-objective optimization problem (MOP) can be such as indicator-based MOEAs, decomposition-based
formulated as the following minimization problem: MOEAs, and bi-goal criterion MOEAs [33].
Recently, there have been a few attempts to exploit
min F (x) = (f1 (x), f2 (x), f, fM (x)) T
x MOEAs to search for efficient neural architectures. For
s.t. g j (x) # 0, j ! {1, 2, f, J}, instance, Lu et al. proposed a method, called NSGA-Net [40],

h k (x) = 0, k ! {1, 2, f, K}, which formulates the neural architecture search as a multi-
x ! X, (1) objective problem and uses the NSGA-II algorithm to solve it.
where J denotes the number of inequality constraints, K is the NSGA-Net considers two objectives: the classification error
number of equality constraints, X 3 R n is the decision space, and the computation cost (measured by the number of MACs).
x = (x 1, x 2, f, x n) T is a candidate solution, and F : X " R M It has achieved promising results compared with other neural
consists of M (conflicting) objective functions. architecture search methods, e.g., DARTS [5] and ENAS [41],
Let a and b be two feasible solutions for an MOP defined in on the CIFAR-10 dataset [42].
Equ at ion (1), one can say that a dominates b if This work studies how the evolutionary multi-objective
6u fu (a) # fu (b) and 7v fv (a) 1 fv (b), where u, v ! {1, 2, f, M}. (EMO) method can be used in model compression, given its
A solution is Pareto optimal if it is not dominated by any other multi-objective nature.
solutions. Due to the conflict of the objectives in MOPs, there
are a set of Pareto optimal solutions, which represent the best IV. Our Proposed Method
possible trade-offs among different objectives. The optimal solu- In real-world applications, users usually have different preferenc-
tion set in the decision space is called the Pareto set (PS), and its es on the prediction model’s objectives, including accuracy,
mapping in the objective space is called the Pareto front (PF). energy efficiency, model size, etc. In this section, the evolutionary

multi-objective model compression method is presented. The the pruning amount p, the quantization depth q and the coding
model compression problem is formulated as a multi-objective scheme c is defined as
problem (MOP), which has several objective functions and con-
Model Size = f3 (p, q, c) .(4)
straints [43]. Then, an evolutionary algorithm is adopted to solve
the MOP. The goal of the optimization is to find a set of Pareto There are L + 3 variables, and L denotes the number of lay-
optimal solutions that represents various trade-offs on the ers in the original model. The value of the variable p is a real
desired objectives, thus enabling the deployment of the AI mod- number that indicates the pruning amount in all the layers of
els on edge devices with different resource constraints. the model. The value of the variable q i is an integer that
reflects the quantization depth in the i-th layer of the model.
A. Problem Formulation The constraints on these variables are as follows:
This work aims to compress a well-trained model to achieve
high accuracy, low energy consumption, and low model size. p l # p # p u,
By providing different pruning amount, p, and quantization q l # q i # q u,

depth, q, the compressed model should result in different accu- d ! {d 1, d 2, d 3, d 4},
racy, energy consumption, and model size. The goal of the opti- c ! {c 1, c 2, c 3}, (5)
mization is to reduce the energy consumption or model size
while at the same time making the accuracy of the model as where p l and p u are the upper and lower bounds of the prun-
high as possible. The relationship between the accuracy, the ing amount, q l and q u are the upper and lower bounds of the
pr uning amount, p, and the quantization depth quantization depth, d 1, d 2, d 3 , and d 4 correspond to the four
q = [q 1, q 2, f, q L] is denoted as dataflow designs of X: Y, C I : C O, FX : FY and X: FX , and c 1, c 2
and c 3 indicate three parameter coding schemes of the normal
Accuracy = f1 (p, q), (2)
coding, COO and CSR, respectively. In this work, the pruning
where f1 (·) represents the accuracy score of the model amount is assumed to be from 0% to 100%, and the quantiza-
obtained by pruning p of the weight parameters in each layer tion depth of each layer ranges from 1 bit to 23 bits.
of the original model, then quantizing the parameters in the Two bi-objective optimization problems are studied. In the
i-th layer with the depth of q i bits, and L denotes the number first problem, it explores possible combinations of pruning
of layers in the original model. amount and quantization depth, and aims to maximize the
The energy consumption of the inference is constrained by model accuracy f1 and minimize the energy consumption f2,
the battery’s capacitance of edge devices. Exceeding the ener- assuming the dataflow design to be d. Mathematically, the
gy budget of the edge device will greatly limit the implemen- bi-objective problem can be formulated as following:
tation of AI applications. From the perspective of users, it is
p l # p # p u,
' s.t. * q l # q i # q u,
usually acceptable to trade a bit of loss of accuracy for a large max f1 (p, q),
(6)
amount of reduction on energy consumption, especially for min f2 (p, q,d ),
d ! {d 1, d 2, d 3, d 4} .
edge devices. For a trained model, the energy consumption in
inference is also related to the exact dataflow design d applied The second bi-objective problem considers to maximize
on the edge devices. The relationship among the pruning the accuracy f1 and minimize the model size f3 simultaneous-
amount p, the quantization depth q, and the dataflow design d ly, assuming the coding scheme to be c, namely, the following
is denoted as follows: problem:
p l # p # p u,
s.t. * q l # q i # q u, (7)
Energy = f2 (p, q, d ) .(3)
'
max f1 (p, q),

The model size is constrained by the capacities of on-chip min f3 (p, q, c),
c ! {c 1, c 2, c 3} .
memory modules in edge devices. If the model size exceeds
the limitation, the model inference procedure requires to load Note that this work formulates two bi-objective optimiza-
and save weights/features maps through the off-chip memory. tion problems rather than a three-objective optimization prob-
Given the fact that off-chip memory access consumes much lem. There are two reasons. Firstly, if one optimizes the energy
larger energy consumption than the on-chip memory access consumption and the model size simultaneously (i.e., different
[10], the energy consumption of the inference process increas- dataflow designs and different coding schemes will be consid-
es tremendously. Furthermore, the app stores are sensitive to ered at the same time), the decision space will be increased
the size of the binary files, e.g., App Store has the restriction considerably, making the optimization much harder and con-
“apps above 100 MB will not download until you connect to suming more computation resource. Secondly, as the evalua-
Wi-Fi” [10]. Hence, it is important to shrink the size of the tion of each individual has a high computational cost, the
model and to make sure that the entire model can be fit into population size cannot be a large number. Typically, the pop-
the memory constraint of the edge devices. For a given model, ulation size is set to be smaller than 100. A three-objective
the model size highly depends on the coding scheme c applied space will lead to the solution set to be much more sparse
to store the weights. The relationship between the model size, than a bi-objective space.

B. Multi-Objective Optimization and Speedup A challenge in the multi-step pruning process is that it usu-
Instead of pruning the model directly in one step, a more effec- ally has high computational complexity. Specifically, each step
tive approach employed is to prune the model in multiple steps. requires fine-tuning the model by one or several epochs. If one
If pruning the model in one step, the accuracy will decrease attempts to find the optimal pruning amount and quantization
apparently, and it will be too difficult to restore the model [44]. depth for a model, the multi-step pruning process will consid-
Figure 3 demonstrates the comparison between the multi-step erably delay the optimization progress. To obtain the accuracy
pruning method and the single-step pruning method. The of the compressed model at a given pruning amount and quan-
model compression of a well-trained VGG-16 model is tested tization depth, the model needs to be compressed first, which
on the CIFAR-10 dataset [15], [42]. For the multi-step prun- usually includes many training epochs. Due to the large search
ing, it gradually increases the pruning amount from 0 to 95% space, it is almost impossible to pre-store all the compressed
in 32 steps. In each step, the model is pruned partially and re- models under any combinations of pruning amount and quan-
trained by one epoch. In the single-step pruning, the model is tization depth. For example, the parameters in each layer of the
pruned by 95% immediately, and then re-trained by 32 epochs. model can be quantized from 23 bits to 1 bit. The pruning
As shown in Figure 3, it can be seen that the multi-step prun- amount in each layer can range from 0 to 100%. In general, an
ing method outperforms the single-step pruning method in L-layer model can have 100 # 23 L possible combinations of
terms of accuracy with a large margin. pruning amount and quantization depth, assuming 1% pruning
amount granularity.
The EMO technique is adopted to solve this problem.
1 1
However, since an evolutionary algorithm is essentially a sto-
Non-Zero Weights (%)
chastic search, it may need thousands of trials (candidate solu-

0.75 0.75
tions) to find a high-quality solution. Once a new solution
Accuracy
(architecture) is produced, it takes a substantial amount of time

0.5 0.5
to perform the training for the evaluation. Consequently, it
0.25 0.25
may make the EMO-based search impossible.
To address this issue, by taking advantage of the orthogonal-
0 0
ity between pruning and quantization [45], a two-stage prun-
0 4 8 12 16 20 24 28 32 ing and quantization co-optimization method is proposed,
Time (Epoch) which can effectively reduce the computational cost. Specifical-
Multi-Step Pruning ly, the optimization process is divided into two stages. In the
Single-Step Pruning first stage, it prunes the model by multiple independent loops.
In each loop, it starts from a well-trained model, prunes the
FIGURE 3 The comparison between multi-step pruning and single-step model with a different pruning amount, fine-tunes the model,
pruning, tested on CIFAR-10 using VGG-16 (figure adopted from [15]). and saves the pruned model into a library. The set of pruning
amounts cover all the possible pruning
amounts which can be referenced by
the multi-objective solver. This is to
Pruning Amount p Pruning Amount p Quantizing Depth q guarantee that no pruning process is
required in the second stage. In the sec-
ond stage, the multi-objective solver
Prune Model Pre-Pruned Models Load Quantize Model starts to explore the design space and
Based on p Library Based on q tries to find the optimal combinations
Save of pruning amount and quantization
depth. During this process, the solver
Pre-Pruned Models
Energy Estimator Model Inference needs to know the accuracy, energy
Library
consumption, and model size under a
given combination of pruning amount
Energy and quantization depth. At this step, one
p=p+ρ Accuracy
Consumption just needs to load the corresponding
pruned model from the library and
quantize it.
(a) (b) Figure 4 shows an overview of the
proposed approach. Instead of pruning
FIGURE 4 The process of the proposed two-stage pruning and quantization co-optimization and quantizing the models at the same
method. (a) Stage-I: prune the model and save the pre-pruned models to the library; (b) Stage-
II: load the pre-pruned model from the library, quantize the parameter, and calculate the accuracy time, these two actions are taken into
as well as the energy consumption/the model size. two different stages. In the first stage, it

only prunes the model. Specifically, assuming t
as granularity, it prunes the model by 100/t Pruning and quantization are two popular techniques
times. At the i-th time step, it starts from the in DNN model compression, which show substantial
well-trained model and prunes the model
gradually using the multi-step pruning method,
improvement in energy efficiency.
until the target pruning amount reaches
i $ t. After that, it saves the compressed model into a pre-pruned first place on the image localization and the second place on
models library. In the second stage, it loads one of the pre- the image classification task in the ImageNet Large-Scale Visual
pruned models from the library based on the required pruning Recognition Challenge (ILSVRC) in 2014. LeNet-5 is a sim-
amount p, and then quantizes the parameters on the pre-pruned ple network for handwritten and machine-printed character
model based on the required quantization depth q. Since prun- recognition. It consists of only two sets of convolutional and
ing and quantization are two orthogonal operations, the final average pooling layers, followed by a flattening convolutional
compressed model will be equivalent to the compressed model layer, two fully-connected layers, and a Softmax classifier.
that is pruned and quantized at the same time. Lastly, it obtains MobileNet and VGG-16 are tested for color image classifica-
the accuracy by performing the model inference and read the tion on the CIFAR-10 dataset [42], and LeNet-5 is applied to
energy consumption from an energy estimator. recognize handwriting digits in the MNIST dataset [47].
The proposed approach can efficiently speed up the optimiza-
tion process.To obtain the accuracy and energy consumption under A. Experimental Setting
a given pruning amount and quantization depth, it does not need The NSGA-II algorithm in the python-based tool Pymoo [43]
to fine-tune the model anymore. Before the optimization process, it is used to solve the formulated multi-objective problem. The
completes the procedures in stage-I and saves only 100 compressed neural network is implemented in PyTorch4. During the net-
models into the library, assuming 1% granularity. The number of work training, the initial learning rate is set to be 0.01, and it
saved models is much less than 100 # 23 L , i.e., the number of pos- decays by half every 30 epochs. The batch size is set to be 256.
sible compressed models in the whole exploration space. For each During the multi-objective optimization process, the popula-
combination of pruning amount and quantization depth, the time tion size is set to be 40, and it runs 250 generations in each
cost of evaluating the individual is roughly equal to the inference execution. The multi-objective optimization and network
time cost of the model. training are performed on an NVIDIA Titan Xp graphics
processing unit (GPU) card. Four dataflow designs are consid-
V. Experimental Results and Analysis ered as they are the most commonly used dataflow designs
The proposed method is evaluated on three baseline CNN X : Y, C I : C O, FX : FY , and X : FX . The resource requirement is
models: MobileNet [46],VGG-16 [2] and LeNet-5 [47], which calculated based on the Xilinx Virtex UltraScale FPGA and the
have different characteristics. MobileNet is a neural network energy consumption from the Xilinx XPE toolkit [48]. In the
specially designed for mobile and embedded vision applica-
tions. VGG is a typical deep neural network, which was in the 4
PyTorch Open Source Toolkit at https://github.com/pytorch/pytorch.
X :Y (3.4) CI :CO (3.2) X :Y (23.1) CI :CO (21.2) X :Y (31.2) CI :CO (33.1)

FX :FY (4.6) X:FX (4.7) FX :FY (24) X:FX (22.8) FX :FY (31.5) X:FX (28.1)
1 2.5 6
Energy Consumption (mJ)
0.8 2 5
0.6 1.5 4
0.4 1 3
0.2 0.5 2
0 0 1
0.7 0.75 0.8 0.85 0.9 0.75 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95 1
Accuracy Accuracy Accuracy
MobileNet VGG-16 LeNet-5
FIGURE 5 The solution sets obtained from the bi-objective optimization of accuracy and energy consumption on CIFAR-10 (MobileNet and VGG-
16) and MNIST (LeNet-5). The four different dataflow designs are marked with different colors. In the legends, the quoted number after the data-
flow design indicates its energy consumption (mJ) on the original model before the model compression.

implementation, the multipliers and adders are implemented under the four dataflow designs. For example, under the
on LUTs (lookup tables). An M # N multiplier requires dataflow design of X : Y, the accuracy scores of MobileNet
M/2 # (N + 1) LUTs [49]. To save the memory space, there is range from around 75% to 90%, and the energy consump-
no need to keep the feature map in local memory after the tion from around 0.2 mJ to 0.58 mJ. It offers the right
computation of each layer. Hence, the size of the local memory trade-offs between the two objectives for meeting the con-
modules must support the weights in all layers and the tempo- straints of various edge devices.
rary feature maps. The pruning and quantization approaches are ❏❏ From the perspective of energy consumption, if searching
described in [10]. For pruning, the , 1 -norm based unstructured solutions from the one with the highest energy consumption
pruning method is adopted and a mask is added to filter out to the one with the lowest energy consumption, the loss on
the pruned weight. For quantization, the linear (uniform) accuracy is negligible at the first few points. For instance,
quantization method is adopted and a scaling factor is used to under the dataflow design of X : Y, the energy consumption
lower the precision of the weights. of VGG-16 decreases from around 2.3 mJ to 0.5 mJ with an
accuracy drop less than 2%. However, after a certain thresh-
B. Bi-Objective Optimization of old, the accuracy loss becomes extremely large. By consider-
Accuracy and Energy Consumption ing the model’s accuracy, if searching for solutions from the
Due to the proposed two-stage pruning and quantization co- one with the highest accuracy to the one with the lowest
optimization method, one can complete the model compres- accuracy, the reduction of energy consumption is remarkable
sion and obtain the solution set efficiently. The entire at the first few points. However, after a certain threshold, the
optimization process includes two stages. The first stage is for energy consumption becomes relatively stable.
the pre-processing, which takes around 24 hours. The second ❏❏ Different models prefer different dataflow designs. Specifi-
stage is for the multi-objective optimization. The solver can cally, C I : C O achieves the highest energy efficiency among
generate optimal solutions within one hour by using a single the four dataflow designs for MobileNet. However, it is
NVIDIA Titan Xp graphics processing unit (GPU) card. inferior to other dataflow designs for VGG-16. The reason is
Figure 5 shows the solution sets obtained from the bi-objec- that the convolution layers of different models have different
tive optimization of accuracy and energy consumption, shapes. In addition to energy consumption, the latency and
under the four different dataflow designs. Each point in the cost of edge devices also depend on dataflow designs. The
figure corresponds to one compressed model in the solution selection of dataflow designs involves many factors, which
set obtained by the bi-objective optimization. From the makes it very difficult in practice. This work explores the
results, one can see that: optimization results on the four popular dataflow designs.
❏❏ The points marked in different colors cover a large range of
accuracy scores and energy consumption, which means that C. Bi-Objective Optimization of Accuracy and Model Size
EMOMC obtains a solution set with a high diversity for Figure 6 demonstrates the solution sets obtained from the bi-
the model compression of the three baseline CNN models, objective optimization of accuracy and model size, under three
COO CSR-Relative COO CSR-Relative COO CSR-Relative

Normal (12.5) Normal (59.5) Normal (240)
7.5 25 125
Model Size (MBytes)
Model Size (MBytes)
Model Size (MBytes)
6 20 100
4.5 15 75
3 10 50
1.5 5 25
0 0 0
0.7 0.75 0.8 0.85 0.9 0.75 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95 1
FIGURE 6 The solution sets obtained from the bi-objective optimization of accuracy and model size on CIFAR-10 (MobileNet and VGG-16) and
MNIST (LeNet-5). The three different coding schemes are marked with different colors. In the legends, the quoted number after normal coding
scheme indicates the size of the original model before the model compression.

10 125 10
8 100 8
Normalized Score
Normalized Score
Normalized Score
6 75 6
4 50 r /τ = 5 4 r /τ = 5
r/τ = 5
r /τ = 1 r /τ = 1
2 r/τ = 1 25 2
r /τ = 0.2 r /τ = 0.2
r/τ = 0.2
0 0 0
0.7 0.75 0.8 0.85 0.9 0.75 0.8 0.85 0.9 0.95 0.8 0.85 0.9 0.95 1
FIGURE 7 The aggregation scores on CIFAR-10 (MobileNet and VGG-16) and MNIST (LeNet-5) under three different reward over penalty ratios.
The scores are individually normalized by the aggregation score obtained by the uncompressed models.
different parameter coding schemes. Each point stands for one where f1 is the accuracy of the model, and f2 is the corre-
compressed model in the solution set obtained by EMOMC. sponding energy consumption. When classifying an image, if
The results demonstrate that in terms of diversity the solution the result is correct, a reward r can be obtained; otherwise, a pen-
sets show similar patterns with the bi-objective optimization of alty x is performed. By giving a fixed amount of energy budget,
accuracy and energy consumption. the number of images that can be classified is inversely propor-
Furthermore, it can be observed that although COO and tional to the energy consumed per image f2 . From Equation (8),
CSR are developed to store a sparse matrix, sometimes they it can be seen that one of the key parameters in this aggregation
do not save the memory space for the compressed models, score system is the ratio between the reward and the penalty r/x ,
compared with the normal coding scheme. For example, if which indicates the significance of accuracy. The selection of the
pursuing high model accuracy, the normal coding scheme is optimal solution highly depends on the ratio r/x .
the best one among the three coding schemes for
MoblieNet. The reason is that although the COO and CSR
coding schemes only store non-zero elements, they still need
Ratio of Energy Cons.
several extra bits to record the position of each non-zero ele- 22

ment. If attempting to keep the model accuracy at a high 21 X :Y CI :CO
level, the compression rate cannot be high, making the 1 FX :FY X:FX
memory space saved from the sparsity of the filter less than
2–1
the overhead of those extra bits. In this case, the normal cod-
2–2
ing scheme is a better choice. However, if allowing a certain 0.70 0.75 0.80 0.85 0.90 0.95
level of accuracy loss, then CSR is the best among the three Accuracy
coding schemes.
FIGURE 8 The energy consumption of VGG-16 over the energy con-
sumption of MobileNet under different accuracy scores on CIFAR-10.
D. Aggregation of Accuracy and Energy Efficiency
Theoretically, higher accuracy comes with higher energy con-
sumption. Most previous model compression approaches only
allow a negligible loss of accuracy. For applications on edge
COO CSR-Relative
devices, it will be acceptable to sacrifice a little bit of accuracy
Normal
to achieve substantial improvement in energy efficiency. For
Ratio of Model Size
23
VGG-16, as shown in Figure 5, if 2% of accuracy loss is 22
acceptable, the energy consumption can be reduced by around 21
1
80%. In the solution sets displayed in Figure 5, there are some
2–1
knee points if considering the balance of both the model accu- 2–2
racy and the energy consumption. To help users select the 2–3
0.70 0.75 0.80 0.85 0.90 0.95
model for deployment on edge devices, a new metric called
Accuracy
aggregation score is defined as:
FIGURE 9 The model size of VGG-16 over the model size of MobileNet
AScore = (f1 $ r + (1 - f1) $ x) /f2, (8) under different accuracy scores on CIFAR-10.

TABLE II Energy consumption comparison of the compressed models obtained by EMOMC
and the peer methods for VGG-16 on CIFAR-10.
ACCURACY ENERGY CONSUMPTION (MJ) EFFICIENCY IMPROVEMENT (×) AGGREGATION SCORE
METHOD LOSS X :Y CI : CO FX : FY X : FX X :Y CI : CO FX : FY X : FX r/x = 5.0 r/x = 1.0 r/ x = 0.2

EMOMC 0.3% 1.7 1.7 2.3 2.6 14.0 12.2 10.4 8.9 125.6 97.0 30.0
(OURS)
PRUNING 0.2% 18.5 19.6 18.5 19.3 1.2 1.1 1.3 1.2 1.2 1.2 1.4
FILTERS [26]
PLAY AND 0.1% 9.5 12.6 9.5 11.2 2.4 1.7 2.5 2.0 2.2 2.2 2.5
PRUNE [50]
occupies around 50% less memory

TABLE III Model size comparison of the compressed models obtained by MOMC
and the peer methods for VGG-16 on CIFAR-10. space than the MobileNet when the
accuracy is below 88%. This observa-
ACCURACY MODEL SIZE (MB) COMPRESSION RATE (×)
tion shows that although MobileNet is
METHOD LOSS NORMAL CSR COO NORMAL CSR COO designed for computation efficiency,
EMOMC (OURS) 0.3% 9.8 8.0 24.5 6.1 7.4 2.4 one should select a compressed model
PRUNING 0.2% 34.7 37.9 57.3 1.7 1.6 1.0 from a more complex neural network
FILTERS [26] such as VGG-16. It is more efficient
PLAY AND 0.1% 15.1 16.5 24.6 3.9 3.6 2.4 than the compressed model from simpler
PRUNE [50]
neural networks, in terms of energy effi-
ciency and model size. The reason is that
the number of the parameters or the
Figure 7 displays the aggregation scores of different solu- precision of the parameters in VGG-16 can be lower than those
tions under three different values of r/x . Each curve is of MobileNet after the mode compression.
plotted based on the results from one execution of the
multi-objective optimization. As the multi-objective solver F. Comparison to the State-of-the-Art
generates discrete points (i.e., solutions), they are plotted in a Tables II and III report the results of different model com-
line using a smooth function called cspline, which connects pression methods, in terms of the energy consumption and
consecutive points by natural cubic splines after rendering the aggregation scores, and model size, respectively. Table II
the data monotonic. The scores are individually normalized shows that under negligible accuracy loss (typically, less than
by the aggregation score obtained by original uncompressed 0.5% accuracy loss), EMOMC improves the energy efficiency
models. From the results, it can be observed that most accu- and model compression rate by a factor of 11.4 # and 5.3 #,
racy-energy curves have one peak point. For the complex on average. There are two reasons for such improvements.
neural networks such as VGG-16, the highest scores will be Firstly, the evolutionary multi-objective solver optimizes the
125 # higher than the uncompressed model due to its high problem generation by generation. By allowing a certain
compression rate. For the simpler networks such as range of accuracy loss, it can generate many intermediate
MobileNet and LeNet, the highest aggregation score is results, and these results contribute to the improvements in
around 9 # more than the original model. energy efficiency or compression rate. Compared with previ-
ous methods which take accuracy loss as a hard constraint,
E. On Selection of Neural Networks EMOMC is more likely to find better results. Secondly, the
On the same dataset, the selection of an optimal neural net- exploration space of the model compression process is signifi-
work depends on how one compresses the model. For instance, cantly reduced by adopting both pruning and quantization
MobileNet is specially designed for computation efficiency; techniques. Without the proposed two-stage pruning and
although its accuracy is slightly lower than VGG-16, it uses quantization co-optimization strategy, previous approaches
much fewer hardware resources than VGG-16, in terms of suffer from too high computation cost to explore and exploit
energy efficiency and model size. However, this statement is such a huge search space. In addition to energy efficiency and
true only for the original uncompressed MobileNet and VGG- compression rate, the proposed method also shows an average
16. After model compression, VGG-16 may be more efficient 84.2 # improvement on aggregation scores.
than MobileNet. Figures 8 and 9 show the ratios of energy In practice, one needs to select an optimal solution
consumption and model size between VGG-16 and MobileNet, (from the solution set obtained by an EMO algorithm) for
in four dataflow designs and three coding schemes. The results the machine learning task on a specific device. For instance,
show that apart from dataflow design C I : C O and the normal after solving the bi-objective optimization problem of accura-
coding scheme,VGG-16 consumes around 50% less energy and cy and energy efficiency, a set of solutions can be obtained

which trade-off the two objectives. The constraint on the [18] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,”
in Proc. Annu. Int. Symp. Comput. Architecture, 2017, pp. 1–12.
energy can be calculated based on the energy capacity and [19] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,”
battery life. Then, the solution that achieves the highest accu- in Proc. Annu. IEEE/ACM Int. Symp. Microarchitecture, 2016, pp. 1–12.
[20] J. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neu-
racy will be selected as the optimal solution for the task. ral network,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2016,
Alternatively, one can select the knee point from the solution pp. 26–35.
[21] Y.-H. Chen et al., “Eyeriss: A spatial architecture for energy-efficient dataf low for
set as the preferred solution. convolutional neural networks,” ACM SIGARCH Comput. Architecture News, vol. 44, no.
3, pp. 367–379, 2016. doi: 10.1145/3007787.3001177.
[22] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high performance FPGA-
VI. Conclusion based accelerator for large-scale convolutional neural networks,” in Proc. Int. Conf. Field
In this paper, an evolutionary multi-objective model compres- Programmable Logic Appl., 2016, pp. 1–9.
[23] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient DNNs,” in
sion approach is proposed to accelerate and compress DNNs by Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1379–1387.
optimizing multiple objectives (e.g., accuracy, energy efficiency, [24] F. Manessi, A. Rozza, S. Bianco, P. Napoletano, and R. Schettini, “Automated
pruning for deep neural network compression,” in Proc. Int. Conf. Pattern Recognit., 2018,
and model size) simultaneously. As the evaluation of each archi- pp. 657–664.
tecture is extremely time-consuming during the evolution, a [25] C. Lemaire, A. Achkar, and P.-M. Jodoin, “Structured pruning of neural networks
with budget-aware regularization,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit.,
two-stage pruning and optimization co-optimization strategy is 2019, pp. 9108–9116.
developed to speed up the architecture searching process. [26] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for ef-
ficient convnets,” 2016. [Online]. Available: https://arxiv.org/abs/1608.08710
Extensive experimental results demonstrate that the proposed [27] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: AutoML for model
method can obtain a set of diverse networks in a single execu- compression and acceleration on mobile devices,” in Proc. Eur. Conf. Comput. Vision, 2018,
pp. 784–800.
tion. Furthermore, the proposed method outperforms the peer [28] Z. Liu, J. Xu, X. Peng, and R. Xiong, “Frequency-domain dynamic pruning for con-
methods in terms of energy efficiency and model size for model volutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1043–1053.
[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep
compression of three popular DNNs. convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[30] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput.
Vision Pattern Recognit., 2015, pp. 1–9.
Acknowledgement [31] K. Deb, “Multi-objective optimization,” in Search Methodologies. Springer-Verlag,
This work is partly supported by the Agency for Science, Tech- 2014, pp. 403–449.
[32] J. D. Schaffer, “Multiple objective optimization with vector evaluated genetic algo-
nology and Research (A*STAR) under its AME Programmatic rithms,” in Proc. Int. Conf. Genetic Algorithms, 1985, pp. 93–100.
Funding Scheme (No. A18A1b0045 and No. A1687b0033). [33] B. Li, J. Li, K. Tang, and X. Yao, “Many-objective evolutionary algorithms: A sur-
vey,” ACM Comput. Surv., vol. 48, no. 1, 2015. doi: 10.1145/2792984.
[34] L. Zhen, M. Li, R. Cheng, D. Peng, and X. Yao, “Adjusting parallel coordinates for
References investigating multi-objective search,” in Proc. Int. Conf. Simulated Evol. Learn., Shenzhen,
[1] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, vol. 1. Cam- China, 2017, pp. 224–235.
bridge MA: MIT Press, 2016, no. 2. [35] M. Li, L. Zhen, and X. Yao, “How to read many-objective solution sets in parallel
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale coordinates,” IEEE Comput. Intell. Mag., vol. 12, no. 4, pp. 88–100, 2017. doi: 10.1109/
image recognition,” 2014. [Online]. Available: https://arxiv.org/abs/1804.09081 MCI.2017.2742869.
[3] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “XLNet: [36] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Addi-
Generalized autoregressive pretraining for language understanding,” in Adv. Neural Inf. son-Wesley Longman Publishing, 1989.
Process. Syst., 2019, pp. 5753–5763. [37] N. Srinivas and K. Deb, “Muiltiobjective optimization using nondominated sorting
[4] L. Zhen, P. Hu, X. Peng, R. S. M. Goh, and J. T. Zhou, “Deep multimodal transfer in genetic algorithms,” Evol. Comput., vol. 2, no. 3, pp. 221–248, 1994. doi: 10.1162/
learning for cross-modal retrieval,” IEEE Trans. Neural Netw. Learn. Syst., to be published. evco.1994.2.3.221.
doi: 10.1109/TNNLS.2020.3029181. [38] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective
[5] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput., vol. 6, no. 2, pp. 182–197,
Proc. Int. Conf. Learn. Representations, 2018. 2002. doi: 10.1109/4235.996017.
[6] Y. Sun, H. Wang, B. Xue, Y. Jin, G. G. Yen, and M. Zhang, “Surrogate-assisted evolution- [39] C. A. C. Coello, S. G. Brambila, J. F. Gamboa, M. G. C. Tapia, and R. H. Gómez,
ary deep learning using an end-to-end random forest-based performance predictor,” IEEE “Evolutionary multiobjective optimization: Open research areas and some challenges ly-
Trans. Evol. Comput., vol. 24, no. 2, pp. 350–364, 2020. doi: 10.1109/TEVC.2019.2924461. ing ahead,” Complex Intell. Syst., vol. 6, no. 2, pp. 221–236, 2020.
[7] Y. Sun, G. G. Yen, and Z. Yi, “Evolving unsupervized deep neural networks for learn- [40] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf,
ing meaningful representations,” IEEE Trans. Evol. Comput., vol. 23, no. 1, pp. 89–103, “NSGA-Net: Neural architecture search using multi-objective genetic algorithm,” in
2018. doi: 10.1109/TEVC.2018.2808689. Proc. Genetic Evol. Comput. Conf., 2019, pp. 419–427.
[8] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated CNN architec- [41] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture
ture design based on blocks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 4, pp. search via parameters sharing,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4095–4104.
1242–1254, 2020. doi: 10.1109/TNNLS.2019.2919608. [42] A. Krizhevsky, V. Nair, and G. Hinton, “CIFAR-10 (Canadian Institute for Advanced
[9] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Evolving deep convolutional neural networks Research),” Tech. Rep., 2010. [Online]. Available: http://www.cs.toronto.edu/kriz/
for image classification,” IEEE Trans. Evol. Comput., vol. 24, no. 2, pp. 394–407, 2020. cifar.html
[10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural net- [43] J. Blank and K. Deb, “Pymoo: Multi-objective optimization in python,” IEEE Access,
works with pruning, trained quantization and Huffman coding,” 2015. [Online]. Avail- vol. 8, pp. 89,497–89,509, 2020.
able: https://arxiv.org/abs/1510.00149 [44] M. Zhu and S. Gupta, “To prune, or not to prune: Exploring the efficacy of prun-
[11] Y. Wang, C. Xu, J. Qiu, C. Xu, and D. Tao, “Towards evolutionary compression,” ing for model compression,” 2017. [Online]. Available: https://arxiv.org/abs/1710.01878
in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2018, pp. 2476–2485. [45] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and
[12] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neu- acceleration for deep neural networks,” 2017. [Online]. Available: https://arxiv.org/
ral networks using energy-aware pruning,” in Proc. IEEE Conf. Comput. Vision Pattern abs/1710.09282
Recognit., 2017, pp. 5687–5695. [46] A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile
[13] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ: Hardware-aware automated quan- vision applications,” 2017. [Online]. Available: https://arxiv.org/abs/1704.04861
tization with mixed precision,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2019. [47] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based learning applied
[14] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. doi: 10.1109/
networks: A tutorial and survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, 2017. doi: 5.726791.
10.1109/JPROC.2017.2761740. [48] Xilinx, “Vivado design suite user guide,” Technical Publication, 2018.
[15] Z. Wang, T. Luo, J. T. Zhou, and R. S. M. Goh, “EDCompress: Energy-aware model [49] E. G. Walters, “Array multipliers for high throughput in Xilinx FPGAs with 6-input
compression with dataf low,” 2020. [Online]. Available: https://arxiv.org/abs/2006.04588 LUTs,” Computers, vol. 5, no. 4, p. 20, 2016. doi: 10.3390/computers5040020.
[16] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the sensor,” in Proc. [50] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri, “Play and Prune: Adaptive
Annu. Int. Symp. Computer Architecture, 2015, pp. 92–104. filter pruning for deep model compression,” 2019. [Online]. Available: https://arxiv.org/
[17] M. Song et al., “Towards efficient microarchitectural design for accelerating unsu- abs/1905.04446
pervized GAN-based deep learning,” in Proc. IEEE Int. Symp. High Performance Comput.
Architecture, 2018, pp. 66–77.

©SHUTTERSTOCK.COM/OMELCHENKO
Fast and Unsupervised
Neural Architecture Evolution
for Visual Representation
Learning
Abstract—Unsupervised visual representation learning is one

Song Xue, and Hanlin Chen of the hottest topics in computer vision, yet performance still
Beihang University, CHINA lags behind compared with the best supervised learning meth-
Chunyu Xie ods. At the same time, neural architecture search (NAS) has pro-
Qihoo 360 AI Research, CHINA; Beihang University, CHINA duced state-of-the-art results on various visual tasks. It is a
natural idea to explore NAS as a way to improve unsupervised
Baochang Zhang representation learning, yet it remains largely unexplored. In this
Beihang University, CHINA paper, we propose a Fast and Unsupervised Neural Architecture
Xuan Gong, and David Doermann Evolution (FaUNAE) method to evolve an existing architecture,
University at Buffalo, USA manually constructed or the result of NAS on a small dataset, to
a new architecture that can operate on a larger dataset. This par-
tial optimization can utilize prior knowledge to reduce search
cost and improve search efficiency. The evolution is self-super-
vised where the contrast loss is used as the evaluation metric in
Digital Object Identifier 10.1109/MCI.2021.3084394 Corresponding author: Baochang Zhang (e-mail: bczhang@buaa.edu.cn).
Date of current version: 15 July 2021 Co-first authors: Song Xue, and Hanlin Chen

a student-teacher framework. By eliminating the inferior or [5] are inefficient and can not deal effectively with unsupervised
least promising operations, the evolutionary process is greatly representation learning. Our approach is extremely efficient with
accelerated. Experimental results show that we achieve state-of- a complexity of O (n 2) where n is the size of the operation space.
the-art performance for downstream applications, such as object Here we propose our Fast and Unsupervised Neural Architecture
recognition, object detection, and instance segmentation. Evolution (FaUNAE) method to search architectures for repre-
sentation learning. Although UnNAS [6] discusses the value of a
L
I. Introduction label and discovers that labels are not necessary for NAS, it cannot
earning high-level representations from labeled data and solve the aforementioned problems because it is computationally
deep learning models in an end-to-end manner is one of expensive and is trained using supervised learning for real applica-
the biggest successes in computer vision in recent history. tions. FaUNAE is introduced to evolve an architecture from an
These techniques make manually specified features largely existing architecture manually designed or searched from one
redundant and have greatly improved the state-of-the-art for many small-scale dataset on another large-scale dataset. This partial opti-
real-world applications, such as medical imaging, security, and mization can utilize the existing models to reduce the search cost
autonomous driving. However, there are still many challenges. For and improve search efficiency. The strategy is more practical for
example, a learned representation from supervised learning for real applications, as it can efficiently adapt to the new scenarios’
image classification may lack information such as texture, which minimal data labeling requirements.
matters little for classification but can be more relevant for later First, we adopt a trial-and-test method to evolve the initial
tasks. Yet adding it makes the representation less general, and it architecture, which is more efficient than the traditional evolution
might be irrelevant for tasks such as image captioning. Thus, methods, which are computationally expensive and require large
improving representation learning requires features to be focused amounts of labeled data. Second, we note that the quality of the
on solving a specific task. Unsupervised learning is an important architecture is hard to estimate due to the absence of labeled data.
stepping stone towards robust and generic representation learning To address this, we explore contrastive loss [1] as the evaluation
[1]. The main challenge is the significant performance gap com- metric for the operation evaluation. Although our method is built
pared to supervised learning. Neural architecture search (NAS) has based on contrastive loss [1], we model our method on the teach-
shown extraordinary potential in deep learning due to the cus- er-student framework to mimic supervised learning and then
tomization of the network for target data and tasks. Intuitively, estimate the operation performance even without annotations.
using the target data and searching the tasks directly without a Then the architecture can be evolved based on the estimated per-
proxy gap will result in the least domain bias. For performance rea- formance. Third, we address the fact that one bottleneck in NAS
sons however, early NAS [2] methods only searched small datasets. is its explosive search space of up to 148. The search space issue is
Later, transfer methods were utilized to search for an optimal even more challenging for unsupervised NAS built on an ambig-
architecture on one dataset, then adapt the architecture to work on uous performance estimation that further deteriorates the training
larger datasets [3]. ProxylessNAS [4] was proposed to search larger process. To address this issue, we build our search algorithm based
datasets after directly learning the architecture for the target task on the principles of survival of the fittest and elimination of the
and hardware, instead of using a proxy. Recent advances in NAS inferior. This significantly improves search efficiency. Our frame-
show a surge of interest in neural architecture evolution (NAE) work is shown in Fig. 1. Our contributions are as follows:
[3], [5]. NAE first searches for an appropriate architecture on a ❏❏ We propose unsupervised neural architecture evolution,
small dataset and then transfers the architecture to larger dataset, by which utilizes prior knowledge to search for an architecture
simply adjusting weights. One key to the success of NAS or NAE using large-scale unlabeled data.
is the use of large-scale data, suggesting that more data leads to bet- ❏❏ We design a new search block and limit the model size
ter performance.The problem, however, is that the cost of labeling according to the initial architecture during evolution to deal
these larger datasets may become prohibitive as their size grows. with the huge search space of ResNet. The search space is
In scenarios where we can not obtain sufficient annotations, further reduced through a contrastive loss in a teacher-stu-
self-supervised learning is a popular approach to leverage the dent framework by abandoning operations with less poten-
mutual information of unlabeled training data. However, the per- tial to significantly improve the search efficiency.
formance of unsupervised methods is still unsatisfactory compared ❏❏ Extensive experiments demonstrate that the proposed
with the supervised methods. One obstacle is that only the param- methods achieve better performance than prior art on Ima-
eters are learned using conventional self-supervised methods. To geNet, PASCAL VOC, and COCO.
address the performance gap, a natural idea is to explore the use of
NAE to optimize the architecture along with parameter training. II. Related Work
Tackling these bottlenecks will allow us to design and implement
efficient deep learning systems that will help to address a variety of A. Unsupervised Learning
practical applications. Specifically, we can initialize with an archi- Recent progress on unsupervised/self-supervised1 learning
tecture found using NAS on a small supervised dataset and then began with artificially designed pretext tasks, such as patch
evolve the architecture on a larger dataset using unsupervised
learning. Currently, existing architecture evolution methods [3], 1
Self-supervised learning is a form of unsupervised learning.

relative location prediction [7] and rotation prediction [8]. Large series of anchors as samples and then classifies the samples
scale experiments on these pretext tasks [9] reveal that neither is through the convolutional neural network. Common algo-
the ranking of architectures consistent across different methods rithms include R-CNN [20], Fast R-CNN [21], and Faster
nor is the ranking of methods consistent across architectures. Thus, R-CNN [22]. The one-stage methods use a single shot detector,
pretext tasks for self-supervised learning should be considered in and the methods are divided into anchor-based method and
conjunction with their underlying architectures. anchor-free methods. The anchor-free method does not generate
Self-supervised representation learning can also be formu- anchors and directly transforms the problem of anchor positioning
lated as learning an embedding where the semantically similar into regression. Common algorithms are YOLO [23], SSD [24]
images are very close, and the semantically different images are and so on. Compared with object detection, instance segmentation
far away. One method is to contrast positive pairs with negative requires more stringent requirements. It not only requires seg-
pairs. Contrastive learning [10] learns mappings that are invari- menting the object’s mask and identifying its category but also
ant to certain transformations of the inputs. He, et al. [1] build a needs to distinguish different instances in the same category. The
dynamic dictionary with a queue and a moving-averaged main algorithms include Mask R-CNN [25], SOLO [26], and
encoder using a contrastive loss. He, et al. [1] update the dic- SOLOv2 [27]. For object detection and segmentation, we mainly
tionary in the form of a queue, decoupling the size of the dic- use Faster R-CNN and Mask R-CNN. Faster R-CNN proposes
tionary and the batch size, so that the size of the dictionary the Region Proposal Network on the basis of Fast R-CNN and
does not need to be constrained by batch size. In addition, the integrates feature extraction, region proposal, bounding box
Momentum-based Moving Average keeps the encoder updated regression (rect refine), and classification into one network, thereby
all the time. Subsequently, Chen, et al. [11] combine the meth- effectively improving detection accuracy and detection efficiency.
ods of [1], [12], and use an MLP projection head and more Mask R-CNN is a framework for instance segmentation, which
data Augmentation to further improve performance. adds a branch to Faster R-CNN for segmentation. In addition,
Mask R-CNN introduces a region of interest (ROI) alignment to
B. Neural Architecture Search replace ROI pooling in Faster R-CNN, which improves the
With the rapid development of computer vision, various net- accuracy of the mask.
work structures have achieved significant performance improve-
ments, but most of them are manually designed. Recently, a D. Neural Architecture Evolution
method called NAS has attracted more and more attention, Before the advent of the term “deep learning,” neuroevolution
which can automate neural net architecture design [2], [13]. [28], [29] used genetic algorithms to change the topology along
NAS performs full-fledged training and validation over thou- with hyperparameters and connection weights of artificial neu-
sands of architectures from scratch. Traversing all candidate ral networks (ANNs) by allowing high performing structures to
architectures in a very large search space requires tremendous reproduce and mutate. Recent works focus on evolving deep
computation and memory resources. A cell-based architecture neural architecture on a larger scale. Real, et al. [13] start small
[3] is widely used to reduce the size of search space by stacking and mutates architectures to navigate large search spaces.
cells to make NAS more efficient. However, the search space is Encouraged by the idea of stacking the same modules in many
still noncontinuous and high-dimensional. A one-shot architec- successful architectures such as ResNet and DenseNet, a variant
ture search method [14], [15] is proposed for learning efficiency, of this method [30] shows better results. It repeatedly evolves
making it possible to identify an optimal architecture within a small neural network modules by using a larger hand-coded
few GPU days. Pham, et al. [15], share parameters among child blueprint. Similarly, CoDeepNEAT [31] evolves the topology,
models and trains a controller to search the network architec- the components, and hyperparameters simultaneously for
ture. Liu, et al. [14] introduce a differentiable framework that cheaper computation. Cai, et al. [5] use reinforcement learning
enables NAS to use gradient descent for search. Due to the as a meta-controller and Bi-LSTM as an encoder for function-
considerable memory requirement of [14], [16] proposes a par- preserving transformations to make the network wider or
tial channel connection method, which reduces memory con- deeper. Zoph, et al. [3] use a recurrent neural network as a con-
sumption by sampling a small part of the network and improves troller to learn an architectural building block (cell) on a small
search efficiency. Chen, et al. [17] successfully apply NAS to dataset and then transfer the block to a larger dataset. Hierarchi-
model defense, sampling the network through the combination cal genetic representation is proposed to mutate from a low
of valid accuracy and bandit, and achieve good results in adver- level to a high level for complex topologies [32].
sarial training. The use of NAS for unsupervised learning is a Although a lot of advances have been made in NAE,
new research hotspot, which needs further exploration. few works have investigated unsupervised learning, partially
because of the inefficiency in the search process. We design
C. Object Detection and Segmentation a new search block and limit the model size according to
Object detection [18], [19] is one of the basic tasks of comput- the initial architecture during evolution. The search effi-
er vision. Object detection algorithms typically include one- ciency of our method can also be significantly improved
stage methods and two-stage methods. The two-stage methods by abandoning operations with less potential in terms of
use a sliding window detector. First, the algorithm generates a contrastive loss.

III. FaUNAE
In this section, we elaborate on how our network efficiently
FIGURE 1 The main framework of the proposed Teacher-Student search strategy. Left: Search space of FaUNAE. Middle: Evolution process of FaUNAE. Right: FaUNAE reduces the search space by
SAConv
5×5
adapts to new datasets with unsupervised learning. We first
introduce an evolution strategy to effectively evolve the
SAConv
architecture from an existing architecture to a new one. We
3×3
Search Space
then train our network by exploiting the contrastive learn-
OUTPUT
INPUT
ing to discriminate between positive samples and negative
samples. Our framework is shown in Fig. 1.
Conv
5×5
A. Search Space
For different datasets, our architecture space is different. For
Conv
3×3
CIFAR10 [33], we search for two kinds of computation
Ω ← Ω – {arg max w(ok,n)}

cells, normal cells and reduction cells, to build the final
i
2logN
architecture. The reduction cells are located at 1/3 and 2/3
saki
Search Space
of the total depth of the network, and the rest are normal
w(ok,n) = l(ok,n) –
Reduce
k
cells. Following [14], the set of operations include skip
connection (identity), 3 # 3 convolutions, 5 # 5 convolu-
i
tions, 3 # 3 depth-wise separable convolutions, 5 # 5
i
depth-wise separable convolutions,3 # 3 max pooling,
i
Contrastive
i
3 # 3 average pooling, and zero. The search space of a cell
Loss
consists of operations on all edges.
On ImageNet, we have experimentally determined that
s0
s1
s2
t
for unsupervised learning, ResNet [34] is better than cell-
based methods for building an architecture space. We denote
Teacher
Student
SAConv
SAConv
this space as {X i}, where i represents a given block. Rather 5×5
5×5
than repeating the bottleneck (building block in ResNet)
with various operations, however, we allow a set of search
SAConv
SAConv
3×3
3×3
blocks shown in Fig.2 (a) with various operations, including
OUTPUT
OUTPUT
traditional convolution with kernel size {3, 5}, split-atten-
INPUT
INPUT
tion convolution (SAConv) [35] with kernel sizes {3, 5} and
radixes {2, 4}. SAConv is a Split-Attention block [35] which
Conv
Conv
5×5
5×5
incorporates feature map split attention within the individu-
al network blocks. Each block divides the feature map into
Conv
Conv
several groups and fine-grained subgroups or splits, where

3×3
3×3
the feature representation of each group is determined by

the weighted combination of its split representations. This
reduces the model size by sharing the 1 # 1 convolution to
Mutation and Train
improve efficiency. To enable a direct trade-off between

K Times
depth and block size (indicated by the parameters of the

selected operations), we initiate a deeper over-parameterized
network and allow a block to be skipped by adding the
eliminating the operations with the least potential.
identity operation to the candidate set of its mixed opera-

tion. So the set of operations X i in the ith block consists of
M = 7 operations. With a limited model size, the network
SAConv
5×5
can either choose to be shallower by skipping more blocks

and using larger blocks or choose to be deeper by keeping
SAConv
more blocks with smaller size.

3×3
Search Space
The initial structure a 0 is first manually designed (e.g.,

OUTPUT
INPUT
ResNet-50, without weight parameters) or searched for

by another NAS (e.g., ProxylessNAS [4]) on different
Conv
5×5
datasets in a supervised manner2, then remapped to the

search space to accelerate the evolution process and make
use of prior knowledge.
Conv
3×3
2
The evolution is based on unsupervised learning.

B. Evolution Z 1 - , oi = oi
]] e k, n k, n - 1
The evolutionary strategy is summarized in Alg. 1. Unlike Amoe- p mt ô k, nh = [ sa ik
f / sa ikl pe, otherwise
i 1 1- (1)
baNet [36], which evaluates the performance of the sub-networks ]K-1
sampled from the search space in a population, our method evolves \ kl
the operation in each block in a trial-and-test manner. We first where sa k represents the sampling times of the operation o ik .
i
mutate the operation based on its mutation probability, followed The operation o ik represents the kth operation in the ith block
by an evaluation step to make sure the mutation is ultimately used. and n represents current number of evolutions or the index of
Mutation: An initial structure a 0 is manually designed (e.g., epoch. In general, the operation in each block is kept constant
ResNet-50)3 or searched by another NAS (e.g., ProxylessNAS with a probability 1 - e in the beginning. For the mutation,
[4]) on a different dataset using supervised learning. The initial younger (less sample time) operations are more likely to be
subnetwork fis, which is generated by searching over-parameter- selected, with a probability of e, Intuitively, keeping the opera-
ized network based on a 0, is then trained using (6) for k steps to tion constant can be thought of as providing exploitation,
obtain the evaluation metric l ô ikh . A new architecture a n (o k, n) is while mutations provide exploration [36]. We use two main
then constructed from the old architecture a n - 1 (o k, n - 1) by a mutations that we call the depth mutation and the op muta-
transformation or a mutation. The mutation probability p mt is tion, as in AmoebaNet [36], to modify the structure generated
defined as by the search space described above.
The operation mutation pays attention to the selection of
3
No weight parameters. operations in each block. Once the operation in a block is
Performance Evaluation
InfoNCE Loss
t = fθt (xt) s0, s1, s2...
Reduce t = fθt (xt) s0, s1, s2...
Teacher Student
Teacher EMA Student

Search Space
FaUNAE TS Contrastive Learning
(a)
Conv Conv
1×1 1×1
Conv Conv Conv SAConv SAConv

3×3 fmap in Memory
3×3 5×5 3×3 5×5
fmap not in Memory
Conv Conv
1×1 1×1
1) Bottleneck 2) Search Block

(b)
FIGURE 2 (a) The main framework of the Teacher-Student model, which focuses on both the unsupervised neural architecture evolution (left)
and contrastive learning (right). (b) Compared with the original bottleneck (1) in ResNet, a new search block is designed for FaUNAE (2).” fmap
in memory” means that we select the operation to participate in the feature map calculation.

chosen to mutate, the mutation picks one of the other opera- We use contrastive loss to match an encoded query s with the
tions based on (1). The depth mutation can change the depth encoded key dictionary to train the visual representation student
of the subnetwork in the over-parameterized network by set- model. The value of the contrastive loss is lower when s and t are
ting the operation of one block to the ‘Identity’ operation. To from the same (positive) sample and higher when s and t are
search more efficiently and evaluate operations that are more from different (negative) samples. The contrastive loss is also
reasonable, we limit the model size as a restriction metric. The deployed in FaUNAE as a measure to guide the evolution of the
structure can then evolve into a subnetwork that has the same structure to obtain the optimal structure based on the unlabeled
computational burden. The probability of the restriction metric dataset. InfoNCE [39] shown in Fig.2 (a), measures the similarity
p irm is defined as using the dot product and is used as our evaluation metric:
- exp MS (o ik, n) exp (s $ t + /x)

p rm (o ik, n) = L = - log , (6)
/ exp MS (o ikl,n) ,(2)
N
/ exp (s $ t n /x)
kl n=0
i
where MS (o ) represents the number of parameters of the kth
k, n where x is a temperature hyperparameter per [40] and t + rep-
operation in the ith block and n represents the index of epoch resents the feature calculated from the same sample with s.
or current number of evolutions. The final evolution probabili- InfoNCE is over one positive and M negative samples. Our
ty p that combines p mt and p rm is defined as method is general and can be based on other contrastive loss
functions [37], [40]–[42]. Following [1], [43], the teacher
p (o ik, n) = m 1 ) p mt (o ik, n) + (1 - m 1) ) p rm (o ik, n), (3) model is updated as an exponential moving average (EMA) of
the student model:
where m 1 is a hyperparameter.
Mutation validation: After each evolution, the subnet- it = m ) i t + (1 - m) ) i s, (7)
work is trained using (6), and the loss is used as the evaluation
metric. We observe the current validation loss a and accordingly where i s and i t are the weight of the student model and the
update the loss l (o ik, n), which historically indicates the valida- teacher model, respectively, updated by back-propagation in
(i, j)
tion loss of all the sampled operations o k as contrastive learning, and m ! [0, 1) is a smoothing coefficient
hyperparameter.
l (o ik, n) = m 2 ) l (o ik, n - 1) + (1 - m 2) ) a, (4)
where m 2 is a hyperparameter. If the operation which is mutat-

Algorithm 1 FaUNAE.
ed performs better (less loss), we apply it as the base of the next
evolution; otherwise, we use the original operation as the base Input: Training data, validation data, and the initial structure a 0
of the next evolution. Parameter: Searching hypergraph
Output: Optimal structure a
o ik, n = )
o ik, n, l (o ik, n) 1 l (o ik, n - 1)
(5) 1: Let n = 1.
o ik, n - 1, else
2: while (K >1) do
3: for t = 1, ..., T epoch do
C. Contrastive Learning 4: Evolve architecture a n from the old architecture a n - 1
Contrastive learning [37] greatly improves the performance of based on the evolution probability p using (3);
unsupervised visual representation learning. The goal is to keep 5: Construct the Teacher model and Student model with
the negative sample pairs away and the positive samples close in the same architecture a n, and then train Student mod-
the latent space. Prior works [1], [38] usually investigate contras- els by gradient descent and update Teacher model by
tive learning by exploring the sample pairs calculated from the EMA using (7);
encoder and the momentum encoder [1]. Based on the investi- 6: Get the evaluation loss on the validation data using (6);
gation, we reformulate the unsupervised/self-supervised NAS as Use (4) to calculate the performance and assign it to all
a teacher-student model, as shown in Fig. 2 (a). Following [1], sampled operations;
we build dynamic dictionaries, and the “keys” t in the dictionary Update a n using (5);
are sampled from data and are represented by the teacher net- 7: end for
work. In general, the key representation is t = fit (x t), where 8: if t == K ) E then
fit (.) = f (o ik, n; i t; .) is a teacher network, and x t is a key sample. 9: Update w ô ki , nh using (9);
Likewise, the “query” x s is represented by s = fis (x s), where 10: Reduce the search space: X i ! X i - " argmax w ô ki , nh,;
fis = f (o ik, n; i s; .) is a student network. Unsupervised learning 11: K ! K - 1;
k
performs dictionary look-up by training a network of students. 12: end if

The student model and teacher model are subnetworks of NAE 13: end while
from the over-parameterized network described in 3.1. 14: return a

D. Fast Evolution by Eliminating Operations After | X i | ) E epochs, we remove the operations in each
One of the most challenging aspects of NAS lies in the ineffi- block based on performances (loss) and the sampling times.
cient search process, and we address this issue by eliminating Inspired by the UCB algorithm, we define the combination of
the operations with the least potential. The multi-armed bandit the two as
problem is similar to the NAS problem. The Upper Confi-
dence Bound (UCB) is used for dealing with the exploration- 2 log N
w (o ik, n) = l (o ik, n) - , (9)
exploitation dilemma in the multi-armed bandit problem. sa ik
2 log N
s = rk + , (8) where N is the total number of evolutions and mutations, and
nk
sa ik is the number of times the kth operation of the ith block
where rk represents the average reward obtained by arm k. n k has been selected. The first item l (o ik, n) in (9) is calculated based
represents the number of times arm k has been selected in the on an accumulation of the validation loss, and the second term
total number of trials N. is the exploration term. The operation with the maximum w
Airplane
Automobile
Bird
Cat
Deer
Dog
Frog
Horse
Ship
Truck
FIGURE 3 Some pictures and corresponding labels of the CIFAR10 dataset.

for every block is abandoned. This means that the operations 2) ImageNet
that are given more opportunities but result in poor perfor- ImageNet is currently the world’s largest image recognition
mance are removed. In order to improve search efficiency, we database. It is an ongoing research effort aimed at providing an
eliminate the operation with the maximum w for every block easy-to-access image database for researchers around the world.
simultaneously. With this strategy, the search space which has v At present, there are tens of millions of images in ImageNet,
blocks is significantly reduced from | X i | v to (| X i | - 1) v, and more than 20000 categories.
and the reduced space becomes
3) PASCAL VOC
VOC2007 contains 9963 annotated pictures and 20 categories
X i ! X i - { argmax w (o ik, n)} .(10)
k split into training, validation, and testing data with 24,640
objects marked. There are 11530 images in VOC2012. For the
The reduction procedure is carried out repeatedly until the detection task, the trainval of VOC2012 has 11540 images and
optimal structure is obtained when only one operation is left in a total of 27450 objects.
each block.
4) MS COCO
IV. Experiments The coco dataset is widely used in areas such as object detec-
In this section, we describe experiments on the ImageNet and tion and instance segmentation. COCO2017 contains 118,287
CIFAR10 datasets and compare FaUNAE with other NAS training images, 5000 validation images, and 40670 testing,
methods and human-designed networks. In addition, we with more than 80 object categories.
describe experiments on the PASCAL VOC and COCO data-
sets. The evolved architecture on ImageNet is applied as the B. Evolution and Training Protocol
backbone of object detection. The evolution and training protocol used in our experiments
are described in this section. We first set a global average pool-
A. Datasets ing and a 2-layer MLP [12] head (hidden layer 2048-d, with
ReLU), which has a fixed-dimensional output (128-d [40])
1) CIFAR-10 after the hypernet and searched network. The temperature x in
CIFAR-10 (Fig. 3) is a dataset containing 60,000 images. 6 is set to 0.2 [40], and the smoothing coefficient hyperparam-
Fifty thousand pictures are used for the training set, and the eter m in (7) is set as 0.999. We use the same data enhancement
remaining ten thousand are used for the test set. Each photo settings as MoCoV2 [11]
is a 32 × 32 color photo, each pixel includes three values of Experiments first evolve the initial structure a 0 on an
RGB, and the value range is 0-255. All photos belong to 10 over-parameterized network, which uses ResNet50 as the
different categories. backbone to build the architecture space (details can be found
skip_connect
conv_3×3 sep_conv_3×3
N_{–1} N_4
max_pool_3×3 sep_conv_5×5 N_3
N_1
conv_3×3 skip_connect Output
skip_connect N_2
N_{0}
(a)
conv_3×3
N_{–1} avg_pool_3×3 skip_connect N_4
N_1
conv_3×3
avg_pool_3×3 Output
conv_5×5
N_2 conv_5×5
N_{0} skip_connect
N_3
(b)
FIGURE 4 Detailed structure of the best cells discovered on CIFAR-10. (a) normal cells found on CIFAR10. (b) reduction cells found on CIFAR10.

in 3.1) on ImageNet. We set the initial structure a 0 as a ran- train an epoch, we immediately verify the performance on the
dom structure, ResNet50, and the structure searched by Proe- validation set. In the experiment using ImageNet, we perform
xyless on ImageNet100, respectively, to show the importance unsupervised pretraining and then train a supervised linear clas-
of prior knowledge. During the architecture search, the 128M sifier for 100 epochs. Finally, we report top-1 classification
training samples of ImageNet are divided into two subsets, accuracy on the validation set.
80% for the training set, and the remainder for the validation
set for mutation validation and search space reduction. We set 1) Results on CIFAR10
the channel as half of that of ResNet50 for efficiency and pay On CIFAR10, we use an experimental setting described in
attention to the evolution of the operation rather than the MoCo v1 [1], not the settings used in v2 [11] for ImageNet. In
channel. We set the hyperparameter E = 3, so the number of addition, we use cell-based search space. In the search process,
the epoch of the search is R Mm = 2 k ) E. The hyperparameter m 1 we search for normal cells and reduction cells. The total num-
and m 2 are set to 0.9 and 0.3. After evolution, we train the ber of layers is set to 6, and reduction cells are located at 1/3
searched network on ImageNet in an unsupervised manner and 2/3. We set the intermediate nodes to M = 4, the momen-
for 200 epochs. tum to 0.9, the weight decay to 3 # 10 -4 and the initial learn-
ing rate to 0.025 described in [14]. We use 90% of the training
C. Results for Classification set as the training data in the search phase, and 10% of the
In this section, we perform experiments on CIFAR10 and training set as the validation data. After the search, we train the
ImageNet. In the experiment using CIFAR10, every time we final architecture for 600 epochs on CIFAR10. We set the total
number of cell layers to 10 and the
batch size to 128.
TABLE 1 Comparison under the linear classification protocol on CIFAR10.
We use two NVIDIA Titan V GPUs
SEARCH to search approximately 1.5 hours. We
COST (GPU SEARCH
ARCHITECTURE METHOD ACCURACY (%) PARAMS (M) DAYS) METHOD
set the initial number of channels to 36
and 100 for better performance. It can
AMDIMSMALL AMDIM [44] 89.5 194 — Manual
clearly be seen in Table 1 whether
AMDIMLARGE AMDIM [44] 91.2 626 — Manual
it is FaUNAE (channel = 36) or
RESNET18 MOCO V1 [1] 84.09 11.1 — Manual FaUNAE (channel = 100), the model
FAUNAE size of FaUNAE is smaller than that
(CHANNEL = 36) MOCO V1 88.83 9.1 0.125 Evolution
of AMDIM large and AMDIM small.
FAUNAE Note that the results in Table 1 are
(CHANNEL = 100) MOCO V1 92.26 69.8 0.125 Evolution
our best, but they are unstable and
we need additional trials to obtain
TABLE 2 Comparisons under the linear classification protocol on ImageNet. sults. Compared with AMDIM large,
re
FaUNAE (channel = 100) achieves
ACCURACY PARAMS SEARCH COST SEARCH
ARCHITECTURE METHOD (%) (M) (GPU DAYS) METHOD
not only a better performance (91.2
vs. 92.26), but also has fewer parame-
RESNET50 INSTDISC [40] 54.0 24 — Manual
ters (626M vs. 69.8M). When com-
RESNET50 LOCALAGG [45] 58.8 24 — Manual
pared with ResNet18, FaUNAE is
RESNET101 CPC V1 [42] 48.7 28 — Manual slightly better than ResNet in terms
RESNET170WIDER CPC V2 [46] 65.9 303 — Manual of model size (11.1M vs. 9.1M), and
RESNET50L+AB CMC [47] 64.1 47 — Manual FaUNAE improved performance by
AMDIMSMALL AMDIM [44] 63.5 194 — Manual 4.74% (84.09 vs. 88.83).
AMDIMLARGE AMDIM [44] 68.1 626 — Manual
2) Results on ImageNet
RESNET50 MOCO V1 [1] 60.6 24 — Manual
During the search, we set an initial
RESNET50 MOCO V2 [11] 67.5 24 — Manual
learning rate of 0.03, a momentum
RESNET50 SIMCLR [12] 66.6 24 — Manual of 0.9, a weight decay of 0.0001, and
RANDOM MOCO V2 66.2 23 — Random then we use 8 Tesla V100 GPUs to
PROXYLESSNAS MOCO V2 67.8 23 23.1 Gradient-base search approximately 46 hours. In
FAUNAE the process of training a supervised
(RANDOM) MOCO V2 67.4 24 15.3 Evolution linear classifier, we set the initial
FAUNAE learning rate as 30 and the weight
(RESNET50) MOCO V2 67.8 24 15.3 Evolution decay 0 as described in [1]. Table 2
FAUNAE shows that FaUNAE outperforms
(PROXYLESSNAS) MOCO V2 68.3 30 15.3 Evolution
ResNet50, ResNet101, ResNet170,

32×112×112
3×224×224
128×28×28
256×14×14
64×56×56
512×7×7
SAConv4 3×3
SAConv2 3×3
SAConv4 5×5
SAConv4 3×3
SAConv2 3×3
SAConv4 3×3
SAConv4 3×3
SAConv4 3×3
SAConv4 3×3
SAConv4 3×3
SAConv4 3×3
SAConv2 3×3
SAConv4 3×3
SAConv2 3×3
SAConv4 3×3
Conv 3×3
Conv 3×3
FIGURE 5 Detailed structures of the best structure discovered on ImageNet. “SAConv2” and “SAConv4” denote split-attention bottleneck convolu-
tion layer with radix of 2 and 4, respectively.
and AMDIM small/large with higher accuracy. Compared with 20]). Next, we perform experiments on the COCO [18] data
AMDIM large, which is by far the best method of manual set and the PASCAL VOC [19] data set, and compared FaUNAE
design that we know of, FaUNAE (ProxylessNAS) achieves with other methods.
not only a better performance (68.1 vs. 68.3), but the
model size was also reduced by about 20 times (626M vs. 1) Results on PASCAL VOC
30M). FaUNAE also performs better than the structure We use Faster R-CNN [22] as the detector and the evalu-
sampled randomly from the search space described in 3.1 ated network obtained on ImageNet as the backbone,
on Top-1 accuracy (68.3 vs. 66.2). When compared with with batch normalization tuned, implemented in [48]. All
other NAS methods which use the same search space, such layers are fine-tuned end-to-end. We fine-tune for 18k
as ProxylessNAS, FaUNAE obtains a better performance iterations (~18 epochs) on PASCAL VOC trainval07+12.
with higher accuracy (67.8 vs. 68.3) and is much faster We evaluate the default VOC metric of AP50 on the VOC
(23.1 vs. 15.3 GPU days). In addition, we calculated the test2007 set.
FLOPs of different models separately. The FLOPs of In Table 3, we can see the results on trainval07+12. FaU-
ResNet50, FaUNAE (ResNet50) and FaUNAE (Proxyless- NAE with ImageNet is better than ResNet50 with the same
NAS) are respectively: 4.11G, 3.87G and 4.08G. When the method (MoCo v2): up to + 0.9 AP50, + 2.3 AP, and + 2.6
FLOPs are approximately the same, FaUNAE (Proxyless- AP75. Also, our FaUNAE is better than the ImageNet super-
NAS) achieves better performance. vised counterparts, which shows the usefulness and effective-
We also set different initial structures a 0 including random ness of unsupervised learning.
structure, ResNet50, and the structure searched by Proxyless-
NAS on ImageNet100. As shown in Table 2, we find that the
better the initial structure, the better the performance, which TABLE 3 Results under object detection on PASCAL VOC
with Faster R-CNN.
shows the importance of the prior knowledge. For the struc-
ture (Fig. 5) obtained by FaUNAE on ImageNet, we find ARCHITECTURE METHOD AP50 AP AP75
that the structure on unsupervised learning prefers a small RESNET50 Super. 81.3 53.5 58.8
kernel size and a split-attention convolution [35], which also RESNET50 MOCO V1 [1] 81.5 55.9 62.6
shows the effectiveness of split-attention convolution and the
RESNET50 MOCO V2 [11] 82.4 57.0 63.6
rationality of FaUNAE. We searched the architecture on Ima-
FAUNAE MOCO V2 83.3 59.3 66.2
geNet for many times, which needs further research in our
future work.
D. Results on Object Detection TABLE 4 Results of object detection and instance segmentation on COCO with
and Segmentation Mask R-CNN. APbb means bounding-box AP and APmk means mask AP.
Learning transferable features is the main ARCHITECTURE METHOD APbb APbb APbb APmk mk
AP50 mk
AP75
50 75
goal of unsupervised learning. When used as
RESNET50 Super. 40.0 59.9 43.1 34.7 56.5 36.9
fine-tuning initialization for object detection
RESNET50 MOCO V1 [1] 40.7 60.5 44.1 35.4 57.3 37.6
and segmentation, ImageNet supervised pre-
training is the most influential (e.g., [21, 22, FAUNAE MOCO V2 43.1 63.0 47.2 37.7 60.2 40.6

2) Results on COCO [13] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin,
“Large-scale evolution of image classifiers,” in Proc. ICML, pp. 2902–2911.
We apply Mask R-CNN [25] with the C4 backbone as the [14] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” in
detector. The rest of the settings are the same as those for the Proc. ICLR, 2019.
[15] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture
PASCAL VOC experiment. In Table 4, we can see the results search via parameter sharing,” in Proc. ICML, 2018, pp. 4092–4101.
on the COCO dataset with the C4 backbone. With the 2 # [16] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “Pc-darts: Par-
tial channel connections for memory-efficient architecture search,” in Proc. ICLR, 2020.
schedule, FaUNAE is better than its ImageNet supervised [17] H. Chen, B. Zhang, S. Xue, X. Gong, H. Liu, R. Ji, and D. Doermann, “Anti-bandit
counterpart on all metrics. Due to the absent result of MoCo neural architecture search for model defense,” 2020, arXiv:2008.00698.
[18] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C.
v2 [11], we do not compare it with our FaUNAE. We run their L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. ECCV, Springer-
code for this comparison, which is even worse than v1. Also, Verlag, 2014, vol. 8693, pp. 740–755.
[19] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The
FaUNAE is better than ResNet50 trained with unsupervised pascal visual object classes (VOC) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010. doi:
MoCo v1 [1]. 10.1007/s11263-009-0275-4.
[20] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” in Proc. CVPR, 2014, pp. 580–587.
V. Conclusions [21] R. Girshick, “Fast R-CNN,” in Proc. ICCV, 2015, pp. 1440–1448.
[22] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object
We proposed a fast and unsupervised neural architecture evolu- detection with region proposal networks,” in Proc. NeurIPS, 2015, pp. 91–99.
tion (FaUNAE) to evolve an existing architecture manually [23] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection,” in Proc. CVPR, Las Vegas, NV, 2016, pp. 779–788.
designed or searched for on one small dataset to a new archi- [24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg,
tecture on another larger dataset. The evolution is self-super- “SSD: Single shot multibox detector,” in Proc. Comput. Vision - ECCV 2016 - 14th Eur.
Conf., Amsterdam, The Netherlands, Springer-Verlag, 2016, vol. 9905, pp. 21–37.
vised, where contrast loss is used as the evaluation metric in the [25] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. ICCV,
student-teacher framework. The evolution process is signifi- 2017, pp. 2980–2988.
[26] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “SOLO: Segmenting objects by
cantly accelerated by eliminating the inferior or the least prom- locations,” CoRR, vol. abs/1912.04488, 2019.
ising operation. Experimental results demonstrate the [27] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2: Dynamic, faster and
stronger,” 2020, arXiv:2003.10152.
effectiveness of our unsupervised NAE for the downstream [28] D. Floreano, P. Durr, and C. Mattiussi, “Neuroevolution: from architectures to
applications, such as object recognition, object detection, and learning,” Evol. Intell., vol. 1, pp. 47–62, 2008. doi: 10.1007/s12065-007-0002-4.
[29] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through augmenting
instance segmentation. FaUNAE addresses core technical chal- topologies,” MIT Press Journals, 2002.
lenges in improving the performance of unsupervised deep [30] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image
classifier architecture search,” in Proc. AAAI, 2019, pp. 4780–4789.
learning to make it more scalable and applicable, which will [31] R. Miikkulainen et al., “Evolving deep neural networks,” 2017, arXiv:1703.00548.
ultimately impact many meaningful applications. [32] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical
representations for efficient architecture search,” in Proc. ICLR, 2018.
[33] A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” 2009.
Acknowledgments [34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proc. CVPR, 2016, pp. 770–778.
This study was supported by Grant NO.2019JZZY011101 [35] H. Zhang et al., “Resnest: Split-attention networks,” 2020, arXiv:2004.08955.
from the Key Research and Development Program of Shan- [36] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image clas-
sifier architecture search,” in Proc. AAAI, 2019, pp. 4780–4789. Doi: 10.1609/aaai.v33
dong Province to Dianmin Sun. This work was supported in i01.33014780.
part by the National Natural Science Foundation of China [37] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an
invariant mapping,” in Proc. CVPR, 2006, pp. 1735–1742.
under Grant 62076016 and 61876015. [38] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” in Proc.
ICLR, 2020.
[39] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive pre-
References dictive coding,” 2018, arXiv:1807.03748.
[1] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised [40] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-
visual representation learning,” in Proc. CVPR, 2020, pp. 9726–9735. parametric instance discrimination,” in Proc. CVPR, 2018, pp. 3733–3742.
[2] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in [41] X. Wang and A. Gupta, “Unsupervised learning of visual representations using vid-
Proc. ICLR, Toulon, France, Apr. 24–26, 2017. eos,” in Proc. CVPR, 2015, pp. 2794–2802.
[3] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures [42] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A.
for scalable image recognition,” in Proc. CVPR, 2018, pp. 8697–8710. Trischler, and Y. Bengio, “Learning deep representations by mutual information estima-
[4] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct neural architecture search on tion and maximization,” in Proc. ICLR, 2019.
target task and hardware,” in Proc. ICLR, 2019. [43] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-aver-
[5] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture search by aged consistency targets improve semi-supervised deep learning results,” in Proc. NIPS,
network transformation,” in Proc. AAAI, 2018, pp. 2787–2794. 2017.
[6] C. Liu, P. Dollár, K. He, R. Girshick, A. Yuille, and S. Xie, “Are labels necessary for [44] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maxi-
neural architecture search?” 2020, arXiv:2003.12056. mizing mutual information across views,” in Proc. NeurIPS, 2019, pp. 15,509–15,519.
[7] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning [45] C. Zhuang, A. L. Zhai, and D. Yamins, “Local aggregation for unsupervised learning
by context prediction,” in Proc. ICCV, 2015, pp. 1422–1430. of visual embeddings,” in Proc. ICCV, 2019, pp. 6001–6011.
[8] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by [46] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d.
predicting image rotations,” in Proc. ICLR, 2018. Oord, “Data-efficient image recognition with contrastive predictive coding,” 2019,
[9] A. Kolesnikov, X. Zhai, and L. Beyer, “Revisiting self-supervised visual representa- arXiv:1905.09272.
tion learning,” in Proc. CVPR. [47] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” 2019, arX-
[10] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an iv:1906.05849.
invariant mapping,” in Proc. CVPR, 2006. [48] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2.” 2019.
[11] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum https://github.com/facebookresearch/detectron2
contrastive learning,” 2020, arXiv:2003.04297.
[12] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for con-
trastive learning of visual representations,” 2020, arXiv:2002.05709.

Self-Supervised Representation
Learning for Evolutionary
Neural Architecture Search
©SHUTTERSTOCK.COM/YURCHANKA SIARHEI
Abstract—Recently proposed neural architecture search (NAS)

Chen Wei algorithms adopt neural predictors to accelerate architecture
Xidian University and Xi’an University of search. The capability of neural predictors to accurately predict
Posts & Telecommunications, CHINA the performance metrics of the neural architecture is critical to
Yiping Tang NAS, but obtaining training datasets for neural predictors is often
Xidian University, CHINA time-consuming. How to obtain a neural predictor with high
prediction accuracy using a small amount of training data is a
Chuang Niu central problem to neural predictor-based NAS. Here, a new
Rensselaer Polytechnic Institute, USA architecture encoding scheme is first devised to calculate the
Haihong Hu,Yue Wang, and Jimin Liang graph edit distance of neural architectures, which overcomes the
Xidian University, CHINA drawbacks of existing vector-based architecture encod-
ing schemes. To enhance the predictive performance of neural
Date of current version: 15 July 2021 Corresponding Author: Jimin Liang (email: jimleung@mail.xidian.edu.cn).
1556-603X/21©2021IEEE AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 33

predictors, two self-supervised learning methods are proposed to trained neural predictor producing a less effective representation
pre-train the architecture embedding part of n eural predictors to of the input neural architecture. Discriminative unsupervised rep-
generate a meaningful representation of neural architectures. resentation learning, also known as self-supervised learning,
The first method designs a graph neural network-based model requires designing a pretext task [15], [16] from an unlabeled
with two independent branches and utilizes the graph edit dis- dataset and using it as the supervision to learn meaningful feature
tance of two different neural architectures as a supervision to representation. Inspired by previous findings that “close by” archi-
force the model to generate meaningful architecture representa- tectures tend to have similar performance metrics [17], [18], this
tions. Inspired by contrastive learning, the second method pres- paper uses the graph edit distance (GED) as supervision for self-
ents a new contrastive learning algorithm that utilizes a central supervised learning, because GED can reflect the distance of dif-
feature vector as a proxy to contrast positive pairs against nega- ferent neural architectures in the search space. Commonly used
tive pairs. Experimental results illustrate that the pre-trained GED is computed based on the graph encoding of two different
neural predictors can achieve comparable or superior perfor- neural architectures (adjacency matrices and node operations)
mance compared with their supervised counterparts using only [17], but this scheme cannot identify isomorphic graphs. Path-
half of the training samples. The effectiveness of the proposed based encoding [3] is another commonly used neural architecture
methods is further validated by integrating the pre-trained neu- encoding scheme, but it cannot recognize the position of each
ral predictors into a neural predictor guided evolutionary neural operation in the neural architecture, e.g., two different operations
architecture search (NPENAS) algorithm, which achieves state- in a neural architecture may have the same path-encoding vectors.
of-the-art p erformance on NASBench-101, NASBench-201, To overcome the above drawbacks, this paper proposes a new
and DARTS benchmarks. neural architecture encoding scheme, a position-aware path-based
encoding, which can recognize the positions of different opera-
N
I. Introduction tions in the neural architecture and efficiently identify unique
eural architecture search (NAS) refers to the use of cer- neural architectures.
tain search strategies to find the best performing neural Since different pretext tasks may lead to different feature rep-
architecture in a pre-defined search space with minimal resentations, two self-supervised learning methods are proposed
search costs [1]. The search strategies sample potentially from two different perspectives to improve the feature representa-
promising neural architectures from the search space and perfor- tion of neural architectures, and to investigate the effect of differ-
mance metrics of the sampled architectures obtained from time- ent pretext tasks on the predictive performance of neural
consuming training and validation procedures, which are used to predictors. The first method utilizes a handcrafted pretext task,
optimize the search strategies.To alleviate the time cost of training while the second one learns feature representation by contrasting
and validation procedures, some recently proposed NAS search positive pairs against negative pairs.
strategies employed neural predictors to accelerate the perfor- The pretext task of the first self-supervised learning method
mance estimation of the sampled architectures [2]–[6]. The capa- is to predict the normalized GED of two different neural archi-
bility of neural predictors to accurately predict the performance tectures in the search space. A graph neural network-based
of the sampled architectures is critical to downstream search strat- model with two independent identical branches is devised and
egies [2], [5]–[8]. Because of the significant time cost of obtaining the concatenation of the output features from both branches is
labeled training samples, acquiring accurate neural predictors used to predict the normalized GED. After the self-supervised
using a fewer number of training samples is one of the key issues pre-training, only one branch of the model is adopted to build
in NAS methods employing neural predictors. the neural predictor. This method is termed as self-supervised
Self-supervised representation learning, a type of unsupervised regression learning.
representation learning, has been successfully applied in areas such The second self-supervised learning method is inspired by the
as image classification [9], [10] and natural language processing prevalently contrastive learning for image classification [9], [10],
[11]. If a model is pre-trained by self-supervised representation [12], which maximizes the agreement between differently aug-
learning and then fine-tuned by supervised learning using a few mented views of the same image via a contrastive loss in latent
labeled training data, then it is highly likely to outperform its space [9]. Since there is no guarantee that a neural architecture
supervised counterparts [9], [10], [12]. In this paper, self-super- and its transformed form will have the same performance met-
vised representation learning is investigated and applied to the rics, it is not reasonable to directly apply contrastive learning to
NAS domain to enhance the performance of neural predictors NAS. This paper proposes a new contrastive learning algorithm,
built from graph neural networks [13] and employed in the termed self-supervised central contrastive learning, that uses the
downstream evolutionary search strategy. feature vector of a neural architecture and its nearby neural
Effective unsupervised representation learning falls into one of architectures’ feature vectors (with small GEDs) to build a cen-
two categories: generative or discriminative [9]. Existing unsuper- tral feature vector. Then, the contrastive loss is utilized to tightly
vised representation learning methods for NAS [8], [14] belong to aggregate the feature vectors of the architecture and its nearby
the generative category. Their learning objective is to compel the architectures onto the central feature vector and push the fea-
neural predictor to correctly reconstruct the input neural archi- ture vectors of other neural architectures away from the central
tecture, but it has limited relevance to NAS.This may result in the feature vector.

After self-supervised pre-training, two neural predictors are proxy dataset and proxy architecture, early stopping, inheriting
built by connecting a fully connected layer to the architecture weights from a trained architecture, and weight sharing [1]. A
embedding modules of the pre-trained models. Finally, the pre- neural predictor that is employed to estimate the performance
trained neural predictors are respectively integrated into a neural metrics of the neural network architectures can also be recog-
predictor guided evolutionary neural architecture search (NPE- nized as a kind of performance estimation strategy. Recently,
NAS) algorithm [6] to verify their performance. many search strategies have adopted neural predictors to explore
Our main contributions can be summarized as follows. the search space [2]–[4], [6]–[8], [34], [35].The capability of neural
❏❏ A new position-aware path-based neural architecture encod- predictors to accurately predict the performance of neural archi-
ing scheme is devised to overcome the drawbacks of adjacen- tectures is critical to the search strategies using neural predictors.
cy matrix encoding and path-based encoding methods. The The neural predictors are trained on a dataset consisting of a
experimental results illustrate its superiority in identifying number of neural architectures, along with corresponding perfor-
unique neural architectures. mance metrics, which are obtained through time-consuming
❏❏ A self-supervised regression learning method is proposed, training and validation procedures. Wen et al. [7] designed an
which defines a pretext task to predict the normalized GED architecture encoding method and proposed a neural predictor
of two different neural architectures and design a graph neural composed of Graph Convolutional Networks (GCNs). BRP-
network-based model with two independent identical NAS [35] adopts GCNs to construct a latency neural predictor
branches to learn meaningful representation of neural archi- and employs transfer learning to transfer knowledge from the
tectures. The neural predictor pre-trained by this method trained latency predictor to a binary relation predictor, which is
achieves its performance upper bound with a small search used to rank neural architectures. GATES [5] views the embed-
budget and a few training epochs. In the best case, it can ding of neural architectures as the information flow in the archi-
achieve better performance only using half of the search bud- tectures and presents two different encoding processes: operation
get compared to its supervised counterparts. on node and operation on edge. Each operation node employs a
❏❏ A self-supervised central contrastive learning algorithm is pro- soft attention mask to enhance the input features, and the output
posed, which forces neural architectures with small GED to of the output node is used as the embedding of the neural archi-
lean closer together in the feature space, while neural architec- tecture. SemiNAS [34] presents a semi-supervised iterative train-
tures with large GED are divided further apart.The pre-trained ing scheme to reduce the number of architecture-accuracy data
neural predictor fine-tuned with a quarter of the search budget pairs required to train a high-performance neural predictor. Semi-
can achieve comparable performance to its supervised counter- NAS first trains the neural predictor with a small number of
parts; with the same search budget, the fine-tuned neural pre- architecture-accuracy data pairs, then utilizes the trained neural
dictor outperforms its supervised counterparts by about 1.5 predictor to predict the performance of a large number of archi-
times. The proposed central contrastive learning algorithm can tectures, and finally adds the generated pseudo data pairs to the
also be extended to the domain of graph unsupervised repre- original training set to update the neural predictor.
sentation learning without any modifications. In this paper, self-supervised representation learning is applied
❏❏ Incorporating the pre-trained neural predictors, NPENAS to the NAS domain. Two self-supervised representation learning
achieves state-of-the-art performance on NASBench-101 [17], methods are proposed to improve the feature representation of
NASBench-201 [19], and DARTS [20] benchmarks. On neural predictors, thus enhancing the prediction performance of
NASBench-101 and NASBench-201 search space, the neural predictors.
searched neural architectures even achieve comparable or equal
results to the ORACLE baseline (performance upper bound). B. Neural Architecture Encoding Scheme
Neural architecture is usually defined as a direct acyclic graph
II. Related Work (DAG).The adjacency matrix of the graph is used to represent the
connections of operations, and the nodes are used to represent the
A. Neural Architecture Search operations. The commonly used neural architecture encoding
Due to the huge size of the pre-defined search space, NAS usually schemes can be categorized into the vector encoding scheme and
searches for potential superiority neural network architectures by graph encoding scheme.
utilizing a search strategy. Reinforcement learning (RL) [2], [21]– Adjacency matrix encoding [4], [29], [36] and path-based
[23], evolutionary algorithms [6], [24]–[28], gradient-based meth- encoding [3] are two frequently used vector encoding schemes.
ods [20], [29]–[32], and Bayesian optimization (BO) [3], [6], [33] The adjacency matrix encoding is the concatenation of the flat-
are the commonly used search strategies. A search strategy adjusts tened adjacency matrix and the one-hot encoding vector of each
itself by exploiting the selected neural architectures’ performance node, but it cannot identify the isomorphic graphs [8]. Path-based
metrics to explore the search space better. encoding is the encoding of the input-to-output paths of the
As it is time-consuming to estimate the performance metrics neural architecture, but as demonstrated in the Supplementary
of a given neural architecture through training and validation pro- Materials1, this scheme cannot recognize the position of
cedures, many performance estimation strategies are proposed to
speed up this task. Commonly used strategies include using a 1
The supplementary material is available at https://github.com/auroua/SSNENAS.

operations in the neural architecture.The graph encoding scheme self-supervised central contrastive learning, to learn meaningful
represents the neural architecture by its adjacency matrix and the representation of neural architectures. To the best of our knowl-
one-hot encoding of each node. edge, this is the first study applying contrastive learning to the
In this paper, a new vector encoding scheme denoted as posi- NAS domain.
tion-aware path-based encoding is proposed. This encoding
scheme can identify different graphs more efficiently and recog- III. Methodology
nize the position of operations in the neural architecture. A graph To enhance the prediction performance of neural predictors, two
neural network is adopted to embed the neural architectures into self-supervised representation learning methods are proposed to
feature space. Since the graph encoding scheme cannot identify improve the feature representation ability of the neural predictors.
isomorphic graphs [8], position-aware path-based encoding is A new neural architecture encoding scheme is designed to calcu-
used to filter out isomorphic graphs. It is also utilized to calculate late the GED of graphs in Section III-B. The self-supervised
the GED of different neural architectures. regression learning that utilizes a carefully designed model with
two independent identical graph neural network branches to pre-
C. Unsupervised Representation Learning for NAS dict the GED of neural architectures is discussed in Section III-C.
Unsupervised representation learning methods fall into two cate- The self-supervised central contrastive learning is introduced in
gories: generative and discriminative [9]. The learning objective Section III-D. The utilization of the pre-trained neural predictors
of existing generative unsupervised learning methods for NAS, for the downstream search strategies is elaborated in Section III-E.
arc2vec [8], and NASGEM [14] is to reconstruct the input neu-
ral architectures using an encoder-decoder network, which has A. Problem Formulation
little relevance to NAS. Moreover, arc2vec [8] adopts the varia- In a pre-defined search space S, a neural architecture s can be rep-
tional autoencoder [37] to embed the input neural architectures resented as a DAG
into a high dimensional continuous feature space, and the feature
space is assumed to follow Gaussian distribution. Since there is s = ^V, E h, s ! S,(1)
no guarantee that the real underlying distribution of the feature
space is Gaussian, this assumption may harm the representation where V = " v i ,i = 1: H is the set of nodes representing operations
of neural architectures. NASGEM [14] adds a similarity loss to in s, E = " v i, v j ,i, j = 1: H is the set of edges describing the connec-
improve the feature representation. However, the similarity loss tion of operations, and H is the number of nodes.
only considers the adjacency matrix of the input neural architec- The prediction process of neural predictor can be formu-
ture and ignores the node operations, resulting in the failure to lated as
identify isomorphic graphs.
In this paper, two self-supervised learning methods are pro- yt = f ^ s h,(2)
posed for NAS. The first one is inspired by unsupervised graph
representation learning. GMNs [38] adopts the graph neural net- where f is the neural predictor, and it takes a neural architecture s
work as building blocks and presents a cross-graph attention- as input and predicts the performance metric yt of s.
based mechanism to predict the similarity of the two input
graphs. SimGNN [39] takes two graphs as input, embeds the B. Position-Aware Path-Based Encoding
graph and each node of the graph into the feature space using a Since the proposed self-supervised learning methods utilize GED
graph convolutional neural network, and then uses graph feature to measure the similarity of different neural architectures, it is crit-
similarity and node feature similarity to predict the similarity of ical to calculate GED effectively. This paper presents a new vector
the input graphs. UGRAPHEMB [40] takes two graphs as input, encoding scheme, position-aware path-based encoding, which
adopts the graph isomorphism network (GIN) [41] to embed the improves path-based encoding [3] by recording the position of
input graphs into feature space, and utilizes a multi-scale node each operation in the path.The scheme consists of two steps: gen-
attention mechanism to predict the similarity of the input graphs. erating the position-aware path-based encoding vectors for the
Our work is similar to UGRAPHEMB, but it designs a new input-to-output paths of the neural architecture, and concatenat-
neural network model without using complex multi-scale node ing the vectors of all of the paths.
attention and applies unsupervised learning to the field of neural As shown in Eq. 1, a neural architecture can be defined by
architecture representation learning. DAG with its nodes representing the operations in the neural
The second method is inspired by contrastive learning for architecture. DAG consists of an input node, some operation
image classification. The contrastive learning used in image classi- nodes, and an output node connected in sequence.The adjacency
fication forces the image and its transformations to be similar in matrix of DAG is used to represent the connections of the differ-
the feature space [9], [10], [42]. Since there is no guarantee that a ent nodes. Since each node in DAG has a fixed position, each
neural architecture and its transformed form will have the same node is assigned with a unique index, which implies that each
performance metrics, it is not reasonable to directly apply contras- operation associated with the node has a unique index.
tive learning for image classification to the NAS domain. This NASBench-101 [17] is a widely used NAS search space. It
paper proposes a new contrastive learning algorithm, contains three different operations: convolution 3 # 3,

convolution 1 # 1, and max-pooling 3 # 3. Figure 1a illustrates a 3) Finally, concatenate the sorted input-to-output paths’ posi-
neural architecture in NASBench-101 search space and uses tion-aware path-based encoding vectors.
green and red lines to indicate two different input-to-output The vector length of path-based encoding [3] increases expo-
paths. The operations and their corresponding indices of the nentially with the number of operation nodes, whereas the vector
neural architecture are shown in Figure 1b. Figure 1c demon- length of the position-aware path-based encoding increases lin-
strates two input-to-output paths of the neural architecture in early with the number of input-to-output paths. Therefore, the
Figure 1a. Unlike path-based encoding [3], when all the input- position-aware path-based encoding is a more efficient vector
to-output paths are extracted from the neural architecture, the encoding scheme than path-based encoding. As the number of
index of operations in the input-to-output paths is also record- input-to-output paths may be different in different neural archi-
ed.The position-aware path-based encoding of the two different tectures, the short vectors will be padded with zeros to keep all
paths in Figure 1c is indicated in Figure 1d.The vector length of vectors the same length.
each input-to-output path is fixed, which equals the multiplica- Since only the structural information of neural architectures is
tion of the number of operation nodes and the number of considered, the position-aware path-based encoding scheme does
operation types. All the operation nodes are traversed in the not cover other important properties of neural architectures such
neural architecture. If an operation node appears in the input- as input resolution and kernel numbers. However, the position-
to-output path, then the operation is represented by its one-hot aware path-based encoding scheme is flexible and scalable. With
operation type vector; otherwise, it is represented by a zero vec- some slight modifications, the current encoding scheme can
tor. Since there are three different operations in NAS- accommodate other neural architecture settings. A modified posi-
Bench-101, the length of the one-hot operation vector and the tion-aware path-based encoding scheme, including the neural
zero vector is three. architecture’s input resolution and kernel numbers, is presented in
The final encoding vector is the concatenation of the posi- the Supplementary Materials.
tion-aware path-based encoding vectors for all the input-to-out-
put paths in the neural architecture. In order to maintain the C. Self-Supervised Regression Learning
consistency of the connection, the following operations are per- The pretext task of the proposed self-supervised regression learn-
formed sequentially: ing is to predict the normalized GED of the two input neural
1) Firstly, sort all the input-to-output paths in ascending order architectures. GED is defined as
by path length.
K
2) Secondly, sort all the input-to-output paths of the same path
GED ^s i, s j h = / p ki - p kj , s i, s j ! S,(3)
length in ascending order by the operation index. k=i
[Input, Conv 3 × 3, Conv 3 × 3,Conv 1 × 1, Max-Pool, Output] Input-to-Output Path 1 Encoding

Output 3×3
Node [1,0,0] [0,0,0] [0,1,0] [0,0,0]
Input-to-Output Path 2 Encoding
[Input, 1, 2, 3, 4, Output] [1,0,0] [1,0,0] [0,1,0] [0,0,1]
Max-Pool
Conv 3×3 (b) (d)
1×1
Input Conv Conv Output

Conv Node 3×3 1×1 Node
3×3
Input-to-Output
Path 1 1 3
Conv
3×3
Input Conv Conv Conv Max-Pool Output
Node 3×3 3×3 1×1 3×3 Node
Input Input-to-Output
Node Path 2 1 2 3 4
(a) (c)
FIGURE 1 Overview of the position-aware path-based encoding. (a) A neural architecture in NASBench-101 search space. The green and red
lines indicate two input-to-output paths. (b) Operations and their corresponding unique indices. (c) Two different input-to-output paths and
their operation indices. (d) Position-aware path-based encoding of the two input-to-output paths in (c).

where p i and p j are the position-aware path-based encoding After the self-supervised pre-training, any branch of frl can
vectors of architecture s i and s j, p ki and p kj are the kth elements be selected to embed the neural architectures into the feature
of the position-aware path-based encoding vector p i and p j, and space. A neural predictor is constructed by connecting a fully
K is the vector length. connected layer to the architecture embedding module (as
Following [39], the normalized GED is defined as illustrated in the red rectangle of Figure 2) of the pre-trained
GED ^s i, s j h
models. Regression loss is employed to fine-tune the neural
nGED ^s i, s j h = exp -dist, where dist = ,(4) predictor. The parameters of the neural predictor, denoted as w,
V
are optimized as
where V is the number of nodes in the neural architectures
and exp represents the exponential function with base e. The w ) = argmin
w
/ ^ f^s ih - y ih2,(6)
si ! S
performance of normalized GED and non-normalized GED
are compared on NASBench-201 and the result shows that where y i is the performance metric of s i.
the normalized GED performs slightly better than non-nor- The structure of self-supervised regression learning method is
malized GED. similar to the binary relation predictor proposed in BRP-NAS
As the architecture in search space is represented as a DAG, it [35]. However, the binary relation predictor requires the perfor-
is straightforward to adopt graph neural networks to aggregate mance metrics of neural architectures to learn how to rank
features for each node and generate the graph embedding by architectures, and the performance metrics are very time-con-
averaging the nodes’ features. In this paper, both the self-super- suming to obtain.The self-supervised regression learning method
vised models and neural predictors utilize the spatial-based graph uses GED as supervision to learn a meaningful representation of
neural network GIN layers. the neural architectures, and the time-cost of computing GED
Since the pretext task is to predict the normalized GED of can be neglected.
two different neural architectures, a regression model frl consist-
ing of two independent identical graph neural network branches D. Self-Supervised Central Contrastive Learning
is designed, as shown in Figure 2. Each branch is composed of This paper proposes a central contrastive learning algorithm to
three sequentially connected GIN layers and a global mean pool- force neural architectures with a small GED to lean closer togeth-
ing (GMP) layer. The GMP layer outputs the mean of the node er in the feature space, while neural architectures with a large
features of the last GIN layer.The outputs of the two branches are GED are divided further apart.
concatenated, and then sent to two sequentially connected fully As illustrated in Figure 3, a graph neural network model fccl is
connected layers to predict the two input architectures’ normal- developed to embed the neural architecture into feature space.
ized GED. The regression loss function to optimize the parame- Following SimCLR [9], fccl consists of a neural architecture
ters w rl of frl is formulated as embedding module and two fully connected layers. For a fair
comparison, the architecture embedding module is identical to
w )rl = argmin / ^ frl ^s i, s j h - nGED ^s i, s j hh2.(5) that of frl .
Given a batch of neural architectures S b = " s k ,Nk = 1 and a neu-
w rl (s i , s j ) ! S
ral architecture s i ! S b, the minimum GED of s i from all other

architectures is denoted as g min . The neural architectures whose
Architecture GED from si is equal to g min, together with s i, constitute the
Embedding
positive sample set S pos of s i . The set of neural architectures in
Performance
GMP
GIN
GIN
GIN
FC
Input Predictor this batch but not in S pos is denoted as S neg . The model fccl is
used to embed all of the neural architectures, and the feature vec-
pred
FC
FC
tor sets of S pos and S neg are denoted as E pos and E neg, respective-
ly. A central vector e c is calculated as the average of all of the
GMP
GIN
GIN
GIN
feature vectors in E pos . The contrastive loss is used to aggregate all

FC
Input
Architecture of the feature vectors in E pos to the central vector e c and push
Embedding the feature vectors in E neg far away from e c . The detailed proce-
dure of central contrastive learning is summarized in Algorithm 1.
FIGURE 2 Structure of the regression model frl. An example of the central contrastive learning is illustrated in the
Supplementary Materials.
To reduce the interaction between the central vectors, a cen-
Architecture Embedding tral vector regularization term is added to the loss function that
forces each pair of the central vectors to be orthogonal. The cen-
Embedding
GMP
GIN
GIN
GIN
tral vector regularization term is defined as

FC
FC
Input Feature
M M
L reg = 1 / / 1 [ j ! i] e i< e j,(7)
FIGURE 3 Structure of the feature embedding model fccl. 2 i=0 j=0

where M is the number of central vectors, and e i and e j are the paper, the pre-trained neural predictors are integrated respectively
central vectors. with NPENAS to illustrate the performance gains that result
After the self-supervised central contrastive pre-training, a from applying self-supervised learning to NAS.
fully connected layer is appended to the architecture embed- Since our experiments demonstrate that the neural predictor
ding module to predict the input neural architecture’s perfor- built from a self-supervised pre-trained model can significantly
mance. The regression loss in Eq. 6 is adopted to fine-tune the outperform its supervised counterpart and achieve comparable
neural predictor. performance with smaller training datasets, the NPENAS method
is modified to utilize only a fixed search budget to carry out the
E. Fixed Budget NPENAS neural architecture search. The fixed budget NPENAS is summa-
NPENAS [6] combines the evolutionary search strategy with a rized in Algorithm 2, which is modified from NPENAS and only
neural predictor and utilizes the neural predictor to guide the the differences are presented.
evolutionary search strategy to explore the search space. In this
IV. Experiments and Analysis
In this section, experiments are conducted to illustrate that the
performance of the proposed neural predictors can be significant-
Algorithm 1 Central Contrastive Learning. ly improved by self-supervised learning. The benefit of self-super-
1: Input: batch size N, number of training architectures M and vised learning for NAS is also demonstrated by integrating the
M # N, temperature x, regularization weight m, model fccl . pre-trained neural predictors with NPENAS.
2: for sampled minibatch S b = " s k ,Nk = 1 do All experiments are implemented in Pytorch [43]. GIN is
3: E c = Q implemented using the publicly available graph neural network
4: for all t ! " 1, f, M , do library pytorch_geometric [44].The code of this paper is provided
5: Randomly draw one neural architecture s i ! S b . in reference [45].
6: g min = min GED (s i, s j) where j = " 1, f, i -1, i +1, f, N ,
# GED: Eq. 3 A. Benchmark Datasets
7: E pos = Q and E neg = Q All experiments are performed on NASBench-101 [17], NAS-
8: for all j ! " 1, f, i - 1, i + 1, N , do Bench-201 [19], and DARTS [32] search spaces.
9: e j = fccl (s j)
10: if GED (s i, s j) == g min then a) NASBench-101
11: E pos ! E pos , e j NASBench-101 [17] contains 423k neural architectures, and each
12: else architecture is trained three times on the CIFAR-10 [46] training
13: E neg ! E neg , e j datasets independently.The structure of the neural architectures, as
14: end well as their validation accuracies, and test accuracies correspond-
15: end for ing to the three independent trainings on CIFAR-10 are report-
16: E pos ! E pos , e i ed. The architecture in this search space is defined by DAG, utilizing
nodes to represent the operations of the neural architecture and
17: e c = 1 ) / e # vector average
E pos e ! Epos
18: E c ! E c , e c
19: for all idx, e p ! E pos do # idx is the index of e p in Epos
Algorithm 2 Fixed Budget NPENAS.
20: E pair = Q
21: sim p, c = e <p e c / ^x e p e c h Input: initial population size n 0 , initial population D = " (s i, y i),
1:
22: E pair ! E pair , sim p, c i = 1, 2, f, n 0 ,, neural predictor f, number of the total
23: for all e n ! E neg do searched architectures total_num, number of the evaluated
24: sim n, c = e <n e c / ^x e n ec h architectures (budget) to fine-tune neural predictor ft_num.
25: E pair ! E pair , sim n,c for n from n 0 to total_num do
2:
26: end for 3: if n # ft_num then
exp (sim p, c)
27: l t, idx = - log 4: Initialize the weights of neural predictor f with the
/ exp (sim vec)
weights from the pre-trained model.
sim vec ! E pair
28: end for 5: Fine-tune the neural predictor f with dataset D = " (s i, y i),
29: l t = / l t, idx i = 1, 2, f, n , .
idx
6: end
30: end for
M 7: Utilize the neural predictor f to guide the evolutionary neu-
31: L = 1 / l t + m L reg (E c) # Lreg: Eq. 7
M t=1 ral architecture search. # Detailed code can be found in
32: Update model fccl to minimize L. Algorithm 2 of NPENAS [6].
33: end for end for
8:
34: return model fccl Output: s ) = argmin ( y i), (s i, y i) ! D
9:

using the adjacency matrix to represent the connection of differ- unique topology structures [17], [19], respectively. All neural
ent operations. Only convolution 1 # 1, convolution 3 # 3, and architectures are encoded separately using the path-based
max-pooling 3 # 3 are allowed to be used to build the neural encoding and position-aware path-based encoding schemes,
architectures. The best architecture achieves a mean test error of and then the numbers of unique neural architecture encodings
5.68%, and the mean test error of the architecture with the best are counted respectively.
validation error is 5.77%. As shown in Table I, the position-aware path-based encoding
can identify all the unique neural architectures in the NAS-
b) NASBench-201 Bench-101 search space, while path-based encoding maps many
NASBench-201 [19] is a recently proposed NAS benchmark, and unique neural architectures to the same encodings. In the NAS-
it contains 15.6k trained architectures for image classification. Bench-201 search space, both encoding schemes fail to identify
Each architecture is trained once on the CIFAR-10, CIFAR-100 the 6466 unique neural architectures. The failure of the position-
[46], and ImageNet-16-120, and ImageNet-16-120 is a down- aware path-based encoding scheme is attributed to the presence
sampled variant of ImageNet [47].The structure of each architec- of skip connection operations and zeroize operations in the
ture and its evaluation details such as training error, validation NASBench-201 search space. This is a limitation of the proposed
error, and test error of each architecture are reported. Each archi- encoding scheme.
tecture is defined by a DAG, utilizing nodes to represent the fea- Since the vector length of position-aware path-based encod-
ture maps and using the edges to represent the operation. The ing only increases linearly with the number of input-to-output
convolution 1 # 1, convolution 3 # 3, average pooling 3 # 3, paths, it is a more efficient encoding method than path-based
skip connection, and zeroize operation are allowed to be used to encoding. For the NASBench-101, NASBench-201, and
construct the neural architectures. On CIFAR-10, CIFAR-100, DARTS search spaces, the encoding vector lengths of the posi-
and ImageNet-16-120, the best test percentage errors are 8.48%, tion-aware path-based encoding scheme are 120, 96 and 1224,
26.49%, and 52.69%, respectively. On CIFAR-10, CIFAR-100, respectively, while those of the path-based encoding are 364, 5461
and ImageNet-16-120, architectures with the best validation and 18724, respectively.
error achieve the test percentage errors of 8.91%, 26.49%, and
53.8%, respectively. C. Prediction Analysis
c) DARTS 1) Model Details

Unlike the two search spaces above, DARTS [32] is a more realis- This paper first utilizes self-supervised regression learning to train
tic-sized search space that contains 10 18 neural architectures. The the model frl in Figure 2 and self-supervised central contrastive
architectures in DARTS are not evaluated by the training and val- learning to train the model fccl in Figure 3. The architecture
idation procedures. The optimal normal and reduction cells need embedding module consists of three sequentially connected
to be found in the search space, and the final neural network is GIN layers. The hidden layer size of the GIN layer is 32, and
composed of sequentially connected normal and reduction cells. each GIN layer is followed with a batch normalization and a
Each cell has two inputs and four nodes, each with two opera- ReLU layer. The hidden dimension size of the fully connected
tions. Each node can use the following operations: 3 # 3 and layer of the model frl and fccl is 16 and 8, respectively. After self-
5 # 5 separable convolutions, 3 # 3 and 5 # 5 dilated separable supervised pre-training, the neural predictors are constructed by
convolutions, 3 # 3 max pooling, 3 # 3 average pooling, identity, connecting the pre-trained architecture embedding modules
and zero. with a single fully connected layer with the hidden dimension
size of 8. The neural predictors constructed by the architecture
B. Position-Aware Path-Based Encoding Analysis embedding modules of frl and fccl are denoted as SS-RL and
In this section, the improvement and limitation of the posi- SS-CCL, respectively.
tion-aware path-based encoding scheme are analyzed. Experi- The same neural architecture encoding method as NPE-
ments on NASBench-101 and NASBench-201 search spaces NAS is employed [6]. The architecture in NASBench-101 is
are conducted to confirm the proposed encoding scheme’s represented by a 7 # 7 upper triangle adjacency matrix and a
ability to uniquely identify neural architectures. NAS- collection of 6-dimensional one-hot encoded node features,
Bench-101 and NASBench-201 contain 423k and 6466 and that in the NASBench-201 is represented by an 8 # 8
upper triangle matrix and several 8-dimensional one-hot
encoded node features.
TABLE I Unique encoding vectors of different encoding
methods.
2) Training Details
PATH- POSITION-AWARE UNIQUE Self-supervised regression learning requires the construction of
SEARCH SPACE BASED PATH-BASED ARCHS* paired training samples from the training data, and the number of
NASBENCH-101 170K 423K 423K paired training samples increases by the square of the number of
NASBENCH-201 8046 9741 6466 training data. When the number of neural architectures in the
*ARCHITECTURES is shortened as ARCHS. search space is large, it is impractical to train the self-supervised

regression learning model using all paired training samples within results are averaged over 40 independent runnings using differ-
a reasonable time cost. For example, on NASBench-101, it costs ent random seeds.
about 1.7 minutes to train the self-supervised regression learning
model if 380k (the number equals to 90% of the total number of 4) Results
architectures) of the paired training samples are used. If the model The predictive performance measurements of the neural predic-
is trained with all the paired samples ^^1 + 423kh # 423k # 0.5h tors on NASBench-101 and NASBench-201 are shown in Fig-
for 300 epochs, it will take about 252 years. Therefore, in our ure 4 and Figure 5, respectively.
experiments, a fixed number of paired training samples are ran- On the NASBench-101 search space, SS-RL achieves its best
domly selected to pre-train the model frl . The number is set to performance with fewer training epochs and gradually decreases
90% of the total number of the neural architectures in the search with more training epochs, and finally drops to a range of 0.2 to
space, i.e., 380k for NASBench-101 and 13.7k for NAS- 0.3. The supervised neural predictor performs significantly better
Bench-201. The training epoch is 300 and the batch size is 64. than SS-RL when the training epoch is above 150 and the
The Adam optimizer [48] is employed to optimize the parameters search budget is more than 100. The predictive performance of
of the model frl, the initial learning rate is 5e - 4, and the SS-CCL is significantly better than that of SS-RL and the super-
weight decay is 1e - 4. A cosine learning rate schedule [49] vised neural predictor. It increases with the number of training
without restart is adopted to anneal down the learning rate to epochs, and approaches saturation when the search budget
zero. On NASBench-201, the initial learning rate is 1e - 4 and it exceeds 150. In addition, SS-CCL achieves better performance
is trained for 1000 epochs. Other training details are the same as than the supervised neural predictor while using half of the
NASBench-101. number of training neural architectures. In extreme cases,
The self-supervised central contrastive learning utilizes all of SS-CCL achieves comparable performance to the supervised
the architectures in NASBench-101 to pre-train the model fccl . neural predictor using only a quarter of the training neural archi-
The training epoch is 300, the regularization weight m is 0.5, and tectures (Figure 4e and Figure 4f).
the temperature x is 0.07. The batch size is 140k, and the train- On the NASBench-201 search space, SS-RL and SS-CCL
ing architectures are 140k. When pre-training on NAS- have comparable performance and both consistently outper-
Bench-201, the batch size is 10k, and the training architectures form the supervised neural predictor. SS-CCL slightly out-
are 1k. For both search spaces, the initial learning rate is 5e - 3. performs SS-RL when trained for more than 150 epochs. As
Other training details like the optimizer, weight decay, and the shown in Figure 5b and Figure 5f, even with fewer training
learning rate schedule are identical with those for self-supervised epochs, SS-RL and SS-CCL can approach their optimal per-
regression learning. for mance. When trained over 200 epochs, SS-RL and
The initial learning rates of the supervised neural predictor SS-CCL can outperform the supervised neural predictor using
on NASBench-101 and NASBench-201 are 5e - 3 and only a quarter of the training samples (Figure 5e, Figure 5f).
1e - 3, respectively. The other training details of the supervised Nvidia TITAN V GPUs are used to train the two self-super-
neural predictor are the same as self-supervised central contras- vised representation learning methods and report the time-cost in
tive learning. Table II. When trained on the NASBench-101 search space with
After pre-training, the neural architectures and their corre- the same epochs, the time cost of self-supervised central contras-
sponding validation accuracies are used to fine-tune the neural tive learning is about six times higher than that of self-supervised
predictors. SS-RL and SS-CCL are fine-tuned with an initial regression learning. On the small search space of NASBench-201,
learning rate of 5e - 5 and 5e - 3, respectively. The weight their training time costs are comparable and both methods are
decay is 1e - 4. The optimizer and the learning schedule are the quite efficient in training.
same as the self-supervised pre-training. The above results show that the performance of SS-RL on
the small search space of NASBench-201 is comparable to that of
3) Setup
The search budget and training epochs of neural predictors
directly affect the time cost of NAS. The supervised neural TABLE II Time cost of self-supervised training.
predictor, SS-RL and SS-CCL are compared under the search SEARCH TRAINING BATCH GPU
budgets of 20, 50, 100, 150, and 200. According to a search METHODS SPACE EPOCHS SIZE DAYS
budget, neural architectures are randomly selected from the Self-Supervised

Regression 101* 300 — 0.92
search space as the training dataset, and the remaining neural
architectures constitute the test dataset. To illustrate the effect Self-Supervised Cen-
tral-Contras* 101* 300 14K 5.97
of the training epochs, the neural predictors with different
search budgets are trained under 50, 100, 150, 200, 250, and Self-Supervised
Regression 201* 1000 — 0.05
300 training epochs. After fine-tuning, the correlation between
Self-Supervised Cen-
the validation accuracy of the neural architectures and their tral-Contras* 201* 300 10K 0.02
performance predicted by the neural predictors is evaluated
*NASBench-101, NASBench-201 and Central Contrastive are shortened as 101, 201
using the Kendall tau rank correlation. All the experimental and Central-Contras, respectively.

Training Epoch: 50 Training Epoch: 100
0.6 0.6
0.5 0.5
Kendall Tau Correlation

0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
Search Budget Search Budget
(a) (b)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(c) (d)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(e) (f)
SUPERVISED SS-RL SS-CCL
FIGURE 4 Predictive performance of neural predictors on NASBench-101 with different training epochs.

0.6 0.6
0.5 0.5

0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(a) (b)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(c) (d)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(e) (f)
SUPERVISED SS-RL SS-CCL
FIGURE 5 Predictive performance of neural predictors on NASBench-201 with different training epochs.

SS-CCL, but on the large search space of NASBench-101, its As illustrated in Figure 7, the neural predictor SS-CCL
performance drops significantly. We conjecture that this is because achieves the best performance for all of the search budgets.
SS-RL does not get enough paired training samples and training The supervised neural predictor outperforms SemiNAS and
time, so it is difficult to learn to generate meaningful feature rep- MLP when the search budget is less than 100, but it is sur-
resentations. This suggests that self-supervised regression learn- passed by MLP and NP-NAS when the search budget is
ing may not be suitable for large search space. Unlike greater than 100. Since self-supervised regression learning is
self-supervised regression learning, self-supervised central con- not efficient for learning meaningful representations when the
trastive learning does not need paired training samples. The search space has numerous neural architectures, the perfor-
optimization purpose of central contrastive learning is more mance of SS-RL is only slightly better than that of SemiNAS,
intuitive, which pulls similar neural architectures to gather close- which has the worst performance.
ly in the feature space and push dissimilar neural architectures
far away. When the search space is large, we advocate the adop- F. Fixed Budget NPENAS
tion of the self-supervised central contrastive learning method,
and self-supervised regression learning can be considered when 1) Setup
the search space is small. The pre-trained neural predictors are integrated with NPENAS
[6] and the integration of SS-RL and SS-CCL with NPENAS
D. Effect of Batch Size are denoted as NPENAS-SSRL and NPENAS-SSCCL, respec-
Since the number of negative pairs of central contrastive learning tively. The fixed budget version of NPENAS-SSRL and NPE-
is determined by the batch size, in this section, experiments are NAS-SSCCL are denoted as NPENAS-SSRL-FIXED and
conducted to investigate the effect of batch size on the perfor- NPENAS-SSCCL-FIXED, respectively. The experimental set-
mance of neural predictors. The batch size N is set to 10k, 40k, tings are the same as NPENAS [6]. The algorithms of random
70k, and 100k, and the corresponding neural predictors are search (RS) [50], regularized evolutionary (REA) [25], BANAN-
denoted as SS-CCL-10k, SS-CCL-40k, SS-CCL-70k, and SS- AS [3] with path-based encoding (BANANAS-PE), BANANAS
CCL-100k, respectively. The number of training architectures M with position-aware path-based encoding (BANANAS-PAPE),
is set to half of the batch size. To compare the performance of NPENAS-NP [6], and NPENAS-NP with a fixed search budget
neural predictors with larger M, SS-CCL pre-trained with (NPENAS-NP-FIXED) are compared to illustrate the benefit of
N = 140k and M = 140k is also included, and it is denoted as the two self-supervised learning methods for NAS. Each algo-
SS-CCL-140k. The initial learning rate of the above neural pre- rithm has a search budget of 150 and 100 on the NASBench-101
dictors is set to 5e - 3, and all the experiments in this section are and NASBench-201 search space, respectively. The fixed search
carried out on NASBench-101. Other experimental settings are budgets for NASBench-101 and NASBench-201 are set to 90
the same as in Section IV-C. All results are averaged over 40 inde- and 50, respectively. All the experimental results are averaged over
pendent runnings using different seeds. 600 independent trails, every update of the population, each algo-
As shown in Figure 6, SS-CCL outperforms the supervised rithm returns the architecture with the lowest validation error so
neural predictor despite the batch size used. When the search far and reports its test error, so there are 15 or 10 best architec-
budget is greater than 100 and the training epoch is greater tures in total. The methods proposed in this paper are also com-
than 100, the performance of SS-CCL trained with different pared with arc2vec [8] that is a recently proposed unsupervised
batch sizes tends to be the same (Figure 6c-6f). Unlike the representation learning for NAS. As the search strategies employ
findings in the study of contrast learning-based image classifica- the neural architectures’ validation error to explore the search
tion [9], i.e., where model performance increases consistently space, a reasonable best performance of NAS is the test error of
with batch size, using larger batch sizes does not improve the the neural architecture that has the best validation error in the
performance of SS-CCL. The results demonstrate that SS-CCL search space, which is called the ORACLE baseline [7]. The
is not sensitive to batch size when the search budget is large. ORACLE baseline is used as the upper bound of performance.
Therefore, when using the self-supervised central contrastive Since self-supervised central contrastive learning is more suitable
learning method to pre-train the neural predictor’s embedding for a large search space, the experiments on the DARTS search
part, a relatively small batch size can be selected to obtain better space are conducted only using the self-supervised central contras-
memory efficiency. tive learning method. As the DARTS search space contains 1018
neural architectures, it is not possible to train the self-supervised
E. Predictive Performance Comparison central contrastive representation learning model using all of the
The neural predictors proposed in this paper are compared with neural architectures.The model is trained with 50k randomly sam-
SemiNAS [34], Multilayer Perceptron (MLP) [4], and that pro- pled neural architectures.The training details on the DARTS search
posed in Wen et al. [7] (denoted as NP-NAS). The comparison is space are the same as those on NASBench-101. After self-super-
performed on NASBench-101, and all the predictors are trained vised pre-training, NPENAS-SSCCL-FIXED is adopted to search
for 200 epochs under the search budgets of 20, 50, 100, and 200. architectures in the DARTS search space. The NPENAS-SSCCL-
The experimental results are averaged over 40 independent run- FIXED has a search budget of 100 but can only evaluate 70
nings using different random seeds. searched architectures. When the number of searched architectures

0.6 0.6
0.5 0.5

0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(a) (b)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(c) (d)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
20 50 100 150 200 20 50 100 150 200
(e) (f)
SUPERVISED SS-CCL-10K SS-CCL-40K SS-CCL-70K SS-CCL-100K SS-CCL-140K
FIGURE 6 Predictive performance of neural predictors pre-trained by different batch size N on NASBench-101.

Training Epoch: 200
0.6 7.4
RS
Testing Error of Best Neural Net

7.2 REA
0.5 BANANAS-PE
7 BANANAS-PAPE
0.4 NPENAS-NP
6.8 NPENAS-SSRL
NPENAS-SSCCL
0.3 6.6
6.4
0.2 MLP
NP_NAS 6.2
SS-CCL
0.1 SS-RL 6
SUPERVISED
SemiNAS 5.8
0.0
10 30 50 70 90 110 130 150
20 50 100 200 Number of Samples
Search Budget
FIGURE 7 Performance comparison of different neural predictors. FIGURE 8 Performance comparison of NAS algorithms on NASBench-101.
TABLE III Performance comparison of NAS algorithms on NASBench-101.
METHODS SEARCH BUDGET TEST ERR (%) AVG ARCHITECTURE EMBEDDING SEARCH METHOD
RS [50] 150 6.42 ± 0.20 – RANDOM SEARCH
REA [25] 150 6.32 ± 0.23 DISCRETE EVOLUTION
BANANAS-PE [3] 150 5.90 ± 0.15 SUPERVISED BAYESIAN OPTIMIZATION
BANANAS-AE [3] 150 5.85 ± 0.14 SUPERVISED BAYESIAN OPTIMIZATION
BANANAS-PAPE [3] 150 5.86 ± 0.14 SUPERVISED BAYESIAN OPTIMIZATION
NPENAS-NP [6] 150 5.83 ± 0.11 SUPERVISED EVOLUTION
†
NPENAS-NP-FIXED [6] 90 5.90 ± 0.16 SUPERVISED EVOLUTION
ARCH2VEC-RL [8] 400 5.90 UNSUPERVISED REINFORCE
ARCH2VEC-BO [8] 400 5.95 UNSUPERVISED BAYESIAN OPTIMIZATION
NPENAS-SSRL 150 5.86 ± 0.14 SELF-SUPERVISED EVOLUTION
NPENAS-SSRL-FIXED 90† 5.88 ± 0.15 SELF-SUPERVISED EVOLUTION
NPENAS-SSCCL 150 5.83 ± 0.11 SELF-SUPERVISED EVOLUTION
NPENAS-SSCCL-FIXED 90† 5.85 ± 0.13 SELF-SUPERVISED EVOLUTION
†
The neural predictor is trained with 90 evaluated neural architectures, while other algorithms use 150 neural architectures for evaluation.
TABLE IV Impact of fixed search budget on NASBench-101. is greater than 70, the performance of the extra sampled neural
architectures is predicted by the neural predictor. Upon comple-
METHODS SEARCH BUDGET† TEST ERR (%) AVG
tion of the search, the best performing neural architecture is
NPENAS-SSRL 20 6.04 ± 0.25 selected for evaluation on the CIFAR-10 dataset, and the archi-
NPENAS-SSRL 50 5.94 ± 0.18 tecture is evaluated five times with different seeds. All the other
NPENAS-SSRL 80 5.87 ± 0.14 evaluation settings are the same as DARTS [32]. These experi-
NPENAS-SSRL 110 5.86 ± 0.11 ments are executed on two Nvidia RTX 2080Ti GPUs.
NPENAS-SSRL 150 5.86 ± 0.14
2) NAS Results on NASBench-101
NPENAS-SSCCL 20 5.99 ± 0.21
The performance of different NAS algorithms on the NAS-
Bench-101 benchmark is illustrated in Figure 8, and the quantita-
tive comparison is also provided in Table III. Except for RS and
NPENAS-SSCCL 110 5.83 ± 0.12 REA, the other algorithms achieve comparable p erformance
NPENAS-SSCCL 150 5.83 ± 0.12 on NASBench-101 (Figure 8), while NPENAS-SSCCL and
†
The neural predictor is trained with the given number of search budgets. NPENAS-NP have the best performance overall, as shown in

Table III. The performance of BANANAS using the position-
10.4 aware path-based encoding exceeds that of using path-based
RS encoding, proving that the proposed position-aware path-based
10.2 REA
BANANAS-PE
encoding is an efficient and effective encoding scheme. In addi-
10 BANANAS-PAPE tion, the performance of NPENAS-NP improves from 5.86% [6]
NPENAS-NP to 5.83% after filtering out isomorphic graphs using the proposed
9.8 NPENAS-SSRL
NPENAS-SSCCL position-aware path-based encoding. Table III also shows that
9.6 NPENAS methods using the proposed self-supervised pre-
9.4 trained neural predictors (NPENAS-SSRL and NPENAS-SSC-
CL) have only a slight performance drop when the search budget
9.2 is reduced from 150 to 90, but still outperform the unsupervised
9 arch2vec that uses a large search budget of 400.
The impact of the search budget on NPENAS-SSRL and
8.8
10 20 30 40 50 60 70 80 90 100
NPENAS-SSCCL methods are shown in Table IV. The perfor-
Number of Samples mance of NPENAS-SSRL continues to improve as the search
budget increases, while NPENAS-SSCCL achieves its best per-
FIGURE 9 Performance comparison of NAS algorithms on NASBench-201 formance using only 80 evaluated neural architectures. These
for CIFAR10 classification. results are consistent with those in Section IV-C and again show
31.5 56.9
RS RS
31
REA REA
56.4
30.5 BANANAS-PE BANANAS-PE
BANANAS-PAPE BANANAS-PAPE
30 NPENAS-NP 55.9 NPENAS-NP
29.5 NPENAS-SSRL NPENAS-SSRL
NPENAS-SSCCL 55.4 NPENAS-SSCCL
29
28.5 54.9
28
54.4
27.5
27 53.9
26.5 53.4
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of Samples Number of Samples
FIGURE 10 Performance comparison of NAS algorithms on NASBench-201 FIGURE 11 Performance comparison of NAS algorithms on NASBench-201
for CIFAR-100 classification. for ImageNet-16-120 classification.
TABLE V Performance comparison of NAS algorithms on NASBench-201.
METHODS SEARCH BUDGET TEST ERR (%) AVG CIFAR-10 TEST ERR (%) AVG CIFAR-100 TEST ERR (%) AVG IMAGENET-16-120
RS [50] 100 9.27 ± 0.32 28.55 ± 0.86 54.65 ± 0.80
REA [25] 100 8.98 ± 0.24 27.20 ± 0.89 53.89 ± 0.71
BANANAS-PE [3] 100 9.06 ± 0.31 27.41 ± 1.03 53.85 ± 0.64
BANANAS-AE [3] 100 8.95 ± 0.14 26.77 ± 0.67 53.67 ± 0.34
BANANAS-PAPE [3] 100 8.93 ± 0.16 26.86 ± 0.67 53.71 ± 0.46
NPENAS-NP [6] 100 8.95 ± 0.13 26.70 ± 0.64 53.90 ± 0.59
NPENAS-NP-FIXED [6] 50† 8.95 ± 0.15 26.72 ± 0.57 53.84 ± 0.57
NPENAS-SSRL 100 8.94 ± 0.11 26.55 ± 0.34 53.75 ± 0.51
NPENAS-SSRL-FIXED 50† 8.92 ± 0.11 26.57 ± 0.30 53.63 ± 0.38
NPENAS-SSCCL 100 8.94 ± 0.09 26.50 ± 0.12 53.80 ± 0.36
NPENAS-SSCCL-FIXED 50† 8.93 ± 0.10 26.58 ± 0.33 53.66 ± 0.36
†
The neural predictor is trained with 50 evaluated neural architectures, while other algorithms use 100 neural architectures for evaluation.

TABLE VI Performance comparison of NAS algorithms on DART for CIFAR-10 Classification.
MODEL PARAMS (M) ERR(%) AVG ERR(%) BEST NO. OF SAMPLES EVALUATED GPU DAYS
NASNET-A [23] 3.3 – 2.65 20000 1800
NAONET [51] 10.6 – 3.18 1000 200
ASHA [50] 2.2 – 2.85 700 9
DARTS [20] 3.3 – 3.00 ± 0.14 – 1.5
BANANAS [3] 3.6 2.64 ± 0.05 2.57 100 11.8
GATES [5] 4.1 – 2.58 800 –
ARCH2VEC-RL [8] 3.3 2.65 ± 0.05 2.60 100 9.5
ARCH2VEC-BO [8] 3.6 2.56 ± 0.05 2.48 100 10.5
NPENAS-NP [6] 3.5 2.54 ± 0.10 2.44 100 1.8
NPENAS-SSCCL-FIXED 3.9 2.49 ± 0.06 2.41 70 1.6
that the neural predictor SS-CCL outperforms SS-RL on the NASBench-101, the performance of BANANAS using the posi-
large search space of NASBench-101. tion-aware path-based encoding exceeds that of using path-based
encoding. Furthermore, it can be seen from Figures 9-11 that the
3) NAS Results on NASBench-201 performance of our methods improves faster as the search budget
The above algorithms are compared on CIFAR-10, CIFAR-100, increases. Due to space limitation, more comparisons are attached
and ImageNet-16-120 on NASBench-201, and the results are in the Supplementary Materials.
shown in Figure 9, Figure 10 and Figure 11, respectively. The
quantitative comparison is presented in Table V. As arc2vec [8] does 4) NAS Results on DARTS
not report queries on this benchmark, it is not compared here. As shown in Table VI, NPENAS-SSCCL-FIXED achieves the
Table V shows that the methods proposed in this paper obtain best performance compared with the recently proposed NAS
the best performance on all three datasets. Specifically, NPENAS- algorithms, and its search speed is nearly the same as the gradient-
SSCCL achieves the best performance on both CIFAR-100 and based method DARTS [20].The searched normal cell and reduc-
ImageNet-16-120, almost reaching the ORACLE baseline on tion cell are illustrated in Figure 12.
CIFAR-100 (26.5% vs. 26.49%). In particular, the performance
of NPENAS-SSCCL is the same as the ORACLE baseline on V. Conclusion
ImageNet-16-120. On CIFAR-10, NPENAS-SSRL-FIXED This paper presents a new neural architecture encoding scheme,
achieves the best performance comparable to the ORACLE position-aware path-based encoding, to calculate the GED of
baseline (8.92% vs 8.91%). In addition, like the results on neural architectures. To enhance the performance of neural
predictors, two self-supervised learn-
ing methods are proposed to pre-train
the neural predictors’ architecture
sep_conv_3×3 embedding modules to generate
skip_connect meaningful representation of neural
c_{k-2} sep_conv_3×3
2 architectures. Extensive experiments
0
sep_conv_5×5 demonstrate the superiority of the
c_{k}
sep_conv_3×3 self-supervised pre-training. The
1 dil_conv_5×5
3 results advocate the adoption of the
sep_conv_3×3
c_{k-1} sep_conv_3×3 self-supervised central contrastive rep-
(a) resentation learning method, while
self-supervised regression learning
none can be considered when the search
0
c_{k-1} avg_pool_3×3 sep_conv_5×5 space is small. When integrating the
sep_conv_5×5
max_pool_3×3
3 pre-trained neural predictors with
c_{k}
NPENAS, it achieves state-of-the-art
c_{k-2} 1 max_pool_3×3 performance on the NASBench-101,
none 2
max_pool_3×3 NASBench-201 and DARTS search
space. Since neural predictors can be
(b) combined with different search strate-
gies, the proposed self-supervised
FIGURE 12 (a) The normal cell and (b) reduction cell searched by NPENAS-SSCCL-FIXED on DARTS. representation learning methods are

applicable to any search strategy that employs graph neural [21] B. Zoph and Q.V. Le, “Neural architecture search with reinforcement learning,” in Proc. 5th Int.
Conf. Learning Representations, ICLR 2017,Toulon, France, Apr. 24–26, 2017.
networks as neural predictors. [22] H. Pham, M.Y. Guan, B. Zoph, Q.V. Le, and J. Dean, “Efficient neural architecture search via
For future works, combining the pre-trained neural predictors parameter sharing,” in Proc. 35th Int. Conf. Machine Learning, ICML 2018, Stockholmsmässan, Stock-
holm, Sweden, July 10–15, 2018, vol. 80, pp. 4092–4101.
to other neural predictor-based NAS to verify their generalization [23] B. Zoph,V.Vasudevan, J. Shlens, and Q.V. Le,“Learning transferable architectures for scalable im-
ability is worth further investigation. Extending the integration of age recognition,” in Proc. IEEE Conf. Comput.Vision and Pattern Recogn., CVPR 2018, Salt Lake City,
UT, June 18–22, 2018, pp. 8697–8710.
pre-trained neural predictors with NPENAS to other tasks, such [24] E. Real et al.,“Large-scale evolution of image classifiers,” in Proc. 34th Int. Conf. Machine Learning,
as image segmentation, object detection and natural language ICML 2017, Sydney, NSW, Australia, Aug. 6–11, 2017, vol. 70, pp. 2902–2911.
[25] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for im-
processing, is also promising research. age classifier architecture search,” in Proc. 33rd AAAI Conf. Artif. Intell., AAAI 2019, 31st
Innovative Appl.Artif. Intell. Conf., IAAI 2019, 9th AAAI Symp. Educational Adv.Artif. Intell., EAAI 2019,
Honolulu, HI, Jan. 27–Feb. 1, 2019, pp. 4780–4789.
VI. Acknowledgment [26]Y. Sun, B. Xue, M. Zhang, G. G.Yen, and J. Lv,“Automatically designing CNN architectures using
This work was supported in part by the National Natural Science the genetic algorithm for image classification,” IEEE Trans. Cybernetics, 2020, pp. 1–15. doi: 10.1109/
TCYB.2020.2983860.
Foundation of China under Grants Nos 61976167, U19B2030 [27] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated CNN archi-
and 11727813, the Key Research and Development Program in tecture design based on blocks,” IEEE Trans. Neural Netw. Learning Syst., vol. 31, no. 4,
pp. 1242–1254, 2020. doi: 10.1109/TNNLS.2019.2919608.
the Shaanxi Province of China under Grant 2021GY-082, and [28] Y. Sun, B. Xue, M. Zhang, and G. G.Yen,“Evolving deep convolutional neural networks for im-
the Xi’an Science and Technology Program under Grant age classification,” IEEE Trans. Evolutionary Comput., vol. 24, no. 2, pp. 394–407, 2020.
[29] H. Zhou, M.Yang, J. Wang, and W. Pan, “Bayesnas: A Bayesian approach for neural architecture
201809170CX11JC12. The authors would like to thank Dr. search,” in Proc. 36th Int. Conf. Machine Learning, ICML 2019, June 9–15, 2019, Long Beach, CA,
Karen von Deneen for her professional language editing. 2019, vol. 97, pp. 7603–7613.
[30] S. Xie, H. Zheng, C. Liu, and L. Lin,“SNAS: Stochastic neural architecture search,” in Proc. 7th Int.
Conf. Learning Representations, ICLR 2019, New Orleans, LA, May 6–9, 2019.
References [31] X. Chen, L. Xie, J.Wu, and Q.Tian, “Progressive differentiable architecture search: Bridging the
[1] T. Elsken, J. H. Metzen, and F. Hutter,“Neural architecture search: A survey,” J. Mach. Learning Res., depth gap between search and evaluation,” pp. 1294–1303, 2019.
vol. 20, pp. 55:1–55:21, 2018. [32]Y. Xu et al.,“PC-DARTS: Partial channel connections for memory-efficient architecture search,”
[2] C. Liu et al., “Progressive neural architecture search,” in Proc. Computer Vision—ECCV 2018— in 8th Int. Conf. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, Apr. 26–30, 2020.
15th European Conf., Munich, Part I, Sept. 8–14, 2018, vol. 11205, pp. 19–35. [33] K. Kandasamy,W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing,“Neural architecture search
[3] C. White, W. Neiswanger, and Y. Savani, “Bananas: Bayesian optimization with neu- with Bayesian optimisation and optimal transport,” in Proc. Adv. Neural Information Process. Syst. 31:
ral architectures for neural architecture search,” 2019. [Online]. Available: https://arxiv Annu. Conf. Neural Information Process. Syst. 2018, NeurIPS 2018, Montréal, Canada, Dec. 3–8, 2018,
.org/abs/1910.11858 pp. 2016–2025.
[4] L.Wang,Y. Zhao,Y. Jinnai,Y.Tian, and R. Fonseca, “Neural architecture search using deep neural [34] R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, and T. Liu, “Semi-supervised neural architecture
networks and Monte Carlo tree search,” in Proc. 34th AAAI Conf. Artif. Intell. (AAAI 2020) 32nd In- search,” in Proc. Adv. Neural Information Processing Syst. 33: Annu. Conf. Neural Information Process. Syst.
novative Appl.Artif. Intell. Conf. (IAAI 2020) 10th AAAI Symp. Educational Adv.Artif. Intell., EAAI 2020, 2020, NeurIPS 2020, Dec. 6–12, 2020.
New York, Feb. 7–12, 2020, pp. 9983–9991. [35] L. Dudziak, T. C. P. Chau, M. S. Abdelfattah, R. Lee, H. Kim, and N. D. Lane, “BRP-NAS:
[5] X. Ning, Y. Zheng, T. Zhao, Y. Wang, and H. Yang, “A generic graph-based neural architecture Prediction-based NAS using GCNS,” in Proc. Adv. Neural Inf. Process. Syst. 33: Annu. Conf. Neural Inf.
encoding scheme for predictor-based NAS,” in Proc. Comput.Vision—ECCV 2020—16th European Process. Syst. 2020, NeurIPS 2020, Dec. 6–12, 2020.
Conf., Glasgow, U.K., Aug. 23–28, 2020, pp. 189–204. [36] B. Baker, O. Gupta, R. Raskar, and N. Naik, “Accelerating neural architecture search using
[6] C. Wei, C. Niu,Y. Tang,Y. Wang, H. Hu, and J. Min Liang, “NPENAS: Neural predictor guided performance prediction,” in Proc. 6th Int. Conf. Learning Representations, ICLR 2018,Vancouver, BC,
evolution for neural architecture search,” 2020. [Online].Available: https://arxiv.org/abs/2003.12857 Canada, Apr. 30–May 3, 2018.
[7] W. Wen, H. Liu,Y. Chen, H. H. Li, G. Bender, and P. Kindermans, “Neural predictor for neural [37] D. P. Kingma and M.Welling, “Auto-encoding variational bayes,” in Proc. 2nd Int. Conf. Learning
architecture search,” in Proc. Comput. Vision—ECCV 2020—16th European Conf., Glasgow, U.K., Representations, ICLR 2014, Banff, AB, Canada, Apr. 14–16, 2014.
Aug. 23–28, 2020, pp. 660–676. [38]Y. Li, C. Gu,T. Dullien, O.Vinyals, and P. Kohli,“Graph matching networks for learning the simi-
[8] S.Yan,Y. Zheng,W. Ao, X. Zeng, and M. Zhang, “Does unsupervised architecture representation larity of graph structured objects,” in Proc. 36th Int. Conf. Machine Learning, ICML 2019, Long Beach,
learning help neural architecture search?” in Proc. Adv. Neural Inf. Process. Syst. 33: Annu. Conf. Neural CA, June 9–15, 2019, pp. 3835–3845.
Inf. Process. Syst. 2020, NeurIPS 2020, Dec. 6–12, 2020. [39] Y. Bai, H. Ding, S. Bian,T. Chen,Y. Sun, and W.Wang,“SIMGNN:A neural network approach to
[9] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for con- fast graph similarity computation,” in Proc. 12th ACM Int. Conf.Web Search and Data Mining,WSDM
trastive learning of visual representations,” 2020. [Online]. Available: https://arxiv 2019, Melbourne,VIC, Australia, Feb. 11–15, 2019, pp. 384–392.
.org/abs/2002.05709 [40] Y. Bai et al.,“Unsupervised inductive graph-level representation learning via graph-graph prox-
[10] K. He, H. Fan,Y. Wu, S. Xie, and R. B. Girshick, “Momentum contrast for unsupervised visual imity,” in Proc. 28th Int. Joint Conf. Artif. Intell., IJCAI 2019, Macao, China, Aug. 10–16, 2019, pp.
representation learning,” in Proc. IEEE/CVF Conf. Comput.Vision and Pattern Recogn., CVPR 2020, 1988–1994.
Seattle,WA, June 13–19, 2020, pp. 9726–9735. [41] K. Xu,W. Hu, J. Leskovec, and S. Jegelka,“How powerful are graph neural networks?” in Proc. 7th
[11] J. Devlin, M. Chang, K. Lee, and K.Toutanova,“BERT: pre-training of deep bidirectional trans- Int. Conf. Learning Representations, ICLR 2019, New Orleans, LA, May 6–9, 2019.
formers for language understanding,” in Proc. Conf. North American Chapter Assoc. Comput. Linguistics: [42] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning
Human Language Technol. (NAACL-HLT 2019), Minneapolis, MN, June 2–7, 2019, pp. 4171–4186. of visual features by contrasting cluster assignments,” 2020. [Online]. Available: https://arxiv.org/
[12] Y. M. Asano, C. Rupprecht, and A.Vedaldi, “Self-labelling via simultaneous clustering and repre- abs/2006.09882
sentation learning,” in Proc. 8th Int. Conf. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, [43] A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proc.
Apr. 26–30, 2020. Adv. Neural Information Processing Syst. 32: Annu. Conf. Neural Information Process. Syst. 2019, NeurIPS
[13] Z.Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S.Yu,“A comprehensive survey on graph neural 2019, Dec. 8–14, 2019,Vancouver, BC, Canada, 2019, pp. 8024–8035.
networks,” IEEE Trans. Neural Netw. Learning Syst., 2020. [44] M. Fey and J. E. Lenssen,“Fast graph representation learning with PyTorch Geometric,” in Proc.
[14] H. Cheng et al., “NASGEM: Neural architecture search via graph embedding method,” 2020. Int. Conf. Learning Representations Workshop Representation Learning Graphs and Manifolds, 2019.
[Online]. Available: https://arxiv.org/abs/2007.04452 [45] C. Wei, “Code for self-supervised representation learning for evolutionary neural architecture
[15] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw search,” 2020. [Online]. Available: https://github.com/auroua/SSNENAS
puzzles,” in Proc. Comput.Vision—ECCV 2016—14th European Conf., Amsterdam,The Netherlands, [46] A. Krizhevsky and G. E. Hinton, “Learning multiple layers of features from tiny images,” Tech.
Oct. 11–14, 2016, pp. 69–84. Rep., 2009.
[16] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting [47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
image rotations,” in Proc. 6th Int. Conf. Learning Representations, ICLR 2018,Vancouver, BC, Canada, neural networks,” in Proc. Adv. Neural Information Process. Syst. 25: 26th Annu. Conf. Neural Information
April 30–May 3, 2018. Process. Syst. 2012. Proc. Meeting, Lake Tahoe, NV, Dec. 3–6, 2012, pp. 1106–1114.
[17] C.Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter,“NAS-bench-101:Towards [48] D. P. Kingma and J. Ba,“Adam:A method for stochastic optimization,” CoRR, vol. abs/1412.6980,
reproducible neural architecture search,” in Proc. 36th Int. Conf. Machine Learning, ICML 2019, June 2014.
9–15, 2019, vol. 97, pp. 7105–7114. [49] I. Loshchilov and F. Hutter,“SGDR: Stochastic gradient descent with warm restarts,” in Proc. 5th
[18] J.You, J. Leskovec, K. He, and S. Xie,“Graph structure of neural networks,” in Proc. 37th Int. Conf. Int. Conf. Learning Representations, ICLR 2017,Toulon, France, Apr. 24–26, 2017, 2017.
Machine Learning, ICML 2020, July 12–18, 2020. [50] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” in
[19] X. Dong and Y.Yang, “NAS-Bench-201: Extending the scope of reproducible neural architec- Proc. 35th Conf. Uncertainty Artif. Intell., UAI 2019,Tel Aviv, Israel, July 22–25, 2019, p. 129.
ture search,” in Proc. 8th Int. Conf. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, Apr. [51] R. Luo, F.Tian,T. Qin, and T.-Y. Liu, “Neural architecture optimization,” in Proc. Adv. Neural Inf.
26–30, 2020. Process. Syst. 31: Annu. Conf. Neural Inf. Process. Syst. 2018, NeurIPS 2018, Dec. 3–8, 2018, Montréal,
[20] H. Liu, K. Simonyan, and Y.Yang, “DARTS: Differentiable architecture search,” in Proc. 7th Int. Canada, 2018, pp. 7827–7838.
Conf. Learning Representations, ICLR 2019, New Orleans, LA, May 6–9, 2019.

©SHUTTERSTOCK.COM/JIMMOYHT
Forecasting Wind Speed
Time Series Via Dendritic
Neural Regression
series. To address this issue, the use of artificial neural net-
Junkai Ji, Minhui Dong, and Qiuzhen Lin works has attracted increasing attention owing to their sig-
Shenzhen University, CHINA nificantly enhanced prediction accuracy. Based on these
Kay Chen Tan considerations, a novel neural model with a dynamic den-
The Hong Kong Polytechnic University, HONG KONG SAR. drite structure, known as the dendritic neuron model
(DNM), can be adopted for wind speed time series predic-
tion. The DNM is a plausible biological neural model that
Abstract—Wind energy is considered one of the fastest was originally designed for classification problems; accord-
growing renewable (‘green’) energy resources. Precise wind ingly, this study proposes the use of a regressive version of
power forecasting is imperative to ensure reliable power sys- the DNM, named dendritic neural regression (DNR), in
tem planning and wind farm operation. However, traditional which the dendrite strength of each branch is considered. To
methods cannot yield satisfactory forecasts because of the enhance the prediction performance, the recently proposed
chaotic properties and high volatility of wind speed time states of matter search (SMS) optimization algorithm is used
to optimize the neural architecture for DNR. By virtue of
the powerful search ability of the SMS algorithm, DNR can
Date of current version: 15 July 2021 Corresponding Author: Qiuzhen Lin (e-mail: qiuzhlin@szu.edu.cn).

efficiently capture the nonlinear correlations among distinct follows a normal distribution, although this assumption is not
features and dendritic branches. Extensive experimental valid in all cases. Moreover, statistical models have linear
results and statistical tests demonstrate that compared with correlation structures and always yield large errors when
other state-of-the-art prediction techniques, DNR can applied to intermittent and stochastic wind speed series [17].
achieve highly competitive results in wind speed forecasting. The third category pertains to artificial intelligence mod-
els, which can effectively overcome the shortcomings of the
W
I. Introduction abovementioned methods. Considering the nonlinear nature
ith the rapid development of society, it has of wind speed series, artificial intelligence algorithms that are
become challenging for the remaining fossil designed for effectively solving nonlinear problems, such as
fuel supply to meet societal requirements world- artificial neural networks (ANNs) [16] and support vector
wide; thus, the demand for sustainable renewable machines (SVMs) [18], are suitable for wind speed forecast-
energy is growing. In recent years, wind energy, as a sustain- ing. A previous comparison of prediction performance has
able renewable energy source, has attracted particular atten- demonstrated that artificial intelligence algorithms are faster
tion. Wind is an abundant, pollution-free energy source that and more accurate than statistical models [19]. SVMs are
is widely distributed, has few geographical restrictions, and commonly used in prediction frameworks, and they can out-
can be easily used anywhere. Building wind farms is one of perform ANNs in certain cases. However, the performance
the main challenges in the use of wind energy. Nevertheless, of an SVM is limited by its penalty settings and kernel
the cost of building wind farms is low, no additional energy parameters; consequently, algorithms for tuning these hyper-
source is required, and such farms do not impact the sur- parameters are necessary [20]. For example, a genetic algo-
rounding environment. rithm was employed to enhance the prediction results of an
Since the energy supplied by wind farms mainly depends SVM in [21]; a reduced SVM with feature selection, trained
on the wind power, precise wind power forecasting is imper- using the particle swarm optimization algorithm, was used to
ative to enable reliable power system planning and wind optimize the parameters of an SVM in [22]; and the perfor-
farm operation [1, 2]. Wind power is closely related to wind mance of an SVM was enhanced using the cuckoo search
speed, and these two parameters share several characteristics, algorithm in [23]. In addition to SVMs, an increasing num-
such as randomness, uncontrollability and intermittency [3]. ber of ANNs and their variants have been proposed for wind
Because of these characteristics, wind farm management is speed prediction. For instance, a backpropagation (BP) neu-
extremely challenging. Calculating the energy production ral network was employed to forecast a wind speed series in
of a wind farm is essential for assessing the economic feasi- [24], a combination of an ANN and Markov chains was pro-
bility of such a project prior to construction planning [4]. posed for forecasting in [25], and a functional network was
Therefore, accurate wind speed prediction is being strongly utilized for multistep wind speed prediction in [26]. Further-
prioritized in this context [5]. More accurate forecasting more, a fine-tuned long short-term memory (LSTM) neural
capabilities correspond to larger reductions in the construc- network hybridized with the crow search algorithm, the
tion costs of wind farms [6]. wavelet transform and feature selection was applied for short-
To date, various models have been used for wind speed term wind speed forecasting in [27]. In [28], a hybrid model
prediction. These methods can be classified into three cate- involving a causal convolutional network and a gated recur-
gories: physical models, statistical models, and artificial rent unit architecture was used in wind speed prediction.
intelligence models [7]. Examples of physical models In general, physical models and traditional statistical mod-
include the Mesoscale Model Version 5 [8] and the Weather els both have several limitations pertaining to the precision
Research and Forecasting Model [9]. These numerical pre- and robustness of wind speed time series prediction, whereas
diction models can achieve satisfactory performance in artificial intelligence models can effectively overcome these
long-term wind speed prediction; however, such models problems to offer more powerful prediction performance.
require complex atmospheric information pertaining to Based on these considerations, the dendritic neuron model
pressure, temperature and other environmental factors and (DNM), which was recently developed based on inspiration
exhibit high computational complexity [10], [11]. from biological neurons in vivo [29], is adopted in this study
Compared to physical models, statistical models are more for the prediction of wind speed time series. In the DNM,
widely used to forecast wind speed. Examples of such models synaptic nonlinearity is implemented in a dendritic structure
include direct random time series models [12], autoregressive to effectively solve linearly inseparable problems, and this
models [13] and autoregressive integrated moving average model has been applied to a variety of complex continuous
(ARIMA) models [14]. ARIMA models are regarded as a functions [30]–[32]. The original DNM was specifically
typical class of statistical models, and their prediction perfor- designed for classification problems. By discarding the unnec-
mance in short-term wind speed forecasting has been veri- essary synapses and dendritic branches in the DNM, diverse
fied [15]. However, in general, the prediction performance of dendritic structures can be produced to pursue an extremely
statistical models is f lawed [16] because most statistical mod- high classification speed for each task. Notably, however, the
els are based on the assumption that the wind speed series structure of the original DNM is extremely simple, and it

its plastic neural architecture can efficiently
The demand for sustainable renewable energy is capture the nonlinear correlations among
growing. Wind energy, as a sustainable renewable distinct features and different dendritic
branches. In addition, a one-dimensional
energy source, has attracted particular attention. wind speed time series typically appears to
exhibit intermittent and random features
ignores the mechanism by which the signal transformation since it contains complex information projected from a high-
strengths of the dendritic branches vary with their thickness er-dimensional space. Thus, the phase space should be recon-
[33]. Therefore, this study proposes the use of a variant of the structed based on Takens’ theorem [44]. The time delays and
DNM called dendritic neural regression (DNR), in which the numbers of embedding dimensions for a wind speed time
intensity of each dendritic signal is considered to enhance the series can be calculated using a mutual information (MI)
regression ability. approach and the false nearest neighbors (FNN) algorithm,
The neural architecture considerably inf luences the model respectively. To evaluate the forecasting results of the DNR-
performance of ANNs [34]. Accordingly, evolutionary algo- SMS method, a comprehensive experiment has been con-
rithms have been widely used to realize neural architecture ducted in which the performance of the DNR approach has
search tasks for performance optimization. For instance, Sun been compared with that of other commonly used time series
et al. proposed an automatic method of designing convolu- prediction algorithms.
tional neural network (CNN) architectures for solving The main contributions of this study are as follows:
image classification problems based on genetic algorithms First, a novel DNR model that considers the dendrite
[35]. Lu et al. used a multiobjective evolutionary algorithm strength of each individual branch is proposed. This
for CNN architecture design [36]. In contrast to that of approach can significantly enhance the regression perfor-
conventional ANNs, the neural architecture of a DNR mance for wind speed forecasting. Second, due to its pow-
model is determined by the values of the weights and erful search ability, the SMS algorithm is used to optimize
thresholds rather than by hyperparameters. Subsequently, an the neural architecture of the DNR model. The SMS algo-
inherent pruning mechanism can be implemented to simpli- rithm can escape from local minima more effectively than
fy the DNR model to produce unique neural architectures gradient-based optimization algorithms can. Third, the
for particular real-world tasks, as described in our previous results of extensive experiments demonstrate that a DNR
research [29], [37]. To further enhance the performance of model trained using the SMS algorithm can achieve highly
wind speed forecasting, the recently proposed states of mat- competitive performance in wind speed forecasting com-
ter search (SMS) algorithm [38] can be used to optimize the pared with other state-of-the-art prediction techniques.
neural architecture of the DNR model. The SMS algorithm The remainder of the paper is organized as follows: Section
is a global search algorithm that can escape from local mini- II describes the DNR approach. Section III introduces the
ma more effectively than gradient-based optimization algo- process of wind speed prediction. Section IV presents and
rithms. In fact, many researchers have attempted to employ discusses the experimental results. Finally, concluding
evolutionary algorithms to enhance the performance of remarks are presented in Section V.
ANNs for chaotic time series prediction. For example, par-
ticle swarm optimization has been introduced into echo II. Dendritic Neural Regression
state networks as a pretraining tool to optimize the In this study, we demonstrate the utilization of DNR for
untrained weights to address time series forecasting prob- wind speed prediction and evaluate its performance at vari-
lems [39]. Furthermore, a genetic algorithm has been intro- ous time scales. As illustrated on the left side of Fig. 1, the
duced into the Elman neural network to optimize the neural architecture for DNR includes four layers. The first
connection weights and thresholds to prevent the optimiza- layer is the synaptic layer, which represents the specific tissue
tion from becoming trapped in local minima and enhance that receives electrical or chemical signals from other neu-
the training speed and success rate [40]. Moreover, a modi- rons. The second layer is the dendrite layer, which consists of
fied cuckoo search algorithm has been used to optimize many branches that integrate the output signals from the
wavelet neural networks to achieve higher generalization synaptic layer. The third layer is the membrane layer, which
capabilities in chaotic time series forecasting [41]. In addi- sums the outputs of the dendritic layer and transfers the result
tion, an evolving fuzzy neural network predictor has been to the next layer. The final layer is the cell body (soma),
proposed to effectively capture the dynamic properties of which compares the signal against a given threshold. If the
multidimensional datasets and accurately track the system signal is larger than the threshold, the soma fires; otherwise,
characteristics [42], while in [43], a novel strategy was pro- no action is performed.
posed for evolving the structure of deep recurrent neural
networks by means of ant colony optimization. A. Synaptic Layer
Compared with other ANNs, a DNR model can help Synapses are a kind of neural tissue that conveys information
enhance the performance of wind speed prediction because between dendrites and axons or among dendrites of different

Condition Cases
Branch-m
Constant 1 Condition
Synapse Synapse Synapse Dendrite
Learning Algorithm Constant 0 Condition
Direct Condition
Inverse Condition
Branch-3
Synapse Synapse Synapse Dendrite

Direct Connection Inverse Connection
1 1
0.8 0.8
0.6 0.6
y
y
Branch-2 0.4 0.4

0.2 0.2
0 0
Synapse Synapse Synapse Dendrite –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2
x x
(a) (b)
Constant 1 Connection
1 1
Branch-1 0.8 0.8
0.6 0.6
y
y
0.4 0.4
Synapse Synapse Synapse Dendrite 0.2 0.2
0 0
–2 –1.5 –1 –0.5 0 0.5 1 1.5 2 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2
(c)
Constant 0 Connection
1 1
0.8 0.8
X1 0.6 0.6
y
X2 Xn
y
0.4 0.4
Soma 0.2 0.2
0 0
–2 –1.5 –1 –0.5 0 0.5 1 1.5 2 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2
Neuron Architecture x x
(d)
Condition Functions
AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE

53
FIGURE 1 Illustration of the DNR architecture.
DNM. Therefore, in DNR, each u m is
The neural architecture of DNR consists of four layers, regarded as a parameter that needs to be
namely the synaptic layer, dendritic layer, membrane optimized via a learning algorithm; accord-
ingly, these values are specified differently
layer, and soma layer. for different problems.
neurons. These elements are distributed throughout the den- D. Cell Body (Soma)
dritic tree and possess various receptors for specific ions. The soma fires depending on whether the membrane poten-
Depending on the potential of the ions entering a receptor, tial exceeds a given threshold. This process can be mathemat-
the synapse changes its connection state and enters either an ically described as a sigmoid operation on the product terms,
excitatory or inhibitory state [45]. The process of signal as follows:
transmission can be described using the following equation:
O= 1 , (4)
1 1 + e -k (V - qs)
y im = , (1)
1 + e -k (w im xi - qim)
where V represents the output of the membrane layer and k
where x i represents the i-th input feature, whose range is and qs are positive constant hyperparameters.
[0,1], with i ! 61, 2, f, I @; y im is the output of the i-th syn-
apse on the m-th dendritic branch, with m ! 61, 2, f, M @; k E. Connection Cases
is a hyperparameter that is a positive constant; and w im and On the right side of Fig. 1, the six functions of the synaptic
q im are connection parameters that represent a weight and a layer are illustrated for various combinations of w im and q im.
threshold, respectively. To obtain the appropriate values for According to these six different functions, the connection
each problem, these connection parameters in a DNR model states of the synaptic layer can be divided into four main cate-
can be trained using a learning algorithm. gories, defined as follows: constant 1 connections, whose
parameters satisfy w im 1 0 1 q im or 0 1 w im 1 q im , implying
B. Dendrite Layer that regardless of the input, the output is always excitatory;
Each branch in the dendrite layer receives the output signals constant 0 connections, whose parameters satisfy q im 1 w im 1 0
from all synapses on that branch. The nonlinear relationship or q im 1 0 1 w im , implying that regardless of the input, the
among these signals plays a key role in neural information output remains inhibitory; excitatory connections, whose
processing for several sensory systems in biological networks, parameters satisfy 0 1 q im 1 w im , implying that the input and
such as the visual and auditory systems [46], [47]; in DNR, output are directly correlated; and inhibitory connections,
this relationship can be expressed in terms of multiplication whose parameters satisfy w im 1 q im 1 0, implying that the
operations. Let Z m represent the output of the m-th dendritic input and output are inversely correlated.
branch. The equation for a dendritic branch can be expressed
as follows: F. Learning Algorithm
Because of the multiplication operations applied in the den-
I drite layer, the parameter space of the model appears to be
Z m = % y im .(2)
i=1 extremely large and complicated. Additionally, weights are
added to the output of each dendritic branch in the DNR
C. Membrane Layer model, leading to an increase in the dimensionality of the
The membrane layer combines all outputs from the dendrite parameter space, which further increases the difficulty of
layer through a summation operation. Let V represent the out- optimizing the parameters. Consequently, it is difficult to
put of the membrane layer. The corresponding equation can perfectly train the parameters when using the traditional BP
be expressed as follows: algorithm. Therefore, in this study, the SMS algorithm is
adopted as a more suitable global optimization algorithm to
M
optimize the DNR model. The SMS algorithm is an evolu-
V= / û m ) Z mh,(3) tionary algorithm that mimics the variation in the states of
m=1
matter. Compared with the traditional BP algorithm, the
where u m represents the strength of the m-th dendritic SMS algorithm exhibits a higher search ability, is less likely to
branch. This value is constant and is always set to 1 for each fall into local optima, and seldom results in overfitting. In this
branch in the original DNM to simplify the neural architec- subsection, this algorithm is described in detail.
ture to accelerate the computation process [29]. However, in The process of searching for the best solution in the SMS
reality, the thicknesses and signal transformation strengths of algorithm can be expressed as a series of physical motions
the dendritic branches vary; thus, using a uniform u m value among molecules, which mimic the state transformations of
for all branches may degrade the regression ability of the matter [38]. Specifically, the SMS algorithm can be divided

into three phases: the gas phase, the liquid
phase, and the solid phase. The gas phase The process of searching for the best solution in the
corresponds to the previous solution set. In SMS algorithm can be expressed as a series of physical
the gas state, the molecules are far from one
another and widely distributed, and intense
motions among molecules, which mimic the state
motions and collisions occur among them. transformations of matter.
The liquid phase represents the intermediate
solution set. In the liquid state, the distances between mole- denote the corresponding distance threshold, which is calcu-
cules are considerably smaller than those in the gas state, the lated as follows:
distribution is relatively concentrated, and the movements are
/ ^b mhigh - b mlowh
n
relatively limited. Finally, the solid phase represents the later
solution set. In the solid state, the distances between mole- r = m=1 ) b, (8)
n
cules are minimal, the distribution is highly concentrated,
and only a small amount of motion occurs. where b is a constant positive value with a range of [0,1].
In each phase, three operations are executed between the If the distance between two individuals is less than r, they
molecules, defined as follows: will undergo collision, and this phenomenon can be
(1) Direction Vector Operator: The purpose of the direc- expressed as follows:
tion vector operator is to push other individuals to move
toward the best individual in the population. Let P i repre- d m = d i AND d i = d m .(9)
sent a member of the population, and let p best represent the
current best individual. P best attracts other members of the (3) Random Behavior: Following the rules of the SMS
population, causing them to move toward P best. Let d i be the algorithm, during the iterative process, each individual will
direction vector of the i-th member. The corresponding likely exhibit random behavior, i.e., the elements of the indi-
equation is as follows: viduals will randomly change. The function for this random
behavior can be expressed as follows:
d ti + 1 = d ti ) c 1 - t m ) 0.5 + P best - Pi , (5)
]] b m + Rand 2 ) ^b m - b m h,
Z low high low
epoch P best - Pi
t+1
P i, m = [ with probability H, (10)
\ P i, m , with probability ^1 - H h,
where t denotes the current iteration number and epoch ] t+1
denotes the total number of iterations. The obtained value d i
is used to calculate the velocity vector of the i-th member of where H represents the occurrence probability of the random
the population, as follows: behavior.
The entire iterative process in the SMS algorithm is
/ ^b high
m - bm h
n
low divided into three phases by establishing different c, a, b
and H values. In accordance with [38], we adopt the parame-
vi = di ) m = 1 ) c, (6)
n ters presented in Table II for the three phases. The complete
SMS process is described in Algorithm 1.
where c is a constant positive value with a range of [0,1] and
n represents the dimensionality of each member, namely, the III. Wind Speed Forecasting Framework
number of parameters in the model. The upper and lower The process of applying DNR for wind speed prediction is
bounds on the m-th parameter are b high m and b low
m , respectively. illustrated in Fig. 2. First, the time delay and embedding
Next, we can employ v i to update the position p i, where the dimensionality of the series are calculated to reconstruct the
update function is defined as follows: phase space. Next, the maximum Lyapunov exponent is cal-
culated from the time delay and embedding dimensionality.
P it,+m1 = p ti, m + v m ) Rand 1 ) ^b mhigh - b mlowh ) a, (7) When this exponent exceeds zero, the wind speed series
data are regarded as a chaotic time series. Subsequently, the
where Rand 1 is a random number between 0 and 1 and a is a
constant positive value with a range of [0,1].
(2) Collision Operator: When two molecules approach
TABLE I Parameter settings of the SMS algorithm
each other, they may collide. The collision operator emu- in different states.
lates this process; if two individuals have collided, their
STATE DURATION c a b PROBABILITY H
direction vectors are exchanged, which ensures the diversi-
ty of the population and prevents premature convergence in GAS 50% 0.8 0.8 [0.8,1] 0.9
the evolutionary process. Collision behavior occurs when LIQUID 40% 0.4 0.2 [0.0,0.6] 0.2
two individuals are sufficiently close to each other. Let r SOLID 10% 0.1 0 [0.0,0.1] 0.0

wind speed time series data are processed through a normal- been calculated using a certain approach, a group of new
ization operation. Finally, the DNR model can be employed vectors can be constructed as follows:
for wind speed prediction.
J x1 x2 f x N - 1 - x (m - 1)N
K O
x x f x N - 1 - x (m - 2)O
X =K
x+1 x+2
A. Phase Space Reconstruction ,(11)
K h h h h O
It is difficult to forecast the future trend of a chaotic time K O
series such as a wind speed series because of its irregulari- Lx x (m - 1) + 1 x x (m - 1) + 2 f xN - 1 P
ty. A chaotic time series can be considered to represent a T = ^x x (m - 1) + 2, x x (m - 1) + 3, f, x N h, (12)
type of random motion in a definite dynamical system.
Thus, we need to restore the original dynamical system where X and T are the input signals of the neural model and
of the chaotic time series. The most common method the target outputs, respectively. The matrix X can efficiently
to accomplish this is the delayed coordinate approach describe the original dynamical system if the appropriate val-
proposed by Takens [44]. For a chaotic time ser ies ues of the time delay x and embedding dimensionality m are
" x (i ) ; i = 1, 2, f, N ,, under the assumption that the time selected. The operation of constructing X from a time series
delay x and vector embedding dimensionality m have " x (i ) ; i = 1, 2, ..., N , is termed phase space reconstruction.
The core problem in performing this operation is to obtain
the appropriate values for x and m. Takens only proved the exis-
tence of the time delay and embedding dimensionality in theory
Algorithm 1 Pseudocode for the SMS algorithm. and did not provide a method for determining their values. In
Input: P opulation size N, Number of dimensions n, Maximum general, a time series always contains finite sequences with
number of iterations epoch. noise. There is no single method that can accurately obtain the
Result: Best solution P best . time delay and embedding dimensionality for every time series;
begin instead, the appropriate algorithm must be chosen depending
Initialize the population X = " P1, f, PN ,, the direction vector set on the actual situation. In this study, the FNN algorithm is
D 0 = " d 01, f, d 0N ,, the maximum number of iterations epoch = utilized to calculate the embedding dimensionality [48], and
1000, the state count phase = 1, and the current iteration the MI algorithm is employed to obtain the time delay [49].
number t = 1; The details of these two methods are presented in the two
repeat subsequent subsections.
if phase == 1 then
c = 0.8, a = 0.8, b = 0.9, H = 0.9, Dend = epoch ) 0.5; B. Mutual Information Algorithm
if phase == 2 then The MI algorithm is one of the most widely used methods for cal-
c = 0.4, a = 0.2, b = 0.5, H = 0.2, Dend = epoch ) 0.9;
culating the time delay of a time series and is based on the theory
of mutual information. The information entropy, H(X ), which is
if phase == 3 then
used to represent the degree of uncertainty of X, can be expressed
c = 0.1, a = 0.0, b = 0.0, H = 0.0, Dend = epoch;
as follows:
for t, t 1 = Dend, t = t + 1 do
Evaluate the fitness of the population n
F (X ) = " f (P1), f, f (PN) ,; H (X ) = - / P (x i) log P (x i), (13)
i=1
Set the individual with the best fitness as P best;
/* Perform direction vector operations */ where X denotes discrete random variables, P (x i) is the prob-
Calculate the new direction vector set D t with Eq. (5); ability of the occurrence of event x, and n is the total number
Use D t to calculate the velocity vector set V = " v 1, f, v N , of states x. H (X ; Y ) represents the conditional information
with Eq. (6); entropy and can be expressed as
Update each individual P n with Eq. (7);
/* Perform collision operations */
Calculate the threshold r using Eq. (8); n m
H (X ; Y ) = - / / P (y j ) P (x i ; y j ) log P (x i ; y j ),(14)
Calculate the distance between each pair of points; if the i=1 j=1
distance is less than r, exchange the direction vectors of
the points, as indicated in Eq. (9); where m represents the total number of states y, P (y j ) is the
/* Perform random behavior */ probability of event y occurring alone, and P (x i ; y j) denotes
Draw a random number RA in the range of [0,1]; if RA is the conditional probability of event x occurring under the
less than H, execute random behavior with Eq. (10); condition of the occurrence of event y. The MI entropy,
phase = phase + 1; I(X, Y ), can be calculated as
until phase 2 3;
return the best solution P best . I (X, Y ) = H (X ) + H (Y ) - H (X, Y ), (15)

where H(X, Y ) represents the joint information of X and Y If R i (m + 1) is considerably larger than R i (m), the two
and can be calculated using the following expression: points are considered to be false nearest neighbor points.
By modifying this equation, a more convenient function
n m for judging false nearest neighbor points can be obtained
H (X, Y ) = - / / P (x i, y j) log P (x i, y j).(16) as follows:
i=1 j=1
z i ( i + m x ) - z j ( j + mx )
For the time series " x (i ) ; i = 1, 2, f, N ,, P (x l) and
Rx = .(20)
R i ( m)
P (x l + x) are the probabilities that x l and x l + x, respectively,
appear in " x (i ) ; i = 1, 2, f, N , . Based on these definitions, R x is compared against a given threshold i, whose range is
the MI entropy I (x l, x l + x) for a time delay x can be specified set to [10, 50]. If R x is larger than this threshold, the points are
as follows: determined to be false nearest neighbor points. For a real-
world chaotic time series, such as a wind speed series, the ini-
I (x) = I (x l, x l + x) = H (x l) + H (x l + x) - H (x l, x l + x).(17) tial number of embedding dimensions is usually set to 2. Once
the proportion of false nearest neighbor points is less than 5%,
According to the MI algorithm, the value of x when the corresponding number of embedding dimensions is select-
I (x) reaches a local minimum for the first time is taken as ed as the final solution. In certain extreme cases, the propor-
the final solution. tion of false nearest neighbor points cannot decrease to 5%; in
such a case, the critical dimensionality at which the number of
C. False Nearest Neighbors Algorithm false nearest neighbor points stops decreasing is considered the
The FNN algorithm, which was proposed by Kennel in final solution.
1992 [48], is used to calculate the embedding dimensionali-
ty for a chaotic time series. This algorithm is based on the D. Maximum Lyapunov Exponent
premise that a chaotic time series, such as a series of wind Before a machine learning technique is employed to forecast
speed data, can be regarded as a set of continuously varying the future wind speed, the chaotic characteristics of the histor-
particles in a high-dimensional space mapped to a one- ical wind speed series must be confirmed. Lyapunov expo-
dimensional space. If the number of embedding dimensions nents are commonly used for this purpose. In 1983, it was
is too small, the particles will be compressed and folded proven that if at least one of the Lyapunov exponents in a
onto one another due to the insufficient extent of their spa- dynamical system is positive, the corresponding time series can
tial orbits, meaning that two adjacent points in the one- be considered chaotic [50]. Therefore, the maximum Lyapu-
dimensional space may cor respond to t wo particles nov exponent of the time series needs to be calculated. If this
separated by a large distance in the high-dimensional space. exponent exceeds zero, the time series is considered to have
Two such adjacent particles are def ined as false nearest chaotic characteristics. Among the various approaches for cal-
neighbor points. In this case, the embedding dimensionali- culating Lyapunov exponents, Wolf ’s method [51], which is
ty should be gradually increased to fully expand the spatial based on the phase space reconstruction theory of Takens, is
orbits. With an increase in the number of embedding considered to be the most effective.
dimensions, the particles are expected to gradually separate, For a series X (t ) = (x (t ), x (x + t ), f, x ((m - 1)x + t ) re
and the number of false nearest neighbor points is expected constructed using Takens’ theory, as described above, the
to gradually decrease. Once the embedding dimensionality distance L i between X (t i) and the closest point X (t n) can be
is set to a sufficient value such that all false nearest neighbor expressed as follows:
points are eliminated, the corresponding solution is consid-
ered to be the optimal solution. L i = X (t n) - X (t i) .(21)
Suppose that z i (m) = (y (i ), y (x + i ), f, y ((m - 1) x + i ))
is a vector in an m-dimensional phase space and that According to Wolf ’s method, the maximum Lyapunov
z j(m) is the nearest neighbor point of z i; the distance exponent can be calculated as
between z i(m) and z j (m) can be calculated using the fol- M
lowing equation: m max = 1 / ln L i , (22)
t M - t 0 i = 0 L li
R i (m) = z i (m) - z j (m) .(18) where t 0 and t M represent the initial time and the f inal
time, respectively, and M = N - (m - 1) x. In addition,
This distance changes as the number of dimensions the interval for time series prediction can be calculated
increases. Thus, the updated distance between z i (m + 1) and as follows [52], [53]:
z j (m + 1) can be expressed as follows:
Tt = 1 .(23)
R i2 (m + 1) = R i2 (m) + z i (i + mx) - z j ( j + mx ) .(19) m max

IV. Experiments and Discussion
It is difficult to forecast the future trend of a chaotic
time series such as a wind speed series because of A. Experimental Setup
The wind speed data used for experimental
its irregularity. But it can be considered to represent a analysis in this study pertain to a wind farm
type of random motion in a definite dynamical system. in Galicia and are available for free at http://
www.evwind.es. To comprehensively evaluate
the prediction performance of DNR, two sets of
This equation indicates that larger and smaller m max values wind speed data with different time intervals, specifically, ten
correspond to closer and more distant predictions, respectively. minutes and one day, were collected; these datasets are illus-
trated in Figs. 3(a) and (b), respectively. The two datasets were
E. Normalization used to verify long-term and short-term predictions. Each
For a raw time series, the first task is normalization. Normalization dataset contained 1,150 data points. The collected data were
is a necessary and useful preprocessing step in the field of ANNs. By divided into two subsets: training data and test data. The
normalizing the data, all features can be mapped to the same scale, training data were used to train the DNR parameters, and the
thereby reducing the computational cost and increasing the conver- test data were utilized to evaluate the performance of the
gence speed of optimization algorithms. In this study, the one- trained model in terms of several indicators. The use of differ-
dimensional wind speed time series " x (i ) ; i = 1, 2, f, N , is ent ratios of training data to test data can help determine the
normalized using the min-max method, as follows: dependence of an algorithm on the number of training data;
thus, three ratios were considered in our experiments: 1:1, 4:1
x t - MIN (x t)
Yt = , (24) and 9:1. All experiments were conducted on a personal PC
MAX (x t) - MIN (x t)
with a 2.90 GHz Intel(R) Core i7 CPU and 16 GB of memory
where MAX (X t) and MIN (X t) represent the maximum and using MATLAB R2018b.
minimum values of the input vector, respectively. Using Eq.
(24), both the training and test samples are normalized to the B. Evaluation Indicators
range [0,1]. In addition to min-max normalization, other To fairly evaluate the performance of each algorithm, several
methods, such as mean and variance normalization and simple evaluation methods were adopted in this study. The details of
normalization, are also commonly used [54], [55]. these methods are given as follows.
Start
No Chaotic
No
Estimate Whether
It Is Chaotic
Wind Speed Time Series
Yes
Divide Into the Training and Implement the

Testing Datasets Normalization Operator
Calculate Time Delay and

Train DNR on the Training
Embedding Dimension
Dataset
Conduct Phase Space

Reconstruction Evaluate DNR on the Test
Dataset
Calculate Maximum
Lyapunov Exponent
End
FIGURE 2 Flowchart of DNR-based wind speed forecasting.

(1) Evaluation Indicators: To compare the prediction Before employing the DNR model to forecast the wind
performance of the algorithms, four widely used indica- speed, three hyperparameters, M, pop, and epoch, which could
tors were evaluated in our experiments: the mean-squared inf luence the performance of the proposed algorithm, were
error (MSE), the mean absolute error (MAE), the root- established. M denotes the number of dendritic branches, and
mean-squared error (RMSE) and the correlation coeffi- Pop and epoch represent the population size and maximum iter-
cient of the predictions (R). These indicators are defined ation time, respectively, for the SMS algorithm. It should be
as follows: noted that determining the best hyperparameter combination
is a difficult and time-consuming task. Although several
n
approaches have been proposed for the automatic tuning of
MSE = 1 / (O - T )2, (25)
n i=1 hyperparameters, the results are comparable to those of the tri-
n al-and-error approach. Based on empirical evidence, k, qs, M,
MAE = 1 / ;T - O ; , (26) pop, and epoch were set to 6, 0.8, 9, 100 and 1000, respectively,
n i=1
in this experiment.
n
1 / (O - T )2 , (27) Moreover, several commonly used prediction techniques
RMSE =
n i=1 were compared with the proposed model, including the
n
/ (T - Tr ) (O - Or )
i =1
R= n n , (28) Wind Speed Time Series Curve (One Day)
25
/ (T - Tr )2 / (O - Or )2 20
i =1 i =1 15
M/S
10
5
where O is the prediction output and T is the actual 0
0 200 400 600 800 1,000 1,200
result. MSE, MAE, and RMSE represent different kinds
of errors between the outputs and targets. Smaller indi- Wind Speed Time Series Curve (Ten Minutes)
16
cator values correspond to higher prediction perfor- 12
mance. R represents the degree of fit between the output
M/S
8
curve of a prediction algorithm and the original curve. 4
0
If R is equal to 1, the predicted curve fits the original 0 200 400 600 800 1,000 1,200
curve perfectly.
(2) Nonparametric Statistical Test: This test detects wheth- FIGURE 3 Wind speed curves for long-term and short-term predictions.
er a significant difference exists between the proposed meth-
od and another algorithm. In this study, the Wilcoxon rank
sum test [56], [57] was conducted using the KEEL software TABLE II Results for the time delays, embedding
[58]. The significance level was set to 5%. If the p-value of the dimensionalities and maximum Lyapunov exponents of
wind speed series at different time scales.
Wilcoxon rank sum test was less than 5%, a significant differ-
ence in performance was considered to exist between the two MAXIMUM
TIME EMBEDDING LYAPUNOV
algorithms being compared. TIME DELAY DIMENSIONALITY EXPONENT
(3) Relative Graphs: Graphs related to an experiment INTERVAL (x) (m) (mmax) CHAOTIC
enable more intuitive observation of the experimental ONE DAY 2 4 0.1181 YES
results. The following graphs were generated in this study. TEN MIN 5 7 0.0330 YES
First, fit graphs and correlation coefficient graphs were plot-
ted to visualize how well the predicted results matched the
target values. Second, a convergence graph was produced to TABLE III Parameter settings of algorithms for predicting
illustrate the speed and stability of the proposed model dur- wind speed time series.
ing the convergence process. ALGORITHM PARAMETERS
MLP HIDDENLAYER = 10, LEARNINGRATE = 0.01,
C. Performance Comparison and Analysis EPOCH = 1000
Following the approaches described above, the time delays, ENN LEARNINGRATE = 0.01, EPOCH = 1000
embedding dimensionalities and maximum Lyapunov SVR LINEAR, POLYNOMIAL, RBF AND SIGMOID KERNELS
exponents of the wind speed series at different time scales were
LSTM HIDDENUNITS = 200, EPOCH = 1000
calculated, and the results are presented in Table II. The maxi-
DNM-BP LEARNINGRATE = 0.01, EPOCH = 1000
mum Lyapunov exponents presented in Table II imply that the
two wind speed time series are chaotic. Next, phase space EDNM POPSIZE = 100, EPOCH = 1000
reconstruction was conducted based on the time delay and DNR-BP LEARNINGRATE = 0.01, EPOCH = 1000
embedding dimensionality. DNR-SMS POPSIZE = 100, EPOCH = 1000

TABLE IV Long-term prediction performance of all models for the experimental wind speed time series.
LONG-TERM WIND SPEED TIME SERIES WITH A PARTITION RATIO OF 1:1
MSE (MEAN ± STD RMSE (MEAN ± STD MAE (MEAN ± STD R (MEAN ± STD
ALGORITHM (OR SD)) P-VALUE (OR SD)) P-VALUE (OR SD)) P-VALUE (OR SD))
ENN 1.19E-01±3.77E-01 9.13E-07 2.28E-01±2.59E-01 9.13E-07 1.99E+00±3.50E-01 9.13E-07 4.01E-01±1.26E-01
MLP 2.86E-02±5.42E-03 9.13E-07 1.68E-01±1.54E-02 9.13E-07 2.24E+00±2.18E-01 9.12E-07 2.94E-01±1.18E-01
DT 2.33E-02±0.00E+00 9.13E-07 1.53E-01±0.00E+00 9.12E-07 2.04E+00±0.00E+00 9.12E-07 4.45E-01±0.00E+00
SVR-L 1.66E-02±0.00E+00 1.00E+00 1.29E-01±0.00E+00 1.00E+00 1.68E+00±0.00E+00 1.00E+00 5.59E-01±0.00E+00
SVR-P 2.07E-02±0.00E+00 9.13E-07 1.44E-01±0.00E+00 9.12E-07 1.98E+00±0.00E+00 9.12E-07 4.56E-01±0.00E+00
SVR-R 1.65E-02±0.00E+00 1.00E+00 1.28E-01±0.00E+00 1.00E+00 1.70E+00±0.00E+00 1.00E+00 5.68E-01±0.00E+00
SVR-S 1.68E-02±0.00E+00 9.94E-01 1.30E-01±0.00E+00 9.93E-01 1.69E+00±0.00E+00 1.00E+00 5.51E-01±0.00E+00
LSTM 1.99E-02±2.10E-04 9.13E-07 1.41E-01±7.48E-04 9.11E-07 1.95E+00±1.27E-02 9.12E-07 4.90E-01±1.25E-02
DNM-BP 1.70E-02±3.63E-04 5.69E-01 1.30E-01±1.39E-03 5.69E-01 1.70E+00±4.40E-02 9.41E-01 5.50E-01±1.97E-02
EDNM 1.72E-02±4.20E-04 2.10E-03 1.31E-01±1.60E-03 2.10E-03 1.72E+00±3.48E-02 2.11E-01 5.50E-01±1.35E-02
DNR-BP 1.75E-02±1.86E-03 3.04E-01 1.32E-01±6.59E-03 3.04E-01 1.73E+00±9.29E-02 5.73E-01 5.41E-01±7.08E-02
DNR-SMS 1.69E-02±2.04E-04 - 1.30E-01±7.85E-04 - 1.72E+00±1.07E-02 - 5.53E-01±5.73E-03
ENN 3.15E-02±4.48E-02 9.13E-07 1.62E-01±7.20E-02 9.13E-07 1.83E+00±1.21E-01 9.13E-07 3.92E-01±7.47E-02
MLP 2.72E-02±4.10E-03 9.13E-07 1.64E-01±1.24E-02 9.13E-07 2.12E+00±1.63E-01 9.13E-07 2.83E-01±1.21E-01
DT 2.57E-02±0.00E+00 9.11E-07 1.60E-01±0.00E+00 9.12E-07 2.13E+00±0.00E+00 9.12E-07 3.54E-01±0.00E+00
SVR-L 1.73E-02±0.00E+00 3.32E-06 1.32E-01±0.00E+00 3.32E-06 1.69E+00±0.00E+00 1.37E-06 5.01E-01±0.00E+00
SVR-R 1.71E-02±0.00E+00 3.75E-01 1.31E-01±0.00E+00 3.83E-01 1.68E+00±0.00E+00 8.35E-05 5.08E-01±0.00E+00
SVR-S 1.75E-02±0.00E+00 9.11E-07 1.32E-01±0.00E+00 9.12E-07 1.70E+00±0.00E+00 9.12E-07 4.94E-01±0.00E+00
DNR-SMS 1.71E-02±1.84E-04 - 1.31E-01±7.05E-04 - 1.67E+00±9.46E-03 - 5.14E-01±6.22E-03
ENN 2.70E-02±3.88E-03 1.01E-06 1.64E-01±1.10E-02 1.01E-06 2.01E+00±7.23E-02 9.13E-07 4.05E-01±5.49E-02
MLP 3.94E-02±6.43E-03 9.13E-07 1.98E-01±1.60E-02 9.13E-07 2.56E+00±2.31E-01 9.13E-07 2.05E-01±1.03E-01
DT 3.36E-02±0.00E+00 9.12E-07 1.83E-01±0.00E+00 9.12E-07 2.36E+00±0.00E+00 9.12E-07 3.31E-01±0.00E+00
SVR-L 2.25E-02±0.00E+00 4.88E-06 1.50E-01±0.00E+00 4.88E-06 1.89E+00±0.00E+00 9.12E-07 4.76E-01±0.00E+00
SVR-R 2.22E-02±0.00E+00 1.38E-01 1.49E-01±0.00E+00 1.33E-01 1.88E+00±0.00E+00 4.88E-06 4.85E-01±0.00E+00
SVR-S 2.28E-02±0.00E+00 9.12E-07 1.51E-01±0.00E+00 9.12E-07 1.91E+00±0.00E+00 9.12E-07 4.68E-01±0.00E+00
DNR-SMS 2.21E-02±3.55E-04 - 1.49E-01±1.20E-03 - 1.86E+00±1.35E-02 - 4.94E-01±9.41E-03

TABLE V Short-term prediction performance of all models for the experimental wind speed time series.
SHORT-TERM WIND SPEED TIME SERIES WITH A PARTITION RATIO OF 1:1
ENN 4.35E-02±6.37E-02 9.13E-07 1.80E-01±1.06E-01 9.13E-07 9.84E-01±2.42E-01 9.13E-07 6.03E-01±2.53E-01
MLP 2.87E-02±1.08E-02 9.13E-07 1.66E-01±3.15E-02 9.13E-07 1.21E+00±2.24E-01 9.13E-07 5.75E-01±2.01E-01
DT 1.21E-02±0.00E+00 9.13E-07 1.10E-01±0.00E+00 9.13E-07 7.94E-01±0.00E+00 9.13E-07 8.09E-01±0.00E+00
SVR-L 4.69E-03±0.00E+00 1.00E+00 6.85E-02±0.00E+00 1.00E+00 4.89E-01±0.00E+00 1.00E+00 9.26E-01±0.00E+00
SVR-R 5.85E-03±0.00E+00 8.59E-06 7.65E-02±0.00E+00 8.59E-06 5.50E-01±0.00E+00 1.37E-06 9.25E-01±0.00E+00
SVR-S 5.22E-03±0.00E+00 9.97E-01 7.22E-02±0.00E+00 9.97E-01 5.18E-01±0.00E+00 9.81E-01 9.15E-01±0.00E+00
LSTM 1.47E-02±1.99E-03 9.13E-07 1.21E-01±8.00E-03 9.13E-07 9.29E-01±6.56E-02 9.13E-07 8.63E-01±5.28E-02
DNM-BP 5.96E-03±3.04E-03 4.19E-01 7.59E-02±1.44E-02 4.67E-01 5.41E-01±1.08E-01 5.08E-01 9.15E-01±5.45E-02
EDNM 2.26E-02±2.96E-02 1.24E-06 1.30E-01±7.56E-02 1.24E-06 7.24E-01±2.95E-01 4.44E-06 7.25E-01±2.65E-01
DNR-BP 9.19E-03±1.98E-02 4.51E-01 8.29E-02±4.82E-02 4.75E-01 6.03E-01±3.84E-01 3.11E-01 8.99E-01±1.38E-01
DNR-SMS 5.43E-03±3.54E-04 - 7.36E-02±2.41E-03 - 5.24E-01±1.58E-02 - 9.21E-01±5.28E-03
ENN 1.03E-02±3.13E-02 9.13E-07 7.87E-02±6.42E-02 9.13E-07 5.12E-01±6.91E-02 9.13E-07 8.86E-01±1.24E-01
MLP 1.80E-02±7.20E-03 9.13E-07 1.32E-01±2.66E-02 9.13E-07 1.01E+00±2.14E-01 9.13E-07 6.03E-01±1.78E-01
DT 5.89E-03±0.00E+00 9.13E-07 7.68E-02±0.00E+00 9.13E-07 5.90E-01±0.00E+00 9.13E-07 8.82E-01±0.00E+00
SVR-L 3.42E-03±0.00E+00 3.02E-06 5.85E-02±0.00E+00 2.48E-06 4.48E-01±0.00E+00 2.48E-06 9.35E-01±0.00E+00
SVR-P 1.49E-02±0.00E+00 9.13E-07 1.22E-01±0.00E+00 9.13E-07 9.48E-01±0.00E+00 9.13E-07 7.73E-01±0.00E+00
DNR-SMS 3.15E-03±1.50E-04 - 5.61E-02±1.32E-03 - 4.27E-01±1.14E-02 - 9.36E-01±3.12E-03
ENN 4.47E-03±4.25E-04 9.13E-07 6.68E-02±3.12E-03 9.13E-07 4.98E-01±2.39E-02 9.13E-07 8.77E-01±1.21E-02
MLP 1.92E-02±4.41E-03 9.13E-07 1.37E-01±1.70E-02 9.13E-07 1.08E+00±1.46E-01 9.13E-07 4.58E-01±1.36E-01
DT 6.73E-03±0.00E+00 9.13E-07 8.20E-02±0.00E+00 9.13E-07 6.23E-01±0.00E+00 9.13E-07 8.14E-01±0.00E+00
SVR-L 3.81E-03±0.00E+00 9.13E-07 6.18E-02±0.00E+00 9.13E-07 4.64E-01±0.00E+00 9.13E-07 8.93E-01±0.00E+00
SVR-P 9.06E-03±0.00E+00 9.13E-07 9.52E-02±0.00E+00 9.13E-07 7.75E-01±0.00E+00 9.13E-07 8.24E-01±0.00E+00
DNR-SMS 3.50E-03±7.30E-05 - 5.91E-02±6.17E-04 - 4.46E-01±7.34E-03 - 9.03E-01±2.07E-03

Elman neural network (ENN) [59]; a multilayer perceptron specifically, the original DNM, the DNM trained with the
(MLP); a decision tree (DT) model; support vector regres- L-SHADE algorithm (EDNM), and a DNR model trained
sion with a linear kernel (SVR-l), a polynomial kernel with the BP algorithm. The parameter settings for all algo-
(SVR-p), a radial basis function (RBF) kernel (SVR-r), and rithms are listed in Table III. To ensure fair comparisons,
a sigmoid kernel (SVR-s) [60]; a long short-term memory the experiments with all algorithms were conducted 30
(LSTM) network; and several variants of the DNM, times for each time series.
Since the size of the training dataset can impact the fore-
casting accuracy, the prediction results of DNR-SMS were
evaluated on wind speed data with different partition ratios
0.4 between the training and test datasets, specifically, 1:1, 4:1
One-Day 1:1 and 9:1. The average and standard deviation for each
0.3 One-Day 4:1
One-Day 9:1 approach are summarized in Tables IV and V. Note that
MSE
0.2 Ten-Min 1:1 MSE and RMSE were calculated using the normalized data,
Ten-Min 4:1 whereas MAE and R were calculated using the raw data.
0.1 Ten-Min 9:1
Tables IV and V show that for most algorithms, the predic-
0 tion results for both the long-term and short-term wind
0 100 200 300 400 500
Learning Epoch
speed time series with a partition ratio of 4:1 were superior to
those for a ratio of 1:1; moreover, the prediction results for a
FIGURE 4 Convergence curves of the DNR model for the experimental partition ratio of 9:1 were inferior to those for 4:1. These
wind speed time series. f indings suggest that merely increasing the number of
Output ~= 0.34 ∗ Target + 4.6

Training and Prediction (One-Day 1:1) 2 4 6 8 10 12 14 16
The Values of InFlow (M/S)
20
Data 16
Actual Value Training Fit
15 Prediction 14
Predictive Value Y=T 12
10 10
8
5 6
4
0 2
0 200 400 600 800 1,000 1,200 Target
(a)

20 16
Actual Value Training Prediction Data 14
15 Predictive Value Fit
Y=T 12
10 10
8
5 6
4
0 2
0 200 400 600 800 1,000 1,200 Target
(b)

Output ~= 0.27 ∗ Target + 5
20 16
Actual Value Data
Training Prediction Fit 14
15 Predictive Value
Y=T 12
10 10
8
5 6
4
0 2
0 200 400 600 800 1,000 1,200 Target
(c)
FIGURE 5 Training and prediction results and predicted correlation coefficients of DNR-SMS for the wind speed time series with an interval of
one day and partition ratios of (a) 1:1, (b) 4:1, and (c) 9:1.

training samples does not always enhance
the prediction performance and that a par- To fairly evaluate the performance of each algorithm,
tition ratio of 4:1 may be a suitable choice. several evaluation methods were adopted in this
In addition, for a partition ratio of 1:1,
SVR-r and SVR-l yielded superior results
study, such as the evaluation indicators, nonparametric
on both the long-term and short-term statistical test, and relative graphs.
wind speed data compared to those
obtained using the other algorithms. Howev-
er, when the partition ratio was 4:1 or 9:1, DNR-SMS SVR-r and SVR-l outperform DNR-SMS for the ratio of
achieved the highest prediction performance among the test- 1:1, the corresponding p-values are larger than 5%, indicat-
ed algorithms. These findings imply that the training pro- ing that the performance is not significantly higher than
cesses for SVR-r and SVR-l require less data than those for that of DNR-SMS. However, for ratios of 4:1 and 9:1,
other algorithms. Thus, when the length of the wind speed DNR-SMS can produce significantly superior results com-
time series is insufficient, SVR-r and SVR-l can provide pared to the other algorithms, such as the ENN, the MLP,
more satisfactory results. However, if an adequate amount of the various SVR methods and the LSTM network, based
wind speed time series data is available, DNR can demon- on the corresponding p-values. Moreover, compared with
strate highly competitive performance. the prediction performance of the original DNM-BP algo-
To identify the signif icant differences between the rithm and EDNM, that of the proposed DNR model is
compared algorithms, the p-values of the Wilcoxon rank significantly higher, which implies that considering the
sum test are also provided in Tables IV and V. Although strength of each dendritic branch individually can enhance
Training and Prediction (Ten-Min 1:1) 1 2 3 4 5 6 7
Output ~= 0.74 ∗ Target + 1

10 7
Data
Actual Value Training Prediction Fit 6
8 Predictive Value Y=T 5
6
4
4 3
2 2
1
0
0 200 400 600 800 1,000 1,200 Target
(a)
Training and Prediction (Ten-Min 4:1) 1 2 3 4 5 6

10
Actual Value Training Data
Data 6
8 Prediction Fit
Fit
Predictive Value 5
Y == TT
Y
6 4
4 3
2 2
0 1
0 200 400 600 800 1,000 1,200 Target
(b)
Training and Prediction (Ten-Min 9:1) 2 3 4 5 6

10
Actual Value Data
Data
8 Training Prediction Fit
Fit 6
Predictive Value
Y
Y == TT
6 5
4 4
2 3
0 2
0 200 400 600 800 1,000 1,200 Target
(c)
FIGURE 6 Training and prediction results and predicted correlation coefficients of DNR-SMS for the wind speed time series with an interval of
ten minutes and partition ratios of (a) 1:1, (b) 4:1, and (c) 9:1.

and short-term wind speed prediction,
Experimental results demonstrate that the DNR respectively. Figures 5 and 6 indicate that
can provide satisfactory and stable forecasting long-term wind speed time series prediction
is more complex than short-term prediction.
performances for wind speed time series. Although the forecast curves of DNR-SMS
closely fit the actual curves for both long-
the regression ability. Thus, DNR is more suitable for term and short-term wind speed prediction with different
solving prediction problems than the DNM. In addition, a partition ratios, the long-term prediction performance of
comparison of DNR-SMS and DNR-BP is presented in DNR-SMS is higher than the short-term prediction perfor-
Tables IV and V. DNR-SMS is significantly superior to mance, as indicated by the correlation coefficient graphs. In
DNR-BP in most cases, suggesting that the SMS algo- addition, for both long-term and short-term wind speed pre-
rithm, as a global optimization approach that does not eas- diction, the performance of the DNR model trained using
ily become trapped in local minima, can offer enhanced the SMS algorithm was compared with that of DNR models
optimization performance compared to that achieved using trained using other state-of-the-art evolutionary algorithms,
the BP algorithm. namely, a genetic algorithm (GA), particle swarm optimiza-
Furthermore, the convergence curves of DNR-SMS for tion (PSO), adaptive differential evolution with an optional
the wind speed time ser ies are provided in Fig. 4. external archive (JADE) [61] and differential evolution with
Although the maximum number of iterations of the SMS success-history-based parameter adaptation and linear popula-
algorithm was set to 1000, convergence was attained in less tion size reduction (L-SHADE) [62]. The experimental
than 200 iterations. Thus, the DNR model trained using results are presented in Tables VI and VII. The SMS algo-
the SMS algorithm could achieve prompt convergence, rithm notably outperforms the other algorithms. Conse-
indicating that the SMS algorithm is an efficient optimiza- quently, the SMS algorithm is recommended as the main
tion tool for real-world prediction problems. In addition, optimization approach for DNR. These experimental results
Figures 5 and 6 show the fit graphs and corresponding demonstrate that the proposed approach can enable satisfacto-
correlation coefficient graphs of DNR-SMS for long-term ry and stable forecasting for wind speed time series.
TABLE VI Long-term prediction performance of DNR models trained using different evolution algorithms
for the experimental wind speed time series.
DNR-GA 1.91E-02±1.55E-03 1.01E-06 1.38E-01±5.50E-03 1.01E-06 1.84E+00±1.16E-01 4.89E-06 5.02E-01±5.98E-02
DNR-PSO 1.73E-02±6.64E-04 1.01E-02 1.32E-01±2.50E-03 1.06E-02 1.72E+00±1.73E-02 3.48E-01 5.46E-01±1.68E-02
DNR-JADE 1.74E-02±6.89E-04 9.13E-07 1.32E-01±2.59E-03 9.12E-07 1.73E+00±4.62E-02 9.12E-07 5.46E-01±2.10E-02
DNR-L-SHADE 1.81E-02±2.15E-03 1.46E-02 1.34E-01±7.48E-03 1.54E-02 1.76E+00±1.08E-01 8.41E-02 5.27E-01±6.00E-02
DNR-SMS 1.69E-02±2.04E-04 - 1.30E-01±7.85E-04 - 1.72E+00±1.07E-02 - 5.53E-01±5.73E-03
DNR-SMS 1.71E-02±1.84E-04 - 1.31E-01±7.05E-04 - 1.67E+00±9.46E-03 - 5.14E-01±6.22E-03
DNR-SMS 2.21E-02±3.55E-04 - 1.49E-01±1.20E-03 - 1.86E+00±1.35E-02 - 4.94E-01±9.41E-03

TABLE VII Short-term prediction performance of DNR models trained using different evolution algorithms
for the experimental wind speed time series.
DNR-GA 1.14E-02±4.36E-03 1.12E-06 1.05E-01±1.94E-02 1.12E-06 7.56E-01±1.47E-01 1.24E-06 8.46E-01±7.47E-02
DNR-PSO 1.09E-02±1.51E-02 1.37E-06 9.63E-02±4.08E-02 1.37E-06 6.11E-01±1.49E-01 3.02E-06 8.46E-01±1.52E-01
DNR-JADE 8.55E-03±5.98E-03 2.54E-04 8.91E-02±2.47E-02 2.75E-04 5.99E-01±1.13E-01 2.12E-03 8.80E-01±9.16E-02
DNR-L-SHADE 1.27E-02±1.50E-02 9.13E-07 1.05E-01±4.06E-02 9.13E-07 6.41E-01±1.44E-01 1.12E-06 8.11E-01±1.45E-01
DNR-SMS 5.43E-03±3.54E-04 - 7.36E-02±2.41E-03 - 5.24E-01±1.58E-02 - 9.21E-01±5.28E-03
DNR-SMS 3.15E-03±1.50E-04 - 5.61E-02±1.32E-03 - 4.27E-01±1.14E-02 - 9.36E-01±3.12E-03
DNR-SMS 3.50E-03±7.30E-05 - 5.91E-02±6.17E-04 - 4.46E-01±7.34E-03 - 9.03E-01±2.07E-03
V. Conclusion model. Second, DNR-SMS can be applied to solve other pre-

A novel DNR method inspired by biological neurons is diction problems, such as container throughput forecasting and
employed to predict wind speed time series. To improve the power load prediction.
prediction results, the neural architecture for DNR is trained
using the SMS algorithm, which is a global optimization algo- Acknowledgments
rithm with powerful search capabilities. Before the prediction This work was supported by a Project of the Guangdong Basic and
process, the time delay and the number of embedding dimen- Applied Basic Research Fund (No. 2019A1515111139), by the
sions are calculated using the MI algorithm and the FNN National Natural Science Foundation of China under Grants
algorithm, respectively. Based on the results, phase space 61876110 and 61876162, by a Key Program of the Joint Funds
reconstruction of the wind speed data is performed, and the of the National Natural Science Foundation of China under
maximum Lyapunov exponent is calculated to detect the cha- Grant U1713212, and in part by the Research Grants Council
otic nature of the wind speed series. Comprehensive experi- of the Hong Kong Special Administrative Region, China,
ments conducted to evaluate the effectiveness of the proposed under Grant PolyU11202418 and Grant PolyU11209219.
DNR-SMS approach are reported, in which several common-
ly used prediction approaches are considered for comparison References
[1] W. Yang, J. Wang, H. Lu, T. Niu, and P. Du, “Hybrid wind energy forecasting and
and four diverse evaluation indicators are employed. The analysis system based on divide and conquer scheme: A case study in China,” J. Cleaner
experimental results and statistical tests demonstrate that DNR Production, vol. 222, pp. 942–959, 2019. doi: 10.1016/j.jclepro.2019.03.036.
[2] C. Tian, Y. Hao, and J. Hu, “A novel wind speed forecasting system based on hybrid
can outperform the other compared approaches. The proposed data preprocessing and multi-objective optimization,” Appl. Energy, vol. 231, pp. 301–319,
DNR-SMS model is thus an efficient tool for wind speed 2018. doi: 10.1016/j.apenergy.2018.09.012.
[3] H. Li, J. Wang, H. Lu, and Z. Guo, “Research and application of a combined model
prediction due to its high prediction accuracy and robust sta- based on variable weight for short term wind speed forecasting,” Renewable Energy, vol.
bility. Further research can be conducted in the following 116, pp. 669–684, 2018. doi: 10.1016/j.renene.2017.09.089.
[4] M. Imal, M. Sekkeli, C. Yildiz, and F. Kececioglu, “Wind energy potential estimation
directions. First, other optimization algorithms can be evaluat- and evaluation of electricity generation in Kahramanmaras, Turkey,” Energy Educ. Sci.
ed to enhance the forecasting performance of the DNR Technol. A, Energy Sci. Res., vol. 30, no. 1, pp. 661–672, 2012.

[5] J. Zhao, Z.-H. Guo, Z.-Y. Su, Z.-Y. Zhao, X. Xiao, and F. Liu, “An improved multi- [34] T. Elsken, J. H. Metzen, F. Hutter et al., “Neural architecture search: A survey.” J.
step forecasting model based on WRF ensembles and creative fuzzy systems for wind Mach. Learn. Res., vol. 20, no. 55, pp. 1–21, 2019.
speed,” Appl. Energy, vol. 162, pp. 808–826, 2016. doi: 10.1016/j.apenergy.2015.10.145. [35] Y. Sun, B. Xue, M. Zhang, G. G. Yen, and J. Lv, “Automatically designing CNN
[6] C. Y ild iz, M. Tekin, A. Gani, Ö. F. Keçeciog˘lu, H. Aç ikgöz, and M. S¸ekkeli, “A architectures using the genetic algorithm for image classification,” IEEE Trans. Cybern.,
day-ahead wind power scenario generation, reduction, and quality test tool,” Sustainabil- vol. 50, no. 9, pp. 3840–3854, 2020. doi: 10.1109/TCYB.2020.2983860.
ity, vol. 9, no. 5, p. 864, 2017. doi: 10.3390/su9050864. [36] Z. Lu, I. Whalen, Y. Dhebar, K. Deb, E. Goodman, W. Banzhaf, and V. N. Boddeti,
[7] D. Wang, H. Luo, O. Grunder, and Y. Lin, “Multi-step ahead wind speed forecasting using an “Multi-objective evolutionary design of deep convolutional neural networks for image
improved wavelet neural network combining variational mode decomposition and phase space re- classification,” IEEE Trans. Evol. Comput., 2020. doi: 10.1109/TEVC.2020.3024708.
construction,” Renewable Energy, vol. 113, pp. 1345–1358, 2017. doi: 10.1016/j.renene.2017.06.095. [37] J. Ji, Y. Tang, L. Ma, J. Li, Q. Lin, Z. Tang, and Y. Todo, “Accuracy versus simplifica-
[8] J. Wang, P. Du, T. Niu, and W. Yang, “A novel hybrid system based on a new proposed tion in an approximate logic neural model,” IEEE Trans. Neural Netw. Learn. Syst., 2020.
algorithm–multi-objective whale optimization algorithm for wind speed forecasting,” [38] E. Cuevas, A. Echavarría, and M. A. Ramírez-Ortegón, “An optimization algorithm
Appl. Energy, vol. 208, pp. 344–360, 2017. doi: 10.1016/j.apenergy.2017.10.031. inspired by the states of matter that improves the balance between exploration and exploi-
[9] J. Song, J. Wang, and H. Lu, “A novel combined model based on advanced opti- tation,” Appl. Intell., vol. 40, no. 2, pp. 256–272, 2014. doi: 10.1007/s10489-013-0458-0.
mization algorithm for short-term wind speed forecasting,” Appl. Energy, vol. 215, pp. [39] N. Chouikhi, B. Ammar, N. Rokbani, and A. M. Alimi, “PSO-based analysis of
643–658, 2018. doi: 10.1016/j.apenergy.2018.02.070. echo state network parameters for time series forecasting,” Appl. Soft Comput., vol. 55, pp.
[10] T. Howard and P. Clark, “Correction and downscaling of NWP wind speed fore- 211–225, 2017. doi: 10.1016/j.asoc.2017.01.049.
casts,” Meteorol. Appl., J. Forecasting, Pract. Appl., Train. Techn. Model., vol. 14, no. 2, pp. [40] W. Jia, D. Zhao, Y. Zheng, and S. Hou, “A novel optimized GA–Elman neural net-
105–116, 2007. doi: 10.1002/met.12. work algorithm,” Neural Comput. Appl., vol. 31, no. 2, pp. 449–459, 2019. doi: 10.1007/
[11] P. Du, J. Wang, W. Yang, and T. Niu, “A novel hybrid model for short-term wind power s00521-017-3076-7.
forecasting,” Appl. Soft Comput., vol. 80, pp. 93–106, 2019. doi: 10.1016/j.asoc.2019.03.035. [41] P. Ong and Z. Zainuddin, “Optimizing wavelet neural networks using modified
[12] Y. Wang, J. Wang, G. Zhao, and Y. Dong, “Application of residual modification ap- cuckoo search for multi-step ahead chaotic time series prediction,” Appl. Soft Comput.,
proach in seasonal ARIMA for electricity demand forecasting: A case study of China,” vol. 80, pp. 374–386, 2019. doi: 10.1016/j.asoc.2019.04.016.
Energy Pol., vol. 48, pp. 284–294, 2012. doi: 10.1016/j.enpol.2012.05.026. [42] D. Z. Li, W. Wang, and F. Ismail, “An evolving fuzzy neural predictor for multi-
[13] J. Wang, J. Heng, L. Xiao, and C. Wang, “Research and application of a combined dimensional system state forecasting,” Neurocomputing, vol. 145, pp. 381–391, 2014. doi:
model based on multi-objective optimization for multi-step ahead wind speed forecast- 10.1016/j.neucom.2014.05.014.
ing,” Energy, vol. 125, pp. 591–613, 2017. doi: 10.1016/j.energy.2017.02.150. [43] T. Desell, S. Clachar, J. Higgins, and B. Wild, “Evolving deep recurrent neural
[14] R. G. Kavasseri and K. Seetharaman, “Day-ahead wind speed forecasting using f-ARIMA networks using ant colony optimization,” in Proc. Eur. Conf. Evol. Comput. Combinatorial
models,” Renewable Energy, vol. 34, no. 5, pp. 1388–1393, 2009. doi: 10.1016/j.renene.2008.09.006. Optimization. Springer-Verlag, 2015, pp. 86–98.
[15] R. Li and Y. Jin, “A wind speed interval prediction system based on multi-objective [44] F. Takens, “Detecting strange attractors in turbulence,” in Dynamical Systems and
optimization for machine learning method,” Appl. Energy, vol. 228, pp. 2207–2220, 2018. Turbulence, Warwick 1980. SpringerVerlag, 1981, pp. 366–381.
doi: 10.1016/j.apenergy.2018.07.032. [45] C. Koch, Biophysics of Computation: Information Processing in Single Neurons. London:
[16] H. Liu, Z. Duan, Y. Li, and H. Lu, “A novel ensemble model of different mother Oxford Univ. Press, 1998.
wavelets for wind speed multi-step forecasting,” Appl. Energy, vol. 228, pp. 1783–1800, [46] E. Salinas and L. Abbott, “A model of multiplicative neural responses in parietal
2018. doi: 10.1016/j.apenergy.2018.07.050. cortex,” Proc. Nat. Acad. Sci., vol. 93, no. 21, pp. 11,956–11,961, 1996. doi: 10.1073/
[17] Q. Zhou, C. Wang, and G. Zhang, “Hybrid forecasting system based on an optimal pnas.93.21.11956.
model selection strategy for different wind speed forecasting problems,” Appl. Energy, vol. [47] F. Gabbiani, H. G. Krapp, C. Koch, and G. Laurent, “Multiplicative computation
250, pp. 1559–1580, 2019. doi: 10.1016/j.apenergy.2019.05.016. in a visual neuron sensitive to looming,” Nature, vol. 420, no. 6913, pp. 320–324, 2002.
[18] Y. Hao and C. Tian, “A novel two-stage forecasting model based on error factor and [48] M. B. Kennel, R. Brown, and H. D. Abarbanel, “Determining embedding dimension
ensemble method for multi-step wind power forecasting,” Appl. Energy, vol. 238, pp. for phase-space reconstruction using a geometrical construction,” Phys. Rev. A, vol. 45,
368–383, 2019. doi: 10.1016/j.apenergy.2019.01.063. no. 6, p. 3403, 1992. doi: 10.1103/PhysRevA.45.3403.
[19] A. Sfetsos, “A comparison of various forecasting techniques applied to mean hourly [49] A. M. Fraser and H. L. Swinney, “Independent coordinates for strange attractors
wind speed time series,” Renewable Energy, vol. 21, no. 1, pp. 23–35, 2000. doi: 10.1016/ from mutual information,” Phys. Rev. A, vol. 33, no. 2, p. 1134, 1986. doi: 10.1103/
S0960-1481(99)00125-1. PhysRevA.33.1134.
[20] Y. Wang, Y. Yu, S. Cao, X. Zhang, and S. Gao, “A review of applications of artificial [50] C. Grebogi, E. Ott, and J. A. Yorke, “Crises, sudden changes in chaotic attractors,
intelligent algorithms in wind farms,” Artif. Intell. Rev., pp. 1–54, 2019. doi: 10.1007/ and transient chaos,” Phys. D, Nonlinear Phenomena, vol. 7, nos. 1–3, pp. 181–200, 1983.
s10462-019-09768-7. doi: 10.1016/0167-2789(83)90126-4.
[21] A. Liu, Y. Xue, J. HU, and L. LIU, “Ultra-short-term wind power forecasting based [51] A. Wolf, J. B. Swift, H. L. Swinney, and J. A. Vastano, “Determining Lyapunov
on SVM optimized by GA,” Power Syst. Protection Control, vol. 43, no. 2, pp. 90–95, 2015. exponents from a time series,” Phys. D, Nonlinear Phenomena, vol. 16, no. 3, pp. 285–317,
[22] X. Kong, X. Liu, R. Shi, and K. Y. Lee, “Wind speed prediction using reduced sup- 1985. doi: 10.1016/0167-2789(85)90011-9.
port vector machines with feature selection,” Neurocomputing, vol. 169, pp. 449–456, 2015. [52] P. Shang, X. Na, and S. Kamae, “Chaotic analysis of time series in the sediment
doi: 10.1016/j.neucom.2014.09.090. transport phenomenon,” Chaos, Solitons Fractals, vol. 41, no. 1, pp. 368–379, 2009. doi:
[23] P. Jiang, Y. Wang, and J. Wang, “Short-term wind speed forecasting using a hybrid 10.1016/j.chaos.2008.01.014.
model,” Energy, vol. 119, pp. 561–577, 2017. doi: 10.1016/j.energy.2016.10.040. [53] H. Abarbanel, Analysis of Observed Chaotic Data. Springer Science & Business Media,
[24] Z.-h. Guo, J. Wu, H.-y. Lu, and J.-z. Wang, “A case study on a hybrid wind speed 2012.
forecasting method using bp neural network,” Knowl.-Based Syst., vol. 24, no. 7, pp. [54] J. Sola and J. Sevilla, “Importance of input data normalization for the application of
1048–1056, 2011. neural networks to complex industrial problems,” IEEE Trans. Nuclear Sci., vol. 44, no. 3,
[25] A. Ahmed and M. Khalid, “An intelligent framework for short-term multi-step wind pp. 1464–1468, 1997. doi: 10.1109/23.589532.
speed forecasting based on functional networks,” Appl. Energy, vol. 225, pp. 902–911, [55] G. Zhang, B. E. Patuwo, and M. Y. Hu, “Forecasting with artificial neural networks:
2018. doi: 10.1016/j.apenergy.2018.04.101. The state of the art,” Int. J. Forecasting, vol. 14, no. 1, pp. 35–62, 1998. doi: 10.1016/
[26] S. P. Kani and M. Ardehali, “Very short-term wind speed prediction: A new artifi- S0169-2070(97)00044-7.
cial neural network–Markov chain model,” Energy Convers. Manage., vol. 52, no. 1, pp. [56] S. García, D. Molina, M. Lozano, and F. Herrera, “A study on the use of non-
738–745, 2011. doi: 10.1016/j.enconman.2010.07.053. parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the
[27] G. Memarzadeh and F. Keynia, “A new short-term wind speed forecasting method CEC–2005 special session on real parameter optimization,” J. Heurist., vol. 15, no. 6, p.
based on fine-tuned LSTM neural network and optimal input sets,” Energy Convers. Man- 617, 2009. doi: 10.1007/s10732-008-9080-4.
age., vol. 213, p. 112,824, 2020. [57] J. Derrac, S. García, D. Molina, and F. Herrera, “A practical tutorial on the use
[28] G. Zhang and D. Liu, “Causal convolutional gated recurrent unit network with of nonparametric statistical tests as a methodology for comparing evolutionary and
multiple decomposition methods for short-term wind speed forecasting,” Energy Convers. swarm intelligence algorithms,” Swarm Evol. Comput., vol. 1, no. 1, pp. 3–18, 2011. doi:
Manage., vol. 226, p. 113,500, 2020. 10.1016/j.swevo.2011.02.002.
[29] J. Ji, S. Gao, J. Cheng, Z. Tang, and Y. Todo, “An approximate logic neuron model [58] J. Alcalá-Fdez et al., “Keel: a software tool to assess evolutionary algorithms for
with a dendritic structure,” Neurocomputing, vol. 173, pp. 1775–1783, 2016. doi: 10.1016/j. data mining problems,” Soft Comput., vol. 13, no. 3, pp. 307–318, 2009. doi: 10.1007/
neucom.2015.09.052. s00500-008-0323-y.
[30] T. Zhou, S. Gao, J. Wang, C. Chu, Y. Todo, and Z. Tang, “Financial time series [59] J. L. Elman, “Finding structure in time,” Cognit. Sci., vol. 14, no. 2, pp. 179–211,
prediction using a dendritic neuron model,” Knowl.-Based Syst., vol. 105, pp. 214–224, 1990. doi: 10.1207/s15516709cog1402_1.
2016. doi: 10.1016/j.knosys.2016.05.031. [60] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM
[31] W. Chen, J. Sun, S. Gao, J.-J. Cheng, J. Wang, and Y. Todo, “Using a single dendritic Trans. Intell. Syst. Technol., vol. 2, pp. 27:1–27:27, 2011. [Online]. Available: http://www
neuron to forecast tourist arrivals to Japan,” IEICE Trans. Inf. Syst., vol. 100, no. 1, pp. .csie.ntu.edu.tw/~cjlin/libsvm
190–202, 2017. doi: 10.1587/transinf.2016EDP7152. [61] J. Zhang and A. C. Sanderson, “Jade: adaptive differential evolution with optional
[32] J. Ji, S. Song, Y. Tang, S. Gao, Z. Tang, and Y. Todo, “Approximate logic neu- external archive,” IEEE Trans. Evol. Comput., vol. 13, no. 5, pp. 945–958, 2009. doi:
ron model trained by states of matter search algorithm,” Knowl.-Based Syst., vol. 163, pp. 10.1109/TEVC.2009.2014613.
120–130, 2019. doi: 10.1016/j.knosys.2018.08.020. [62] R. Tanabe and A. S. Fukunaga, “Improving the search performance of shade using
[33] J. K. Makara, A. Losonczy, Q. Wen, and J. C. Magee, “Experience-dependent com- linear population size reduction,” in Proc. 2014 IEEE Congr. Evol. Comput. (CEC), pp.
partmentalized dendritic plasticity in rat hippocampal ca1 pyramidal neurons,” Nature 1658–1665. doi: 10.1109/CEC.2014.6900380.
Neurosci., vol. 12, no. 12, p. 1485, 2009. doi: 10.1038/nn.2428.

Michael Margaliot
Tel Aviv University, Israel
A Self-Adaptive Mutation
©SHUTTERSTOCK.COM/BETTERVECTOR
Neural Architecture Search
Algorithm Based on Blocks
researchers proposed the neural architecture search, which

Yu Xue and Yankang Wang searches CNN architecture automatically, to satisfy dif-
Nanjing University of Information Science and Technology, CHINA ferent requirements. However, the blindness of the search
Jiayu Liang strategy causes a ‘loss of experience’ in the early stage of
Tiangong University, CHINA the search process, and ultimately affects the results of the
later stage. In this paper, we propose a self-adaptive
Adam Slowik mutation neural architecture search algorithm based on
Koszalin University of Technology, POLAND
ResNet blocks and DenseNet blocks. The self-adaptive
mutation strategy makes the algorithm adaptively adjust
the mutation strategies during the evolution process to
Abstract—Recently, convolutional neural networks (CNNs) achieve better exploration. In addition, the whole search pro-
have achieved great success in the field of artificial intelligence, cess is fully automatic, and users do not need expert knowl-
including speech recognition, image recognition, and natural edge about CNNs architecture design. In this paper, the
language processing. CNN architecture plays a key role in proposed algorithm is compared with 17 state-of-the-art
CNNs’ performance. Most previous CNN architectures are algorithms, including manually designed CNN and automatic
hand-crafted, which requires designers to have rich expert search algorithms on CIFAR10 and CIFAR100. The results
domain knowledge. The trial-and-error process consumes a lot indicate that the proposed algorithm outperforms the com-
of time and computing resources. To solve this problem, petitors in terms of classification performance and consumes
fewer computing resources.

Date of current version: 15 July 2021 Corresponding Author: Yu Xue (e-mail: xueyu@nuist.edu.cn).

D
I. Introduction proposed, such as genetic algorithm (GA) [21], particle swarm
eep learning has achieved great success in many optimization (PSO) [22], and artificial ant colony algorithm
fields [1], such as speech recognition [2], semantic [23]. Due to the gradient-free characteristic, the EC technique
segmentation [3], [4], image recognition [5], [6], and has been widely used in optimization problems [24] and even in
natural language processing [7]. With the excellent black box problems that do not have mathematical equations
performance in these fields, convolutional neural networks explicitly expressed. Therefore, EC techniques have been used
(CNNs) have become one of the most widely used models in to solve NAS problems.
deep learning [8]. In general, conventional CNNs consist of In fact, two decades ago, the EC technique was already used to
convolutional layers, pooling layers, and fully-connected layers. optimize neural networks and it was called neuroevolution [25].
Typical CNNs include AlexNet [9], VGG [10], GoogleNet The goal is to use the EC technique to optimize the architecture
[11], ResNet [5], and DenseNet [12]. Although these networks and weights of neural networks. Neuroevolution is proposed only
have achieved huge accuracy improvements in vision tasks, the for small or medium scale neural networks [16]. With the devel-
design of the CNN architectures is still a difficult task because opment of neural networks, the CNNs based on deep neural net-
of the large number of parameters [13]. These models are all works (DNNs) have been proposed and widely used in computer
hand-crafted and they cannot learn the architecture by them- vision tasks. However, the CNNs have a large number of parame-
selves, thus, designers need to have much expert knowledge in ters.Therefore, to solve the problem of designing CNNs’ architec-
CNN architecture design [14]. Moreover, the design of CNN tures, the concept of ENAS was proposed in the evolutionary
architecture is guided by problems; that is, the architectures of computation community. According to references, the first ENAS
CNNs are decided by different data distributions, and a manu- work was completed by Google’s Real et al. [26] and they pro-
ally designed architecture lacks flexibility. The two main fac- posed a LargeEvo algorithm. The LargeEvo algorithm uses the
tors that affect the performance of CNNs are architectures GA to design CNNs’ architectures. However, they only use muta-
and weights. For weight optimization, the gradient descent tion operators. Experimental results have proven the effectiveness
algorithm has been proved to offer a significant advantage [15]. of LargeEvo on CIFAR10 and CIFAR100 [27]. LargeEvo direct-
Unfortunately, for the optimization of CNNs’ architectures, ly evolves basic units in open space, then samples and evaluates
there are no explicit functions to directly calculate the optimal each candidate solution, which consumes a lot of computational
architecture [16]. To solve the problems above, Google resources. Since Real et al. [26] used layers as the search space, this
researchers Zoph et al. [17] proposed the concept of the neural space contains a relatively simple architecture, LargeEvo initializes
architecture search (NAS) for deep neural networks. Following the model with the basic layers, and each individual contains only
this concept, a large number of researchers have paid attention a fully connected layer. Xie et al. [28] searched through the same
to this field, and it has become one of the most popular space, and their results indicated that this trivial search space is
research topics in the automatic machine learning community. capable of evolving a competitive architecture.Yet, the large search
The purpose of NAS is to automatically search for parameters space brings high computational cost. To overcome this phenom-
such as the optimal architectures of CNNs so that CNNs can enon, some methods have been proposed to constrain the search
outperform those that are hand-crafted. Moreover, the NAS space, such as block-based design methods. These methods com-
can reduce the high cost of trial-and-error design. bine the state-of-the-art architecture in the initial state, so it can
The existing NAS algorithms can be categorized into three greatly reduce the search space and maintain the performance.
types: 1) reinforcement learning based methods [18]; 2) gradient Many block-based methods have been proposed, such as ResNet
based methods [19]; 3) evolutionary computation based [20] blocks [5], DenseNet blocks [12], Inception blocks [11]. Block-
(ENAS) methods. Among these methods, the NAS method based methods are quite promising because of the fewer parame-
based on reinforcement learning requires a large number of ters which greatly speed up the search process.
computing resources, i.e., thousands of graphics processing units Some researchers directly used aforementioned blocks to
(GPUs) run on a medium-sized dataset such as CIFAR10 for design CNN architectures [29], [30], whereas others have pro-
more than 10 days. Zoph et al. [17] proposed a NAS method posed different kinds of blocks. For example, Chen et al. [31]
based on reinforcement learning. They used 800 GPUs and proposed eight kinds of blocks, including ResNet block and
spent 28 days completing the entire search process, which is not Inception block, and encoded them into 3-bit strings. After that,
practical for most interested users. The gradient-based method is they used the Hamming distance to identify blocks that have
more efficient than the reinforcement learning method, and one similar performance. Song et al. [32] proposed three types of
of the most famous works is proposed by Liu et al. [19]. Because residual dense blocks, which greatly constrain the search space
of the lack of theoretical support, the architectures searched by and reduce computational consumption. After determining the
them are inadequate and unstable, and the process of construct- search space, different evolution operators were employed to
ing a cell consumes significant computing resources and requires generate new architectures. Sun et al. [14] used polynomial
much prior knowledge. The ENAS algorithms use the evolu- mutation on the encoded information which is represented by
tionary computation (EC) technique to design the CNN archi- real numbers. To make mutations less random, Lorenzo et al.
tecture. In particular, the EC is a population-based technique to [33] proposed a new Gaussian mutation guided by Gaussian
obtain the global optimum. Many EC techniques have been regression. Gaussian regression can predict the architectures

which have good performance, and can also search and sample very important. A method with a large number of constraints
in a search space where the fitness value may be better. can easily obtain a good architecture, but it also will be unable
The studies above do not effectively record the evolution infor- to find a novel architecture. In addition, the size of the search
mation, which leads to their inability to guide the entire search pro- space greatly affects the efficiency of search techniques. With
cess based on past experience. Most of them tend to focus on the the development of the NAS, more researchers use block-based
improvement of the search space and evaluation methods, but the techniques to define the search space [35], [36], which com-
search strategy also affects an algorithm when finding a competitive bines various types of layers as the basic unit. These block-based
architecture. At the same time, the general search strategy loses the techniques have good performance and fewer parameters are
‘experience’ in the entire search process, resulting in the algorithm required to build the architecture.
being unable to obtain more favorable architectures. This paper ResNet and DenseNet are two kinds of DNNs with bet-
focuses on finding a suitable way to guide the whole search process. ter performance in hand-designed CNNs. Their success
For this purpose, this paper proposes a self-adaptive mutation based depends on the construction of ResNet blocks and
on blocks to design CNN architectures. The self-adaptive mecha- DenseNet blocks, and the block-based method has been
nism has been widely used in EC, but no one has applied it to the proven to have better effectiveness than the design method
NAS. In addition, the traditional mutation strategy does not focus which is composed of layers [32], [37], [38]. This design
on retaining the evolution information. To choose an appropriate method can solve the problem of gradient disappearance
mutation strategy, the self-adaptive mechanism is added into the [39]. Therefore, the block-based search space can achieve
evolutionary process. To prevent the phenomenon of population promising prospects. In this paper, ResNet blocks and
degradation and slow convergence, a semi-complete binary com- DenseNet blocks are used as the basic units to determine the
petition selection strategy has also been designed and it can suffi- search space. For the convenience of representation, ResNet
ciently prevent the promising individuals in the population from blocks and DenseNet blocks are respectively called RBs and
being easily abandoned. Moreover, the ENAS algorithm proposed DBs. The units composed of RBs or DBs are called RUs or
in this paper is completely automatic, which means that it does not DUs, respectively. Figure 1 shows the structure of an RU.
require the user to have rich expert knowledge about CNN archi- Figure 2 shows the internal architecture of an RB. In this
tecture design. RB, there are three convolutional layers, which are called
The rest of the paper is organized as follows: Section II intro- conv1, conv2, and conv3. In conv1, 1 × 1 convolution ker-
duces related work. Section III describes the proposed algorithm in nels are used to decrease the number of channels of input
details. Section IV introduces the designs of the experiment and features and the computing complexity. In conv2, the 3 × 3
Section V analyzes the experimental results. Finally, Section VI pro- filters are used to learn the characteristics of the input. In
vides our conclusions and directions for future work. conv3, the filters with the size of 1 × 1 are used to increase
the number of output channels of the previous convolu-
II. Related Work tional layer, so that the number of input and output chan-
nels is consistent throughout the entire RB. In addition, a
A. Genetic Algorithm
The GA is an evolutionary algorithm inspired by the natural
selection. Because it has gradient-free characteristics, it can be
used to optimize nonconvex and nondifferentiable problems RU
[21], [34]. In addition, the GA has the characteristic of being
insensitive to local optima, which enables it to find global opti- RB RB RB
ma. The GA usually uses a series of biologically inspired opera-
tors to solve optimization problems, such as crossover and
mutation. Generally speaking, the main steps of GA are as fol-
FIGURE 1 An example of the ResNet Unit.
lows: 1) initialize a population; 2) evaluate fitness; 3) execute
crossover and mutation operators; 4) evaluate individuals; 5)
environmental selection. Among them, steps 3) - 5) are con-
trolled by a loop, and the stop condition is whether the maxi- RB
mum generation is reached.
In this paper, each individual represents the architecture of
Conv2
Conv1
Conv3
one CNN, and the fitness is the classification accuracy of each

Output
Input
CNN on the validation dataset. The higher the classification

accuracy, the higher the fitness.
B. Block-Based Design Method

Because a constraint represents human intervention in the FIGURE 2 An example of the ResNet block with three convolutional
search space, the suitable constraints on the coding space are layers and one skip connection.

C. Self-Adaptive Mechanism
The existing NAS algorithms can be categorized In recent years, self-adaptive mechanisms have
into three types: 1) reinforcement learning based been increasingly used in the evolutionary
computation community. Moreover, many
methods; 2) gradient based methods; 3) evolutionary EC algorithms based on self-adaptive mecha-
computation based (ENAS) methods. nisms have been proposed [34]. In the early
years of such research, Qin et al. [40] intro-
duced the self-adaptive mechanism into the
skip-connection is used to connect the input layer and the differential evolution (DE) algorithm, combining different
output layer, and when the sizes of input and output chan- types of DE, and proposed a new algorithm called self-adap-
nels are different, some 1 × 1 filters are added to adjust the tive DE. The experimental results showed that self-adaptive
size of channels so that the size of the input channels is the DE can obtain higher quality solutions. In recent years, Xue
same as the size of output channels. Figure 3 shows an example et al. [24] applied a self-adaptive PSO to solve large-scale fea-
of a DB. As shown in the figure, three convolutional layers ture selection problems, and experimental results showed its
are used to form a DB (the number of convolutional layers effectiveness in reducing redundant features. In the field of
in each DB can be set by the users). In this DB, the input of neural networks, adaptive mechanisms, such as adaptive gra-
each convolutional layer contains the output of each previ- dient algorithm [41], adadelta [42], and adam [43] have been
ous layer. In addition, to better control the input and output applied for selecting the learning rate. These algorithms
size of each convolutional layer, a parameter k is used. For can adjust the learning rate to meet the needs of different
example, if the input size of the first layer is a, then the out- learning rates in different training stages. To the best of our
put size of the first layer is a+k, the input size of the second knowledge, there is no work that introduces the adaptive
layer is a+k, and the output size of the second layer is a+2k. mechanism into the CNN architecture design. Therefore, this
Through similar operations, some DBs can be constructed. paper proposes a self-adaptive mutation method to search the
The two hierarchical representation block-based methods CNNs’ architectures so that the algorithm can adaptively
above can not only reduce the search space, but also enable choose appropriate mutation operations, such as adding,
deeper convolutional layers to learn features in the shallow removing, and modifying units.
layers. This hierarchical representation of input can also
improve the classification accuracy and make the search pro- III. Details of SaMuNet
cess more flexible [37]. In this section, the framework of the algorithm is described,
and its main components are discussed in details. A self-adaptive
mutation network (SaMuNet) is proposed in this paper. The
proposed NAS method is mainly used in image classification
DB tasks. For clarity, the frequently used notations and their
descriptions are shown in Table I.
Conv1
Conv2
Conv3
Output
Input
A. Framework of SaMuNet
The framework of SaMuNet is shown in Algorithm 1, first, a
population with size N is initialized. Then, each individual in the
initialized population is evaluated. After that, the architecture of
FIGURE 3 An example of the DenseNet block with three convolutional
the CNN is evolved by the crossover and self-adaptive mutation
layers. operators. A new population will be generated by the environ-
ment selection operator. This
process continues until the
TABLE I Notations and Corresponding Descriptions. maximum generation number
is reached. Finally, the best
NOTATIONS DESCRIPTIONS NOTATIONS DESCRIPTIONS
generated individual is decod-
N Population size P Current population
ed into a CNN with the best
nlter Maximal iteration Q Offspring population architecture. In the following
m Crossover probability nsflag Success matrix in a generation parts, encoding strategy, popu-
n Mutation probability nfflag Fail matrix in a generation lation initialization, individual
Dtrain Train datasets S Success matrix in defined generation evaluation, crossover opera-
tors, self-adaptive mutation
Dvalid Validation datasets F Fail matrix in defined generation
mechanism, and the proposed
p1, p2 Two parent individuals COGS Candidate offspring generate strategy
environment selection opera-
q1, q2 Two offspring individuals NS Number of strategies
tor are introduced.

B. Encoding Strategy
Encoding is very important when solving In this paper, each individual represents the architecture
practical problems, and different problems of one CNN, and the fitness is the classification
require different coding strategies. For the
EC technique, encoding strategies are gener-
accuracy of each CNN on the validation dataset. The
ally divided into two types: direct encoding higher the classification accuracy, the higher the fitness.
and indirect encoding. Direct encoding
directly encodes the precise information into
each individual. Indirect encoding is different from directly use layers as basic units. Inspired by the success of ResNet [5]
representing the weight or connection of the network, and it and DenseNet [12], in this paper, RUs, DUs, and PUs are used as
usually represents each individual through a new generation the basic units. RUs are composed of RBs, DUs are composed
rule [16]. This encoding method is widely used in neuroevolu- of DBs, and PUs are composed of pooling layers. The proposed
tion [44]. Because of the increasing number of network layers algorithm does not optimize the fully-connected layer.The main
in CNNs, the number of parameters that need to be opti- reason is that the neurons will cause overfitting due to the char-
mized has also increased exponentially. The backpropagation acteristics of the fully-connected layers [45]. To alleviate this
algorithm [15] has been proven to be superior in optimizing phenomenon, operations such as dropout [46] need to be used,
weights and it is directly applied to the weight optimization of but this will increase the number of parameters that need to be
the CNN. Thus, it is not necessary to consider the optimiza- optimized and increase the number of computing resources.
tion of the weights of CNN, and only some hyperparameters
need to be optimized.
To make operators such as crossover and mutation easier to Algorithm 1 Framework of SaMuNet.
implement and maintain the flexibility of CNNs’ architectures,
this paper uses a direct variable-length encoding strategy to Input: The population size (N), the number of maximal iteration
represent CNN architecture. This encoding strategy uses basic (nIter), the crossover probability (m), the mutation probability
units RU, DU and PU. For the RU, a random number is used (μ), the current iteration (t).
to represent the amount of RBs, and each RB contains at least O utput: The best individual with the architecture of
one 3 × 3 convolution kernel. Each RB also includes a skip the CNN.
connection operation. 1: Randomly initialize population P0;
For the DU, a random number is used to represent the number 2: Evaluate each individual;
of DBs, and each DB also contains at least one 3 × 3 convolu- 3: while t < nIter do
tion kernel. The difference between DU and RU is that the skip 4: Qt ! {};
connections in the DB are densely connected (i.e., the input of 5: while Q t < N do
each layer contains all the previous layer outputs). Meanwhile, 6: Select two parents p1, p2 from Pt by binary tourna-
each PU includes only one pooling layer of either type: average ment selection;
pooling layer and maximum pooling layer. In summary, RUs, 7: Generate two offspring q1, q2 from parents p1, p2 by
DUs and PUs are composed of RBs, DBs and pooling layers, crossover and mutation;
respectively. For different RBs, the unknown parameters are the 8: Qt ! Qt ∪ q1 ∪ q2;
number of input and output channels. For different DBs, in addi- 9: end while
tion to the number of input and output channels, the unknown 10: Evaluate the individuals in Qt and update the probabili-
parameters include a growth rate k (to calculate the number of ties of strategies in mutation strategy pool (Adding,
input channels for each layer). To express individual information Removing, Replacing) by using self-adaptive mechanism;
and make the genetic operator more flexible, this article uses the 11: Utilize the proposed environment selection strategy to
variable-length encoding strategy [15]. As shown in Figure 4, an select N individuals from Pt ∪ Qt and generate the new
individual contains RUs, DUs, and PUs. In the later stage of the population Pt + 1;
evolution process, the block-based method in the figure is used 12: t = t + 1;
as the basic unit. 13: end while
14: Output the best individual with the best architecture of
C. Initialization of the Population the CNN;
Population initialization is the first stage in the evolution process,
and it involves a random initialization method that obeys a uni-
form distribution for each individual. As described in Part A,
each individual represents a CNN architecture. To find the best Individual:
individual, GA with self-adaptive mutation is used in the evolu- RU DU PU RU DU PU PU RU DU PU
tion process. The proposed algorithm uses a block-based design
method, which is different from traditional NAS methods that FIGURE 4 The architecture of an individual.

The method of population initialization is presented in E. Crossover Operator
Algorithm 2. For each individual, we need to randomly gen- The offspring inherit the genes from the parents and are
erate a positive integer k to determine the length of the indi- expected to obtain a higher fitness than the parents. Therefore,
vidual (i.e., the number of units that need to be initialized), individuals with higher fitness values should be selected as par-
and then randomly generate a vector with length k. The vec- ents. In this paper, the binary competition strategy is used to
tor stores each unit that needs to be initialized. Then, the sec- select parents [47]. This strategy first randomly selects two
ond random number is used to determine whether to individuals from the population, and then it chooses the fitter
generate RUs, DUs, or PUs. Note that, in line 7, the number of the two as a parent. The second parent is selected in the
of PUs is limited. This is because the function of the pooling same way.
layer is to halve the size of the input, and this operation is After selecting the parents, the crossover operator which has
widely used in many state-of-the-art CNNs [5], [10]–[12]. local ability is performed. To make each individual have better
However, if an individual starts with the pooling layer, a lot of flexibility, the fixed-length coding strategy is replaced by the
information will be lost. In addition, if a 64 × 64 input image variable-length coding strategy. Because of the different lengths
passes through seven pooling layers continuously, the output of individuals, the traditional crossover operator is no longer
size will be 1 × 1; this operation will lead to logic errors and is directly applicable.Yet, the crossover operator plays a key role in
thus disallowed. the search process; a modified single point crossover operator is
used based on variable-length individuals. The specific opera-
D. Evaluation of Individuals tion is shown in Figure 5.
The fitness value of each individual indicates the individual’s It should be noted that the number of input channels at the
adaptivity to the environment. In the proposed algorithm, the splicing place after the crossover needs to be adjusted, and the
fitness value is the classification accuracy on the image datasets.
In EC, individuals with higher fitness values have a higher
probability of generating offspring with better performance
than themselves. In this paper, to obtain the fitness value, the Algorithm 3 Evaluation of the Individuals.
proposed algorithm decodes each individual into a CNN I nput: The population Pt for fitness evaluation, training dataset
architecture and trains it on the training set, then tests the Dtrain, validation dataset Dvalid, i = 0.
trained architecture on the validation set. In Algorithm 3, each Output: The fitness of the population.
individual is decoded into a CNN architecture and its weights 1: for each individual in Pt do
are initialized, then the CNN will be trained until the iteration 2: Decode the encoded information to the architecture of
reaches its maximum value. Finally, the trained CNN model CNN;
searched by the proposed algorithm is put on the validation 3: Initialize the weights of CNN;
dataset to obtain the fitness value. 4: while i < epoch do
5: Train the CNN;
6: i + + ;
Algorithm 2 Initialize Population. 7: end while
8: Evaluate the accuracy for the CNN on the Dvalid;
Input: The population size (N), the maximal PUs poolmax, 9: end for
j = 0, i = 0.
10: Output the fitness values of the individuals in Pt;
Output: The initialized population P0.
1: P0 ! {};
2: while i < N do
3: Randomly generate a positive number k to decide the
length of individual;
4: Initialize an array with the size of 1# k; Parent 1: RU DU PU RU DU PU PU DU PU
5: for j < k do
6: U ! Randomly choose one unit from {PU, RU, DU};
7: if The last unit of U is not PU and the number of PU
Parent 2: RU DU DU PU RU DU PU RU
is less than poolmax then
8: U ! randomly choose one unit from {RU, DU};
9: end if
10: Put the information of U into the jth element of a; Offspring 1: RU DU PU RU RU DU PU RU
11: end for
12: P0 ! P0 ∪ a; Offspring 2: RU DU DU RU DU PU PU DU PU
13: i = i + 1;
14: end while FIGURE 5 The crossover operator based on variable-length encoding
15: Output the initialized population P0; strategy.

unit number also needs to be modified. The pseudo-codes of becoming 0. Next, the generation probability of a new COGS
the crossover operator are described in Algorithm 4. A number is calculated using the following formulas:
is randomly generated to determine whether to perform the
crossover operator or not. If it is determined to perform the Ng
crossover operator, the binary competition strategy is used to S 3q = S 2q e S 2q + / Fk,q o (3)

k=1
select parents firstly, then two crossover points are randomly
Nq
generated for the parents. Finally, the offspring are generated by
the crossover operator.
Pq = S 3q / S 3q (4)
q=1
F. Self-Adaptive Mechanism Strategy Based on Blocks where S 3q represents the new probability of the qth COGS
The mutation operator usually explores the search space to after the update. In addition, this probability needs to be nor-
obtain individuals with promising performance. The pro- malized by Equation (4). The above equations are used to
posed algorithm uses three types of mutation methods, i.e., update the probability of each COGS according to the perfor-
adding, removing, and replacing. Adding refers to adding a mance of the offspring. Obviously, the better the performance
unit at the selected location of the individual, removing of a COGS, the higher the probability that it will be selected
refers to removing the current unit at the designated location during the entire evolutionary process.
of the individual, and replacing refers to randomly replacing In Algorithm 5, the implementation of the self-adaptive
a new type of unit with a new one at the selected location of mutation strategy is shown in details. First, three opera-
the individual. tions are added to the strategy pool (line 1). Then, a ran-
The self-adaptive mechanism has been widely used in the dom number is generated to determine whether to perform
evolutionary computation community [24], [34]. In this mutation operations. If a mutation occurs, a block opera-
paper, a self-adaptive mutation strategy pool with three candi- tion is selected to generate new offspring. The information
date offspring generation strategies (COGSs) is proposed. This of generating offspring is used to update the probability
mechanism enables the algorithm to adaptively find mutation of the COGS. Finally, the mutated offspring population
operators that are more suitable for the current task. Thus it is output.
can determine a more suitable generation strategy to match
the evolution process at different stages. G. Environmental Selection Based on Semi-Complete
The self-adaptive mutation operator mechanism is described Binary Competition
as follows: First, each COGS has an initial probability 1/N s, In the environmental selection stage, suitable individuals need to
where Ns represents the number of COGS in the strategy be selected from the population to serve as the next generation
pool. Let Pq represent the probability of the qth strategy being population. The simplest way to accomplish this is to select the
selected (q = 1, 2,…, Ns). Next, the roulette strategy is used to individual with the highest fitness value from the population
select a COGS. Through this strategy, a new individual is gen- each time, but this will cause the population to lose diversity
erated, and it is compared with its parent. The result is record- and fall into a local optimum [48], [49].
ed in the initial zero matrix nsflag i, s and nfflag i, s (i= 1, 2,…, N, For NAS applications, researchers have proposed many
s = 1, 2,…, Ns) where N represents the number of individuals selection strategies. Real et al. [50] used aging evolution to
in the population. After a generation, the sums of each col-
umn of nsflag i, s and nfflag i, s are calculated, and they are
recorded in two new matrices S k, q and Fk, q (k= 1,…, N g,
Algorithm 4 Crossover operator of SaMuNet.
q = 1,…, Ns, where Ng represents the updated period). S
records the number of successes of the qth strategy in the kth Input: Two parents p1, p2, the crossover probability m.
generation, and F records the number of failures of qth strate- Output: Two offspring q1, q2.
gy in the kth generation. After a generation, both nsflag N, Ns 1: Randomly generate a number v in range of [0,1];
and nsflag N, Ns are reset. 2: if v < m then
After Ng generations, the total number of successes and fail- 3: Select two parents p1, p2 from Pt by binary tourna-
ures of all COGSs is calculated, and the probability of each ment selection;
COGS is updated as follows: 4: Randomly generate two integer numbers to separate p1,
p2 into two parts, respectively;
Ng
S 1q = / S k,q (1) 5: Combine the first part of p1 and the second part of p2;
k=1 6: Combine the first part of p2 and the second part of p1;
S 2q = )
m, if S 1q = 0 7: else
1 (2) 8: q1 ! p1;
S , otherwise
q
9: q2 ! p2;
where q represents the strategy in the qth strategy pool, and m is 10: end if
set to 0.0001 to prevent the probability of some strategies from 11: Output two offspring q1, q2;

kill the oldest individual from the population, thus avoiding randomly generated, then k individuals with the highest fit-
the premature convergence problem. Zhu et al. [51] com- ness values are selected from the parent population and the
bined two methods while removing the oldest and worst offspring population. We retain the k individuals and use
individuals. In addition, Xie et al. [28] used the roulette strat- the binary competition strategy to select the needed indi-
egy to assign probabilities to each individual. Individuals with viduals. In this way, it is ensured that the excellent indi-
higher fitness values have a higher probability of survival. Sun viduals in the population will not be eliminated. This
et al. [14], [29] used a binary competition strategy for envi- method, which incorporates the ‘elite’ [52] mechanism in
ronmental selection. GAs, can effectively prevent the population from degener-
In the traditional binary competition, two individuals are ating during evolution.
randomly selected; the individual with lower fitness value is Algorithm 6 shows the details of the proposed environ-
eliminated. This means that if two good individuals are mental selection based on the semi-complete binary competi-
selected, the relatively bad one will be killed. This process tion. First, randomly generate a number k from (0, N/2], select
can have a negative effect on the population. Therefore, this k best individuals from the parent population and the offspring
paper proposes an environmental selection operator based on population, and then use the binary competition strategy to
a semi-complete binary competition. The implementation of select N – k from the remaining individuals. Finally, the new
the method is shown as follows: First, a number k is population is obtained.
IV. Experiment Design

This section verifies the effectiveness of the CNN architecture
Algorithm 5 Self-adaptive mutation operator of SaMuNet.
searched by our proposed algorithm on image classification
Input: The maximal generation N, number of strategies N s = 3 , tasks. We first introduce the benchmark datasets (in Part A),
the mutation probability μ, the number of offspring q, i = 0, cur- then introduce the comparison algorithms (in Part B), and
rent iteration (cIter), the number of iterations (Iter), flagIter = 0, finally describe the parameter setting (in Part C).
Ng = 5.
Output: The population of mutated offspring. A. Benchmark Datasets
1: Put three operation {Adding, Removing, Replacing} into CIFAR10 and CIFAR100 are color image datasets that are
strategy pool; close to universal objects. For state-of-the-art CNNs,
2: while i < Iter do CIFAR10 and CIFAR100 are the most widely used bench-
3: for j < q do mark datasets [27]. Therefore, this paper uses CIFAR10
4: Randomly generate a number r in range from [0,1]; and CIFAR100 to verify the effectiveness of our pro-
5: if r < μ then posed algorithm.
6: Choose one strategy with roulette from strategy CIFAR10 contains a total of 10 classes of RGB color imag-
pool to generate q new
j ; es: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and
7: if the new offspring q new j is better than qj then truck. The size of each image is 32 × 32. Each class has 6,000
8: nsflag i, s = 1; images. CIFAR10 has a total of 50,000 training images and
9: else
10: nfflag i, s = 1;
11: end if Algorithm 6 Environmental selection based on semi-complete
binary competition.
12: j = j + 1;
13: end if Input: The parent population Pt, the offspring population Qt, the
14: end for population size N.
15: S N g # N s = sum of each column in nsflag i, s; Output: The next population Pt+1.
16: FN g # N s = sum of each column in nfflag i, s; 1: Pt + 1 ! { }, U t ! { }, q ! { };
17: Reset matrices nsflag i, s and nfflag i, s; 2: Randomly generate a number k in range of (0,N/2];
18: if cIter – flagIter = = Ng then 3: while j < k do
19: tempS N g # N s = sum of each column in S N g # N s; 4: p ! Select the k individuals with highest values from
20: tempFN g # N s = sum of each column in FN g # N s; Pt , U t;
21: U p d a t e t h e p r o b a b i l i t y : P1 # N s = (tempS N g # N s) / 5: Pt + 1 ! Pt + 1 , P;
(temp S N g # N s + FN g # N sh; 6: end while
22: Reset matrices S N g # Nq and FN g # Nq; 7: for j = k + 1 to N do
23: flagIter = cIter; 8: P1, P2 ! Randomly selected two individuals from Ut;
24: end if 9: q ! Select the one with higher fitness from Ut;
25: i = i + 1; 10: Pt + 1 ! Pt + 1 , q;
26: end while 11: end for
27: Output the population of mutated offspring; 12: Output the next population Pt+1;

10,000 test images. We show the examples of CIFAR10 in this paper is one GeForce GTX 2080Ti, and the video mem-
Figure 6. In contrast, CIFAR100 has 100 classes, each with ory is 11 GB, some of the parameters can be adjusted by the
500 training images and 100 test images. In the experiment, in user. Because of the device’s video memory limitation, out of
order to match the conventions of state-of-the-art CNNs, each memory error may occur if the parameter setting is too large.
side of each image is filled with four zeros, and then the image
is randomly cropped to its original size. After that, each side is V. Experimental Results and Analysis
horizontally flipped randomly, and the images are input finally This paper evaluates the performance of the algorithms using
in the proposed algorithm. not only the classification accuracy, but also the number of
parameters (i.e., model size) and GPU/Days. According to the
B. Comparison Algorithms concept of GPU/Days proposed by some previous researchers,
To verify the effectiveness of the proposed algorithm, many the time consumption in the final search process of the algo-
state-of-the-art algorithms are used for comparison. Ac rithm can be calculated as follows: If n GTX 1080Ti cards are
cording to the literature [29], the existing state-of-the-art used to run on the CIFAR10 dataset for m days, the computa-
CNNs are divided into the following three types: The first tional resource consumed by the algorithm on CIFAR10 is
type is purely hand-designed CNNs, which require a lot of m ) n GPU/Days. In addition, the proposed algorithm uses the
exper t knowledge, such as VGG [10], ResNet [5], standard in [29] in which Sun et al. divided the current state-
DenseNet [12], Network in Network [53], Highway Net- of-the-art CNNs into three categories: hand-crafted, semi-
work [54], Maxout [55], and ALL-CNN [56]. The purpose automatic, and completely automatic.
of choosing these hand-designed networks is to prove the
superiority of the automatically searched network archi- A. Classification Performance of the
tecture. The second type of network is designed with a Algorithms on CIFAR10
semi-automatic architecture, such as Genetic CNN [28], Table II shows the performance comparison of the proposed
Hierarchical Evolution [57], and EAS [58]. The third is a algorithm SaMuNet and the comparison algorithms on
fully-automatically designed network architecture, such as CIFAR10 and CIFAR100. The second and third columns rep-
Large-scale Evolution [26], NAS [17], MetaQNN [59], resent the validation classification error rates of different CNNs
AE-CNN [29], Firefly-CNN [60], EPSO-CNN [61], and on CIFAR10 and CIFAR100, respectively. The fourth column
NSGANet [62] which is based on a multi-objective NAS represents the sizes of the final models. The fifth column repre-
algorithm [63]. sents the sizes of GPU/Days consumed by each algorithm, and
the sixth column is the constructive method of each model.
C. Parameter Setting The ‘-’ in the table indicates that this indicator has not been
We compare the existing state-of-the-art algorithms by reported in the relevant reference.
consulting the results presented in other papers. The reason As shown in Table II, compared with the purely hand-craft-
for this is that the algorithms in other papers generally use ed model, which includes ResNet and DenseNet, the classifi-
the best accuracy, which saves a lot of time. In this paper, cation accuracy of SaMuNet on CIFAR10 has achieved the
because the proposed algorithm is based on GA architec- second lowest error rate, which is a little higher than Firefly-
ture search [29], the population size is set to 20, the cross- CNN. However, the number of parameters from the final
over probability is set to 0.9, and the three mutation model which is found by our algorithm is much lower than
operation probabilities (adding, removing, and replacing) Firefly-CNN. The proposed algorithm has an error rate 1.6%
are set to 0.4, 0.3, and 0.3, respectively. Following the rec-
ommendation of the machine learning community, the
datasets are randomly divided into one-fifth as validation
Airplane
datasets. Finally, all classification error rates are obtained on
the same validation datasets.
In this paper, when training the obtained neural network,
Adam is used as the optimizer to learn the weights of the neu- Automobile
ral network [43]. The number of epochs is set to 250, the initial
learning rate is set to 0.1, and the batch size is set to 80, the
learning rate is adjusted according to the epoch.
In addition, this paper uses a block-based design method.
Each block and its parameters in the block are set as follows: Horse
The maximum number of RUs, DUs, and PUs is set to three,
and the maximum number of RBs in each RU is set to four.
The selectable parameters in the DB are set to 12, 20, and 40.
The maximum number of convolutional layers is set to five in
the DB. Note that, because the experimental device used in FIGURE 6 Three examples from CIFAR10.

lower than the best model (DenseNet, which is hand-crafted all models have higher classification errors on CIFAR100. The
on CIFAR10). Since the hand-crafted model does not have classification error rate of our proposed algorithm on
the process of searching the model, its GPU/Days is represent- CIFAR100 is 20.2%.
ed by ‘-’. Compared with the semi-automatic architecture Compared with the hand-crafted networks, the error rate of
design methods, our proposed algorithm has the lowest classi- our proposed algorithm is 4.2% lower than DenseNet which is
fication error rate on CIFAR10, and consumes less GPU/ the best of the hand-crafted models. Compared to other hand-
Days than Hierarchical Evolution. Compared with the com- crafted models, SaMuNet has remarkable advantages; in partic-
pletely automatic search algorithm, the proposed algorithm ular, when compared to Maxout, the error rate of SaMuNet is
has a classification 1.8% lower than Large-scale Evolution, 18.6% lower. In addition, the table shows that the automatically
2.4% lower than NAS, 3.2% lower than MetaQNN, and 0.7% searched architectures generally have a lower classification error
lower than AE-CNN. Since our proposed algorithm and AE- rate than the manually designed networks. In addition, expert
CNN are both block-based constructive methods and both knowledge is not required when designing architectures
involve GAs, this also proves the effectiveness of our proposed for SaMuNet.
adaptive mutation strategy. Compared with the semi-automatic and completely auto-
It can be seen that our algorithm is less time-consum- matic architecture design methods, SaMuNet also achieved the
ing than most algorithms, except for EPSOCNN. Because lowest error rate. Among them, the classification error rate of
EPSOCNN uses a multi-server distributed computing meth- our proposed algorithm is 0.6% lower than that of AE-CNN.
od, it is faster than our algorithm which did not use a weight This also proves the effectiveness of the adaptive mutation strat-
sharing strategy. egy proposed in this paper. Compared with the Large-Scale
Evolution and NAS algorithms, SaMuNet is dozens of times
B. The Classification Performance of the faster. Because NAS is based on reinforcement learning, it usu-
Algorithms on CIFAR100 ally consumes more computing resources. In general, SaMuNet
As one of the most widely used datasets for machine learning has achieved very promising performance in classification accu-
algorithms, the classification task on CIFAR100 is more diffi- racy, and the consumption of computing resources is relatively
cult than that on CIFAR10. It can be seen from Table II that lower than its comparison algorithms.
TABLE II THE PERFORMANCE OF ALGORITHMS ON CIFAR10 AND CIFAR100.
DATASETS CIFAR10(%) CIFAR100(%) PARAMETERS GPU/DAYS NETWORK CONSTRUCTIONS

VGG [10] 6.66 28.05 20.04M – Hand-crafted
ResNet [5] (depth=1,202) 7.93 27.82 1.7M – Hand-crafted
DenseNet (K=12) [12] 5.24 24.42 1.0M – Hand-crafted
Maxout [55] 9.3 38.6 – – Hand-crafted
Network in Network [53] 8.81 35.68 – – Hand-crafted
Highway Network [54] 7.72 32.39 – – Hand-crafted
ALL-CNN [56] 7.25 33.71 – – Hand-crafted
Genetic CNN [28] 7.1 29.05 – 17 Semi-automatic search
Hierarchical [57] Evolution 3.63 – – 300 Semi-automatic search
EAS [58] 4.23 – 23.4M 10 Semi-automatic search
Large-Scale Evolution [26] 5.4 – 5.4M 2,750 Automatic search
Large-Scale Evolution – 23 40.4M 2,750 Automatic search
NAS [17] 6.01 – 2.5M 22,400 Automatic search
MetaQNN [59] 6.92 27.14 – 100 Automatic search
AE-CNN [29] 4.3 – 2.0M 27 Automatic search
AE-CNN – 20.85 5.4M 36 Automatic search
NSGANet [62] 4.67 – 0.2M 27 Automatic search
NSGANet – 25.17 0.2M 27 Automatic search
Firefly-CNN [60] 3.3 22.3 3.21M – Automatic search
EPSO-CNN [61] 3.69 – 6.77M 4– Automatic search
SaMuNet (ours) 3.6 – 1.5M 23 Automatic search
SaMuNet (ours) – 20.2 4.6M 25 Automatic search

C. Evolution Process on CIFAR10
To prove the effectiveness of the self-adaptive The simplest way to choose parents is to select the
mutation mechanism and semi-complete bina- individual with the highest fitness value from the
ry competition strategy proposed in this paper,
the classification accuracy of the final searched
population each time, but this will cause the population
individuals is collected, and plot the statistical to lose diversity and fall into a local optimum.
results. In this paper, the population size is 20,
and the total number of generations is 20.
Figure 7 compares the proposed algorithm and GA. To CNN architecture design process, and uses an adaptive muta-
intuitively show how the self-adaptive mutation mechanism tion strategy to make the algorithm adaptively select mutation
guides the evolution process, a tendency line is drawn by con- strategies in different evolution stages, so that the algorithm can
necting the averages over ten points before and after the cur- guide the search process effectively. In addition, a semi-com-
rent point. It is noted that the accuracy of some individuals is plete binary competition strategy is also designed for environ-
0% in Figure 7. Because the variable-length encoding strategy mental selection. This strategy can better retain the elites. It can
is used, the length that the offspring individuals generated by be seen from the experimental results that our proposed algo-
the crossover individuals is too long. Due to the memory limi- rithm achieves better results in comparison with different
tation of our experimental equipment, an out of memory hand-crafted, semi-automatic, and completely automatic struc-
occurs. As shown in Figure 7, for SaMuNet, from the 1st gen- tural design methods in terms of accuracy and consumption of
eration to the 7th generation, the average classification accuracy computing resources. SaMuNet still consumes a lot of comput-
rises faster. From the 12th generation, the average classification ing resources, thus, in the future, our team will be committed
accuracy is stable at about 90%, and the gap between individu- to accelerating the process of evaluating individuals, and reduc-
als narrows, which shows that the algorithm is gradually con- ing consumption of resources.
verging. SaMuNet can achieve about 96% accuracy in the last
generation. For the comparison method GA, due to the lack of
guidance from the self-adaptive mechanism, the individual
1.0
accuracy rises slowly. After the 12th generation, the traditional
GA gradually converges too, but the final convergence accura- 0.8
Fitness (Accuracy)
cy is only about 94%. The analysis proves that the self-adaptive

mechanism has better performance in the later stage, and it can 0.6
also guide the later evolution. Tendency (SaMuNet)
In addition, to prove the effectiveness of the semi-complete 0.4 Tendency (GA)
binary competition strategy, this paper makes a comparison Individual (SaMuNet)
between SaMuNet with semi-complete binary competition 0.2
Individual (GA)
and SaMuNet with the binary competition algorithm. The
0.0
conditions remain unchanged from the previous experiment. It
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
can be seen from Figure 8 that, compared with the binary
competition, the semi-complete binary competition strategy Generations
appears to be more stable throughout the entire evolution pro-
FIGURE 7 Scatter diagram of SaMuNet and GA.
cess. The competition method tends to retain bad individuals.
Especially after the 15th generation, there are some poor indi-
viduals with an accuracy of 80%.
The self-adaptive mutation mechanism proposed in this 1.0
paper can effectively improve the search efficiency of the
algorithm, and guide the later evolution through the experi- 0.8
Fitness (Accuracy)
ence accumulated in the early stage. In addition, the pro-

0.6
posed semi-complete binary competition strategy can
effectively maintain population stability and prevent popula-
0.4 SaMuNet With Semi-Complete
tion degradation.
Binary Competition
0.2 SaMuNet With Binary Competition
VI. Conclusions and Future Work
The goal of this paper is to design a CNN architecture search 0.0
algorithm by using SaMuNet with an adaptive mutation strate-
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
gy and a semi-complete binary competition strategy. This goal Generations

is successfully achieved on CIFAR10 and CIFAR100. This
paper uses a block-based design method to accelerate the entire FIGURE 8 Scatter diagram of semi-completely binary competition.

VII. Acknowledgments [30] L. Frachon, W. Pang, and G. M Coghill, “Immunecs: Neural committee search by an
artificial immune system,” 2019, arXiv:1911.07729.
This work was partially supported by the National Natural Sci- [31] Z. Chen, Y. Zhou, and Z. Huang, “Auto-creation of effective neural network archi-
ence Foundation of China (61876089, 61876185, 61902281), the tecture by evolutionary algorithm and ResNet for image classification,” in Proc. IEEE Int.
Conf. Systems, Man Cybernetics (SMC), 2019, pp. 3895–3900. doi: 10.1109/SMC.2019.8914267.
Engineering Research Center of Digital Forensics, Ministry of [32] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, and Y. Wang, “Efficient residual dense
Education, the Priority Academic Program Development of Jiang- block search for image super-resolution,” in Assoc. the Adv. Artif. Intell., 2020, pp. 12,007–
12,014. doi: 10.1609/aaai.v34i07.6877.
su Higher Education Institutions, and the Postgraduate Research & [33] P. R. Lorenzo and J. Nalepa, “Memetic evolution of deep neural networks,” in Proc.
Practice Innovation Program of Jiangsu Province (KYCX21_1019). Genetic Evolutionary Comput. Conf., 2018, pp. 505–512.
[34] Y. Xue, J. Jiang, B. Zhao, and T. Ma, “A self-adaptive artificial bee colony algorithm
based on global best for global optimization,” Soft Comput., vol. 22, no. 9, pp. 2935–2952,
References 2018. doi: 10.1007/s00500-017-2547-1.
[1] I. Al Ridhawi, S. Otoum, M. Aloqaily, and A. Boukerche, “Generalizing AI: Chal- [35] M. Suganuma, M. Kobayashi, S. Shirakawa, and T. Nagao, “Evolution of deep con-
lenges and opportunities for plug and play AI solutions,” IEEE Netw., 2020. doi: 10.1109/ volutional neural networks using Cartesian genetic programming,” Evolut. Comput., vol.
MNET.011.2000371. 28, no. 1, pp. 141–163, 2020. doi: 10.1162/evco_a_00253.
[2] G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: [36] K. Chen and W. Pang, “Immunetnas: An immune-network approach for searching
The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. convolutional neural network architectures,” 2020, arXiv:2002.12704.
82–97, 2012. [37] A. Emin Orhan and X. Pitkow, “Skip connections eliminate singularities,” 2017,
[3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic arXiv:1701.09175.
segmentation,” in Proc. IEEE Conf. Comput. Vision Pattern Recogn., 2015, pp. 3431–3440. [38] T. Elsken, J.-H. Metzen, and F. Hutter, “Simple and efficient architecture search for
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: convolutional neural networks,” 2017, arXiv:1711.04528.
Semantic image segmentation with deep convolutional nets, atrous convolution, and fully [39] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradi-
connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, ent descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, 1994. doi:
2017. doi: 10.1109/TPAMI.2017.2699184. 10.1109/72.279181.
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” [40] A. Kai Qin, V. L. Huang, and P. N. Suganthan, “Differential evolution algorithm
in Proc. IEEE Conf. Comput. Vision Pattern Recogn., 2016, pp. 770–778. with strategy adaptation for global numerical optimization,” IEEE Trans. Evolut. Comput.,
[6] X. Yao and Md M. Islam, “Evolving artificial neural network ensembles,” IEEE Com- vol. 13, no. 2, pp. 398–417, 2008. doi: 10.1109/TEVC.2008.927706.
put. Intelli. Mag., vol. 3, no. 1, pp. 31–42, 2008. doi: 10.1109/MCI.2007.913386. [41] M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” 2012, arXiv:1212.5701.
[7] R. Collobert and J. Weston, “A unified architecture for natural language processing: [42] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning
Deep neural networks with multitask learning. in Proc. 25th Int. Conf. Machine Learning, and stochastic optimization,” J. Machine Learning Res., vol. 12, no. 7, pp. 2121–2159, 2011.
2008, pp. 160–167. [43] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” 2014,
[8] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image cap- arXiv:1412.6980.
tion generator,” in Proc. IEEE Conf. Comput. Vision Pattern Recogn., 2015, pp. 3156–3164. [44] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen, “Designing neural net-
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep works through neuroevolution,” Nat. Mach. Intell., vol. 1, no. 1, pp. 24–35, 2019. doi:
convolutional neural networks,” in Proc. Adv. Neural Information Process. Syst., 2012, pp. 10.1038/s42256-018-0006-z.
1097–1105. [45] G. C. Cawley and N. L. C. Talbot, “On over-fitting in model selection and sub-
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale sequent selection bias in performance evaluation,” J. Machine Learning Res., vol. 11, pp.
image recognition,” 2014, arXiv:1409.1556. 2079–2107, 2010.
[11] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Drop-
Vision and Pattern Recogn., 2015, pp. 1–9. out: A simple way to prevent neural networks from overfitting,” J. Machine Learning Res.,
[12] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected vol. 15, no. 1, pp. 1929–1958, 2014.
convolutional networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recogn., 2017, [47] B. L. Miller et al., “Genetic algorithms, tournament selection, and the effects of
pp. 4700–4708. noise,” Complex Syst., vol. 9, no. 3, pp. 193–212, 1995.
[13] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming approach to de- [48] Y. Leung, Y. Gao, and Z.-B. Xu, “Degree of population diversity—A perspective on
signing convolutional neural network architectures,” in Proc. Genetic Evolutionary Comput. premature convergence in genetic algorithms and its Markov chain analysis,” IEEE Trans.
Conf., 2017, pp. 497–504. doi: 10.1145/3071178.3071229. Neural Netw., vol. 8, no. 5, pp. 1165–1176, 1997. doi: 10.1109/72.623217.
[14] Y. Sun, B. Xue, M. Zhang, and G. G Yen, “Evolving deep convolutional neural [49] L. Davis, “Handbook of genetic algorithms,” 1991.
networks for image classification,” IEEE Trans. Evolutionary Comput., vol. 24, no. 2, pp. [50] E. Real, A. Aggarwal, Y. Huang, and Q. V Le, “Regularized evolution for image
394–407, 2019. doi: 10.1109/TEVC.2019.2916183. classifier architecture search,” in Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 4780–4789,
[15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa- 2019. doi: 10.1609/aaai.v33i01.33014780.
tions by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. doi: [51] H. Zhu, Z. An, C. Yang, K. Xu, E. Zhao, and Y. Xu, “EENA: Efficient evolution of
10.1038/323533a0. neural architecture,” in Proc. IEEE Int. Conf. Comput. Vision Workshops, 2019, pp. 1–13.
[16] Y. Liu, Y. Sun, B. Xue, M. Zhang, and G. Yen, “A survey on evolutionary neural [52] J. A. Vasconcelos, J. A. Ramirez, R. H. C. Takahashi, and R. R. Saldanha, “Im-
architecture search,” 2020, arXiv:2008.10937. provements in genetic algorithms,” IEEE Trans. Magn., vol. 37, no. 5, pp. 3414–3417,
[17] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” 2001. doi: 10.1109/20.952626.
2016, arXiv:1611.01578. [53] M. Lin, Q. Chen, and S. Yan, “Network in network,”, 2013, arXiv:1312.4400.
[18] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A sur- [54] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” 2015, arX-
vey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996. iv:1505.00387.
[19] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” 2018, [55] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout
arXiv:1806.09055 networks,” in Proc. Int. Conf. Machine Learning, 2013, pp. 1319–1327.
[20] T. Bäck, D. B. Fogel, and Z. Michalewicz, “Handbook of evolutionary computa- [56] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for sim-
tion,” Release, vol. 97, no. 1, p. B1, 1997. plicity: The all convolutional net,” 2014, arXiv:1412.6806.
[21] K. Deb, A. Pratap, S. Agarwal, and T. A. M. T. Meyarivan, “A fast and elitist multi- [57] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical
objective genetic algorithm: NSGA-II,” IEEE Trans. Evolutionary Comput., vol. 6, no. 2, representations for efficient architecture search,” 2017. arXiv:1711.00436.
pp. 182–197, 2002. [58] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture search by
[22] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proc. ICNN’95—Int. network transformation,” 2017, arXiv:1707.04873.
Conf. Neural Netw., vol. 4, pp. 1942–1948, 1995. [59] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architec-
[23] M. Dorigo, M. Birattari, and T. Stutzle, “Ant colony optimization,” IEEE Comput. tures using reinforcement learning,” 2016, arXiv:1611.02167.
Intell. Mag., vol. 1, no. 4, pp. 28–39, 2006. doi: 10.1109/MCI.2006.329691. [60] A. I. Sharaf and E.-S. F. Radwan, “An automated approach for developing a convo-
[24] Y. Xue, B. Xue, and M. Zhang, “Self-adaptive particle swarm optimization for large- lutional neural network using a modified firef ly algorithm for image classification,” in
scale feature selection in classification,” ACM Trans. Knowl. Discovery Data (TKDD), vol. Applications of Firefly Algorithm and Its Variants. Springer-Verlag, 2020, pp. 99–118.
13, no. 5, pp. 1–27, 2019. doi: 10.1145/3340848. [61] B. Wang, B. Xue, and M. Zhang, “Particle swarm optimisation for evolving deep neural
[25] D. Floreano, P. Dürr, and C. Mattiussi, “Neuroevolution: From architectures to networks for image classification by evolving and stacking transferable blocks,” in Proc. IEEE
learning,” Evolution. Intell., vol. 1, no. 1, pp. 47–62, 2008. Congr. Evolut. Comput. (CEC), 2020, pp. 1–8. doi: 10.1109/CEC48606.2020.9185541.
[26] E. Real et al., “Large-scale evolution of image classifiers,” 2017, arXiv:1703.01041. [62] Z. Lu, I. Whalen, V. Boddeti, Y. Dhebar, K. Deb, E. Goodman, and W. Banzhaf,
[27] A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” 2009. “NSGA-NET: Neural architecture search using multi-objective genetic algorithm,” in
[28] L. Xie and A. Yuille, “Genetic CNN,” in Proc. IEEE Int. Conf. Comput. Vision, 2017, Proc. Genetic Evolut. Comput. Conf., 2019, pp. 419–427.
pp. 1379–1388. [63] Y. Xue, Y. Tang, X. Xu, J. Liang, and F. Neri, “Multi-objective feature selection with
[29] Y. Sun, B. Xue, M. Zhang, and G. G Yen, “Completely automated CNN architec- missing data in classification,” IEEE Trans. Emerg. Topics Comput. Intell. doi: 10.1109/
ture design based on blocks,” IEEE Trans. Neural Netw. Learning Syst., vol. 31, no. 4, pp. TETCI.2021.3074147.
1242–1254, 2019. doi: 10.1109/TNNLS.2019.2919608.

Bo Peng Application
North China Electric Power University, CHINA Notes
Ying Bi, Bing Xue, Mengjie Zhang
Victoria University of Wellington, NEW ZEALAND
Shuting Wan
North China Electric Power University, CHINA
Multi-View Feature Construction Using Genetic Programming

for Rolling Bearing Fault Diagnosis
Abstract failures and the percentage of failures
R olling bearing fault diagno-

sis is an important task in
mechanical engineering.
Existing methods have several limi-
tations, such as requiring domain
caused by rolling bearing damage
reaches nearly 30% [3]. Therefore, it
is important to identify the faults of
rolling bearing to monitor bearing
status, ensure machinery safety, reduce
knowledge and a large number of economic losses, and avoid casualties.
training samples. To address these In recent years, many methods
limitations, this paper proposes a have been developed for rolling
new diagnosis approach, i.e., multi- bearing fault diagnosis. Because the
view feature construction based on vibration signals are easy to collect
genetic programming with the idea ©SHUTTERSTOCK.COM/SERGEY TARASOV
and often contain information of
of ensemble learning (MFCGPE), the running states, the studies on
to automatically construct high-level fea- results show that MFCGPE achieves using vibration signals for fault diagnosis
tures from multiple views and build an higher diagnostic accuracy than all the have gained much attention. The col-
effective ensemble for identifying different compared methods on the three datasets lected vibration signals often have back-
fault types using a small number of train- with a small number of training samples. ground noise or information loss. A
ing samples.The MFCGPE approach uses variety of signal processing methods, such
a new program structure to automatically I. Introduction as wavelet transform [4], empirical mode
construct a flexible number of features Rolling bearings are important support- decomposition [5], fast spectral kurtosis
from every single view. A new fitness ing equipment and have been used in all [6], maximum correlated kurtosis decon-
function based on accuracy and distance is kinds of rotating machinery, such as elec- volution [7], and variational mode decom-
developed in MFCGPE to improve the tric motors, turbine generators, and high- position [8], have been used to suppress
discriminability of the constructed fea- speed trains [1]. The performance of the interference of noise and harmonics
tures. To further improve the generaliza- rolling bearings can be affected by the and strengthen signal characteristics.
tion performance, an ensemble of classifiers high temperature, high pressure, and alter- Experts have conducted the spectrum
based on k-nearest neighbor is created by nating load during the equipment opera- analysis on the processed signals and rec-
using the constructed features from every tion [2]. The rolling bearings are often ognized the fault characteristic frequency
single view. Three bearing datasets and 19 inevitably damaged as the running time for fault diagnosis [9]. However, these
competitive methods are used to validate increases [2]. The damage of rolling bear- methods do not provide satisfactory results
the effectiveness of the new approach.The ings will lead to mechanical equipment and require extensive domain expertise,
which is time-consuming and expensive.
Digital Object Identifier 10.1109/MCI.2021.3084495 Corresponding Author: Ying Bi (e-mail: ying.bi@ecs
To automatically identify faults,
Date of current version: 15 July 2021 .vuw.ac.nz). methods based on machine learning

have been developed for rolling bearing often use a large number of training sam- constructed features are effective. Owing
feature analysis and fault diagnosis. These ples to build models/classifiers for fault to these advantages, many GP methods
methods extract various types of features diagnosis. In real-world scenarios, the have been developed to feature construc-
from the vibration signals and use tradi- training samples are often difficult to tion in various problems and achieved
tional classification algorithms to per- obtain and require extensive manual effort promising results [36]–[38].
form fault diagnosis [10]. The commonly to label. Furthermore, designing an effec- However, existing GP methods need
used features include simple statistical tive architecture for the NN model typi- several improvements to construct effec-
features of time-domain and frequency- cally requires rich expertise in the tive features for fault diagnosis using a
domain, and the non-linear evaluation problem and NN domains. Therefore, it is small number of training samples. First,
indicators, such as fractal dimension [11], necessary to develop a new intelligent the features extracted from the vibration
Lyapunov exponent [12], and entropy- fault diagnosis method that can achieve signals can be multiple views, such as
based features [13]–[15]. Typically, a large good performance using a small number time-domain view and the frequency-
number of features are extracted to of training samples. domain view [39]. Each view represents
describe the vibration signal and these Typically, the feature quality is impor- different characteristics of the data.
features may contain redundant or irrele- tant for effective rolling bearing fault Existing methods often simply concate-
vant features, which may reduce the fault diagnosis. Traditional methods use many nate all the features from different views,
diagnosis accuracy. Therefore, feature ways, such as signal processing, feature which may not be effective. GP has sel-
selection methods, such as max-relevance selection, and feature learning, to improve dom been developed for constructing
and min-redundancy [16], ReliefF [17], the quality of the extracted features [15], high-level features from different views
Laplacian score [18], principal compo- [21], [25]. Feature construction is an (i.e., multi-view feature construction).
nent analysis [19], local discriminant effective way to generate new informa- Second, existing fitness measures of GP
analysis [20], and margin fisher analysis tive and high-level features from the for feature construction often use classi-
[21], have been employed to select a subset original low-level features [31]. As shown fication accuracy, which may cause the
of important features for effective fault in existing work [11]–[15], the features overfitting issue, particularly when the
diagnosis. To further improve the fault of vibration signals can be extracted from number of training samples is small.
diagnosis performance, different classifi- multiple views and each view represents Third, the generalization performance
cation methods have been explored for different characteristics. However, feature of the features constructed by GP can
different tasks, including k-nearest neigh- construction, particularly constructing be further improved by constructing
bor (KNN) [22], artificial neural net- features from multiple views, which can ensembles of classifiers for classification.
work (ANN) [23], extreme learning create informative features to improve However, this has seldom been explored.
machine (ELM) [24], and support vector the fault diagnosis, has not been exten- Therefore, this paper develops a new GP
machine (SVM) [25]. However, these sively explored in this field. approach to address these limitations.
methods typically contain several steps of Evolutionary computation (EC) meth- The goal of this paper is to propose a
signal processing, feature design, feature ods ordinarily do not use extensive new intelligent approach, i.e., multi-
selection, and classifier learning. These domain knowledge to find solutions and view feature construction based on GP
steps need to be appropriately connected have been successfully applied to many with the idea of ensemble learning
to achieve accurate fault diagnosis. In difficult problems [32]–[34]. Genetic pro- (hereafter called MFCGPE), to rolling
addition, most of these steps require gramming (GP) is an EC technique that bearing fault diagnosis using a small
domain expertise, which causes the diag- has been widely used as one of the most number of training samples. The pro-
nosis method to be effective only on a popular feature construction methods [31], posed approach is able to construct a
specific fault diagnosis task and unable to [35]. Unlike other EC methods, which use flexible number of high-level features
be generalized to other even similar tasks. a fixed-length representation, GP has a from every single view and build an
Deep learning is an advanced machine variable-length tree-based representation, effective ensemble using the constructed
learning approach that has been applied enabling it to automatically construct multi-view features to achieve high gen-
to fault diagnosis [26], [27]. Most deep high-level features from low-level features eralization performance when the num-
learning methods are based on neural net- in a more flexible way [35]. For feature ber of training samples is small. The
works (NNs). Different types of neural construction, GP typically evolves models performance of MFCGPE will be tested
networks, such as deep belief networks, that consist of a set of operators (such as +, on three datasets of varying difficulty
sparse autoencoders, and convolutional − and ×) and the original features. The and compared with 19 competitive
neural networks, have also been devel- original features are operated by these methods. Further analysis will be con-
oped for effective fault diagnosis [28]– operators and new features are then gener- ducted to show the effectiveness of the
[30]. These methods can automatically ated. The tree-based solutions of GP constructed features and the ensemble.
learn features from the vibration signals potentially have high interpretability to The main contr ibutions of this
and train classifiers for effective fault diag- provide insights into what features are paper are summarized in the follow-
nosis. However, the NN-based methods important for construction and why the ing four aspects.

1) A new program structure is devel- variable-length representation [40]. An generated using genetic operators, i.e.,
oped in MFCGPE to allow it to individual of GP is typically represented Elitism, Crossover, and Mutation operators.
construct high-level features from using a tree-based structure, as shown in The Elitism operation is to copy the best
multiple views. More importantly, Figure 1. This example tree/program individuals from the current generation
the number of constructed features consists of internal nodes (the functions to the next generation. A number of
can be adaptively determined with- or operators selected from the function individuals are selected based on their
out being pre-defined. set) and leaf nodes (the arguments or fitness values via Tournament selection to
2) A new fitness function based on constants selected from the terminal set). be used as parents to generate new
accuracy and distance is proposed in This example tree can be mathematical- individuals using Crossover and Muta-
MFCGPE to enable the constructed ly expressed as 5 ' (y + 8) + (2 # x), tion operators. The Crossover operation
features to be accurate and discrimi- where +, ×, and ' (protected division, is to exchange the subtrees of two
native when the number of training return 0 if the divisor is 0) are the inter- parents to generate new offspring.
samples is small. The new fitness nal nodes, and x, y, 2, 5, and 8 are the The Mutation operation is to random-
function is able to maximize the leaf nodes. This equation can also be ly delete a subtree of the parent and
classification performance and the treated as a newly constructed feature, grow a new subtree from that node. The
inter-class distances of training sam- where x and y are two original features. new population of individuals are evalu-
ples, and minimize the intra-class GP can automatically select functions ated and evolved generation by genera-
distances of training samples. from the function set and terminals from tion. When a termination criterion is
3) An ensemble is created using the fea- the terminal set to evolve the best trees
tures constructed by MFCGPE from for solving a task via an evolutionary pro-
multiple views and using KNN to cess. Figure 2 shows the overall evolu- +
make predictions for unseen test sam- tionary process of GP. First, a population
÷ ×
ples. Using ensemble for classification of computer programs are randomly ini-
can further improve the generalization tialized in the search space. Then, each 5 + 2 x
performance, particularly when the individual (program) in the population is
number of training samples is small. evaluated using a fitness function and y 8
4) A new intelligent diagnosis approach, assigned a fitness value. At each genera-
namely MFCGPE, is developed to tion, a new population of individuals are FIGURE 1 An example program of GP.
achieve effective fault diagnosis of roll-
ing bearings with the use of a small
number of training samples. MFCGPE Start
can achieve better results than 19 com-
petitive methods on three datasets. Spe-
Initial Population
cifically, MFCGPE can achieve a
maximal and average accuracy of 100%
and above 99% on three fault datasets
with only five training samples per class.
The rest of this paper is organized as Generate New Population
follows: Section II briefly introduces the Crossover
GP algorithm and reviews its applications Fitness Function
on feature construction and fault diagno-
sis. Section III describes the MFCGPE No
approach in detail. Section IV designs the
Terminate Mutation Elitism
experiments. The experimental results are
analyzed and compared in Section V. Sec- Yes
tion VI further analyses the proposed
Best Individual
approach. Section VII presents conclu-
sions and future work.
II. Background and Related Work
A. Genetic Programming (GP)

Unlike other EC techniques such as End
genetic algorithms (GAs) that have a
fixed-length representation, GP has a FIGURE 2 Evolutionary process of GP.

satisfied, the evolutionary process stops, C. GP for Fault Diagnosis number of training samples. In general, a
and the best individual is obtained. To the best of our knowledge, GP has small number of training samples can
rarely been applied to fault diagnosis of not well represent the class distribution
B. GP for Feature Construction mechanical equipment. In [48], GP was information comprehensively and often
In recent years, many GP based feature used as a binary classifier for fault diag- leads to poor generalization perfor-
construction methods have been pro- nosis of rolling bearing. The classifica- mance. To address the above issues, this
posed, in which the arithmetic and logi- tion performance of GP with the use of paper proposes a new GP based fault
cal operators are usually used as the statistical features, spectral features, and diagnosis approach (i.e., MFCGPE) to
function set, and the low-level features the combination of statistical and spec- adaptively construct a flexible number
are usually used as the terminal set. tral features are compared. The results of informative features from multiple
Otero et al. [41] used information gain showed that the combined features views for representing sample compre-
ratio as the fitness function of GP to achieved better fault classification accu- hensively, and create an ensemble using
construct features. Muharram et al. [42] racy. Guo et al. [49] proposed a GP- these constructed features for effective
comprehensively compared four differ- based rolling bearing fault diagnosis fault diagnosis with the use of a small
ent fitness functions based on informa- method, in which a fitness function number of training samples.
tion gain in GP for feature construction. based on Fisher criteria was developed
Guo et al. [43], [44] used the Fisher cri- to enable GP to construct the high III. The Proposed Approach
teria and its improved version as the fit- order of moments of vibration signals as In this section, the details of the MFC-
ness function of GP. Neshatian et al. informative features. This method used GPE fault diagnosis approach will be
[45] proposed a multiple feature con- two classification algorithms for multi- introduced, including the algorithm
struction method for symbolic learning class fault classification. Xuan et al. [50] overview, the program structure, the
classifiers, where the constructed fea- proposed a gear fault diagnosis method function set, the terminal set, the fitness
tures are evaluated by a fitness function that combines GP and SVM, where a function, and ensemble construction for
that maximizes the purity of the class distance measure based fitness function fault diagnosis.
interval. The above methods belong to was developed and the power spectral
the filter-based feature construction features of vibration signals were used to A. Overview of MFCGPE
methods. Guo et al. [36] proposed a construct high-level features. These two Figure 3 shows the overall structure of
method based on GP and KNN to clas- methods only construct a single feature MFCGPE to rolling bearing fault diag-
sify EEG signals, which achieved a clas- for fault diagnosis, which may not be nosis with a small number of training
sification accuracy of 99% on one effective when the machine working samples. First, the collected vibration
dataset. Bi et al. [37] proposed a GP conditions become more complicated. signals of rolling bearings are trans-
method combined with image-related In summary, although these methods formed into three different-view fea-
operators and SVM for image classifica- successfully show that GP offers possi- tures by calculating the statistical values
tion, which obtained better accuracy bilities for dealing with fault diagnosis, of time waveform and frequency spec-
than the deep learning methods on there are still some problems that need trum. The 16 time-domain features
some datasets. Aslam et al. [38] com- to be addressed. The GP based feature (TDF), the 13 frequency-domain fea-
bined GP and KNN for automatic construction methods in [45]–[47] have tures (FDF), and the combination of
modulation classification, and the meth- discussed the effect of the number of TDF and FDF (named TFDF features)
od achieved better classification perfor- constructed features on classification are represented by View1, View2, and
mance. The above methods belong to performance. However, these methods View3, respectively. Second, the trans-
the wrapper-based feature construction set the number of constructed features formed data set is divided into the train-
methods. Tran et al. [46] developed a fit- according to prior knowledge and mul- ing set and the test set, and three
ness function using the classification tiple trials, therefore the adaptability of independent GPs are utilized to con-
accuracy and a distance measure to them is poor. The GP based fault diag- struct high-level features of each single-
improve the performance of the con- nosis method in [48]–[50] only used the view feature set, respectively. The
structed features on high-dimensional features of a single view for feature con- program structure and the function set
data classification tasks and discussed the struction. However, since the features of used for feature construction from dif-
impact of the number of constructed different views have both internal rela- ferent views are the same, but the termi-
features on classification accuracy. Ma et tions and interval differences, using only nal set is different. Only the training set
al. [47] designed a hybrid fitness func- single view features may ignore the is used for the evolutionary process of
tion that combined information gain characteristics of the samples. Moreover, GP to construct high-level features that
ratio and the error rate of a classification these GP based fault diagnosis methods are expected to have a small intra-class
algorithm, and proposed a feature con- use sufficient training data to construct distance and a large inter-class distance
struction strategy that obtained multiple features and perform classification and for effective fault diagnosis. Third, the
high-level features using a single GP. do not consider the scenario of a small constructed features are provided as the

inputs of classification algorithms for outputs/features can be further used for without presetting the number of con-
fault diagnosis. We use the features con- feature construction. To achieve feature structed features.
structed from each view to feed three combination, two new operators (FC 2
classifiers/KNNs individually, which are and FC m) are designed to combine D. Terminal Set
combined via majority voting to form multiple features into a vector, where The terminal set of MFCGPE is com-
an ensemble to make a good prediction the inputs of FC 2 are two features, and posed of the low-level features extract-
for the unseen test set. the inputs of FC m can be one feature, ed from the vibration signals. The
one vector, or two vectors. The use of terminal sets for feature construction
B. Program Structure these feature combination operators from different views are different. For
To enable MFCGPE to construct a flex- enables MFCGPE to adaptively con- View1 (only consider the time-domain
ible number of high-level features, a struct multiple high-level features characteristics of rolling bearing signals),
new program structure with the input,
feature construction, feature combina-
tion, and output layers are developed
based on strongly-typed GP (STGP) Vibration Signal Multi-View Feature Set
[51]. These layers are connected in a S1
16 View1 Features T1×16 F1×13 TF1×29
bottom-up manner. Each layer has a S2
...
...
...
specific function and type constraint. ...
13 View2 Features Tp×16 Fp×13 TFp×29
The input layer takes the original fea- Sp
...
...
...
tures as inputs, which are the terminals
...
Sq 29 View3 Features Tq ×16 Fq ×13 TFq ×29

of GP. The feature construction layer
transforms the original features into
Step 1: Original Feature Extraction
high-level features. The feature combi-
nation layer combines multiple con-
structed features into a vector to Train Set Test Set
comprehensively describe the sample.
S1 ... ... ... S1 ... ... ...
The output layer returns the construct-
ed feature vector as outputs for fault
...
...
...
...
...
...
...
...
diagnosis. The tree depths of the input Sm ... ... ... Sn ... ... ...
and output layers are one, and those of
the feature construction and feature
Evolutionary Learning
combination layers are automatically
adjusted according to the given task. GP1 GP2 GP3
Figure 4 shows the program structure
and an example program of MFCGPE,
respectively. This example program rep- Best Model
resents a feature vector containing three
constructed features. The detailed
description of the operators and the ter-
minals will be described in the follow-
Step 2: Multi-View Feature Construction
ing subsections.
C. Function Set Training Set and Test Set With the Constructed Features
The function set of MFCGPE is com-
posed of two types of operators, i.e., the
Classifier 1 Classifier 2 Classifier 3
feature construction operators and the
feature combination operators. Table I
lists the detailed information of the Majority Voting
function set. Four arithmetic operators
including +, −, ×, and ' are for feature
construction, where ' is protected by Final Diagnosis
returning 0 if the divisor is 0. The inputs Step 3: Ensemble Diagnosis
of +, −, ×, and ' operators are two fea-
tures, and the output of them is a new
feature. It should be noted that their FIGURE 3 Overall structure of the MFCGPE fault diagnosis approach.

the terminal set contains 16 statistical where s(i) represents the time signal, variance, maximum value, minimum
features T1 + T16, which are extracted and N represents the number of the sig- value, peak-to-peak value, waveform
through performing the simple mathe- nal data points. T1 + T16 represent the index, peak index, pulse index, margin
matical calculation on the amplitude of vibration signal’s mean value, standard index, skewness index and kurtosis
the raw vibration signals [25]. Table II deviation, square root amplitude, abso- index, respectively.
lists the terminal set used under View1, lute mean value, skewness, kurtosis, For View2 (only consider the fre-
quency-domain characteristics of rolling
bearing signals), the terminal set con-
tains 13 statistical features F1 + F13,
Output
which are extracted through performing
the mathematical calculation on the
Output FCm
amplitude of frequency spectrum that
FC2 – was obtained through Fast Fourier
Feature Combination Transform for the vibration signals [25].
+ – ÷ ÷
Table III lists the terminal set used
Feature Construction under View2, where f ( j ) represents the
× T5 + T4 T14 × T9 T11
frequency spectrum of time signal s(i),
T1 T8 × T5 T2 T7 M represents the number of spectrum
Input
lines, and A j represents the frequency
T14 T3 value of the jth spectrum line. F1 repre-
sents the energy of the vibration signal
FIGURE 4 Program structure of MFCGPE (left) and an example program (right) that can be in frequency-domain, F2 + F5 represent
evolved by MFCPGE. the position variation of the main fre-
quency band, and F6 + F13 represent the
TABLE I Function set of GP.
concentration and dispersion of the fre-
quency spectrum.
SYMBOL INPUT OUTPUT DESCRIPTION For View3 (considering both the time-
+ 2 Features 1 Feature Addition domain and frequency-domain characteris-
− 2 Features 1 Feature Subtraction tics of rolling bearing signals), the terminal
× 2 Features 1 Feature Multiplication set contains 29 features (TF1 + TF29 ) com-
' 2 Features 1 Feature Protected division
posed by the features of View1 and View2,
where the features TF1 + TF16 are equal to
FC 2 2 Features 1 Vector Concatenation
the features T1 + T16 in View1, and the fea-
FC m 2 Vectors or 1 Vector/Feature 1 Vector Concatenation
tures TF17 + TF29 are equal to the fea-
tures F1 + F13 in View2. There are two
reasons for using TDF features and FDF
TABLE II Terminal set corresponding to View1.
features to form the multi-view features.
SYMBOL FORMULA SYMBOL FORMULA One is that the extraction of TDF and
T1 N
1 / s (i ) T9 min s (i ) FDF features is simple statistical calcula-
N i=1 tions, while the extraction of the time-fre-
T2 N
1 / [s (i ) - T ] 2 T10 T8 - T9 quency domain features typically needs
N - 1 i=1 1
complicated signal processing opera-
T3 N 2 T11 T2
tions and manual participation. The
eN s (i ) o
1 / T4 other is that [48] shows that using the
i=1
combination of the TDF and FDF fea-
T4 N
1 / s (i ) T12 T8
N i=1 F2 tures for feature construction is effective
for constructing new features. Different
T5 N
1 / (s (i ))3 T13 T8 from most of the existing works, the
N i=1 F4
proposed method constructs features
T6 N
1 / (s (i ))4 T14 T8 from these three views, individually,
N i=1 T3
which aims to generate a group of
T7 N
1 / (s (i ))2 T15 T5 effective and diverse features to com-
N i=1 ( T7 ) 3 prehensively describe a sample. The
T8 max s (i ) T16 T6 effectiveness and diversity of the con-
( T7 ) 2 structed features can also improve the

high generalization performance of the scheme. That is, the new generated fea- ence value of the inter-class and intra-
ensemble, which will be described in tures and labels of the training set are class distances into the range of [0, 1].
the later subsection. divided into 5 folds, evenly. Each time, n
one fold is used as the sub-test set, and d (X k, X l) = / (x ko - x lo) 2 (2)
E. Fitness Function the other two folds are used as the sub- o=1
An effective fitness function is crucial training set. The average value of the Ni Ni
D intra (X i) = 1 / / d (X k , X l )
(i) (i)
for guiding GP to construct high-level three sub-test sets is set as the diagno- Ni Ni k = 1 l = 1
features for fault diagnosis. When the sis accuracy. (3)
number of training samples is small, the The Dist is calculated according to Ni Nj
D inter (X i, X j) = 1 / / d (X k , X l )
(i) (j)
features constructed by GP can easily Equations (2)–(5), which evaluate the Ni N j k = 1 l = 1
achieve good classification performance distance of the training samples with the (4)
on the training set, but poor generaliza- constructed features. The calculation of Dist = 1
tion performance on the test set. To Dist is based on the Euclidean distance. 1 + e - (min (Dinter) - max (Dintra))
(5)
address this issue, a new fitness function For the given samples X k = {x k1, x k2,
based on the classification accuracy and f, x kn} and X l = {x l1, x l2, f, x l n}, the The proposed fitness function opti-
the distance measure of the training Euclidean distance between them can mizes the classification accuracy and the
samples is proposed to guide the search be calculated as Equation (2), where x ko distances of the training samples, simul-
of MFCGPE. To calculate the diagnosis and x lo represent the features of samples taneously. When the fitness value
accuracy, KNN is used to perform clas- X k and X l, respectively, and n is the approaches one, it indicates that the fea-
sification based on the constructed fea- number of features. The calculation of tures constructed by GP have the best
tures. The reason for using KNN is that the intra-class and inter-class distances is classification performance and the train-
it is a simple classification algorithm, based on Equation (3) and Equation ing samples have a small intra-class dis-
easy to implement, and treats each (4), where X represents the samples in tance and a large inter-class distance.
feature equally without any feature one single class, and N is the number of
weighting or selection [22], [52]. With samples in the X class. Equation (5) is F. Ensemble for Fault Diagnosis
the use of KNN, MFCGPE can auto- the calculation of Dist, where the sigmoid The MFCGPE system is able to find
matically construct discriminative fea- function is used to transform the differ- the three best GP trees/programs that
tures and avoid redundant or irrelevant
features. The distance measure is to
TABLE III Terminal set corresponding to View2.
minimize the intra-class distance and
maximize the inter-class distance of the SYMBOL FORMULA SYMBOL FORMULA
training samples based on the con- F1 M F8 M
structed features. Using such a distance / f( j) / [f ( j ) - F1]4

j=1 j=1
measure will group the samples in the M M (F6) 2

same class and enlarge the differences of F2 M F9 M
/ (A j f ( j )) / [(A j - F5) 2 f ( j )]
the samples in different classes, which j=1 j=1
M
potentially improves the discriminabili- M
/ f( j)
j=1
ty of the constructed features. This also
helps to improve the effectiveness of F3 M F10 F9
/ [A 2j f ( j )] F2
KNN as it is based on distance. Based j=1
M
on the above analysis, the new fitness / f( j)
j=1
function to be maximised is formulated
F4 M F11 M
as follows. / [A 4j f ( j )] / [(A j - F2) 3 f ( j )]
j=1 j=1
(Acc + Dist) M M (F9) 3

Fit = (1) / [A 2j f ( j )]
2 j=1
F5 M F12 M
where Acc represents the diagnosis accu- / [A 2j f ( j )] / [(A j - F2) 4 f ( j )]
j=1 j=1
racy of KNN using the constructed fea- M M M (F9) 4
= / (A j f ( j ))G= / f ( j )G
4
tures and Dist represents the distance j=1 j=1
measure. Only the training set is used in F6 M F13 M
the fitness evaluation step, and the con- / [f ( j ) - F1] 2 /[ A j - F2 f ( j )]

j=1 j=1
structed features are transformed into M-1 M F9
the range of [0, 1] through the min-max F7 M — —
normalization method. The Acc is calcu- / [f ( j ) - F1] 3
j=1
lated using the 5-fold cross-validation M ( F6 )3

randomly selected as the training set,
and the remaining samples are used
View1 GP
KNN as the test set. The three datasets are
Features Construct Features
described as follows.
1) NCEPU [25]: It is a bearing fault
Raw View2 GP Majority Final
Signal Features Construct Features KNN Voting Diagnosis
dataset collected by North China
Electric Power University (NCEPU)
and contains vibration signals of six
View3 GP fault types. Figure 6 shows the used
KNN
Features Construct Features
test rig of NCEPU. The six rolling
bearing running conditions are the
FIGURE 5 Architecture of ensemble fault diagnosis system. normal state (NOR), inner ring fault
(IRF), outer ring fault (ORF), rolling
from different views. This constructed element fault (REF), inner & outer
ensemble is applied to predict the class ring compound fault (IOCF), and
labels for the unseen samples in the test rolling element & outer ring com-
set. In the process, the majority voting pound fault (ROCF), which are sim-
algorithm is employed because it is a ulated by manufacturing defects on
simple and common method to com- normal bearings using electrical dis-
prehensively consider the results charge machining (EDM) technolo-
obtained by the multiple base classifiers gy. All vibration signals are collected
FIGURE 6 NCEPU test rig.
[53]. For a test sample, among the class with a sampling frequency of 12,000
labels/fault types obtained by three Hz. The first 102,400 data points of
construct features from different views. KNNs, the fault type with the largest vibration signals under each running
Since the high-level features are con- number of votes is chosen as the final condition are divided into 50 sam-
structed from different views, they are fault type. ples on average and there is no over-
diverse and effective. To effectively use lap between each sample. Table IV
the constructed features, an ensemble is IV. Experimental Design lists the detailed information of the
created for fault diagnosis. Consistent NCEPU dataset. Figure 7 shows the
with the fitness evaluation process, the A. Dataset Description time domain waveform of vibration
ensemble uses three KNNs as the base The mechanical equipment typically signals under six running conditions
classifiers, and each KNN uses the fea- operates normally, and the failure data in NCEPU.
tures constructed from a single view. that can be collected is limited. Collect- 2) CWRU [54]: It is a bearing fault
The overall fault diagnosis system using ing a large number of labeled samples dataset collected by Case Western
ensemble is illustrated in Figure 5. requires high cost and a long period of Reserve University (CWRU) and
The training set (including the trans- time. Therefore, the goal of this paper is contains vibration signals of ten
formed feature sets from three different to deal with fault diagnosis using the fault types and fault degrees. Fig-
views and the corresponding class labels) limited number of training samples. In ure 8 shows the used test rig of CWRU.
are fed into three KNNs, and these the experiments, three datasets with a All faults are generated by manufac-
three KNNs are used as base classifiers small number of training samples are turing defects on the inner ring,
of an ensemble. This ensemble will have employed to evaluate the effectiveness outer ring, and rolling element sur-
high generalization performance since of MFCGPE. In each dataset, only five faces using EDM, and the same fault
the inputs are the features constructed samples of each running condition are category with three defect diame-
ters (0.007, 0.014, and 0.021 inches,
TABLE IV Description of the NCEPU dataset.
respectively). The ten different roll-
ing bearing running conditions are
CLASS TRAINING TEST the normal state (NOR), three kinds
LABEL RUNNING CONDITION SAMPLES SAMPLES
of inner ring fault (IRF07, IRF14,
1 Normal 5 45
and IRF21), three kinds of outer
2 Inner ring fault 5 45 ring fault (ORF07, ORF14, and
3 Outer ring fault 5 45 ORF21), and three kinds of rolling
4 Rolling element fault 5 45 element fault (REF07, REF14, and
5 Inner and outer ring compound fault 5 45 REF21), respectively. IRF, ORF, and
6 Rolling element and outer ring 5 45 REF indicate inner ring fault, outer
compound fault ring fault, and rolling element fault,

respectively. 07, 14, and 21 indicate
that the defect diameter is 0.007, NOR IRF
50 300
0.014, and 0.021 inches, respectively.
0 0
The vibration signals of driver-end –50 –300
bearing under 1,797 r/min are col- 0 0.08 0.16 0 0.08 0.16
lected with a sampling frequency of
Amplitude (m/s2)
ORF REF
12,000 Hz. Similar to the NCEPU 400 500
dataset, the first 102,400 data points 0 0
of vibration signals under each run- –400 –500
ning condition are divided into 50 0 0.08 0.16 0 0.08 0.16
samples on average. Five samples of IOCF ROCF
400 200
each running condition are ran-
0 0
domly selected to form the training
–400 –200
set, and the remaining samples are 0 0.08 0.16 0 0.08 0.16
used as the test set. Table V lists the Time (s) Time (s)
detailed information of the CWRU
dataset. Figure 9 shows the time FIGURE 7 Time domain waveform of vibration signals under six running conditions in NCEPU.
domain waveform of vibration sig-
nals under ten running conditions domain waveform of vibration sig- sification. The TDF and FDF features
in CWRU. nals under four running conditions have been described in Section III-D.
3) X JTU [55]: It is a run-to-failure in XJTU. The numbers of features in MDF,
bearing fault dataset collected by MMSDE, and IMDE are 37, 20, and 20,
Xi’an Jiaotong University (XJTU), B. Comparison Methods respectively. The comparisons aim to
in which the reason for fault occur- To verify the effectiveness of MFCGPE, investigate whether the features con-
rence is the natural damage as the four categories of competitive methods, structed by MFCGPE are more effective
running time increases. Figure 10 i.e., 19 methods, are employed for com- for fault diagnosis than these manually
shows the used test rig of XJTU. In parisons. The first category includes five crafted features.
the experiments, the vibration sig- different classification algorithms, i.e., The third category contains six GP
nals of the test bearings are collect- KNN, SVM, Naive Bayes (NB), logistic based methods using TDF, FDF, and
ed with a sampling frequency of regression (LR), and multilayer percep- TFDF (described in Section III-D) as
25,600 Hz until the bearing fails, tron (MLP), which use raw signals
and the time interval for each sam- amplitude to train the classifiers for fault
pling is one minute. The vibration classification. The second category are
signals of the last two recorded data KNN using five manually crafted fea-
file of the Bearing 2_1, Bearing tures, i.e., TDF, FDF, multi-domain fea-
2_2, and Bearing 2_3 datasets are tures (MDF) [25], modified multi-scale
used as the analyzed data. The fault symbolic dynamic entropy (MMSDE)
types are inner ring fault (IRF), [14], and improved multi-scale disper-
outer ring fault (ORF), and cage sion entropy (IMDE) [15], for fault clas- FIGURE 8 CWRU test rig.
fault (CF). The signals of the first
two recorded data file of the Bear-
TABLE V Description of the CWRU dataset.
ing 2_1 dataset is used as the nor-
mal (NOR) vibration signals for CLASS DEFECT TRAINING TEST
LABEL RUNNING CONDITION SIZE (IN) SAMPLES SAMPLES
analysis. Every vibration signal
contains 32,768 data points and is 1 Normal 0 5 45
divided into 32 samples on average 2 Inner ring fault 0.007 5 45
for conducting the experiments. 3 Inner ring fault 0.014 5 45
Each sample contains 2,048 data 4 Inner ring fault 0.021 5 45
points. Five samples under each 5 Outer ring fault 0.007 5 45
running condition are randomly
6 Outer ring fault 0.014 5 45
selected to form the training set,
7 Outer ring fault 0.021 5 45
and the remaining samples are used
as the test set.Table VI lists the 8 Rolling element fault 0.007 5 45
detailed information of the XJTU 9 Rolling element fault 0.014 5 45

dataset. Figure 11 shows the time 10 Rolling element fault 0.021 5 45

MFCGPE can outperform other GP
NOR IRF07 based feature construction methods
0.5 2
that construct features.
0 0
The fourth category of the compari-
–0.5 –2
0 0.08 0.16 0 0.08 0.16 son methods are three ensemble diagno-
sis methods (i.e., OFE, GPSE, and
IRF14 IRF21
5 5 GPME). These methods use three
0 0 KNNs as base classifiers and the majori-
–5 –5 ty voting algorithm to build an ensem-
0 0.08 0.16 0 0.08 0.16 ble. In OFE (i.e., ensemble using the
Amplitude (m/s2)
ORF07 ORF14 original features), the TDF, FDF, and

1 2
TFDF features are fed into three KNNs,
0 0
respectively. In GPSE (i.e., ensemble
–1 –2
0 0.08 0.16 0 0.08 0.16 using the features constructed by GPS),
the features constructed by GPS-TDF,
ORF21 REF07 GPS-FDF, and GPS-TFDF are fed into
1 5
0 0 three KNNs to build an ensemble. In
–1 –5 GPME (i.e., ensemble using the features
0 0.08 0.16 0 0.08 0.16 constructed by GPM), the features con-
REF14 REF21
structed by GPM-TDF, GPM-FDF, and
0.5 10 GPM-TFDF are fed into three KNNs,
0 0 respectively. The comparisons aim to
–0.5 –10 investigate whether MFCGPE can beat
0 0.08 0.16 0 0.08 0.16
Time (s) Time (s) the ensemble diagnosis methods that use
the original features and the features
FIGURE 9 Time domain waveform of vibration signals under ten running conditions in CWRU.
constructed by GP.
GPS methods (using GP to construct a C. Parameter Settings

single feature) and the GPM methods In MFCGPE, the GP related parameters
(using GP to construct multiple fea- are common settings and refer to those
tures). The fitness function of these in [56]. The population size and the
methods is the same as MFCGPE. The maximal number of generations are set
comparisons can investigate the effect as 100 and 50, respectively. The rates of
of the number of constructed features crossover, mutation, and elitism are set as
on diagnosis accuracy. The GPS/GPM- 0.8, 0.19 and 0.01, respectively. The tree
TDF, GPS/GPM-FDF, and GPS/ depth is between 2-6. The tree genera-
GPM-TFDF methods use the TDF, tion method is ramped half-and-half.
FDF, and TFDF features for feature The tournament selection with size 5 is
construction, respectively. According to used to select parents for crossover and
FIGURE 10 XJTU test rig.
[46], the number of features construct- mutation operations. The parameters of
ed by the GPM based methods is the the GP based comparison methods are
same as the number of the classes in the same as MFCGPE.
the input features to construct high- the dataset. For example, GPM-TDF, The GP based methods are imple-
level features. The constructed features GPM-FDF, and GPM-TFDF construct mented using DEAP [57] and the clas-
are used as the inputs of the KNN clas- 10 new features for classifying the sification algorithms are implemented
sifier for fault classification. Two types CWRU dataset with 10 classes. The using scikit-learn [58]. The number of
of GP based methods are used, i.e., the comparisons aim to investigate whether the neighbors in KNN is set as 3
according to [22]. The parameters in
TABLE VI Description of the XJTU dataset. the classification algorithms are the
default values of the scikit-learn pack-
CLASS LABEL RUNNING CONDITION TRAINING SAMPLES TEST SAMPLES
age. The parameter values of the meth-
1 Normal 5 27
ods using manually extracted features
2 Inner ring fault 5 27 are set according to [14], [15], [25].
3 Outer ring fault 5 27 Each method has been executed 30
4 Cage fault 5 27 independent runs on each dataset with

different random seeds. The evolution- the significance test results on each data- maximal and average accuracy of
ary process of GP and the classifier set is listed in the last row of Table VII. 52.78%, which is better than other clas-
training step only use the training set. Rows 1-5 of Table VII list the classi- sifiers. Compared with these five meth-
The fault diagnosis results of the test fication results of the five traditional ods, MFCGPE achieves much higher
sets are reported. classifiers using raw signals amplitude accuracy, i.e., over 99%, on the three
(RSA). It can be seen that the diagnosis datasets. The results show that construct-
V. Results and Discussions accuracy of these methods is very low. ing high-level features is very important
This section discusses and analyses the On the NCEPU and CWRU datasets, for fault diagnosis of rolling bearings.
fault diagnosis results obtained by MFC- MLP achieves better results than KNN, Rows 6 to 10 of Table VII list the
GPE and the 19 comparison methods LR, SVM, and NB. Specifically, MLP classification results of the KNN classifi-
on the three different datasets with a small achieves an average accuracy of 22.01% er using five different types of manually
number of training samples. Table VII on NCEPU and of 21.92% on CWRU. crafted features (i.e., TDF, FDF, MDF,
lists the classification accuracy of the 20 On the XJTU dataset, NB achieves a MMSDE, and IMDE). It can be seen
methods, including the maximum
(Max) value, average (Avg) value and
standard deviation (Std) of the accuracy NOR IRF
5 50
of the 30 runs, where the best results of
0 0
each dataset are highlighted in bold.
Amplitude (g)
–5 –50
Wilcoxon rank-sum test with a 5% sig- 0 0.04 0.08 0 0.04 0.08
nificance level is employed to evaluate ORF CF
30 50
the significant difference in performance
0 0
improvement of MFCGPE compared to
–30 –50
a method. In Table VII, the “+” symbol 0 0.04 0.08 0 0.04 0.08
indicates that the performance of MFC- Time (s) Time (s)
GPE is significantly better than the
comparison method. The summary of FIGURE 11 Time domain waveform of vibration signals under four running conditions in XJTU.
TABLE VII Diagnosis accuracy (%) of MFCGPE and the comparison methods on the NCEPU, CWRU, and XJTU datasets.
ROW METHOD NCEPU CWRU XJTU
MAX AVG±STD MAX AVG±STD MAX AVG±STD

1 RSA+KNN 16.67 16.67±0.00+ 10.00 10.00±0.00+ 25.00 25.00±0.00+
2 RSA+LR 17.04 17.04±0.00+ 21.33 21.33±0.00+ 28.70 28.70±0.00+
3 RSA+SVM 16.67 16.67±0.00+ 16.44 16.44±0.00+ 28.70 28.70±0.00+
4 RSA+NB 17.41 17.41±0.00+ 23.33 23.33±0.00+ 52.78 52.78±0.00+
5 RSA+MLP 31.85 22.01±4.24+ 32.22 21.92±1.10+ 37.96 31.45±3.26+
6 TDF 79.26 79.26±0.00+ 76.67 76.67±0.00+ 87.96 87.96±0.00+
7 FDF 82.96 82.96±0.00+ 83.56 83.56±0.00+ 97.22 97.22±0.00+
8 MDF 88.15 88.15±0.00+ 90.22 90.22±0.00+ 75.93 75.93±0.00+
9 MMSDE 87.78 87.78±0.00+ 92.67 92.67±0.00+ 83.33 83.33±0.00+
10 IMDE 91.11 91.11±0.00+ 85.78 85.78±0.00+ 82.41 82.41±0.00+
11 GPS-TDF 90.74 87.83±2.76+ 88.89 84.62±2.56+ 91.67 88.22±3.71+
12 GPS-FDF 92.59 89.42±3.78+ 91.78 87.52±3.08+ 99.07 96.45±2.36+
13 GPS-TFDF 94.81 90.73±3.53+ 93.78 90.17±2.91+ 100.0 96.67±3.37+
14 GPM-TDF 94.44 92.37±1.94+ 96.22 93.37±2.42+ 95.37 93.09±2.98+
15 GPM-FDF 95.92 93.30±2.17+ 97.11 95.34±1.58+ 100.0 97.87±2.36+
16 GPM-TFDF 97.03 94.56±2.35+ 97.56 96.16±1.76+ 100.0 98.21±1.97+
17 OFE 89.25 89.25±0.00+ 90.44 90.44±0.00+ 95.37 95.37±0.00+
18 GPSE 95.55 93.84±1.56+ 94.46 92.81±1.45+ 100.0 98.37±1.81+
19 GPME 98.51 97.47±1.12+ 98.45 97.65±0.85+ 100.0 99.02±1.09+
20 MFCGPE 100.0 99.56±0.30 100.0 99.23±0.67 100.0 99.61±0.26
21 OVERALL 19+ 19+ 19+

that the accuracy of these five methods TFDF achieves an average diagnosis achieve a comprehensive description of
is higher than the KNN, LR, SVM, NB, accuracy of 94.56%, 96.16%, and the raw signal but also avoid redundancy
and MLP methods using raw signals 98.21% on NCEPU, CWRU and and information interference. The fitness
amplitude. Among these five methods XJTU, respectively. Compared with the function based on accuracy and distance
using the manually crafted features, the GPS-based methods, the GPM-based enables MFCGPE to construct new
IMDE method achieves the best results methods achieve higher average accura- effective features that achieve a small
(the maximal and average diagnosis cy and smaller standard deviation values. intra-class distance and a large inter-class
accuracy are 91.11%) on NCEPU, the These results indicate that using multi- distance of the training samples, which
MMSDE method achieves the best ple constructed high-level features is can improve the generalization perfor-
results (the maximal and average diag- more effective than using a single con- mance on the unseen test set. MFCGPE
nosis accuracy are 92.67%) on CWRU, structed high-level feature to improve constructs multiple high-level features
and the FDF method achieves the best diagnosis performance. Compared with from different views, individually, and
results (the maximal and average diag- these three methods, MFCGPE achieves constructs an effective ensemble based
nosis accuracy are 97.22%) on XJTU. better performance on the three datasets. on these constructed features through
The results show that it is necessary to MFCGPE builds an effective ensemble the majority voting, which further
extract a set of effective features accord- using the features constructed from improves the effectiveness of the diag-
ing to the datasets to perform fault multiple views, which allows it to obtain nostic approach and avoids the overfit-
diagnosis because the performance of higher generalization performance than ting issue caused by using a small
the manually extracted features varies these methods using a single classifier. number of training samples. Thus, MFC-
with datasets. Compared with these Rows 17 to 19 of Table VII are the GPE is a practical and promising
methods, MFCGPE achieves much bet- classification results of the ensemble approach to engineering applications.
ter performance by automatically con- diagnosis methods, i.e., OFE, GSE, and
structed multi-view high-level features GPME. It can be found that on the VI. Further Analysis
and building an ensemble for fault diag- NCEPU, CWRU, and XJTU datasets, This section further analyses the evolved
nosis. The adaptability of MFCGPE is the diagnostic performance of OFE, GP tree/models, the features construct-
much higher than these five methods as GPSE, and GPME is ranked the third, ed by GP from different views, and the
it achieves the best results on all the the second, and the first, respectively. constructed ensemble to provide insights
three datasets. Compared with these three methods, into why the new approach is effective.
Rows 11 to 13 of Table VII are the MFCGPE achieves higher average accu-
classification results of the GP based racy and smaller standard deviation values. A. Example Trees/Models
comparison methods for constructing a The reason is that MFCGPE has a new Figure 12 shows the best trees/models
single feature, i.e., GPS-TDF, GPS-FDF, program structure, function set, terminal evolved by MFCGPE from the three
and GPS-TFDF. Among these methods, set, and fitness function, which allow it to different views on the NCEPU dataset,
GPS-TFDF achieves the best results on construct more discriminative features where the white oval nodes are the fea-
all three datasets by obtaining an aver- and build a more effective ensemble to ture combination operators, the white
age accuracy of 90.73% on NCEPU, obtain better diagnosis accuracy. circle nodes are the feature construction
90.17% on CWRU and 96.67% on Row 20 and Row 21 of Table VII operators, and the orange, blue, and pink
XJTU. Compared with these three lists the classification results and the sig- rectangle nodes are the features of View1,
methods, MFCGPE achieves better and nificance test results of the MFCGPE View2, and View3, respectively.
stable classification performance. The approach, respectively. On the NCEPU, As it can be seen from Figure 12, for
results show that only constructing one CWRU, and XJTU dataset, MFCGPE View1, six distinguished features among
high-level feature may not be effective achieves the maximal accuracy of 100% the input 16 features are selected to
for multi-class fault diagnosis. Com- and the average accuracy of above 99%. construct four high-level features. For
pared with these methods, MFCGPE is The results of the significance tests View2, six distinguished features among
able to construct a flexible number of show that the diagnosis performance of the 13 input features are selected to
effective features from every single view MFCGPE is significantly better than construct four high-level features. For
for fault diagnosis. the 19 comparison methods on the View3, five distinguished features among
Rows 14 to 16 of Table VII are the three datasets. the 29 input features are selected to
classification results of the GP based In summary, the results show that construct three high-level features. This
comparison methods that construct MFCGPE is able to achieve excellent indicates that MFCGPE can automati-
multiple features for fault diagnosis, i.e., fault diagnosis performance on three cally select the useful features and deter-
GPM-TDF, GPM-FDF, and GPM- datasets with a small number of training mine the number of constructed
TFDF. Among these three methods, samples. MFCGPE can adaptively con- features, which avoid information inter-
GPM-TFDF achieves better results on struct a flexible number of features for ference and feature redundancy. The
all three datasets. Specifically, GPM- fault diagnosis, which can not only depths of the trees evolved from View1,

FCm
FCm ×
FC2 F11 F5
FCm
× – ×
FC2 FC2
F5 – F5 F5 – –
÷ ÷ ÷ ÷ FCm
F8 F1 F7 – F12 F5
T2 – T10 T5 ÷ T10 T2 T5 – FC2
F8 –
T10 ÷ T7 × + – TF7 TF28 + ÷
T7 T13 T5 T9 F1 F12 F8 F1 TF3 TF17 TF2 TF28
(a) (b) (c)
FIGURE 12 Example trees evolved by MFCGPE on the NCEPU dataset. (a) Best tree obtained under View1 (b) Best tree obtained under View2 (c)
Best tree obtained under View3.
View2, and View3 are 5, 7, and 3, respec-

tively, which indicates that MFCGPE View1 View2 View3
can adaptively evolve the optimal vari-
able-length models with the use of the
time-domain, frequency-domain,
time&frequency-domain characteristics
of vibration signals. The above analysis
shows that MFCGPE has the ability to
learn effective and diverse features from
multi-view input features to capture dif- (a)
ferent patterns of the vibration signal,
which is beneficial in describing data View1 View2 View3
characteristics comprehensively.
B. Visualization of the GP-Constructed

Features
To better demonstrate that the evolved
models/trees shown in Figure 12 are
effective for fault diagnosis, the original
features and the newly constructed fea- (b)
tures are visualized for comparisons. The
features of the samples in the test set are NOR IRF ORF REF IOCF ROCF
visualized by using t-distributed stochas-
tic neighbor embedding (t-SNE) [59] FIGURE 13 Feature visualization on the NCEPU dataset using t-SNE. (a) Original features
method. Figure 13 shows the visualiza- (b) GP constructed features.
tion results, where six colored points

represent six running conditions of the As shown in Figure 13(a), using the other classes are also overlapping. These
rolling bearing, i.e., normal (NOR), original features of View1 for visualiza- visualization results indicate that the
inner ring fault (IRF), outer ring fault tion, the points in different classes are all original features of vibration signals in
(ORF), rolling element fault (REF), overlapping; when using the original different running conditions cannot
inner & outer ring compound fault features of View2 and View3 for visual- effectively separate the fault types into
(IOCF), and rolling element & outer ization, the points in the NOR class are different classes to achieve a high diag-
ring compound fault (ROCF ). clustered separately, but the points in nostic accuracy. As shown in Figure 13(b),

using the features constructed from C. Diagnosis Results Using the samples to be misclassified, where two
View1 for visualization, few points in the GP-Constructed Features and the samples in the IRF class are identified as
NOR, IRF, and REF classes are overlap- Constructed Ensemble the NOR class and five samples in the
ping, and the other colored points are To analyze the effectiveness of the con- REF class are identified as the IRF class.
gathered together and do not overlap. structed features of Figure 13(b) and Figure 14(b) shows the diagnosis results
Using the features constructed from the built ensemble, Figure 14 shows of using the features constructed from
View2 for visualization, only a few points their diagnosis results and the results of View2. It shows that two types of com-
in the IOCF and ROCF classes are over- the built ensemble, where the blue box pound faults are misclassified, i.e., two
lapping. Using the features constructed represents the true label of the sample, samples in the IOCF class are classified
from View3 for visualization, only a few the red asterisks represent the predicted into the ROCF class and three samples
points in the ORF and ROCF classes are label of the sample, and the class labels 1, in the ROCF class are identified as the
overlapping. Compared with the original 2, 3, 4, 5, and 6 represent the NOR, IRF, IOCF class. Figure 14(c) shows the diag-
features, these newly constructed features ORF, REF, IOCF, and ROCF running nosis results of using the features con-
have a good similarity among the same conditions, respectively. The overlap of structed from View3, where three samples
class and have a big difference between boxes and asterisks indicates that the in the ORF class are identified as the
different classes. These visualization diagnosis is correct. No-overlapping of ROCF class. The diagnosis accuracy of
results indicate that the constructed boxes and asterisks indicates that the using the TDF, FDF, and TFDF based
features can effectively represent the diagnosis is wrong. constructed features are 97.40%, 98.14%
vibration signals in different running It can be seen from Figure 14(a) that and 98.88%, respectively. The ensemble
conditions, which makes fault diagnosis using the features constructed from diagnosis accuracy is obtained by inte-
simpler and more accurate. View1 for fault diagnosis will cause seven grating the three diagnosis results via the
majority voting method and shown in
Figure 14(d), in which no samples are
View1 View2 misclassified. Obviously, the ensemble
6 6 strategy improves the diagnosis accuracy.
5 5 By analyzing the best models evolved
by MFCGPE, the feature construction
Class Label
Class Label
4 4 process is clearly displayed. In addition,

3 3 the visualization results and the diagnosis
results of the newly constructed features
2 2 are demonstrated for interpretability.
1 1
Owing to the program structure and the
0 45 90 135 180 225 270 0 45 90 135 180 225 270 new fitness function, MFCGPE con-
Sample Number Sample Number structs a flexible number of diverse and
(a) (b) effective high-level features from differ-
ent views. The constructed features can
View3 Ensemble
6 6 make the classification of different fault
types easier and more accurate. Owing
5 5 to the ensemble diagnosis strategy, MFC-
Class Label
Class Label
4 4
GPE constructs multi-view features and
gains a higher diagnosis accuracy by
3 3 using an ensemble built from the features
constructed from different views.
2 2
1 1 VII. Conclusions
0 45 90 135 180 225 270 45 90 135 180 225 270 The goal of this paper was to develop a
Sample Number Sample Number new GP-based approach to achieving
(c) (d)
effective fault types diagnosis of rolling
Ture bearings using a small number of train-
Predict ing samples. This goal has been suc-
cessfully achieved by developing the
FIGURE 14 Diagnosis results on the NCEPU dataset using the constructed features of differ- MFCGPE approach. A new GP pro-
ent views and the constructed ensemble. (a) Diagnosis results of using the features construct- gram structure, a function set, and a
ed on View1 (b) Diagnosis results of using the features constructed on View2 (c) Diagnosis
results of using the features constructed on View3 (d) Diagnosis results of using the construct- terminal set were developed to enable
ed ensemble. MFECPE to construct a flexible number

of high-level features from three different References [19] Y. Gu, X. Zhou, D. Yu, and Y. Shen, “Fault diagno-
[1] R. Liu, B. Yang, E. Zio, and X. Chen, “Artificial sis method of rolling bearing using principal component
views of features. A new fitness function intelligence for fault diagnosis of rotating machinery: A analysis and support vector machine,” J. Mech. Sci. Tech-
based on the measures of accuracy and review,” Mech. Syst. Signal Process., vol. 108, pp. 33–47, nol., vol. 32, no. 11, pp. 5079–5088, 2018. doi: 10.1007/
2018. doi: 10.1016/j.ymssp.2018.02.016. s12206-018-1004-0.
distance was developed to enable these [2] H. Li, T. Liu, X. Wu, and Q. Chen, “Enhanced fre- [20] J. Yu, “Machinery fault diagnosis using joint global
newly constructed features to be accu- quency band entropy method for fault feature extrac- and local/nonlocal discriminant analysis with selective
tion of rolling element bearings,” IEEE Trans. Ind. In- ensemble learning,” J. Sound Vib., vol. 382, pp. 340–356,
rate and discriminative. To improve the form., vol. 16, no. 9, pp. 5780–5791, 2020. doi: 10.1109/ 2016. doi: 10.1016/j.jsv.2016.06.046.
diagnosis performance, an ensemble clas- TII.2019.2957936. [21] X. Zhao and M. Jia, “Fault diagnosis of rolling bear-
[3] Y. Zhang and R. Randall, “Rolling element bearing based on feature reduction with global-local margin
sifier was created by using constructed ing fault diagnosis based on the combination of genetic fisher analysis,” Neurocomputing, vol. 315, pp. 447–464,
features from multiple views and using algorithms and fast kurtogram,” Mech. Syst. Signal Pro- 2018. doi: 10.1016/j.neucom.2018.07.038.
cess., vol. 23, no. 5, pp. 1509–1517, 2009. doi: 10.1016/j. [22] J. Tian, C. Morillo, M. H. Azarian, and M. Pecht,
KNN. With these designs, MFCGPE can ymssp.2009.02.003. “Motor bearing fault detection using spectral kurto-
not only automatically select and con- [4] R. Yan, R. X. Gao, and X. Chen, “Wavelets for sis-based feature extraction coupled with k-nearest
fault diagnosis of rotary machines: A review with ap- neighbor distance analysis,” IEEE Trans. Ind. Electron., vol.
struct informative and discriminative fea- plications,” Signal Process., vol. 96, pp. 1–15, 2014. doi: 63, no. 3, pp. 1793–1803, 2016. doi: 10.1109/TIE.2015.
tures from different views but also build 10.1016/j.sigpro.2013.04.015. 2509913.
[5] Y. Lei, J. Lin, Z. He, and M. Zuo, “A review on em- [23] M. Unal, M. Onat, M. Demetgul, and H. Kucuk,
an effective ensemble to achieve a high pirical mode decomposition in fault diagnosis of rotating “Fault diagnosis of rolling bearings using a genetic al-
generalization performance. machinery,” Mech. Syst. Signal Process., vol. 35, nos. 1–2, gorithm optimized neural network,” Measurement,
pp. 108–126, 2013. doi: 10.1016/j.ymssp.2012.09.015. vol. 58, pp. 187–196, 2014. doi: 10.1016/j.measure
The effectiveness of MFCGPE was [6] J. Antoni, “The spectral kurtosis: a useful tool for ment.2014.08.041.
evaluated on three rolling bearing fault characterising non-stationary signals,” Mech. Syst. Signal [24] M. Luo, C. Li, X. Zhang, R. Li, and X. An, “Com-
Process., vol. 20, no. 2, pp. 282–307, 2006. doi: 10.1016/j. pound feature selection and parameter optimization of
datasets and compared with the 19 com- ymssp.2004.09.001. ELM for fault diagnosis of rolling element bearings,”
petitive methods. The results showed that [7] G. L. McDonald, Q. Zhao, and M. J. Zuo, “Maxi- ISA Trans., vol. 65, pp. 556–566, 2016. doi: 10.1016/j.
mum correlated kurtosis deconvolution and application isatra.2016.08.022.
MFCGPE achieved the best diagnosis on gear tooth chip fault detection,” Mech. Syst. Signal [25] X. Yan and M. Jia, “A novel optimized svm clas-
accuracy on the three datasets among all Process., vol. 33, pp. 237–255, 2012. doi: 10.1016/j.yms- sification algorithm with multi-domain feature and its
sp.2012.06.010. application to fault diagnosis of rolling bearing,” Neu-
the methods. The highlight of MFCGPE [8] K. Dragomiretskiy and D. Zosso, “Variational mode rocomputing, vol. 313, pp. 47–64, 2018. doi: 10.1016/j.
was that multiple discriminative features decomposition,” IEEE Trans. Signal Process., vol. 62, no. neucom.2018.05.002.
3, pp. 531–544, 2013. doi: 10.1109/TSP.2013.2288675. [26] M. He and D. He, “Deep learning-based approach
were automatically constructed using [9] R. B. Randall, J. Antoni, and S. Chobsaard, “The rela- for bearing fault diagnosis,” IEEE Trans. Ind. Appl., vol.
MFCGPE, and the ensemble diagnosis tionship between spectral correlation and envelope analysis 53, no. 3, pp. 3057–3065, 2017. doi: 10.1109/TIA.2017.
in the diagnostics of bearing faults and other cyclostationary 2661250.
can address the issue of poor generaliza- machine signals,” Mech. Syst. Signal Process., vol. 15, no. 5, [27] S. Zhang, S. Zhang, B. Wang, and T. G. Habetler,
tion of the diagnosis model caused by pp. 945–962, 2001. doi: 10.1006/mssp.2001.1415. “Deep learning algorithms for bearing fault diagnos-
[10] S. Maurya, V. Singh, N. K. Verma, and C. K. tics—A comprehensive review,” IEEE Access, vol. 8,
using a small number of training samples. Mechefske, “Condition-based monitoring in variable pp. 29,857–29,881, 2020. doi: 10.1109/ACCESS.2020.
This paper shows that the proposed machine running conditions using low-level knowledge 2972859.
transfer with DNN,” IEEE Trans. Autom. Sci. Eng., 2020. [28] W. Deng, H. Liu, J. Xu, H. Zhao, and Y. Song, “An
MFCGPE approach is effective for fault [11] J. Yang, Y. Zhang, and Y. Zhu, “Intelligent fault improved quantum-inspired differential evolution al-
diagnosis of rolling bearings. In addition diagnosis of rolling element bearing based on SVMS gorithm for deep belief network,” IEEE Trans. Instrum.
and fractal dimension,” Mech. Syst. Signal Process., vol. Meas., vol. 69, no. 10, pp. 7319–7327, 2020.
to fault diagnosis, remaining life predic- 21, no. 5, pp. 2012–2024, 2007. doi: 10.1016/j.yms [29] S. Maurya, V. Singh, and N. K. Verma, “Condi-
tion can also help to analyze rolling sp.2006.10.005. tion monitoring of machines using fused features from
[12] W. Caesarendra, B. Kosasih, A. K. Tieu, and C. A. EMD-based local energy with DNN,” IEEE Sens. J., vol.
bearing degradation. In the future, we Moodie, “Application of the largest Lyapunov exponent 20, no. 15, pp. 8316–8327, 2019. doi: 10.1109/JSEN.
will investigate how GP is used to con- algorithm for feature extraction in low speed slew bearing 2019.2927754.
condition monitoring,” Mech. Syst. Signal Process., vol. 50, [30] C. Cheng, B. Zhou, G. Ma, D. Wu, and Y. Yuan,
struct the health index to predict the pp. 116–138, 2015. doi: 10.1016/j.ymssp.2014.05.021. “Wasserstein distance based deep adversarial transfer
remaining life of the rolling bearing. [13] J. Zheng, H. Pan, and J. Cheng, “Rolling bearing learning for intelligent fault diagnosis with unlabeled or
fault detection and diagnosis based on composite mul- insufficient labeled data,” Neurocomputing, vol. 409, pp.
tiscale fuzzy entropy and ensemble support vector ma- 35–45, 2020. doi: 10.1016/j.neucom.2020.05.040.
Acknowledgments chines,” Mech. Syst. Signal Process., vol. 85, pp. 746–759, [31] H. Al-Sahaf et al., “A survey on evolutionary ma-
2017. doi: 10.1016/j.ymssp.2016.09.010. chine learning,” J. Roy. Soc. N. Z., vol. 49, no. 2, pp.
This work was supported in part by the [14] Y. Li, Y. Yang, G. Li, M. Xu, and W. Huang, “A fault 205–228, 2019. doi: 10.1080/03036758.2019.1609052.
National Natural Science Foundation of diagnosis scheme for planetary gearboxes using modi- [32] W. Deng, J. Xu, X.-Z. Gao, and H. Zhao, “An en-
fied multi-scale symbolic dynamic entropy and MRMR hanced MSIQDE algorithm with novel multiple strate-
China under grant 51777075, the Natu- feature selection,” Mech. Syst. Signal Process., vol. 91, pp. gies for global optimization problems,” IEEE Trans. Syst.
ral Science Foundation of Hebei Prov- 295–312, 2017. doi: 10.1016/j.ymssp.2016.12.040. Man. C., 2020.
[15] X. Yan and M. Jia, “Intelligent fault diagnosis of ro- [33] W. Deng, J. Xu, Y. Song, and H. Zhao, “Differen-
ince under grant E2019502064, the tating machinery using improved multiscale dispersion tial evolution algorithm with wavelet basis function
Marsden Fund of New Zealand Govern- entropy and MRMR feature selection,” Knowl. Based and optimal mutation strategy for complex optimization
Syst., vol. 163, pp. 450–471, 2019. doi: 10.1016/j.kno problem,” Appl. Soft Comput., 2020.
ment under Contracts VUW1509 and sys.2018.09.004. [34] Y. Song et al., “MPPCEDE: Multi-population paral-
VUW1615, the Science for Technologi- [16] V. Singh and N. K. Verma, “Intelligent condition- lel co-evolutionary differential evolution for parameter
based monitoring techniques for bearing fault diagnosis,” optimization,” Energ. Convers. Manage., vol. 228, 2021.
cal Innovation Challenge (SfTI) fund IEEE Sens. J., 2020. [35] K. Krawiec, “Genetic programming-based construc-
under grant E3603/2903, the University [17] X. Zhang, Q. Zhang, M. Chen, Y. Sun, X. Qin, and tion of features for machine learning and knowledge dis-
H. Li, “A two-stage feature selection and intelligent fault covery tasks,” GPEM, vol. 3, no. 4, pp. 329–343, 2002.
Research Fund at Victoria University of diagnosis method for rotating machinery using hybrid [36] L. Guo, D. Rivero, J. Dorado, C. R. Munteanu, and
Wellington grant number 223805/3986, filter and wrapper method,” Neurocomputing, vol. 275, pp. A. Pazos, “Automatic feature extraction using genetic
2426–2439, 2018. doi: 10.1016/j.neucom.2017.11.016. programming: An application to epileptic EEG clas-
MBIE Data Science SSIF Fund under [18] Z. Huo, Y. Zhang, L. Shu, and M. Gallimore, “A sification,” Expert Syst. Appl., vol. 38, no. 8, pp. 10,425–
the contract RTVU1914, and National new bearing fault diagnosis method based on fine-to- 10,436, 2011.
coarse multiscale permutation entropy, Laplacian score [37] Y. Bi, B. Xue, and M. Zhang, “Genetic program-
Natural Science Foundation of China and SVM,” IEEE Access, vol. 7, pp. 17,050–17,066, 2019. ming with image-related operators and a flexible program
(NSFC) under Grant 61876169. doi: 10.1109/ACCESS.2019.2893497. structure for feature learning in image classification,”

IEEE Trans. Evol. Comput., vol. 25, no. 1, pp. 87–101, gramming based on the fisher criterion,” Expert Syst., [51] D. J. Montana, “Strongly typed genetic program-
2021. doi: 10.1109/TEVC.2020.3002229. vol. 25, no. 5, pp. 444–459, 2008. doi: 10.1111/j.1468 ming,” Evol. Comput., vol. 3, no. 2, pp. 199–230, 1995.
[38] M. W. Aslam, Z. Zhu, and A. K. Nandi, “Automatic -0394.2008.00451.x. doi: 10.1162/evco.1995.3.2.199.
modulation classification using combination of genetic [45] K. Neshatian, M. Zhang, and P. Andreae, “A filter [52] Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, and A. K.
programming and KNN,” IEEE Trans. Wirel. Com- approach to multiple feature construction for symbolic Nandi, “Applications of machine learning to machine
mun., vol. 11, no. 8, pp. 2742–2750, 2012. doi: 10.1109/ learning classifiers using genetic programming,” IEEE fault diagnosis: A review and roadmap,” Mech. Syst. Signal
TWC.2012.060412.110460. Trans. Evol. Comput., vol. 16, no. 5, pp. 645–661, 2012. Process., vol. 138, 2020.
[39] Z. Gao, C. Cecati, and S. X. Ding, “A survey of fault doi: 10.1109/TEVC.2011.2166158. [53] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, “A survey
diagnosis and fault-tolerant techniques—Part I: Fault di- [46] B. Tran, B. Xue, and M. Zhang, “Genetic pro- on ensemble learning,” Front. Comput. Sci., pp. 1–18, 2020.
agnosis with model-based and signal-based approaches,” gramming for multiple-feature construction on high- [54] “Case western reserve university bearing data cen-
IEEE Trans. Ind. Electron., vol. 62, no. 6, pp. 3757–3767, dimensional classification,” Pattern Recogn., vol. 93, pp. ter.” Accessed: Apr. 2018. [Online]. http://csegroups
2015. doi: 10.1109/TIE.2015.2417501. 404–417, 2019. doi: 10.1016/j.patcog.2019.05.006. .case.edu/bearingdatacenter/home/
[40] J. R. Koza, Genetic Programming: On the Programming [47] J. Ma and G. Teng, “A hybrid multiple feature con- [55] B. Wang, Y. Lei, N. Li, and N. Li, “A hybrid prog-
of Computers by Means of Natural Selection. Cambridge, struction approach for classification using genetic pro- nostics approach for estimating remaining useful life of
MA: MIT Press, 1992. gramming,” Appl. Soft Comput., vol. 80, pp. 687–699, rolling element bearings,” IEEE Trans. Reliab., vol. 69, no.
[41] F. E. Otero, M. M. Silva, A. A. Freitas, and J. C. 2019. doi: 10.1016/j.asoc.2019.04.039. 1, pp. 401–412, 2020. doi: 10.1109/TR.2018.2882682.
Nievola, “Genetic programming for attribute construc- [48] L. Zhang, L. B. Jack, and A. K. Nandi, “Fault de- [56] Y. Bi, B. Xue, and M. Zhang, “An effective feature
tion in data mining,” in Proc. EuroGP. Berlin: Springer- tection using genetic programming,” Mech. Syst. Signal learning approach using genetic programming with im-
Verlag, 2003, pp. 384–393. Process., vol. 19, no. 2, pp. 271–289, 2005. doi: 10.1016/j. age descriptors for image classification [research fron-
[42] M. Muharram and G. D. Smith, “Evolutionary ymssp.2004.03.002. tier],” IEEE Comput. Intell. Mag., vol. 15, no. 2, pp. 65–
constructive induction,” IEEE Trans. Knowl. Data Eng., [49] H. Guo, L. B. Jack, and A. K. Nandi, “Feature gen- 77, 2020. doi: 10.1109/MCI.2020.2976186.
vol. 17, no. 11, pp. 1518–1528, 2005. doi: 10.1109/ eration using genetic programming with application to [57] F.-A. Fortin et al., “DEAP: Evolutionary algorithms
TKDE.2005.182. fault classification,” IEEE Trans. Syst., Man, Cybern. C, made easy,” J. Mach. Learn. Res., vol. 13, pp. 2171–2175, 2012.
[43] H. Guo and A. K. Nandi, “Breast cancer diagnosis Cybern., vol. 35, no. 1, pp. 89–99, 2005. doi: 10.1109/ [58] F. Pedregosa et al., “Scikit-learn: Machine learning in
using genetic programming generated feature,” Pattern TSMCB.2004.841426. Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011.
Recogn., vol. 39, no. 5, pp. 980–987, 2006. doi: 10.1016/j. [50] J. Xuan, H. Jiang, T. Shi, and G. Liao, “Gear fault [59] L. v. d. Maaten and G. Hinton, “Visualizing data using T-
patcog.2005.10.001. classification using genetic programming and support vec- SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008.
[44] H. Guo, Q. Zhang, and A. K. Nandi, “Feature ex- tor machines,” Int. J. Inf. Technol., vol. 11, no. 9, pp. 18–27,
traction and dimensionality reduction by genetic pro- 2005.

Publication Spotlight (continued from page 7)
by O. A. Ibrahim, J. M. Keller, and J. C. data stream evolves, flag outliers, and where objects are temporarily stored, the
Bezdek, IEEE Transactions on Emerging most importantly, spawn new emerging system can extract properties of the stored
Topics in Computational Intelligence, Vol. structures. We compare the two new objects related to the task. We propose a
5, No. 2, April 2021, pp. 262–273. indices to the incremental Xie-Beni and deep reinforcement learning (RL) neural
Davies-Boudin indices, which to our network to learn affordances, i.e., a
Digital Object Identifier: 10.1109/ knowledge offer the only comparable sequence of actions to manipulate these
TETCI.2019.2909521 approach, with numerical examples on a objects. The RL network and object
“Dunn’s internal cluster validity variety of synthetic and real datasets.” detector are trained alternatively. After
index is used to assess partition quality both the network and detector are
and identify a “best” crisp c-partition of IEEE Transactions on Artificial trained, the objects and their affordances
n objects built from static data sets. This Intelligence are transferred to an external memory.
index is quite sensitive to inliers and They are then utilized when the same
outliers in the input data, so a subse- Procedural Memory Augmented Deep objects are detected in input frames. Here,
quent study developed a family of 17 Reinforcement Learning, by Y. Ma, J. we use a combination of a dictionary and
generalized Dunn’s indices that extend Brooks, H. Li, and J. C. Principe, IEEE a linked list for the external memory that
and improve the original measure in Transactions on Artificial Intelligence, Vol. can be accessed by either content or tem-
various ways. This paper presents online 1, No. 2, Oct 2020, pp. 105–120. poral order. This dual access is motivated
versions of two modified generalized by the temporal property of human pro-
Dunn’s indices that can be used for the Digital Object Identifier: 10.1109/ cedural memory. The proposed memory-
dynamic evaluation of an evolving (clus- TAI.2021.3054722 augmented RL framework br ings
ter) structure in streaming data. We “Inspired by the human brain, we pro- advantages of transferability, explainability
argue that this method is a good way to pose an external memory-augmented and computational efficiency with respect
monitor the ongoing performance of decision-making architecture for video to conventional deep learning architec-
streaming clustering algorithms, and we processing. A self-organizing object detec- tures. We validate the framework on the
illustrate several types of inferences that tor is employed as a frontend to decon- video game Super Mario Brothers to
can be drawn from such indices. Stream- struct the environment. This is done by show superiority to some classical deep
ing clustering algorithms are incremen- extracting events from the flow of time RL architectures and exemplify these
tal, process incoming data points only and detecting objects within the frames. three advantages.”
once and then discard them, adapt as the By employing an extra working memory

We have 30
million reasons to
be proud.
Thanks to our donors, supporters and volunteers who answered the call of the
Realize the Full Potential of IEEE Campaign,
helping impact lives around the world through the power of
technology and education.
Illuminate Educate Engage Energize
Realize Your Impact

Learn how: ieeefoundation.org/campaign
Conference Marley Vellasco
Calendar Pontifícia Universidade
Católica do Rio de Janeiro,
BRAZIL
T 6th South-East Europe Design T 5th Asian Conference on

* Denotes a CIS-Sponsored Conference
Automation, Computer Engineering, Artificial Intelligence (ACAIT 2021)
T Denotes a CIS Technical
Computer Networks and Social Media October 29–31, 2021
Co-Sponsored Conference
Conference (SEEDA CECNSM 2021) Place: Haikou, China
September 24–26, 2021 General Chairs: Qionghai Dai, Cesare
* 2021 IEEE Conference on Games Place: Preveza, Greece Alippi, Jong-Hwan Kim
(IEEE CoG 2021) General Co-Chairs: Markos G.Tsipouras, Website: http://www.acait.cn/
August 17–20, 2021 Alexandros T.Tzallas, Michael F. Dossis
Place: Virtual (Copenhagen, Denmark) Website: https://seeda2021.uowm.gr/ T 2021 International Conference
General Co-Chairs: Miguel Sicart & on Process Mining (ICPM 2021)
Paolo Burelli * 2021 IEEE International October 31–November 4, 2021
Website: https://ieee-cog.org/2021/ Conference on Data Science Place: Eindhoven, The Netherlands
index.html and Advanced Analytics General Co-Chairs: Boudewijn
(IEEE DSAA 2021) van Dongen
T 2nd International Conference October 6–9, 2021 Website: https://icpmconference.org/
on Industrial Artificial Intelligence Place: Porto, Portugal 2021/
(IAI 2021) General Co-Chairs: João Gama &
August 20–22, 2021 Francisco Herrera * 2021 IEEE Latin American
Place: Shenyang, China Website: https://dsaa2021.dcc.fc.up.pt/ Conference on Computational
General Chair: Yaochu Jin Intelligence (IEEE LA-CCI 2021)
Website: http://conf.kzgc.com.cn/ * 2021 IEEE Conference on November 2–4, 2021
iai2021 Computational Intelligence in Place: Temuco, Chile
Bioinformatics and Computational General Co-Chairs: Millaray Curilem,
* 2021 IEEE International Biology (IEEE CIBCB 2021) Doris Saez and Nelson Aros
Conference on Development October 13–15, 2021 Website: http://la-cci.org/
and Learning (IEEE ICDL 2021) Place: Melbourne, Australia
August 23–26, 2021 General Chairs: Madhu Chetty T International Workshop on
Place: Beijing, China Website: http://federation.edu.au/cibcb2021 Semantic and Social Media
General Co-Chairs: Dingsheng Luo and Adaptation and Personalization
Angelo Cangelosi * 2021 IEEE Smart World (SMAP 2021)
Website: https://icdl-2021.org Conference (IEEE SWC 2021) November 4–5, 2021
October 18–21, 2021 Place: Virtual
T 2021 International Conference Place: Atlanta, USA General Chair: Phivos Mylonas
on Emerging Techniques in General Co-Chairs:Yi Pan, Rajshekhar Website: https://hilab.di.ionio.gr/
Computational Intelligence Sunderraman,Yanqing Zhang smap2021/
(ICETCI 2021) Website: https://grid.cs.gsu.edu/
August 25–27, 2021 ~smartworld2021/index.php * 2021 IEEE Symposium
Place: Hyderabad, India Series on Computational
General Chair: Atul Negi, Naresh T 5th International Conference Intelligence (IEEE SSCI 2021)
Mallenahalli, Akira Hirose on Intelligent Computing in December 5–8, 2021
Website: http://ietcint.com Data Sciences (ICDS 2021) Place: Orlando, FL, USA
October 20–22, 2021 General Co-Chairs: Sanaz Mostaghim
Place: Fez, Morocco and Keeley Crockett
General Chairs: Cesare Alippi, Ahmed Website: https://attend.ieee.org/
Azough,Youness Oubenaalla ssci-2021/
Digital Object Identifier 10.1109/MCI.2021.3084496 Website: http://www.researchnetwork
Date of current version: 15 July 2021 .ma/icds2021/

Share Your
Preprint Research
with the World!
TechRxiv is a free preprint server for unpublished

research in electrical engineering, computer
science, and related technology. Powered by
IEEE, TechRxiv provides researchers across a
broad range of fields the opportunity to share
early results of their work ahead of formal
peer review and publication.
BENEFITS:
• Rapidly disseminate your research findings
• Gather feedback from fellow researchers
• Find potential collaborators in the

scientific community
• Establish the precedence of a discovery
• Document research results in advance

of publication
Upload your unpublished research today!
Follow @TechRxiv_org
Learn more techrxiv.org Powered by IEEE
IEEE Computational Intelligence Society Publications
IEEE Transactions on  IEEE Transactions on Evolutionary 
IEEE Transactions on Fuzzy Systems
Neural Networks and Learning Systems Computation
Editor-in-Chief:  Haibo He Editor-in-Chief:  Jonathan Garibaldi Editor-in-Chief:  Carlos A. Coello Coello
Impact Factor: 8.793 Impact Factor: 9.518 Impact Factor: 11.169
Eigenfactor: 0.04821 Eigenfactor: 0.02514 Eigenfactor: 0.0117
Article Influence Score: 2.584 Article Influence Score: 2.203 Article Influence Score: 3.059
Ranking (JCR 2019): Ranking (JCR 2019): Ranking (JCR 2019):
#3 Computer Science—Theory & Methods #7 Computer Science—Artificial Intelligence #2 Computer Science—Theory & Methods
#3 Computer Science—Artificial Intelligence
#3 Computer Science—Hardware & Architecture #10 Computer Science—Electrical and Electronic Engineering
IEEE Transactions on Emerging Topics in 
IEEE Computational Intelligence Magazine IEEE Transactions on Artificial Intelligence
Computational Intelligence
Editor-in-Chief:  Chuan-Kang Ting Editor-in-Chief:  Yew Soon Ong Editor-in-Chief:  Hussein Abbass
Impact Factor: 9.083 Indexing: Scopus Sole transactions for all AI topics.
Eigenfactor: 0.00277 Uniquely communicate AI benefits for every
Publishes original articles on emerging
Article Influence Score: 2.468 paper using published open-access impact
aspects of computational intelligence,
statements.
Ranking (JCR 2019): including theory, applications, and surveys.
DBLB indexed.
#9 Computer Science—Artificial Intelligence
IEEE Transactions on  IEEE Press Books
IEEE Transactions on  Games
Cognitive and Developmental Systems Computational Intelligence Series
Editor-in-Chief:  Yaochu Jin Editor-in-Chief:  Julian Togelius Editor-in-Chief:  David Fogel
Impact Factor: 2.667 Impact Factor: 1.886
CIS Liaison:  Alice Smith
Eigenfactor: 0.00123 Eigenfactor: 0.00019
Article Influence Score: 0.694 Article Influence Score: 0.449 The IEEE Press Series on Computational
Focuses on advances in the study of Publishes original high-quality articles Intelligence includes books on neural, fuzzy,
development and cognition in natural covering scientific, technical, and and evolutionary computation, and related
(humans, animals) and artificial (robots, engineering aspects of games. technologies, of interest to the engineering
agents) systems. and scientific communities.
Additionally, IEEE CIS technically co-sponsors the following other titles; The IEEE Transactions on Smart Grid, the IEEE Transactions on Big Data, the IEEE 
Transactions on Nanobioscience, the IEEE Transactions on Information Forensics and Security, and the IEEE Transactions on Affective Computing.

Ieee Computationalintelligence 202108userupload - in

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ieee Computationalintelligence 202108userupload - in

Uploaded by

Copyright:

Available Formats

Organising Committee

Call for Tutorials Call for Special Sessions

Digital Object Identifier 10.1109/MCI.2021.3084497

Promoting Sustainable Forestry

IEEE Computational Intelligence Magazine (ISSN

IEEE prohibits discrimination, harassment, and bullying.

2 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

 AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 3

Read and download Create a personalized Locate IEEE members by location,

4 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

CIS Publication Spotlight

IEEE Transactions on Neural IEEE Transactions on

A Comprehensive Survey on Graph An Effective Multiresolution Hierarchical

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 5

6 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

Call for Papers for Journal Special Issues

Special Issue on “Cyborg Intelligence: Human Enhancement with Fuzzy Sets”

Special Issue on “Recent Advances in Fuzzy-based Intelligent IoT and Cyber-physical

Special Issue on “Benchmarking Sampling-Based Optimization Heuristics: Methodology

Special Issue on “Computational Intelligence to Edge AI for Ubiquitous IoT Systems”

Special Issue on “Security and Privacy of Machine Learning”

Digital Object Identifier 10.1109/MCI.2021.3084499

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 7

Evolutionary Neural Architecture Search and Applications

8 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

10 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021 1556-603X/21©2021IEEE

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 11

Algorithm 1 Computation of a typical convolutional layer. TABLE I Popular dataflow designs

12 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 13

14 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 15

chastic search, it may need thousands of trials (candidate solu-

(architecture) is produced, it takes a substantial amount of time

16 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

X :Y (3.4) CI :CO (3.2) X :Y (23.1) CI :CO (21.2) X :Y (31.2) CI :CO (33.1)

Energy Consumption (mJ)

Energy Consumption (mJ)

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 17

COO CSR-Relative COO CSR-Relative COO CSR-Relative

Model Size (MBytes)

Model Size (MBytes)

18 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

several extra bits to record the position of each non-zero ele- 22

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 19

ACCURACY ENERGY CONSUMPTION (MJ) EFFICIENCY IMPROVEMENT (×) AGGREGATION SCORE

METHOD LOSS X :Y CI : CO FX : FY X : FX X :Y CI : CO FX : FY X : FX r/x = 5.0 r/x = 1.0 r/ x = 0.2

occupies around 50% less memory

20 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 21

Abstract—Unsupervised visual representation learning is one

22 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021 1556-603X/21©2021IEEE

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 23

24 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | AUGUST 2021

Ω ← Ω – {arg max w(ok,n)}

several groups and fine-grained subgroups or splits, where

the feature representation of each group is determined by

improve efficiency. To enable a direct trade-off between

depth and block size (indicated by the parameters of the

identity operation to the candidate set of its mixed opera-

can either choose to be shallower by skipping more blocks

more blocks with smaller size.

The initial structure a 0 is first manually designed (e.g.,

AUGUST 2021 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 3