Ieee Computationalintelligence 201808

2019 IEEE International Conference on Fuzzy Systems
FUZZ-IEEE 2019
June 23-26, 2019
General Chairs The 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2019), the world-leading
Tim Havens, USA event focusing on the theory and application of fuzzy logic, will be held in New Orleans, Louisiana,
Jim Keller, USA USA. Nicknamed the “Big Easy,” New Orleans is known for its round-the-clock nightlife, vibrant
live-music scene, and spicy Cajun cuisine. It is located on the Mississippi River, near the Gulf of
Program Chairs
Mexico, and is popular tourist destination for all ages.
Alina Zare, USA
Derek Anderson, USA FUZZ-IEEE 2019 will be hosted at the JW Marriott, a premier conference venue nestled in
the heart of the world-famous French Quarter. You will be steps away from some of New Orleans’s
Special Sessions Chairs most iconic nightlife and restaurants. Take a walk outside and visit Jackson Square, shop at the
Christian Wagner, UK lively French Market, or dance your way through Bourbon Street.
Thomas Runkler, Germany FUZZ-IEEE 2019 will represent a unique meeting point for scientists and engineers, from
Tutorials Chairs academia and industry, to interact and discuss the latest enhancements and innovations in the
Qiang Shen, UK field. The topics of the conference will cover all aspects of theory and applications of fuzzy logic
Jesus Chamorro, Spain and its hybridisations with other artificial and computational intelligence methods. In particular,
Keynotes Chair
FUZZ-IEEE 2019 topics include, but are not limited to:
Robert John, UK
Posters Chair Mathematical and theoretical foundations of fuzzy sets, fuzzy measures, and fuzzy integrals
Marek Reformat, Canada Fuzzy control, robotics, sensors, fuzzy hardware and architectures
Finance Chair Fuzzy data analysis, fuzzy clustering, classification and pattern recognition
Mihail Popescu, USA Type-2 fuzzy sets, computing with words, and granular computing
Conflict-of-Interest Chairs Fuzzy systems with big data and cloud computing, fuzzy analytics, and visualization
Sansanee Fuzzy systems design and optimization
Auephanwiriyakul, Thailand Fuzzy decision analysis, multi-criteria decision making and decision support
CT Lin, Australia Fuzzy logical and its applications to industrial engineering
Competitions Chairs Fuzzy modeling, identification, and fault detection
Christophe Marsala, France Fuzzy information processing, information extraction, and fusion
Mika Sato-Ilic, Japan Fuzzy web engineering, information retrieval, text mining, and social network analysis
Panel Sessions Chair Fuzzy image, speech and signal processing, vision, and multimedia data analysis
Humberto Bustince, Spain Fuzzy databases and informational retrieval
Publications Chairs Rough sets, imprecise probabilities, possibilities approaches
Anna Wilbik, Netherlands Industrial, financial, and medical applications
Tim Wilkin, Australia Fuzzy logic application in civil engineering and GIS
Registrations Chair Fuzzy sets and soft computing in social sciences
Marie-Jeanne Lesot, France Linguistic summarization, natural language processing
Local Arrangement Chairs Computational intelligence in security
Fred Petry, USA Hardware and software for fuzzy systems and logic
Paul Elmore, USA Fuzzy Markup Language and standard technologies for fuzzy systems
Publicity Chair Adaptive, hierarchical, and hybrid neuro- and evolutionary-fuzzy systems
Daniel Sanchez, Spain
Web Chair The conference will include regular oral and poster presentations, an elevator pitch competition,
Tony Pinar, USA
tutorials, panels, special sessions, and keynote presentations. Full details of the submission
process for papers, tutorials, and panels will be made available on the conference website:
http://www.fuzzieee.org
Important dates
Deadline for special session, tutorial, competition, and panel session proposals: October 8, 2018
Z•IE Notification of acceptance for tutorials, special sessions, and panels: November 2, 2018
Z Deadline for full paper submission: January 11, 2019
E
ANS
LE
Notification of paper acceptance: March 4, 2019
FU
E
OR
Deadline for camera-ready paper submission: April 1, 2019

NEW
Deadline for early registration: April 5, 2019

Conference: June 23-26, 2019
019 http://www.fuzzieee.org
2
Digital Object Identifier 10.1109/MCI.2018.2840740

Volume 13 Number 3 ❏ August 2018
www.ieee-cis.org
Features
12
Identifying DNA Methylation Modules Associated
with a Cancer by Probabilistic Evolutionary Learning
b y Je-Keun Rhee, Soo-Jin Kim, and Byoung-Tak Zhang
20
Augmentation of Physician Assessments with Multi-Omics
Enhances Predictability of Drug Response: A Case Study
of Major Depressive Disorder
by Arjun Athreya, Ravishankar Iyer, Drew Neavin, Liewei Wang,

Richard Weinshilboum, Rima Kaddurah-Daouk, John Rush,
Mark Frye, and William Bobo
Columns
32 Research Frontier
Optimal Weighted Extreme Learning Machine for Imbalanced Learning
with Differential Evolution
by JongHyok Ri, Liang Liu,Yong Liu, Huifeng Wu,Wenliang Huang, and Hun Kim
Learning Without External Reward
by Haibo He and Xiangnan Zhong
on the cover
©istockphoto.com/kirstypargeter
55 Review Article
Recent Trends in Deep Learning Based Natural Language Processing
by Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria
Departments
2 Editor’s Remarks 10 Guest Editorial
3 President’s Message Computational Intelligence
by Nikhil R. Pal Techniques in Bioinformatics
and Bioengineering
5 Conference Reports by Richard Allmendinger,
Conference Report on IEEE Daniel Ashlock, and
Computational Intelligence Sansanee Auephanwiriyakul
Society ExCom Workshop 2018
by Jie Lu 76 Conference Calendar
by Bernadette Bouchon-Meunier
8 Publication Spotlight
by Haibo He, Jon Garibaldi,
Kay Chen Tan, Julian Togelius,
Yaochu Jin, and Yew Soon Ong
Promoting Sustainable Forestry
SFI-01681
IEEE Computational Intelligence Magazine (ISSN 1556-603X) is published quarterly by The Institute of Electrical and Electronics Engineers, Inc. Headquarters: 3 Park Avenue, 17th
Floor, New York, NY 10016-5997, U.S.A. +1 212 419 7900. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society, or its members. The magazine is a
membership benefit of the IEEE Computational Intelligence Society, and subscriptions are included in Society fee. Replacement copies for members are available for US$20 (one copy only).
Nonmembers can purchase individual copies for US$201.00. Nonmember subscription prices are available on request. Copyright and Reprint Permissions: Abstracting is permitted with
credit to the source. Libraries are permitted to photocopy beyond the limits of the U.S. Copyright law for private use of patrons: 1) those post-1977 articles that carry a code at the bottom of
the first page, provided the per-copy fee is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01970, U.S.A.; and 2) pre-1978 articles without fee. For other
copying, reprint, or republication permission, write to: Copyrights and Permissions Department, IEEE Service Center, 445 Hoes Lane, Piscataway NJ 08854 U.S.A. Copyright © 2018 by The
Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Periodicals p ostage paid at New York, NY and at additional mailing offices. Postmaster: Send address changes to
IEEE Computational Intelligence Magazine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854-1331 U.S.A. Printed in U.S.A. Canadian GST #125634188.
Digital Object Identifier 10.1109/MCI.2017.2770275 August 2018 | IEEE Computational intelligence magazine 1
CIM Editorial Board Hisao Ishibuchi Editor’s
Editor’s
Editor-in-Chief
Hisao Ishibuchi
Southern University of Science Remarks
Southern University of Science and Technology and Technology, CHINA
Department of Computer Science and Engineering Osaka Prefecture University, JAPAN
Shenzhen, Guangdong, CHINA
Osaka Prefecture University
Department of Computer Science
Sakai, Osaka 599-8531, JAPAN
(Email) hisaoi@cs.osakafu-u.ac.jp
depart title
Founding Editor-in-Chief
Gary G. Yen, Oklahoma State University, USA
One Year in China
Past Editor-in-Chief
Kay Chen Tan, City University of Hong Kong,
HONG KONG
Editors-At-Large
Piero P. Bonissone, Piero P. Bonissone Analytics
LLC, USA
David B. Fogel, Natural Selection, Inc., USA
Vincenzo Piuri, University of Milan, ITALY
S
ince April 2017, I have been working in China. I live in an
apartment on the university campus. When I moved to
China last year, I knew only two Chinese words, “ni hao
(hello)” and “xie xie (thank you)”. Surprisingly I have yet to learn
any other Chinese words. I rely on my secretary’s help which I
Marios M. Polycarpou, University of Cyprus,
CYPRUS am very thankful for. My biggest achievement in China has been
Jacek M. Zurada, University of Louisville, USA
the success in my weight loss. I have lost more than 10 kg! I eat
Associate Editors three meals a day at the university’s canteen. It opens seven days a week but only
José M. Alonso, University of Santiago de Compos- for five hours a day: two hours for breakfast, one and a half hours each for lunch
tela, SPAIN and dinner. Due to this, my daily schedule is clock work around the canteen’s
Erik Cambria, NTU, SINGAPORE
Raymond Chiong, The University of Newcastle, opening hours. Also the large and hilly campus lets me exercise regularly for my
AUSTRALIA weight control too.
Yun Fu, Northeastern University, USA
Robert Golan, DBmind Technologies Inc., USA Recently I visited Xiamen for a conference. Three undergraduate students came to
Roderich Gross, The University of Sheffield, UK see me at the high-speed train station with a large name board (see the photo below).
Amir Hussain, University of Stirling, UK
John McCall, Robert Gordon University, UK I enjoyed dinner and a one-hour metro ride to the conference hotel. Thank you to
Zhen Ni, South Dakota State University, USA the student who brought back
Yusuke Nojima, OPU, JAPAN
Nelishia Pillay, University of Pretoria, SOUTH the name board to the univer-
AFRICA sity. On the last day in Xiamen,
Rong Qu, University of Nottingham, UK
Dipti Srinivasan, NUS, SINGAPORE they took me to the train sta-
Ricardo H. C. Takahashi, UFMG, BRAZIL tion and gave me some local
Kyriakos G.Vamvoudakis, Virginia Tech, USA
Nishchal K.Verma, IIT Kanpur, INDIA cakes and nuts. One of the stu-
Dongrui Wu, DataNova, USA dents spoke excellent Japanese.
Bing Xue, Victoria University of Wellington, NEW Her Japanese was much better
ZEALAND
Dongbin Zhao, Chinese Academy of Sciences, than my English. When I re-
CHINA ceived a phone call from her With three first-year undergraduate students in Xiamen
IEEE Periodicals/
before meeting them, I thought
Magazines Department that the call came from Japan. She was in Japan for one year when she was a high-
Editorial/Production Associate, Heather Hilton school student. I was so impressed by her language learning ability while comparing it
Senior Art Director, Janet Dudar
Associate Art Director, Gail A. Schnitzer
to my progress with Chinese after one year. Also the other student showed me her
Production Coordinator, Theresa L. Smith amazing photos of her hometown about 600 km north of Xi’an. She told me that she
Director, Business Development— often enjoyed camel rides when she was a small girl. Next year, we will host the IEEE
Media & Advertising, Mark David
SSCI 2019 conference in Xiamen. I am looking forward to seeing all of you and those
Advertising Production Manager,
Felicia Spagnoli students again.
Production Director, Peter M. Tuohy The feature topic of the current CIM issue is “Bioinformatics and Bioengineer-
Editorial Services Director, Kevin Lisankie ing.” This has been an interesting and challenging application field for CI techniques. I
Staff Director, Publishing Operations,
Dawn M. Melley hope you will enjoy all of the articles in this issue.
IEEE prohibits discrimination, harassment, and bullying.

For more information, visit http://www.ieee.org/web/
abou-tus/whatis/policies/p9-26.html.
Digital Object Identifier 10.1109/MCI.2017.2770276 Digital Object Identifier 10.1109/MCI.2018.2840642

Date of publication: 18 July 2018
2 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

President’s Nikhil R. Pal CIS Society Officers
Message Indian Statistical President – Nikhil R. Pal,
Indian Statistical Institute, INDIA
Institute, INDIA Past President – Pablo A. Estévez,
University of Chile, CHILE
Vice President – Conferences – Bernadette
Bouchon-Meunier, University Pierre et Marie
Curie, FrANCE
Vice President – Education – Simon M. Lucas,
Queen Mary University of London, UK
Vice President – Finances – Enrique H. Ruspini,
Sharing Some Experience USA
Vice President – Members Activities –
Pau-Choo (Julia) Chung, National Cheng
Kung University,TAIWAN
Vice President – Publications – Jim Keller,
University of Missouri, USA
W
Vice President – Technical Activities –
e held our society’s first Executive Committee (ExCom) Hussein Abbass, University of New South Wales,
AUSTrALIA
meeting of this year on April 8, 2018, in Sydney, Austra-
lia. Although I had attended several ExCom meetings as Publication Editors
the Vice-President for Publications and as the President-Elect, this IEEE Transactions on Neural Networks
was my first meeting as the President and hence was of a special and Learning Systems
Haibo He, University of rhode Island, USA
importance to me. We had a long but very successful meeting. I
IEEE Transactions on Fuzzy Systems
take this opportunity to sincerely thank all members of the Jon Garibaldi, University of Nottingham, UK
ExCom for their very active participation and providing useful suggestions, which IEEE Transactions on Evolutionary Computation
will certainly help to take the society ahead. Kay Chen Tan, City University of Hong Kong,
In our society we make use of every opportunity to strengthen our interaction HONG KONG
IEEE Transactions on Games
with people and it is not restricted to the members of the Computational Intelli-
Julian Togelius, New York University, USA
gence Society (CIS). Keeping this in mind, we carefully choose the locations for our IEEE Transactions on Cognitive and
ExCom meetings and usually organize one (and sometimes even two) one-day Developmental Systems
workshop on computational intelligence and its applications in the region where Yaochu Jin, University of Surrey, UK
the meeting is held. To the extent I know, this initiative was started in 2011 and so IEEE Transactions on Emerging Topics in
Computational Intelligence
far it has been found to be very successful in achieving its intended goals of allow-
Yew Soon Ong, Nanyang Technological University,
ing us to reach out to students, researchers, and practitioners. Generally we find a SINGAPOrE
“champion” from a local university/institute or from the local CIS chapter, if any, to
organize the event. Every ExCom member, who attends the ExCom meeting in Administrative Committee
person, serves as a speaker to the workshop. In addition, to strengthen our bonding Term ending in 2018:
with the local researchers as well as to get a better idea of some of the research Piero P. Bonissone, Piero P. Bonissone Analytics
activities that the local researchers are engaged in, often we invite a few speakers LLC, USA
Carlos A. Coello Coello, CINVESTAV-IPN,
from the neighboring institutes/universities. In the past years we organized very
MEXICO
successful events in different countries including Brazil, Cyprus, Peru, Spain, Ecua- Barbara Hammer, Bielefeld University, GErMANY
dor, Chile, and Argentina. This year, after the ExCom meeting, we have organized Alice E. Smith, Auburn University, USA
two such workshops, one in Sydney (on April 9) and the other in Canberra (on Jacek M. Zurada, University of Louisville, USA
April 10). In Sydney, the workshop was organized by the University of Technology Term ending in 2019:
Sydney (UTS) under the leadership of Prof. Jie Lu. It was a very well-attended Cesare Alippi, Politecnico di Milano, ITALY
event—more than 100 researchers/students participated in this workshop. The sec- James C. Bezdek, USA
ond workshop was organized by the University of New South Wales (UNSW) Gary B. Fogel, Natural Selection, Inc., USA
Hisao Ishibuchi, Southern University of Science
Canberra and Prof. Hussein Abbass took the lead in organizing this event. This was and Technology, CHINA, and Osaka Prefecture
also a well-attended one with more than 80 participants. For this event, in addition University, JAPAN
to the ExCom members, Prof. Jason Scholz, Adjunct Professor, University of New Kay Chen Tan, City University of Hong Kong,
South Wales in Canberra and the Chief Scientist and Engineer, Defence Coopera- HONG KONG
tive Research Centre (DCRC) was invited to give a talk. He spoke on “Trusted Term ending in 2020:
Autonomous Systems Defence CRC”. These workshops covered a wide spectrum Janusz Kacprzyk, Polish Academy of Sciences,
of related topics of contemporary interests including deep learning, recognition POLAND
Sanaz Mostaghim, Otto von Guericke University of
technology, machine learning, big data challenges in astrophysics, fuzzy information
Magdeburg, GErMANY
processing, and human-machine systems. In both workshops the participants were Christian Wagner, University of Nottingham, UK
very enthusiastic and interactive. An important aspect of the Canberra program was Ronald R.Yager, Iona College, USA
that some young high school students were invited to attend the talk by Prof. Gary G.Yen, Oklahoma State University, USA

Date of publication: 18 July 2018 Digital Object Identifier 10.1109/MCI.2018.2840198
AUGUST 2018 | IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE 3

Bernadette Bouchon-Meunier. Her
talk was appropriately adapted to suit
the high school students. After the talk,
the students were given an opportuni-
ty to closely interact with Prof. Berna-
dette Bouchon-Meunier and Prof. Julia
Chung (both are CIS ExCom mem-
bers). This unique initiative was taken
to motivate young minds to get inter-
ested in engineering, particularly in
computational intelligence. Prof. Abbass
deserves a special thank for this inno-
Prof. Bernadette Bouchon-Meunier giving her talk at the UTS, Sydney workshop.
vative idea. Such an interaction with
high school students can have a long
lasting effect. In future, we plan to take
up more such initiatives to make high
school students interested in engineer-
ing and technology.
To conclude, I would like to remind
us that our favorite SSCI 2018 (IEEE
Symposium Series on Computational
Intelligence, 2018) will be held in Bengal-
uru, India during November 18–21, 2018.
I look forward to seeing you all there.
Prof. Pablo Estevez giving his talk at the UNSW, Canberra workshop.
IMAGE LICENSED BY INGRAM PUBLISHING

IEEE connects you to
a universe of information!
As the world’s largest professional
association dedicated to advancing
technological innovation and excellence
for the benefit of humanity, the IEEE
and its Members inspire a global
community through its highly cited
publications, conferences, technology
standards, and professional
and educational activities.
Visit www.ieee.org.
Publications / IEEE Xplore ® / Standards / Membership / Conferences / Education
4 IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE | AUGUST 2018

Jie Lu Conference
Faculty of Engineering and Information technology,
University of Technology Sydney, AUSTrAlIA
Reports
Conference Report on IEEE Computational Intelligence

Society ExCom Workshop 2018
o n Monday 9th April 2018, the

IEEE Computational Intelli-
gence Society (CIS) ExCom
Workshop was hosted by the Faculty of
Engineering and Information Technol-
ogy (FEIT) at University of Technology
Sydney (UTS), chaired by Associate
Dean Research Excellence, Distin-
guished Professor Jie Lu.
Over 120 academic staff, industry
partners and PhD Students from UTS,
University of NSW, University of Syd-
ney and University of Wollongong,
along with IEEE CIS members in the Prof. Nikhil R. Pal presenting to staff, students, and industry at University Technology Sydney.
NSW Chapter attended the workshop.

The UTS IEEE Student Board were autonomous behaviour, and from astron- Research, Sorbonne University, spoke
also involved in the facilitation of the omy to bush-fires. on “Fuzzy approaches of informa-
event and on-hand on the day to answer The talks gave the audience an tion quality.”
any questions about IEEE. opportunity to hear insights from aca- Prof. Pablo A. Estévez, Electrical
In her opening speech, Dist. Prof. Jie demics at the top of their fields, includ- Engineering, Universidad de Chile &
Lu gave an introduction to the IEEE ing significant contributions to new co-founder of the Millennium Institute
CIS and its members, mentioning the knowledge and applications with real- of Astrophysics, spoke on “Big data
vast reach of IEEE and the IEEE CIS world impact. challenges and deep learning appli-
across the world, its international stand- The audience were inspired, captivat- cations to astronomy.”
ing as a leading society for technical ed and delighted to hear from renowned Dist. Prof. James M Keller, Elec-
professionals and as a world-leading professors on such pertinent topics, trical Engineering and Computer Sci-
publisher in the CI field. which encouraged a lot of interest- ence, R.L Tatum Professor, University of
Following this, the six IEEE CIS ing questions. Missouri, spoke on “Recognition techn-
ExCom members and world renowned ology: Lofti’s look to the future from
scholars from institutions in France, Tai- The program ran as follows: the late 1990s.”
wan, India, Chile, America and Australia Prof. Nikhil R. Pal, Electronics and Dist. Prof. Pau-Choo (Julia)
presented the latest theoretical and Communication Sciences Unit, Indian Chung, Director General of the
application developments in CI. Statistical Institute & current Presi- Department of Information and Tech-
During these talks, the speakers cov- dent of IEEE CIS spoke on “The nology Education, Ministry of Educa-
ered many interesting areas and applica- open world classification problem: tion,Taiwan spoke on “An overview to
tions of CI from neural networks to making neural networks say don’t deep learning development.”
classification, from fuzzy systems to know.” Prof. Hussein Abbass, School of
Prof . Ber nadette Bouchon- Engineering and Information Technolo-
Digital Object Identifier 10.1109/MCI.2018.2840644 Meunier, Emeritus Director of Re- gy, UNSW, Canberra spoke on “Char-
Date of publication: 18 July 2018 search, National Centre for Scientific acterising human and autonomous

systems behaviour in deceptive and
noisy environments.”
The workshop was very successful
and offered excellent opportunities for
IEEE CIS to promote its research devel-
opments. It provided a high-level inter-
national forum for scientists, engineers,
students and educators to learn state-of-
the-art research and applications in
computational intelligence.
************************
More questions for the CI experts.

While the IEEE CIS ExCom mem- From left to right: Dist. Prof. CT Lin, Dist. Prof. Jie Lu (UTS), Prof. Nikhil R. Pal (President), Prof.
bers were visiting for the workshop, Pablo Estevez, Dist. Prof. James Keller, Dist. Prof. Pau-Choo (Julia) Chung, Prof. Bernadette
UTS also took the opportunity to ask Bouchon-Meunier, Jo-Ellen Snyder, guest (IEEE Computational Intelligence Society).
them a few questions about computa-

tional intelligence, and captured it on tational neural networks (NN) started ing, autonomous systems and smart-
film. The resulting answers were very earlier—the first computational model environment understanding [and many
insightful. Transcripts of some of these of neuron was proposed in the 1940’s. other applications].
answers are included below. At the beginning AI people never In this kind of environment it’s hard
considered NN a part of AI but with for humans to know what kind of rule
Q: What is the difference between time, when CI and in particular NN, you can embed, so you can use the
Artificial Intelligence (AI) and Com- which is one of the major pillars of CI, algorithm very precisely to do that. In
putational Intelligence (CI)? started giving really superb results, that case our society [has built] up this
Answer from Prof. Nikhil R. Pal everybody [now] considers NN, espe- technique, to let the algorithm learn the
I think we need to look at the histo- cially deep NN [deep learning] as a part situation, learn the environment and
ry, AI formally started in late 1950’s and of AI. learn from the data so that it will
at that time the emphasis was on sym- become smarter and out-perform the
bolic information processing, logical Q: What have been some of the traditional approaches.
reasoning—you know, the success sto- most successful applications of AI?
ries of expert systems and other things. Answer from Dist. Prof Pau- Q: What are the current challenges
At that time they (AI) did not give Choo (Julia) Chung in AI and robotics?
much emphasis on learning or reason- Currently [CI is] used in image Answer from Prof. Bernadette
ing based on data. Research in compu- understanding, large image understand- Bouchon-Meunier
Members of the IEEE Computational Intelligence Society with UTS hosts, IEEE Student Board volunteers, staff, and students.
6 IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE | AUGUST 2018

I think it is the human component to read, we do this by telling the kid that’s where I think a lot of activity is
of robots which is not so obvious, for how to pronounce the words, and then going. And then of course you have all
instance: to understand the feelings, to after repeating many times the kid the issues of cyber security and physical
understand the subtlety of natural lan- learns. This is the same way for ma- security—e.g. landmines and parameters
guage (which is not obvious), robots chines. So we are talking about ma - detection. But all the cyber-attacks going
are able to perform [normal tasks] very chine learning. on, there is going to need to be an enor-
efficiently. But as far as we go to human- You can teach a machine to do mous amount of work done to try to keep
like behaviour, it is more and more dif- something. But the problem is when our [personal] data secure.
ficult and this is where fuzzy logic is you need labels, this is not useful when Answer from Prof. Hussein
very useful in managing a kind of you have say, 100 million of data, there Abbass
subjective, informal way of interacting is no way humans will label that. So The future is with connected data,
with people. one important area is to do the same but what we need to start thinking
without labelling. To be able to recog- about is [how] to connect the smartness
Q: What are the new research nize the environment and the patterns that exists all over the place, and almost
directions and applications in CI? without humans telling the machine independent of each other. How we can
Answer from Prof. Pablo A. what is there. actually leverage what we call distribut-
Estévez Answer from Dist. Prof. James ed intelligence. Once we are able to
There are several areas, in my case I Keller connect this intelligence we will quite
work with NN, so the idea is to classify The population of the world is get- quickly see lots of advantages in society.
patterns but the traditional way to do ting older so more and more resources
this is with labels. It is like teaching a kid are being pushed into that arena, so
Are You Moving?

Don’t miss an issue of this magazine—
update your contact information now!
Update your information by:
E-MAIL: address-change@ieee.org
PHONE: +1 800 678 4333 in the United States
or +1 732 981 0060 outside
the United States
If you require additional assistance
regarding your IEEE mailings,
visit the IEEE Support Center
at supportcenter.ieee.org.
IEEE publication labels are printed six to eight weeks
in advance of the shipment date, so please allow sufficient
time for your publications to arrive at your new address.
© ISTOCKPHOTO.COM/BRIANAJACKSON

Publication
Spotlight
Haibo He, Jon Garibaldi, Kay Chen Tan,
Julian Togelius, Yaochu Jin, and
Yew Soon Ong
CIS Publication Spotlight
IEEE Transactions on Neural Multicolumn RBF Network, by A. O. mative subsets provide the MCRN with
Networks and Learning Systems Hoori and Y. Motai, IEEE Transac- a regional experience to specify the
tions on Neural Networks and Learning problem instead of generalizing it. The
Robust C-Loss Kernel Classifiers, by G. Systems,Vol. 29, No. 4, April 2018, pp. MCRN has been tested on many bench-
Xu, B. Hu, and J. C. Principe, IEEE 766–778. mark datasets and has shown better
Transactions on Neural Networks and accuracy and great improvements in
Learning Systems, Vol. 29, No. 3, Digital Object Identifier: 10.1109/ training and testing times compared
March 2018, pp. 510–522. TNNLS.2017.2650865 with a single RBFN. The MCRN also
“This paper proposes the multicol- shows good results compared with those
Digital Object Identifier: 10.1109/ umn RBF network (MCRN) as a of some machine learning techniques,
TNNLS.2016.2637351 method to improve the accuracy and such as the support vector machine and
“The correntropy-induced loss (C-loss) speed of a traditional radial basis func- k-nearest neighbors.”
function has the nice property of being tion network (RBFN). The RBFN, as a
robust to outliers. In this paper, we study fully connected artificial neural network IEEE Transactions
the C-loss kernel classifier with the Tik- (ANN), suffers from costly kernel inner- on Fuzzy Systems
honov regularization term, which is used product calculations due to the use of
to avoid overfitting. After using the half- many instances as the centers of hidden Analysis and Design of Functionally
quadratic optimization algorithm, which units. This issue is not critical for small Weighted Single-Input-Rule-Modules
converges much faster than the gradient datasets, as adding more hidden units Connected Fuzzy Inference Systems, by
optimization algorithm, we find out that will not burden the computation time. C. Li, J. Gao, J. Yi, and G. Zhang,
the resulting C-loss kernel classifier is However, for larger datasets, the RBFN IEEE Transactions on Fuzzy Systems,
equivalent to an iterative weighted least requires many hidden units with several Vol. 26, No 1, February 2018, pp.
square support vector machine (LS- kernel computations to generalize the 56–71.
SVM). This relationship helps explain the problem. The MCRN mechanism is
robustness of iterative weighted LS-SVM constructed based on dividing a dataset Digital Object Identifier: 10.1109/
from the correntropy and density estima- into smaller subsets using the k-d tree TFUZZ.2016.2637369
tion perspectives. On the large-scale data algorithm. N resultant subsets are con- “Single-input-rule-modules (SIRMs)
sets which have low-rank Gram matri- sidered as separate training datasets to can efficiently solve the fuzzy rule explo-
ces, we suggest to use incomplete Cho- train N individual RBFNs. Those small sion phenomenon, which usually occurs
lesky decomposition to speed up the RBFNs are stacked in parallel and in the multivariable modeling and/or
training process. Moreover, we use the bulged into the MCRN structure dur- control applications. However, the per-
representer theorem to improve the sparse- ing testing. The MCRN is considered as formance of SIRM connected fuzzy
ness of the resulting C-loss kernel classifier. a well-developed and easy-to-use paral- inference systems (SIRM-FIS) is limited
Experimental results confirm that our lel structure, because each individual due to its simple input-output mapping.
methods are more robust to outliers than ANN has been trained on its own sub- In this paper, to further enhance the per-
the existing common classifiers.” sets and is completely separate from the formance of SIRM-FIS, a functionally
other ANNs. This parallelized structure weighted SIRM-FIS (FWSIRM-FIS),
reduces the testing time compared with which adopts multi-variable functional
that of a single but larger RBFN, which weights to measure the important degrees
Digital Object Identifier 10.1109/MCI.2018.2840645 cannot be easily parallelized due to its of the SIRMs, is presented.Then, in order
Date of publication: 18 July 2018 fully connected structure. Small infor- to show the fundamental differences of
8 IEEE ComputatIonal IntEllIgEnCE magazInE | august 2018

the methods, various properties of a Apache Spark framework. They have performed an automated text-analysis
range of SIRM-FIS models are explored. used ten real-world publicly available of an extended list of papers published
These properties demonstrate that the big datasets for evaluating the behavior on bilevel optimization to date. This
proposed FWSIRM-FIS has more gen- of the scheme along three dimensions: paper should motivate evolutionary
eral and complex input-output mapping 1) performance in terms of classification computation researchers to pay more
than the existing SIRMs methods. Fur- accuracy, model complexity, and execu- attention to this practical yet challeng-
thermore, based on the least-squares tion time; 2) scalability varying the ing area.”
method, a novel data-driven optimiza- number of computing units; and 3) abil-
tion method is presented for parameter ity to efficiently accommodate an IEEE Transactions on Games
learning, which can also be used to increasing dataset size. They have dem-
optimize the parameters of the other onstrated that the proposed scheme Informed Hybrid Game Tree Search for
models. Due to the properties of the turns out to be suitable for managing General Video Game Playing, by T.
least-squares method, the proposed big datasets even with a modest com- Joppen, M. U. Moneke, N. Schröder,
parameter learning algorithm can over- modity hardware support. Finally, they C. Wirth, and J. Fürnkranz, IEEE
come the drawbacks of gradient-based have used the distributed decision tree Transactions on Games, Vol. 10, No. 1,
parameter learning methods and obtain learning algorithm implemented in the March 2018, pp. 78–90.
both smallest training errors and small- MLLib library and the Chi-FRBCS-
est parameters. Finally, examples and BigData algorithm, a MapReduce dis- Digital Object Identifier: 10.1109/
detailed comparisons are given. Simula- tributed fuzzy rule-based classification TCIAIG.2017.2722235
tion results show that FWSIRM-FIS system, for comparative analysis.” “In this paper, we introduce a univer-
can obtain better performance than the sal game playing agent that is able to suc-
other methods, and, compared with some IEEE Transactions cessfully play a wide variety of video
well-known methods, can achieve similar on Evolutionary Computation games. It combines the strengths of
or better performance with fewer param- Monte Carlo tree search with conven-
eters and faster training speed.” A Review on Bilevel Optimization: From tional heuristic search into a single hybrid
Classical to Evolutionary Approaches and search agent, which is able to select the
On Distributed Fuzzy Decision Trees for Applications, by A. Sinha, P. Malo, K. appropriate strategy based on its obser-
Big Data, by A. Segatori, F. Marcello- Deb, IEEE Transactions on Evolution- vations about the game dynamics. In
ni, and W. Pedrycz, IEEE Transactions ary Computation,Vol. 22, No. 2, April particular, the agent learns a knowledge
on Fuzzy Systems,Vol. 26, No. 1, Feb- 2018, pp. 276–295. base which provides the agent with
ruary 2018, pp. 174–192. information such as an approximate
Digital Object Identifier: 10.1109/ transition function, the type of agents
Digital Object Identifier: 10.1109/ TEVC.2017.2712906 and objects that participate in the game
TFUZZ.2016.2646746 “Bilevel optimization is defined as a and the possible effects of interacting
“Fuzzy decision trees (FDTs) have mathematical program, where an opti- with them, heuristics for focusing and
shown to be an effective solution in the mization problem contains another pruning the search, and more. This hy-
framework of fuzzy classification. The optimization problem as a constraint. brid strategy proved to be successful in
approaches proposed so far to FDT These problems have received signifi- the 2015 General Video Game Compe-
learning, however, have generally neglect- cant attention from the mathematical tition, in which our agent emerged as
ed time and space requirements. In this programming community. Only limited the clear winner.”
paper, the authors propose a distributed work exists on bilevel problems using
FDT learning scheme shaped according evolutionary computation techniques; IEEE Transactions on Cognitive
to the MapReduce programming model however, recently there has been an and Developmental Systems
for generating both binary and multiway increasing interest due to the prolifera-
FDTs from big data. The scheme relies tion of practical applications and the Bootstrapping Q-Learning for Robotics
on a novel distributed fuzzy discretizer potential of evolutionary algorithms in From Neuro-Evolution Results, by M.
that generates a strong fuzzy partition tackling these problems. This paper pro- Zimmer and S. Doncieux, IEEE
for each continuous attribute based on vides a comprehensive review on bilevel Transactions on Cognitive and Develop-
fuzzy information entropy. The fuzzy optimization from the basic principles mental Systems, Vol. 10, No. 1, March
partitions are, therefore, used as an input to solution strategies; both classical and 2018, pp. 102–119.
to the FDT learning algorithm, which evolutionary. A number of potential
employs fuzzy information gain for application problems are also discussed. Digital Object Identifier: 10.1109/
selecting the attributes at the decision To offer the readers insights on the TCDS.2016.2628817
nodes. The authors have implemented prominent developments in the field of
the FDT learning scheme on the bilevel optimization, the authors have (continued on page 11)
august 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 9

Guest Richard Allmendinger
Editorial University of Manchester, UK
Daniel Ashlock
University of Guelph, Canada
Sansanee Auephanwiriyakul
Chiang Mai University, Thailand
Computational Intelligence Techniques

in Bioinformatics and Bioengineering
t he field around Bioinformatics

and Bioengineering is rich and
includes important and diverse
problems, such as protein structure pre-
diction, systems and synthetic biology,
Many Bioinformatics and Bioengineering problems
can be framed as optimization, modeling and/or
learning problems that are too difficult to tackle
feature discovery and induction, health- via classical techniques.
care informatics, biomarker discovery,
and development of personalized medi-
cine and treatment, amongst others. their tuning and performance in differ- understanding on the effects of DNA
Many of the above problems can be ent problem domains. Large parts of this methylation on complex diseases. In
framed as optimization, modeling and/ expertise are yet to make their debut in particular, the paper focuses on identify-
or learning problems that are too diffi- the domain of bioinformatics and bio- ing multiple interactions of many DNA
cult to tackle via classical techniques. engineering, as the knowledge exchange methylation sites in the context of can-
Consequently, a consensus is emerging between the two fields has been limited. cer. The authors demonstrate that com-
that current state-of-the-art approaches, It is only very recently that this boundary putational intelligence (more precisely,
such as sampling-based schemes in the has started to break down, and promising an estimation of distribution algorithm-
Rosetta suite for macromolecular mod- preliminary applications have underlined based evolutionary algorithm) can be
eling, or classical mathematical pro- the potential of this research direction. used to identify high-order interactions
gramming methods, have reached a Given this, a special issue in a popular of DNA methylated sites that are poten-
saturation point and are not very effec- journal, such as IEEE Computational tially relevant to a disease. The meth-
tive for vast multi-modal landscapes Intelligence Magazine, is particularly time- odology has been validated successfully
encountered in this domain. ly and will help further draw attention to on array- and sequencing-based high-
Over the past decade, it has become this emerging research area. throughput DNA methylation profil-
clear that increases in computational The computational intelligence com- ing datasets.
power alone will not be sufficient to munity in bioinformatics and bioengi- The second paper, “Augmentation of
combat this problem, and that there is a neering is fragmented and large. The aim physician assessments with multi-omics
need for the development of specialized of the special issue is to capture some of enhances predictability of drug response: A
search and learning procedures that the ongoing interdisciplinary research case study of major depressive disorder” by
exploit problem-specific features and are that draws upon joint expertise in the Arjun Athreya, Ravishankar Iyer, Drew
capable of reusing information gathered domains of optimization and learning via Neavin, Liewei Wang, Richard Weinshil-
during the problem solving procedure. computational intelligence techniques boum, Rima Kaddurah-Daouk, John
The field around optimization and and bioinformatics and bioengineering. Rush, Mark Frye, William Bobo, pro-
learning via computational intelligence After a rigorous review process, two papers poses a learning-augmented clinical
offers a repertoire of candidate tech- were selected for publication in the spe- assessment workflow to sequentially
niques for global optimization and cial issue. augment a physician’s assessment of
learning, as well as a rich body of theo- The first paper, “Identifying DNA patients symptoms and their socio-
retical and empirical work relating to methylation modules associated with a cancer demographic measures with heteroge-
by probabilistic evolutionary learning” by Je- neous biological measures to accurately
Digital Object Identifier 10.1109/MCI.2018.2840658 Keun Rhee, Soo-Jin Kim, and Byoung- predict treatment outcomes using
Date of publication: 18 July 2018 Tak Zhang, aims at improving our machine learning and computational

intelligence. Using real data from a clini-
cal trial as a case study, the paper dem- There is a need for the development of specialized
onstrates that the proposed approach can search and learning procedures that exploit problem-
yield significant improvements in the
prediction accuracy for antidepressant
specific features and are capable of reusing information
treatment outcomes in patients with major gathered during the search.
depressive disorder, compared to using
only a physician’s assessment as the pre-
dictor. In other words, the paper argues outcomes for other medical conditions, their invaluable contribution in assessing
that a properly tuned prediction model such as migraine headaches or rheuma- the submissions. Our final thanks go to
can be used to assess the therapeutic toid arthritis. the Editor-in-Chief of IEEE CIM, Hisao
efficacy for a new patient prior to treat- We would like to use this opportuni- Ishibuchi, for the opportunity to publish
ment. Ultimately, the approach proposed ty to thank all the authors for submit- the special issue and his support and
may find applications beyond psychia- ting their high quality papers to the advice throughout the process.
try, for example, for predicting treatment special issue, and all the reviewers for
Publication Spotlight (continued from page 9)
“Reinforcement learning (RL) prob- IEEE Transactions on Emerging vates advanced optimizers that mimic
lems are hard to solve in a robotics con- Topics in Computational human cognitive capabilities; leveraging
text as classical algorithms rely on Intelligence on what has been seen before to accel-
discrete representations of actions and erate the search toward optimal solu-
states, but in robotics both are continu- Insights on Transfer Optimization: tions of never before seen tasks. With
ous. A discrete set of actions and states Because Experience is the Best Teacher, this in mind, this paper sheds light on
can be defined, but it requires an exper- by A. Gupta, Y. S. Ong, and L. Feng, recent research advances in the field of
tise that may not be available, in particu- IEEE Transactions on Emerging Topics in global black-box optimization that
lar in open environments. It is proposed Computational Intelligence, Vol. 2, No. champion the theme of automatic knowl-
to define a process to make a robot 1, February 2018, pp. 51–64. edge transfer across problems. We intro-
build its own representation for an RL duce a general formalization of transfer
algorithm. The principle is to first use a Digital Object Identifier: 10.1109/ optimization, based on which the con-
direct policy search in the sensori-motor TETCI.2017.2769104 ceptual realizations of the paradigm are
space, i.e., with no predefined discrete “Traditional optimization solvers classified into three distinct categories,
sets of states nor actions, and then extract tend to start the search from scratch by namely sequential transfer, multitasking,
from the corresponding learning traces assuming zero prior knowledge about and multiform optimization. In addition,
discrete actions and identify the relevant the task at hand. Generally speaking, the we carry out a survey of different meth-
dimensions of the state to estimate the capabilities of solvers do not automati- odological perspectives spanning Bayes-
value function. Once this is done, the cally grow with experience. In contrast, ian optimization and nature-inspired
robot can apply RL: 1) to be more however, humans routinely make use of computational intelligence procedures
robust to new domains and, if required a pool of knowledge drawn from past for efficient encoding and transfer of
and 2) to learn faster than a direct policy experiences whenever faced with a new knowledge building blocks. Finally, real-
search. This approach allows to take the task. This is often an effective approach world applications of the techniques are
best of both worlds: first learning in a in practice as real-world problems sel- identified, demonstrating the future
continuous space to avoid the need of a dom exist in isolation. Similarly, practi- impact of optimization engines that
specific representation, but at a price of a cally useful artificial systems are expected evolve as better problem-solvers over
long learning process and a poor gener- to face a large number of problems in time by learning from the past and from
alization, and then learning with an their lifetime, many of which will either one another.”
adapted representation to be faster and be repetitive or share domain-specific
more robust.” similarities. This view naturally moti-
auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 11

Identifying DNA Methylation
Modules Associated with
a Cancer by Probabilistic
Evolutionary Learning
©istockphoto.com/ciphotos
Abstract—DNA methylation leads to inhibition of downstream

gene expression. Recently, considerable studies have been made
to determine the effects of DNA methylation on complex dis-
ease. However, further studies are necessary to find the multiple
Je-Keun Rhee interactions of many DNA methylation sites and their associa-
Cancer Research Institute, College of Medicine, tion with cancer. Here, to assess DNA methylation modules
Catholic University of Korea, Seoul, KoRea potentially relevant to disease, we use an Estimation of Distri-
bution Algorithm (EDA) to identify high-order interaction of
Soo-Jin Kim
DNA methylated sites (or modules) that are potentially relevant
Research Institute of agriculture and Life Sciences,
to disease. The method builds a probabilistic dependency
College of agriculture and Life Sciences,
model to produce a solution that is a set of discriminative
Seoul National University, Seoul, KoRea
methylation sites. The algorithm is applied to array- and
Byoung-Tak Zhang sequencing-based high-throughput DNA methylation profiling
School of Computer Science & engineering, datasets. The experimental results show that it is able to identify
Seoul National University, Seoul, KoRea DNA methylation modules for cancer.
12 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018 1556-603x/18©2018IEEE

g
I. Introduction complex problems where other methods failed to find a
enomic studies mainly aim to find genetic markers good solution [30]–[32].
that are associated with a phenotype. Based on DNA We investigated DNA methylation modules relevant to can-
sequences, researchers have searched for causal effects cer, using the DNA methylation profiling datasets produced by
on biological processes including gene regulatory array- and sequencing-based approaches. The experimental
mechanisms and diseases. Although several risk factors have results showed that our method could identify DNA methyla-
been identified by the association studies, the genetic variants do tion modules related to cancer.
not fully explain the abnormal regulation because the biological
regulatory mechanism can be affected by many other factors, as II. Methods
well as DNA sequence modification [1]–[4].
Epigenomics refers to the study of regulation of various a. evolutionary Learning Procedure to Identify a Set
genomic functions that are controlled by another partially sta- of DNa Methylation Sites associated with a Disease
ble modification, but not DNA sequence variants [5]. Among EDAs evolve a population to find the optimal solution probabi-
these, DNA methylation, which typically occurs at CpG dinu- listically. The initial population is constructed by randomly
cleotides catalyzed by DNA methyltransferase, is a crucial epi- selecting individuals. The individuals represent higher order
genetic regulatory mechanism in cellular processes. DNA interactions of the methylated sites. The population size m is
methylation of CpG sites mostly causes silencing of the down- decided empirically and the initial weight w j of the individ-
stream gene. The enrichment of the differentially methylated ual j (0 1 j 1 m) is randomly assigned with a small value
DNA fractions can contribute to specific abnormalities, includ- (- 1 1 w j 1 1).
ing complex diseases [6]–[8]. In particular, with the advent of In the evolutionary process, each individual is evaluated for
array and next generation sequencing (NGS) technology, many how discriminative the interaction is for the datasets. Better
researchers have carried out genome-wide DNA methylation individuals are then selected and a dependency tree is built by
profiling studies [9]–[11], and the genome-wide studies have fitting to the selected individuals. New individuals of the next
reported that many genomic regions are differentially methyl- generation are generated using the probability distribution
ated in normal and abnormal cells [12]–[14]. within the tree structure, and replace the previous individuals.
However, a complex disease is caused by a combination of The overall procedure is as follows:
dysegulatory effects of multiple genes [15]–[17]. That is, errors Step 1) Set g ! 0
of biological processes are not caused by the alteration of an Step 2) Initialize population X (g) by random generation
individual methylation level. Recently, Easwaran et al. hypothe- Step 3) Evaluate individuals in X (g)
sized that DNA hypermethylation modules preferentially target Step 4) Select a set of individuals by tournament selection from
important developmental regulators in embryonic stem cells X (g)
[18]. They found a set of genes whose DNA methylation con- Step 5) Construct a dependency tree G (g) by measuring Kull-
tributed to the stem-like state of cancer. Horvath et al. studied back-Leibler divergence between variables
aging effects of DNA methylation and identified co-methylat- Step 6) Learn parameters using a probability distribution of the
ed modules related to aging in the human brain and blood set of selected at step 4
tissue [19]. Zhang and Huang investigated the DNA co-meth- Step 7) Generate new individuals by sampling with joint distri-
ylation patterns frequently observed in cancer [20]. bution from the G (g), and create a new population
Here, we identify combinatorial modules of DNA methyla- X (g + 1)
tion sites associated with human diseases using an evolutionary Step 8) Set g ! g + 1
learning approach (Figure 1). Evolutionary algorithms can approx- Step 9) If the termination criterion is not met, go to Step 3
imate solutions well for a variety of problems [21]–[25]. They Further details for steps 3 and 5 are explained in follow-
generate a new population through iterative updates and ing sections.
selection using a guided search process in a feature space. We
utilized an Estimation of Distribution Algorithm (EDA)-based B. Learning Dependency Tree
learning approach to identify combinations of cancer-relat- The dependency tree is built from the selected individuals by
ed DNA methylation sites. In the EDA, the population is searching conditional dependencies between random variables.
evolved according to the probabilistic distribution in selected The model is then optimized by a series of incremental updates
individuals without conventional genetic operators such as cross- [33], [34], as follows:
over and mutation. As a result, the EDA can provide answers Suppose that X is a population and X = {X 1, X 2, ..., X n}
in combinatorial optimization problems [26]–[29]. The represents a vector of variables with n features, i.e., DNA
EDA-based methods have been previously applied in several methylation sites. The probability distribution is denoted by a
biological studies, and it has offered promising results for joint probability P (X 1, X 2, ..., X n) as to:

Date of publication: 18 July 2018 Corresponding Author: Byoung-Tak Zhang (Email: btzhang@bi.snu.ac.kr)

Generation Sampling
Population New Population
X1 X2 X3 X5 X6
X1 X2 X3 X4 X5
X1 X7 X8 X2 X4 X8
X2 X3 X6 X7 X3 X4 X7 X9
X1 X4 X7 X8 X9 X1 X5 X6 X8 X10
Dependency
X3 X5 X9 X10 Graphs
X4 X6 X7 X9
Selection Evaluation
X1 X2 X3 X4 X5
X1 X2 X3 X5 X6
X2 X4 X8
X2 X4 X8
X2 X3 X6 X7 X3 X4 X7 X9
Training
Datasets
X1 X7 X8
X1 X5 X6 X8 X10
X4 X6 X7 X9 X4 X6 X7 X9
FIgurE 1 Schematic overview for probabilistic evolutionary learning to identify DNA methylation modules.
P (X) = P (X 1, X 2, ..., X n) I (X i; X pa (i)) =

= P (X 1|X 2, ..., X n) P (X 2|X 3, ..., X n) ....P (X n - 1|X n) P (X n).
(1) / / P (X i = x, X pa(i) = y) log PP(X(Xi =i =x)xP, X(Xpapa(i)(i=) =y)y) . (4)
x y
However, it is hard to measure all the joint probabilities

exactly when n, the number of variables, is large. Thus it is The complete graph G searches the maximum spanning tree,
necessary to approximate the probability distribution. In this and then the best dependency tree is constructed.
study, we used a dependency tree, and the distribution is For parameter learning, the most likely values are calculated
approximated as follows: from the frequencies in the selected individuals. That is, the
model parameters are represented as marginal probabilities in a
P (X 1, X 2, ..., X n) = P (X r ) % P (X i|X pa (i)), (2) root node and conditional probabilities in the other nodes. The
i!r
marginal probabilities in the root nodes and the conditional
where X 1, X 2, ..., X n are random variables, r is an index of a probabilities in the child nodes are calculated by Eqs. (5) and
root node, and pa(i) denotes the index of the parent node of (6), respectively, as follows:
X i. The tree structure is built by searching based on Kullback-
c (X r = x)
Leibler divergence between two random variables. The depen- P ( X r = x) = , (5)
N
dency graph is constructed optimally in a direction to
c (X i = x, X pa (i) = y)
maximize total mutual information as follows: P (X i = x|X pa (i) = y) = , (6)
c (X pa (i) = y)
argmax r, pa % I (X i; X pa (i)), (3) where c is the count of a variable X with a specific value and
i!r N is the total number of individuals.

C. Fitness evaluation in a Population The DNA methylation levels of the two datasets were rep-
The fitness function represents how informative the chromo- resented as beta-values, which were bounded between 0
some is to classify the samples. That is, the fitness for an individ- (unmethylated) and 1 (totally methlyated).
ual is evaluated by measuring the classification accuracy for
interaction of the features. To determine and update the fitness III. Results
for each individual, we introduce a gradient descendant rule for
training data D as follows: a. DNa Methylation Module associated with Breast Cancer
This analysis was carried out based on DNA methylation pro-
w i = w i + h (t j - f ( D j)) v ji, (7) filing datasets that experimentally measured the methylation
statuses using DNA Methylation BeadChip [35]. We extracted
where w i is the weight value for i-th feature and t j is the target data for DNA methylation profiles on chromosome 17 from
class in the j-th training instance D j . h is the learning rate and breast cancer and normal samples. Then, the data used at our
v ji is the value of the i-th attribute in the j-th instance. f (D j) is experiment consist of total 99 samples with 82 cancer and 17
the predicted output value of the j-th training instance by our normal samples with 1,587 features. Figure 2 shows the learn-
model and determined as follows: ing curves in the evolutionary process. The fitness value was
improved when the number of generations increased. We intro-
n
duced a term, in the fitness function, for the number of the
f ( D j) = *
1, if / w i $ v ji 2 0, (8)
methylation sites to find an individual with a shorter length;
i=0
- 1, otherwise.
1.0
The difference between the predictions and the target values
specified in the training sequence is used to represent the error 0.9
of the current weight vector. The target function is optimized
0.8
to minimize the classification error. The weight values are eval-
uated against a sequence of training samples and are updated to 0.7
Fitness
improve the classification accuracy. The weight update processes

are repeated until they converge after a number of epochs. 0.6
Using the learning scheme, we identify the most informa-
0.5
tive individuals for classification, where the absolute values of Maximum
their weights are large. In addition, it is better to find the DNA 0.4 Mean
methylation module, whose number of features is small. Finally, Minimum
the fitness function for the k-th individual X k, Fitness (X k ) is 0.3
defined as follows: 0 10 20 30 40 50 60
Generation
(a)
Fitness (X k ) = Acc (X k ) - Order (X k ), (9)
1,000
Maximum
where Acc (X k ) is the classification accuracy for training datasets
Mean
and Order (X k ) denotes the number of methylation sites which Minimum
800
are selected in the individual X k.
D. Dataset 600
Order
The high-throughput DNA methylation profiles of large

genomic regions can be produced by both array and NGS 400
technologies. We applied our approach to these two types of
datasets. The array data were generated by the Illumina Infini- 200
um 27 k Human DNA methylation BeadChip, for surveying
genome-wide DNA methylation profiles in breast cancer and
0
normal samples [35]. We downloaded the dataset from Gene
Expression Omnibus accession number GSE32393, and re- 0 10 20 30 40 50 60
Generation
moved the samples with missing values. Sequence-based datasets
(b)
were produced by MethylCap-seq in matched normal and
colorectal cancer samples and collected at GSE39068 [36]. Nor-
FIgurE 2 Learning curve using breast cancer datasets. The x-axis is
malization and preprocessing were carried out using the ap- the number of generations and the y-axis shows (a) fitness values and
proaches detailed by Simmer et al. [36]. (b) the number of methylation sites.

therefore, the order decreased with the learning process (Fig- In addition, it has been reported that hypermethylation of
ure 2(b)). After convergence, six sites were selected for the dis- ALOX12 is associated with cancer [39]–[42]. Indeed, the
crimination. These six sites were related to genes, KIAA1267, ALOX12 gene is closely related to apoptosis, and alterations in its
CD79B, ALOX12, TMEM98, KRT19 and FOXJ1 (Table 1). expression caused by DNA methylation can cause a malfunction
ALOX12 has a role in the growth of breast cancer and its in cell death [43]–[45]. Therefore, it is reasonable to hypothesize
inhibition may be a strategy for inhibiting tumor growth [37]. that a change of methylation in the gene is linked to cancer,
The gene can be used as a serum marker for breast cancer [38]. including breast tumors. KRT19 is a well-known marker for
breast cancer [46], [47], and the KRT19 promoter can be aber-
TAbLE 1 Finally Selected Methylation Sites. rantly methylated in cancer cell lines [48]. The CD79B gene has
also been shown to be related to breast cancer in several studies
ID PoSITIoN gENE CgI LoCATIoN
[49], [50]. FOXJ1, a member of the forkhead box (FOX) family,
cg02301815 41605268 KiAA1267 41605074-41605445
may function as a tumor suppressor gene in breast cancer [51].
cg07973967 59363339 cD79B 25467633-25468370 FOXJ1 is hypermethylated and silenced in breast cancer cell lines
cg08946332 6840612 ALoX12 6839463-6841283 [52]. TMEM98 is a transmembrane protein. Recently, Grimm et
cg11833861 28279748 TMeM98 28278827-28279833 al. investigated transmembrane proteins specific for cancer cells,
cg16585619 36938776 KRT19 No cgi* and showed that the transmembrane proteins can be targets for
cg24164563 71647990 FoXJ1 71647419-71649480
antibodies and may form biomarkers for tumor diagnosis, prog-
*This site is not located within a cgi.
nosis, and treatment [53]. The function of KIAA1267 is unclear
yet, but this gene encodes KAT8 regulatory NSL complex sub-
unit 1, and KAT8 regulates p53, a tumor suppressor gene [54],
TAbLE 2 Classification Performance by Splitting Training [55]. Our results suggest that KIAA1267 also can have a role in
and Test Data.
breast cancer.
ALgorIThM ACCurACy SENSITIvITy SPECIFICITy To verify that our method produced good classification
LogiSTic RegReSSioN 0.947 0.919 0.768 performance generally, we calculated the classification perfor-
SVM 0.908 0.975 0.476 mance by randomly separating the original dataset into train-
DeciSioN TRee 0.928 0.894 0.768 ing and test datasets. Table 2 shows the average accuracy,
NAiVe BAyeS 0.935 0.928 0.772 sensitivity and specificity for 20 times repetition of random
TAbLE 3 Classification Performance Using the Selected Sites and Randomly Selected Sites.
ALgorIThM FEATurE* ACCurACy SENSITIvITy SPECIFICITy

LogiSTic RegReSSioN SeLecTeD 0.939 0.987 0.762
f=5 0.834 0.968 0.191
f=6 0.839 0.967 0.224
f = 10 0.855 0.949 0.405
f = 20 0.893 0.950 0.621
SVM SeLecTeD 0.929 0.941 0.857
f=5 0.829 0.999 0.008
f=6 0.830 0.998 0.018
f = 10 0.833 0.995 0.054
f = 20 0.867 0.986 0.304
DeciSioN TRee SeLecTeD 0.939 0.952 0.867
f=5 0.822 0.936 0.269
f=6 0.822 0.93 0.302
f = 10 0.826 0.908 0.431
f = 20 0.849 0.910 0.555
NAiVe BAyeS SeLecTeD 0.919 0.951 0.765
f=5 0.774 0.817 0.568
f=6 0.769 0.802 0.613
f = 10 0.795 0.804 0.753
f = 20 0.837 0.843 0.810
*At the column Feature, f is the number of randomly selected sites, and selected means the selected sites by our method.

TAbLE 4 Classification Performance Using Only the Selected
1.0 Sites in Colorectal Cancer.
ALgorIThM ACCurACy SENSITIvITy SPECIFICITy

0.9
LogiSTic 0.900 0.920 0.880
RegReSSioN
0.8
SVM 0.940 0.960 0.920
Fitness
DeciSioN TRee 0.640 0.680 0.600

0.7
NAiVe BAyeS 0.900 0.920 0.880
0.6
0.5 Maximum TAbLE 5 Enriched Geneset in Colorectal Cancer Data.

Mean
gENE SET p-vALuE FDr q-vALuE
Minimum
0.4 NoN-SMALL ceLL LuNg cANceR 2.61e-05 4.25e-03
0 20 40 60 80 100 120 gLioMA 4.56e-05 4.25e-03
Generation
NeuRoTRophiN SigNALiNg 3.25e-04 1.85e-02
pAThwAy
FIgurE 3 Learning curve using colorectal cancer datasets. The x-axis pAThwAyS iN cANceR 3.99e-04 1.85e-02
is the number of generations and the y-axis is fitness values.
wNT SigNALiNg pAThwAy 5.52e-04 2.05e-02
ALDoSTeRoNe-ReguLATeD 9.09e-04 2.22e-02
SoDiuM ReABSoRpTioN
splitting, measured by conventional classification algorithms.
eNDocyToSiS 9.62e-04 2.22e-02
Our algorithm showed good classification results even at the
independent test set. For further verification, we randomly VASopReSSiN-ReguLATeD wATeR 9.97e-04 2.22e-02
ReABSoRpTioN
extracted the methylation sites with 100 times repetition, then
cheMoKiNe SigNALiNg pAThwAy 1.07e-03 2.22e-02
measured the classification performance in each dataset by
10-fold cross-validation. Table 3 shows that our method pro- FocAL ADheSioN 1.26e-03 2.34e-02
duced better results than the others, regardless of the number eNDoMeTRiAL cANceR 1.39e-03 2.35e-02
of the randomly selected sites. In particular, it was noted that BASAL ceLL cARciNoMA 1.55e-03 2.41e-02
the specificity using the selected sites by our method was coLoRecTAL cANceR 1.97e-03 2.73e-02
much better than the others, even though the original data pANcReATic cANceR 2.50e-03 2.73e-02
was highly imbalanced.
MeLANoMA 2.57e-03 2.73e-02
chRoNic MyeLoiD LeuKeMiA 2.72e-03 2.73e-02
B. Modules associated with Colorectal Cancer
using High-Throughput Sequencing Data cyToKiNe-cyToKiNe RecepToR 2.82e-03 2.73e-02
iNTeRAcTioN
Recently, high-throughput sequencing technologies have been
MApK SigNALiNg pAThwAy 2.82e-03 2.73e-02
used to determine DNA methylation profiles. We applied our
method to the sequencing-based methylation profile datasets phoSphATiDyLiNoSiToL 2.94e-03 2.73e-02
SigNALiNg SySTeM
produced by Simmer et al. [36].
VegF SigNALiNg pAThwAy 2.94e-03 2.73e-02
The experiments were carried out using 25 cancer and 25
normal samples with 10,393 genomic regions on chromosome Fc epSiLoN Ri SigNALiNg 3.17e-03 2.81e-02
pAThwAy
17. Figure 3 depicts the improvement of the fitness in iterative
SMALL ceLL LuNg cANceR 3.58e-03 2.98e-02
learning procedures using these datasets, and finally 348 regions
were selected to discriminate colorectal cancer and normal sam- eRBB SigNALiNg pAThwAy 3.83e-03 2.98e-02
ples after convergence. Table 4 shows the average classification ApopToSiS 3.92e-03 2.98e-02
performance by 10-fold cross-validation using the selected sites. pRoSTATe cANceR 4.01e-03 2.98e-02
We annotated the 348 selected regions using GPAT [56] and
investigated which genes were located close to the selected
regions. We determined which genes were enriched within the cancer, the roles of the wnt signaling pathway and MAPK signal-
KEGG pathway using the genes whose transcription start sites ing pathway have been studied intensively [59]–[62]. Genetic
are located within 5,000 bp from the selected genomic regions mutations affecting the pathway components and the alteration
[57], [58]. Table 5 summarizes the significantly enriched path- of their expression can enhance tumorigenicity in cancer cells. In
ways with low p-values and shows that most of these are closely addition, the neurotrophin signaling pathway could be related to
associated with cancer-related networks. Note that the enriched growth of colorectal cancer cells [63] and the chemokine signal-
signaling pathways were related to colorectal cancer. In colorectal ing pathway suppresses colorectal cancer metastasis [64], [65].

The phosphatidylinositol signaling pathway plays an important approaches in the model of a complex disease. Moreover, the sys-
role in the growth, survival and metabolism of cancer cells, and tematic identification of causal factors and modules would pro-
targeting this pathway has the potential to lead to treatments for vide insights into mechanisms underlying complex diseases and
colorectal cancer [66], [67].VEGF and ErbB may be valid thera- help to develop efficient therapies or effective drugs.
peutic targets for patients with colorectal cancer [68]–[71]. In summary, we presented a method for searching the higher-
order interaction of DNA methylation sites using a probabilistic
IV. Discussion and Conclusion evolutionary learning method. We also examined the potential
DNA methylation may be associated significantly with com- for the combined effects of various sites on the genome. The
plex diseases and many genomic regions are differentially results suggested that the alteration of DNA methylations at
methylated in various cancers, comparing to normal samples. multiple sites affects cancer. Similar to genome-wide association
In this study, we presented a method to identify combinatorial studies, our approach provided an opportunity to capture the
effects of DNA methylation at multiple sites. From a systematic complex and multifactorial relationships among DNA methyla-
perspective, the relationship between DNA methylation tion sites and to find new factors for future study. Therefore, our
regions and a specific disease is learned by the presented proba- approach would facilitate a comprehensive analysis of genome-
bilistic evolutionary learning method. The fitness value of a wide DNA methylation datasets and help the interpretation for
DNA methylation module measures the level of its responses the effects of DNA methylation on multiple sites.
to the cancer. In a computational view, our method can solve a
large number of feature problems by identifying modules with Acknowledgment
both compactness and high coverage of cancer-related genes. This work was supported by the National Research Foundation of
Applying our method to breast cancer and colorectal cancer Korea (NRF) grant funded by the Ministry of Science and ICT,
data produced by high-throughput technologies, we detected Republic of Korea (grant no. NRF-2015R1C1A1A01053824,
cancer-related modules that were confirmed by the literature NRF-2018R1C1B6005304, NRF-2016R1D1A1B03935676,
and functional enrichment analysis. Interestingly, we observed and NRF-2018R1D1A1B07050393).
that the selected regions were located around genes that are
significantly enriched in cancer-related gene set categories, References
[1] P. A. Jones and S. B. Baylin, “The epigenomics of cancer,” Cell, vol. 128, no. 4,
which provided evidence that the identified modules in our pp. 683–962, 2007.
study are biologically meaningful. [2] B. Sadikovic, K. Al-Romaih, J. Squire, and M. Zielenska, “Cause and consequences
of genetic and epigenetic alterations in human cancer,” Current Genomics, vol. 9, no. 6,
Moreover, from the result for the array-based dataset, we could pp. 394–408, 2008.
obtain a good accuracy with a very small number of random fea- [3] A. E. Handel, G. C. Ebers, and S. V. Ramagopalan, “Epigenetics: Molecular mecha-
nisms and implications for disease,” Trends Mol. Med., vol. 16, no. 1, pp. 7–16, 2010.
tures. However, the specificity was very low in the experiments [4] J. Sandoval and M. Esteller, “Cancer epigenomics: Beyond genomics,” Current Opin-
with random features. The result suggested that our method could ion Genetics Develop., vol. 22, no. 1, pp. 50–55, 2012.
[5] L. Bonetta, “Epigenomics: Detailed analysis,” Nature, vol. 454, pp. 795–798, 2008.
generate well-balanced classification performance even with a [6] K. Robertson, “DNA methylation and human disease,” Nature Rev. Genetics, vol. 6,
highly imbalanced dataset, although conventional classifiers would pp. 597–610, 2005.
[7] A. Portela and M. Esteller, “Epigenetic modifications and human disease,” Nature
not work well with imbalanced circumstances. Also in the second Biotechnol., vol. 28, no. 10, pp. 1057–1068, 2010.
experiment using the NGS-based dataset with large number of [8] P. Jones, “Functions of DNA methylation: Islands, start sites, gene bodies and be-
yond,” Nature Rev. Genetics, vol. 13, no. 7, pp. 484–492, 2012.
features and small sample size, our method could find the informa- [9] P. Laird, “Principles and challenges of genomewide DNA methylation analysis,” Na-
tive DNA methylation sites with good classification performances, ture Rev. Genetics, vol. 11, no. 3, pp. 191–203, 2010.
[10] V. K. Hill, C. Ricketts, I. Bieche, S. Vacher, D. Gentle, C. Lewis, E. R. Maher,
even though the decision tree, necessary to be discretized in each and F. Latif, “Genome-wide DNA methylation profiling of CpG islands in breast cancer
value, showed relatively lower results. identifies novel genes associated with tumorigenicity,” Cancer Res., vol. 71, no. 8, pp.
2988–2999, 2011.
Studies on DNA methylation could reveal the process of [11] J.-K. Rhee, K. Kim, H. Chae, J. Evans, P. Yan, B.-T. Zhang, J. Gray, P. Spellman,
tumorigenesis as well as identify biomarkers. Our approach, T. H.-M. Huang, K. P. Nephew, and S. Kim, “Integrated analysis of genome-wide DNA
methylation and gene expression profiles in molecular subtypes of breast cancer,” Nucl.
which identifies multiple DNA methylation sites that might be Acids Res., vol. 41, no. 18, pp. 8464–8474, 2013.
epigenetically regulated, could provide a useful strategy to detect [12] H. Cheung, T. Lee, A. Davis, D. Taft, O. Rennert, and W. Chan, “Genome-
wide DNA methylation profiling reveals novel epigenetically regulated genes and non-
the epigenetic association related to cancer. By applying our coding RNAs in human testicular cancer,” Br. J. Cancer, vol. 102, no. 2, pp. 419–427,
method to array- and NGS-based data, we showed that it is 2010.
[13] G. Toperoff, D. Aran, J. D. Kark, M. Rosenberg, T. Dubnikov, B. Nissan, J. Wain-
applicable to a variety of data types and various disease contexts. stein, Y. Friedlander, E. Levy-Lahad, B. Glaser, and A. Hellman, “Genome-wide survey
Moreover, recent studies suggest a complex relationship between reveals predisposing diabetes type 2-related DNA methylation variations in human pe-
ripheral blood,” Hum. Mol. Genetics, vol. 21, no. 2, pp. 371–383, 2012.
genetic variation and DNA methylation. Systems genetics and [14] B. A. Walker, C. P. Wardell, L. Chiecchio, E. M. Smith, K. D. Boyd, A. Neri, F. E.
epigenetics approaches are required to examine these relation- Davies, F. M. Ross, and G. J. Morgan, “Aberrant global methylation patterns affect the
molecular pathogenesis and prognosis of multiple myeloma,” Blood, vol. 117, no. 2, pp.
ships. Although our framework is based on DNA methylation 553–562, 2011.
profile datasets, it could be used to identify the combinatorial [15] J. N. Hirschhorn and M. J. Daly, “Genome-wide association studies for common
diseases and complex traits,” Nature Rev. Genetics, vol. 6, no. 2, pp. 95–108, 2005.
association of various factors, including gene expression levels, [16] A. Janssens and C. van Duijn, “Genome-based prediction of common diseases: Ad-
microRNAs, copy number variations, genetic variations, and vances and prospects,” Hum. Mol. Genetics, vol. 17, pp. R166–R173, 2008.
[17] A. Kiezun, K. Garimella, R. Do, N. O. Stitziel, B. M. Neale, P. J. McLaren, N.
environmental factors. The integration of a variety of data would Gupta, P. Sklar, P. F. Sullivan, J. L. Moran, C. M. Hultman, P. Lichtenstein, P. Mag-
provide the basis for new hypotheses and experimental nusson, T. Lehner, Y. Y. Shugart, A. L. Price, P. I. de Bakker, S. M. Purcell, and S. R.

Sunyaev, “Exome sequencing and the genetic basis of complex traits,” Nature Genetics, [43] X.-Z. Ding, C. A. Kuszynski, T. H. El-Metwally, and T. E. Adrian, “Lipoxygenase inhibi-
vol. 44, no. 6, pp. 623–630, 2012. tion induced apoptosis, morphological changes, and carbonic anhydrase expression in human
[18] H. Easwaran, S. Johnstone, L. Van Neste, J. Ohm, T. Mosbruger, Q. Wang, M. pancreatic cancer cells,” Biochem. Biophys. Res. Commun., vol. 266, no. 2, pp. 392–399, 1999.
Aryee, P. Joyce, N. Ahuja, D. Weisenberger, E. Collisson, J. Zhu, S. Yegnasubramanian, [44] G. P. Pidgeon, M. Kandouz, A. Meram, and K. V. Honn, “Mechanisms controlling
W. Matsui, and S. Baylin, “A DNA hypermethylation module for the stem/progenitor cell cycle arrest and induction of apoptosis after 12-lipoxygenase inhibition in prostate
cell signature of cancer,” Genome Res., vol. 22, pp. 837–849, 2012. cancer cells,” Cancer Res., vol. 62, no. 9, pp. 2721–2727, 2002.
[19] S. Horvath, Y. Zhang, P. Langfelder, R. Kahn, M. Boks, K. van Eijk, L. van den [45] G. P. Pidgeon, K. Tang, Y. L. Cai, E. Piasentin, and K. V. Honn, “Overexpression
Berg, and R. Ophoff, “Aging effects on DNA methylation modules in human brain and of platelet-type 12-lipoxygenase promotes tumor cell survival by enhancing a v b 3 and
blood tissue,” Genome Biol., vol. 13, no. 10, p. R97, 2012. av b5 integrin expression,” Cancer Res., vol. 63, no. 14, pp. 4258–4267, 2003.
[20] J. Zhang and K. Huang, “Pan-cancer analysis of frequent DNA co-methylation pat- [46] A. Ring, I. E. Smith, and M. Dowsett, “Circulating tumour cells in breast cancer,”
terns reveals consistent epigenetic landscape changes in multiple cancers,” BMC Genomics, Lancet Oncol., vol. 5, no. 2, pp. 79–88, 2004.
vol. 18, no. 1, p. 1045, 2017. [47] M. Lacroix, “Significance, detection and markers of disseminated breast cancer
[21] M. Kumar, M. Husian, N. Upreti, and D. Gupta, “Genetic algorithm: Review cells,” Endocrine-Relat. Cancer, vol. 13, no. 4, pp. 1033–1067, 2006.
and application,” Int. J. Inf. Technol. Knowl. Manage., vol. 2, no. 2, pp. 451–454, 2010. [48] M. Morris, D. Gentle, M. Abdulrahman, N. Clarke, M. Brown, T. Kishida, M.
[22] K. Deb and R. Datta, “A fast and accurate solution of constrained optimization Yao, B. Teh, F. Latif, and E. R. Maher, “Functional epigenomics approach to identify
problems using a hybrid bi-objective and penalty function approach,” in Proc. IEEE Con- methylated candidate tumour suppressor genes in renal cell carcinoma,” Br. J. Cancer, vol.
gr. Evolutionary Computation, 2010, pp. 1–8. 98, no. 2, pp. 496–501, 2008.
[23] J.-G. Joung, S.-J. Kim, S.-Y. Shin, and B.-T. Zhang, “A probabilistic coevolution- [49] R. Ellsworth, C. Heckman, J. Seebach, L. Field, B. Love, J. Hooke, and C. Shriver,
ary biclustering algorithm for discovering coherent patterns in gene expression dataset,” “Identification of a gene expression breast cancer nodal metastasis profile,” J. Clin. Oncol.,
BMC Bioinform., vol. 13, no. Suppl 17, p. S12, 2012. vol. 26, no. 15 Suppl, p. 1022, 2008.
[24] R. Wang, R. C. Purshouse, and P. J. Fleming, “On finding well-spread pareto [50] A. Prat, J. S. Parker, O. Karginova, C. Fan, C. Livasy, J. I. Herschkowitz, X. He,
optimal solutions by preference-inspired co-evolutionary algorithm,” in Proc. 15th Annu. and C. M. Perou, “Phenotypic and molecular characterization of the claudin-low intrin-
Conf. Genetic and Evolutionary Computation Conf., New York, NY, 2013, pp. 695–702. sic subtype of breast cancer,” Breast Cancer Res., vol. 12, no. 5, p. R68, 2010.
[25] S.-J. Kim, J.-W. Ha, and B.-T. Zhang, “Bayesian evolutionary hypergraph learning [51] B. C. Jackson, C. Carpenter, D. W. Nebert, and V. Vasiliou, “Update of human and
for predicting cancer clinical outcomes,” J. Biomed. Inform., vol. 49, pp. 101–111, 2014. mouse forkhead box (FOX) gene families,” Hum. Genomics, vol. 4, pp. 345–352, 2010.
[26] T. Chen, P. Lehre, K. Tang, and X. Yao, “When is an estimation of distribution [52] B. Demircan, L. M. Dyer, M. Gerace, E. K. Lobenhofer, K. D. Robertson, and K.
algorithm better than an evolutionary algorithm?” in Proc. IEEE Congr. Evolutionary Com- D. Brown, “Comparative epigenomics of human and mouse mammary tumors,” Genes
putation, 2009, pp. 1470–1477. Chromosomes Cancer, vol. 48, no. 1, pp. 83–97, 2009.
[27] A. Zhou, Q. Zhang, and Y. Jin, “Approximating the set of pareto-optimal solutions [53] D. Grimm, J. Bauer, J. Pietsch, M. Infanger, J. Eucker, C. Eilles, and J. Schoen-
in both the decision and objective spaces by an estimation of distribution algorithm,” berger, “Diagnostic and therapeutic use of membrane proteins in cancer cells,” Current
IEEE Trans. Evol. Comput., vol. 13, no. 5, pp. 1167–1189, 2009. Med. Chem., vol. 18, no. 2, pp. 176–190, 2011.
[28] V. Shim, K. Tan, J. Chia, and A. Al Mamun, “Multi-objective optimization with [54] X. Li, L. Wu, C. A. S. Corsa, S. Kunkel, and Y. Dou, “Two mammalian MOF
estimation of distribution algorithm in a noisy environment,” Evol. Comput., vol. 21, no. complexes regulate transcription activation by distinct mechanisms,” Mol. Cell, vol. 36,
1, pp. 149–177, 2013. no. 2, pp. 290–301, 2009.
[29] J. Ceberio, E. Irurozqui, A. Mendiburu, and J. Lozano, “A distance-based ranking [55] S. Zhang, X. Liu, Y. Zhang, Y. Cheng, and Y. Li, “RNAi screening identifies
model estimation of distribution algorithm for the f lowshop scheduling problem,” IEEE KAT8 as a key molecule important for cancer cell survival,” Int. J. Clin. Exp. Pathol., vol.
Trans. Evol. Comput., vol. 18, no. 2, pp. 286–300, 2014. 6, no. 5, pp. 870–877, 2013.
[30] S. Pal, S. Bandyopadhyay, and S. Ray, “Evolutionary computation in bioinfor- [56] A. Krebs, M. Frontini, and L. Tora, “GPAT: Retrieval of genomic annotation from
matics: A review,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 36, no. 5, pp. large genomic position datasets,” BMC Bioinform., vol. 9, no. 1, p. 533, 2008.
601–615, 2006. [57] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gil-
[31] R. Santana, A. Mendiburu, N. Zaitlen, E. Eskin, and J. Lozano, “Multi-marker lette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov, “Gene
tagging single nucleotide polymorphism selection using estimation of distribution algo- set enrichment analysis: A knowledge-based approach for interpreting genome-wide ex-
rithms,” Artif. Intell. Med., vol. 50, no. 3, pp. 193–201, 2010. pression profiles,” Proc. Natl. Acad. Sci. USA, vol. 102, no. 43, pp. 15 545–15 550, 2005.
[32] K. Shelke, S. Jayaraman, S. Ghosh, and J. Valadi, “Hybrid feature selection and [58] A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, P. Tamayo, and
peptide binding affinity prediction using an EDA based algorithm,” in Proc. IEEE Congr. J. P. Mesirov, “Molecular signatures database (MSigDB) 3.0,” Bioinformatics, vol. 27, no.
Evolutionary Computation, 2013, pp. 2384–2389. 12, pp. 1739–1740, 2011.
[33] M. Pelikan, “Implementation of the dependency-tree estimation of distribution al- [59] E. Å. Jansson, A. Are, G. Greicius, I.-C. Kuo, D. Kelly, V. Arulampalam, and S.
gorithm in C++,” 2006. Pettersson, “The WNT/b-catenin signaling pathway targets PPARY activity in colon
[34] M. Pelikan, S. Tsutsui, and R. Kalapala, “Dependency trees, permutations, and qua- cancer cells,” Proc. Natl. Acad. Sci. USA, vol. 102, no. 5, pp. 1460–1465, 2005.
dratic assignment problem,” in Proc. 9th Annu.Conf. Genetic and Evolutionary Computation, [60] S. Segditsas and I. Tomlinson, “Colorectal cancer and genetic alterations in the
New York, NY, 2007, pp. 629–629. WNT pathway,” Oncogene, vol. 25, no. 57, pp. 7531–7537, 2006.
[35] J. Zhuang, A. Jones, S.-H. Lee, E. Ng, H. Fiegl, M. Zikan, D. Cibula, A. Sar- [61] J. Y. Fang and B. C. Richardson, “The MAPK signalling pathways and colorectal
gent, H. B. Salvesen, I. J. Jacobs, H. C. Kitchener, A. E. Teschendorff, and M. Wid- cancer,” Lancet Oncol., vol. 6, no. 5, pp. 322–327, 2005.
schwendter, “The dynamics and prognostic potential of DNA methylation changes at [62] M. L. Slattery, A. Lundgreen, and R. K. Wolff, “Map kinase genes and colon and
stem cell gene loci in women’s cancer,” PLoS Genetics, vol. 8, no. 2, p. e1002517, 2012. rectal cancer,” Carcinogenesis, vol. 33, no. 12, pp. 2398–2408, 2012.
[36] F. Simmer, A. Brinkman, Y. Assenov, F. Matarese, A. Kaan, L. Sabatino, A. Vil- [63] H. Akil, A. Perraud, C. Mélin, M.-O. Jauberteau, and M. Mathonnet, “Fine-tun-
lanueva, D. Huertas, M. Esteller, T. Lengauer, C. Bock, V. Colantuoni, L. Altucci, and ing roles of endogenous brain-derived neurotrophic factor, TrkB and sortilin in colorectal
H. Stunnenberg, “Comparative genome-wide DNA methylation analysis of colorectal cancer cell survival,” PLoS One, vol. 6, no. 9, p. e25097, 2011.
tumor and matched normal tissues,” Epigenetics, vol. 7, no. 12, pp. 1355–1367, 2012. [64] T. Kitamura, T. Fujishita, P. Loetscher, L. Revesz, H. Hashida, S. Kizaka-Kondoh, M.
[37] A. kumar Singh, R. Singh, F. Naz, S. S. Chauhan, A. Dinda, A. A. Shukla, K. Gill, Aoki, and M. M. Taketo, “Inactivation of chemokine (C-C motif) receptor 1 (CCR1) sup-
V. Kapoor, and S. Dey, “Structure based design and synthesis of peptide inhibitor of hu- presses colon cancer liver metastasis by blocking accumulation of immature myeloid cells in a
man lox-12: In vitro and in vivo analysis of a novel therapeutic agent for breast cancer,” mouse model,” Proc. Natl. Acad. Sci. USA, vol. 107, no. 29, pp. 13 063–13 068, 2010.
PLoS One, vol. 7, no. 2, p. e32521, 2012. [65] H. J. Chen, R. Edwards, S. Tucci, P. Bu, J. Milsom, S. Lee, W. Edelmann, Z. H.
[38] A. Singh, S. Kant, R. Parshad, N. Banerjee, and S. Dey, “Evaluation of human Gümüs, X. Shen, and S. Lipkin, “Chemokine 25-induced signaling suppresses colon
LOX-12 as a serum marker for breast cancer,” Biochem. Biophys. Res. Commun., vol. 414, cancer invasion and metastasis,” J. Clin. Invest., vol. 122, no. 9, pp. 3184–3196, 2012.
no. 2, pp. 304–308, 2011. [66] D. W. Parsons, T.-L. Wang, Y. Samuels, A. Bardelli, J. M. Cummins, L. DeLong,
[39] A. C. Tan, A. Jimeno, S. H. Lin, J. Wheelhouse, F. Chan, A. Solomon, N. Rajesh- N. Silliman, J. Ptak, S. Szabo, J. K. Willson, S. Markowitz, K. W. Kinzler, B. Vogel-
kumar, B. Rubio-Viqueira, and M. Hidalgo, “Characterizing DNA methylation patterns stein, C. Lengauer, and V. E. Velculescu, “Colorectal cancer: Mutations in a signalling
in pancreatic cancer genome,” Mol. Oncol., vol. 3, no. 5, pp. 425–438, 2009. pathway,” Nature, vol. 436, no. 7052, pp. 792–792, 2005.
[40] S. Alvarez, J. Suela, A. Valencia, A. Fernández, M. Wunderlich, X. Agirre, F. [67] T. Yuan and L. Cantley, “PI3K pathway alterations in cancer: Variations on a
Prósper, J. I. Martín-Subero, A. Maiques, F. Acquadro, S. Rodriguez Perales, M. J. Ca- theme,” Oncogene, vol. 27, no. 41, pp. 5497–5510, 2008.
lasanz, J. Roman-Gómez, R. Siebert, J. C. Mulloy, J. Cervera, M. A. Sanz, M. Esteller, [68] L. M. Ellis and D. J. Hicklin, “VEGF-targeted therapy: Mechanisms of anti-tumour
and J. C. Cigudosa, “DNA methylation profiles and their relationship with cytogenetic activity,” Nature Rev. Cancer, vol. 8, no. 8, pp. 579–591, 2008.
status in adult acute myeloid leukemia,” PLoS One, vol. 5, no. 8, p. e12197, 2010. [69] T. Winder and H.-J. Lenz, “Vascular endothelial growth factor and epidermal
[41] O. Ammerpohl, J. Pratschke, C. Schafmayer, A. Haake, W. Faber, O. von Kam- growth factor signaling pathways as therapeutic targets for colorectal cancer,” Gastroenter-
pen, M. Brosch, B. Sipos, W. von Schönfels, K. Balschun, C. Röcken, A. Arlt, B. ology, vol. 138, no. 6, pp. 2163–2176, 2010.
Schniewind, J. Grauholm, H. Kalthoff, P. Neuhaus, F. Stickel, S. Schreiber, T. Becker, [70] R. Roskoski, Jr., “The ERBB/HER receptor protein-tyrosine kinases and cancer,”
R. Siebert, and J. Hampe, “Distinct DNA methylation patterns in cirrhotic liver and Biochem. Biophys. Res. Commun., vol. 319, no. 1, pp. 1–11, 2004.
hepatocellular carcinoma,” Int. J. Cancer, vol. 130, no. 6, pp. 1319–1328, 2012. [71] J. Spano, R. Fagard, J.-C. Soria, O. Rixe, D. Khayat, and G. Milano, “Epidermal
[42] R. S. Ohgami, L. Ma, L. Ren, O. K. Weinberg, M. Seetharam, J. R. Gotlib, and growth factor receptor signaling in colorectal cancer: Preclinical data and therapeutic
D. A. Arber, “DNA methylation analysis of ALOX12 and GSTM1 in acute myeloid leu- perspectives,” Ann. Oncol., vol. 16, no. 2, pp. 189–194, 2005.
kaemia identifies prognostically significant groups,” Br. J. Haematol., vol. 159, no. 2,
pp. 182–190, 2012.

Image lIcensed by Ingram PublIshIng
Augmentation of Physician
Assessments with Multi-Omics
Enhances Predictability
of Drug Response: A Case Study
of Major Depressive Disorder
Arjun Athreya and Ravishankar Iyer
Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, IL, USA
Drew Neavin, Liewei Wang and Richard Weinshilboum

Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, MN, USA
Rima Kaddurah-Daouk and John Rush

Department of Psychiatry and Behavioral Sciences, Duke University, NC, USA
Mark Frye
Department of Psychiatry and Psychology, Mayo Clinic, MN, USA
William Bobo
Department of Psychiatry and Psychology, Mayo Clinic, FL, USA

Date of publication: 18 July 2018 Corresponding Author: Ravishankar Iyer (rkiyer@illinois.edu)

Abstract—This work proposes a “learning-augmented clinical are treated according to subject-reported assessments of symp-
assessment” workflow to sequentially augment physician assess- tom severity.
ments of patients’ symptoms and their socio-demographic
I
measures with heterogeneous biological measures to accurately I. Introduction
predict treatment outcomes using machine learning. Across n diseases that are characterized by the complex pheno-
many psychiatric illnesses, ranging from major depressive disor- types (traits) listed in Table 1, such as psychiatric disor-
der to schizophrenia, symptom severity assessments are subjective ders, inflammatory diseases, and migraines, therapeutic/
and do not include biological measures, making predictability in treatment decisions are primarily based on the subject-
eventual treatment outcomes a challenge. Using data from the reported/physician-rated severity of symptoms (which are an
Mayo Clinic PGRN-AMPS SSRI trial as a case study, this example of complex phenotypes/traits) in conjunction with
work demonstrates a significant improvement in the prediction standard social/demographic factors. The ability of these mea-
accuracy for antidepressant treatment outcomes in patients sures to predict therapeutic success is slightly better than
with major depressive disorder from 35% to 80% individualized chance [1], [2], and is largely limited by the lack of biological
by patient, compared to using only a physician’s assessment as measures that reflect the underlying molecular mechanisms of
the predictors. This improvement is achieved through an itera- therapeutic agents (e.g., drugs) and could therefore potentially
tive overlay of biological measures, starting with metabolites serve as stronger predictors of therapeutic outcomes. The key
(blood measures modulated by drug action) associated with contribution of this work in addressing that limitation is a
symptom severity, and then adding in genes associated with “learning-augmented clinical assessment” workflow to sequential-
metabolomic concentrations. Hence, therapeutic efficacy for a ly augment physicians’ assessments of subject-specific ratings
new patient can be assessed prior to treatment, using predic- of symptoms with heterogeneous biological measures (such as
tion models that take as inputs selected biological measures metabolomics and genomics) to significantly enhance the pre-
and physicians’ assessments of depression severity. Of broader dictability of drug treatment outcomes as shown in Fig. 1. As
significance extending beyond psychiatry, the approach pre- a case study, the workflow proposed in this work demon-
sented in this work can potentially be applied to predicting strates improved predictability in antidepressant treatment
treatment outcomes for other medical conditions, such as outcomes of patients with major depressive disorder (MDD)
migraine headaches or rheumatoid arthritis, for which patients by using biological measures (metabolomics and genomics)
TAblE 1 Efforts in integrating multiple measures to predict clinical outcomes in diseases characterized by complex phenotypes.
DiSEASE PRiOR wORk PREDiCTOR vARiAblES COMMEnT
CliniCAl
MEASuRES (nOn- nOn-DRug DRug-bASED
biOMARkERS) biOMARkERS biOMARkERS
MDD ChekrouD et Yes no no abIlItY to establIsh Cross-trIal
al. [1], InIesta preDICtIon of ClInICal outCoMes
et al. [2] usIng onlY ClInICal Measures
thIs work Yes targeteD targeteD augMentIng exIstIng ClInICal Mea-
Metabolo- MetaboloMICs sures wIth funCtIonallY valIDateD
MICs anD anD genoMICs bIoMarkers assoCIateD wIth DIsease
genoMICs pathophYsIologY or Drug MeCha-
nIsMs IMproves preDICtabIlItY In
treatMent outCoMes
reDlICh Yes MagnetIC no establIshIng the abIlItY of IMagIng
et al. [3] resonanCe Data to preDICt ClInICal outCoMes
IMagIng
(MrI)
sChIzophrenIa koutsoulerIs Yes no no abIlItY to establIsh Cross-trIal
et al. [4] preDICtIon of ClInICal outCoMes
usIng onlY ClInICal Measures
bIpolar DIsor- tIghe Yes funCtIonal no establIshIng the abIlItY of IMagIng
Der et al. [5] MrI Data anD transCrIptoMICs Data to
preDICt ClInICal outCoMes
rheuMatoID wIjbranDts Yes no targeteD abIlItY to preDICt treatMent outCoMes
arthrItIs et al. [6] transCrIp- usIng ClInICal Measures anD
toMICs transCrIptoMe varIatIons assoCIateD
wIth outCoMes, but bIoMarkers neeD
replICatIon anD funCtIonal valIDatIon

Trial Patient Stratification
Increasing Symptom Severity

Patients Treated with Antidepressants
Baseline 4 Weeks 8 Weeks Baseline 4 Weeks 8 Weeks
Antidepressant Treatment Outcome Prediction
–log10 (P )
ics
m
e no Chromosome
G
?
ics
m
lo
abo
et ?
M
Baseline 4 Weeks 8 Weeks
FiguRE 1 the proposed analyses to establish predictability of clinical outcomes at eight weeks.
derived from peripheral blood to augment the severity mea- clinical trial of the Mayo Clinic Pharmacogenomics Research
sures and other sociodemographic factors currently used in Network Antidepressant Medical Pharmacogenomic Study
clinical practice as predictor variables. This improvement in (Mayo PGRN-AMPS) [7], which is the largest single-center
predictive accuracy of treatment outcomes motivates the need selective serotonin reuptake inhibitor (SSRI) trial that has been
for developing antidepressant-specific prediction models, so conducted in the United States. 603 patients completed the
that choice of antidepressant can be based on highest likeli- trial. They were administered citalopram/escitalopram (com-
hood of remission of depressive symptoms. Choosing antide- monly prescribed SSRIs) for eight weeks, and psychiatric
pressants that maximize therapeutic success marks a major assessments of depression severity at baseline (pre-treatment),
shift from the current “try and wait” approach, which often four weeks, and eight weeks were conducted by a clinician
requires multiple trials of antidepressants before patients using the Quick Inventory of Depressive Symptomatology
achieve remission of their depressive symptoms. (QIDS-C). In this trial, biological measures for 290 of the 603
To demonstrate the improved predictability in treatment patients included genome-wide association study (GWAS) gen-
outcomes, the workflow was developed using data from the otype data that, after imputation, included approximately 7 million

single-nucleotide polymorphisms (SNPs) ^G h, and plasma
TAblE 2 Data (D = [S: C: B]).
metabolomic concentrations (B in Table 2) for 31 metabolites
^ M h from patients at three time-points of the trial (at baseline, total patIents: 603.
four weeks, and eight weeks). Through augmentation of those Men: total: 222. wIth oMICs: 99.
biological measures with psychiatric assessments and sociode- woMen: total: 381. wIth oMICs: 191.
mographic factors as predictor variables, the prediction accura- SOCiAl AnD DEMOgRAPhiC DATA (S) COllECTED Only AT
cy of antidepressant treatment outcomes in MDD patients bASElinE:
improved from 35% to 80% relative to the use of clinical mea- age (In Years)
sures alone as the predictor variables. boDY Mass InDex (bMI In kg/m2)
The formalism for integrating multiple biological measures DepressIon In {parents, sIblIngs, ChIlDren}
in this case study is as follows and is illustrated in Fig. 2. Just as
bIpolar DIsorDer In {parents, sIblIngs, ChIlDren}
tumor subtypes serve as a foundation for integrating biological
alCohol abuse bY {parents, sIblIngs, ChIlDren}
measures in oncology, our formalism first established patient
subtypes/stratification C by using mixture-model-based unsu- Drug abuse bY {parents, sIblIngs, ChIlDren}
pervised learning techniques. In the first layer of overlaying of seasonal pattern In sYMptoM oCCurrenCe
the biological measures, a set of metabolites m ! M were hIstorY of psYChotherapY
identified based on significant associations of their concentra- DEPRESSivE SEvERiTy ASSESSMEnT (C ):
tions with symptom severity in previously inferred patient ClInICIan-rateD QuICk InventorY of DepressIve sYMptoMatologY
stratification. In the second layer of the overlay of biological (QIDs-C) QuestIonnaIre (16 QuestIons)
measures, in what is referred to as a metabolomics-informed-genom-
QIDs-C total sCore
ics approach, we used GWAS to identify SNPs g ! G that are
biOlOgiCAl DATA (B):
associated with concentrations of metabolites comprising m.
Through iterative overlaying of biological measures starting M: 31 MetabolItes froM the hplC lCeCa platforM
with metabolites (blood measures reflecting drug action) asso- G : 7 MIllIon sIngle nuCleotIDe polYMorphIsM genotYpes
ciated with depressive severity, and then adding in the genes
associated with metabolomic concentrations, the biological
measures became more closely associated with the molecular
mechanisms of antidepressant response. Finally, out of the more Formulation of Multi-Omic Integration
than 7 million possible predictor variables, the proposed
approach identified about 65 predictor variables that comprised
(1) SNPs ( g) identified by the GWAS based on metabolomic
concentrations, (2) metabolites (m) whose concentrations are
significantly associated with depression severity in patient clus- Patient Stratification
ters, and (3) clinical measures (as shown in Table 2). Thus we = arg max (x, µk, σk2)
k (x ) where k (x ) =
made the size of the predictor data computationally tractable to k ∈[1:K ]
predict clinical outcomes yu by using supervised learning meth-

ods F (m, g, S, C, y), where y is the treatment outcome labels
of the training data. Metabolomic Association with Symptom
The overarching significance of this work is suggested by Severity in Patient Clusters
the success of analogous “precision medicine” approaches in | p.value [m α x ( )] ≤ 0.05/| |}
m = {m ∈
breast cancer therapeutics. Today, treatment strategies for each
breast cancer patient are tailored to the tumor’s specific molec-
ular characteristics. That successful approach is facilitated by the
close association of the phenotype (which is the molecular SNP Associations with
Metabolomic Concentrations
characteristics of the tumor, such as whether it is estrogen-
receptor-positive (ER-positive), human-epidermal-growth-fac- g = {g ∈ | p.value [GWAS(m)] ≤ 1E – 06}
tor-receptor-2-amplified (HER2 amplified), and/or triple-negative)
with a set of biomarkers, such as hormone receptors (e.g., ER),
genes, and their SNPs, which, when taken together, can be
Predicting Antidepressant Treatment Outcomes
prognostic of treatment outcomes [8]. However, in the study of
treatment outcomes in patients with MDD (as for other diseas- ~ S: Social and Demographics Data
y= (m, g, x, S, C, y) C: Clinical Data
es with complex phenotypes), some interesting key observa-
y : Training Label
tions can be made. First, GWAS have often failed to associate
SNPs with complex and non-binary phenotypes defined as, for FiguRE 2 the proposed approach to integrating multiple omics
example, “Did patients achieve a 50% reduction in baseline (metabolomics and genomics) measures.

patient-reported/clinician-recorded symptoms?” As a conse- today. However, with growing access to omics data from trials
quence, it is acknowledged that methods of integrating widely across the globe, it is now possible for breast oncologists to iden-
heterogeneous biological measures without a priori biological tify different kinds of tumor based on molecular characteristics,
knowledge become computationally intractable as the number e.g., estrogen-receptor-positive (ER-positive), human-epidermal-
of study variables increases to the order of millions [9], [10]. growth-factor-receptor-2-amplified (HER2 amplified), or tri-
Second, the predictability of antidepressant treatment out- ple-negative. Today, therapeutic options for breast cancer are
comes when clinical measures alone are used is at best slightly tailored to those observed molecular characteristics of the
better than chance [1], [2], [11]–[15]. Third, antidepressant tumor. As a result, the survival rate among patients with breast
medications such as selective serotonin reuptake inhibitors cancer has increased significantly [8]. Similar approaches have
(SSRIs) are the standard of care for drug therapy in adults been taken for other aspects of oncology and also for autism, and
with MDD, but less than half of patients have favorable out- are currently being explored for neurodegenerative diseases such
comes from this treatment [11]. In light of these observations, as Alzheimer’s disease [25], [26].
if learning techniques could more accurately predict treatment
outcomes in patients with MDD by integrating a few biologi- B. Multi-omics Integration
cal measures prognostic of antidepressant treatment’s success Integration of multi-modal biological data (multi-omics data)
with routine clinical measures, the impact would be far-reach- has been proposed in the context of breast cancer, diabetes, glio-
ing, because MDD affects over 350 million patients worldwide blastoma, and other diseases [9], [27]–[31]. For example, imaging
and is expected to be the leading cause of disabilities globally data have been combined with gene expression and patient
by 2030 [16]–[18]. demographics data to better predict the prevalence of cancer
The approach proposed in this work will also have biological [31]. Such work has looked at identifying unique biological sig-
significance and clinical utility through further methodological natures or biomarkers from each of the types of static data, and
innovations. For example, clusters inferred in this work can then building a function that linearly or nonlinearly combines
serve as the basis for using probabilistic graphical models to the data into a rule-based decision system [9]. Previous efforts
study the longitudinal behavior of depressive symptoms during have either used a sequential approach to integrate omics mea-
antidepressant treatment; we have previously referred to such sures (assuming causal relationships) [32], [33] or used simul-
behavior as symptom dynamics [19]. From a clinical perspective, taneous approaches when studying gene-gene interactions [34].
this method can be applied to other medical conditions as Another class of methods for integrating data have used net-
shown in Table 1, such as rheumatoid arthritis and migraine work-based approaches that use correlation analyses as their
headaches, for which patients are subtyped by the degree of foundation. Finally, based on knowledge of prior distributions in
swelling of their joints (which does not directly reflect a specific omics measures, Bayesian and knowledge-boosting approaches
mechanistic biomarker), and by the pain ratings reported by in both network and network-free approaches have been useful
patients using a scale similar to the QIDS-C scale that is used in for incrementally growing the interactions between multi-omics
this work to rate depressive symptoms [20]–[22]. Furthermore, measures [35]. For all these aforementioned approaches to be
the approach proposed in this work could also be used to aug- insightful and capable of handling the high-dimensional nature
ment biological measures with clinical measures in other psy- of multi-omics measures, a necessary condition is the availabil-
chiatric conditions, such as bipolar disorder and schizophrenia. ity of a priori biological knowledge of disease/patient sub-
types or the phenotype [9], [10]. In the absence of this a priori
II. Related Work biological knowledge, handling large volumes of data with mil-
Multi-omic integration has been broadly pursued in the con- lions of features (e.g., 7 million SNPs across the whole genome
text of precision medicine for which patients are subtyped by in each patient) is computationally intractable [9], [10].
the underlying molecular characteristics of their disease. In addressing the limitations of the aforementioned work,
this work makes significant contributions towards the integra-
A. Subtyping in Precision Medicine tion of multi-omics measures for diseases with complex phe-
Complexity in diseases is due to the manifestation of factors notypes, using MDD as a case study.
from both within and outside the genome. Alterations in many 1) Although this work has not subtyped MDD patients by
different molecular pathways may lead to similar disease pheno- their molecular characteristics, replication of patient strat-
types [23]. Therefore, where possible, disease subtypes can be ification for patients across independent trials with statis-
defined by identifying different biological mechanisms that result tically identical symptom severity profiles is an important
in the same phenotypes. Such subtypes may present unique step forward in the field of psychiatry. The ability to repli-
predispositions to benefit from different therapeutic options cate stratification is important given the widely acknowl-
[24]. Hence, for the success of precision medicine initiatives for edged heterogeneity in depressive symptom profiles and
any disease, it is necessary to map therapeutic options to subtypes drug response phenotypes [16]–[18].
(phenotypes) by using molecular mechanisms of the therapeutic 2) We assert that the stratification established in this work
agent [24]. Until two decades ago, all breast cancers were treated serves to overlay multiple biological measures that could
in the same way, and the survival rate was much lower than it is potentially improve predictability of treatment outcomes.

3) We demonstrate that this approach to integrating multi- existing knowledge of gender differences in response to antide-
omics measures provides mechanistic insights into drug pressants [37], the patients were stratified separately by gender.
response that can be experimentally established in clini-
cal laboratories. observation
The p-value from the Shapiro-Wilk test of the total QIDS-C
III. Data score from all three time-points of the trial, and in both men
The Mayo PGRN-AMPS trial (NCT 00613470) was and women, was less than the significance level (a = 0.05).
designed to assess the clinical outcomes of adults (aged 18–84 This meant that the symptom severity scores were not normal-
years) with non-psychotic MDD after four and eight weeks of ly distributed, as we rejected the null hypothesis of the Shapiro-
open-label treatment with citalopram or escitalopram and to Wilk test (i.e., that the data are normally distributed).
examine the metabolomic and genomic factors associated
with those outcomes [7]. Subjects were recruited from prima- approach
ry and specialty care settings in and near Rochester, MN from The fact that levels of symptom severity are not normally dis-
March 2005 to May 2013. All psychiatric diagnoses were con- tributed meant that the k-means clustering algorithm would
firmed at the screening visit using modules of the Structured not be suitable as a clustering algorithm here. Without a loss in
Clinical Interview for DSM-IV (SCID) administered by generality, under the assumption that the data (x; the total
trained clinical research staff. The data D = [S :C : B] analyzed QIDS-C score, which is the sum of a group of individual
in this work comprise social and demographic variables (S ), depression items in the QIDS-C scale) were distributed as a
clinical measures (C ), and biological measures (B ) and are list- mixture of Gaussians (modeled using a Gaussian mixture model
ed in Table 2. The social and demographic data (S ) were (GMM)), we developed the patient stratification workflow
assessed only at baseline. The treatment outcomes were estab- (Algorithm 1). Starting with an assumption that the data have at
lished using the 16-item, clinician-rated version of the Quick least two components in the GMM, we used the expectation
Inventory of Depressive Symptomatology (QIDS-C) at base- maximization (EM) algorithm to estimate the sufficient statistics
line, four weeks, and eight weeks; the results comprise the parameters of the Gaussian components (mean n and variance
2
clinical data C, which include the responses to the 16 QIDS- v ) of the GMM, as shown in Fig. 3(a). 10,000 samples were
C questions and the total QIDS-C score of the symptom randomly drawn from the inferred distributions (generateSam-
severity [36]. Biological measures for 290 of the 603 patients ples). Next, the Kolmogorov-Smirnov test (ks.test) was used to
in this trial included GWAS genotype data that, after imputa- test whether the distribution of the generated data was statisti-
tion, included approximately 7 million SNPs ^G h, and plasma cally similar to that of the original data. If the p-value (p) was
metabolomic concentrations (B in Table 2) for 31 metabolites less than the significance level (a = 0.05), then we rejected the
^ M h taken from patients at three time-points of the trial (at null hypothesis that the two distributions were not similar. If
baseline, four weeks, and eight weeks). Samples were assayed that happened, the number of components was increased by
on a high-performance liquid chromatography (HPLC) elec- one, and tested for similarity in the two distributions. Once we
trochemical coulometric array (LCECA) platform to obtain obtained the minimum number of components K in the
standardized measures of the concentrations of metabolites. GMM for which the generated and input data’s distributions
were similar, K clusters C = {C k; 6k ! 1: K } ordered by the
Clinical Definitions
Response is defined as a 50% reduction in baseline symptoms as
measured at four weeks or eight weeks. If the total QIDS-C
Algorithm 1 Patient Stratification.
score measured at eight weeks is less than or equal to 5, then
the patient is said to have achieved remission. input: x ! total QIDs-C scores
1: k ! 2
IV. Patient Stratification and Omics Association 2: C ! Q
3: a ! 0.05
A. Patient Stratification 4: p ! 0
Precision medicine in practice optimizes therapeutic options 5: while p # a do
based on patient “subtypes” based on specific biological/clinical 6: { n, v 2} ! EM (x, k)
characteristics. Currently, there are no established mechanisms 7: x l ! generatesamples ^ n, v 2h
by which patients with MDD are subtyped/stratified at baseline 8: p ! ks.test (x, xl )
(prior to treatment) or during the treatment, although at the 9: if p 2 significancelevel then
end-point of the trial, patients are triaged into remitters, 10: C ! gmmCluster ^ n, v 2h
responders without remission, and non-responders. In the 11: end if
absence of predefined stratification of patients, as previously 12: k ! k + 1
demonstrated in detail in our earlier work [19], we use unsuper- 13: end while
vised learning to algorithmically infer patient clusters. Based on Output: C

0.12
0.09 0.2
Probability Density
Probability Density
0.06
0.1
0.03
0.00 0.0
10 15 20 25 10 15 20 25
Baseline Total QIDS−C Score Baseline Total QIDS−C Score
(a) Estimating Distributions (b) Clusters from Inferred GMM
FiguRE 3 estimating parameters of gaussian mixture model for identifying clusters in the data. (a) the inference of mixtures comprising the
distribution of symptom severity scores. (b) Distribution of symptom severity within the clusters inferred using the sufficient statistics of
components inferred in (a).
increasing mean ( n k) of the components were the outputs of with depression severity in all three clusters at eight weeks.
the workflow [38]. Patients were assigned to the component C These correlations were biologically significant because
that maximized the likelihood L (x) given the component’s suf- they have been associated with MDD treatment and
ficient statistics (gmmCluster), as illustrated in Fig. 3(b) and response, as these metabolites are related to the mono-
described as amine neurotransmitter pathways [40], [41]. Furthermore,
SNPs ^ g = " g ! G p.value [GWAS (m)] # 1E - 06 ,h in the
C = argmax L k (x) where L k (x) = N ^x, n k, v 2k h . (1) TSPAN5 (rs10516436), AHR (rs17137566), ERICH3
k ! [1: K ]
(rs696692), and DEFB1 (rs5743467, rs2741130, rs2702877)
Results genes have been associated through use of GWAS with con-
At each time-point t ! {b (baseline), f (four weeks), e (eight centrations of kynurenine and serotonin [42], [43]. The associa-
weeks)}, we found three clusters of men and women by using tions of these biological measures laid the foundation for
the proposed stratification process. Clusters at the baseline are assessing whether they could improve the predictability of clin-
C b = " C 1b , C 2b , C 3b ,, at four weeks are C f = " C 1f , C 2f , C 3f ,, and ical outcomes when combined with traditional clinical, social,
at eight weeks are C e = " C 1e , C 2e , C 3e , . and demographic variables.
The clinical value of the clustering behavior is that C 1e in
both men and women captures all patients who achieved V. Using Baseline Data to Predict Clinical Outcomes
remission at the end of eight weeks. Furthermore, the C 2e in A recent publication proposed a prediction model that uses
both men and women comprised patients who demonstrated elastic-net regularization for feature selection and a gradient
response but did not achieve remission. Finally, patients in C 3f boosting machine (GBM) for classification, but only using
(both men and women) did not exhibit response or achieve baseline social, demographic, and clinical data 6S :C@ from the
remission. The same workflow demonstrated identical patient STAR*D trial [1]. While their prediction accuracies were bet-
stratification in the Sequenced Treatment Alternatives to ter than chance, the authors acknowledged the limitations of
Relieve Depression (STAR*D) trial [39], i.e., the Kol- their work, which suggests that it might be worthwhile to
mogorov-Smirnov test for the symptom severity scores study whether the addition of baseline biological measures
between clusters of similar average symptom severity had a together with the social, demographic, and clinical data would
p-value 2 0.8 [19]. From the analytics perspective, the signifi- increase the predictability of the clinical outcomes. With access
cance of the replication of patient stratification in two independent to metabolomics and genomics data in a smaller cohort of the
clinical trials is that the clustering behavior followed the existing defi- Mayo PGRN-AMPS trial, we set out to answer the following
nitions of clinical outcomes in psychiatry. questions (illustrated in Fig. 4):
1) Would augmenting social, demographic, and clinical
B. Omics Associations data 6S :C@ with metabolomics data improve the predic-
Baseline concentrations of key metabolites m = " m ! M p.value tion accuracies of treatment outcomes over using only
[m a x (C )] # 0.05/ M ,, such as serotonin, kynurenine, tryp- social, demographic, and clinical data 6S :C@ as predic-
tophan, tyrosine, and paraxanthine, were significantly correlated tor variables?

2) Would augmenting social, demographic, and clinical data In addition to elastic-net regularization, recursive feature
[ S: C ] with metabolomics and genomics data improve the elimination (i.e., a wrapper method) was also used for the
prediction accuracies of treatment outcomes over using GLM and GBM classifiers; that made it possible to estimate the
only social, demographic, and clinical data 6S :C@ as pre- model performance not only by optimizing the parameters of
dictor variables? the model, but also by searching for the right set of predictor
3) If the predictions improved as a result of augmenting variables. Based on our datasets, the prediction performance did
existing clinical measures with biological measures, how not significantly vary with or without the use of any of the fea-
many of the top predictors were biological measures? ture selection methods; the prediction accuracy remained
within 4%. This observation could also be in part due to a rea-
Feature Selection and Choice of Classifiers sonably small size of predictor variables.
Three classes of classifiers are used in this work, including ker- To minimize the effects of overfit and information leak,
nel, linear, and ensemble methods. For predicting outcomes nested cross-validation (nested-CV) with 5 repeats was used to
using baseline clinical, social, demographic, and metabolomic train the classifiers. In each repeat, data were randomized, and
data, we used support vector machines with linear kernels the nested-CV comprised an outer loop and an inner loop. The
(SVMLinear) and support vector machines that use radial-basis outer loop had a fivefold cross-validation to split the data into
kernels (SVM-RBF) as kernel methods [44]; a generalized lin- training data (80% of the data) and testing data (the remaining
ear model (GLM) as a linear method [45]; and GBMs as an 20%). The inner loop used the training data to train the classifi-
ensemble method [46]. As the creators of those methods have er by using a tenfold cross-validation, and the trained classifier
indicated, each of those broader types has its own merits, math- was tested on the testing data. To minimize the effects of class
ematical nuances, and complexities, and all of them have been imbalance (i.e., unequal numbers of responders (60%) and non-
used in other classification applications, such as in Kaggle [47]. responders (40%)) in the training data, we used the synthetic
To use all of the omics and clinical, social, and demographics minority over-sampling (SMOTE) algorithm [49], which sim-
data to predict outcomes, we used nonparametric classifiers ulated patient profiles of the under-sampled class and up-sam-
such as SVM-RBF and random forests, as they are better suited pled the under-sampled class to ensure that the two classes had
to handling correlated features [48], and have been used in pre- equal sizes. Prediction performance was reported using several
dicting treatment outcomes in other psychiatric diseases such metrics (AUC, sensitivity, and specificity), and the statistical sig-
as schizophrenia. nificance of the classifier’s accuracy was established using the
Clinical Assessments
Depression Severity,
Social and Demographic Data,
Patient History
Metebolomics Genomics
Predictive Models Predictive Models Predictive Models
Accuracy Improvement? Accuracy Improvement? Accuracy
FiguRE 4 the proposed analyses to establish improved predictability in antidepressant treatment outcomes by augmenting the clinicians’
assessments with biological measures.

null information rate (NIR, which is the prevalence of the class with mood in the behavioral sciences, which has additional
with the largest samples) that served as a proxy for chance. promising implications, as discussed next.
Training with and Without Biological Measures VI. Broader Significance

In order to quantitatively assess the benefit of adding biologi- The approach presented in this work provides opportunities for
cal measures to predict outcomes yu , we trained classifiers further methodological extensions and innovations that will
F (m, g, S, C, y) using (1) baseline clinical data that included allow for longitudinal analyses of therapeutic agents (such as
only social and demographic data, X = [S :C]; (2) all base- the analysis done for antidepressants in this work). Our work
line data (including metabolomics and genomics data), also has broader clinical significance based on knowledge gained
X = [S :C : B], where B = [m, g]; and (3) training labels y for through the analyses conducted in this work.
treatment outcomes. Metabolites (m) whose baseline concentra-
tions were correlated with the symptom severity at eight weeks, A. Methodological Extensions
and SNPs ( g) associated with their concentrations, were then Patient stratification across the time-points of a clinical trial
normalized along with clinical data in order to train the chosen could serve as the intelligence needed to build a probabilistic
supervised learning methods. It is important to note that several graph in which stratification could be used as the nodes of the
other researchers have proposed the combination of other graph, so that one can study the longitudinal effects of thera-
modalities of biological data [28]–[30], but it remains to be peutic agents during treatment. By using the proportions of
explored whether combination becomes less effective when patients who traverse between clusters of consecutive time-
patient-reported data are used, since there is considerable hetero- points (transition probabilities), we can use optimizations such as
geneity in subject-reported measures. Therefore, to the best of the Viterbi algorithm or forward algorithm to establish the
our knowledge, this is the first time that quantified biological “most likely” paths that patients will traverse during the treat-
measures comprising metabolomics and genomics measures have ment. Our previous work has demonstrated this capability by
been integrated for analyses with the clinical measures of psychi- using factor graphs to formalize the depressive symptom severi-
atric assessments that comprise demographic data and patient- ty data during the antidepressant treatment’s timeline, and then
provided responses to symptom questionnaires (such as using the forward algorithm to establish the “most likely” paths
QIDS-C). For all the classifiers, we compared the AUC, in addi- (referred to as symptom dynamics) that patients will traverse based
tion to the generalized prediction accuracies, to see whether the on changes in their depression severity during treatment [19].
same model’s predictive ability improved with the addition of
metabolomics data. Further, if the predictability improved, we B. In Psychiatry
extracted the top five predictors of the model that provided the Patient stratification conforming to known psychiatric definitions
best balance of accuracy and AUC to see whether the top pre- of outcomes, and replicated in both the Mayo PGRN-AMPS
dictors were dominated by the metabolomics. and STAR*D trials, provided the first show of confidence in our
ability to model symptom responses to citalopram/escitalopram
A. Results (antidepressant) treatment in depressed patients. Our previous
As shown in Table 3, for both men and women and for both work demonstrated its biological significance through its
the outcomes response and remission, there was a 30% improve- improvement of predictability by integrating metabolomics data
ment in the overall accuracy and corresponding AUC. The with clinical measures, because metabolites, such as serotonin and
highlighted columns in Table 3 indicate the best-performing kynurenine, were among the top predictors [19]. That develop-
models with the metabolomics data included; four out of the ment was important because for decades, the treatment of MDD
top five predictors are metabolites, indicating that their addition has focused on biogenic amine neurotransmitter pathways, i.e.,
to the prediction model likely explains the increase in the pre- the synthesis and metabolism of catecholamines (such as norepi-
dictability of the outcomes. As shown in Table 4, there was a nephrine) and indoleamines (such as serotonin) [40], [41]. Fur-
further improvement of at least 5% in the AUC and corre- thermore, the existing body of knowledge fits well with the
sponding accuracy when genomics data were integrated with findings of our study; note that the metabolites listed in Table 3
the metabolomics, clinical, social, and demographic data. We include serotonin (5HT) itself as well as two metabolites from the
have two observations about the inclusion of biological mea- competing tryptophan metabolism pathway (KYN and 3OHKYN)
sures in all these predictions. First, the top predictors of out- and the major catecholamine metabolite (MHPG), which are
comes when biological measures were used were different in known to play a role in behavior.
men and women, likely pointing to different biological mecha- The addition of genomics data with metabolomics and
nisms determining how men and women respond to the same clinical measures as predictor variables has further improved the
antidepressant. Second, except for the variable seasonal pattern predictability of antidepressant outcomes, as shown in Table 4.
and the involvement item in the QIDS-C scale, no other clini- This result raises the question of whether the genes associated
cal/demographic measures were predictive of outcomes. Finally, with serotonin and kynurenine are also extending their effects
it is biologically significant that many of the top predictor in other metabolites used in this study through other mecha-
metabolites identified in this work are known to be correlated nisms. The improvement in predictability should motivate

TAblE 3 Clinical outcome prediction performance with metabolomics data in the Mayo Clinic PGRN-AMPS trial. Expansions of the
abbreviations of the top predictors are as follows. ATOCO is (+)-alpha-tocopherol; URIC is uric acid; QIDSC-1 is sleep-onset
insomnia [36]; KYN is kynurenine; 3OHKY is 3-hydroxykynurenine; AMTRP is alpha-methyltryptophan; I3PA is indole-3-propionic
acid; GTOCO3 is (+)-gamma-tocopherol (redox state #3); 5HT is serotonin; MHPG is methoxy-hydroxyphenyl glycol; MET is
methionine; QIDS-13 is involvement [36]; HGA is homogentisic acid; PARAXAN is 1,7-dimethylxanthine.
MEn
RESPOnSE
DATA CliniCAl DATA Only CliniCAl AnD METAbOlOMiCS DATA TOP PREDiCTORS
MoDel svM-rbf svM-lInear glM gbM svM-rbf svM-lInear glM gbM atoCo
aCCuraCY 28.2 32 52 40 48 48 64 48 urIC
sensItIvItY 0 16.67 16.67 33.33 33.33 33.33 50 33.33 QIDs-1
speCIfICItY 53.5 46.15 84.62 46.15 61.54 61.54 61.54 61.54 kYn
auC 0.64 0.60 0.63 0.54 0.53 0.53 0.68 0.5 3ohkY
REMiSSiOn
MoDel svM-rbf svM-lInear glM gbM svM-rbf svM-lInear glM gbM aMtrp
aCCuraCY 28 44 44 48 64 68 64 45.65 I3pa
sensItIvItY 38.46 38 53.85 46.15 76.52 76 76.92 65.22 Drug Dosage
speCIfICItY 16.67 50 33.33 50 50 50 50 26.09 gtoCo3
auC 0.8 0.6 0.67 0.6 0.76 0.78 0.62 0.6 5ht
wOMEn
RESPOnSE
MoDel svM-rbf svM-lInear glM gbM svM-rbf svM-lInear glM gbM seasonal pattern
aCCuraCY 52.08 52.08 54.17 50 41.3 72.33 64.58 41.67 5ht
sensItIvItY 18.18 18.18 27.27 18.18 34.78 18.18 36.36 0 Mhpg
speCIfICItY 80.72 80.76 76.92 76.9 47.83 92.83 88.46 76 Met
auC 0.60 0.59 0.63 0.63 0.69 0.74 0.68 0.51 QIDs-13
REMiSSiOn
MoDel svM-rbf svM-lInear glM gbM svM-rbf svM-lInear glM gbM 5ht
aCCuraCY 34.78 50 45.65 36.96 41.3 54.33 52.17 45.65 hga
sensItIvItY 26.09 65.22 56.52 47.83 34.78 56.52 76.92 65.22 3ohkY
speCIfICItY 43.48 34.78 34.78 26.09 47.83 52.17 50 26.09 seasonal pattern
auC 0.64 0.52 0.58 0.58 0.56 0.53 0.53 0.53 paraxan
researchers and clinicians to collect more biological measures disorder, for which such biologically based subtyping is not
for psychiatric diseases other than major depressive disorder possible, the approach described in this work for stratifying
(e.g., bipolar disorder, schizophrenia, and various dementias) patients using symptomatic characteristics is of immense value
that would not only help subtype or stratify patients by their to associated pharmacogenomics research. In particular, trials
symptom severity profiles, but also combine biological charac- could be designed in which multi-omics (metabolomics,
teristics that would enable treatment strategies closer to the genomics, etc.) and other biological measures (neuroimaging,
kinds used in breast cancer therapeutics. electrophysiology, etc.) could be collected that help to establish
biological associations with patients stratified using the pro-
C. In Pharmacogenomics Research posed approach. Then, longitudinal effects of the drug on these
Pharmacogenomics research focuses on understanding the biomarkers can be used to study why patients either respond
interplay between drug effects and functions of the genome. In well to the intervention or do not. Furthermore, as already
this context, we reflect on the improvements in breast cancer demonstrated in this work, associations of biological markers
therapeutics wherein treatment selection is based on molecular with inferred patient stratification can provide improved pre-
characteristics of the tumor. In diseases such as major depressive dictability for treatment outcomes.

D. In Other Clinical Applications accurately predict clinical outcomes for new patients prior to
From a biological mechanistic perspective, on one end of the initiating treatment. Use of such a small set of measures (as op-
clinical spectrum there are diseases such as breast cancer for posed to millions of variables, as with genome-wide genotype
which treatment is based on tumor subtypes. At the other end data) makes inference of novel biology, or accurate prediction
of the spectrum are neuropsychiatric diseases such as MDD for of clinical outcomes in diseases with complex phenotypes,
which subtyping of individual patients is possible based on computationally tractable. Of further significance is the ability
their reported symptom severity, as demonstrated in this work. to replicate patient stratification in independent clinical trial
Between these two extremes are migraine headaches and in- datasets while also having clinical validity in the context of
flammatory diseases, such as rheumatoid arthritis (RA), in which studying antidepressant response in patients with major de-
patients are subtyped by the degree of swelling of their joints pressive disorder. The stratification that serves as the basis
(which does not directly reflect a specific mechanistic biomark- for sequential overlaying of biological measures has not only im-
er) and by pain ratings reported by the patients using validated proved the predictability of antidepressant outcomes, but also
scales that are similar to the QIDS-C scale used to rate depres- shown that top predictors have mechanistic associations with an-
sive symptoms in this work [20]–[22]. Therefore, there is suffi- tidepressant response. These findings together motivate the
cient heterogeneity in RA symptoms to make treatment response use of this approach for other common diseases, such as rheu-
phenotypes so complex that the methodological innovation matoid arthritis or migraine headaches, for which a similar
presented in this work could be used to overlay biological mea- complexity in phenotype is seen, and will also motivate re-
sures that provide a significant mechanistic perspective. This searchers and clinicians to collect additional biological measures
approach could then be tested for the prediction of outcomes in for other psychiatric diseases for which the methods proposed
response to the drug therapy for migraine headaches or RA, as in this work could identify novel mechanisms of therapeutic effi-
additional examples of a possible broader application of the cacy. Furthermore, the workflow could be further enhanced by
approach described here. considering other omics data such as transcriptomics, and/or
proteomics, in addition to profiling of the microbiome. While
VII. Conclusions, Limitations, and Future Work the work has shown promise in our ability to combine multi-
Use of data-driven approaches such as the one proposed in this ple biological measures, our future work will focus on estab-
work provides a way to identify and combine a small set of tar- lishing replicability of the findings in external datasets of
geted biological measures to augment physician assessments to MDD patients treated with citalopram/escitalopram and other
antidepressants. Such replication in
findings would represent a strong
TAblE 4 Clinical outcome prediction performance when clinical measures were
foundation for investigating biological
combined with metabolomics and genomics data from the Mayo Clinic PGRN-AMPS factors associated with pathophysiolo-
trial. Expansion of abbreviations of the top predictors are as follows. 5HT is serotonin; gy of this disease with heterogeneous
MHPG is methoxy-hydroxyphenyl glycol; MET is methionine; QIDSC-13 is involvement
[36]; HGA is homogentisic acid; 3OHKY is 3-hydroxykynurenine; PARAXAN is
disease states, and additional mecha-
1,7-dimethylxanthine; ATOCO is (+)-alpha-tocopherol; URIC is uric acid; 4HPLA is nisms of drug action.
4-hydroxyphenyllactic acid; 4HBAC is 4-hydroxybenzoic acid; XAN is xanthine; 4HPAC is
4-hydroxyphenylacetic acid; CYS is cysteine; I3PA is indole-3-propionic acid; DEFB1_1 is
VIII. Acknowledgments
DEFB1 SNP rs5743467.
This material is based upon work par-
gEnDER MEn wOMEn tially supported by a Mayo Clinic and
OuTCOME RESPOnSE Illinois Alliance Fellowship for Tech-
MODEl SvM-RbF RAnDOM FOREST SvM-RbF RAnDOM FOREST nology-Based Healthcare Research; a
aCCuraCY (%) 76 80 75 71
CompGen Fellowship; an IBM Fac-
ulty Award; the National Science
sensItIvItY 0.66 0.83 0.5 0.6
Foundation (NSF) under grant CNS
speCIfICItY 0.84 0.77 0.9 0.88
13-37732; the National Institutes of
auC 0.82 0.88 0.88 0.73 Health (NIH) under grants U19
top preDICtors urIC, atoCo, 3ohkY, I3pa, Defb1_1 Mhpg, Met, 5ht, 4hpaC, xan GM61388, U54 GM114838, RO1
gEnDER MEn wOMEn GM28157, R24 GM078233, T32
OuTCOME REMiSSiOn GM072474 and RC2 GM092729;
MODEl SvM-RbF RAnDOM FOREST SvM-RbF RAnDOM FOREST and The Mayo Clinic Center for
aCCuraCY (%) 68.75 80 86.7 73.7
Individualized Medicine. Any opin-
ions, findings, and conclusions or rec-
sensItIvItY 0.63 0.9 0.73 0.78
ommendations expressed in this
speCIfICItY 0.74 0.7 0.8 0.68
material are those of the author(s) and
auC 0.74 0.87 0.9 0.88 do not necessarily reflect the views of
top preDICtors 4hpla, I3pa, 3ohkY, urIC, QIDsC-13 paraxan, 5ht, 4hbaC, CYs, 3ohkY the NSF and NIH. We thank IBM,

Intel, and Xilinx for hardware donations, which were used to [22] M. Kosinski, M. Bayliss, J. Bjorner, J. Ware, W. Garber, A. Batenhorst, R. Cady, C.
Dahlöf, A. Dowson, and S. Tepper, “A six-item short-form survey for measuring headache
develop and test the methods. We thank Jenny Applequist for her impact: The hit-6,” Quality Life Res., vol. 12, no. 8, pp. 963–974, Dec. 2003.
help in preparing the manuscript. Finally, we thank the MDD [23] D. S. Charney and H. K. Manji, “Life stress, genes, and depression: Multiple pathways
lead to increased risk and new opportunities for intervention,” Sci. STKE, vol. 2004, no.
patients who participated in the PGRN-AMPS and STAR*D 225, p. re5, Mar. 2004.
SSRI study and the psychiatrists who cared for them. [24] R. Weinshilboum and L. Wang, “Pharmacogenomics: Bench to bedside,” Nature Rev.
Drug Discovery, vol. 3, no. 9, pp. 739–748, Sept. 2004.
[25] D. H. Geschwind and M. W. State, “Gene hunting in autism spectrum disorder: On
References the path to precision medicine,” Lancet Neurol., vol. 14, no. 11, pp. 1109–1120, Nov. 2015.
[1] A. M. Chekroud, R. J. Zotti, Z. Shehzad, R. Gueorguieva, M. K. Johnson, M. H. [26] R. Higdon, R. K. Earl, L. Stanberry, C. M. Hudac, E. Montague, E. Stewart, I.
Trivedi, T. D. Cannon, J. H. Krystal, and P. R. Corlett, “Cross-trial prediction of treat- Janko, J. Choiniere, W. Broomall, N. Kolker, et al., “The promise of multi-omics and
ment outcome in depression: A machine learning approach,” Lancet Psychiatry, vol. 3, no. 3, clinical data integration to identify and target personalized healthcare approaches in au-
pp. 243–250, Apr. 2016. tism spectrum disorders,” Omics: J. Integrative Biol., vol. 19, no. 4, pp. 197–208, Apr. 2015.
[2] R. Iniesta, K. Malki, W. Maier, M. Rietschel, O. Mors, J. Hauser, N. Henigsberg, [27] W. Zhang, F. Li, and L. Nie, “Integrating multiple omics analysis for microbial biolo-
M. Z. Dernovsek, D. Souery, D. Stahl, et al., “Combining clinical variables to optimize gy: Application and methodologies,” Microbiology, vol. 156, no. 2, pp. 287–301, Feb. 2010.
prediction of antidepressant treatment outcomes,” J. Psychiatric Res., vol. 78, pp. 94–102, [28] D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: An overview of methods,
2016. challenges, and prospects,” Proc. IEEE, vol. 103, no. 9, pp. 1449–1477, Sept. 2015.
[3] R. Redlich, N. Opel, D. Grotegerd, K. Dohm, D. Zaremba, C. Burger, S. Munker, [29] B. Ray, M. Henaff, S. Ma, E. Efstathiadis, E. R. Peskin, M. Picone, T. Poli, C. F.
L. Muhlmann, P. Wahl, W. Heindel, et al., “Prediction of individual response to electro- Aliferis, and A. Statnikov, “Information content and analysis methods for multi-modal
convulsive therapy via machine learning on structural magnetic resonance imaging data,” high-throughput biomedical data,” Sci. Rep., vol. 4, p. 4411, Mar. 2014.
JAMA Psychiatry, vol. 73, no. 6, pp. 557–564, July 2016. [30] P.-Y. Wu, C.-W. Cheng, C. D. Kaddi, J. Venugopalan, R. Hoffman, and M. D.
[4] N. Koutsouleris, R. S. Kahn, A. M. Chekroud, S. Leucht, P. Falkai, T. Wobrock, E. Wang, “–Omic and electronic health record big data analytics for precision medicine,”
M. Derks, W. W. Fleischhacker, and A. Hasan, “Multisite prediction of 4-week and 52- IEEE Trans. Biomed. Eng., vol. 64, no. 2, pp. 263–273, Feb. 2017.
week treatment outcomes in patients with first-episode psychosis: A machine learning [31] J. Kong, L. A. Cooper, F. Wang, D. A. Gutman, J. Gao, C. Chisolm, A. Sharma, T.
approach,” Lancet Psychiatry, vol. 3, no. 10, pp. 935–946, Oct. 2016. Pan, E. G. Van Meir, T. M. Kurc, et al., “Integrative, multimodal analysis of glioblas-
[5] S. K. Tighe, P. B. Mahon, and J. B. Potash, “Predictors of lithium response in bipolar toma using TCGA molecular data, pathology images, and clinical outcomes,” IEEE Trans.
disorder,” Therapeutic Advances Chronic Disease, vol. 2, no. 3, pp. 209–226, May 2011. Biomed. Eng., vol. 58, no. 12, pp. 3469–3474, Dec. 2011.
[6] C. Wijbrandts and P. Tak, “Prediction of response to targeted treatment in rheumatoid [32] B. P. Coe, E. A. Vucic, R. Chari, W. L. Lam, and W. W. Lockwood, “An integrative
arthritis,” Mayo Clin. Proc., vol. 92, no. 7, pp. 1129–1143, July 2017. multi-dimensional genetic and epigenetic strategy to identify aberrant genes and pathways
[7] D. A. Mrazek, J. M. Biernacka, D. J. O’kane, J. L. Black, J. M. Cunningham, M. S. in cancer,” BMC Syst. Biol., vol. 4, no. 1, p. 67, Dec. 2010.
Drews, K. A. Snyder, S. R. Stevens, A. J. Rush, and R. M. Weinshilboum, “CYP2C19 [33] M. R. Aure, I. Steinfeld, L. O. Baumbusch, K. Liestøl, D. Lipson, S. Nyberg, B.
variation and citalopram response,” Pharmacogenetics Genomics, vol. 21, no. 1, p. 1, Jan. Naume, K. K. Sahlberg, V. N. Kristensen, A.-L. Børresen-Dale, et al., “Identifying in-
2011. trans process associated genes in breast cancer by integrated analysis of copy number and
[8] A. A. Onitilo, J. M. Engel, R. T. Greenlee, and B. N. Mukesh, “Breast cancer subtypes expression data,” PLoS One, vol. 8, no. 1, p. e53014, Jan. 2013.
based on ER/PR and Her2 expression: Comparison of clinicopathologic features and [34] B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains,
survival,” Clin. Med. Res., vol. 7, no. 1–2, pp. 4–13, June 2009. and A. Goldenberg, “Similarity network fusion for aggregating data types on a genomic
[9] M. D. Ritchie, E. R. Holzinger, R. Li, S. A. Pendergrass, and D. Kim, “Methods of scale,” Nature Methods, vol. 11, no. 3, pp. 333–337, Mar. 2014.
integrating data to uncover genotype-phenotype interactions,” Nature Rev. Genetics, vol. [35] D. Kim, J.-G. Joung, K.-A. Sohn, H. Shin, Y. R. Park, M. D. Ritchie, and J. H.
16, no. 2, pp. 85–97, Feb. 2015. Kim, “Knowledge boosting: A graph-based integration approach with multi-omics data
[10] M. Bersanelli, E. Mosca, D. Remondini, E. Giampieri, C. Sala, G. Castellani, and L. and genomic knowledge for cancer clinical outcome prediction,” J. Amer. Med. Informatics
Milanesi, “Methods for the integration of multi-omics data: Mathematical aspects,” BMC Assoc., vol. 22, no. 1, pp. 109–120, July 2014.
Bioinformatics, vol. 17, no. Suppl 2, p. 15, Dec. 2016. [36] A. J. Rush, M. H. Trivedi, H. M. Ibrahim, T. J. Carmody, B. Arnow, D. N. Klein, J.
[11] M. H. Trivedi, M. Fava, S. R. Wisniewski, M. E. Thase, F. Quitkin, D. Warden, C. Markowitz, P. T. Ninan, S. Kornstein, R. Manber, et al., “The 16-item quick inven-
L. Ritz, A. A. Nierenberg, B. D. Lebowitz, M. M. Biggs, et al., “Medication augmenta- tory of depressive symptomatology (QIDS), clinician rating (QIDS-C), and self-report
tion after the failure of SSRIs for depression,” New Eng. J. Med., vol. 354, no. 12, pp. (QIDS-SR): A psychometric evaluation in patients with chronic major depression,” Biol.
1243–1252, Mar. 2006. Psychiatry, vol. 54, no. 5, pp. 573–583, Sept. 2003.
[12] F. A. Jain, A. M. Hunter, J. O. Brooks, and A. F. Leuchter, “Predictive socioeconomic [37] M. Piccinelli and G. Wilkinson, “Gender differences in depression critical review,”
and clinical profiles of antidepressant response and remission,” Depression Anxiety, vol. 30, Br. J. Psychiatry, vol. 177, no. 6, pp. 486–492, Dec. 2000.
no. 7, pp. 624–630, July 2013. [38] J. D. Banfield and A. E. Raftery, “Model-based Gaussian and non-Gaussian cluster-
[13] R. Hirschfeld, J. M. Russell, P. L. Delgado, J. Fawcett, R. A. Friedman, W. M. ing,” Biometrics, pp. 803–821, Sept. 1993.
Harrison, L. M. Koran, I. W. Miller, M. E. Thase, R. H. Howland, et al., “Predictors of [39] M. H. Trivedi, A. J. Rush, S. R. Wisniewski, A. A. Nierenberg, D. Warden, L. Ritz,
response to acute treatment of chronic and double depression with sertraline or imipra- G. Norquist, R. H. Howland, B. Lebowitz, P. J. McGrath, et al., “Evaluation of outcomes
mine,” J. Clin. Psychiatry, vol. 59, pp. 669–675, July 1998. with citalopram for depression using measurement-based care in STAR*D: Implications
[14] R. M. Bagby, A. G. Ryder, and C. Cristi, “Psychosocial and clinical predictors of for clinical practice,” Amer. J. Psychiatry, vol. 163, no. 1, pp. 28–40, Jan. 2006.
response to pharmacotherapy for depression,” J. Psychiatry Neurosci., vol. 27, no. 4, pp. [40] J. J. Schildkraut, “Neuropsychopharmacology and the affective disorders,” New Eng.
250, July 2002. J. Med., vol. 281, no. 6, pp. 302–308, Aug. 1969.
[15] A. C. Altamura, C. Montresor, D. Salvadori, and E. Mundo, “Does comorbid sub- [41] J. Axelrod and R. Weinshilboum, “Catecholamines,” New Eng. J. Med., vol. 287, no.
threshold anxiety affect clinical presentation and treatment response in depression? A 5, pp. 237–242, Aug. 1972.
preliminary 12-month naturalistic study,” Int. J. Neuropsychopharmacol., vol. 7, no. 4, pp. [42] M. Gupta, D. Neavin, D. Liu, J. Biernacka, D. Hall-Flavin, W. V. Bobo, M. A. Frye,
481–487, Dec. 2004. M. Skime, G. D. Jenkins, A. Batzler, et al., “TSPAN5, ERICH3 and selective serotonin
[16] N. Olchanski, M. M. Myers, M. Halseth, P. L. Cyr, L. Bockstedt, T. F. Goss, and R. reuptake inhibitors in major depressive disorder: Pharmacometabolomics-informed phar-
H. Howland, “The economic burden of treatment-resistant depression,” Clin. Therapeu- macogenomics,” Mol. Psychiatry, Dec. 2016.
tics, vol. 35, no. 4, pp. 512–522, Apr. 2013. [43] D. Liu, B. Ray, D. R. Neavin, J. Zhang, A. P. Athreya, J. M. Biernacka, W. V. Bobo,
[17] K. Martinowich, D. Jimenez, C. Zarate, and H. Manji, “Rapid antidepressant effects: D. K. Hall-Flavin, M. K. Skime, H. Zhu, et al., “Beta-defensin 1, aryl hydrocarbon
Moving right along,” Mol. Psychiatry, vol. 18, no. 8, pp. 856–863, Aug. 2013. receptor and plasma kynurenine in major depressive disorder: Metabolomics-informed
[18] R. C. Kessler, H. S. Akiskal, M. Ames, H. Birnbaum, P. Greenberg, R. M. A. genomics,” Translational Psychiatry, vol. 8, no. 1, p. 10, Jan. 2018.
Hirschfeld, R. Jin, K. R. Merikangas, G. E. Simon, and P. S. Wang, “Prevalence and ef- [44] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regular-
fects of mood disorders on work performance in a nationally representative sample of US ization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2001.
workers,” Amer. J. Psychiatry, vol. 163, no. 9, pp. 1561–1568, Sept. 2006. [45] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized
[19] A. P. Athreya, S. S. Banerjee, D. Neavin, R. Kaddurah-Daouk, A. J. Rush, M. A. linear models via coordinate descent,” J. Stat. Softw., vol. 33, no. 1, p. 1, 2010.
Frye, L. Wang, R. M. Weinshilboum, W. V. Bobo, and R. K. Iyer, “Data-driven longitu- [46] J. H. Friedman, “Stochastic gradient boosting,” Computational Statist. Data Anal., vol.
dinal modeling and prediction of symptom dynamics in major depressive disorder: Inte- 38, no. 4, pp. 367–378, Feb. 2002.
grating factor graphs and learning methods,” in Proc. IEEE Conf. Computational Intelligence [47] J. Friedman, T. Hastie, and R. Tibshirani, The Elements of Statistical Learning. Springer
Bioinformatics and Computational Biology, Aug. 2017, pp. 1–9. Series in Statistics, vol. 2. Berlin: Springer, 2009.
[20] W. Bardwell, P. Nicassio, M. Weisman, R. Gevirtz, and D. Bazzo, “Rheumatoid [48] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser, “SVMs modeling for highly
arthritis severity scale: A brief, physician-completed scale not confounded by patient imbalanced classification,” IEEE Trans. Syst., Man, Cybern., Part B, vol. 39, no. 1, pp.
self-report of psychological functioning,” Rheumatology, vol. 41, no. 1, pp. 38–45, Jan. 281–288, Feb. 2009.
2002. [49] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthet-
[21] W. Kwong and D. Pathak, “Validation of the eleven-point pain scale in the mea- ic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, June 2002.
surement of migraine headache pain,” Cephalalgia, vol. 27, no. 4, pp. 336–342, Apr.
2007.

Research
Society
Society JongHyok Ri
Frontier
Briefs Institute of Cyber-Systems and Control, Zhejiang University, Zhejiang, China
Institute of Information Technology, Kim Il Song University, Pyongyang, DPR of Korea
Liang Liu
Institute of Cyber-Systems and Control, Zhejiang University, Zhejiang, China
Yong Liu
State Key Laboratory of Industrial Control Technology and Institute
of Cyber-Systems and Control, Zhejiang University, Zhejiang, China
Huifeng Wu
College of Computer Science and Technology, Zhejiang Dianzi University,
Zhejiang, China
Wenliang Huang
China Unicom Ltd. Beijing, China
Hun Kim
Department of Computer Science, Kim Il Song University, DPR of Korea
Optimal Weighted Extreme Learning Machine

for Imbalanced Learning with Differential Evolution
Abstract high accuracy in representation learn- neuron parameters are randomly assigned
I n this paper, we present a formal

model for the optimal weighted
extreme learning machine (ELM)
on imbalanced learning. Our model
regards the optimal weighted ELM as an
ing by performing experiments on
MNIST, CIFAR-10, and YouTube-8M,
with feature representation from convo-
lutional neural networks.
which may be independent of training
data and the output weights can be analyt-
ically decided by the Moore-Penrose gen-
eralized inverse [5], [6]. Thus, it provides
simpler and faster implementation than
optimization problem to find the best I. Introduction other machine learning techniques.
weight matrix. We propose an approxi- Extreme learning machine (ELM) [1]–[4], In recent years, the imbalanced learn-
mate search algorithm, named weighted is an effective and efficient machine ing problem [7]–[10] has drawn a signif-
ELM with differential evolution (DE), learning technique that has attracted icant amount of interest from academia,
that is a competitive stochastic search attention in various fields. The essential industry, and government funding agen-
technique, to solve the optimization advantage of ELM is that the hidden cies as data continues to accumulate. As
problem of the proposed formal most of the standard learning
imbalanced lear ning model. algorithms assume the distri-
We perfor m experiments on butions among all classes are
standard imbalanced classifica- equal, the equal misclassification
tion datasets which consist of 39 costs are acceptable on algo-
binary datasets and 3 multiclass rithms learning balanced datas-
datasets. The results show a ets. However, when dealing with
significant performance improve- a complex imbalanced dataset,
ment over standard ELM with standard algorithms [2]–[4] will
an average Gmean improvement fail to represent the distribu-
of 10.15% on binary datasets tion characteristics of the data
©istockphoto.com/valdum
and 1.48% on multiclass datasets, and thus will lead to unfavor-

which are also better than other able accuracies across the data
state-of-the-art methods. We classes [11]. Unfortunately, clas-
also demonstrate that our pro- sic ELM does not solve the
posed algorithm can achieve problem of imbalanced data

Date of publication: 18 July 2018 Corresponding Author: Yong Liu (Email: yongliu@iipc.zju.edu.cn).

distribution. When using classic ELM for The contributions of our approach using the Moore-Penrose “generalized”
imbalanced data, the majority class tends are as follows, inverse [3], [16] as follows,
to push the separating boundary toward ❏ We present a formal mathematical
the minority side to gain better classifica- model to obtain the optimal weighting n 1 l : b = H @ T = H T ` I + HH T j T
-1
tion results for itself.Therefore data in the scheme. We introduce DE to calculate C

l 1 n : b = H @ T = ` I + H T H j H T T.
-1
minority class will be easily misclassified. the approximate optimal solution. C
The most straightforward solution for ❏ Our approach can achieve significant (2)
imbalanced data learning is to assign the improvement in classification per-
misclassification cost inversely to the class for mance compared with other ELM can also be explained from the
distribution, which may be simply calcu- state-of-the-art methods on various optimization view. ELM tries to mini-
2
lated as the number of samples in each imbalanced datasets. mize both Hb - T 2 and b . There-
class. Thus Zong et al. [7] proposed ❏ Our approach can narrow the search fore, a solution for formula (1) can be
weighted ELM to overcome the disad- range greatly compared with other obtained [2] from
vantages of the original ELM for an state-of-the-art methods, which indi-
n
imbalanced data problem. Their solution cates our approach may be more effi-
Minimize : L ELM = 1 b + C 1 / p i 2
2
is based on the work regarding weighted cient for the practical imbalanced 2 2 i=1
regularized ELM presented by Toh [9] learning problem. Subject to : h (x i) b = t Ti - p Ti , i = 1, ..., n.
and Deng et al. [10], and the key essence The remaining sections are organized (3)
of weighted ELM in Zong et al. [7] is to as follows. Sections II and III introduce
assign an extra weight to each sample to the theoretical background of ELM and where p i = [p i, 1, ..., p i, m] is the training
strengthen the impact of the minority related ELM methods on imbalanced error vector of the m output nodes cor-
class while weakening the relative impact learning. Section IV presents the pro- responding to training sample x i . C is
of the majority class. Experimental results posed method. Section V reports the the trade-off regularization parameter
in their work showed superior perfor- experimental results and performance between the minimization of training
mance of weighted ELM compared with analysis. Finally, the conclusions are sum- errors and the maximization of the mar-
original ELM on various imbalanced marized in Section VI. ginal distance. Based on the Karush-
datasets. However, the two weighting Kuhn-Tucker (KKT) theorem [16], we
schemes used in their approach can only II. Theoretical Background can solve the optimization problem of
obtain empirical sub-optima, and global This section introduces a brief theoreti- formula (3) and obtain the same solution
optima cannot be guaranteed. Following cal background of ELM. as formula (2).
work [8] introduced the boosting meth- ELM [1], [2] was originally proposed Given a new sample x, the output
od to obtain better weighting schemes; for single-hidden layer feedforward function of ELM is obtained as follows.
however, how to set the optimal weight- neural networks (SLFNs) and then ex-
ing scheme remains an open problem. tended to ‘generalized’ SLFNs where f ( x) =
h (x) H T ` I + HH T j T, when n 1 l
In this paper, we present a formal the hidden layer does not require tun- -1
model for optimal weighted ELM ing [3]. * C

h (x) ` I + H T H j H T T, when n $ l.
-1
applied to imbalanced learning. Our The main feature of ELM is the ran-
C
model assumes that the optimal weight- dom generation of hidden nodes which (4)
ed ELM can be presented as an optimi- may be independent of the training data
zation problem to search the optimal and the analytical calculation of output Here, f (x) = 6 f1 (x), ..., fm (x)@ is the out-
weight matrix for the weighted ELM. weights by the Moore-Penrose general- put function vector. Users may determine
We also present an approximate solution ized inverse [3]. The hidden layer the predication label of x as follows.
for the optimal weight matrix searching output (with l nodes) can be presented
problem based on differential evolution by a row vector h (x) = [h 1 (x), ..., h l (x)] label (x) = arg max fi (x), i ! [1, ..., m].
i
(DE) [12], which is a competitive sto- where x is the input sample. Given n (5)
chastic search technique that performs training samples (x i, t i), the mathemati-
well in various standard test functions cal model of the SLFNs is Recently, further focus has been
and real-world optimization problems. placed on ELM algorithms and applica-
We also evaluated the effectiveness of Hb = T. (1) tions [17]–[19], and some advanced algo-
our learning machine algorithm by per- rithms [20]–[23] have been proposed to
forming experiments on various stan- where H is the hidden layer output improve performance. In [20], a Bayes-
dard classification datasets, which consist matrix, b is the output weight and T is ian-based ELM (BELM), which incor-
of 39 binary datasets and 3 multiclass the target vector. porates the advantages of both ELM and
datasets: the MNIST [13], CIFAR-10 The least squares solution with mini- Bayesian models, is proposed. It can
[14], and YouTube-8M [15] datasets. mal norm is analytically determined build the corresponding confidence

and we also apply the standard ELM,
Gmean is a conventional evaluation metric in the case WELM with W 1, WELM with W 2 and
of imbalanced learning; it is the geometric mean of the two other WELMs with randomly
selected weights. The original dataset and
recall values of all m classes. the classification results are shown in
Fig. 1, where the ground truth labels of
interval without using additional meth- matically, an n # n diagonal matrix W every data instance are denoted by dif-
ods such as bootstrap, and requires low associated with every training sample x i ferent symbols and the classification
computational cost. is defined. Usually, if x i comes from a results are denoted by different colors.
Bai et al. [21] proposed a sparse ELM minority class, the associated weight w ii The results in Fig. 1 show that although
(S-ELM) by replacing the equality con- is relatively larger than samples from a WELM with W 1 and W 2 policies can
straints in traditional ELM model with majority class. Therefore, the impact of achieve better results than the ELM
inequality constraints, which can reduce the minority class is strengthened while method in this dataset, their perfor-
the storage space and testing time. Bai the relative impact of the majority class mances are worse than the two WELMs
et al. [22] also proposed a S-ELM for is weakened. Considering the diagonal with randomly selected weights, which
regression analysis that can reduce the weight matrix W, the optimization for- indicates the policies with W 1 and W 2
storage space and testing time. In addi- mula of ELM can be revised [7] as in WELM are not optimal. In this paper,
tion, they have developed an efficient these two WELM classifiers are referred
training algorithm based on iterative Minimize : to as WELM-W1 and WELM-W2.
n
computation, which scales quadratically WELM [7] can improve the perfor-
L WELM = 1 b + C 1 / (w ii
2 2
pi )
with respect to the number of training 2 2 i=1 mance of ELM greatly for imbalanced
samples. In [23], a random projection- Subject to : data. The classification error of the class
based ELM (RP-ELM), which is esti- h (x i) b = t Ti - p Ti , i = 1, ..., n. (6) with fewer elements was reduced by set-
mated on the analysis of the random ting an unequal cost distribution for
projection feature mapping schema in According to the KKT theorem [16], each class with the weight matrix. How-
the ELM, is proposed. RP-ELM can sig- the solution to formula (6) is ever, the approach relied on empirical
nificantly reduce the number of neurons weighting schemes, which were designed
in the hidden layer without affecting the b = H @T = according to the element number of
H T ` I + WHH T j WT, when n 1 l
accuracy of the generalization perfor- -1 each class. Thus, it can also be called ex-
mance. As a result, the final learning * I C perienced WELM.
` C + H T WH j H T WT, when n $ l.
-1
machine will benefit from a consider-
able simplification in the feature-map- (7) B. Boosting Weighted ELM [8]
ping stage. To overcome this shortcoming of ex-
They also proposed two empirical weight- per ienced WELM, boosting-based
III. Related ELM Methods ing schemes [7] as follows, WELM (BWELM) [8] was proposed.
on Imbalanced Learning BWELM tries to determine the opti-
W 1: w ii = 1 mal weight matrix using an AdaBoost
A. Weighted ELM [7] #t i algorithm [24].
Z 0.618
The main goal of the ELM classifier is ]] if #t i 2 AVG (#t i) Inspired by the distribution weights
#t i
to find a boundary to separate data from W 2: w ii = [ 1 updating mechanism of AdaBoost, they
two or multiple parts with maximal ] #t if #t i # AVG (#t i) . embedded WELM seamlessly into a
i
\
margin distance between any two parts. (8) modified AdaBoost framework. Intui-
This separating boundary is supposed to tively, the distribution weights in Ada-
be pushed toward the side of the minor- where #t i is the number of samples Boost, which reflect the importance of
ity class for imbalanced data so that belonging to class t i, and AVG (#t i) re- training samples, are input into WELM
the minority classes are easy misclassi- presents the average number of samples as training sample weights. Further-
fied. To resolve this issue, weighted for all classes. more, such training sample weights are
ELM (WELM) [7] has been recent- Although WELM proposes two dynamically updated during iterations
ly proposed. weighting schemes in formula (8) for of AdaBoost.
WELM improves the classification imbalanced data, these matrices are not Considering the characteristics of
performance for data with imbalanced optimal, which can be proven by the imbalanced learning, they modify the
class distribution while maintaining the following toy experiment. In the toy original AdaBoost.M1 [24] in two
advantages of the original ELM stated experiment, we first randomly generate aspects. First, the initial distribution
above. Specifically, each training sample an imbalanced dataset containing two weights are set to be asymmetric to make
is assigned with an extra weight. Mathe- categories with a category ratio of 10:1, AdaBoost converge at a faster speed.

1.0 1.0 1.0
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
–0.2 –0.2 –0.2
–0.6 –0.6 –0.6
–0.4 –0.4 –0.4
–0.8 –0.8 –0.8
–1.0 –1.0 –1.0
–1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0
(a) Original Data (b) ELM (Gmean: 95.3463) (c) WELM-W1 (Gmean: 98.5347)
1.0 1.0 1.0

0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
–0.2 –0.2 –0.2
–0.6 –0.6 –0.6
–0.4 –0.4 –0.4
–0.8 –0.8 –0.8
–1.0 –1.0 –1.0
–1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0 –1.0 –0.5 0.0 0.5 1.0
(d) WELM-W2 (Gmean: 98.9949) (e) Random Selected (f) Random Selected
Weight (1) (Gmean: 99.6357) Weight (2) (Gmean: 99.7269)
Figure 1 Classification results produced by ELM, WELM-W1, WELM-W2, and WELM with random selected weights on a randomly-generated
dataset with an imbalanced ratio of 1:10. Blue circles and red circles represent the correctly classified instances and incorrectly classified instanc-
es in the majority class, respectively. Red crosses and blue crosses represent the correctly classified instances and incorrectly classified instances
in the minority class, respectively.
This results in a boosting classifier with [25], DE is far more efficient and robust IV. Our Approach
a smaller number of WELM classifiers (with respect to reproducing the results Although the two weighting schemes in
and can save much computational time. in several runs) compared to other evo- formula (8) can obtain superior results
Second, the distribution weights are lutionary computation algorithms such compared with the original ELM, they
updated separately for different classes as particle swarm optimization (PSO) are only empirical schemes and cannot
to avoid destroying the distribution [26] and evolutionary programming (EP) guarantee the optima, which was proven
weights’ asymmetry. [27]. In addition, it has few parameters to in the toy experiment. In this section,
In this paper, we also address the set, and the same settings can be adapted we present the formal mathematical
problem of obtaining the best weighted to many different problems. Thus, this model to obtain the optimal weighting
matrix for the ELM-based imbalanced method has been actively used in various scheme, and we also introduce the DE
learning problem, and we propose a DE- fields [28]–[30]. Some works [31]–[33] method [12] to calculate the approxi-
based WELM (DE-WELM) for imbal- have focused on updating ELM using mate optimal weight matrix used in
anced dataset learning. DE. However, they only use DE to search WELM, as the calculation for the formal
DE was first presented by R. Storn the hidden neuron parameters instead of model is infeasible.
and Price [12], and it has been recog- searching the weighted matrix for imbal-
nized as a powerful method for solving anced learning. A. Mathematical Model for
optimization problems. It resembles the As parameters such as NP, F, and CR Optimal Weighted Scheme
structure of evolutionary algorithms, but in the DE algorithm are critical for its The problem of calculating the optimal
differs from their traditional versions in perfor mance, there are also many weight matrix in WELM can be formal-
the generation of new candidate solu- improved DE algorithms [34]–[40], ly defined as follows, Given training data
tions and the use of a greedy selection and [41] which can adaptively control X = {(x i, t i), t i ! {1, 2, ..., m}, i = 1, 2, ...,
scheme. As shown in previous research those parameters. n}, and there are m classes in X, the

diag (W i, R) 1 and compute the er ror
We compare the performance of our proposed E Wi,R (i = 1, ..., NP) of the WELM clas-
method with four state-of-the-art methods, i.e., the sifier on validation set X vd, and the
candidate optimization weight vector
Softmax classifier [43], the classic ELM classifier [1], which produces the lowest error is
the experienced WELM classifier [7], and the defined as W best, R .
BWELM classifier [8]. E W i, R =
1 - Gmean (Hb (X vd, diag (W i, R)),T )
optimal weight matrix can be obtained candidate solution with regard to a W best, R =
by optimizing the following formula, given measure of quality; it has been arg min E Wi,R (i = 1, 2, ..., NP ). (12)
W i, R
successfully applied in many domains
W) = [28]–[30]. There are three major control
arg min (1 - Gmean (Hb (X,W),T )) factors in the DE algorithm, i.e., popula- 2) update Stage
W tion size (NP), scaling factor (F ), and The update stage will generate a new
Subject to: crossover rate (CR). There are also three candidate population based on the
b (X, W) =
operations involved, i.e., mutation oper- results of the previous iteration and per-
H T ` I + WHH T j WT, when n 1 l
-1
ation, crossover operation, and selection forms the operations of mutation, cross-
* I C
operation. We present our approximate over, and selection.
` C + H T WH j H T WT, when n $ l.
-1
weight calculation algorithm based on The input of the update stage is a
(9) the three factors and three operations candidate population of the previous
in DE. stage and the candidate optimal weight
where H is the hidden layer output There are two stages in our algorithm: vector from the previous stage, and the
matrix, T is the target vector, and l is the initial stage and the update stage. output of the update stage is a candidate
the number of hidden layer nodes in population of the next stage and a can-
WELM. As b is the output weight 1) Initial Stage didate optimal weight vector for the
computed in WELM using W as the The initial stage generates an initial next stage.
weight matrix, this formula can be con- population that contains NP candidates First, we transform candidate popula-
sidered as the function of W and X. in the weight matrix W, and searches for tion W R to mutant population VR with
Gmean is a conventional evaluation the best weight from the current candi- the mutation operation. Each mutant
metric in the case of imbalanced learn- date set. vector of VR is calculated according to
ing; it is the geometric mean of the First, the values in training set X are the following equation.
recall values of all m classes and is normalized to [-1,1], and we also select a
defined as follows, subset X vd 1 X as the validation set, VR = {Vi, R Vi, R = W i, R + F (W best, R
1 which is used to avoid overfitting. - W i, R) + F (W ri1, R - W ri2, R),
pjG
Gmean (Y,T ) = = %
m
qj m
(10) The initial population contains NP i = 1, 2, ..., NP}. (13)
j=1
candidate weight matrixes, NP – 2 can-
where Y is the class prediction vector, didates are generated randomly in [0,1], The indices r i1 and r i2 are mutually
q j is the number of elements belonging and the remaining two candidates are set exclusive integers randomly generated
in class j correctly classified among Y, as W 1 and W 2 , respectively, which are within the range [1, NP], which are also
and p j is the number of elements presented in WELM [7]. Thus the initial different from the index i. Indices are
belonging in class j among T. Although population W R consists of NP n-dimen- randomly generated once for each mutant
we may use the Levenberg Marquardt sional vectors as follows, vector. The scaling factor F is a positive
(LM) algorithm [42] to search for the control parameter for scaling the differ-
optimal solution to this problem, the W R = {W i, R W i, R = (w 1i, R, w i2, R, ..., w ni, R), ences among vectors.
results are quite sensitive to the initial i = 1, 2, ..., NP - 2} , {W 1, W 2}. After the mutation operation, we
values, so we introduce the DE algo- (11) apply the crossover operation to each
rithm to calculate the approximate opti- pair of target vector W i, R and its corre-
mal solution. Note that the weight matrix in WELM sponding mutant vector Vi, R to gener-
is a diagonal matrix, so all elements are zero ate a trial vector U i, R = (u 1i, R, u 2i, R, ..., u ni, R)
B. DE-based Approximate except for the diagonal elements. There- as follows,
Weight Calculation Algorithm for fore we will express a diagonal matrix
Weighted ELM using a vector.
1
diag(.) is the function to map input vectors to a diago-
DE is a method that optimizes a prob- Second, we transfor m W i, R (i = nal matrix as diagonal elements and all elements are
lem by iteratively trying to improve a 1, 2, ..., NP ) into a weight matr ix zero except for the diagonal elements.

u ij, R = the target vector and enter the next can- repository2 to evaluate the methods. We
didate population. Otherwise, the target also use the datasets of MNIST [13]
v ij, R if (rand j [0, 1) # CR) or ( j = j rand)
) vector will remain in the population for (handwritten digit image dataset), the
w ij, R otherwise the next generation. CIFAR-10 [14] (tiny image dataset), and
( j = 1, 2, ..., n, i = 1, 2, ..., NP). The update stage will be executed the YouTube-8M database in our experi-
(14) iteratively several times, and we obtain ments. The results are averaged over ten
the approximate optimal weight matrix runs. Brief descriptions of the datasets are
Here if the values of the newly gen- diag (W best,Rmax) once the algorithm com- provided in the following subsections.
erated trial vectors exceed the correspond- pletes, where R max is the number of it- In our experiments, we also intro-
ing upper and lower bounds, they will erations in the DE algorithm. duce the imbalanced ratio (IR) [7] to
be re-normalized into the range [0, 1]. In short, our method obtains the ini- quantitatively measure the imbalanced
CR is defined by users as a constant tial population and the candidate weight degree of a dataset.
within the range [0, 1], which controls vectors in the initial stage and calculates
the fraction of parameter values copied an approximate optimized matrix by # (+ 1 )
Binary : IR =
from the mutant vector. j rand is a ran- updating the population using the DE # ( - 1)
domly chosen integer in the range [1, n]. algorithm in the update stage. The pseu- min (# (t i))
Multiclass : IR = , i = 1, ..., m.
This operation copies the jth parameter do-code of our proposed algorithm is max (# (t i))
of the mutant vector Vi, R to the corre- given in Algorithm 1. (17)
sponding element in trial vector U i, R if
(rand j [0, 1) # CR) or ( j = j rand). Other- V. Performance Evaluation Above, # (+ 1) is the number of samples
wise, the parameter is copied from the In this section, we conduct comparison in a minority class, # (- 1) is the num-
corresponding target vector W i, R . The experiments to evaluate the classification ber of samples in the majority class, and
condition j = j rand is introduced to capability of our DE-WELM on imbal- # (t i) is the number of samples in class t i
ensure that the trial vector U i, R will be anced learning problems and the analy- for a multiclass dataset.
different from its corresponding target ses on the results are also presented. We The attributes of all the datasets are
vector W i, R by at least one parameter. compare the performance of our pro- normalized to [–1,1].
We then use diag (U i, R) (i = 1, 2, ..., posed method with four state-of-the-art In ELM theory, there are many
NP ) as a weight matrix to compute the methods, i.e., the Softmax classifier [43], choices for feature mapping. In our
error of the WELM classifier on valida- the classic ELM classifier [1], the experi- experiments, we use the sigmoid addi-
tion set X vd . The element which has the enced WELM classifier [7], and the tive based feature mapping function
lowest error is defined to U best.R. We can BWELM classifier [8]. [4], which is a popular choice for
calculate the error of U i, R as follows. We select a subset of the KEEL researchers. There are two parameters
(Knowledge Extraction based on Evolu-
tionary Learning) imbalanced dataset 2
http://sci2s.ugr.es/keel/study.php?cod=24
E Ui,R = 1 - Gmean (Hb (X vd, diag (U i, R)), T )
(i = 1, ..., NP) . (15)
Algorithm 1 De-WeLM.
Then the next candidates popula-
tion W R + 1 = {W i, R + 1, i = 1, 2, ..., NP} input: Training set X = {(x i, t i), t i ! {1, 2, ..., m}, i = 1, 2, ..., n}, R max .
and its best candidate can be decid- Output: Approximate optimal weight matrix W *.
ed according to the following select 1: All data values x i of training set X are normalized in [–1,1], and then select valida-
operation. tion set X vd. Set R = 0.
2: Generate initial population W R = {W i, R, i = 1, 2, ..., NP} by formula (11).
3: Compute the error E W i,R (i = 1, ..., NP) of WELM classifier with weight matrix
W i, R + 1 = '
U i, R if E Ui,R # E Wi,R
W i, R otherwise diag (W i, R) on validation set X vd and the candidate optimal weight vector W best,R by
W best, R if E Wbest,R 1 E Ubest,R formula (12).
W best, R + 1 = ' 4: while R 1 R max do
U best, R otherwise.
(16) 5: Compute the trial vectors U i, R (i = 1, ..., NP) from W R by formulas (13) and (14).
6: Compute the error E U i,R(i = 1, ..., NP) of WELM classifier with weight matrix
We compare the E Ui,R of each trial diag (U i, R) on validation set X vd by formula (15).
vector U i, R with the E Wi,R of its corre- 7: Obtain next stage population W R + 1 and a candidate optimal weight vector of the
sponding target vector in the current next stage W best,R + 1 by formula (16).
population. If the trial vector has less or 8: R = R + 1.
equal error value to the corresponding 9: end while
target vector, the trial vector will replace 10: W * = diag (W best,Rmax).

proach. The number of iterations R max
TAbLe 1 Specification of binary classification problems.
is set to 10. The validation set is random-
DATASeTS # ATTri # TrAiN # TeST IR ly assigned as a subset with half the ele-
YEAST05679VS4 8 422 106 0.1047 ments of the training set.
YEAST1458VS7 8 554 139 0.0453 In the experiments involving ELM,
YEAST1289VS7 8 385 97 0.0327 experienced WELM [7], and BWELM,
the grid search ranges for C and l are
YEAST1VS7 7 367 92 0.0700
conducted over {2 -18, 2 -16, ..., 2 48, 2 50}
ECOLI1 7 268 68 0.2947
and {10, 20, ..., 990, 1000}, respectively.
ECOLI2 7 268 68 0.1806 The search range of m for the Soft-
ECOLI3 7 268 68 0.1167 max classifier is set as {- 8, - 7, - 6,
ECOLI4 7 268 68 0.0635 ..., 1, 2, 3}.
YEAST2VS4 8 411 103 0.1078
To compare our algorithm with
other computational methods, we also
YEAST2VS8 8 385 97 0.0434
introduce the approximate weight search
GLASS0123VS456 9 171 43 0.3053
method based on PSO [44], [45]. Parti-
GLASS016VS2 9 153 39 0.0929 cle swarm optimization is one of the
GLASS016VS5 9 147 37 0.0500 most popular nature-inspired meta-
PIMA 8 614 154 0.5350 heuristic optimization algorithms, devel-
YEAST1 8 1187 297 0.4064
oped by Kennedy and Eberhart in 1995
[44], [45]. Since its development, many
YEAST3 8 1187 297 0.1230
variants have also been developed for
YEAST4 8 1187 297 0.0349
solving practical issues related to optimi-
YEAST5 8 1187 297 0.0304 zation [46]–[48]. In our paper, we use
SHUTTLEC0VSC4 9 1463 366 0.0726 the PSO proposed in [26].
SHUTTLEC2VSC4 9 103 26 0.0404 In the following experiments involv-
SEGMENT0 19 1846 462 0.1661
ing the PSO method, two acceleration
parameters c 1 and c 2 are set to 2, the
WISCONSIN 9 546 137 0.5380
inertia factor w is set to [0.4, 0.9], and
HABERMAN 3 244 62 0.3556 the maximal iteration is set to 50. The
IRIS0 4 120 30 0.5000 size of the initial population (swarm)
VOWEL0 13 790 198 0.1002 that is the initial set of candidate weights
NEW-THYROID1 5 172 43 0.1944 is set to 50, and W 1, W 2 , which are pre-
sented in [7] are included in the initial
NEW-THYROID2 5 172 43 0.1944
population (swarm) and the remaining
PAGE-BLOCKS1 10 377 95 0.0620
elements are randomly generated
GLASS0 9 173 43 0.4786 between [0,1]. The grid search ranges
GLASS1 9 171 43 0.5405 for C and l are also conducted over
GLASS2 9 171 43 0.0823 {2 -18, 2 -16, ..., 2 48, 2 50} and {10, 20, ...,
GLASS4 9 171 43 0.0621
990, 1000}, respectively. We call this
algorithm “PSO-WELM” in the follow-
GLASS5 9 171 43 0.0427
ing sections.
GLASS6 9 171 43 0.1554
VEHICLE0 18 676 170 0.3075 A. Dataset Specification

VEHICLE1 18 676 170 0.3439
VEHICLE2 18 676 170 0.3466 1) Standard Classification Datasets

VEHICLE3 18 676 170 0.3330
Similar to the data specifications men-
tioned in the experienced WELM [7]
YEAST6 8 1187 297 0.0243
and BWELM [8], we have selected 39
binary datasets and 3 multiclass datasets
to tune for ELM, i.e., the trade-off C ! {2 -6, 2 -4, 2 -2, 2 0, 2 2, 2 4, 2 6} from the KEEL dataset repository
constant C and the number of hidden l ! {10, 110, 210, ..., 810, 910} . (18) which have different degrees of imbal-
nodes l. In our exper iments, these ance. Details of the datasets used in
two parameters for DE-WELM are We set the control parameters to our experiments are shown in Tables 1
set as follows, NP = 50, F = 1, CR = 0.8 in our ap - and 2, where the number of attributes

(#ATTRI), the number of classes ten digits that is commonly used for images and 10,000 test images with each
(#CLASS), the number of training training various image processing sys- image labeled by an integer, and 7,000
samples (#TRAIN), the number of test tems. The database is also widely used images per class. Each example is a 28 ×
samples (#TEST), and IR are listed. for training and testing in the field of 28 single channel grayscale image. Exam-
The IR of the binary datasets selected machine learning [13], [49]–[52]. The ple images of the MNIST dataset are
from the KEEL dataset varies from MNIST dataset contains 60,000 training shown in Fig. 2(a).
0.0243 to 0.5405, and the IR of the
multiclass datasets varies from 0.0061
to 0.5882. TAbLe 2 Specification of multiclass classification problems.
DATASeTS # ATTri # CLASS # TrAiN # TeST IR

2) mnISt Dataset [13]
HAYES-ROTH 4 3 105 27 0.5882
The MNIST dataset (Mixed National
Institute of Standards and Technology NEW-THYROID 5 3 172 43 0.2066
database) is a large database of handwrit- PAGE-BLOCKS 10 5 438 110 0.0061
Airplane
Automobile
Bird
Cat
Deer
Dog
Frog
Horse
Ship
Truck
(a) MNIST (b) CIFAR-10
Arts and Entertainment 2,884,721

Games 1,202,429
Autos and Vehicles 1,149,528
Sports 1,099,091
Food and Drink 580,627
Computers and Electronics 489,413
Hobbies and Leisure 435,991
Pets and Animals 394,785
Science 388,718
Business and Industrial 370,558
Beauty and Fitness 327,335
Vertical
Internet and Telecom 252,385

Shopping 226,161
Home and Garden 160,147
Travel 152,482
People and Society 106,686
News 71,306
Reference 70,971
Jobs and Education 62,252
Law and Government 50,448
Books and Literature 50,448
Health 43,941
Real Estate 15,945
Finance 13,756
104 105 106 107
Number of Videos
(c) YouTube-8M
Figure 2 Example digital images of MNIST database and CIFAR-10 database, and number histograms of top training videos in YouTube-8M.

TAbLe 3 Experimental results of binary problem.
DATASeTS GMEAN (SigMOiD MODe)
SOFTMAX eLM WeLM-W1 WeLM-W2 bWeLM PSO-WeLM De-WeLM

(%) (%) (%) (%) (%) (%) (%)
YEAST05679VS4 62.58 84.13 86.02 80.02 85.56 85.69 86.53
YEAST1458VS7 45.12 61.07 67.10 64.26 68.09 68.55 72.18
YEAST1289VS7 46.52 59.23 75.83 73.92 74.21 78.48 90.16
YEAST1VS7 49.72 74.70 76.65 81.36 86.23 88.24 85.45
ECOLI1 86.40 87.77 90.69 90.26 93.02 90.80 95.13
ECOLI2 91.82 89.44 90.19 88.64 90.80 94.50 95.04
ECOLI3 73.70 77.38 90.17 90.00 93.21 92.84 93.61
ECOLI4 85.15 91.96 97.83 95.90 100 100 99.19
YEAST2VS4 85.28 86.25 91.56 90.02 93.78 97.75 98.01
YEAST2VS8 81.63 72.83 75.56 76.01 81.32 80.32 81.65
GLASS0123VS456 97.10 95.34 95.66 95.55 100 100 100
GLASS016VS2 49.28 67.78 83.77 83.06 92.85 94.11 95.26
GLASS016VS5 100 92.41 98.55 98.70 100 100 100
PIMA 74.32 70.10 74.74 71.51 75.79 74.97 78.06
YEAST1 54.19 63.26 72.57 70.32 73.36 73.37 77.06
YEAST3 73.42 80.75 93.25 91.08 92.52 93.92 94.06
YEAST4 40.54 65.52 87.92 73.92 83.51 93.05 93.01
YEAST5 61.02 81.04 95.39 95.15 98.25 98.79 99.04
SHUTTLEC0VSC4 100 100 100 100 100 100 100
SHUTTLEC2VSC4 100 93.54 100 100 100 100 100
SEGMENT0 100 99.24 99.75 99.70 99.78 100 100
WISCONSIN 95.58 97.85 97.62 96.95 97.34 98.25 98.48
HABERMAN 23.57 49.16 65.11 59.26 73.74 77.24 78.95
IRIS0 100 100 100 100 100 100 100
VOWEL0 85.58 100 100 100 100 100 100
NEW-THYROID1 98.64 98.24 99.44 99.72 100 100 100
NEW-THYROID2 100 95.55 99.72 100 100 100 100
PAGE-BLOCKS1 97.78 99.09 99.43 99.42 100 100 100
GLASS0 74.65 78.51 80.29 81.35 86.66 88.98 87.18
GLASS1 38.29 78.79 78.96 80.87 79.54 84.04 86.09
GLASS2 57.01 90.61 91.33 88.34 90.15 93.07 94.08
GLASS4 69.79 85.72 91.34 91.46 96.45 96.74 97.33
GLASS5 97.47 90.81 95.99 96.60 100 100 100
GLASS6 89.44 94.96 95.72 95.90 100 100 100
VEHICLE0 95.97 98.51 99.12 99.09 96.71 100 99.62
VEHICLE1 69.54 84.14 85.21 82.75 75.98 87.23 87.47
VEHICLE2 95.99 98.43 99.12 98.78 97.31 100 100
VEHICLE3 60.92 78.15 85.13 84.34 85.37 86.52 87.37
YEAST6 70.59 70.77 87.77 88.29 90.68 89.67 90.02
AVERAGE 76.37 84.18 89.60 88.52 91.07 92.49 93.33

3) CIFaR-10 Dataset [14] algorithms except for our DE-WELM better than the values of BWELM,
The CIFAR-10 dataset is a labeled sub- on the datasets of ECOLI2 and GLASS- where the Gmean values in some datas-
set of the 80 Million Tiny Images data- 0123VS456, which also indicates that ets, such as VEHICLE1, YEAST4, and
set3, containing 60,000 color images (32 WELM-W1, WELM-W2, and BWELM YEAST1289VS7, are 10% higher than
× 32) in 10 classes, with 6,000 images are not optimal. the values of BWELM. The overall com-
per class. They are split into 50,000 parable results indicate that DE-WELM
training images and 10,000 test images. 2) Comparison with potentially represents a superior solution
Example images of the CIFAR-10 data- Experienced Weighted Elm as an optimal WELM classifier com-
base are shown in Fig. 2(b). Based on the results in Table 3, our DE- pared with BWELM. This conclusion is
WELM can perform much better in all further supported by the results for aver-
4) Youtube Dataset datasets compared with experienced age Gmean values.
The YouTube-8M dataset [15] contains WELM, especially on the datasets
approximately 8 million YouTube vid- YEAST1289VS7, YEAST4, and 4) multiclass Imbalanced learning
eos. Each video is annotated with one GLASS016VS2, achieving an improve- The experimental results on multiclass
or multiple tags for a total of 4,716 class- ment of more than 10% on Gmean. data are shown in Table 4, the results
es. All 4,716 classes can be grouped into Regarding the balanced datasets such as also show that DE-WELM can achieve
24 top-level vertical categories. In Fig. 2(c), GLASS1, WISCONSIN, PIMA, etc., the much better performance compared
we show histograms with the number of experimental results of DE-WELM are with other methods except on NEW-
training videos in each top-level vertical also better than experienced WELM, THYROID, where the Gmean value of
(or top-level) category. which shows that our method is applica- our method is 0.04% lower than the
ble to not only imbalanced datasets, but value of BWELM. The results in Table 4
B. Experimental Results on also to balanced datasets. It also indicates indicate that our DE-WELM potential-
Standard Classification Datasets that the two experienced weighting ly represents the best solution as an opti-
The experimental results of seven meth- schemes employed in [7] are not optimal. mal WELM classifier in multiclass
ods (Softmax, ELM, WELM-W1, imbalanced datasets.
WELM-W2, BWELM, PSO-WELM, 3) Comparison with
and DE-WELM) under 39 binary im- Boosting Weighted Elm 5) analysis on the Search Ranges
balanced datasets and 3 multiclass imbal- As shown in Table 3, DE-WELM can In ELM related methods, the best
anced datasets are given in Tables 3 and 4, achieve superior performance on 39 parameters for the trade-off constant C
respectively. The results in both tables datasets (92.85%) versus BWELM and the number of hidden nodes l are
show that our DE-WELM can perform except for ECOLI4, YEAST6, and normally obtained by a grid search pol-
much better than all the other state-of- YEAST1VS7. There are 11 datasets icy, which pre-sets search ranges for
the-art methods, and the evaluation (26.19%) where both DE-WELM and these two parameters and exhaustively
metric Gmean can improve by 30% BWELM achieve perfect 100% Gmean searches all possible combinations of
compared to the classic ELM classifier. values, which indicates that optimal these two parameters. Obviously, the
The average performance metrics also weights are reached for those datasets. computation efficiencies are negatively
indicate significant improvement using We further focus on the three datasets related to the search ranges, while the
our approach. where DE-WELM performs worse than probabilities to obtain the optimal
BWELM. The Gmean values of our solution are positively related to the
1) Comparison with approach are all slightly lower (less than search ranges.
Softmax Classifier 1%) than BWELM. The Gmean values In the beginning of our experiment
As can be seen in Tables 3 and 4, our of our approach on other datasets are section, we mentioned that the settings
DE-based algorithm can achieve satis-
factory performance for all datasets
compared with the Softmax classifier.
Specifically, on the datasets YEAST- TAbLe 4 Experimental results of multiclass problems.
1289VS7, HABERMAN, YEAST4, DATASeTS GMEAN (SigMOiD MODe)
GLASS1, and GLASS016VS2, DE-WELM
WeLM- WeLM- PSO- De-
can achieve an improvement of more SOFTMAX eLM W1 W2 bWeLM WeLM WeLM
than 40% on Gmean. There are six datas- (%) (%) (%) (%) (%) (%) (%)
ets where both classifiers achieve perfect
HAYES-ROTH 42.51 67.94 69.65 63.52 71.65 71.20 72.20
100% Gmean values.The Softmax classifi-
NEW-THYROID 80.69 93.11 92.83 92.58 93.15 93.51 93.11
er can perform much better than other
PAGE-BLOCKS 88.46 94.41 89.71 94.33 93.52 92.85 94.60
3
http://groups.csail.mit.edu/vision/TinyImages/ AVERAGE 70.55 85.15 84.06 83.47 86.10 85.85 86.63

of C and l in our DE-WELM are based Fig. 3. As shown in Fig. 3, a minimum Gmean values. Comparing with PSO-
ove r {2 -6, 2 -4, 2 -2, 2 0, 2 2, 2 4, 2 6} a n d of 9 iterations are required for the train- WELM, there are six datasets (14.29%)
{10, 110, 210, ..., 810, 910}, respectively, ing process to converge in WISCON- (YEAST1458VS7, YEAST1289VS7,
and the settings of C and l in other SIN, 10 in PIMA, 6 in GLASS5, 7 in ECOLI1, PIMA, YEAST1, GLASS1)
methods are {2 -18, 2 -16, ..., 2 48, 2 50} and SEGMENT0, 10 in HABERMAN, and where DE-WELM achieves an improve-
{10, 20, ..., 990, 1000}, respectively. The 10 in GLASS0. ment of over 2% on Gmean, 11 datasets
experimental results in Tables 3 and 4 The results in Fig. 3 indicate that (26.19%) (YEAST05679VS4, ECOLI2,
indicate that DE-WELM uses a smaller although the two sets of parameters are ECOLI3, YEAST2VS8, GLASS016VS2,
search range compared to other state-of- quite different, DE-WELM will quickly HABERMAN, and so on) where DE-
the-art methods, and can achieve much converge to the optimum within 10 iter- WELM can achieve an improvement
better performance compared to other ations on all the test datasets, which indi- greater than 0.5% on Gmean, and six data-
state-of-the-art methods. Thus, our DE- cates DE-WELM can easily converge sets (14.29%) where DE-WELM achieves
WELM represents a more efficient opti- toward the optimal solution. Therefore, an improvement of more than 0.14% on
mal WELM classifier. we set the number of iterations to Gmean. DE-WELM achieves an improve-
R max = 10 in the previous experiments. ment above 0.84% regarding average
6) analysis of the Convergence Gmean for binary datasets, and an
on Iterations 7) Comparison with improvement greater than 0.78% on aver-
In this section, we will further evaluate other Computational age Gmean for multiclass datasets com-
the convergence on iteration of our method Based WElm pared with PSO-WELM.
DE-WELM. We carry out two experi- As shown in Tables 3 and 4, our DE-
ments on six datasets, i.e., WISCONSIN, WELM can achieve superior perfor- 8) temporal Complexity analysis
HABERMAN, GLASS5, SEGMENT0, mance to PSO-WELM on 36 datasets Regarding the temporal complexity,
GLASS0, and PIMA. We use two sets of (85.71%) except for ECOLI4, YEAST4, assuming that the temporal computation
C and l parameters for the two experi- GLASS0, VEHICLE0, YEAST1VS7, cost of a single WELM on the given
ments and plot the training Gmean val- and NEW-THYROID. Among all datas- parameters of C, l is denoted as t WELM ,
ues after every iteration of the DE ets, there are 13 datasets (30.95%) where then the temporal complexities of
algorithm. The results are shown in both methods achieve perfect 100% WELM based methods can be expressed
Wisconsin (IR = 0.5380) PIMA (IR = 0.5350) Glass 5 (IR = 0.0427)

1.1 1.1 1.1
1.0 1.0 1.0
Training Gmean
Training Gmean
Training Gmean
0.9 0.9 0.9

0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
L : 310, C : 24 L : 410, C : 2 L : 110, C : 2
0.5 L : 610, C : 2–4 0.5 L : 810, C : 2–2 0.5 L : 210, C : 2–4
0.4 0.4 0.4
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Iteration Steps Iteration Steps Iteration Steps
(a) (b) (c)
Segment 0 (IR = 0.1661) Haberman (IR = 0.3556) Segment 0 (IR = 0.1661)

1.1 1.1 1.1
1.0 1.0 1.0
Training Gmean
Training Gmean
Training Gmean
0.9 0.9 0.9

0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
L : 110, C : 22 L : 310, C : 22 L : 110, C : 2–2
0.5 L : 710, C : 2–6 0.5 L : 510, C : 2–4 0.5 L : 710, C : 2–6
0.4 0.4 0.4
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Iteration Steps Iteration Steps Iteration Steps
(d) (e) (f)
Figure 3 Gmean values after every iteration of our DE-based method on six standard classification datasets.

as the search numbers multiplied by the 9) analysis of the Convergence WELM achieved better classification
basic temporal computation cost of a on Different Initializations performance than PSO-WELM in 8 ex-
single ELM as follows, Both our DE-WELM and PSO-WELM periments (80%), and in 9 experiments
employ candidate weight populations (90%) compared to BWLEM.
T = t WELM # that are randomly generated at the
(Psize # I max # C number # L number) (19) beginning of the algorithm. In the case C. Experimental Results
of BWELM, the initial weights are on MNIST Dataset
where Psize is the size of the candidate determined directly by the data. In this We introduce the LeNet-5 model to
population, I max is the number of itera- section, we will further evaluate the con- evaluate the performance of DE-WELM
tions, C number is the number of possible vergence on different initial candidate versus three state-of-the-art methods
trade-off constants C, and L number is the weight populations. We set the same ini- (classic ELM classifier, experienced
number of possible hidden nodes l. tial population, which is generated ran- WELM classifier, and BWELM) on the
Considering the above equation and domly, for the former two methods and MNIST dataset.The LeNet-5 model [13]
our experiment settings, the temporal perform comparative experiments on is a classic convolutional neural network
complexity of our algorithm is t WELM two datasets (ECOLI1, GLASS2). The (CNN) architecture proposed by LeCun
# 50 # 10 # 7 # 10 = t WELM # 35, 000 experiments were carried out ten times et al. in 1998, which was applied in hand-
and the temporal complexity of PSO- with different initializations. At the same written digit character recognition.
WELM is t WELM # 50 # 100 # 35 # 100 time, corresponding experiments on We perform experiments with the
= t WELM # 17, 500, 000. Consider ing BWELM were also performed. The LeNet-5 model regarding the first few
that the initial candidate weight of results can be seen in Fig. 4. layers of the network as feature extrac-
BWELM is one and requires 10 itera- In Fig. 4, the horizontal axis indicates tors to evaluate DE-WELM.
tions, the temporal complexity of the sequence number of the different ini- We first arbitrar ily select three
BWELM is t WELM # 1 # 10 # 35 # 100 tial candidate weight populations and the datasets that do not overlap and bal-
= t WELM # 35, 000. That is to say, our vertical axis denotes the resulting Gmean ance among the 70,000 MNIST datas-
DE-WELM is 500 times faster than values obtained in each experiment. ets (60,000 training sets, 10,000 test
PSO-WELM and is equal to BWELM Among 10 experiments for ECOLI1, sets). These three datasets are referred
in terms of temporal complexity. DE-WELM achieved better classification to as RES_DATA (representation
As mentioned for BWELM in [8] performance than PSO-WELM in 9 training set), C_TRAIN (classifier
and considering the very fast learning experiments (90%), and in 7 experiments training set), and C_TEST (classifier
speed of ELM, such costs for DE-WELM (70%) compared to BWLEM. Among test set). The details of the selected
are quite acceptable. the 10 experiments for GLASS2, DE- datasets are provideds in Table 5,
ECOLI 1 (IR = 0.2947) Glass 2 (IR = 0.0823)

1.00 1.00
0.95 0.95
Gmean
Gmean
0.90 0.90
0.85 0.85
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Experiment Number Experiment Number
(a) (b)
DE-WELM PSO-WELM BWELM
Figure 4 Experimental results of the convergence on initialization of weight on two datasets (ECOLI1, GLASS2).

nected layer on top for classification. We
We perform experiments with the LeNet-5 model use the simplest implementation to train
regarding the first few layers of the network as feature a CifarNet model as a baseline model
without any preprocessing such as image
extractors to evaluate DE-WELM. translations or transformations.
We perform experiments on Cifar-
where the number of attributes (#ATTRI), The experimental results of six Net regarding the first few layers of the
the number of classes (#CLASS), the num- methods (LeNet-5, ELM, WELM-W1, network as feature extractors to evaluate
ber of selected data (#SAMPLE) and, IR WELM-W2, BWELM, and DE-WELM) our classifier algorithm.
are listed. under selected datasets are given in We first arbitrarily select three datasets
Then we train the LeNet-5 model Table 7. The parameters used in the that do not overlap and balance (among
on Res_Data by a popular deep learning experiment were the same as those used 60,000 images from the CIFAR-10 data-
framework known as Caffe [53]. The for the standard classification problem set: 50,000 for the training set, 10,000 for
parameters we used for training the datasets. Table 7 indicates that DE- the test set), as RES_DATA (representa-
CNN are the default values in Caffe. WELM achieves better performance tion training set), C_TRAIN (classifier
The LeNet-5 model outputs fea- than the other methods. training set) and C_TEST (classifier test
ture images of original image data set).The details of the selected datasets are
which have 500 dimensions in its first D. Experimental Results on shown in Table 6, where the number of
fully connected layer. We regard the CIFAR-10 Dataset attributes (#ATTRI), the number of
output of the LeNet-5 model’s first We introduce the CifarNet model [14] to classes (#CLASS), the number of selected
fully connected layer on C_TRAIN as evaluate the performance of our DE- data (#SAMPLE), and IR are listed.
the training set to evaluate our classifi- WELM with four state-of-the-art meth- We train CifarNet on RES_DATA
er algorithm, and we take the output ods (ELM, WELM-W1, WELM-W2, using Caffe and its default parameters [53].
of LeNet-5 model’s first fully connect- and BWELM) on CIFAR-10. Proposed We extract and reshape the output of
ed layer on C_TEST as the test set to by Alex Krizhevsky, CifarNet is the state- the “Pool3” layer, which has 1024 dimen-
evaluate our DE-WELM. We also eval- of-the-art method for object classification sions, and use the extracted features of
uate the classification of LeNet-5 using on the CIFAR-10 dataset. It has three the CIFAR-10 dataset to evaluate five
the above two data to compare with convolution layers and three pooling lay- classifiers (ELM, WELM-W1, WELM-
DE-WELM. ers for feature extraction, and a fully con- W2, BWELM, and DE-WELM). We
regard the output of CifarNet model’s
“Pool3” layer on C_TRAIN as the
training set, and the output of CifarNet
TAbLe 5 Selected imbalanced datasets from MNIST.
model’s “Pool3” layer on C_TEST as
DATASeTS # ATTri # CLASS # SAMPLe IR the test set to evaluate DE-WELM. We
RES_DATA 500 10 13470 0.0848 also evaluated classification by the Cifar-
C_TRAIN 500 10 3329 0.0872 Net model using the above two datasets
to compare with our DE-WELM.
C_TEST 500 10 4991 0.0872
The experimental results for six meth-
ods (CifarNet, ELM, WELM-W1,
WELM-W2, BWELM, and DE-WELM)
TAbLe 6 Selected imbalanced datasets from CIFAR-10. under selected imbalanced datasets are
given in Table 7. The parameters in the
DATASeTS # ATTri # CLASS # SAMPLe IR
experiment were the same as previous
RES_DATA 1024 10 12351 0.0839
experiments. As shown in Table 7,
C_TRAIN 1024 10 3121 0.0918 DE-WELM achieves better performance
C_TEST 1024 10 4679 0.0918 than other methods except for CifarNet.
TAbLe 7 Experimental results on MNIST and CIFAR-10.
DATAbASeS gMeAN (SigMOiD MODe)

LENET-5 CIFARNET ELM WELM-W1 WELM-W2 BWELM DE-WELM
(%) (%) (%) (%) (%) (%) (%)
MNIST 96.82 95.57 97.58 97.5 97.48 97.72
CIFAR-10 58.84 37.33 53.94 55.73 55.54 57.25

Airplane 0.57 0.05 0.06 0.04 0.06 0.05 0.05 0.05 0.04 0.03 Airplane 0.42 0.01 0.07 0.06 0.06 0.12 0.06 0.11 0.08 0.02
Automobile 0.02 0.65 0.03 0.05 0.05 0.04 0.04 0.05 0.06 0.02 Automobile 0.04 0.37 0.09 0.08 0.08 0.09 0.09 0.06 0.07 0.04
Bird 0.00 0.01 0.67 0.04 0.05 0.03 0.04 0.03 0.05 0.06 Bird 0.01 0.08 0.32 0.03 0.06 0.11 0.09 0.13 0.13 0.04
True Expression
True Expression
Cat 0.04 0.07 0.07 0.48 0.05 0.07 0.06 0.07 0.06 0.03 Cat 0.03 0.07 0.06 0.43 0.06 0.09 0.06 0.08 0.07 0.05
Deer 0.03 0.05 0.05 0.06 0.55 0.06 0.06 0.05 0.07 0.02 Deer 0.05 0.06 0.10 0.09 0.36 0.05 0.09 0.08 0.09 0.02
Dog 0.03 0.06 0.06 0.06 0.08 0.48 0.06 0.05 0.09 0.03 Dog 0.02 0.08 0.08 0.08 0.06 0.42 0.06 0.08 0.07 0.05
Frog 0.02 0.05 0.03 0.05 0.05 0.05 0.62 0.04 0.05 0.03 Frog 0.04 0.08 0.08 0.08 0.08 0.08 0.35 0.07 0.09 0.03
Horse 0.04 0.04 0.06 0.03 0.08 0.06 0.04 0.60 0.04 0.02 Horse 0.04 0.07 0.08 0.08 0.07 0.10 0.10 0.32 0.09 0.07
Ship 0.03 0.03 0.09 0.04 0.04 0.04 0.05 0.06 0.58 0.02 Ship 0.02 0.05 0.05 0.06 0.10 0.05 0.08 0.08 0.45 0.05
Truck 0.05 0.05 0.02 0.00 0.03 0.02 0.05 0.03 0.02 0.72 Truck 0.03 0.05 0.07 0.07 0.07 0.14 0.09 0.10 0.03 0.34
Ai
Au
Bi
Fr
Sh
Tr
Ai
Au
Bi
Fr
Sh
Tr
at
ee
og
or
at
ee
og
or
rp
rd
og
uc
rp
rd
og
uc
to
ip
to
ip
se
se
r
l
an
k
m
an
k
m
ob
ob
e
e
ile
ile
Recognized Expression Recognized Expression
(a) CifarNet (b) ELM
Bird 0.01 0.11 0.61 0.03 0.03 0.07 0.05 0.03 0.02 0.03 Bird 0.00 0.06 0.63 0.03 0.05 0.02 0.05 0.11 0.03 0.01
True Expression
True Expression
Cat 0.03 0.07 0.08 0.45 0.06 0.07 0.06 0.09 0.06 0.03 Cat 0.04 0.06 0.06 0.45 0.07 0.07 0.08 0.06 0.08 0.03
Deer 0.03 0.07 0.04 0.06 0.49 0.08 0.07 0.07 0.06 0.02 Deer 0.02 0.05 0.07 0.08 0.50 0.05 0.07 0.07 0.05 0.03
Dog 0.04 0.07 0.08 0.07 0.06 0.48 0.06 0.05 0.05 0.02 Dog 0.04 0.06 0.06 0.06 0.05 0.50 0.07 0.05 0.05 0.05
Frog 0.02 0.04 0.05 0.06 0.03 0.05 0.65 0.04 0.04 0.03 Frog 0.02 0.04 0.05 0.05 0.03 0.04 0.65 0.04 0.04 0.03
Horse 0.02 0.06 0.12 0.05 0.05 0.05 0.04 0.51 0.07 0.03 Horse 0.03 0.10 0.06 0.05 0.08 0.08 0.04 0.51 0.04 0.01
Ship 0.04 0.05 0.06 0.06 0.06 0.09 0.05 0.06 0.50 0.02 Ship 0.02 0.06 0.06 0.06 0.08 0.05 0.06 0.04 0.52 0.02
Truck 0.05 0.09 0.03 0.07 0.07 0.03 0.03 0.02 0.00 0.60 Truck 0.03 0.03 0.05 0.02 0.00 0.07 0.09 0.02 0.00 0.69
Ai
Au
Bi
Fr
Sh
Tr
Ai
Au
Bi
Fr
Sh
Tr
at
ee
og
or
at
ee
og
or
rp
rd
og
uc
rp
rd
og
uc
to
ip
to
ip
se
se
la
la
r
k
m
k
m
ne
ne
ob
ob
ile
ile
(c) WELM-W1 (d)WELM-W2
Bird 0.03 0.02 0.63 0.06 0.04 0.04 0.08 0.03 0.04 0.01 Bird 0.01 0.02 0.79 0.03 0.02 0.04 0.04 0.03 0.00 0.01
True Expression
True Expression
Cat 0.02 0.06 0.07 0.45 0.08 0.06 0.08 0.06 0.07 0.04 Cat 0.04 0.06 0.06 0.45 0.06 0.07 0.10 0.08 0.06 0.02
Deer 0.04 0.09 0.05 0.03 0.50 0.05 0.05 0.08 0.08 0.04 Deer 0.05 0.06 0.07 0.07 0.50 0.05 0.05 0.07 0.05 0.03
Dog 0.04 0.07 0.05 0.07 0.06 0.50 0.08 0.05 0.06 0.03 Dog 0.03 0.08 0.08 0.05 0.06 0.49 0.06 0.05 0.06 0.03
Frog 0.03 0.05 0.04 0.03 0.05 0.04 0.65 0.03 0.04 0.03 Frog 0.02 0.07 0.03 0.04 0.03 0.05 0.63 0.04 0.04 0.03
Horse 0.01 0.06 0.06 0.10 0.04 0.06 0.05 0.52 0.08 0.02 Horse 0.07 0.07 0.07 0.04 0.07 0.04 0.06 0.51 0.06 0.01
Ship 0.05 0.04 0.05 0.08 0.05 0.07 0.07 0.05 0.50 0.02 Ship 0.03 0.08 0.07 0.05 0.04 0.04 0.05 0.06 0.51 0.05
Truck 0.02 0.09 0.02 0.05 0.02 0.00 0.07 0.02 0.03 0.69 Truck 0.02 0.07 0.10 0.00 0.00 0.03 0.00 0.00 0.02 0.76
Ai
Au ne
Bi ob
Fr
Sh
Tr
Ai
Au ne
Bi ob
Fr
Sh
Tr
at
ee
og
or
at
ee
og
or
rp
rd
rp
rd
og
uc
og
uc
to
ip
to
ip
se
se
la
la
r
r
k
k
m
m
ile
ile
(e) BWELM (f) DE-WELM
Figure 5 Confusion matrices obtained by applying six algorithms (CifarNet, ELM, WELM-W1, WELM-W2, BWELM, DE-WELM) on CIFAR-10 dataset.

Table 9, our DE-WELM achieves better
TAbLe 8 Selected imbalanced datasets from YouTube.
performance than the other methods.
DATASeTS # ATTri # CLASS # SAMPLe IR The experimental results indicate
TRAINDATA 1024 24 10534 0.0018 that our DE-WELM model is an effec-
TESTDATA 1024 24 5267 0.0018 tive and efficient algorithm for solving
the class imbalanced problem in varied
datasets. However, owing to an embed-
TAbLe 9 Experimental results on YouTube dataset. ded optimization search procedure in
the algorithm, DE-WELM requires
eXPeriMeNT
iNDeX WeLM-W1 WeLM-W2 bWeLM PSO-WeLM De-WeLM
more training time. Table 9 provides the
total running time of various learning
GMEAN(%) 47.78 47.40 46.25 47.84 49.51
algorithms on the YouTube datasets.
TOTAL RUNNING 2287.57 2236.45 23301.35 3535679.05 24805.23
TIME(S)
From Table 9, the execution time of
DE-WELM is several times higher than
WELM-W1 and WELM-W2. As our
The Gmean value of our approach is “bird” and “truck” in C_TRAIN are proposed DE-WELM employs the vali-
1.59% lower than CifarNet with the respectively 95 and 58, which are the dation set in the training step, our algo-
original Softmax classifier. We can argue two smallest classes among the classes of rithm is a little slower than BWELM.
that the model has been custom-trained C_TRAIN. This shows that our DE-
in the network training stage for the WELM is more suitable for imbalanced VI. Conclusions
Softmax classifier layer by back-propaga- data classification. In this paper, we present the formal
tion from the classification layer to the The experimental results on the mathematical model to obtain the opti-
feature extraction part. Hence the origi- selected feature data also show that DE- mal weight scheme in WELM for
nal structure of the classifier works bet- WELM achieves better performance imbalanced learning problems. We also
ter for the feature extractor than our compared with the other four methods. propose DE-WELM to calculate the
independent classifier. A relatively similar approximate optimal weight matrix of
work to this paper is [54]. The authors E. Experimental Results the model. The effectiveness of DE-
compared a “one-shot” fine-tuning on YouTube Dataset WELM is proven by experiments con-
with Softmax to a multi-stage training In this dataset, visual and audio features ducted using 39 binary datasets and 3
method using support vector machines are pre-extracted. Visual features are multiclass datasets which have different
(SVMs) and the results indicated that obtained using the Google Inception degrees of imbalance, as well as two
Softmax slightly outperforms SVMs. On CNN, which is pre-trained on ImageN- large-scale image datasets from MNIST
the other hand, the number of parame- et [55], and those features are then and CIFAR-10, and one large-scale
ters in the classifier is much smaller than reduced by PCA-compression into a video dataset from YouTube-8M.
the number in the feature extraction 1024-dimensional vector. The audio fea- Our DE-based approximate weight
layers, which leads to a fast recovery for tures are extracted from a pre-trained calculation algorithm requires only 10
the Softmax layer during the classifier VGG [56] network. In the official split, iterations to converge to a solution and
training stage. To some degree, the Soft- the dataset is divided into three parts: uses a smaller search range than other
max layer recalls the samples in the 70% for training, 20% for validation, and state-of-the-art methods for the trade-
network training set. In the case with 10% for testing. In our experiment, we off between constant C and the number
an independent classifier (DE-WELM), randomly selected 15,801 videos, which of hidden nodes l. As the initial popula-
the results are close to the original clas- have unique top-level vertical categories tion in our approach contains two expe-
sifier performance of CifarNet, indicat- with visual features from the official rienced weighting schemes, W1 and W2,
ing the effectiveness and robustness of dataset, to validate our algorithm on our algorithm can preserve the advan-
our algorithm. video classification problems. The de- tages of WELM [7]. We also performed
In Fig. 5, we show the confusion tails of the selected datasets are shown experiments comparing other computa-
matrices obtained by applying the six in Table 8. tional methods, and the experimental
methods (CifarNet, ELM, WELM-W1, The experimental results of these five results show that DE-WELM can im-
WELM-W2, BWELM, and DE- methods, i.e., WELM-W1, WELM-W2, prove overall classification performance
WELM). As can be observed, our meth- BWELM, PSO-WELM, and DE- significantly with less temporal com-
od achieves better performance than WELM, under selected imbalanced plexity. Future work will focus on more
CifarNet for four classes (bird, dog, frog, datasets are given in Table 9. The param- effectively structuring the initial popula-
truck). It is worth considering here that eters for the experiment were the same tion into the DE algorithm, and apply-
“bird” and “truck” classes belong to the as the ones in the experiment on stan- ing DE-WELM to datasets with large
minority class. The sample numbers of dard classification datasets. As shown in variety in class distributions.

Acknowledgments [19] I. Chaturvedi, E. Ragusa, P. Gastaldo, R. Zunino, optimization problems,” IEEE Trans. Evol. Comput., vol.
and E. Cambria. (2017, July). Bayesian network based 18, no. 5, pp. 689–707, May 2014.
This work was supported in part by the extreme learning machine for subjectivity detection, J. [38] J. Zhang and A. C. Sanderson, “JADE: Adaptive dif-
National Natural Science Foundation of Franklin Inst. [Online]. Available: http://www.sciencedirect ferential evolution with optional external archive,” IEEE
.com/science/article/pii/S0016003217303009 Trans. Evol. Comput., vol. 13, no. 5, pp. 945–958, May
China under Grant U1509210 and [20] E. Soria-Olivas, J. Gomez-Sanchis, J. D. Martin, J. 2009.
U1609210, and the National Key Re- Vila-Frances, M. Martinez, J. R. Magdalena, and A. J. [39] Y. Wang, Z. Cai, S. Member, Q. Zhang, and S.
Serrano, “BELM: Bayesian extreme learning machine,” Member, “Differential evolution with composite trial
search and Development Program of IEEE Trans. Neural Netw., vol. 22, no. 3, pp. 505–509, vector generation strategies and control parameters,”
China under Grant 2017YFB1302003. Mar. 2011. IEEE Trans. Evol. Comput., vol. 15, no. 1, pp. 55–66,
[21] Z. Bai, G.-B. Huang, D. Wang, H. Wang, and M. Mar. 2011.
B. Westover, “Sparse extreme learning machine for [40] Z. Yang, K. Tang, and X. Yao, “Self-adaptive dif-
References classification,” IEEE Trans. Cybern., vol. 44, no. 10, pp. ferential evolution with neighborhood search,” in Proc.
[1] G.-B. Huang, Q.-Y. Zhu, and C. K. Siew, “Extreme 1858–1870, Oct. 2014. IEEE Congr. Evolutionary Computation, Hong Kong, Chi-
learning machine: Theory and applications,” Neurocom- [22] Z. Bai, G.-B. Huang, and D. Wang, “Sparse extreme na, June 2008, pp. 1110–1116.
puting, vol. 70, no. 1–3, pp. 489–501, Dec. 2006. learning machine for regression,” in Proc. Extreme Learn- [41] S. Das, A. Abraham, U. K. Chakraborty, and A.
[2] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Ex- ing Machine, Hangzhou, China, pp. 471–490, vol. 2, Dec. Konar, “Differential evolution using a neighborhood-
treme learning machine for regression and multiclass clas- 2015. based mutation operator,” IEEE Trans. Evol. Comput., vol.
sification,” IEEE Trans. Syst., Man, Cybern., Syst. B, vol. [23] P. Gastaldo, R. Zunino, E. Cambria, and S. 13, no. 3, pp. 526–553, June 2009.
42, no. 2, pp. 513–529, Mar. 2012. Decherchi, “Combining ELMs with random projec- [42] D. W. Marquardt, “An algorithm for least-squares
[3] G.-B. Huang and L. Chen, “Convex incremental tions,” IEEE Intell. Syst., vol. 28, no. 6, pp. 46–48, Nov. estimation of nonlinear parameters,” J. Soc. Ind. Appl.
extreme learning machine,” Neurocomputing, vol. 70, no. 2013. Math., vol. 11, no. 2, pp. 431–441, Aug. 1963.
16–18, pp. 3056–3062, Aug. 2007. [24] R. E. Schapire, “The boosting approach to ma- [43] C. M. Bishop, Pattern Recognition and Machine Learn-
[4] G.-B. Huang, D. Wang, and Y. Lan, “Extreme learn- chine learning: An overview,” in Nonlinear Estimation ing. New York, NY: Springer, 2006.
ing machines: A survey,” Int. J. Mach. Learn. Cybern., vol. and Classification. New York, NY: Springer, 2003, pp. [44] J. Kennedy and R. Eberhart, “Particle swarm op-
2, no. 2, pp. 107–122, May 2011. 149–171. timization,” in Proc. IEEE Int. Conf. Neural Networks,
[5] C. R. Rao and S. K. Mitra, Generalized Inverse of Matri- [25] J. Vesterstrom and R. Thomsen, “A comparative Perth, Australia, Nov. 1995, vol. 4, pp. 1942–1948.
ces and its Applications. New York, NY: Wiley, 1971. study of differential evolution, particle swarm opti- [45] R. Eberhart and J. Kennedy, “A new optimizer us-
[6] D. Serre, Matrices: Theory and Applications. New York, mization, and evolutionary algorithms on numerical ing particle swarm theory,” in Proc. 6th Int. Symp. Micro
NY: Springer-Verlag, 2002. benchmark problems,” in Proc. IEEE Congr. Evolution- Machine and Human Science, Nagoya, Japan, Oct. 1995, pp.
[7] W. Zong, G.-B. Huang, and Y. Chen, “Weight- ary Computation, Portland, OR, June 2004, vol. 2, pp. 39–43.
ed extreme learning machine for imbalance learn- 1980–1987. [46] J.-H. Seo, C.-H. Im, C.-G. Heo, J.-K. Kim, H.-
ing,” Neurocomputing, vol. 101, no. 3, pp. 229–242, [26] M. N. Alam, B. Das, and V. Pant, “A comparative K. Jung, and C.-G. Lee, “Multimodal function op-
Aug. 2013. study of metaheuristic optimization approaches for direc- timization based on particle swarm optimization,”
[8] K. Li, X. Kong, Z. Lu, L. Wenyin, and J. Yin, “Boost- tional overcurrent relays coordination,” Electr. Power Syst. IEEE Trans. Magn., vol. 42, no. 4, pp. 1095–1098, Apr.
ing weighted ELM for imbalanced learning,” Neurocom- Res., vol. 128, pp. 39–52, June 2015. 2006.
puting, vol. 128, no. 5, pp. 15–21, May 2014. [27] S. M. Lucas, “Evolving finite state transducers: Some [47] Z. H. Zhan, J. Zhang, Y. Li, and Y. H. Shi, “Or-
[9] K.-A. Toh, “Deterministic neural classification,” Neu- initial explorations,” in Proc. European Conf. Genetic Pro- thogonal learning particle swarm optimization,” IEEE
ral Comput., vol. 20, no. 6, pp. 1565–1595, Apr. 2008. gramming, Essex, Apr. 2003, pp. 130–141. Trans. Evol. Comput., vol. 15, no. 6, pp. 832–847, Dec.
[10] W. Deng, Q. Zheng, and L. Chen, “Regularized [28] J. Ilonen, J.-K. Kamarainen, and J. Lampinen, “Dif- 2011.
extreme learning machine,” in Proc. IEEE Symp. Compu- ferential evolution training algorithm for feed-forward [48] C. Li, S. Yang, and T. T. Nguyen, “A self-learning
tational Intelligence and Data Mining, Nashville, TN, Apr. neural networks,” Neural Process. Lett., vol. 17, no. 1, pp. particle swarm optimizer for global optimization prob-
2009, pp. 389–395. 93–105, Mar. 2003. lems,” IEEE Trans. Syst., Man, Cybern., Syst. B, vol. 42,
[11] H. He and E. A. Garcia, “Learning from imbalanced [29] R. Storn, “On the usage of differential evolu- no. 3, pp. 627–646, June 2012.
data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. tion for function optimization,” in Proc. North American [49] D. Ciresan, U. Meier, and J. Schmidhuber,
1263–1284, May 2009. Fuzzy Information Processing, Berkeley, CA, June 1996, pp. “Multi-column deep neural networks for image clas-
[12] R. Storn and K. Price, “Differential evolution: A 519–523. sif ication,” in Proc. IEEE Int. Conf. Computer Vision
simple and efficient adaptive scheme for global optimi- [30] T. Rogalsky, S. Kocabiyik, and R. W. Derksen, and Pattern Recognition, Providence, RI, June 2012, pp.
zation over continuous spaces,” International Computer “Differential evolution in aerodynamic optimization,” 3642–3649.
Science Institute, Berkeley, CA, Tech. Rep. TR-95-012, Can. Aeronaut. Space J., vol. 46, no. 4, pp. 183–190, Apr. [50] D. Keysers, T. Deselaers, C. Gollan, and H. Ney,
May 1995. 2000. “Deformation models for image recognition,” IEEE
[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, [31] J. Cao, Z. Lin, and G.-B. Huang, “Self-adaptive Trans. Pattern Anal. Mach. Intell., vol. 29, no. 8, pp. 1422–
“Gradient-based learning applied to document recogni- evolutionary extreme learning machine,” Neural Process. 1435, Nov. 2007.
tion,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. Lett., vol. 36, no. 3, pp. 285–305, July 2012. [51] A. T. Visual, P. Y. Simard, D. Steinkraus, and J.
1998. [32] Q.-Y. Zhu, A. K. Qin, P. N. Suganthan, and G.- C. Platt, “Best practices for convolutional neural net-
[14] A. Krizhevsky, “Learning multiple layers of features B. Huang, “Evolutionary extreme learning machine,” works,” in Proc. IEEE Int. Conf. Document Analysis
from tiny images,” University of Toronto, Toronto, ON, Pattern Recog., vol. 38, no. 10, pp. 1759–1763, Oct. and Recogntion, Edinburgh, Scotland, Aug. 2003, pp.
Tech. Rep., Apr. 2009. 2005. 958–962.
[15] S. Abu-El-Haija, N. Kothari, J. Lee, A. Natsev, [33] Y. Qu, C. Shang, W. Wu, and Q. Shen, “Evolution- [52] D. C. Ciresan, U. Meier, L. M. Gambardella, and J.
G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, ary fuzzy extreme learning machine for mammographic Schmidhuber, “Deep big simple neural nets for handwrit-
“Youtube-8m: A large-scale video classification bench- risk analysis,” Int. J. Fuzzy Syst., vol. 13, no. 4, pp. 282– ten digit recognition,” Neural Comput., vol. 22, no. 12,
mark,” CoRR, vol. abs/1609.08675, Nov. 2016. 291, Dec. 2011. pp. 3207–3220, Dec. 2010.
[16] R. Fletcher, Practical Methods of Optimization, Vol- [34] V. L. Huang, A. K. Qin, and P. N. Suganthan, “Self- [53] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
ume 2: Constrained Optimization. New York, NY: Wi- adaptive differential evolution algorithm for constrained R. Girshick, S. Guadarrama, and T. Darrell, “Caffe:
ley, 1981. real-parameter optimization,” in Proc. IEEE Congr. Convolutional architecture for fast feature embedding,”
[17] E. Cambria, G.-B. Huang, L. L. C. Kasun, H. Zhou, Evolutionary Computation, Vancouver, BC, Canada, July in Proc. 22nd ACM Int. Conf. Multimedia, Orlando, FL,
C. M. Vong, J. Lin, J. Yin, Z. Cai, Q. Liu, K. Li, V. C. 2006, pp. 17–24. Nov. 2014, pp. 675–678.
M. Leung, L. Feng, Y.-S. Ong, M.-H. Lim, A. Akusok, [35] A. K. Qin, V. L. Huang, and P. N. Suganthan, [54] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int.
A. Lendasse, F. Corona, R. Nian, Y. Miche, P. Gastaldo, “Differential evolution algorithm with strategy ad- Conf. Computer Vision, Santiago, Chile, Dec. 2015, pp.
R. Zunino, S. Decherchi, X. Yang, K. Mao, B.-S. Oh, J. aptation for global numerical optimization,” IEEE 1440–1448.
Jeon, K.-A. Toh, A. B. J. Teoh, J. Kim, H. Yu, Y. Chen, Trans. Evol. Comput., vol. 13, no. 2, pp. 398–417, [55] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
and J. Liu, “Extreme learning machines [trends & con- May 2009. F.-F. Li, “Imagenet: A large-scale hierarchical image data-
troversies],” IEEE Intell. Syst., vol. 28, no. 6, pp. 30–59, [36] Y. Wang, Z. Cai, and Q. Zhang, “Differential evolu- base,” in Proc. IEEE Int. Conf. Computer Vision and Pattern
Nov. 2013. tion with composite trial vector generation strategies and Recognition, Miami, FL, Nov. 2009, pp. 248–255.
[18] G.-B. Huang, E. Cambria, K. A. Toh, B. Widrow, control parameters,” IEEE Trans. Evol. Comput., vol. 15, [56] K. Simonyan and A. Zisserman, “Very deep convo-
and Z. Xu, “New trends of learning in computational no. 1, pp. 55–66, May 2011. lutional networks for large-scale image recognition, in
intelligence [guest editorial],” IEEE Comput. Intell. Mag., [37] R. A. Sarker, S. M. Elsayed, and T. Ray, “Differ- Proc. IEEE Int. Conf. Learning Representations, CO, June
vol. 10, no. 2, pp. 16–17, May 2015. ential evolution with dynamic parameters selection for 2015, pp. 1150–1210.

Haibo He
Department of Electrical, Computer and Biomedical
Engineering, University of Rhode Island, Kingston, RI, USA
Xiangnan Zhong
Department of Electrical Engineering,
University of North Texas, Denton, TX, USA
Learning Without External Reward
Abstract that we learn by interacting with the important role in the learning process. In
I n the traditional reinforcement

learning paradigm, a reward signal is
applied to define the goal of the
task. Usually, the reward signal is a
“hand-crafted” numerical value or a
environment [1]–[3]. For instance, when
we try to hold a conversation with oth-
ers, we need to decide what to say based
on the people we are talking to as well as
the conversational context. Over the past
general, the key element of RL is defined
by the reward signal, which is given by
the environment [1], [4]–[6]. In order to
achieve goals, the agent chooses a set of
actions that maximize the expected total
pre-defined function: it tells the agent several decades, many researchers have rewards it receives over time. Therefore,
how good or bad a specific action is. explored computational approaches to RL achieves goals by defining the
However, we believe there exist situa- learn from active interactions with the interaction between an agent and its
tions in which the environment cannot environment, such as reinforcement environment in terms of states, actions,
directly provide such a reward signal to learning (RL) and adaptive dynamic pro- and rewards [1], [7]. Recently, the
the agent. Therefore, the question is gramming (ADP). Imagine we hope to development of deep RL [8]–[10] has
whether an agent can still learn without train a monkey to learn the result of attracted increasing attention, especially
the external reward signal or for the level of intelligence it
not. To this end, this article has achieved.
develops a self-learning ap - So far, many RL/ADP
proach which enables the designs focus on how to calcu-
agent to adaptively develop an late and maximize the cumula-
internal reward signal based on tive rewards [11]–[15]. Usually,
a given ultimate goal, without it is assumed that the agent
requiring an explicit external knows what the immediate
reward signal from the envi- reward is or how the immedi-
ronment. In this article, we aim ate reward is computed as a
to convey the self-learning function of the actions and
idea in a broad sense, which states in which they are taken
could be used in a wide range [16]. There are several ap -
of existing reinforcement image licensed by ingram publishing
proaches in the literature to
learning and adaptive dynamic define such a reward signal. For
programming algorithms and architec- “1+1”. At first, we present two cards with instance, a typical approach is to use a
tures. We describe the idealized forms of number “1” on them to the monkey: If it binary signal, e.g., using a “0” or “−1” to
this method mathematically, and also picks a card with the number “2” on it represent “success” or “failure” of an
demonstrate its effectiveness through a from a box, we present a banana as a action [17], or a semi-binary reward sig-
triple-link inverted pendulum case study. reward. In this way, although the monkey nal, e.g., using “0, −0.4, −1” as a more
does not know the exact meaning of informative representation [18]. Another
I. Introduction math, it knows that a banana can be way to define the reward signal is to use
When we first think about the nature of given when it picks the appropriate card. a quadratic function based on the system
learning, we probably start with the idea Therefore, the banana reward plays an states and actions [19]–[22]. This type of

Date of publication: 18 July 2018 Corresponding Author: Haibo He (Email: haibohe@uri.edu)

reward signal is more adaptive and very traditional and self-learning ADP design is happens sequentially, in discrete time
popular in system stabilization. Recently, described in Fig. 1. We can observe that steps. At each time step t, the agent
a goal representation heuristic dynamic instead of receiving the immediate reward selects an action a (t) according to the
programming (GrHDP) method [23], signal from the environment (Fig. 1(a)), the representation of the environment x (t) .
[24] has been proposed in the literature. agent in the self-learning ADP method, In consequence, the agent finds itself in a
By introducing an internal reward signal, estimates an internal reward s (t) to help new state x (t + 1) and then estimates
this method can automatically and adap- achieve the goal (Fig. 1(b)). Hence, the the corresponding internal reward sig-
tively provide information to the agent communications between the environ- nal s (t + 1) at next time step. For exam-
instead of hand-crafting. Generally, all ment and the agent at each time step are ple, when we train a battery-charged
these existing approaches require the only the states and actions, which is funda- robot to collect trash, the robot decides
environment to define the reward signal mentally different to the existing RL and at each time step whether it should
with a priori knowledge or domain ADP methods. move forward to search more trash or
expertise. This can be seen as an external Note that, in the traditional RL/ADP find its way back to its battery charger.
teacher or trainer who provides the design, the use of the reward signal is to Its decision is based on its position and
reward value or function for each action define the agent’s goal of a task. However, speed. As a consequence of its action and
performed by the agent. However, what in the self-learning ADP method, the state, the robot estimates the reward sig-
if this kind of teacher is unavailable or reward signal is unavailable in the interac- nal s (t) = f (x (t), a (t)) . Initially, the robot
he/she cannot provide such a direct tion. Since the reward signal reflects the randomly assigns a value s (0) since no
feedback to the agent for some reasons? evaluation of an action’s effect, which is prior knowledge is available about what
Can the agent learn by itself in this case always paired with the goal, the agent in to do. However, after trial-and-error
under the RL framework? this learning process needs to learn what learning, we want the robot to learn how
In this article, by considering the the reward signal is according to the ulti- to represent an action in a given state.
relationship between the goal and the mate goal. The effect of the action is Let us stipulate that the agent’s goal is
system states and actions, we propose a compared to the goal within a common to maximize the reward it estimates in
method that enables the agent to learn reference frame in order to assess the the long term. If the reward sequence
only with the ultimate goal but no achievement [26]. The agent then learns after time step t is as s (t + 1), s (t + 2),
explicit external reward signals. To this how “good” or “bad” the action is at each s (t + 3), f, then the value function can
end, the key contribution of this work is time step by itself via the guidance from be described as
the development of a self-learning the ultimate goal. After that, based on the
approach based on the ultimate goal, estimated internal reward signal and the V (t) = s (t) + cs (t + 1) + c 2 s (t + 2)
rather than obtaining the external super- system state, the agent generates the con-
+ c 3 s (t + 3) + g
vision (reward signals) directly from the trol action. This is to say, in order to
= s (t) + cV (t + 1) (1)
environment. We further develop the achieve the ultimate goal, instead of
computational formulation of this self- learning to make the decision directly, the
learning idea based on a specific ADP agent needs first to learn what the best where 0 1 c 1 1 is the discount factor.
design, and validate its performance and reward signal is to represent the informa- The discount factor determines if an
effectiveness based on a triple-link invert- tion upon which to base a certain action. immediate reward is more valuable than
ed pendulum case study. We would like More specifically, the interaction the rewards received in the far future. If
to note that this article focuses on the between the agent and the environment c = 0, the value function is equal to the
self-learning of an agent’s own goals, in
contrast to observational learning or
apprenticeship learning about other’s Traditional RL/ADP Self-Learning RL/ADP
goals such as in inverse RL [25].
State State Internal
x (t ) Agent Agent Reward
II. The Key Idea: x (t ) s (t )
Self-Learning Design Reward
r (t ) Action Action
In contrast to the traditional RL/ADP a (t ) a (t )
design, in which a specific reward signal
Environment Environment
passes from the environment to the agent
to tell the effects (“good” or “bad”), our
proposed self-learning ADP method (a) (b)
enables the agent to establish the reward
signal itself, which is called the internal Figure 1 The conceptual diagram of agent-environment interaction: (a) Traditional RL/ADP
design: the external reinforcement signal r (t) passing from the environment to the agent; (b)
reward signal in this article.The comparison Self-learning RL/ADP design: no external reinforcement signal needed during this process, and
of the agent-environment interaction in the agent estimating an internal reward signal s (t) itself.

accordingly. In this way, the internal
In contrast to the traditional RL/ADP design, in which reward signal is automatically decided
a specific reward signal passes from the environment within the agent in this approach. In
Fig. 2, we use the neural network tech-
to the agent to tell the effects (“good” or “bad”), our nique as an example to illustrate this
proposed self-learning ADP method enables the agent idea. In different applications, one can
to establish the reward signal itself, which is called the choose any type of function approxima-
tors for implementation depending on
internal reward signal in this article. the scenario and/or design. The internal
reward signal estimation will be achieved
by goal network with inputs x (t), a (t)
immediate reward and the agent’s goal given state. Our objective is to ensure and an output s (t). The value function
becomes to maximize the immediate re- this signal can guide the agent to achieve V (t) will be generated by the critic net-
ward. This may reduce the possibility of the goal, i.e., work to adjust the updating process.
accessing future rewards. When c ap- Furthermore, the action network will
proaches 1, future rewards will increase s ) (t) = argmax {V ) (t)}. (3) provide the control action. Therefore,
s (t )
their influence in selecting actions, there- we can observe that instead of obtain-
fore the agent becomes more farsighted Then, according to Bellman’s opti- ing an explicit external reward signal
[1]. We note here that s (t) is the estimated mality principle [28], the control action directly from the environment, the agent
internal reward signal, rather than the ex- a (t) can be given as in the proposed method self-learns an
ternal reward signal as in the classic RL/ internal reward signal to guide its action
ADP literature [1], [12], [13], [18], [27]. a ) (t) = argmax {s ) (t) + cV ) (t + 1)}. (4) based on the ultimate goal. This clearly
a (t )
In this way, the optimal value func- distinguishes self-learning ADP from the
tion can be defined as Note that since the internal reward existing ADP methods: (i) No external
signal s (t) is estimated by the agent, it is supervision is needed in this fully autono-
V ) (t) = max {s ) (t) + cV ) (t + 1)}. (2) random at first during the training pro- mous learning. The basic principle
a (t), s (t)
cess. However, after trial-and-error becomes finding an internal reward func-
Here s ) (t) is the optimal internal re- learning, the agent can learn how to tion which makes the agent achieve the
ward signal that the agent learns to rep- represent the internal reward signal ultimate goal. (ii) Communication burden
resent how good or bad an action is in a based on the ultimate goal and update it is reduced since the interaction between
V (t )
s (t )
Goal Network Critic Network Action Network
V (t )
x (t )
x (t )
s (t ) V (t ) a (t )
x (t )
a (t )
a (t )
s (t )
a (t ) a (t )
Figure 2 Neural network training process. The goal network takes the system state x (t) and control action a (t) as its input, and outputs an
estimated internal reinforcement signal s (t). The critic network similarly applies the neural network structure, but takes an additional input s (t)
and outputs the value function V (t) . Moreover, the action network uses a similar neural network structure with input x (t) and output a (t) .

the agent and the environment at each
time step only becomes the states and In order to achieve the ultimate goal, instead of learning
actions. This is important when the com- to make the decision directly, the agent needs first to
munication bandwidth or power resourc-
es are constrained.
learn what the best reward signal is to represent the
information upon which to base a certain action.
III. Learning Architecture
and Implementation
The key of implementing self-learning the goal network, we set the estimated where b c is the learning rate of the crit-
ADP design is to estimate Eqs. (2), (3), internal reward signal s (t) as one of the ic network. Chain backpropagation rule
2E c (t)
and (4). In our current research, we use inputs to the critic network. Therefore, can be applied to further calculate 2~c (t) :
three neural networks to design the goal the inputs of the critic network include
network, action network, and critic network, the state x (t), the action a (t), and the 2E c (t) 2E c (t) 2V (t)
= . (10)
as shown in Fig. 3. Here, we assume that internal reward signal s (t). The critic 2~ c (t) 2V (t) 2~ c (t)
the solution of the task exists and it is network aims to minimize the following
unique. The mathematical analysis of error function over time: C. Action Network
these three neural networks is provided At any time instant, action network pro-
as follows. e c (t) = cV (t) - [V (t - 1) - s (t - 1)] vides the control action a (t) for the sys-
tem based on the system states x (t) .
E c (t) = 1 e 2c (t) . (8)
A. Goal Network 2 Therefore, we define the error function
The goal network estimates the internal as e a (t) = V (t) - U c and E a (t) = 12 e 2a (t).
reward signal s (t) in Eq.(3) based on the Note that, although both V (t) and The weights updating rule can be
ultimate goal U c . Since the internal V (t - 1) depend on critic network’s defined as:
reward signal is a response of the exter- weights ~ c, we do not account for the
~ a (t) - b a c
2E a ( t)
nal environment, we set the inputs of dependence of V (t - 1) on weights ~ c ~ a (t + 1) = m (11)
2~ a (t)
the goal network as the state x (t) and when minimizing the error in Eq. (8)
action a (t). With these information, the [16]. Therefore, the updating scheme of where b a is the learning rate of the
goal network provides an estimation of the critic network can be defined as: action network. With the chain back-
the internal reward signal in order to propagation rule, we have:
optimize the value function V (t). There-
~ c (t) - b c c
2E c ( t)
m
2E a (t) 2E a (t) 2V (t) 2 a (t)
fore, the error can be described as: ~ c (t) = (9) = . (12)
2~ c (t) 2~ a (t) 2V (t) 2 a (t) 2~ a (t)
e g (t) = V (t) - U c
E g (t) = 1 e 2g (t) (5)
2
where U c is the ultimate goal. The value
Environment
of U c is critical in the design and it
could be variant in different applications. x (t + 1)
We apply backpropagation to train the
goal network and update the goal net- x (t ) a (t )
work weights ~ g (t) as following:
s (t )
~ g ( t) - b g c
2E g (t)
m (6)
Action a (t ) Goal
~ g (t + 1) =
2~ g (t) Network Network + +
where b g is a learning rate of the goal x (t ) Internal Reward Signal

network. Followed by the chain back-
Critic V (t ) V (t –1)
propagation rule, we further have γ
Network + –
2E g (t) 2E g (t) 2V (t) 2s (t)
= . (7) + U c(t )
2~ g (t) 2V (t) 2s (t) 2~ g (t)
– Ultimate Goal
Agent
B. Critic Network
The value function V (t) in Eq. (2) is
Figure 3 Implementation architecture of self-learning adaptive dynamic programming
estimated by the critic network. To design: three neural networks are established as the action network, the critic network and
closely connect the critic network with the goal network.

The algorithm for the proposed self- IV. Results and Discussions nonlinearities. Therefore, this benchmark
learning ADP approach is given in We consider a triple-link inverted pen- is frequently used to evaluate the perfor-
[Algorithm 1], where step 1 setups the dulum case here. As seen in Fig. 4, this mance of new control strategies. The
variables and ingredients of three neural pendulum includes three poles connect- system model and system parameters are
networks, and then steps 2-26 describe ed by three links. This model is highly identical as those in [18]. In this task, our
the online training process. unstable and exhibits non-negligible goal is to balance the inverted pendu-
lum under the following constraints: (1)
the cart track should be within 1.0 m to
both sides from the center point; and (2)
Algorithm 1 Self-learning adaptive dynamic programming. each link angle should be within the
range of [- 20c, 20c] with respect to the
1: a ! fa (x a, ~ a), control action selection; vertical axis. For these two conditions, if
fa: the action network; either one fails or both fail, we consider
x a: inputs of action network, x a = [x]; that the current controller fails to
~ a: weights in action network; accomplish the task.
a: control action; In our study, the triple-link inverted
s ! fg (x g, ~ g), internal reward signal choosing; pendulum is the environment and the
fg: the goal network; designed controller, which produces the
x g: input of goal network, x g = [x, a]; control action, is the agent. The control
~ g: weights in goal network; unit a (t) (in voltage) is converted into
s: internal reward signal; force by an analog amplifier (with gain
V ! fc (x c, ~ c), value function mapping; K s = 24.7125N/V ) to the DC servo
fc: the critic network; motor. Each link here only rotates in a
x c: input of critic network, x c = [x, a, s]; vertical plane. At each time step t, the
~ c: weights in critic network; controller receives an eight-dimension
V: value function; vector which represents the state, i.e., the
N a, N g, and N c: internal cycles of the action network, goal network, and critic net- position of the cart on the track (x), the
work, respectively; vertical angles of three links î 1, i 2, i 3h,
Ta, Tg, and Tc: internal training error thresholds for the action network, goal network, the cart velocity (xo ), the angular velocities
and critic network, respectively; of three links îo 1, io 2, io 3h . On this basis,
2: for 1 to MaxRun do the controller chooses an action. Note that
3: Initialize ~ a (0), ~ g (0), ~ c (0); the sign and the number of the action
4: x (t) ! System (x (t - 1), a (t - 1)); // execute action and obtain current state denote direction and magnitude of the
5: // online training of goal network force, respectively. Then one time step
6: while Ê g (t) 2 Tg & cyc 1 N gh do later, as a result of the action, the con-
7: ~ g (t) = ~ g (t) + D~ g (t) via (6) and (7); // update the weights recursively troller receives a new state vector and
8: s (t) ! fg (x g (t), ~ g (t)); estimates the internal reward signal
9: e g (t) = V (t) - U c, E g (t) = 1 e 2g (t); according to the current state and the
2
10: end while produced action. Therefore, during this
11: // online training of critic network training process, there is no external
12: while Ê c (t) 2 Tc & cyc 1 N c h do reward signal transmitting from the envi-
13: ~ c (t) = ~ c (t) + D~ c (t) via (9) and (10); ronment to the agent.
14: s (t) ! fg (x g (t), ~ g (t)); V (t) ! fc (x c (t), ~ c (t)); To test the performance of the self-
15: e c (t) = cV (t) - [V (t - 1) - s (t - 1)], E c (t) = 1
2 e 2c (t); learning ADP approach, one hundred
16: end while runs were conducted. We set the initial
17: // online training of action network cart position at the center of the track
18: while Ê a (t) 2 Ta & cyc 1 N ah do and its velocity to be zero. For different
19: ~ a (t) = ~ a (t) + D~ a (t) via (11) and (12); runs, the initial values of three angles
20: a (t) ! fa (x a (t), ~ a (t)); s (t) ! fg (x g (t), ~ g (t)); V (t) ! fc (x c (t), ~ c (t)); and angular velocity of the triple links
21: e a (t) = V (t) - U c, E a (t) = 12 e 2a (t); were chosen randomly within the range
22: end while of [-1c, 1c] and [- 0.5, 0.5] rad /s, re -
23: ~ a (t + 1) = ~ a (t); spectively. The maximum number of tri-
24: ~ g (t + 1) = ~ g (t); als in each run was 3000. A run is
25: ~ c (t + 1) = ~ c (t); // update weights through each trial considered successful if the controller
26: end for can balance the triple-link inverted

pendulum within the assigned number
of trials. The structures of goal, critic, The learning process of the proposed method can be
and action networks established for this accomplished without the explicit external reward
case refer to Fig. 2 with the number of
neurons in each layer as 9-14-1, (i.e., the
directly from the environment. Instead, this reward will
neural network has 9 input nodes, 14 be automatically and adaptively learned and developed
hidden nodes, and 1 output node), by the goal network according to the ultimate goal.
10-16-1, and 8-14-1, respectively. Since
the optimal equilibrium of all states are
near zero, we define the mathematical
representation of the ultimate goal, in effect of an action during the learning an explicit reward signal directly from
this example, as U c = 0. process automatically. the external environment, the agent
Our simulation demonstrated that itself can learn an internal reward signal
92% of the runs resulted in a successful V. Summary and Conclusion s (t) by the goal network according to
balance. The average number of trials We have designed a self-learning meth- the ultimate goal U c and guide itself to
to success was 1071.6. Table 1 shows od without the explicit external reward accomplish the task. This also means in
the comparative results of our method signal directly given by the environ- our approach, only two interaction ele-
with respect to those reported in [18] ment. Comparing with the traditional ments, state x (t) and action a (t), are
and [23]. RL/ADP methods in the literature, the required at each time step during the
Table 1 shows that the perfor- key contribution of our approach is that learning process. From simulations, we
mance of the proposed self-learning we introduce a new goal network to observe that the success rate of the
ADP method is not as good as the automatically and adaptively develop an designed self-learning method is lower
perfor mance of the existing ADP internal reward signal based on the ulti- than that of the traditional ADP meth-
methods with pre-defined reward sig- mate goal to facilitate the self-learning ods. This is because the agent in the
nal. This is because the agent in our process. Therefore, instead of receiving proposed method needs to learn how to
approach needs to learn what the
reward signal is. However, the key ob-
servation from this research is that, the
State
learning process of the proposed meth- Environment Agent
x (t )
od can be accomplished without the
θ3
explicit external reward directly from
the environment. Instead, this reward
will be automatically and adaptively s (t )
learned and developed by the goal net- θ2
Internal
work according to the ultimate goal. Reward Signal
This is the key fundamental contribu-
tion of this article.
To further examine the performance θ1
of the self-learning ADP method, a typ- Self-Learning
ical trajectory for each of the state vari-
ables for the task is shown in Fig. 5, Force
a (t )
including (a) the position of the cart on
Cart
the track; (b)–(d) the vertical angle of
the 1st, 2nd and 3rd links of the pendu-
lum, respectively; (e) the velocity of the
cart; and (f )–(h) the angular velocity of
the 1st, 2nd and 3rd links of the pendu- Figure 4 Triple-link inverted pendulum case and its interaction with the agent.
lum, respectively. We observe that the

cart position on the track and all the
TAbLe 1 Comparison with existing ADP learning algorithms.
joint angles of the links are balanced
SeLF-LeArNiNg TrADiTiONAL gOAL rePreSeN-
within a small range of the balance ADP ADP [18] TATiON ADP [23]
point. This indicates that the proposed
SUCCESS RATE 92% 97% 99%
method can effectively control the sys-
NUMBER OF TRAILS 1071.6 1194 571.4
tem to achieve desired performance
NEED EXTERNAL REWARDS NO YES YES
and the controller can estimate the

[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J.
Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A.
0.02 0.04 K. Fidjeland, G. Ostrovski, et al. “Human-level control
0.02 through deep reinforcement learning,” Nature, vol. 518,
θ1 (rad)
x (m)
no. 7540, pp. 529–533, Feb. 2015.

0 0
[9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
–0.02 G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V.
–0.02 –0.04 Panneershelvam, M. Lanctot, et al. “Mastering the game
0 2,000 4,000 6,000 0 2,000 4,000 6,000 of go with deep neural networks and tree search,” Nature,
Time Step Time Step vol. 529, no. 7587, pp. 484–489, Jan. 2016.
[10] D. Silver, J. Schrittwieser, K. Simonyan, I. Antono-
(a) (b) glou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A.
0.02 0.01 Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, T. G. G.
van den Driessche, and D. Hassabis, “Mastering the game
θ2 (rad)
θ3 (rad)
of go without human knowledge,” Nature, vol. 550, no.
0 0 7676, pp. 354–359, Oct. 2017.
[11] D. P. Bertsekas, Dynamic Programming and Optimal
Control. Belmont, MA: Athena Scientific, 1995.
–0.02 –0.01 [12] P. J. Werbos, “Intelligence in the brain: A theory of
0 2,000 4,000 6,000 0 2,000 4,000 6,000
how it works and how to build it,” Neural Netw., vol. 22,
Time Step Time Step no. 3, pp. 200–212, Apr. 2009.
(c) (d) [13] P. J. Werbos, “ADP: The key direction for future
research in intelligent control and understanding brain
intelligence,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,
0.2 1
θ1 (rad/s)
vol. 38, no. 4, pp. 898–900, Aug. 2008.

x (m/s)
[14] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Rein-

0 0
forcement learning and feedback control: Using natural deci-
.
–0.2 –1 sion methods to design optimal adaptive controllers,” IEEE

.
Control Syst. Mag., vol. 32, no. 6, pp. 76–105, Dec. 2012.
0 2,000 4,000 6,000 0 2,000 4,000 6,000 [15] W. B. Powell, Approximate Dynamic Programming: Solv-
Time Step Time Step ing the Curses of Dimensionality. New York: Wiley, 2007.
[16] D.V. Prokhorov and D. C.Wunsch,“Adaptive critic designs,”
(e) (f) IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997–1007, Sept. 1997.
1 0.4 [17] F. Liu, J. Sun, J. Si, W. Guo, and S. Mei, “A bound-
edness result for the direct heuristic dynamic program-
θ2 (rad/s)
θ3 (rad/s)
0.2 ming,” Neural Netw., vol. 32, pp. 229–235, Aug. 2012.
0 0 [18] J. Si and Y.-T. Wang, “Online learning control by as-
sociation and reinforcement,” IEEE Trans. Neural Netw.,
–0.2
.
vol. 12, no. 2, pp. 264–276, Mar. 2001.

–1 –0.4 [19] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Dis-
0 2,000 4,000 6,000 0 2,000 4,000 6,000 crete-time nonlinear HJB solution using approximate dy-
Time Step Time Step namic programming: Convergence proof,” IEEE Trans. Syst.,
Man, Cybern. B, vol. 38, no. 4, pp. 943–949, June 2008.
(g) (h) [20] F. L. Lewis and K. G. Vamvoudakis, “Reinforce-
ment learning for partially observable dynamic processes:
Adaptive dynamic programming using measured output
Figure 5 Typical trajectory of a successful trial on the triple-link inverted pendulum balancing data,” IEEE Trans. Syst., Man, Cybern. A, vol. 41, no. 1,
task during the learning process: (a) the position of the cart, (b) the vertical angle of the 1st pp. 14–25, Feb. 2011.
link joint to the cart, (c) the vertical angle of the 2nd link joint to the 1st link, (d) the vertical [21] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neu-
angle of the 3rd link joint to the 2nd link, (e) the cart velocity, (f) the angular velocity of the ral-network-based optimal control for a class of unknown
1st link joint to the cart, (g) the angular velocity of the 2nd link joint to the 1st link, (h) the discrete-time nonlinear systems using globalized dual heu-
ristic programming,” IEEE Trans. Autom. Sci. Eng. (from
angular velocity of the 3rd link joint to the 2nd link.
July 2004), vol. 9, no. 3, pp. 628–634, July 2012.
[22] X. Zhong, H. He, H. Zhang, and Z. Wang, “Optimal
control for unknown discrete-time nonlinear Markov
jump systems using adaptive dynamic programming,” IEEE
represent the reward signals by himself References Trans. Neural Netw. Learn. Syst., vol. 25, no. 12, pp. 2141–
[1] A. G. Barto, Reinforcement Learning: An Introduction.
based on the ultimate goal, rather than Cambridge, MA: MIT Press, 1998.
2155, Mar. 2014.
[23] H. He, Z. Ni, and J. Fu, “A three-network architec-
is given explicitly by an exter nal [2] R. A. Brooks, “Intelligence without reason,” in Proc. ture for on-line learning and optimization based on adap-
teacher in the existing methods. The 12th Int. Joint Conf. Artificial Intelligence, Sydney, NSW, tive dynamic programming,” Neurocomputing, vol. 78, no.
Australia, Aug. 1991, pp. 569–595. 1, pp. 3–13, Feb. 2012.
implications of self-learning with no [3] R. Pfeifer and C. Scheier, Understanding Intelligence. [24] Z. Ni, H. He, J. Wen, and X. Xu, “Goal represen-
external supervision available at each Cambridge, MA: MIT Press, 1999. tation heuristic dynamic programming on maze naviga-
[4] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynam- tion,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, pp.
time step during the learning process is ic programming: An introduction,” IEEE Comput. Intell. 2038–2050, Dec. 2013.
significant, resulting in potential far- Mag., vol. 4, no. 2, pp. 39–47, Apr. 2009. [25] A. Y. Ng and S. J. Russell, “Algorithms for inverse
[5] B. Bathellier, S. P. Tee, C. Hrovat, and S. Rumpel, “A mul- reinforcement learning,” in Proc. 17th Int. Conf. Machine
reaching applications across a wide tiplicative reinforcement learning model capturing learning Learning, San Francisco, CA, July 2000, pp. 663–670.
range of fields. dynamics and interindividual variability in mice,” Proc. Natl. [26] M. Rolf and M. Asada, “Where do goals come from?
Acad. Sci. USA, vol. 110, no. 49, pp. 19950–19955, Nov. 2013. A generic approach to autonomous goal-system develop-
[6] P. W. Glimcher, “Understanding dopamine and re- ment,” arXiv Preprint, arXiv:1410.5557, Oct. 2014.
Acknowledgment inforcement learning: The dopamine reward prediction [27] X. Zhong, H. He, D. Wang, and Z. Ni, “Model-free
error hypothesis,” Proc. Natl. Acad. Sci. USA, vol. 108, no. adaptive control for unknown nonlinear zero-sum dif-
This work was supported in part by the Suppl 3, pp. 15647–15654, Mar. 2011. ferential game,” IEEE Trans. Cybern., vol. 48, no. 5, pp.
National Science Foundation (NSF) [7] D. Zhao, Z. Xia, and Q. Zhang, “Model-free optimal 1633–1646, May 2018.
control based intelligent cruise control with hardware- [28] R. Bellman, Dynamic Programming. Princeton, NJ:
under CMMI 1526835 and ECCS in-the-loop demonstration [research frontier],” IEEE Princeton Univ. Press, 1957.
1731672. Comput. Intell. Mag., vol. 12, no. 2, pp. 56–69, Apr. 2017.

Tom Young Review
School of Information and Electronics, Beijing Institute
of Technology, Beijing, China
Article
Devamanyu Hazarika
School of Computing, National University of Singapore,
Singapore
Soujanya Poria
Temasek Laboratories, Nanyang Technological University,
Singapore
Erik Cambria
School of Computer Science and Engineering,
Nanyang Technological University, Singapore
Recent Trends in Deep Learning Based Natural Language Processing
Abstract processing, in which the analysis of a creasingly focusing on the use of new deep
D eep learning methods employ

multiple processing layers to learn
hierarchical representations of
data, and have produced state-of-the-art
results in many domains. Recently, a
sentence could take up to 7 minutes, to
the era of Google and the likes of it, in
which millions of webpages can be
learning methods (see Fig. 1). For decades,
machine learning approaches targeting
NLP problems have been based on shal-
low models (e.g., SVM and logistic regres-
sion) trained on very high dimensional and
variety of model designs and methods sparse features. In the last few years, neural
have blossomed in the context of natu- networks based on dense vector represen-
ral language processing (NLP). In this tations have been producing superior
paper, we review significant deep results on various NLP tasks.This
learning related models and trend is sparked by the success
methods that have been of word embeddings [2], [3]
employed for numerous and deep learning meth-
NLP tasks and provide a ods [4]. Deep learning
walk-through of their enables multi-level auto-
evolution. We also sum- matic feature representa-
mar ize, compare and tion learning. In contrast,
contrast the various models traditional machine learning
and put forward a detailed Image lIcensed by Ingram PublIshIng based NLP systems liaise heavily
understanding of the past, present on hand-crafted features. Such hand-
and future of deep learning in NLP. processed in less than a second [1]. NLP crafted features are time-consuming and
enables computers to perform a wide often incomplete.
I. Introduction range of natural language related tasks at Collobert et al. [5] demonstrated that
Natural language processing (NLP) is a all levels, ranging from parsing and part- a simple deep learning framework
theory-motivated range of computa- of-speech (POS) tagging, to machine outperforms most state-of-the-art ap-
tional techniques for the automatic translation and dialogue systems. proaches in several NLP tasks such as
analysis and representation of human Deep learning architectures and al- named-entity recognition (NER), se-
language. NLP research has evolved gorithms have already made impressive mantic role labeling (SRL), and POS
from the era of punch cards and batch advances in fields such as computer vision tagging. Since then, numerous complex
and pattern recognition. Following this deep learning based algorithms have
trend, recent NLP research is now in- been proposed to solve difficult NLP
Date of publication: 18 July 2018 Corresponding Author: Erik Cambria (cambria@ntu.edu.sg)
1556-603x/18©2018ieee august 2018 | ieee Computational intelligenCe magazine 55

70 ACL EACL
EMNLP NAACL King
60
(–) Man
(%)
50
40 (+) Woman
30
2012 2013 2014 2015 2016 2017 Queen

Year
Figure 1 Percentage of deep learning papers in ACL, EMNLP, EACL,

NAACL over the last 6 years (long papers). Figure 2 Distributional vectors exhibits compositionality.
tasks. We review major deep learning ods on standard datasets about major smaller dimensionality, are fast and effi-
related models and methods applied to NLP topics. cient in computing core NLP tasks.
natural language tasks such as convolu- Over the years, the models that cre-
tional neural networks (CNNs), recur- II. Distributed Representation ate such embeddings have been shallow
rent neural networks (RNNs), and Statistical NLP has emerged as the pri- neural networks and there has not been
recursive neural networks. We also dis- mary option for modeling complex nat- need for deep networks to create good
cuss memory-augmenting strategies, ural language tasks. However, in its embeddings. However, deep learning
attention mechanisms and how unsu- beginning, it often used to suffer from based NLP models invariably represent
pervised models, reinforcement learning the notor ious curse of dimensionality their words, phrases and even sentences
methods and recently, deep generative while learning joint probability func- using these embeddings. This is in fact a
models have been employed for lan- tions of language models. This led to the major difference between traditional
guage-related tasks. motivation of learning distributed repre- word count based models and deep
To the best of our knowledge, this sentations of words existing in low- learning based models. Word embed-
work is the first of its type to compre- dimensional space [7]. dings have been responsible for state-of-
hensively cover the most popular deep the-art results in a wide range of NLP
learning methods in NLP research A. Word Embeddings tasks [9]–[12]. For example, Glorot et al.
today. The work by Goldberg [6] only Distributional vectors or word embed- [13] used embeddings along with
presented the basic principles for apply- dings (Fig. 2) essentially follow the dis- stacked denoising autoencoders for
ing neural networks to NLP in a tuto- tributional hypothesis, according to domain adaptation in sentiment classifi-
rial manner. We believe this paper will which words with similar meanings tend cation and Hermann and Blunsom [14]
give readers a more comprehensive idea to occur in similar context. Thus, these presented combinatory categorial auto-
of current practices in this domain. The vectors try to capture the characteristics encoders to learn the compositionality
structure of the paper is as follows: Sec- of the neighbors of a word. The main of sentence. Their wide usage across the
tion II introduces the concept of dis- advantage of distributional vectors is that recent literature shows their effectiveness
tributed representation, the basis of they capture similarity between words. and importance in any deep learning
sophisticated deep learning models; Measuring similarity between vectors is model performing a NLP task.
next, Sections III, IV, and V discuss possible, using measures such as cosine Distributed representations (embed-
popular models such as convolutional, similarity. Word embeddings are often dings) are mainly learned through con-
recurrent, and recursive neural net- used as the first data processing layer in a text. During 1990s, several research
works, as well as their use in various deep learning model. Typically, word developments [15] marked the founda-
NLP tasks; following, Section VI lists embeddings are pre-trained by optimiz- tions of research in distributional seman-
recent applications of reinforcement ing an auxiliary objective in a large tics. A more detailed summary of these
learning in NLP and new developments unlabeled corpus, such as predicting a early trends is provided in [16], [17].
in unsupervised sentence representation word based on its context [3], [8], where Later developments were adaptations of
learning; later, Section VII illustrates the the learned word vectors can capture these early works, which led to creation
recent trend of coupling deep learning general syntactical and semantic infor- of topic models like latent Dirichlet
models with memory modules; finally, mation. Thus, these embeddings have allocation [18] and language models [7].
Section VIII summarizes the perfor- proven to be efficient in capturing con- These works laid out the foundations of
mance of a series of deep learning meth- text similarity, analogies and due to its representation learning.
56 ieee Computational intelligenCe magazine | august 2018

In 2003, Bengio et al. [7] proposed a
neural language model which learned dis- i th Output = P (wt = i | Context)
tributed representations for words (Fig. 3).
Authors argued that these word represen- Softmax ..... .....
tations, once compiled into sentence rep- Classification
resentations using joint probability of word
sequences, achieved an exponential num-
Tanh Activation
ber of semantically neighboring sentences.
This, in turn, helped in generalization Concatenation .....
since unseen sentences could now gather

higher confidence if word sequences with C (wt – n +1) C (wt – 1)
similar words (in respect to nearby word ..... .....
representation) were already seen. Table Look-Up
Collobert and Weston [19] was the Using Matrix C
first work to show the utility of pre-
trained word embeddings. The authors Word Index wt – n +1 Word Index wt – 1
proposed a neural network architecture
that forms the foundation to many cur-
Figure 3 Representation of the Neural Language Model proposed by Bengio et al. [7]. C(i) is
rent approaches. The work also estab- the ith word embedding.
lishes word embeddings as a useful tool
for NLP tasks. However, the immense
popularization of word embeddings was the CBOW and skip-gram models. respectively. Each word from the vocab-
arguably due to [3], who proposed the CBOW computes the conditional proba- ulary is finally represented as two
continuous bag-of-words (CBOW) and bility of a target word given the context learned vectors v c and v w, correspond-
skip-gram models to efficiently con- words surrounding it across a window of ing to context and target word represen-
struct high-quality distributed vector size k. On the other hand, the skip-gram tations, respectively. Thus, kth word in
representations. Propelling their popu- model does the exact opposite of the the vocabulary will have
larity was the unexpected side effect of CBOW model, by predicting the sur-
v c = W (k,.) and v w = Wl(.,k) . (1)
the vectors exhibiting compositionality, rounding context words given the central
i.e., adding two word vectors results in a target word. The context words are Overall, for any word wi with given
vector that is a semantic composite of assumed to be located symmetrically to context word c as input,
the individual words, e.g., ‘man’ + ‘royal’ the target words within a distance equal to
= ‘king’. The theoretical justification for the window size in both directions. In P` wi j = yi = V
e ui (2)
c
this behavior was recently given by Git- unsupervised settings, the word embed- / eu i
i= 1
tens et al. [20], which stated that com- ding dimension is determined by the
T
positionality is seen only when certain accuracy of prediction. As the embedding where, u i = v .v c . wi
assumptions are held, e.g., the assump- dimension increases, the accuracy of pre- The parameters i = {Vw, Vc} are
tion that words need to be uniformly diction also increases until it converges at learned by defining the objective func-
distributed in the embedding space. some point, which is considered the optimal tion as the log-likelihood and finding its
Pennington et al. [21] is another embedding dimension as it is the shortest gradient as
famous word embedding method which without compromising accuracy.
is essentially a “count-based” model. Here, Let us consider a simplified version l (i) = / log ` P ` w jj (3)
the word co-occurrence count matrix is of the CBOW model where only one w ! Vocabulary c
= Vc ` 1 - P ` w jj .
preprocessed by normalizing the counts word is considered in the context. 2l (i)
(4)
and log-smoothing them. This matrix is This essentially replicates a bigram lan- 2Vw c
then factorized to get lower dimensional guage model.
representations which is done by minimiz- The CBOW model is a simple fully In the general CBOW model, all the
ing a “reconstruction loss”. connected neural network with one one-hot vectors of context words are
Below, we provide a brief description hidden layer. The input layer, which taken as input simultaneously, i.e,
of the word2vec method proposed by takes the one-hot vector of context
h = W T (x 1 + x 2 + g + x c). (5)
Mikolov et al. [3]. word has V neurons while the hidden
layer has N neurons. The output layer is One limitation of individual word
B. Word2vec softmax of all words in the vocabulary. embeddings is their inability to represent
Word embeddings were revolutionized The layers are connected by weight phrases, where the combination of two
by Mikolov et al. [3], [8] who proposed matrix W ! R V # N and Wl ! R H #V , or more words (e.g., idioms like “hot
august 2018 | ieee Computational intelligenCe magazine 57

potato” or named entities such as “Bos- word. This makes them unable to unknown word issue or out-of-vocabu-
ton Globe”) does not represent the account for polysemy. In a recent work, lar y word (OOV) issue. Character
combination of meanings of individual Upadhyay et al. [27] provided an inno- embeddings naturally deal with it since
words. One solution to this problem, as vative way to address this deficit. The each word is considered as no more than
explored by Mikolov et al. [3], is to authors leveraged multilingual parallel a composition of individual letters. In
identify such phrases based on word data to learn multi-sense word embed- languages where text is not composed of
co-occurrence and train embeddings for dings. For example, the English word separated words but individual characters
them separately. More recent methods bank, when translated to French provides and the semantic meaning of words map
have explored directly learning n-gram two different words: banc and banque to its compositional characters (such as
embeddings from unlabeled data [22]. representing financial and geographical Chinese), building systems at the charac-
Another limitation comes from meanings, respectively. Such multilingual ter level is a natural choice to avoid word
learning embeddings based only on a distributional information helped them segmentation [33]. Thus, works employ-
small window of surrounding words, in accounting for polysemy. ing deep learning applications on such
sometimes words such as good and bad Table 1 provides a directory of existing languages tend to prefer character
share almost the same embedding [23], frameworks that are frequently used for embeddings over word vectors [34]. For
which is problematic if used in tasks creating embeddings which are further example, Peng et al. [35] proved that rad-
such as sentiment analysis [24]. At times incorporated into deep learning models. ical-level processing could greatly
these embeddings cluster semantically- improve sentiment classification perfor-
similar words which have opposing sen- C. Character Embeddings mance. In particular, the authors pro-
timent polarities. This leads the down- Word embeddings are able to capture posed two types of Chinese radical-based
stream model used for the sentiment syntactic and semantic information, yet hierarchical embeddings, which incorpo-
analysis task to be unable to identify this for tasks such as POS-tagging and NER, rate not only semantics at radical and
contrasting polarities leading to poor intra-word morphological and shape character level, but also sentiment infor-
performance. Tang et al. [25] addresses information can also be very useful. Gen- mation. Bojanowski et al. [36] also tried
this problem by proposing sentiment erally speaking, building natural language to improve the representation of words
specific word embedding (SSWE). understanding systems at the character by using character-level information in
Authors incorporate the supervised sen- level has attracted certain research atten- morphologically-rich languages. They
timent polarity of text in their loss function [28]–[31]. Better results on morpho- approached the skip-gram method by
tions while learning the embeddings. logically rich languages are reported in representing words as bag-of-characters
A general caveat for word embeddings certain NLP tasks. Santos and Guimaraes n-grams. Their work thus had the effec-
is that they are highly dependent on the [30] applied character-level representa- tiveness of the skip-gram model along
applications in which it is used. Labutov tions, along with word embeddings for with addressing some persistent issues of
and Lipson [26] proposed task specific NER, achieving state-of-the-art results in word embeddings. The method was also
embeddings which retrain the word Portuguese and Spanish corpora. Kim et fast, which allowed training models on
embeddings to align them in the current al. [28] showed positive results on build- large corpora quickly. Popularly known
task space. This is very important as training a neural language model using only as FastText, such a method stands out
ing embeddings from scratch requires large character embeddings. Ma et al. [32] over previous methods in terms of speed,
amount of time and resource. Mikolov et exploited several embeddings, including scalability, and effectiveness.
al. [8] tried to address this issue by propos- character trigrams, to incorporate proto- Apart from character embeddings,
ing negative sampling which is frequency- typical and hierarchical information for other approaches have been proposed for
based sampling of negative terms while learning pre-trained label embeddings in OOV handling. Herbelot and Baroni [37]
training the word2vec model. the context of NER. provided OOV handling on-the-fly by
Traditional word embedding algo- A common phenomenon for lan- initializing the unknown words as the
rithms assign a distinct vector to each guages with large vocabularies is the sum of the context words and refining
these words with a high learning rate.
However, their approach is yet to be
TABLe 1 Frameworks providing embedding tools and methods.
tested on typical NLP tasks. Pinter et al.
FrAmework LAnguAge urL
[38] provided an interesting approach of
S-SPACE JAvA HTTPS://giTHuB.CoM/fozziETHEBEAT/S-SPACE training a character-based model to recre-
SEMANTiCvECToRS JAvA HTTPS://giTHuB.CoM/SEMANTiCvECToRS/ ate pre-trained embeddings. This allowed
gENSiM PYTHoN HTTPS://RADiMREHuREk.CoM/gENSiM/ them to learn a compositional mapping
from character to word embedding, thus
PYDSM PYTHoN HTTPS://giTHuB.CoM/JiMMYCALLiN/PYDSM
tackling the OOV problem.
DiSSECT PYTHoN HTTP://CLiC.CiMEC.uNiTN.iT/CoMPoSES/TooLkiT/ Despite the ever growing popularity
fASTTExT PYTHoN HTTPS://fASTTExT.CC/ of distributional vectors, recent discussions

on their relevance in the long run have sentence to create an informative latent Let w i:i + j refer to the concatenation
cropped up. For example, Lucy and semantic representation of the sentence of vectors w i, w i + 1, ..., w j . Convolution
Gauthier [39] has recently tried to evalu- for downstream tasks. This application is performed on this input embedding
ate how well the word vectors capture was pioneered by Collobert et al. [5], layer. It involves a filter k ! R hd which is
the necessary facets of conceptual mean- Kalchbrenner et al. [43], Kim [44], applied to a window of h words to pro-
ing. The authors have discovered severe which led to a huge proliferation of duce a new feature. For example, a fea-
limitations in perceptual understanding CNN-based networks in the succeeding ture c i is generated using the window of
of the concepts behind the words, which literature. Below, we describe the work- words w i:i + h - 1 by
cannot be inferred from distributional ing of a simple CNN-based sentence
c i = f (w i:i + h - 1 .k T + b) (6)
semantics alone. A possible direction for modeling network:
mitigating these deficiencies will be where b ! R is the bias term and f is a
grounded learning, which has been gain- A. Basic CNN non-linear activation function, for exam-
ing popularity in this research domain. ple the hyperbolic tangent. The filter k is
1) sentence modeling applied to all possible windows using the
III. Convolutional Neural Networks For each sentence, let w i ! R d repre- same weights to create the feature map.
Following the popularization of word sent the word embedding for the i th c = [c 1, c 2, ..., c n - h + 1] . (7)
embeddings and its ability to represent word in the sentence, where d is the
words in a distributed space, the need dimension of the word embedding. In a CNN, a number of convolu-
arose for an effective feature function Given that a sentence has n words, the tional filters, also called kernels (typically
that extracts higher-level features from sentence can now be represented as an hundreds), of different widths slide over
constituting words or n-grams. These embedding matrix W ! R n # d . Fig. 5 the entire word embedding matrix.
abstract features would then be used for depicts such a sentence as an input to Each kernel extracts a specific pattern of
numerous NLP tasks such as sentiment the CNN framework. n-gram. A convolution layer is usually
analysis, summarization, machine trans-
lation, and question answering (QA).
CNNs turned out to be the natural
choice given their effectiveness in com- Softmax Classification
puter vision tasks [40]–[42].
The use of CNNs for sentence model-
ing traces back to Collobert and Weston Fully Connected Layer
[19].This work used multi-task learning to
output multiple predictions for NLP tasks
such as POS tags, chunks, named-entity
Max-Pool
tags, semantic roles, semantically-similar over Time
words and a language model. A look-up
table was used to transform each word
into a vector of user-defined dimensions.
Convolution
Thus, an input sequence {s 1, s 2, ..., s n} of Layer
n words was transformed into a series of
vectors {w s , w s 2, ..., w s n} by applying the
1
look-up table to each of its words (Fig. 4).

This can be thought of as a primitive Lookup Table
word embedding method whose weights
were learned in the training of the net- Feature k
work. In [5], Collobert extended his
work to propose a general CNN-based Feature 1
framework to solve a plethora of NLP
tasks. Both these works triggered a huge
popularization of CNNs amongst NLP
Input
researchers. Given that CNNs had
Sentence
already shown their mettle for computer
vision tasks, it was easier for people to wo w1 wN –1
believe in their performance.
CNNs have the ability to extract Figure 4 CNN framework used to perform word-wise class prediction proposed by Collobert
salient n-gram features from the input and Weston [19].

windows of words in the sentence at the
Textual Processing same time. At times, TDNN layers are
I
also stacked like CNN architectures to
Love extract local features in lower layers and
This global features in higher layers [5].
Movie Softmax
Very
Much B. Applications
In this section, we present some of the
Input Embedding Convolution with Max-Pooling Dense Layer crucial works that employed CNNs on
Sequence of Multiple Filter Over Time NLP tasks to set state-of-the-art bench-
Sentence Widths and Multiple marks in their respective times.
Feature Maps
Kim [44] explored using the above
architecture for a variety of sentence
Figure 5 CNN modeling on text.
classification tasks, including sentiment,
subjectivity and question type classifica-
followed by a max-pooling strategy, NLP tasks, such as NER, POS tagging, tion, showing competitive results. This
ct = max {c}, which subsamples the and SRL, require word-based predictions. work was quickly adapted by researchers
input typically by applying a max opera- To adapt CNNs for such tasks, a window given its simple yet effective network.
tion on each filter. This strategy has two approach is used, which assumes that the After training for a specific task, the ran-
primary reasons. tag of a word primarily depends on its domly initialized convolutional kernels
Firstly, max pooling provides a fixed- neighboring words. For each word, thus, a became specific n-gram feature detec-
length output which is generally required fixed-size window surrounding itself is tors that were useful for that target task.
for classification. Thus, regardless the size assumed and the sub-sentence ranging This simple network, however, had
of the filters, max pooling always maps the within the window is considered. A many shortcomings with the CNN’s
input to a fixed dimension of outputs. standalone CNN is applied to this sub- inability to model long distance depen-
Secondly, it reduces the output’s dimen- sentence as explained earlier and predic- dencies standing as the main issue.
sionality while keeping the most salient tions are attributed to the word in the This issue was partly handled by Kal-
n-gram features across the whole sen- center of the window. Following this chbrenner et al. [43], who published a
tence. This is done in a translation invari- approach, Poria et al. [45] employed a prominent paper where they proposed a
ant manner where each filter is now able multi-level deep CNN to tag each word dynamic convolutional neural network
to extract a particular feature (e.g., nega- in a sentence as a possible aspect or non- (DCNN) for semantic modeling of sen-
tions) from anywhere in the sentence and aspect. Coupled with a set of linguistic tences. They proposed dynamic k -max
add it to the final sentence representation. patterns, their ensemble classifier managed pooling strategy which, given a sequence
The word embeddings can be initial- to perform well in aspect detection. p selects the k most active features. The
ized randomly or pre-trained on a large The ultimate goal of word-level clas- selection preserved the order of the fea-
unlabeled corpora (as in Section II). The sification is generally to assign a sequence tures but was insensitive to their specific
latter option is sometimes found benefi- of labels to the entire sentence. In such positions (Fig. 6). Built on the concept of
cial to performance, especially when the cases, structured prediction techniques TDNN, they added this dynamic k-max
amount of labeled data is limited [44]. such as conditional random field (CRF) pooling strategy to create a sentence
This combination of convolution layer are sometimes employed to better cap- model. This combination allowed filters
followed by max pooling is often stacked ture dependencies between adjacent class with small width to span across a long
to create deep CNN networks. These labels and finally generate cohesive label range within the input sentence, thus
sequential convolutions help in improved sequence giving maximum score to the accumulating crucial information across
mining of the sentence to grasp a truly whole sentence [46]. the sentence. In the induced subgraph
abstract representation comprising rich To get a larger contextual range, the (Fig. 6), higher order features had highly
semantic information. The kernels through classic window approach is often coupled variable ranges that could be either short
deeper convolutions cover a larger part of with a time-delay neural network and focused or global and long as the
the sentence until finally covering it fully (TDNN) [47]. Here, convolutions are input sentence. They applied their model
and creating a global summarization of performed across all windows throughout on multiple tasks, including sentiment
the sentence features. the sequence. These convolutions are prediction and question type classifica-
generally constrained by defining a kernel tion, achieving significant results. Overall,
2) Window approach having a certain width. Thus, while the this work commented on the range of
The above-mentioned architecture allows classic window approach only considers individual kernels while trying to model
for modeling of complete sentences into the words in the window around the contextual semantics and proposed a way
sentence representations. However, many word to be labeled, TDNN considers all to extend their reach.

Tasks involving sentiment analysis also
require effective extraction of aspects
along with their sentiment polarities [48].
Ruder et al. [49] applied a CNN where
in the input they concatenated an aspect
vector with the word embeddings to get
competitive results. CNN modeling
approach varies amongst different length
of texts. Such differences were seen in
many works like Johnson and Zhang
[22], where performance on longer text
x1 x2 ........ xn x1 x2 ........ xn
worked well as opposed to shorter texts.
Wang et al. [50] proposed the usage of
CNN for modeling representations of Figure 6 Representation of a DCNN subgraph. With dynamic pooling, a filter with small width
short texts, which suffer from the lack of at the higher layers can relate phrases far apart in the input sentence. DCNN was proposed by
kalchbrenner et al. [43].
available context and, thus, require extra
efforts to create meaningful representa-
tions. The authors proposed semantic matching [53]. A similar model to the context from the input questions. By
clustering which introduced multi-scale above CNN architecture (Fig. 5) was representing entities and relations in the
semantic units to be used as external explored in [54] for information retrieval. KB with low-dimensional vectors, they
knowledge for the short texts. CNN was The CNN was used for projecting que- used question-answer pairs to train the
used to combine these units and form ries and documents to a fixed-dimension CNN model so as to rank candidate
the overall representation. In fact, this semantic space, where cosine similarity answers. Severyn and Moschitti [57]
requirement of high context information between the query and documents was also used CNN network to model opti-
can be thought of as a caveat for CNN- used for ranking documents regarding a mal representations of question and
based models. NLP tasks involving specific query. The model attempted to answer sentences. They proposed
microtexts using CNN-based methods extract rich contextual structures in a additional features in the embeddings in
often require the need of additional query or a document by considering a the form of relational information giv-
information and external knowledge to temporal context window in a word en by matching words between the
perform as per expectations. This fact was sequence. This captured the contextual question and answer pair. These param-
also observed in [51], where authors per- features at the word n-gram level. The eters were tuned by the network. This
formed sarcasm detection in Twitter texts salient word n-grams is then discovered simple network was able to pro-
using a CNN network. Auxiliary support, by the convolution and max-pooling lay- duce comparable results to state-of-
in the form of pre-trained networks ers which are then aggregated to form the-art methods.
trained on emotion, sentiment and per- the overall sentence vector. CNNs are wired in a way to capture
sonality datasets was used to achieve In the domain of QA, Yih et al. the most important information in a sen-
state-of-the-art performance. [55] proposed to measure the semantic tence. Traditional max-pooling strategies
CNNs have also been extensively similarity between a question and en- perform this in a translation invariant
used in other tasks. For example, Denil tries in a knowledge base (KB) to form. However, this often misses valuable
et al. [52] applied DCNN to map mean- determine what supporting fact in the information present in multiple facts
ings of words that constitute a sentence KB to look for when answering a within the sentence. To overcome this
to that of documents for summarization. question. To create semantic repre- loss of information for multiple-event
The DCNN learned convolution filters sentations, a CNN similar to the one modeling, Chen et al. [58] proposed a
at both the sentence and document in Fig. 5 was used. Unlike the classifica- modified pooling strategy: dynamic
level, hierarchically learning to capture tion setting, the supervision signal came multi-pooling CNN (DMCNN). This
and compose low-level lexical features from positive or negative text pairs (e.g., strategy used a novel dynamic multi-
into high-level semantic concepts. The query-document), instead of class pooling layer that, as the name suggests,
focal point of this work was the intro- labels. Subsequently, Dong et al. [56] incorporates event triggers and argu-
duction of a novel visualization tech- introduced a multi-column CNN ments to reserve more crucial informa-
nique of the learned representations, (MCCNN) to analyze and under- tion from the pooling layer.
which provided insights not only in the stand questions from multiple aspects CNNs inherently provide certain
learning process but also for automatic and create their representations. required features like local connectivity,
summarization of texts. MCCNN used multiple column net- weight sharing, and pooling. This puts
CNN models are also suitable for cer- works to extract information from as- forward some degree of invariance
tain NLP tasks that require semantic pects comprising answer types and which is highly desired in many tasks.

Speech recognition also requires such sequence and each step is dependent whole sentence is summarized to a fixed
invariance and, thus, Abdel-Hamid et al. on the previous computations and results. vector and then mapped back to the
[59] used a hybrid CNN-HMM model Generally, a fixed-size vector is pro- variable-length target sequence.
which provided invariance to frequency duced to represent a sequence by feeding RNN also provides the network sup-
shifts along the frequency axis. This vari- tokens one by one to a recurrent unit. In port to perform time distributed joint
ability is often found in speech signals a way, RNNs have “memory” over preprocessing. Most of the sequence label-
due to speaker differences. They also vious computations and use this infor- ing tasks like POS tagging [31] come
perfor med limited weight shar ing mation in current processing.This template under this domain. More specific use
which led to a smaller number of pool- is naturally suited for many NLP tasks cases include applications such as multi-
ing parameters, resulting in lower com- such as language modeling [2], [63], label text categorization [76], multi-
putational complexity. Palaz et al. [60] [64], machine translation [65]–[67], speech modal sentiment analysis [77]–[79], and
performed extensive analysis of CNN- recognition [68]–[71], image caption- subjectivity detection [80].
based speech recognition systems when ing [72]. This made RNNs increas- The above points enlist some of the
given raw speech as input. They showed ingly popular for NLP applications in focal reasons that motivated researchers
the ability of CNNs to directly model recent years. to opt for RNNs. However, it would be
the relationship between raw input and gravely wrong to make conclusions on
phones, creating a robust automatic A. Need for Recurrent Networks the superiority of RNNs over other
speech recognition system. In this section, we analyze the funda- deep networks. Recently, several works
Tasks like machine translation re - mental properties that favored the popu- provided contrasting evidence on the
quire perseverance of sequential infor- larization of RNNs in a multitude of superiority of CNNs over RNNs. Even
mation and long-ter m dependency. NLP tasks. Given that an RNN per- in RNN-suited tasks like language
Thus, structurally they are not well suit- forms sequential processing by modeling modeling, CNNs achieved competitive
ed for CNN networks, which lack these units in sequence, it has the ability to performance over RNNs [81]. Both
features. Nevertheless, Tu et al. [61] ad- capture the inherent sequential nature CNNs and RNNs have different objec-
dressed this task by considering both the present in language, where units are tives when modeling a sentence. While
semantic similarity of the translation characters, words or even sentences. RNNs try to create a composition of an
pair and their respective contexts. Al- Words in a language develop their arbitrarily long sentence along with
though this method did not address the semantical meaning based on the previ- unbounded context, CNNs try to
sequence perseverance problem, it al- ous words in the sentence. A simple extract the most important n-grams.
lowed them to get competitive results example stating this would be the differ- Yin et al. [82] provided interesting
amongst other benchmarks. ence in meaning between “dog” and insights on the comparative performance
Overall, CNNs are extremely effec- “hot dog”. RNNs are tailor-made for between RNNs and CNNs. After testing
tive in mining semantic clues in contex- modeling such context dependencies in on multiple NLP tasks that included sen-
tual windows. However, they are very language and similar sequence modeling timent classification, QA, and POS tag-
data heavy models. They include a large tasks, which turned to be a strong moti- ging, they concluded that there is no
number of trainable parameters which vation for researchers to use RNNs over clear winner: the performance of each
require huge training data. This poses a CNNs in these areas. network depends on the global seman-
problem when scarcity of data arises. Another factor aiding RNN’s suit- tics required by the task itself.
Another persistent issue with CNNs is ability for sequence modeling tasks lies Below, we discuss some of the RNN
their inability to model long-distance in its ability to model variable length of models extensively used in the literature.
contextual information and preserving text, including very long sentences,
sequential order in their representations paragraphs and even documents [73]. B. RNN Models
[43], [61]. Although CNNs prove an Unlike CNNs, RNNs have flexible
effective way to capture n-gram features, computational steps that provide better 1) simple Rnn
which is approximately sufficient in cer- modeling capability and create the pos- In the context of NLP, RNNs are pri-
tain sentence classification tasks, their sibility to capture unbounded context. marily based on Elman network [62]
sensitivity to word order is restricted This ability became one of the selling and they are originally three-layer net-
locally and long-term dependencies are points of major works using RNNs [74]. works. Fig. 7 illustrates a more general
typically ignored. Many NLP tasks require semantic RNN which is unfolded across time to
modeling over the whole sentence. This accommodate a whole sequence. In the
IV. Recurrent Neural Networks involves creating a gist of the sentence figure, x t is taken as the input to the
RNNs [62] use the idea of processing in a fixed dimensional hyperspace. network at time step t and s t repre-
sequential information. The term “recur- RNN’s ability to summarize sentences sents the hidden state at the same time
rent” applies as they perform the same led to their increased usage for tasks like step. Calculation of s t is based as per
computation over each token of the machine translation [75] where the the equation:

s t = f (Ux t + Ws t - 1) . (8)
o ot −1 ot ot +1
Thus, s t is calculated based on the
current input and the previous time step’s W W W W
hidden state. The function f is taken to V h ht −1 ht ht +1
V V V
be a non-linear transformation such as Unfold
U U U
U
tanh, ReLU and U, V, W account for xt −1 xt xt +1
x
weights that are shared across time. In the
context of NLP, x t typically comprises of
Figure 7 Simple RNN network.
one-hot encodings or embeddings. At
times, they can also be abstract represen-
tations of textual content. q t illustrates nation of these three gates as per the Researchers often face the dilemma
the output of the network which is also equations below: of choosing the appropriate RNN. This
often subjected to non-linearity, espe- also extends to developers working in
x=; E
cially when the network contains further ht - 1 NLP. Throughout the history, most of
(9)
layers downstream. xt the choices over the RNN variant
The hidden state of the RNN is typ- ft = v (W f . x + b f) (10) tended to be heuristic. Chung et al. [74]
ically considered to be its most crucial did a critical comparative evaluation of
it = v (W i . x + b i) (11)
element. As stated before, it can be con- the three RNN variants mentioned
sidered as the network’s memory ele- ot = v (W o . x + b o) (12) above, although not on NLP tasks. They
ment that accumulates information from c t = ft 9 c t - 1 + i t 9 tanh (W c . x + b c) evaluated their work on tasks relating to
other time steps. In practice, however, (13) polyphonic music modeling and speech
these simple RNN networks suffer from signal modeling. Their evaluation clearly
h t = o t 9 tanh (c t) . (14)
the infamous vanishing gradient problem, demonstrated the superiority of the
which makes it really hard to learn and gated units (LSTM and GRU) over the
tune the parameters of the earlier layers 3) gated Recurrent units traditional simple RNN (in their case,
in the network. Another gated RNN variant called GRU using tanh activation). However, they
This limitation was overcome by [75] (Fig. 8) of lesser complexity was could not make any concrete conclusion
various networks such as long short- invented with empirically similar perfor- about which of the two gating units was
term memory (LSTM), gated recur- mances to LSTM in most tasks. GRU better. This fact has been noted in other
rent units (GRUs), and residual net- comprises of two gates, reset gate and works too and, thus, people often lever-
works (ResNets), where the first two update gate, and handles the flow of infor- age on other factors like computing
are the most used RNN variants in mation like an LSTM without a memory power while choosing between the two.
NLP applications. unit. Thus, it exposes the whole hidden
content without any control. GRU can be C. Applications
2) long short-term memory a more efficient RNN than LSTM. The
LSTM [83], [84] (Fig. 8) has additional working of GRU is as follows: 1) Rnn for Word-level Classification
“forget” gates over the simple RNN, RNNs have had a huge presence in the
which allows the error to back-propa- z= v (U z . x t + W z . h t - 1) (15) field of word-level classification. Many
gate through an unlimited number of r= v (U r . x t + W r . h t - 1) (16) of their applications stand as state of the
time steps. Consisting of three gates: art in their respective tasks. Lample et al.
input, forget and output gates, it calcu- s t = tanh (U z . x t + W s . (h t - 1 9 r)) (17) [85] proposed to use bidirectional
lates the hidden state by taking a combi- h t = (1 - z) 9 s t + z 9 h t - 1. (18) LSTM for NER. The network captured
r
f
i o
xt
ht –1 C̃ C tanh ht xt s z ht
(1) Long Short-Term Memory (2) Gated Recurrent Unit
Figure 8 illustration of an LSTM and gRu gate.

arbitrarily long context information used for predicting sentiment polarity. decoder. The decoder generates tokens
around the target word (curbing the This simple strategy proved competitive one by one, while updating its hidden
limitation of a fixed window size) result- to the more complex DCNN structure state with the last generated token.
ing in two fixed-size vector, on top of by Kalchbrenner et al. [43] designed to Sutskever et al. [67] experimented
which another fully-connected layer was endow CNN models with ability to cap- with 4-layer LSTM on a machine trans-
built. They used a CRF layer at last for ture long-term dependencies. In a special lation task in an end-to-end fashion,
the final entity tagging. case studying negation phrase, the authors showing competitive results. In [91], the
RNNs have also shown considerable also showed that the dynamics of LSTM same encoder-decoder framework is
improvement in language modeling over gates can capture the reversal effect of the employed to model human conversations.
traditional methods based on count statis- word “not”. When trained on more than 100 million
tics. Pioneering work in this field was Similar to CNN, the hidden state of message-response pairs, the LSTM de-
done by Graves [86], who introduced the an RNN can also be used for semantic coder is able to generate very interesting
effectiveness of RNNs in modeling com- matching between texts. In dialogue sys- responses in the open domain. It is also
plex sequences with long range context tems, Lowe et al. [89] proposed to match common to condition the LSTM de-
structures. He also proposed deep RNNs a message with candidate responses with coder on additional signal to achieve cer-
where multiple layers of hidden states Dual-LSTM, which encodes both the tain effects. In [92], the authors proposed
were used to enhance the modeling. This message and response as fixed-size vectors to condition the decoder on a constant
work established the usage of RNNs on and then measure their inner product as persona vector that captures the personal
tasks beyond the context of NLP. Later, the basis to rank candidate responses. information of an individual speaker. In
Sundermeyer et al. [87] compared the gain the above cases, language is generated
obtained by replacing a feed-forward neu- 3) Rnn for generating language based mainly on the semantic vector rep-
ral network with an RNN when condi- A challenging task in NLP is generating resenting textual input. Similar frame-
tioning the prediction of a word on the natural language, which is another natural works have also been successfully used in
words ahead. In their work, they proposed application of RNNs. Conditioned on image-based language generation, where
a typical hierarchy in neural network textual or visual data, deep LSTMs have visual features are used to condition the
architectures where feed-forward neural been shown to generate reasonable task- LSTM decoder (Fig. 9).
networks gave considerable improvement specific text in tasks such as machine Visual QA is another task that
over traditional count-based language translation, image captioning, etc. requires language generation based on
models, which in turn were superseded by In [67], the authors proposed a gen- both textual and visual clues. Malinowski
RNNs and later by LSTMs. An important eral deep LSTM encoder-decoder et al. [93] were the first to provide an
point that they mentioned was the appli- framework that maps a sequence to end-to-end deep learning solution
cability of their conclusions to a variety of another sequence. One LSTM is used to where they predicted the answer as a
other tasks such as statistical machine encode the “source” sequence as a fixed- sequence of words conditioned on the
translation [88]. size vector, which can be text in the input image modeled by a CNN and
original language (machine translation), text modeled by an LSTM.
2) Rnn for sentence-level the question to be answered (QA) or the
Classification message to be replied to (dialogue sys- D. Attention Mechanism
Wang et al. [24] proposed encoding entire tems). The vector is used as the initial One potential problem that the tradi-
tweets with LSTM, whose hidden state is state of another LSTM, named the tional encoder-decoder framework faces
is that the encoder at times is forced to
encode information which might not be
fully relevant to the task at hand. The
Output problem arises also if the input is long or
p1 p2 pN – 1 very information-rich and selective
encoding is not possible.
For example, the task of text summa-
Image
CNN
rization can be cast as a sequence-to-
LSTM LSTM LSTM
sequence learning problem, where the
input is the original text and the output
is the condensed version. Intuitively, it is
w1 w2 wN – 1 unrealistic to expect a fixed-size vector to
True Image Description encode all information in a piece of text
whose length can potentially be very
Figure 9 image captioning using CNN image embedder followed by LSTM decoder. This long. Similar problems have also been
architecture was proposed by vinyals et al. [90]. reported in machine translation [94].

In tasks such as text summarization
and machine translation, certain align-
Weighted Combination
ment exists between the input text and
the output text, which means that each
token generation step is highly related to α
a certain part of the input text.This intu-
Attention
ition inspires the attention mechanism.
This mechanism attempts to ease the
above problems by allowing the decoder
Aspect Embedding
to refer back to the input sequence. Spe-
cifically during decoding, in addition to Hidden h1 hn
the last hidden state and generated token, Representation
the decoder is also conditioned on a
“context” vector calculated based on the LSTM LSTM LSTM
input hidden state sequence.
Bahdanau et al. [94] first applied the
attention mechanism to machine trans- Input Sentence
lation, which improved the perfor- w1 w2 wn
mance especially for long sequences. In
their work, the attention signal over the Figure 10 Aspect classification using attention. The original attention-based model in this
input hidden state sequence is deter- application was proposed by Wang et al. [100].
mined with a multi-layer perceptron by
the last hidden state of the decoder. By
visualizing the attention signal over the ing or generation was chosen at each pretations of sentence structure [4]. Spe-
input sequence during each decod- time step during decoding [99]. cifically, in a recursive neural network,
ing step, a clear alignment between In aspect-based sentiment analysis, the representation of each non-terminal
the source and target language can Wang et al. [100] proposed an attention- node in a parsing tree is determined by
be demonstrated. based solution where they used aspect the representations of all its children.
A similar approach was applied to the embeddings to provide additional sup-
task of text summarization by Rush et al. port during classification (Fig. 10). The A. Basic Model
[95] where each output word in the attention module focused on selective In this section, we describe the basic
summary was conditioned on the input regions of the sentence which affected structure of recursive neural networks.
sentence through an attention mecha- the aspect to be classified. Recently, Ma As shown in Fig. 11, the network g
nism. The authors performed abstractive et al. [101] augmented LSTM with a defines a compositional function on the
summarization which is not very con- hierarchical attention mechanism con- representations of phrases or words (b, c
ventional as opposed to extractive sum- sisting of a target-level attention and a or a, p 1) to compute the representation
marization, but can be scaled up to large sentence-level attention to exploit com- of a higher-level phrase ( p 1 or p 2). The
data with minimal linguistic input. monsense knowledge for targeted representations of all nodes take the
In image captioning, Xu et al. [96] aspect-based sentiment analysis. same form.
conditioned the LSTM decoder on dif- Given the intuitive applicability of
ferent parts of the input image during attention modules, they are still being
each decoding step. Attention signal was actively investigated by NLP researchers
determined by the previous hidden state and adopted for an increasing number p2 = g (a, p1)
and CNN features. In [97], the authors of applications. (on the mat)
cast the syntactical parsing problem as a
sequence-to-sequence learning task by V. Recursive Neural Networks
linearizing the parsing tree. The atten- RNNs represent a natural way to model p1 = g (b, c)
tion mechanism proved to be more sequences. Arguably, however, language (the mat)
data-efficient in this work. A further step exhibits a natural recursive structure,
in referring to the input sequence was where words and sub-phrases combine
to directly copy words or sub-sequences into phrases in a hierarchical manner. a (on) b (the) c (mat)
of the input onto the output sequence Such structure can be represented by a
under a certain condition [98], which constituency parsing tree. Thus, tree-
Figure 11 Recursive neural networks
was useful in tasks such as dialogue gen- structured models have been used to iteratively form high-level representation
eration and text summarization. Copy- better make use of such syntactic inter- from lower-level representations.

token in the ground-truth sequence
Recurrent Neural Networks use the idea of processing given the current hidden state and the
sequential information. The term “recurrent” applies as previous tokens. Termed “teacher forc-
ing”, this training scheme provides the
they perform the same computation over each token real sequence prefix to the generator
of the sequence and each step is dependent on the during each generation (loss evaluation)
previous computations and results. step. At test time, however, ground-truth
tokens are then replaced by a token gen-
erated by the model itself. This discrep-
In [4], the authors described multiple model is trained with the max-margin ancy between training and inference,
variations of this model. In its simplest objective [103]. termed “exposure bias” [106], [107], can
form, g is defined as: Based on recursive neural networks yield errors that can accumulate quickly
and the parsing tree, Socher et al. [4] along the generated sequence.
p 1 = tanh c W ; Em, p 2 = tanh c W ; Em
b a
proposed a phrase-level sentiment an- Another problem with the word-
c p1
alysis framework, where each node in level maximum likelihood strategy,
(19)
the parsing tree can be assigned a sen- when training auto-regressive language
in which the representation for each timent label. generation models, is that the training
node is a d-dimensional vector and Socher et al. [102] classified semantic objective is different from the test met-
W ! R D # 2D. relationships such as cause-effect or ric. It is unclear how the n-gram overlap
Another variation is the MV-RNN topic-message between nominals in a based metrics (BLEU, ROUGE) used to
[102]. The idea is to represent every sentence by building a single composi- evaluate these tasks (machine translation,
word and phrase as both a matrix and a tional semantics for the minimal constit- dialogue systems, etc.) can be optimized
vector. When two constituents are com- uent including both terms. Bowman et with the word-level training strategy.
bined, the matrix of one is multiplied al. [104] proposed to classify the logical Empirically, dialogue systems trained
with the vector of the other: relationship between sentences with with word-level maximum likelihood
recursive neural networks. The represen- also tend to produce dull and short-
p 1 = tanh c W ; Em, P1 = tanh c W M ; Em
Cb B tations for both sentences are fed to sighted responses [108], while text sum-
Bc C another neural network for relationship marization tends to produce incoherent
(20) classification. They show that both or repetitive summaries [99].
vanilla and tensor versions of the recur- Reinforcement learning offers a pro-
in which b, c, p 1 ! R D, B, C, P1 ! R D # D, sive unit performed competitively in a spective to solve the above problems to a
and W M ! R D # 2D. Compared to the textual entailment dataset. certain extent. In order to optimize the
vanilla form, MV-RNN parameterizes To avoid the gradient vanishing non-differentiable evaluation metrics di-
the compositional function with matri- problem, LSTM units have also been rectly, Ranzato et al. [107] applied the
ces corresponding to the constituents. applied to tree structures in [105]. The REINFORCE algorithm [109] to train
The recursive neural tensor network authors showed improved sentence rep- RNN-based models for several se-
(RNTN) is proposed to introduce more resentation over linear LSTM models, quence generation tasks (e.g., text sum-
interaction between the input vectors as clear improvement in sentiment marization, machine translation and
without making the number of parame- analysis and sentence relatedness tests image captioning), leading to improve-
ters exceptionally large like MV-RNN. was observed. ments compared to previous supervised
RNTN is defined by: learning methods. In such a framework,
VI. Deep Reinforced Models and the generative model (RNN) is viewed
p 1 = tanh c; E V [1: D] ; E + W ; Em
b T b b Deep Unsupervised Learning as an agent, which interacts with the ex-
c c c ternal environment (the words and the
(21)
A. Reinforcement Learning for context vector it sees as input at every
where V ! R 2D # 2D # D is a tensor that Sequence Generation time step). The parameters of this agent
defines multiple bilinear forms. Reinforcement learning is a method of defines a policy, whose execution results
training an agent to perform discrete in the agent picking an action, which
B. Applications actions before obtaining a reward. In refers to predicting the next word in the
One natural application of recursive NLP, tasks concerning language genera- sequence at each time step. After taking
neural networks is parsing [10]. A scor- tion can sometimes be cast as reinforce- an action the agent updates its internal
ing function is defined on the phrase ment learning problems. state (the hidden units of RNN). Once
representation to calculate the plausibili- In its original formulation, RNN the agent has reached the end of a se-
ty of that phrase. Beam search is usually language generators are typically trained quence, it observes a reward. This reward
applied for searching the best tree. The by maximizing the likelihood of each can be any developer-defined metric

tailored to a specific task. For example,
Li et al. [108] defined 3 rewards for a Recent success in generating realistic images has
generated sentence based on ease of an- driven a series of efforts on applying deep generative
swering, information flow, and seman-
tic coherence.
models to text data. The promise of such research is
There are two well-known shortcom- to discover rich structure in natural language while
ings of reinforcement learning. To make generating realistic sentences from a latent code space.
reinforcement learning tractable, it is
desired to carefully handle the state and
action space [110], [111], which in the the auxiliary task was to predict two data. The promise of such research is to
end may restrict expressive power and adjacent sentences (before and after) discover rich structure in natural lan-
learning capacity of the model. Secondly, based on the given sentence. The guage while generating realistic sentences
the need for training the reward func- seq2seq model was employed for this from a latent code space. In this section,
tions makes such models hard to design learning task. One LSTM encoded the we review recent research on achieving
and measure at run time [112], [113]. sentence to a vector (distributed repre- this goal with variational autoencoders
Another approach for sequence-level sentation). Two other LSTMs decoded (VAEs) [119] and generative adversarial
supervision is to use the adversarial such representation to generate the tar- networks (GANs) [114].
training technique [114], where the get sequences. The standard seq2seq Standard sentence autoencoders, as in
training objective for the language gen- training process was used. After training, the last section, do not impose any con-
erator is to fool another discrimina- the encoder could be seen as a generic straint on the latent space, as a result,
tor trained to distinguish generated feature extractor (word embeddings they fail when generating realistic sen-
sequences from real sequences. The gen- were also learned in the same time). tences from arbitrary latent representa-
erator G and the discriminator D are Kiros et al. [116] verified the quality tions [120]. The representations of these
trained jointly in a min-max game of the learned sentence encoder on a sentences may often occupy a small
which ideally leads to G, generating range of sentence classification tasks, region in the hidden space and most of
sequences indistinguishable from real showing competitive results with a sim- regions in the hidden space do not nec-
ones. This approach can be seen as a ple linear model based on the static fea- essarily map to a realistic sentence [121].
var iation of generative adversar ial ture vectors. However, the sentence They cannot be used to assign probabili-
networks in [114], where G and D are encoder can also be fine-tuned in the ties to sentences or to sample novel sen-
conditioned on certain stimuli (for ex- supervised learning task as part of the tences [120].
ample, the source image in the task of classifier. Dai and Le [117] investigated The VAE imposes a prior distribu-
image captioning). In practice, the above the use of the decoder to reconstruct the tion on the hidden code space which
scheme can be realized under the rein- encoded sentence itself, which resembled makes it possible to draw proper samples
forcement learning paradigm with poli- an autoencoder [118]. from the model. It modifies the autoen-
cy gradient. For dialogue systems, the Language modeling could also be coder architecture by replacing the
discriminator is analogous to a human used as an auxiliary task when training deterministic encoder function with a
Turing tester, who discriminates be- LSTM encoders, where the supervision learned posterior recognition model.
tween human and machine-produced signal came from the prediction of the The model consists of encoder and gen-
dialogues [115]. next token. Dai and Le [117] conducted erator networks which encode data
experiments on initializing LSTM mod- examples to latent representation and
B. Unsupervised Sentence els with learned parameters on a variety generate samples from the latent space,
Representation Learning of tasks. They showed that pre-training respectively. It is trained by maximizing
Similar to word embeddings, distributed the sentence encoder on a large unsu- a variational lower bound on the log-
representation for sentences can also be pervised corpus yielded better accuracy likelihood of observed data under the
learned in an unsupervised fashion. The than only pre-training word embeddings. generative model.
result of such unsupervised learning are Also, predicting the next token turned Bowman et al. [120] proposed an
“sentence encoders”, which map arbi- out to be a worse auxiliary objective RNN-based variational autoencoder
trary sentences to fixed-size vectors that than reconstructing the sentence itself, as generative model that incorporated dis-
can capture their semantic and syntactic the LSTM hidden state was only respon- tributed latent representations of entire
properties. Usually an auxiliary task has sible for a rather short-term objective. sentences (Fig. 12). Unlike vanilla RNN
to be defined for the learning process. language models, this model worked
Similar to the skip-gram model [8] C. Deep Generative Models from an explicit global sentence repre-
for learning word embeddings, the skip- Recent success in generating realistic sentation. Samples from the prior over
thought model [116] was proposed for images has driven a series of efforts on these sentence representations produced
learning sentence representation, where applying deep generative models to text diverse and well-formed sentences.

is possible to create oracle training data
A generative neural network decodes latent representation from a fixed set of grammars and then
to a data instance, while the discriminative network is evaluate generative models based on
whether (or how well) the generated
simultaneously taught to discriminate between instances samples agree with the predefined
from the true data distribution and synthesized instances grammar [124]. Another strategy is to
produced by the generator. evaluate BLEU scores of samples on a
large amount of unseen test data. The
ability to generate similar sentences to
Hu et al. [122] proposed generating criminate between instances from the unseen real data is considered a mea-
sentences whose attributes are con- true data distribution and synthesized surement of quality [123].
trolled by learning disentangled latent instances produced by the generator.
representations with designated seman- GAN does not explicitly represent the VII. Memory-Augmented Networks
tics. The authors augmented the latent true data distribution p (x). The attention mechanism stores a series
code in the VAE with a set of structured Zhang et al. [121] proposed a frame- of hidden vectors of the encoder, which
variables, each targeting a salient and work for employing LSTM and CNN the decoder is allowed to access during
independent semantic feature of sen- for adversarial training to generate realis- the generation of each token. Here, the
tences. The model incorporated VAE tic text. The latent code z was fed to the hidden vectors of the encoder can be
and attribute discriminators, in which LSTM generator at every time step. seen as entries of the model’s “internal
the VAE component trained the genera- CNN acted as a binary sentence classifier memory”. Recently, there has been a
tor to reconstruct real sentences for gen- which discriminated between real data surge of interest in coupling neural net-
erating plausible text, while the discrim- and generated samples. One problem works with a form of memory, which
inators forced the generator to produce with applying GAN to text is that the the model can interact with.
attributes coherent with the structured gradients from the discriminator cannot In [135], the authors proposed mem-
code. When trained on a large number properly back-propagate through discrete ory networks for QA tasks. In synthetic
of unsupervised sentences and a small variables. In [121], this problem was QA, a series of statements (memory
number of labeled sentences, Hu et al. solved by making the word prediction at entries) were provided to the model as
[122] showed that the model was able to every time “soft” at the word embedding potential supporting facts to the ques-
generate plausible sentences conditioned space. Yu et al. [123] proposed to bypass tion. The model learned to retrieve one
on two major attributes of English: tense this problem by modeling the generator entry at a time from memory based on
and sentiment. as a stochastic policy. The reward signal the question and previously retrieved
GAN is another class of generative came from the GAN discriminator memory. In large-scale realistic QA, a
model composed of two competing judged on a complete sequence, and was large set of commonsense knowledge in
networks. A generative neural network passed back to the intermediate state- the form of (subject, relation, object) tri-
decodes latent representation to a data action steps using Monte Carlo search. ples were used as memory.
instance, while the discriminative net- The evaluation of deep generative Sukhbaatar et al. [136] extended this
work is simultaneously taught to dis- models has been challenging. For text, it work and proposed end-to-end mem-
ory networks, where memory entries
were retrieved in a “soft” manner with
Decoder attention mechanism, thus enabling end-
y1 y2 <EOS>
to-end training. Multiple rounds (hops)
of information retrieval from memory
were shown to be essential to good per-
z LSTM LSTM LSTM
formance and the model was able to
µ σ retrieve and reason about several sup-
<EOS> y1 y2 porting facts to answer a specific ques-
Linear Linear
tion. They also showed a special use of
the model for language modeling,
LSTM LSTM
where each word in the sentence was
LSTM
seen as a memory entry. With multiple
hops, the model yielded results compa-
x1 x2 x3
rable to deep LSTM models.
Encoder Furthermore, dynamic memory net-
works (DMN) [128] improved upon
Figure 12 RNN-based vAE network for sentence generation proposed by Bowman et al. [120]. previous memory-based models by

TABLe 2 POS tagging.
PAPer modeL wSJ-PTB (Per-Token AccurAcy %)
giMéNEz AND MARquEz [125] SvM WiTH MANuAL fEATuRE PATTERN 97.16
CoLLoBERT ET AL. [5] MLP WiTH WoRD EMBEDDiNgS + CRf 97.29
SANToS AND zADRozNY [31] MLP WiTH CHARACTER + WoRD EMBEDDiNgS 97.32
HuANg ET AL. [126] LSTM 97.29
HuANg ET AL. [126] BiDiRECTioNAL LSTM 97.40
HuANg ET AL. [126] LSTM-CRf 97.54
HuANg ET AL. [126] BiDiRECTioNAL LSTM-CRf 97.55
ANDoR ET AL. [127] TRANSiTioN-BASED NEuRAL NETWoRk 97.45
kuMAR ET AL. [128] DMN 97.56
TABLe 3 Parsing (UAS/LAS = Unlabeled/labeled Attachment Score; WSJ = The Wall Street Journal Section of Penn Treebank).
PArSing TyPe PAPer modeL wSJ
DEPENDENCY PARSiNg CHEN AND MANNiNg [129] fuLLY-CoNNECTED NN WiTH fEATuRES iNCLuDiNg PoS 91.8/89.6 (uAS/LAS)
WEiSS ET AL. [130] DEEP fuLLY-CoNNECTED NN WiTH fEATuRES iNCLuDiNg PoS 94.3/92.4 (uAS/LAS)
DYER ET AL. [131] STACk-LSTM 93.1/90.9 (uAS/LAS)
zHou ET AL. [132] BEAM CoNTRASTivE MoDEL 93.31/92.37 (uAS/LAS)
CoNSTiTuENCY PARSiNg PETRov ET AL. [133] PRoBABiLiSTiC CoNTExT-fREE gRAMMARS (PCfg) 91.8 (f1 SCoRE)
SoCHER ET AL. [10] RECuRSivE NEuRAL NETWoRkS 90.29 (f1 SCoRE)
zHu ET AL. [134] fEATuRE-BASED TRANSiTioN PARSiNg 91.3 (f1 SCoRE)
viNYALS ET AL. [97] SEq2SEq LEARNiNg WiTH LSTM+ATTENTioN 93.5 (f1 SCoRE)
employing neural network models for in- TABLe 4 Named-Entity Recognition.

put representation, attention, and answer
PAPer modeL conLL 2003 (F1%)
mechanisms. The resulting model was ap-
plicable to a wide range of NLP tasks CoLLoBERT ET AL. [5] MLP WiTH WoRD 89.59
EMBEDDiNgS+gAzETTEER
(QA, POS tagging, and sentiment analysis),
as every task could be cast to the <memo- PASSoS ET AL. [138] LExiCoN iNfuSED PHRASE EMBEDDiNgS 90.90
ry, question, answer> triple format. Xiong CHiu AND NiCHoLS [139] Bi-LSTM WiTH WoRD+CHAR+LExiCoN 90.77
et al. [137] applied the same model to vi- EMBEDDiNgS
sual QA and proved that the memory Luo ET AL. [140] SEMi-CRf JoiNTLY TRAiNED WiTH LiNkiNg 91.20
module was applicable to visual signals. LAMPLE ET AL. [85] Bi-LSTM-CRf WiTH WoRD+CHAR 90.94
EMBEDDiNgS
VIII. Performance of Different LAMPLE ET AL. [85] Bi-LSTM WiTH WoRD+CHAR EMBEDDiNgS 89.15
Models on Different NLP Tasks STRuBELL ET AL. [141] DiLATED CNN WiTH CRf 90.54
We summarize the performance of a
series of deep learning methods on stan-
dard datasets developed in recent years been widely used for developing and cy between adjacent tags. With a simple
on 7 major NLP topics in Tables 2–7. evaluating POS tagg ing systems. left-to-right tagging scheme, this meth-
Our goal is to show the readers com- Giménez and Marquez [125] employed od modeled dependencies between
mon datasets used in the community one-against-all SVM based on manual- adjacent tags only by feature engineer-
and state-of-the-art results along with ly-defined features within a seven- ing. In an effort to reduce feature engi-
different models. word window, in which some basic neering, Collobert et al. [5] relied on
n-gram patterns were evaluated to form only word embeddings within the word
A. POS Tagging binary features such as: “previous word is window with a multi-layer perceptron.
The WSJ-PTB (the Wall Street Journal the”, “two preceding tags are DT NN”, Incorporating CRF was proven useful
part of the Penn Treebank Dataset) cor- etc. One characteristic of the POS tag- in [5]. Santos and Zadrozny [31] concat-
pus contains 1.17 million tokens and has ging problem was the strong dependen- enated word embeddings with character

TABLe 5 Semantic Role Labeling.
PAPer modeL conLL2005 (F1%) conLL2012 (F1%)
CoLLoBERT ET AL. [5] CNN WiTH PARSiNg fEATuRES 76.06
TäCkSTRöM ET AL. [142] MANuAL fEATuRES WiTH DP foR iNfERENCE 78.6 79.4
zHou AND xu [143] BiDiRECTioNAL LSTM 81.07 81.27
HE ET AL. [144] BiDiRECTioNAL LSTM WiTH HigHWAY CoNNECTioNS 83.2 83.4
TABLe 6 Sentiment Classification (SST-1 = Stanford Sentiment Treebank,

made based on a stack containing avail-
fine-grained 5 classes Socher et al. [4]; SST-2: the binary version of SST-1; able tree nodes, a buffer containing
Numbers are accuracies (%)). unread words and the obtained set of
PAPer modeL SST-1 SST-2 dependency arcs. Chen and Manning
SoCHER ET AL. [4] RECuRSivE NEuRAL TENSoR NETWoRk 45.7 85.4 [129] modeled the decision making at
each time step with a neural network
kiM [44] MuLTiCHANNEL CNN 47.4 88.1
with one hidden layer. The input layer
kALCHBRENNER ET AL. [43] DCNN WiTH k-MAx PooLiNg 48.5 86.8
contained embeddings of certain words,
TAi ET AL. [105] BiDiRECTioNAL LSTM 48.5 87.2 POS tags and arc labels, which came
LE AND MikoLov [145] PARAgRAPH vECToR 48.7 87.8 from the stack, the buffer and the set of
TAi ET AL. [105] CoNSTiTuENCY TREE-LSTM 51.0 88.0
arc labels.
Tu et al. [61] extended the work of
Yu ET AL. [146] TREE-LSTM WiTH REfiNED WoRD 54.0 90.3
EMBEDDiNgS
Chen and Manning [129] by employing
a deeper model with 2 hidden layers.
kuMAR ET AL. [128] DMN 52.1 88.6
However, both Tu et al. [61] and Chen
and Manning [129] relied on manual
embeddings to better exploit morpho- context multiple times by treating each feature selecting from the parser state,
logical clues. In [31], the authors did not RNN hidden state as a memory entry, and they only took into account a lim-
consider CRF, but since word-level each time focusing on different parts of ited number of latest tokens. Dyer et al.
decision was made on a context win- the context. [131] proposed stack-LSTMs to model
dow, dependencies between adjacent arbitrarily long history. The end pointer
tags were modeled implicitly. Huang B. Parsing of the stack changed position as the
et al. [126] concatenated word embed- There are two types of parsing: depen- stack of tree nodes could be pushed and
dings and manually-designed word-level dency parsing, which connects individu- popped. Zhou et al. [132] integrated
features and employed bidirectional al words with their relations, and beam search and contrastive learning for
LSTM to model arbitrarily long context. constituency parsing, which iteratively better optimization.
A series of ablative analysis suggested that breaks text into sub-phrases. Transition- Transition-based models were applied
bi-directionality and CRF both boosted based methods are a popular choice to constituency parsing as well. Zhu et al.
performance. Andor et al. [127] showed a since they are linear in the length of the [134] based each transition action on
transition-based approach that produces sentence. The parser makes a series of features such as the POS tags and con-
competitive result with a simple feed- decisions that read words sequentially stituent labels of the top few words of
forward neural network. When applied from a buffer and combine them incre- the stack and the buffer. By uniquely
to sequence tagging tasks, DMNs [128] mentally into the syntactic structure representing the parsing tree with a lin-
essentially allowed for attending over the [129]. At each time step, the decision is ear sequence of labels, Vinyals et al. [97]
TABLe 7 Machine translation (Numbers are BLEU scores).

wmT2014 wmT2014
PAPer modeL engLiSh2germAn engLiSh2French
CHo ET AL. [75] PHRASE TABLE WiTH NEuRAL fEATuRES 34.50
SuTSkEvER ET AL. [67] RERANkiNg PHRASE-BASED SMT BEST LiST WiTH LSTM SEq2SEq 36.5
Wu ET AL. [147] RESiDuAL LSTM SEq2SEq + REiNfoRCEMENT LEARNiNg REfiNiNg 26.30 41.16
gEHRiNg ET AL. [148] SEq2SEq WiTH CNN 26.36 41.29
vASWANi ET AL. [149] ATTENTioN MECHANiSM 28.4 41.0

applied the seq2seq learning method to
this problem. Recently, there has been a surge of interest in coupling
neural networks with a form of memory, which the model
C. Named-Entity Recognition
CoNLL 2003 has been a standard Eng-
can interact with.
lish dataset for NER, which concentrates
on four types of named entities: people, Traditional SRL systems consist of tences. Yu et al. [146] proposed to refine
locations, organizations and miscellaneous several stages: producing a parse tree, pre-trained word embeddings with a
entities. NER is one of the NLP prob- identifying which parse tree nodes rep- sentiment lexicon, observing improved
lems where lexicons can be very useful. resent the arguments of a given verb, results based on [105].
Collobert et al. [5] first achieved compet- and finally classifying these nodes to Kim [44] and Kalchbrenner et al.
itive results with neural structures aug- determine the corresponding SRL tags. [43] both used convolutional layers. The
mented by gazetteer features. Chiu and Each classification process usually entails model [44] was similar to the one in
Nichols [139] concatenated lexicon fea- extracting numerous features and feed- Fig. 5, while Kalchbrenner et al. [43]
tures, character embeddings and word ing them into statistical models [5]. constructed the model in a hierarchical
embeddings and fed them as input to a Given a predicate, Täckström et al. manner by interweaving k-max pooling
bidirectional LSTM. On the other hand, [142] scored a constituent span and its layers with convolutional layers.
Lample et al. [85] only relied on charac- possible role to that predicate with a
ter and word embeddings, with pre-train- series of features based on the parse tree. F. Machine Translation
ing embeddings on large unsupervised They proposed a dynamic programming The phrase-based SMT framework
corpora, they achieved competitive results algorithm for efficient inference. Collob- [160] factorized the translation model
without using any lexicon. Similar to ert et al. [5] achieved comparable results into the translation probabilities of
POS tagging, CRF also boosted the per- with a convolution neural networks aug- matching phrases in the source and tar-
formance of NER, as demonstrated by mented by parsing information provided get sentences. Cho et al. [75] proposed
the comparison in [85]. Overall, we see in the form of additional look-up tables. to learn the translation probability of a
that bidirectional LSTM with CRF acts Zhou and Xu [143] proposed to use source phrase to a corresponding target
as a strong model for NLP problems bidirectional LSTM to model arbitrarily phrase with an RNN encoder-decoder.
related to structured prediction. long context, which proved to be suc- Such a scheme of scoring phrase pairs
Passos et al. [138] proposed to modify cessful without any parsing tree informa- improved translation performance.
skip-gram models to better learn entity- tion. He et al. [144] further extended this Sutskever et al. [67], on the other hand,
type related word embeddings that can work by introducing highway connec- re-scored the top 1000 best candidate
leverage information from relevant lexi- tions [150], more advanced regulariza- translations produced by an SMT system
cons. Luo et al. [140] jointly optimized tion and ensemble of multiple experts. with a 4-layer LSTM seq2seq model.
the entities and the linking of entities to Dispensing the traditional SMT system
a KB. Strubell et al. [141] proposed to use E. Sentiment Classification entirely, Wu et al. [147] trained a deep
dilated convolutions, defined over a The Stanford Sentiment Treebank (SST) LSTM network with 8 encoder and
wider effective input width by skipping dataset contains sentences taken from 8 decoder layers with residual connec-
over certain inputs at a time, for better the movie review website Rotten Toma- tions as well as attention connections.
parallelization and context modeling. toes. It was proposed by Pang and Lee Wu et al. [147] then refined the model
The model showed significant speedup [151] and subsequently extended by by using reinforcement learning to
while retaining accuracy. Socher et al. [4]. The annotation scheme directly optimize BLEU scores, but they
has inspired a new dataset for sentiment found that the improvement in BLEU
D. Semantic Role Labeling analysis, called CMU-MOSI, where scores by this method did not reflect in
Semantic role labeling (SRL) aims to sentiment is studied in a multimodal human evaluation of translation quality.
discover the predicate-argument struc- setup [152]. Recently, Gehring et al. [148] pro-
ture of each predicate in a sentence. For Socher et al. [4] and Tai et al. [105] posed a CNN-based seq2seq learning
each target verb (predicate), all constit- were both recursive networks that relied model for machine translation. The rep-
uents in the sentence which take a on constituency parsing trees. Their dif- resentation for each word in the input is
semantic role of the verb are recognized. ference shows the effectiveness of computed by CNN in a parallelized
Typical semantic arguments include LSTM over vanilla RNN in modeling style for the attention mechanism. The
Agent, Patient, Instrument, etc., and also sentences. On the other hand, tree- decoder state is also determined by
adjuncts such as Locative, Temporal, LSTM performed better than linear CNN with words that are already pro-
Manner, Cause, etc. [143]. Table 5 shows bidirectional LSTM, implying that tree duced. Vaswani et al. [149] proposed a
the performance of different models on structures can potentially better capture self-attention-based model and dispensed
the CoNLL 2005&2012 datasets. the syntactical property of natural sen- convolutions and recurrences entirely.

H. Dialogue Systems
We expect to see more deep learning models whose Two types of dialogue systems have
internal memory (bottom-up knowledge learned from been developed: generation-based mod-
els and retrieval-based models.
the data) is enriched with an external memory In Table 9, the Twitter Conversation
(top-down knowledge inherited from a knowledge base). Triple Dataset is typically used for eval-
uating generation-based dialogue
systems, containing 3-turn Twitter con-
G. Question Answering the single supporting fact in the data- versation instances. One commonly
QA problems take many forms. Some base. Fader et al. [153] proposed to used evaluation metric is BLEU [161],
rely on large KBs to answer open- tackle this problem by learning a lexicon although it is commonly acknowledged
domain questions, while others answer a that maps natural language patterns to that most automatic evaluation metrics
question based on a few sentences or a database concepts (entities, relations and are not completely reliable for dialogue
paragraph (reading comprehension). For question patterns) based on a question evaluation and additional human evalu-
the former, we list (see Table 8) several paraphrasing dataset. Bordes et al. [154] ation is often necessary. Ritter et al.
experiments conducted on a large-scale embedded both questions and KB tri- [155] employed the phrase-based statis-
QA dataset introduced by [153], where ples as dense vectors and scored them tical machine translation (SMT) frame-
14 M commonsense knowledge triples with inner product. work to “translate” the message to its
are considered as the KB. Each question Weston et al. [135] took a similar appropriate response. Sordoni et al.
can be answered with a single-relation approach by treating the KB as long- [156] reranked the 1000 best responses
query. For the latter, we consider (see term memory, while casting the prob- produced by SMT with a context-sen-
Table 8) the synthetic dataset of bAbI, lem in the framework of a memory net- sitive RNN encoder-decoder frame-
which requires the model to reason over work. On the bAbI dataset, Sukhbaatar work, observing substantial gains. Li
multiple related facts to produce the right et al. [136] improved upon the original et al. [157] reported results on replacing
answer. It contains 20 synthetic tasks that memory networks model [135] by mak- the traditional maximum log likelihood
test a model’s ability to retrieve relevant ing the training procedure agnostic of training objective with the maximum
facts and reason over them. Each task the actual supporting fact, while Kumar mutual information training objective,
focuses on a different skill such as basic et al. [128] used neural sequence mod- in an effort to produce interesting and
coreference and size reasoning. els (GRU) instead of neural bag-of- diverse responses, both of which are
The central problem of learning to words models as in [136] and [135] to tested on a 4-layer LSTM encoder-
answer single-relation queries is to find embed memories. decoder framework.
TABLe 8 Question answering.

PAPer modeL bAbi (meAn AccurAcy %) FArBeS (AccurAcy %)
fADER ET AL. [153] PARAPHRASE-DRivEN LExiCoN LEARNiNg 0.54
BoRDES ET AL. [154] WEEkLY SuPERviSED EMBEDDiNg 0.73
WESToN ET AL. [135] MEMoRY NETWoRkS 93.3 0.83
SukHBAATAR ET AL. [136] END-To-END MEMoRY NETWoRkS 88.4
kuMAR ET AL. [128] DMN 93.6
TABLe 9 Dialogue systems.

TwiTTer converSATion uBunTu diALogue dATASeT
PAPer modeL TriPLe dATASeT (BLeu) (recALL 1@10%)
RiTTER ET AL. [155] SMT 3.60
SoRDoNi ET AL. [156] SMT+NEuRAL RERANkiNg 4.44
Li ET AL. [157] LSTM SEq2SEq 4.51
Li ET AL. [157] LSTM SEq2SEq WiTH MMi oBJECTivE 5.22
LoWE ET AL. [89] DuAL LSTM ENCoDERS foR SEMANTiC MATCHiNg 55.22
DoDgE ET AL. [158] MEMoRY NETWoRkS 63.72
zHou ET AL. [159] SENTENCE-LEvEL CNN-LSTM ENCoDER 66.15

The response retrieval task is defined We also expect to see more research on [12] E. Cambria, S. Poria, A. Gelbukh, and M. Thelwall,
“Sentiment analysis is a big suitcase,” IEEE Intell. Syst.,
as selecting the best response from a re- multimodal learning [164] as, in the real vol. 32, no. 6, pp. 74–80, Nov. 2017.
pository of candidate responses. Such a world, language is often grounded on [13] X. Glorot, A. Bordes, and Y. Bengio, “Domain ad-
aptation for large-scale sentiment classification: A deep
model can be evaluated by the recall1@k (or correlated with) other signals. learning approach,” in Proc. 28th Int. Conf. Machine Learn-
metric, where the ground-truth re- Finally, we expect to see more deep ing, 2011, pp. 513–520.
[14] K. M. Hermann and P. Blunsom, “The role of syntax
sponse is mixed with k - 1 random re- learning models whose internal mem- in vector space models of compositional semantics,” in
sponses. The Ubuntu dialogue dataset ory (bottom-up knowledge learned Proc. 51st Annu. Meeting Association Computational Linguis-
tics, 2013, vol. 1, pp. 894–904.
was constructed by scraping multi-turn from the data) is enr iched with an [15] J. L. Elman, “Distributed representations, simple
Ubuntu trouble-shooting dialogues external memory (top-down knowledge recurrent networks, and grammatical structure,” Mach.
Learn., vol. 7, no. 2–3, pp. 195–225, 1991.
from an online chatroom [89]. Lowe et inherited from a KB). Coupling sym- [16] A. M. Glenberg and D. A. Robertson, “Symbol
al. [89] used LSTMs to encode the mes- bolic and sub-symbolic AI will be key grounding and meaning: A comparison of high-dimen-
sional and embodied theories of meaning,” J. Memory
sage and response, and then inner prod- for stepping forward in the path from Lang., vol. 43, no. 3, pp. 379–401, Oct. 2000.
uct of the two sentence embeddings is NLP to natural language understanding. [17] S. T. Dumais, “Latent semantic analysis,” Annu. Rev.
Inf. Sci. Tech., vol. 38, no. 1, pp. 188–230, Nov. 2004.
used to rank candidates. Relying on machine learning, in fact, is [18] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent
Zhou et al. [159] proposed to better good to make a ‘good guess’ based on dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp.
993–1022, 2003.
exploit the multi-turn nature of human past experience, because sub-symbolic [19] R. Collobert and J. Weston, “A unified architecture
conversation by employing the LSTM methods encode correlation and their for natural language processing: Deep neural networks
with multitask learning,” in Proc. 25th Int. Conf. Machine
encoder on top of sentence-level CNN decision-making process is probabilistic. Learning, 2008, pp. 160–167.
embeddings, similar to [162]. Dodge et Natural language understanding, how- [20] A. Gittens, D. Achlioptas, and M. W. Mahoney,
“Skip-gram-zipf + uniform = vector additivity,” in Proc.
al. [158] cast the problem in the frame- ever, requires much more than that. To 55th Annu. Meeting Association Computational Linguistics,
work of a memory network, where the use Noam Chomsky’s words, “you do 2017, vol. 1, pp. 69–76.
[21] J. Pennington, R. Socher, and C. D. Manning,
past conversation was treated as memory not get discoveries in the sciences by “Glove: Global vectors for word representation,” in Proc.
and the latest utterance was considered taking huge amounts of data, throwing Conf. Empirical Methods Natural Language Processing, 2014,
vol. 14, pp. 1532–1543.
as a “question” to be responded to. The them into a computer and doing statisti- [22] R. Johnson and T. Zhang, “Semi-supervised con-
authors showed that using simple neural cal analysis of them: that’s not the way volutional neural networks for text categorization via
region embedding,” in Proc. Advances Neural Information
bag-of-word embedding for sentences you understand things, you have to have Processing Systems, 2015, pp. 919–927.
can yield competitive results. theoretical insights”. [23] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng,
and C. D. Manning, “Semi-supervised recursive auto-
encoders for predicting sentiment distributions,” in Proc.
IX. Conclusion References Conf. Empirical Methods Natural Language Processing, 2011,
[1] E. Cambria and B. White, “Jumping NLP curves: A pp. 151–161.
Deep learning offers a way to harness review of natural language processing research,” IEEE [24] X. Wang, Y. Liu, C. Sun, B. Wang, and X. Wang,
large amount of computation and data Comput. Intell. Mag., vol. 9, no. 2, pp. 48–57, May 2014. “Predicting polarities of tweets by composing word em-
[2] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and beddings with long short-term memory,” in Proc. Annu.
with little engineering by hand [163]. S. Khudanpur, “Recurrent neural network based lan- Meeting Association Computational Linguistics, 2015, pp.
With distributed representation, various guage model,” in Proc. Interspeech, vol. 2, p. 3, 2010. 1343–1353.
[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, [25] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B.
deep models have become the new and J. Dean, “Distributed representations of words and Qin, “Learning sentiment-specific word embedding for
state-of-the-art methods for NLP prob- phrases and their compositionality,” in Proc. Advances twitter sentiment classification,” in Proc. Annu. Meeting
Neural Information Processing Systems, 2013, pp. 3111–3119. Association Computational Linguistics, 2014, pp. 1555–1565.
lems. Supervised learning is the most [4] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. [26] I. Labutov and H. Lipson, “Re-embedding words,”
popular practice in recent deep learning Manning, A. Ng, and C. Potts, “Recursive deep mod- in Proc. Annu. Meeting Association Computational Linguistics,
els for semantic compositionality over a sentiment tree- 2013, pp. 489–493.
research for NLP. In many real-world bank,” in Proc. Conf. Empirical Methods Natural Language [27] S. Upadhyay, K. Chang, M. Taddy, A. Kalai, and
scenarios, however, we have unlabeled Processing, 2013, pp. 1631–1642. J. Zou, “Beyond bilingual: Multi-sense word embed-
[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. dings using multilingual context,” arXiv Preprint, arX-
data which require advanced unsuper- Kavukcuoglu, and P. Kuksa, “Natural language process- iv:1706.08160, 2017.
vised or semi-supervised approaches. In ing (almost) from scratch,” J. Mach. Learn. Res., vol. 12, [28] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush,
pp. 2493–2537, Aug. 2011. “Character-aware neural language models,” in Proc. As-
cases where there is lack of labeled data [6] Y. Goldberg, “A primer on neural network models for sociation Advancement Artificial Intelligence Conf., 2016, pp.
for some particular classes or the appear- natural language processing,” J. Artif. Intell. Res., vol. 57, 2741–2749.
pp. 345–420, Nov. 2016. [29] C. N. Dos Santos and M. Gatti, “Deep convolutional
ance of a new class while testing the [7] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, neural networks for sentiment analysis of short texts,” in
model, strategies like zero-shot learning “A neural probabilistic language model,” J. Mach. Learn. Proc. Int. Conf. Computational Linguistics, 2014, pp. 69–78.
Res., vol. 3, pp. 1137–1155, Feb. 2003. [30] C. N d Santos and V. Guimaraes, “Boosting named
should be employed. These learning [8] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Effi- entity recognition with neural character embeddings,”
schemes are still in their developing cient estimation of word representations in vector space,” arXiv Preprint, arXiv:1505.05008, 2015.
arXiv Preprint, arXiv:1301.3781, 2013. [31] C. D. Santos and B. Zadrozny, “Learning character-
phase but we expect deep learning [9] J. Weston, S. Bengio, and N. Usunier, “Wsabie: Scal- level representations for part-of-speech tagging,” in Proc.
based NLP research to be driven in the ing up to large vocabulary image annotation,” Proc. Int. 31st Int. Conf. Machine Learning, 2014, pp. 1818–1826.
Joint Conf. Artificial Intelligence., 2011, vol. 11, pp. 2764– [32] Y. Ma, E. Cambria, and S. Gao, “Label embedding
direction of making better use of unla- 2770. for zero-shot fine-grained named entity typing,” in Proc.
beled data. We expect such trend to [10] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, Int. Conf. Computational Linguistics, Osaka, 2016, pp.
“Parsing natural scenes and natural language with recur- 171–180.
continue with more and better model sive neural networks,” in Proc. 28th Int. Joint Conf. Machine [33] X. Chen, L. Xu, Z. Liu, M. Sun, and H. Luan, “Joint
designs. We expect to see more NLP Learning, 2011, pp. 129–136. learning of character and word embeddings,” in Proc. Int.
[11] P. D. Turney and P. Pantel, “From frequency to Joint Conf. Artificial Intelligence, 2015, pp. 1236–1242.
applications that employ reinforcement meaning: Vector space models of semantics,” J. Artif. In- [34] X. Zheng, H. Chen, and T. Xu, “Deep learning for
learning methods, e.g., dialogue systems. tell. Res., vol. 37, pp. 141–188, Nov. 2010. chinese word segmentation and pos tagging,” in Proc.

Conf. Empirical Methods Natural Language Processing, 2013, [55] W. Yih, X. He, and C. Meek, “Semantic parsing neural networks for multi-label text categorization,” in
pp. 647–657. for single-relation question answering,” in Proc. Annu. Proc. Int. Joint Conf. Neural Networks, 2017, pp. 2377–2383.
[35] H. Peng, E. Cambria, and X. Zou, “Radical-based Meeting Association Computational Linguistics, 2014, pp. [77] S. Poria, E. Cambria, D. Hazarika, N. Mazumder, A.
hierarchical embeddings for chinese sentiment analysis 643–648. Zadeh, and L. Morency, “Context-dependent sentiment
at sentence level,” in Proc. Int. Florida Artificial Intelligence [56] L. Dong, F. Wei, M. Zhou, and K. Xu, “Question analysis in user-generated videos,” in Proc. Annu. Meeting
Research Society Conf., 2017, pp. 347–352. answering over freebase with multi-column convolu- Association Computational Linguistics, 2017, pp. 873–883.
[36] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, tional neural networks,” in Proc. Annu. Meeting Association [78] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.
“Enriching word vectors with subword information,” Computational Linguistics, 2015, pp. 260–269. Morency, “Tensor fusion network for multimodal sen-
arXiv Preprint, arXiv:1607.04606, 2016. [57] A. Severyn and A. Moschitti, “Modeling relational timent analysis,” in Proc. Conf. Empirical Methods Natural
[37] A. Herbelot and M. Baroni, “High-risk learning: information in question-answer pairs with convolutional Language Processing, 2017, pp. 1114–1125.
Acquiring new word vectors from tiny data,” arXiv Pre- neural networks,” arXiv Preprint, arXiv:1604.01178, [79] E. Tong, A. Zadeh, C. Jones, and L.-P. Morency,
print, arXiv:1707.06556, 2017. 2016. “Combating human trafficking with deep multimodal
[38] Y. Pinter, R. Guthrie, and J. Eisenstein, “Mimicking [58] Y. Chen, L. Xu, K. Liu, D. Zeng, and J. Zhao, “Event models,” arXiv Preprint, arXiv:1705.02735, 2017.
word embeddings using subword rnns,” arXiv Preprint, extraction via dynamic multi-pooling convolutional [80] I. Chaturvedi, E. Ragusa, P. Gastaldo, R. Zunino,
arXiv:1707.06961, 2017. neural networks,” in Proc. Annu. Meeting Association Com- and E. Cambria, “Bayesian network based extreme learn-
[39] L. Lucy and J. Gauthier, “Are distributional repre- putational Linguistics, 2015, pp. 167–176. ing machine for subjectivity detection,” J. Franklin Inst.,
sentations ready for the real world? Evaluating word vec- [59] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. vol. 355, no. 4, pp. 1780–1797, July 2018.
tors for grounded perceptual meaning,” arXiv Preprint, Deng, G. Penn, and D. Yu, “Convolutional neural [81] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier,
arXiv:1705.11168, 2017. networks for speech recognition,” IEEE Trans. Acoust., “Language modeling with gated convolutional net-
[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Speech, Signal Process., vol. 22, no. 10, pp. 1533–1545, works,” arXiv Preprint, arXiv:1612.08083, 2016.
“Imagenet classification with deep convolutional neural Oct. 2014. [82] W. Yin, K. Kann, M. Yu, and H. Schütze, “Com-
networks,” in Proc. Advances Neural Information Processing [60] D. Palaz, M. Magimai.-Doss, and R. Collobert, parative study of CNN and RNN for natural language
Systems, 2012, pp. 1097–1105. “Analysis of CNN-based speech recognition system us- processing,” arXiv Preprint, arXiv:1702.01923, 2017.
[41] A. Sharif Razavian, H. Azizpour, J. Sullivan, and ing raw speech as input,” Idiap, Tech. Rep, 2015. [83] S. Hochreiter and J. Schmidhuber, “Long short-term
S. Carlsson, “CNN features off-the-shelf: An astound- [61] Z. Tu, B. Hu, Z. Lu, and H. Li, “Context-dependent memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780,
ing baseline for recognition,” in Proc. IEEE Conf. Com- translation selection using convolutional neural net- 1997.
puter Vision and Pattern Recognition Workshops, 2014, pp. work,” arXiv Preprint, arXiv:1503.02357, 2015. [84] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learn-
806–813. [62] J. L. Elman, “Finding structure in time,” Cogn. Sci., ing to forget: Continual prediction with LSTM,” in Proc.
[42] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. vol. 14, no. 2, pp. 179–211, 1990. 9th Int. Conf. Artificial Neural Networks, pp. 850–855, 1999.
Long, R. Girshick, S. Guadarrama, and T. Darrell, [63] T. Mikolov, S. Kombrink, L. Burget, J. Černocký, [85] G. Lample, M. Ballesteros, S. Subramanian, K.
“Caffe: Convolutional architecture for fast feature em- and S. Khudanpur, “Extensions of recurrent neural net- Kawakami, and C. Dyer, “Neural architectures for named
bedding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, work language model,” in Proc. Int. Conf. Acoustics, Speech entity recognition,” arXiv Preprint, arXiv:1603.01360,
pp. 675–678. and Signal Processing, 2011, pp. 5528–5531. 2016.
[43] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, [64] I. Sutskever, J. Martens, and G. E. Hinton, “Gener- [86] A. Graves, “Generating sequences with recurrent
“A convolutional neural network for modelling sentenc- ating text with recurrent neural networks,” in Proc. 28th neural networks,” arXiv Preprint, arXiv:1308.0850,
es,” in Proc. 52nd Annu. Meeting Association Computational Int. Conf. Machine Learning, 2011, pp. 1017–1024. 2013.
Linguistics, 2014, vol. 1, pp. 655–665. [65] S. Liu, N. Yang, M. Li, and M. Zhou, “A recursive [87] M. Sundermeyer, H. Ney, and R. Schlüter, “From
[44] Y. Kim, “Convolutional neural networks for sen- recurrent neural network for statistical machine transla- feedforward to recurrent LSTM neural networks for lan-
tence classification,” arXiv Preprint, arXiv:1408.5882, tion,” in Proc. 52nd Annu. Meeting Association Computa- guage modeling,” IEEE Trans. Audio, Speech, Language
2014. tional Linguistics, 2014, pp. 1491–1500. Process., vol. 23, no. 3, pp. 517–529, Mar. 2015.
[45] S. Poria, E. Cambria, and A. Gelbukh, “Aspect ex- [66] M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint [88] M. Sundermeyer, T. Alkhouli, J. Wuebker, and H.
traction for opinion mining with a deep convolutional language and translation modeling with recurrent neural Ney, “Translation modeling with bidirectional recurrent
neural network,” Knowl.-Based Syst., vol. 108, pp. 42–49, networks,” in Proc. Conf. Empirical Methods Natural Lan- neural networks,” in Proc. Conf. Empirical Methods Natural
June 2016. guage Processing, 2013, pp. 1044–1054. Language Processing, 2014, pp. 14–25.
[46] A. Kirillov, D. Schlesinger, W. Forkel, A. Zelenin, [67] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence [89] R. Lowe, N. Pow, I. Serban, and J. Pineau, “The
S. Zheng, P. Torr, and C. Rother, “Efficient likelihood to sequence learning with neural networks,” in Proc. ubuntu dialogue corpus: A large dataset for research in
learning of a generic CNN-CRF model for semantic Advances Neural Information Processing Systems, 2014, pp. unstructured multi-turn dialogue systems,” arXiv Pre-
segmentation,” arXiv Preprint, arXiv:1511.05067, 2015. 3104–3112. print, arXiv:1506.08909, 2015.
[47] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and [68] T. Robinson, M. Hochberg, and S. Renals, “The use [90] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan,
K. J. Lang, “Phoneme recognition using time-delay neu- of recurrent neural networks in continuous speech recog- “Show and tell: A neural image caption generator,” in
ral networks,” IEEE Trans. Acoust., Speech, Signal Process., nition,” in Proc. Automatic Speech and Speaker Recognition, Proc. IEEE Conf. Computer Vision and Pattern Recognition,
vol. 37, no. 3, pp. 328–339, Mar. 1989. 1996, pp. 233–258. 2015, pp. 3156–3164.
[48] A. Mukherjee and B. Liu, “Aspect extraction through [69] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech [91] O. Vinyals and Q. Le, “A neural conversational mod-
semi-supervised modeling,” in Proc. 50th Annu. Meeting recognition with deep recurrent neural networks,” in el,” arXiv Preprint arXiv:1506.05869, 2015.
Association Computational Linguistics, 2012, pp. 339–348. Proc. Int. Conf. Acoustics, Speech and Signal Processing, 2013, [92] J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J.
[49] S. Ruder, P. Ghaffari, and J. G. Breslin, “Insight-1 pp. 6645–6649. Gao, and B. Dolan, “A persona-based neural conversation
at semeval-2016 task 5: Deep learning for multilingual [70] A. Graves and N. Jaitly, “Towards end-to-end speech model,” arXiv Preprint, arXiv:1603.06155, 2016.
aspect-based sentiment analysis,” arXiv Preprint, arX- recognition with recurrent neural networks,” in Proc. 31st [93] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask
iv:1609.02748, 2016. Int. Conf. Machine Learning, 2014, pp. 1764–1772. your neurons: A neural-based approach to answering
[50] P. Wang, J. Xu, B. Xu, C. Liu, H. Zhang, F. Wang, [71] H. Sak, A. Senior, and F. Beaufays, “Long short-term questions about images,” in Proc. IEEE Int. Conf. Com-
and H. Hao, “Semantic clustering and convolutional memory based recurrent neural network architectures for puter Vision, 2015, pp. 1–9.
neural network for short text categorization,” in Proc. large vocabulary speech recognition,” arXiv Preprint, [94] D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-
Annu. Meeting Association Computational Linguistics, 2015, arXiv:1402.1128, 2014. chine translation by jointly learning to align and trans-
pp. 352–357. [72] A. Karpathy and L. Fei-Fei, “Deep visual-semantic late,” arXiv Preprint, arXiv:1409.0473, 2014.
[51] S. Poria, E. Cambria, D. Hazarika, and P. Vij, “A alignments for generating image descriptions,” in Proc. [95] A. M. Rush, S. Chopra, and J. Weston, “A neural
deeper look into sarcastic tweets using deep convolu- IEEE Conf. Computer Vision and Pattern Recognition, 2015, attention model for abstractive sentence summarization,”
tional neural networks,” in Proc. Int. Conf. Computational pp. 3128–3137. arXiv Preprint, arXiv:1509.00685, 2015.
Linguistics, 2016, pp. 1601–1612. [73] D. Tang, B. Qin, and T. Liu, “Document modeling [96] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R.
[52] M. Denil, A. Demiraj, N. Kalchbrenner, P. Blunsom, with gated recurrent neural network for sentiment classi- Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend
and N. de Freitas, “Modelling, visualising and summaris- fication,” in Proc. Conf. Empirical Methods Natural Language and tell: Neural image caption generation with visual at-
ing documents with a single convolutional neural net- Processing, 2015, pp. 1422–1432. tention,” in Proc. Int. Conf. Machine Learning, 2015, pp.
work,” 26th Int. Conf. Computational Linguistics, 2014, [74] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Em- 2048–2057.
pp. 1601–1612. pirical evaluation of gated recurrent neural networks on [97] O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever,
[53] B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional sequence modeling,” arXiv Preprint, arXiv:1412.3555, and G. Hinton, “Grammar as a foreign language,” in
neural network architectures for matching natural lan- 2014. Proc. Advances Neural Information Processing Systems, 2015,
guage sentences,” in Proc. Advances Neural Information Pro- [75] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah- pp. 2773–2781.
cessing Systems, 2014, pp. 2042–2050. danau, F. Bougares, H. Schwenk, and Y. Bengio, “Learn- [98] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer
[54] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, ing phrase representations using RNN encoder-decoder networks,” in Proc. Advances Neural Information Processing
“A latent semantic model with convolutional-pooling for statistical machine translation,” arXiv Preprint, arX- Systems, 2015, pp. 2692–2700.
structure for information retrieval,” in Proc. 23rd ACM iv:1406.1078, 2014. [99] R. Paulus, C. Xiong, and R. Socher, “A deep rein-
Int. Conf. Information and Knowledge Management, 2014, [76] G. Chen, D. Ye, E. Cambria, J. Chen, and Z. Xing, forced model for abstractive summarization,” arXiv Pre-
pp. 101–110. “Ensemble application of convolutional and recurrent print, arXiv:1705.04304, 2017.

[100] Y. Wang, M. Huang, X. Zhu, and L. Zhao, “At- [122] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. next,” in Proc. Annu. Meeting Association for Computational
tention-based LSTM for aspect-level sentiment classifica- P. Xing, “Controllable text generation,” arXiv Preprint, Linguistics, 2017, pp. 473–483.
tion,” in Proc. Conf. Empirical Methods Natural Language arXiv:1703.00955, 2017. [145] Q. Le and T. Mikolov, “Distributed representations
Processing, 2016, pp. 606–615. [123] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: se- of sentences and documents,” in Proc. 31st Int. Conf. Ma-
[101] Y. Ma, H. Peng, and E. Cambria, “Targeted aspect- quence generative adversarial nets with policy gradient,” chine Learning, 2014, pp. 1188–1196.
based sentiment analysis via embedding commonsense in Proc. Association Advancement Artificial Intelligence Conf., [146] L. Yu, J. Wang, K. R. Lai, and X. Zhang, “Re-
knowledge into an attentive LSTM,” in Proc. Association 2017, pp. 2852–2858. fining word embeddings for sentiment analysis,” in Proc.
Advancement Artificial Intelligence Conf., 2018, pp. 5876– [124] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and Conf. Empirical Methods Natural Language Processing, 2017,
5883. A. Courville, “Adversarial generation of natural lan- pp. 545–550.
[102] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng, guage,” arXiv Preprint, arXiv:1705.10929, 2017. [147] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. No-
“Semantic compositionality through recursive matrix- [125] J. Giménez and L. Marquez, “Fast and accurate rouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K.
vector spaces,” in Proc. Joint Conf. Empirical Methods Natu- part-of-speech tagging: The SVM approach revisited,” Macherey, et al., “Google’s neural machine translation
ral Language Processing and Computational Natural Language Recent Adv. Natural Lang. Process., pp. 153–162, 2004. system: Bridging the gap between human and machine
Learning, 2012, pp. 1201–1211. [126] Z. Huang, W. Xu, and K. Yu, “Bidirectional translation,” arXiv Preprint, arXiv:1609.08144, 2016.
[103] B. Taskar, C. Guestrin, and D. Koller, “Max-mar- LSTM-CRF models for sequence tagging,” arXiv Pre- [148] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y.
gin Markov networks,” in Proc. Advances Neural Informa- print, arXiv:1508.01991, 2015. N. Dauphin, “Convolutional sequence to sequence learn-
tion Processing Systems, 2004, pp. 25–32. [127] D. Andor, C. Alberti, D. Weiss, A. Severyn, A. ing,” arXiv Preprint, arXiv:1705.03122, 2017.
[104] S. R. Bowman, C. Potts, and C. D. Manning, Presta, K. Ganchev, S. Petrov, and M. Collins, “Glob- [149] A. Vaswani, N. Shazeer, N. Parmar, and J. Uszko-
“Recursive neural networks can learn logical semantics,” ally normalized transition-based neural networks,” arXiv reit, “Attention is all you need,” arXiv Preprint, arX-
arXiv Preprint, arXiv:1406.1827, 2014. Preprint, arXiv:1603.06042, 2016. iv:1706.03762, 2017.
[105] K. S. Tai, R. Socher, and C. D. Manning, “Im- [128] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Brad- [150] R. K. Srivastava, K. Greff, and J. Schmidhuber,
proved semantic representations from tree-structured bury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher, “Training very deep networks,”in Proc. Advances Neural
long short-term memory networks,” arXiv Preprint, “Ask me anything: Dynamic memory networks for natu- Information Processing Systems, 2015, pp. 2377–2385.
arXiv:1503.00075, 2015. ral language processing,” in Proc. Int. Conf. Machine Learn- [151] B. Pang and L. Lee, “Seeing stars: Exploiting class
[106] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, ing, 2016, pp. 1378–1387. relationships for sentiment categorization with respect to
“Scheduled sampling for sequence prediction with recur- [129] D. Chen and C. D. Manning, “A fast and accurate rating scales,” in Proc. 43rd Annu. Meeting Association Com-
rent neural networks,” in Proc. Advances Neural Information dependency parser using neural networks,” in Proc. Conf. putational Linguistics, 2005, pp. 115–124.
Processing Systems, 2015, pp. 1171–1179. Empirical Methods Natural Language Processing, 2014, pp. [152] A. Zadeh, R. Zellers, E. Pincus, and L. Morency,
[107] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, 740–750. “Multimodal sentiment intensity analysis in videos: Fa-
“Sequence level training with recurrent neural net- [130] D. Weiss, C. Alberti, M. Collins, and S. Petrov, cial gestures and verbal messages,” IEEE Intell. Syst., vol.
works,” arXiv Preprint, arXiv:1511.06732, 2015. “Structured training for neural network transition-based 31, no. 6, pp. 82–88, Nov. 2016.
[108] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and parsing,” arXiv Preprint, arXiv:1506.06158, 2015. [153] A. Fader, L. S. Zettlemoyer, and O. Etzioni, “Para-
D. Jurafsky, “Deep reinforcement learning for dialogue [131] C. Dyer, M. Ballesteros, W. Ling, A. Matthews, phrase-driven learning for open question answering,” in
generation,” arXiv Preprint, arXiv:1606.01541, 2016. and N. A. Smith, “Transition-based dependency pars- Proc. Annu. Meeting Association Computational Linguistics,
[109] R. J. Williams, “Simple statistical gradient-following with stack long short-term memory,” arXiv Preprint, 2013, pp. 1608–1618.
ing algorithms for connectionist reinforcement learning,” arXiv:1505.08075, 2015. [154] A. Bordes, J. Weston, and N. Usunier, “Open
Mach. Learn., vol. 8, no. 3–4, pp. 229–256, 1992. [132] H. Zhou, Y. Zhang, C. Cheng, S. Huang, X. Dai, question answering with weakly supervised embedding
[110] S. Young, M. Gašič, S. Keizer, F. Mairesse, J. and J. Chen, “A neural probabilistic structured-predic- models,” in Proc. Joint European Conf. Machine Learning and
Schatzmann, B. Thomson, and K. Yu, “The hidden in- tion method for transition-based natural language pro- Knowledge Discovery Databases, 2014, pp. 165–180.
formation state model: A practical framework for POM- cessing,” J. Artif. Intell. Res., vol. 58, pp. 703–729, Mar. [155] A. Ritter, C. Cherry, and W. B. Dolan, “Data-driv-
DP-based spoken dialogue management,” Comput. Speech 2017. en response generation in social media,” in Proc. Conf.
Lang., vol. 24, no. 2, pp. 150–174, June 2010. [133] S. Petrov, L. Barrett, R. Thibaux, and D. Klein, Empirical Methods Natural Language Processing, 2011, pp.
[111] S. Young, M. Gašič, B. Thomson, and J. D. Wil- “Learning accurate, compact, and interpretable tree an- 583–593.
liams, “POMDP-based statistical spoken dialog systems: notation,” in Proc. 21st Int. Conf. Computational Linguistics, [156] A. Sordoni, M. Galley, M. Auli, C. Brockett,
A review,” Proc. IEEE, vol. 101, no. 5, pp. 1160–1179, 2006, pp. 433–440. Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan, “A
2013. [134] M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. neural network approach to context-sensitive genera-
[112] P.-h. Su, V. David, D. Kim, T.-h. Wen, and S. Zhu, “Fast and accurate shift-reduce constituent pars- tion of conversational responses,” arXiv Preprint, arX-
Young, “Learning from real users: Rating dialogue suc- ing,” in Proc. Annu. Meeting Association Computational Lin- iv:1506.06714, 2015.
cess with neural networks for reinforcement learning in guistics, 2013, pp. 434–443. [157] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan,
spoken dialogue systems,” in Proc. Interspeech Conf., 2015, [135] J. Weston, S. Chopra, and A. Bordes, “Memory net- “A diversity-promoting objective function for neural
pp. 2007–2011. works,” arXiv Preprint, arXiv:1410.3916, 2014. conversation models,” arXiv Preprint, arXiv:1510.03055,
[113] P.-H. Su, M. Gasic, N. Mrksic, L. Rojas-Barahona, [136] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, 2015.
S. Ultes, D. Vandyke, T. Wen, and S. Young, “On-line “End-to-end memory networks,” in Proc. Advances Neural [158] J. Dodge, A. Gane, X. Zhang, A. Bordes, S. Cho-
active reward learning for policy optimisation in spoken Information Processing Systems, 2015, pp. 2440–2448. pra, A. Miller, A. Szlam, and J. Weston, “Evaluating
dialogue systems,” arXiv Preprint, arXiv:1605.07669, [137] C. Xiong, S. Merity, and R. Socher, “Dynamic prerequisite qualities for learning end-to-end dialog sys-
2016. memory networks for visual and textual question an- tems,” arXiv Preprint, arXiv:1511.06931, 2015.
[114] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, swering,” in Proc. Int. Conf. Machine Learning, 2016, pp. [159] X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H.
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, 2397–2406. Tian, X. Liu, and R. Yan, “Multi-view response selec-
“Generative adversarial nets,” in Proc. Advances Neural In- [138] A. Passos, V. Kumar, and A. McCallum, “Lexicon tion for human-computer conversation,” in Proc. Conf.
formation Processing Systems, 2014, pp. 2672–2680. infused phrase embeddings for named entity resolution,” Empirical Methods Natural Language Processing, 2016, pp.
[115] J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky, arXiv Preprint, arXiv:1404.5367, 2014. 372–381.
“Adversarial learning for neural dialogue generation,” [139] J. P. Chiu and E. Nichols, “Named entity recogni- [160] P. Koehn, F. J. Och, and D. Marcu, “Statistical
arXiv Preprint, arXiv:1701.06547, 2017. tion with bidirectional LSTM-CNNs,” arXiv Preprint, phrase-based translation,” in Proc. Conf. North American
[116] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, arXiv:1511.08308, 2015. Chapter Association Computational Linguistics, 2003, pp.
R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought [140] G. Luo, X. Huang, C. Lin, and Z. Nie, “Joint 48–54.
vectors,” in Proc. Advances Neural Information Processing named entity recognition and disambiguation,” in Proc. [161] K. Papineni, S. Roukos, T. Ward, and W. Zhu,
Systems, 2015, pp. 3294–3302. Conf. Empirical Methods Natural Language Processing, 2015, “Bleu: A method for automatic evaluation of machine
[117] A. M. Dai and Q. V. Le, “Semi-supervised sequence pp. 879–880. translation,” in Proc. 40th Annu. Meeting Association Com-
learning,” in Proc. Advances Neural Information Processing [141] E. Strubell, P. Verga, D. Belanger, and A. McCal- putational Linguistics, 2002, pp. 311–318.
Systems, 2015, pp. 3079–3087. lum, “Fast and accurate sequence labeling with iterated [162] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Cour-
[118] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, dilated convolutions,” arXiv Preprint, arXiv:1702.02098, ville, and J. Pineau, “Building end-to-end dialogue sys-
“Learning internal representations by error propagation,” 2017. tems using generative hierarchical neural network mod-
DTIC Document, Tech. Rep, 1985. [142] O. Täckström, K. Ganchev, and D. Das, “Efficient els,” in Proc. Association Advancement Artificial Intelligence
[119] D. P. Kingma and M. Welling, “Auto-encoding inference and structured learning for semantic role label- Conf., 2016, pp. 3776–3784.
variational bayes,” arXiv Preprint, arXiv:1312.6114, ing,” Trans. Assoc. Comput. Linguistics, vol. 3, pp. 29–41, [163] Y. LeCun, Y. Bengio, and G. Hinton, “Deep
2013. Jan. 2015. learning,” Nature, vol. 521, no. 7553, pp. 436–444, May
[120] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. [143] J. Zhou and W. Xu, “End-to-end learning of se- 2015.
Jozefowicz, and S. Bengio, “Generating sentences from a mantic role labeling using recurrent neural networks,” in [164] T. Baltrušaitis, C. Ahuja, and L. Morency, “Multi-
continuous space,” arXiv Preprint, arXiv:1511.06349, 2015. Proc. Annu. Meeting Association Computational Linguistics, modal machine learning: A survey and taxonomy,” arXiv
[121] Y. Zhang, Z. Gan, and L. Carin, “Generating text 2015, pp. 1127–1137. Preprint, arXiv:1705.09406, 2017.
via adversarial training,” in Proc. Neural Information Pro- [144] L. He, K. Lee, M. Lewis, and L. Zettlemoyer,
cessing Systems Workshop Adversarial Training, 2016. “Deep semantic role labeling: What works and what’s

Conference Bernadette Bouchon-Meunier
Calendar University Pierre et Marie Curie,
FRANCE
* Denotes a CIS-Sponsored Conference

∆ Denotes a CIS Technical
Co-Sponsored Conference
∆ 13th International Workshop * 2018 IEEE Symposium Series * The 9th Joint IEEE International
on Semantic and Social Media on Computational Intelligence Conference of Developmental
Adaptation and Personalization (IEEE SSCI 2018) Learning and Epigenetic Robotics
(SMAP 2018) November 18-21, 2018 (IEEE ICDL-EpiRob 2019)
September 6-7, 2018 Place: Bangalore, India August 19-22, 2019
Place: Zaragoza, Spain General Co-Chairs: Sundaram Suresh Place: Oslo, Norway
General Chairs: Sergio Ilarri, and Koshy George General Chairs: Jim Torresen
Fernando Bobillo, Raquel Trillo-Lado, Website: http://ieee-ssci2018.org and Kerstin Dautenhahn
and Martín López-Nores Website: TBA
Website: http://smap2018.unizar.es * 2019 IEEE Conference on
Computational Intelligence * 2019 IEEE International
* 2018 IEEE International for Financial Engineering and Conference on Games
Conference on Data Science Economics (IEEE CIFEr 2019) (IEEE CoG 2019)
and Advanced Analytics May 4-5, 2019 August 20-23, 2019
(IEEE DSAA 2018) Place: Shenzhen, China Place: London, United Kingdom
October 1-4, 2018 General Chairs: Hisao Ishibuchi General Chairs: Diego Perez Liebana
Place: Turin, Italy and Dongbin Zhao and Sanaz Mostaghim
General Chairs: Francesco Bonchi Website: TBA Website: TBA
and Foster Provost
Website: https://dsaa2018.isi.it * 2019 IEEE Congress on Evolutionary * 2019 IEEE Latin American
Computation (IEEE CEC 2019) Conference on Computational
* 2018 IEEE Smart World Congress June 10-13, 2019 Intelligence (2019 IEEE LA-CCI)
(IEEE SWC 2018) Place: Wellington, New Zealand November 11-15, 2019
October 8-12, 2018 General Co-Chairs: Mengjie Zhang Place: Guayaquil, Ecuador
Place: Guangzhou, China and Kay Chen Tan General Chair: Otilia Alejandro
General Chairs: Guojun Wang Website: http://www.cec2019.org Website: http://la-cci.org
and Yew Soon Ong
Website: http://www.smart-world.org/ * 2019 IEEE International * 2019 IEEE Symposium
2018/ Conference on Fuzzy Systems Series on Computational
(FUZZ-IEEE 2019) Intelligence (IEEE SSCI 2019)
* 2018 IEEE Latin American June 23-26, 2019 December 6-9, 2019
Conference on Computational Place: New Orleans, USA Place: Xiamen, China
Intelligence (2018 IEEE LA-CCI) General Chairs: Timothy C. Havens General Chair: Tingwen Huang
November 7-9, 2018 and James M. Keller Website: http://ssci2019.org
Place: Guadalajara, Mexico Website: http://www.fuzzieee.org
General Chairs: Alma Y. Alanis * 2020 IEEE World Congress
and Marco A. Perez-Cisneros * 2019 IEEE Conference on on Computational Intelligence
Website: http://la-cci.org Computational Intelligence in (IEEE WCCI 2020)
Bioinformatics and Computational July 19-24, 2020
Biology (IEEE CIBCB 2019) Place: Glasgow, UK
July 24-27, 2019 General Chairs: Amir Hussain,
Place: Siena, Italy Marios M. Polycarpou, and Xin Yao
Digital Object Identifier 10.1109/MCI.2018.2840739 General Chair: Giuseppe Nicosia Website: TBA
Date of publication: 18 July 2018 Website: TBA

New Journal Title
al
on
ati
ut a
p c ts l,
om o  e
C t r efl nica
s on ed nge tech
n m a
c tio rena le ch s of
a
r a ns was he tit ll formes.
E E T mes es. T ass a gam
a
8 , IE in G Gam omp h on
2 01 AI on enc earc
a ry and ns e to res
a nu nce ctio cop ring
J a
In ellige ans the s inee
r
Int E T g of eng m es on
IEE enin c an
d e lco pers r
l w f pa e fo cial
wid entifi a
ur ons o genc artifi
n
c i j o
s e si lli r
Th mis l inte es fo an–
s s ub cia am um on, nd
a me a rtifi es, g ce, h racti nal a e
n G lity a m gen inte atio war
g t
n s o h-qu ,
a
n t elli uter educ , sof es,
i
c tio l hig tific o mp ics, mes gam in
n sa ina cien ring c ph ga in
r a s t i ng
a i g s g iou rin g u
E Tr s or ing inee e r e m p nd
a me
e r g s gin e o a g s.
IEE blish cove d en s.  i u s e c al ty,
u i i c
ge
l en ectiv virt real r top
pu icles al, an ame o
r t i c f g a n T aff mes, ted othe
a hn o uli ga gmen and
tec ects ief
: J
p h au sign,
as C
  r - in- de
ito
Ed

2019 IEEE Congress on Evolutionary Computation
10–13 June 2019, Wellington, New Zealand
IEEE CEC 2019

Wellington New Zealand
Advisory Board
Hussein Abbass, Australia
Kalyanmoy Deb, USA
David Fogel, USA
Kwong Tak Wu Sam, Hong Kong SAR
Simon Lucas, UK
Zbigniew Michalewicz, Australia
Xin Yao, UK
The IEEE Congress on Evolutionary Computation (IEEE CEC) is a world-class event in the field of Evolutionary Gary Yen, USA
Computation. It provides a forum to bring together researchers and practitioners from all over the world to General Co-Chairs
present and discuss their research findings on Evolutionary Computation. Mengjie Zhang, New Zealand
Kay Chen Tan, Hong Kong SAR
Program Chair
IEEE CEC 2019 will be held in Wellington, New Zealand. Wellington is known as the ‘Coolest Little Capital’. It is
Carlos A. Coello Coello, Mexico
famous for a vibrant creative culture fuelled by events and great food. Wellington offers a wide range of Technical Co-Chairs
cosmopolitan amenities in downtown that is safe, clean, and pedestrian friendly. Jürgen Branke, UK
Oscar Cordón, Spain
Call for Papers Hisao Ishibuchi, Japan
Jing Liu, China
Papers for IEEE CEC 2019 should be submitted electronically through the Congress website at Gabriela Ochoa, UK
www.cec2019.org, and will be refereed by experts in the fields and ranked based on the criteria of Dipti Srinivasan, Singapore
originality, significance, quality and clarity. Plenary Talk Chair
Yaochu Jin, UK
Call for Special Sessions Special Session Chair
Chuan-Kang Ting, Taiwan
Special session proposals are invited to IEEE CEC 2019. Special session proposals should include the title, aim Tutorial Chair
and scope (including a list of main topics), and the names of the organizers of the special session, together Xiaodong Li, Australia
with a short biography of all organizers. A list of potential contributors will be very helpful. All special Competition Chair
sessions proposals should be submitted to the Special Session Chair: Prof Chuan-Kang Ting Jialin Liu, UK
(ckting@pme.nthu.edu.tw). Workshop Chair
Handing Wang, UK
Submission Chair
Call for Tutorials Huanhuan Chen, China
IEEE CEC 2019 solicits proposal for tutorials covering specific topics in Evolutionary Computation. If you are Sponsorship Chair
interested in proposing a tutorial, would like to recommend someone who might be interested, or have Andy Song, Australia
Poster Chair
questions about tutorials, please contact the Tutorial Chair: Prof Xiaodong Li (xiaodong.li@rmit.edu.au).
Kai Qin, Australia
Publicity Co-Chair
Call for Competitions Stefano Cagnoni, Italy
Competitions will be held as part of the Congress. Prospective competition organizers are invited to submit Anna I Esparcia-Alcázar, Spain
their proposals to the Competition Chair: Dr Jialin Liu ( jialin.liu@qmul.ac.uk). Emma Hart, UK
Bin Hu, Austria
Sanza Mostaghim, Germany
Call for Workshops
Yew Soon Ong, Singapore
Workshops will be held to provide participants with the opportunity to present and discuss novel research Jun Zhang, China
ideas on active and emerging topics in Evolutionary Computation. Prospective workshop organizers are Finance Chair
invited to submit their proposals to the Workshop Chair: Dr Handing Wang (handing.wang@surrey.ac.uk). Bing Xue, New Zealand
Local Organising Co-Chairs
Will Browne, New Zealand
Important Dates Hui Ma, New Zealand
Registration Chair
26 26 7 7 31 Aaron Chen, New Zealand
October November January March March Proceedings Chair
2018 2018 2019 2019 2019 Yi Mei, New Zealand
Web Masters
Special Session Proposal Competition Proposal Paper Submission & Notification Deadline Camera-ready & Harith Al-Sahaf
Deadline Deadline Workshop Proposal & Early Registration Deadline
Tutorial Proposal Deadline Yiming Peng
Qi Chen
Sponsored by Find Us At
www.cec2019.org
TE WHARE WANANGA O TE UPOKO O TE IKA A MAUI
VICTORIA
UNIVERSITY OF WELLINGTON
admin@cec2019.org

Ieee Computationalintelligence 201808

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ieee Computationalintelligence 201808

Uploaded by

Copyright:

Available Formats

2019 IEEE International Conference on Fuzzy Systems

Deadline for camera-ready paper submission: April 1, 2019

Deadline for early registration: April 5, 2019

Digital Object Identifier 10.1109/MCI.2018.2840740

IEEE prohibits discrimination, harassment, and bullying.

Digital Object Identifier 10.1109/MCI.2017.2770276 Digital Object Identifier 10.1109/MCI.2018.2840642

2 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

Digital Object Identifier 10.1109/MCI.2018.2840643

AUGUST 2018 | IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE 3

IMAGE LICENSED BY INGRAM PUBLISHING

Publications / IEEE Xplore ® / Standards / Membership / Conferences / Education

4 IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE | AUGUST 2018

Conference Report on IEEE Computational Intelligence

o n Monday 9th April 2018, the

NSW Chapter attended the workshop.

AUGUST 2018 | IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE 5

More questions for the CI experts.

them a few questions about computa-

6 IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE | AUGUST 2018

Are You Moving?

AUGUST 2018 | IEEE CompUTATIonAl InTEllIGEnCE mAGAzInE 7

CIS Publication Spotlight

8 IEEE ComputatIonal IntEllIgEnCE magazInE | august 2018

august 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 9

Computational Intelligence Techniques

t he field around Bioinformatics

10 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

Publication Spotlight (continued from page 9)

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 11

Abstract—DNA methylation leads to inhibition of downstream

12 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018 1556-603x/18©2018IEEE

Digital Object Identifier 10.1109/MCI.2018.2840659

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 13

Population New Population

P (X) = P (X 1, X 2, ..., X n) I (X i; X pa (i)) =

However, it is hard to measure all the joint probabilities

14 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

improve the classification accuracy. The weight update processes

The high-throughput DNA methylation profiles of large

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 15

ALgorIThM FEATurE* ACCurACy SENSITIvITy SPECIFICITy

16 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

ALgorIThM ACCurACy SENSITIvITy SPECIFICITy

DeciSioN TRee 0.640 0.680 0.600

0.5 Maximum TAbLE 5 Enriched Geneset in Colorectal Cancer Data.

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 17

18 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 19

Drew Neavin, Liewei Wang and Richard Weinshilboum

Rima Kaddurah-Daouk and John Rush

Digital Object Identifier 10.1109/MCI.2018.2840660

20 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018 1556-603x/18©2018IEEE

DiSEASE PRiOR wORk PREDiCTOR vARiAblES COMMEnT

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 21

Increasing Symptom Severity

Baseline 4 Weeks 8 Weeks Baseline 4 Weeks 8 Weeks

Antidepressant Treatment Outcome Prediction

Baseline 4 Weeks 8 Weeks

22 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

predict clinical outcomes yu by using supervised learning meth-

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 23

24 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018

auguSt 2018 | IEEE ComputatIonal IntEllIgEnCE magazInE 25

26 IEEE ComputatIonal IntEllIgEnCE magazInE | auguSt 2018