You are on page 1of 4

GoChat: Goal-oriented Chatbots

with Hierarchical Reinforcement Learning


Jianfeng Liu Feiyang Pan∗ Ling Luo
Institute of Computing Technology, Institute of Computing Technology, Institute of Computing Technology,
Chinese Academy of Sciences. UCAS. Chinese Academy of Sciences. UCAS. Chinese Academy of Sciences. UCAS.
liujianfeng18s@ict.ac.cn panfeiyang@ict.ac.cn luoling18s@ict.ac.cn

ABSTRACT unified framework for Goal-oriented Chatbots (GoChat in short),


A chatbot that converses like a human should be goal-oriented (i.e., which directly addresses both challenges.
arXiv:2005.11729v2 [cs.CL] 26 May 2020

be purposeful in conversation), which is beyond language genera- Previous work on chatbots can be roughly categorized into two
tion. However, existing dialogue systems often heavily rely on cum- types, i.e., open-domain chatbots and task-oriented dialogue sys-
bersome hand-crafted rules or costly labelled datasets to reach the tems. Open-domain chatbots are not task-specific, therefore can be
goals. In this paper, we propose Goal-oriented Chatbots (GoChat), a end-to-end trained on large corpus. With the development of deep
framework for end-to-end training chatbots to maximize the long- sequence-to-sequence (Seq2Seq) learning, they are modeled as a
term return from offline multi-turn dialogue datasets. Our frame- source to target sequence prediction task. However, they cannot
work utilizes hierarchical reinforcement learning (HRL), where the support specific tasks by pursuing the long-term goals. On the other
high-level policy guides the conversation towards the final goal by hand, task-oriented dialogue systems [6, 16] are capable of dealing
determining some sub-goals, and the low-level policy fulfills the with domain-specific tasks, e.g., recommending products [16] or
sub-goals by generating the corresponding utterance for response. booking restaurant [6].
In our experiments on a real-world dialogue dataset for anti-fraud Recently, some dialogue systems have been proposed to bargain
in financial, our approach outperforms previous methods on both on goods [1, 4], or recommend products [2].
the quality of response generation as well as the success rate of However, designing such task-oriented dialogue systems often
accomplishing the goal. requires lots of hand-crafting, for example, by decomposing the
task into sub-tasks with manually-designed rules, or using heav-
CCS CONCEPTS ily labelled datasets to build a finite-state machine. Such ad-hoc
methods are difficult to scale to complex real-world problems.
• Computing methodologies → Natural language processing.
In light of these observations, in this paper, we propose the
GoChat framework for end-to-end training the chatbot to maxi-
KEYWORDS mize its long-term return without specific design of the dialogue
Goal-oriented Chatbot, Dialogue System, Reinforcement Learning problem. We leverage hierarchical reinforcement learning (HRL)
ACM Reference Format: that simultaneously learns a high-level policy to guide the conver-
Jianfeng Liu, Feiyang Pan, and Ling Luo. 2020. GoChat: Goal-oriented Chat- sation and a low-level policy to generate the response according to
bots with Hierarchical Reinforcement Learning. In Proceedings of the 43rd the guidance from the high-level policy.
International ACM SIGIR Conference on Research and Development in Infor- In particular, at each turn of the multi-turn conversation, the
mation Retrieval (SIGIR ’20), July 25–30, 2020, Virtual Event, China. ACM, high-level policy (the manager) observes the previous conversation
New York, NY, USA, 5 pages. https://doi.org/10.1145/3397271.3401250 as its state, determines a sub-goal as its action, and waits for a
reward from the environment representing whether the final goal
1 INTRODUCTION is accomplished or not. The low-level policy (the worker), on the
Chatbots or human-like dialogue systems play an important role other hand, observes the state together with the sub-goal from its
in connecting machine-learning-powered platforms with users. A manager, generates the corresponding utterance for response as
chatbot that converses like a human should ideally (1) be capable its action to fulfill the sub-goal, and receives a reward from the
of fluent language generation, and (2) be driven by long-term goals manager. In this way, the goal is decomposed as a hierarchy of
in a multi-turn conversation. In this paper, we develop a practical goals, so the chatbot can be trained efficiently in an end-to-end
manner.
∗ Corresponding author. Our dialogue agent is trained with Advantage Actor-Critic (A2C)
[8], an algorithm widely used in modern reinforcement learning
Permission to make digital or hard copies of all or part of this work for personal or and applications [9]. We conducted experiments on a goal-oriented
classroom use is granted without fee provided that copies are not made or distributed dialogue corpus from real-world financial data, where each dia-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM logue has a binary label for whether the goal is achieved or not.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, Quantitative indicators as well as human evaluations demonstrate
to post on servers or to redistribute to lists, requires prior specific permission and/or a that GoChat outperforms previous methods significantly on both
fee. Request permissions from permissions@acm.org.
SIGIR ’20, July 25–30, 2020, Virtual Event, China conversation quality and success rate of achieving the goals.
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8016-4/20/07. . . $15.00
https://doi.org/10.1145/3397271.3401250
Environment

State st <latexit sha1_base64="NMrwqnCJ6AlOnn6rY3AVJOwhYJA=">AAACxnicjVHLTsJAFD1URcQX6tJNIzFxRVpc6JLEDUuI8kiQkLYMOKGvtFMNISb+gAs3+An+iz9gjD+gf+GdoSQqMTpN2zPnnnNm7owdujwWhvGa0ZaWV7KrubX8+sbm1nZhZ7cZB0nksIYTuEHUtq2YudxnDcGFy9phxCzPdlnLHp3JeuuaRTEP/AsxDlnXs4Y+H3DHEkSdxz3RKxSNkqGGvgjMFBQr2frb8/ThqRYUXnCJPgI4SOCBwYcg7MJCTE8HJgyExHUxIS4ixFWd4RZ58iakYqSwiB3Rd0izTsr6NJeZsXI7tIpLb0ROHYfkCUgXEZar6aqeqGTJ/pY9UZlyb2P622mWR6zAFbF/+ebK//pkLwIDnKoeOPUUKkZ256QpiToVuXP9S1eCEkLiJO5TPSLsKOf8nHXliVXv8mwtVX9XSsnKuZNqE3zIXdIFmz+vcxE0yyXzuFSum8VKGbORwz4OcET3eYIKqqihQdlD3GOKR62q+Vqi3cykWib17OHb0O4+AYJAlFo=</latexit>


Manager Reward rt <latexit sha1_base64="o/TbAEUPLjtaZ1IyYEWgt0pP4J4=">AAACxnicjVHLTsJAFD1URcQX6tJNIzFxRVpc6JLEDUuI8kiQkLYM2NBXplMNISb+gAs3+An+iz9gjD+gf+GdoSQqMTpN2zPnnnNm7owdeW4sDOM1oy0tr2RXc2v59Y3Nre3Czm4zDhPusIYTeiFv21bMPDdgDeEKj7Ujzizf9ljLHp3Jeuua8dgNgwsxjljXt4aBO3AdSxB1znuiVygaJUMNfRGYKShWsvW35+nDUy0svOASfYRwkMAHQwBB2IOFmJ4OTBiIiOtiQhwn5Ko6wy3y5E1IxUhhETui75BmnZQNaC4zY+V2aBWPXk5OHYfkCUnHCcvVdFVPVLJkf8ueqEy5tzH97TTLJ1bgiti/fHPlf32yF4EBTlUPLvUUKUZ256QpiToVuXP9S1eCEiLiJO5TnRN2lHN+zrryxKp3ebaWqr8rpWTl3Em1CT7kLumCzZ/XuQia5ZJ5XCrXzWKljNnIYR8HOKL7PEEFVdTQoOwh7jHFo1bVAi3RbmZSLZN69vBtaHeff96UWQ==</latexit>

uh 1
<latexit sha1_base64="aq7RaicgC02Znnz9M8rysOII8oc=">AAACynicjVHLTsJAFD3UB4gvxI2JmyoxcUVaXOiSxI0LF5jIIwFC2jJAQ2mb6dSEVFz5A271w4x/oH/hnaEkKjE6Tdsz555zZu6MHXpuJAzjLaOtrK6tZ3Mb+c2t7Z3dwl6xEQUxd1jdCbyAt2wrYp7rs7pwhcdaIWfWxPZY0x5fynrzjvHIDfxbMQ1Zd2INfXfgOpYgqhn3klHPnPUKJaNsqKEvAzMFpWqxc3TwcJ+tBYVXdNBHAAcxJmDwIQh7sBDR04YJAyFxXSTEcUKuqjPMkCdvTCpGCovYMX2HNGunrE9zmRkpt0OrePRycuo4IU9AOk5YrqareqySJftbdqIy5d6m9LfTrAmxAiNi//ItlP/1yV4EBrhQPbjUU6gY2Z2TpsTqVOTO9S9dCUoIiZO4T3VO2FHOxTnryhOp3uXZWqr+rpSSlXMn1cb4kLukCzZ/XucyaFTK5lm5cmOWqhXMRw6HOMYp3ec5qrhCDXXV5ROe8aJda1ybaslcqmVSzz6+De3xE/quk94=</latexit>

Sub-goal
uc1 Hierarchical
<latexit sha1_base64="hDGzyj0ewHIRog/dNgrhwP9CVAw=">AAACynicjVHLTsJAFD3UB4gvxI2JmyoxcUVaXOiSxI0LF5jIIwFC2mHAxtI206kJQVz5A271w4x/oH/hnaEkKjE6Tdsz555zZu6MG/leLC3rLWMsLa+sZnNr+fWNza3twk6xEYeJYLzOQj8ULdeJue8FvC496fNWJLgzcn3edG/PVb15x0XshcG1HEe8O3KGgTfwmCOJaia9CevZ016hZJUtPcxFYKegVC12DvYe7rO1sPCKDvoIwZBgBI4AkrAPBzE9bdiwEBHXxYQ4QcjTdY4p8uRNSMVJ4RB7S98hzdopG9BcZcbazWgVn15BThNH5AlJJwir1UxdT3SyYn/LnuhMtbcx/d00a0SsxA2xf/nmyv/6VC8SA5zpHjzqKdKM6o6lKYk+FbVz80tXkhIi4hTuU10QZto5P2dTe2LduzpbR9fftVKxas5SbYIPtUu6YPvndS6CRqVsn5QrV3apWsFs5LCPQxzTfZ6iigvUUNddPuEZL8alIYyxMZlJjUzq2cW3YTx+Au6/k9k=</latexit>

Encoder gt
<latexit sha1_base64="uGdBYB27fPulRW8BTUcgpTFfpA4=">AAACynicjVHLTsJAFD3UB4gvxI2JmyoxcUXautAliRsXLjCRRwKEtGWAhtI206kJQVz5A271w4x/oH/hnaEkKjE6Tdsz555zZu6ME/leLAzjLaOtrK6tZ3Mb+c2t7Z3dwl6xHocJd1nNDf2QNx07Zr4XsJrwhM+aEWf22PFZwxldynrjjvHYC4NbMYlYZ2wPAq/vubYgqpF0p8OuNesWSkbZUENfBmYKSpVi++jg4T5bDQuvaKOHEC4SjMEQQBD2YSOmpwUTBiLiOpgSxwl5qs4wQ568CakYKWxiR/Qd0KyVsgHNZWas3C6t4tPLyanjhDwh6ThhuZqu6olKluxv2VOVKfc2ob+TZo2JFRgS+5dvofyvT/Yi0MeF6sGjniLFyO7cNCVRpyJ3rn/pSlBCRJzEPapzwq5yLs5ZV55Y9S7P1lb1d6WUrJy7qTbBh9wlXbD58zqXQd0qm2dl68YsVSzMRw6HOMYp3ec5KrhCFTXV5ROe8aJda1ybaNO5VMuknn18G9rjJ/0Pk98=</latexit>
uh 2 Attention Network <latexit sha1_base64="O6zwIKGrnZR8VuM4BOWEHNgPA44=">AAACxnicjVHNSsNAGJzGv1qrVj0KEiyCp5LUgx4LXnps0f5ALSVJt3FpmoRko5QiePHoVV+lbyK+geJL+O02BbWIbkgyO9/M7H67dujxWBjGa0ZbWl5ZXcuu5zbym1vbhZ3dZhwkkcMaTuAFUdu2YuZxnzUEFx5rhxGzRrbHWvbwXNZbNyyKeeBfinHIuiPL9fmAO5Yg6sLtiV6haJQMNfRFYKagWMlP6x8PB9NaUHjBFfoI4CDBCAw+BGEPFmJ6OjBhICSuiwlxESGu6gx3yJE3IRUjhUXskL4uzTop69NcZsbK7dAqHr0ROXUckScgXURYrqareqKSJftb9kRlyr2N6W+nWSNiBa6J/cs3V/7XJ3sRGOBM9cCpp1AxsjsnTUnUqcid61+6EpQQEidxn+oRYUc55+esK0+sepdna6n6m1JKVs6dVJvgXe6SLtj8eZ2LoFkumSelct0sVsqYjSz2cYhjus9TVFBFDQ3KdvGIJzxrVc3XEu12JtUyqWcP34Z2/wlyJ5Pf</latexit>

Worker’s
uc2 <latexit sha1_base64="a5MZJVcyqnjal8foxDbNYidnD5k=">AAACynicjVHLTsJAFD3UB4gvxI2JmyoxcUVaXOiSxI0LF5jIIwFC2mHAxtI206kJQVz5A271w4x/oH/hnaEkKjE6Tdsz555zZu6MG/leLC3rLWMsLa+sZnNr+fWNza3twk6xEYeJYLzOQj8ULdeJue8FvC496fNWJLgzcn3edG/PVb15x0XshcG1HEe8O3KGgTfwmCOJaia9CetVpr1CySpbepiLwE5BqVrsHOw93GdrYeEVHfQRgiHBCBwBJGEfDmJ62rBhISKuiwlxgpCn6xxT5MmbkIqTwiH2lr5DmrVTNqC5yoy1m9EqPr2CnCaOyBOSThBWq5m6nuhkxf6WPdGZam9j+rtp1ohYiRti//LNlf/1qV4kBjjTPXjUU6QZ1R1LUxJ9Kmrn5peuJCVExCncp7ogzLRzfs6m9sS6d3W2jq6/a6Vi1Zyl2gQfapd0wfbP61wEjUrZPilXruxStYLZyGEfhzim+zxFFReooa67fMIzXoxLQxhjYzKTGpnUs4tvw3j8BPEgk9o=</latexit>
reward rtw <latexit sha1_base64="EjhBJbCt+vfQbgsbkraylmzwlgg=">AAACz3icjVHLSsNAFD2Nr1pftS7dBIvgqiR1ocuCG5ct2AfUWpJxqsE0iZOJIkVxK+7d6gf4PyL+gP6EemeaglpEJyQ5c+45Z+bOuJHvxdKynjPGxOTU9Ex2Njc3v7C4lF8uNOIwEYzXWeiHouU6Mfe9gNelJ33eigR3+q7Pm+7Jjqo3z7iIvTDYkxcR7/Sdo8DrecyRRPUg0IXEAc7fP7r5olWy9DDHgZ2CYqWw8XL7eLpQDfNP2MchQjAk6IMjoCgGHw5ietqwYSEiroMBcYKQp+scl8iRNyEVJ4VD7Al9j2jWTtmA5ioz1m5Gq/j0CnKaWCdPSDpBWK1m6nqikxX7W/ZAZ6q9XdDfTbP6xEocE/uXb6T8r0/1ItHDtu7Bo54izajuWJqS6FNROze/dCUpISJO4UOqC8JMO0fnbGpPrHtXZ+vo+qtWKlbNWapN8KZ2SRds/7zOcdAol+zNUrlmFytlDEcWq1jDBt3nFirYRRV1yo5wh3s8GDXj3LgyrodSI5N6VvBtGDefeWuWjw==</latexit>

……
hct
<latexit sha1_base64="9DBq2Ng5R0BP3gNbY9HJRPRE+os=">AAACzHicjVHLSsNAFD2N7/qquhHcBIvgxprUhYKbghtXomAforUk02kNzYtkIpTQrT/gVv/JjYh/oOBHeGeaglpEJyQ5c+45Z+bO2KHrxMIwXnPaxOTU9MzsXH5+YXFpubCyWouDJGK8ygI3iBq2FXPX8XlVOMLljTDilme7vG73jmS9fsuj2An8c9EPedOzur7TcZgliLq4aaVixxxcs1ahaJQMNfRxYGagWFk/3P2oPT+dBoUXXKGNAAwJPHD4EIRdWIjpuYQJAyFxTaTERYQcVecYIE/ehFScFBaxPfp2aXaZsT7NZWas3IxWcemNyKljizwB6SLCcjVd1ROVLNnfslOVKffWp7+dZXnECtwQ+5dvpPyvT/Yi0MGB6sGhnkLFyO5YlpKoU5E71790JSghJE7iNtUjwkw5R+esK0+sepdna6n6m1JKVs5Zpk3wLndJF2z+vM5xUCuXzL1S+cwsVsoYjllsYBPbdJ/7qOAYp6hStod7POBRO9GElmqDoVTLZZ41fBva3Sd3AZZB</latexit>
1
Worker
uh t
<latexit sha1_base64="cn+1qR81qX8XAq4waP6aSTVuaew=">AAACznicjVHLSsNAFD2NVmt9VcWVm2AR3FiSutBlwYUuK9gH1FKSdNqGpklIJoUSglt/wK0u/ST1AwRd+wPemaagFtEJSc6ce86dufeavmOHXNNeMsrCYnZpObeSX11b39gsbG3XQy8KLFazPMcLmqYRMsd2WY3b3GFNP2DGyHRYwxyeiXhjzILQ9twrPvFZe2T0XbtnWwYnqhV14kEn5kd6knQKRa2kyaXOAz0Fxcru4/lz9vWj6hWecI0uPFiIMAKDC07YgYGQnhZ0aPCJayMmLiBkyzhDgjx5I1IxUhjEDunbp10rZV3ai5yhdFt0ikNvQE4VB+TxSBcQFqepMh7JzIL9LXcsc4q7TehvprlGxHIMiP3LN1P+1ydq4ejhVNZgU02+ZER1Vpolkl0RN1e/VMUpg0+cwF2KB4Qt6Zz1WZWeUNYuemvI+JtUClbsrVQb4V3ckgas/xznPKiXS/pxqXypFysapiuHPezjkOZ5ggouUEVNdvwO93hQqspYSZSbqVTJpJ4dfFvK7Sfz45et</latexit>
1
Response
uct hw
t hct
<latexit sha1_base64="xCUe6DN1c9f6Ug8ziT3tsnTO1S4=">AAACznicjVHLSsNAFD2NVmt9VcWVm2AR3FiSutBlwYUuK9gHVCnJdFpD0yQkk0IJxa0/4FaXfpL6AYKu/QHvTFNQi+iEJGfOPefO3HvtwHUiYRgvGW1uPruwmFvKL6+srq0XNjbrkR+HjNeY7/ph07Yi7joerwlHuLwZhNwa2C5v2P0TGW8MeRg5vnchRgG/Glg9z+k6zBJEteJ2wtqJODDH43ahaJQMtfRZYKagWNl+PH3Ovn5U/cITLtGBD4YYA3B4EIRdWIjoacGEgYC4KyTEhYQcFecYI0/emFScFBaxffr2aNdKWY/2Mmek3IxOcekNyaljjzw+6ULC8jRdxWOVWbK/5U5UTnm3Ef3tNNeAWIFrYv/yTZX/9claBLo4VjU4VFOgGFkdS7PEqivy5vqXqgRlCIiTuEPxkDBTzmmfdeWJVO2yt5aKvymlZOWepdoY7/KWNGDz5zhnQb1cMg9L5XOzWDEwWTnsYBf7NM8jVHCGKmqq43e4x4NW1YbaWLuZSLVM6tnCt6XdfgLn4Jeo</latexit>
1
Encoder
<latexit sha1_base64="amTfNhtdg40e8Nu4NGnJF675WPE=">AAACyHicjVHNTsJAGBzqH6Io4tFLIzHhRFo86JHEi/GEifwkiKQtC2wobW23EkK8ePXgVd/B9zHGF9C38NulJCoxuk3b2flmZvfbtQOXR8IwXlPa0vLK6lp6PbOxmd3azu3k65Efhw6rOb7rh03bipjLPVYTXLisGYTMGtkua9jDE1lv3LAw4r53ISYBa4+svsd73LEEUbVBR1yNO7mCUTLU0BeBmYBCJV98u3++zlb93Asu0YUPBzFGYPAgCLuwENHTggkDAXFtTIkLCXFVZ7hFhrwxqRgpLGKH9O3TrJWwHs1lZqTcDq3i0huSU8cBeXzShYTlarqqxypZsr9lT1Wm3NuE/naSNSJWYEDsX7658r8+2YtAD8eqB049BYqR3TlJSqxORe5c/9KVoISAOIm7VA8JO8o5P2ddeSLVuzxbS9XflVKycu4k2hgfcpd0webP61wE9XLJPCyVz81CpYzZSGMP+yjSfR6hglNUUaNsjgc84kk70wJtrE1mUi2VeHbxbWh3n34SlDg=</latexit>

Context RNN
<latexit sha1_base64="VEEBhofz7vzuHHyJv9PlDq5QGyA=">AAACyHicjVG9TsMwGLyG/0KhlJElokLqVCVlgLESC2ICiRQk/pS4prWaJsFxQFXFwsrACu/A+yDEC8Bb8NmkElAhcJTkfN/d2Z8dJKFIleO8FKyJyanpmdm54vxCaXGpvFxppXEmGfdYHMbyKPBTHoqIe0qokB8lkvv9IOSHQW9b1w+vuExFHB2oQcJP+34nEheC+Yoor3uuzth5uerUHTPsceDmoNqs1F7vni5Le3H5GSdoIwZDhj44IijCIXyk9BzDhYOEuFMMiZOEhKlz3KBI3oxUnBQ+sT36dmh2nLMRzXVmatyMVgnpleS0sU6emHSSsF7NNvXMJGv2t+yhydR7G9A/yLP6xCp0if3LN1L+16d7UbjAlulBUE+JYXR3LE/JzKnondtfulKUkBCncZvqkjAzztE528aTmt712fqm/maUmtVzlmszvOtd0gW7P69zHLQadXej3th3q80GPscsVrGGGt3nJprYwR48yha4xwMerV0rsa6twafUKuSeFXwb1u0HTpKUJA==</latexit>

Decoder uct
uh t
<latexit sha1_base64="+6HdXEB6zgEf+7eZ4B4mRJagI/Q=">AAACzHicjVHLSsNAFD2Nj9b6qnUjuIkWwVVJ6kKXBTeupIJ9SFtKMp1qMC+SiVBiXfoDbvW7xD/Qv/DONAW1iE5Icubcc87MnbFD14mFYbzltIXFpeV8YaW4ura+sVnaKrfiIIkYb7LADaKObcXcdXzeFI5weSeMuOXZLm/bt6ey3r7jUewE/qUYh7zvWde+M3KYJYi6SgYpG6RiMhmUKkbVUEOfB2YGKvVyb2/n4T7fCEqv6GGIAAwJPHD4EIRdWIjp6cKEgZC4PlLiIkKOqnNMUCRvQipOCovYW/pe06ybsT7NZWas3IxWcemNyKnjgDwB6SLCcjVd1ROVLNnfslOVKfc2pr+dZXnECtwQ+5dvpvyvT/YiMMKJ6sGhnkLFyO5YlpKoU5E71790JSghJE7iIdUjwkw5Z+esK0+sepdna6n6u1JKVs5Zpk3wIXdJF2z+vM550KpVzaNq7cKs1GuYjgJ2sY9Dus9j1HGGBpqU7eEJz3jRzjWhpdpkKtVymWcb34b2+AktfJUo</latexit>

<latexit sha1_base64="M8MQeL4eZarFLUMqxGKqUuGjsZc=">AAACzHicjVHLSsNAFD2Nr1pfUVfiJlgEVyXRhS4LgriSFuxDailJOm1D8yKZCCVk6w+41e8Sf0B06R94Z5qCWkQnJDlz7jln5s5YoevEXNdfCsrC4tLySnG1tLa+sbmlbu804yCJbNawAzeI2pYZM9fxWYM73GXtMGKmZ7msZY3PRb11x6LYCfxrPglZ1zOHvjNwbJMTdZP00lEv5VnWU8t6RZdDmwdGDspVtf568dHcqwXqM27RRwAbCTww+OCEXZiI6enAgI6QuC5S4iJCjqwzZCiRNyEVI4VJ7Ji+Q5p1ctanuciMpdumVVx6I3JqOCRPQLqIsFhNk/VEJgv2t+xUZoq9Tehv5VkesRwjYv/yzZT/9YleOAY4kz041FMoGdGdnack8lTEzrUvXXFKCIkTuE/1iLAtnbNz1qQnlr2LszVl/U0qBSvmdq5N8C52SRds/LzOedA8rhgnleO6Ua7qmI4i9nGAI7rPU1RxiRoalO3hAY94Uq4UrqRKNpUqhdyzi29Duf8EBViWcA==</latexit>

Figure 1: The overall GoChat framework for goal-oriented chatbots with hierarchical reinforcement learning. The manager
decides the sub-goals for the worker, and the worker generates the response according to the observations and the sub-goal.

2 GOCHAT: GOAL-ORIENTED CHATBOTS r tw from the manager (detailed in section 2.2.2). Next, we introduce
the model structures of the manager and the worker.
2.1 Problem formulation
Recall that our chatbot is intended to accomplish some goal via 2.2.1 Manager. We adopt the hierarchical attention network (HAN)
multi-turn conversations. Here we denote a complete dialogue with [7, 17] as the network structure of our manager, which learns
T turns as {uh1 , uc 1 , . . . , uhT , ucT }, where uht and uc t , t = 1, . . . ,T the contextual information by sequentially encoding all the pre-
stand for the utterances generated by human and chatbot at the vious dialogue turns in the state. In more detail, each utterance
t-th turn, respectively. Each utterance is a sequence of n tokens (for uh1 , uc 1 , . . . , uht is firstly encoded by a word-level encoder to a
sentences with less than n words, we use zero-paddings to complete hidden vector hh1 , hc 1 , . . . , hht with attention mechanism, as the
the sequence). At the t-th turn, given the dialogue of the previous latent representations of utterances.
t − 1 turns as well as the t-th utterance produced by human, the Next, all the derived sentence vectors hh1 , hc 1 , . . . , hht are en-
chatbot needs to generate uc t as its response. The target of the coded at the dialogue level to form a unified dialogue-level hidden
chatbot is to accomplish the goal through the whole conversation. representation Ht also with attention mechanism. So Ht is a high-
We formulate our problem as a Markov Decision Process (MDP). level feature to summarize the current conversation, which is used
At time step t, the state is defined as st = {uh1 , uc 1 , . . . , uht }, and to choose the next sub-goal,
the action is the chatbot’s response sequence uc t . For simplicity, we дt ∼ πψ (дt |st ) = Categorical softmax(Wд⊤ Ht + bд ) ,

(2)
suppose that the reward of accomplishing the goal is given only at
the end of the conversation, i.e., rT = 1 if the goal is accomplished, where Wд and bд are trainable parameters.
and rT = −1 otherwise. The intermediate rewards are always 0, i.e., 2.2.2 Worker. The worker generates the response uc t under the
r t = 0 for all t < T . high-level guidance from the manager. We adopt Variational Hier-
As the outcome can only be observed after a multi-turn con- archical Recurrent Encoder Decoder (VHRED) [13] as the model
versation, the decision-making process is challenging. We adopt structure of the worker, which is a commonly used architecture
Hierarchical Reinforcement Learning (HRL) to deal with such a of multi-turn dialogue systems. As shown in Figure 1, the worker
long horizon problem with sparse and delayed rewards, and pro- consists of three components for generation, namely the encoder,
pose the GoChat framework to simultaneously learn the high-level the context RNN, and the decoder.
and low-level policy with end-to-end training. At time step t, the encoder firstly encodes the human utterance
uht into an utterance representation hw t , which is then fed to an
2.2 The hierarchy of chatbot policies higher-level encoder, context RNN, to model the long-term aspects
The overall framework is illustrated in Figure 1. of the conversation. Then, context RNN can produce a dialogue rep-
At the t-th turn, given a state st = {uh1 , uc 1 , . . . , uht }, the high- resentation hct by taking the t-th utterance representation hw t and
level policy (the manager) first determines a sub-goal дt ∈ RK , the dialogue representation hct −1 from last conversation as input.
which is a one-hot vector drawn from a K-dimensional categorical Next, hct is fed to a fully connected layer to predict a mean µ and a
distribution дt ∼ πψ (дt |st ). Observing the state and the sub-goal, variance σ of a diagonal Gaussian distribution as the variational
the low-level worker outputs the response uc t ∼ πθ (uc t |st , дt ) by posterior approximation for a latent variable zt . Then, the latent zt
an auto-regressive Seq2Seq model, i.e., is sampled from the posterior using the reparameterization trick,
n
Ö i.e., zt = µ + σ ◦ ϵ, where ϵ ∼ N (0, 1). Therefore, the latent can be
πθ (uc t = (ŷ1 , . . . , ŷn )|st , дt ) = πθ (ŷτ |st , дt , ŷ <τ ) (1) seen as drawn conditioned on the state: zt ∼ N (µ(st ), σ (st )).
τ =1 Finally, the sub-goal from the manager influences the generation
Then, the manager receives a reward r t from the dialogue environ- process by perturbing the latent zt . We get an augmented latent
ment (described in section 2.1), while the worker receives a reward variable by simply concatenating the latent zt with дt , i.e., z̃t =
[zt , дt ]. Finally, the output utterance uc t = [ŷ1 , . . . , ŷn ] can be Table 1: Quantitative results. D-1,-2 stands for Distinct-1,-2.
generated by the sequential decoder with the input of z̃t and hct .
It is challenging to design a reasonable reward function to mea- v.s. GoChat
Method BLEU D-1 D-2
sures how well the worker has achieved its sub-goal, especially Win(%) Tie(%) Lose(%)
when our agent is trained on off-policy data. In our framework, we HRED 17.3 30.2 52.5 9.21 .051 0.407
use the following way to directly construct such reward with the VHRED 22.6 40.0 37.4 9.32 .053 0.419
offline dataset. First, before the end-to-end training, each (source, WSeq 9.5 15.3 75.2 6.43 .047 0.372
target) pair (s, uc ) is assigned with an offline sub-goal vector us- HRAN 14.9 12.8 72.3 7.18 .048 0.384
ing unsupervised clustering, so that the dataset becomes a set of ReCoSa 22.1 32.6 45.3 9.01 .050 0.415
tuples in the form of (s, д, uc ). Then, during training the RL agent, GoChat - - - 9.70 .061 0.479
given the state st and the sub-goal дt from the manager, we find
the k nearest neighbors of (st , дt ) from the offline dataset and get
k reference responses. Finally, we compute the average BLEU score
between our generated uc t and the k references as the reward r tw . targ
where Vϕ is the target value network of A2C and γ = 0.99 is the
2.3 End-to-end training the chatbot discount factor. Finally, we have a unified loss L = L π + Lv .
In this part, we describe the training methods of our frameworks.
To accelerate learning, we first use a pre-train step to warm 3 EXPERIMENTS
start the agent. Recall that we have generated the offline sub-goals 3.1 Experimental set-up
for each source-target pair, we can pretrain the manager and the We conduct experiments on a anti-fraud dialogue system on a large-
worker with supervised Seq2Seq learning. scale FinTech platform. Our task is to train a chatbot for assisting the
Then, we use a simulation-based environment to end-to-end security officers to have a conversation with suspected scammers.
optimize the goal-oriented chatbot. Following previous work on The goal of the chatbot is to find evidences of the scammers’ fraud,
model-based RL [10] and RL-for-dialog [11], we simulate the inter- e.g., to get to know their account ID so that the security officers
active simulator with a user model. Each conversation is initialized can take further investigation. We collect a dataset of over four
with some starting utterances randomly sampled from the train- thousand historical dialogues between the suspected scammers
ing set. Then, our agent chats with a learned user model. The user and the security officers, where each dialogue has about 20 turns
model has the same network structure with our agent, and is trained in average and is only labeled with a binary outcome, indicating
with supervised Seq2Seq learning on the dataset to predict the hu- whether the goal is accomplished. The data is anonymized without
man response and is fixed after training. We limit the multi-turn any physical person’s privacy information.
simulation dialogues to m turns (2m utterances), and m is up to the Given such dataset, we compare GoChat against state-of-the-
practical application scenarios, to prevent the agents from going art multi-turn dialog methods, including HRED [12], VHRED [13],
off-topic on overlong conversations. WSeq [14], HRAN [15], and ReCoSa [18]. We do not compare other
During training the HRL agent, we simultaneously optimize the task-oriented approaches which require detailed annotations on
manager and worker to maximize the following objective function each query-response pairs or dialogue states.
T
αr t + βr tw

J (πψ , πθ ) = Eдt ∼πψ (· |st ),uct ∼πθ (· |st ,дt )

(3) 3.2 Implementation details and evaluation
t =1
In our experiments, we limit the maximum length of dialog to
where the expectation is over conversations induced by the policies,
m = 20 turns (40 utterances) and adopt 500-size word embeddings
and α, β are hyper-parameters to balance the two levels of rewards.
for all methods. The hidden size of word-level and dialog-level
To maximize J (πψ , πθ ), we use the Advantage Actor-Critic (A2C)
encoders in HAN (both the manager and the critic) are 500 and 50,
[8] algorithm, where the policies are updated according to the
respectively. The configuration of the worker follows VHRED [13].
estimated advantage function of the chosen actions. E.g., for the
We firstly pretrain the worker on (st , дt , uc t ) tuples detailed in
manager, Â(st , дt ) = R t (st , дt ) − V̂ϕ (st ), where R t = Tτ =t r t is the
Í
Section 2.2.2 with supervised Seq2Seq learning, where the offline
long-term return and V̂ϕ (st ) is the estimated value function of state sub-goal дt is given by Latent Dirichlet allocation (LDA) into 14
st with parameters ϕ. Then, the surrogate loss for policy learning is categories (sub-goals). We set the hyper-parameters in Eq. 3 as
T α = 1, β = 0.001. The agent is optimized with Adam [3], and
α Ât log πψ (дt |st ) + βAw

L π = −Êt t log πθ (uc t |s t , дt )

(4) trained for 10 epochs with a learning rate of 0.001.
t =1 We adopt human evaluation to examine the utterances for achiev-
where Êt [·] denotes the sample mean over rollout dialogues induced ing goals. In detail, eight annotators are asked to compare the sim-
by the policies. ulated response between our GoChat and each baseline given the
We adopt the hierarchical attention network (HAN) as the value same context. Annotators would give it a tie if both our method
network V̂ϕ (st ), which is trained to minimize the following squared and baselines fail/succeed to generate utterances which are benefi-
error according to the Bellman equation cial to achieve the goal, otherwise they would annotate a win/lose.
1 targ 2 We also use BLEU and distinct [5] to evaluate the quality and the
Lv = r t +1 + γVϕ (st +1 ) − Vϕ (st ) (5) diversity of the generated responses, respectively.
2
Table 2: Case study: comparison of generated dialogue from our GoChat and the most competitive baseline, VHRED.

Agent Utterance
亲,您好!欢迎您来到企业招聘兼职中心,请问您是否需要兼职?
Human (Scammer):
(Welcome to the Enterprise Part-Time Recruitment Center. Do you need a part-time job?)
是的
Case (A) VHRED (Officer):
(Yes. )
需要。有什么兼职可以做呢?
GoChat (Officer):
(Yes, I need. What kinds of part-time jobs do you have? )
你是谁推荐来贷款的?
Human (Scammer):
(Who recommended you here for the loan?)
从一个朋友那里
Case (B) VHRED (Officer):
(From one of my friends.)
就是打游戏的时候,人家说可以办理贷款的。请问可以办理嘛?
GoChat (Officer):
(When I was playing a game, someone said I could apply for a loan. Can I?)

3.3 Quantitative results Foundation of China under Grant No. 61976204, U1811461, the
We compare our model with five baselines for multi-turn dialogue, Project of Youth Innovation Promotion Association CAS. This work
and report the quantitative indicators and human evaluation in is also funded in part by Ant Financial through the Ant Financial
Table 1. From the perspective of generation indicator, our proposed Science Funds for Security Research.
GoChat get the highest score on BLEU (9.70), distinct-1 (.061) and
distinct-2 (0.479), which demonstrates our GoChat can generate REFERENCES
[1] He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. 2018. Decoupling
more high-quality and varied utterances than baseline methods. Strategy and Generation in Negotiation Dialogues. In EMNLP. 2333–2343.
From the human evaluation, we observe that GoChat outperforms [2] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul A Crook, Y-Lan
all the baselines on generating utterances for achieving the goal. Boureau, and Jason Weston. 2019. Recommendation as a Communication Game:
Self-Supervised Bot-Play for Goal-oriented Dialogue. In EMNLP. 1951–1961.
Specifically, GoChat is preferred at rates of 35.2%, 14.8%, 65.7%, [3] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
57.4%, 23.2% against HRED, VHRED, WSeq, HRAN and ReCoSa, mization. arXiv preprint arXiv:1412.6980 (2014).
respectively. We conjecture it might be the simultaneous training [4] Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. 2017.
Deal or No Deal? End-to-End Learning of Negotiation Dialogues. In EMNLP.
of high-level and low-level policy that our GoChat better models 2443–2453.
long-term rewards the goals and produces high-quality responses. [5] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A
Diversity-Promoting Objective Function for Neural Conversation Models. In
NAACL-HLT. Association for Computational Linguistics, 110–119.
3.4 Qualitative results [6] Bing Liu and Ian Lane. 2017. Iterative policy learning in end-to-end trainable
task-oriented neural dialog models. In ASRU. IEEE, 482–489.
According to Table 1, VHRED is the most competitive method. To [7] Ling Luo, Xiang Ao, Feiyang Pan, Jin Wang, Tong Zhao, Ningzi Yu, and Qing
compare, we present two representative cases in Table 2. Observing He. 2018. Beyond Polarity: Interpretable Financial Sentiment Analysis with
Hierarchical Query-driven Attention.. In IJCAI. 4244–4250.
these cases, both two methods can generate appropriate responses [8] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo-
to the inquiries of the scammer given same dialogue contexts. How- thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchro-
ever, our model is more likely to raise questions and generate more nous methods for deep reinforcement learning. In ICML. 1928–1937.
[9] Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, and Qing He. 2019.
diverse sentences. Specifically, in both cases, the utterances from Policy Gradients for Contextual Recommendations. In WWW. 1421–1431.
our GoChat, i.e, “What kinds of part-time jobs do you have?”, “Can [10] Feiyang Pan, Qingpeng Cai, An-Xiang Zeng, Chun-Xiang Pan, Qing Da, Hualin
He, Qing He, and Pingzhong Tang. 2019. Policy optimization with model-based
I?”, are more proactive. Such responses are more likely towards the explorations. In AAAI, Vol. 33. 4675–4682.
ultimate goal in a long run. While case (B) demonstrates that our [11] Abdelrhman Saleh, Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen,
GoChat can generate more diverse and human-like responses, i.e., and Rosalind Picard. 2019. Hierarchical reinforcement learning for open-domain
dialog. arXiv preprint arXiv:1909.07547 (2019).
“When playing games”, which is very convincing and might help to [12] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
get rid of the scammers’ doubts in anti-fraud dialogue tasks. Pineau. 2016. Building End-to-End Dialogue Systems Using Generative Hierar-
chical Neural Network Models. (2016).
[13] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle
4 CONCLUSION Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable
encoder-decoder model for generating dialogues. In AAAI. 3295–3301.
In this paper, we propose an end-to-end framework GoChat for [14] Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao.
goal-oriented dialogue systems based on hierarchical reinforcement 2017. How to make context more useful? an empirical study on context-aware
neural conversational models. In ACL (Volume 2: Short Papers). 231–236.
learning without cumbersome hand-crafting. It uses a high-level [15] Chen Xing, Yu Wu, Wei Wu, Yalou Huang, and Ming Zhou. 2018. Hierarchical
policy to determine the sub-goal to guide the conversation towards recurrent attention network for response generation. In AAAI.
the final goal and a low-level worker to generate corresponding [16] Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li.
2017. Building task-oriented dialogue systems for online shopping. In AAAI.
responses. Experiments conducted on a real-world dialogue dataset [17] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.
verified the effectiveness of our GoChat. 2016. Hierarchical Attention Networks for Document Classification. In NAACL.
1480–1489.
[18] Hainan Zhang, Yanyan Lan, Liang Pang, Jiafeng Guo, and Xueqi Cheng. 2019.
ACKNOWLEDGEMENT ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn
Dialogue Generation. In ACL. 3721–3730.
The research work is supported by the National Natural Science

You might also like