You are on page 1of 12

Asymmetric Non-local Neural Networks for Semantic Segmentation

Zhen Zhu1∗, Mengde Xu1∗, Song Bai2 , Tengteng Huang1 , Xiang Bai1†
1
Huazhong University of Science and Technology, 2 University of Oxford
{zzhu, mdxu, huangtengtng, xbai}@hust.edu.cn, songbai.site@gmail.com
arXiv:1908.07678v5 [cs.CV] 29 Aug 2019

Abstract <latexit sha1_base64="46aI1tzZqBz9o5c5YyR5jdRS0Vw=">AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAl8aLHghePLZi20Iay2U7atZtN2N0IJfQXePGgiFd/kjf/jds2B219YeHhnRl25g1TwbVx3W+ntLG5tb1T3q3s7R8cHlWPT9o6yRRDnyUiUd2QahRcom+4EdhNFdI4FNgJJ3fzeucJleaJfDDTFIOYjiSPOKPGWq3uoFpz6+5CZB28AmpQqDmofvWHCctilIYJqnXPc1MT5FQZzgTOKv1MY0rZhI6wZ1HSGHWQLxadkQvrDEmUKPukIQv390ROY62ncWg7Y2rGerU2N/+r9TIT3QY5l2lmULLlR1EmiEnI/Goy5AqZEVMLlCludyVsTBVlxmZTsSF4qyevQ/u67lluubXGVRFHGc7gHC7BgxtowD00wQcGCM/wCm/Oo/PivDsfy9aSU8ycwh85nz+t3YzC</latexit>

Query
<latexit sha1_base64="46aI1tzZqBz9o5c5YyR5jdRS0Vw=">AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8CAl8aLHghePLZi20Iay2U7atZtN2N0IJfQXePGgiFd/kjf/jds2B219YeHhnRl25g1TwbVx3W+ntLG5tb1T3q3s7R8cHlWPT9o6yRRDnyUiUd2QahRcom+4EdhNFdI4FNgJJ3fzeucJleaJfDDTFIOYjiSPOKPGWq3uoFpz6+5CZB28AmpQqDmofvWHCctilIYJqnXPc1MT5FQZzgTOKv1MY0rZhI6wZ1HSGHWQLxadkQvrDEmUKPukIQv390ROY62ncWg7Y2rGerU2N/+r9TIT3QY5l2lmULLlR1EmiEnI/Goy5AqZEVMLlCludyVsTBVlxmZTsSF4qyevQ/u67lluubXGVRFHGc7gHC7BgxtowD00wQcGCM/wCm/Oo/PivDsfy9aSU8ycwh85nz+t3YzC</latexit>

Key Value
Query Key Value
1x1 Conv 1x1 Conv 1x1 Conv

1x1 Conv 1x1 Conv 1x1 Conv


The non-local module works as a particularly useful
<latexit sha1_base64="/fQwNvq/snN98NUwt76XY3Xpc+A=">AAACBHicdZBNS8NAEIY39avWr6jHXhaL0FNJtG3qrdBLjxXsBzSlbLabdulmE3Y3Qgk9ePGvePGgiFd/hDf/jZs2gooODDy8M8PMvF7EqFSW9WHkNja3tnfyu4W9/YPDI/P4pCfDWGDSxSELxcBDkjDKSVdRxcggEgQFHiN9b95K6/1bIiQN+Y1aRGQUoCmnPsVIaWlsFt0AqZnnJy1X0YBI2IYZ9Jdjs2RVak7jsuZADY1qtVbXULds5/IK2hVrFSWQRWdsvruTEMcB4QozJOXQtiI1SpBQFDOyLLixJBHCczQlQ40c6T2jZPXEEp5rZQL9UOjkCq7U7xMJCqRcBJ7uTE+Wv2up+FdtGCu/MUooj2JFOF4v8mMGVQhTR+CECoIVW2hAWFB9K8QzJBBW2reCNuHrU/g/9C4qtuZrq9QsZ3bkQRGcgTKwgQOaoA06oAswuAMP4Ak8G/fGo/FivK5bc0Y2cwp+hPH2CZ38l/w=</latexit> <latexit sha1_base64="/fQwNvq/snN98NUwt76XY3Xpc+A=">AAACBHicdZBNS8NAEIY39avWr6jHXhaL0FNJtG3qrdBLjxXsBzSlbLabdulmE3Y3Qgk9ePGvePGgiFd/hDf/jZs2gooODDy8M8PMvF7EqFSW9WHkNja3tnfyu4W9/YPDI/P4pCfDWGDSxSELxcBDkjDKSVdRxcggEgQFHiN9b95K6/1bIiQN+Y1aRGQUoCmnPsVIaWlsFt0AqZnnJy1X0YBI2IYZ9Jdjs2RVak7jsuZADY1qtVbXULds5/IK2hVrFSWQRWdsvruTEMcB4QozJOXQtiI1SpBQFDOyLLixJBHCczQlQ40c6T2jZPXEEp5rZQL9UOjkCq7U7xMJCqRcBJ7uTE+Wv2up+FdtGCu/MUooj2JFOF4v8mMGVQhTR+CECoIVW2hAWFB9K8QzJBBW2reCNuHrU/g/9C4qtuZrq9QsZ3bkQRGcgTKwgQOaoA06oAswuAMP4Ak8G/fGo/FivK5bc0Y2cwp+hPH2CZ38l/w=</latexit> <latexit sha1_base64="/fQwNvq/snN98NUwt76XY3Xpc+A=">AAACBHicdZBNS8NAEIY39avWr6jHXhaL0FNJtG3qrdBLjxXsBzSlbLabdulmE3Y3Qgk9ePGvePGgiFd/hDf/jZs2gooODDy8M8PMvF7EqFSW9WHkNja3tnfyu4W9/YPDI/P4pCfDWGDSxSELxcBDkjDKSVdRxcggEgQFHiN9b95K6/1bIiQN+Y1aRGQUoCmnPsVIaWlsFt0AqZnnJy1X0YBI2IYZ9Jdjs2RVak7jsuZADY1qtVbXULds5/IK2hVrFSWQRWdsvruTEMcB4QozJOXQtiI1SpBQFDOyLLixJBHCczQlQ40c6T2jZPXEEp5rZQL9UOjkCq7U7xMJCqRcBJ7uTE+Wv2up+FdtGCu/MUooj2JFOF4v8mMGVQhTR+CECoIVW2hAWFB9K8QzJBBW2reCNuHrU/g/9C4qtuZrq9QsZ3bkQRGcgTKwgQOaoA06oAswuAMP4Ak8G/fGo/FivK5bc0Y2cwp+hPH2CZ38l/w=</latexit>

Sample Sample
<latexit sha1_base64="/fQwNvq/snN98NUwt76XY3Xpc+A=">AAACBHicdZBNS8NAEIY39avWr6jHXhaL0FNJtG3qrdBLjxXsBzSlbLabdulmE3Y3Qgk9ePGvePGgiFd/hDf/jZs2gooODDy8M8PMvF7EqFSW9WHkNja3tnfyu4W9/YPDI/P4pCfDWGDSxSELxcBDkjDKSVdRxcggEgQFHiN9b95K6/1bIiQN+Y1aRGQUoCmnPsVIaWlsFt0AqZnnJy1X0YBI2IYZ9Jdjs2RVak7jsuZADY1qtVbXULds5/IK2hVrFSWQRWdsvruTEMcB4QozJOXQtiI1SpBQFDOyLLixJBHCczQlQ40c6T2jZPXEEp5rZQL9UOjkCq7U7xMJCqRcBJ7uTE+Wv2up+FdtGCu/MUooj2JFOF4v8mMGVQhTR+CECoIVW2hAWFB9K8QzJBBW2reCNuHrU/g/9C4qtuZrq9QsZ3bkQRGcgTKwgQOaoA06oAswuAMP4Ak8G/fGo/FivK5bc0Y2cwp+hPH2CZ38l/w=</latexit> <latexit sha1_base64="/fQwNvq/snN98NUwt76XY3Xpc+A=">AAACBHicdZBNS8NAEIY39avWr6jHXhaL0FNJtG3qrdBLjxXsBzSlbLabdulmE3Y3Qgk9ePGvePGgiFd/hDf/jZs2gooODDy8M8PMvF7EqFSW9WHkNja3tnfyu4W9/YPDI/P4pCfDWGDSxSELxcBDkjDKSVdRxcggEgQFHiN9b95K6/1bIiQN+Y1aRGQUoCmnPsVIaWlsFt0AqZnnJy1X0YBI2IYZ9Jdjs2RVak7jsuZADY1qtVbXULds5/IK2hVrFSWQRWdsvruTEMcB4QozJOXQtiI1SpBQFDOyLLixJBHCczQlQ40c6T2jZPXEEp5rZQL9UOjkCq7U7xMJCqRcBJ7uTE+Wv2up+FdtGCu/MUooj2JFOF4v8mMGVQhTR+CECoIVW2hAWFB9K8QzJBBW2reCNuHrU/g/9C4qtuZrq9QsZ3bkQRGcgTKwgQOaoA06oAswuAMP4Ak8G/fGo/FivK5bc0Y2cwp+hPH2CZ38l/w=</latexit>

<latexit sha1_base64="/fQwNvq/snN98NUwt76XY3Xpc+A=">AAACBHicdZBNS8NAEIY39avWr6jHXhaL0FNJtG3qrdBLjxXsBzSlbLabdulmE3Y3Qgk9ePGvePGgiFd/hDf/jZs2gooODDy8M8PMvF7EqFSW9WHkNja3tnfyu4W9/YPDI/P4pCfDWGDSxSELxcBDkjDKSVdRxcggEgQFHiN9b95K6/1bIiQN+Y1aRGQUoCmnPsVIaWlsFt0AqZnnJy1X0YBI2IYZ9Jdjs2RVak7jsuZADY1qtVbXULds5/IK2hVrFSWQRWdsvruTEMcB4QozJOXQtiI1SpBQFDOyLLixJBHCczQlQ40c6T2jZPXEEp5rZQL9UOjkCq7U7xMJCqRcBJ7uTE+Wv2up+FdtGCu/MUooj2JFOF4v8mMGVQhTR+CECoIVW2hAWFB9K8QzJBBW2reCNuHrU/g/9C4qtuZrq9QsZ3bkQRGcgTKwgQOaoA06oAswuAMP4Ak8G/fGo/FivK5bc0Y2cwp+hPH2CZ38l/w=</latexit>

technique for semantic segmentation while criticized for <latexit sha1_base64="hZkzgYO/gV+mePPyRw8dXIQMHCU=">AAAB63icbZDLSgMxFIZP6q3WW9Wlm2ARXEiZcaPLghuXFewF2qFk0kwnNMkMSUYoQ1/BjQtF3PpC7nwbM+0stPWHwMd/ziHn/GEquLGe940qG5tb2zvV3dre/sHhUf34pGuSTFPWoYlIdD8khgmuWMdyK1g/1YzIULBeOL0r6r0npg1P1KOdpSyQZKJ4xCmxhTVMYz6qN7ymtxBeB7+EBpRqj+pfw3FCM8mUpYIYM/C91AY50ZZTwea1YWZYSuiUTNjAoSKSmSBf7DrHF84Z4yjR7imLF+7viZxIY2YydJ2S2Nis1grzv9ogs9FtkHOVZpYpuvwoygS2CS4Ox2OuGbVi5oBQzd2umMZEE2pdPDUXgr968jp0r5u+4wev0boq46jCGZzDJfhwAy24hzZ0gEIMz/AKb0iiF/SOPpatFVTOnMIfoc8fC6yOJQ==</latexit>


<latexit sha1_base64="029B9ACdqMSwnxSeyzAcTincTsE=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2veix4MVjBfsB7VKyabaNzSZLMiuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMG6VSWPT9b6+0sbm1vVPereztHxweVY9P2lZnhvEW01KbbkQtl0LxFgqUvJsaTpNI8k40uZ3XO0/cWKHVA05THiZ0pEQsGEVntfs45kgH1Zpf9xci6xAUUINCzUH1qz/ULEu4Qiaptb3ATzHMqUHBJJ9V+pnlKWUTOuI9h4om3Ib5YtsZuXDOkMTauKeQLNzfEzlNrJ0mketMKI7tam1u/lfrZRjfhLlQaYZcseVHcSYJajI/nQyF4Qzl1AFlRrhdCRtTQxm6gCouhGD15HVoX9UDx/d+reEXcZThDM7hEgK4hgbcQRNawOARnuEV3jztvXjv3seyteQVM6fwR97nD51PjxI=</latexit> <latexit sha1_base64="sNbZR33L5fafbEbIxIZI/P4jPbw=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2veix4MVjBfsB7VJm02wbm2SXJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMG6WCG+v7315pY3Nre6e8W9nbPzg8qh6ftE2SacpaNBGJ7kZomOCKtSy3gnVTzVBGgnWiye283nli2vBEPdhpykKJI8VjTtE6q90foZQ4qNb8ur8QWYeggBoUag6qX/1hQjPJlKUCjekFfmrDHLXlVLBZpZ8ZliKd4Ij1HCqUzIT5YtsZuXDOkMSJdk9ZsnB/T+QojZnKyHVKtGOzWpub/9V6mY1vwpyrNLNM0eVHcSaITcj8dDLkmlErpg6Qau52JXSMGql1AVVcCMHqyevQvqoHju/9WsMv4ijDGZzDJQRwDQ24gya0gMIjPMMrvHmJ9+K9ex/L1pJXzJzCH3mfP4BDjv8=</latexit>

<latexit sha1_base64="hZkzgYO/gV+mePPyRw8dXIQMHCU=">AAAB63icbZDLSgMxFIZP6q3WW9Wlm2ARXEiZcaPLghuXFewF2qFk0kwnNMkMSUYoQ1/BjQtF3PpC7nwbM+0stPWHwMd/ziHn/GEquLGe940qG5tb2zvV3dre/sHhUf34pGuSTFPWoYlIdD8khgmuWMdyK1g/1YzIULBeOL0r6r0npg1P1KOdpSyQZKJ4xCmxhTVMYz6qN7ymtxBeB7+EBpRqj+pfw3FCM8mUpYIYM/C91AY50ZZTwea1YWZYSuiUTNjAoSKSmSBf7DrHF84Z4yjR7imLF+7viZxIY2YydJ2S2Nis1grzv9ogs9FtkHOVZpYpuvwoygS2CS4Ox2OuGbVi5oBQzd2umMZEE2pdPDUXgr968jp0r5u+4wev0boq46jCGZzDJfhwAy24hzZ0gEIMz/AKb0iiF/SOPpatFVTOnMIfoc8fC6yOJQ==</latexit>

<latexit sha1_base64="q3/ccmifDD/ESgVOBzlEMksvfqo=">AAAB73icbVDLSsNAFL2pr1pfVZdugkVwVRI3uiy4cVnBPqANZTK9aYdOJnHmRiihP+HGhSJu/R13/o3TNgttPTBwOOdc5t4TplIY8rxvp7SxubW9U96t7O0fHB5Vj0/aJsk0xxZPZKK7ITMohcIWCZLYTTWyOJTYCSe3c7/zhNqIRD3QNMUgZiMlIsEZWanbpzESGzQH1ZpX9xZw14lfkBoUsPmv/jDhWYyKuGTG9HwvpSBnmgSXOKv0M4Mp4xM2wp6lisVognyx78y9sMrQjRJtnyJ3of6eyFlszDQObTJmNDar3lz8z+tlFN0EuVBpRqj48qMoky4l7vx4dyg0cpJTSxjXwu7q8jHTjJOtqGJL8FdPXiftq7pv+b1Xa3hFHWU4g3O4BB+uoQF30IQWcJDwDK/w5jw6L86787GMlpxi5hT+wPn8AfW7j9U=</latexit>
<latexit sha1_base64="EWs13fJhgabeDspDc+j38iuTJUQ=">AAAB73icbVDLSgMxFL1TX7W+qi7dBIvgqsy40WXBjcsK9gHtUO6kmTY0yYxJRihDf8KNC0Xc+jvu/BvTdhbaeiBwOOdccu+JUsGN9f1vr7SxubW9U96t7O0fHB5Vj0/aJsk0ZS2aiER3IzRMcMVallvBuqlmKCPBOtHkdu53npg2PFEPdpqyUOJI8ZhTtE7q9kcoJQ6ag2rNr/sLkHUSFKQGBVz+qz9MaCaZslSgMb3AT22Yo7acCjar9DPDUqQTHLGeowolM2G+2HdGLpwyJHGi3VOWLNTfEzlKY6YyckmJdmxWvbn4n9fLbHwT5lylmWWKLj+KM0FsQubHkyHXjFoxdQSp5m5XQseokVpXUcWVEKyevE7aV/XA8Xu/1vCLOspwBudwCQFcQwPuoAktoCDgGV7hzXv0Xrx372MZLXnFzCn8gff5A9iJj8I=</latexit>

its prohibitive computation and GPU memory occupation. <latexit sha1_base64="Buo8wf9BhYBTlJbdaNQPKSbOCEo=">AAAB/HicdVDLSsNAFJ3UV62vaJduBovQVUnsI3VX6MaVVLAPaEqZTCft0MkkzEyEEOqvuHGhiFs/xJ1/46StoKIHBg7n3Ms9c7yIUaks68PIbWxube/kdwt7+weHR+bxSU+GscCki0MWioGHJGGUk66iipFBJAgKPEb63ryd+f07IiQN+a1KIjIK0JRTn2KktDQ2i26A1Mzz02voKhoQCduLsVmyKnWnWa07UJNmrVZvaNKwbKd6Ce2KtUQJrNEZm+/uJMRxQLjCDEk5tK1IjVIkFMWMLApuLEmE8BxNyVBTjvSdUboMv4DnWplAPxT6cQWX6veNFAVSJoGnJ7Oo8reXiX95w1j5zVFKeRQrwvHqkB8zqEKYNQEnVBCsWKIJwoLqrBDPkEBY6b4KuoSvn8L/Se+iYmt+Y5Va5XUdeXAKzkAZ2MABLXAFOqALMEjAA3gCz8a98Wi8GK+r0Zyx3imCHzDePgG9wpS9</latexit> <latexit sha1_base64="R2yvgheT6LiGdZHeifZlYQeqw30=">AAACDXicdVDLSgMxFM34rPVVdekmWAVXZcY+pu4K3biSClaFtpRMekeDmQfJHbEM8wNu/BU3LhRx696df2P6EFT0QMjhnHtzb44XS6HRtj+smdm5+YXF3FJ+eWV1bb2wsXmmo0RxaPNIRurCYxqkCKGNAiVcxApY4Ek4966bI//8BpQWUXiKwxh6AbsMhS84QyP1C7vdgOGV56dN2kURgDYX3OL44VTBIEuPs6xfKNqlqlsvV11qSL1SqdYMqdmOWz6kTskeo0imaPUL791BxJMAQuSSad1x7Bh7KVMouIQs3000xIxfs0voGBoyM7iXjodmdM8oA+pHypwQ6Vj93pGyQOth4JnK0e76tzcS//I6Cfr1XirCOEEI+WSQn0iKER1FQwdCAUc5NIRxJcyulF8xxTiaAPMmhK+f0v/J2UHJMfzELjb2p3HkyDbZIfvEIS5pkCPSIm3CyR15IE/k2bq3Hq0X63VSOmNNe7bID1hvnyiInNI=</latexit> <latexit sha1_base64="B+sK/4UDCymWNS6JQyFP5U76ibc=">AAACDXicdVC7SgNBFJ31bXxFLW0Go5Aq7JrExC6QxkoiGA1kQ5id3NXB2Qczd8Ww7A/Y+Cs2ForY2tv5N04egooeGDic+zh3jhdLodG2P6yZ2bn5hcWl5dzK6tr6Rn5z61xHieLQ5pGMVMdjGqQIoY0CJXRiBSzwJFx4181R/eIGlBZReIbDGHoBuwyFLzhDI/Xze27A8MrzUxfhFsf7UgWDLD3JqIsiAE2bWT9fsEvVWr1crVFD6pVK9dCQQ9uplY+oU7LHKJApWv38uzuIeBJAiFwyrbuOHWMvZQoFl5Dl3ERDzPg1u4SuoSEzPr107J7RfaMMqB8p80KkY/X7RMoCrYeBZzpHt+vftZH4V62boF/vpSKME4SQT4z8RFKM6CgaOhAKOMqhIYwrYW6l/IopxtEEmDMhfP2U/k/OD0qO4ad2oVGcxrFEdsguKRKH1EiDHJMWaRNO7sgDeSLP1r31aL1Yr5PWGWs6s01+wHr7BDidnNI=</latexit>

<latexit sha1_base64="Buo8wf9BhYBTlJbdaNQPKSbOCEo=">AAAB/HicdVDLSsNAFJ3UV62vaJduBovQVUnsI3VX6MaVVLAPaEqZTCft0MkkzEyEEOqvuHGhiFs/xJ1/46StoKIHBg7n3Ms9c7yIUaks68PIbWxube/kdwt7+weHR+bxSU+GscCki0MWioGHJGGUk66iipFBJAgKPEb63ryd+f07IiQN+a1KIjIK0JRTn2KktDQ2i26A1Mzz02voKhoQCduLsVmyKnWnWa07UJNmrVZvaNKwbKd6Ce2KtUQJrNEZm+/uJMRxQLjCDEk5tK1IjVIkFMWMLApuLEmE8BxNyVBTjvSdUboMv4DnWplAPxT6cQWX6veNFAVSJoGnJ7Oo8reXiX95w1j5zVFKeRQrwvHqkB8zqEKYNQEnVBCsWKIJwoLqrBDPkEBY6b4KuoSvn8L/Se+iYmt+Y5Va5XUdeXAKzkAZ2MABLXAFOqALMEjAA3gCz8a98Wi8GK+r0Zyx3imCHzDePgG9wpS9</latexit> <latexit sha1_base64="ou92++FJ4xy/6xXG4ijqwmJZ9pY=">AAACDnicdVDLSgMxFM3UV62vUZdugkXoqszY1tad0I1LRatCW0omvdOGZh4kd8QyzBe48VfcuFDErWt3/o1praCiB0IO59ybe3O8WAqNjvNu5ebmFxaX8suFldW19Q17c+tCR4ni0OKRjNSVxzRIEUILBUq4ihWwwJNw6Y2aE//yGpQWUXiO4xi6ARuEwhecoZF69l4nYDj0/LRJOygC0OaCG5w+nHoygSw9y7KeXXTKtXqjUqtTQxrVau3AkAPHrVcOqVt2piiSGU569lunH/EkgBC5ZFq3XSfGbsoUCi4hK3QSDTHjIzaAtqEhM5O76XRqRveM0qd+pMwJkU7V7x0pC7QeB56pnCyvf3sT8S+vnaDf6KYijBOEkH8O8hNJMaKTbGhfKOAox4YwroTZlfIhU4yjSbBgQvj6Kf2fXOyXXcNPneJRaRZHnuyQXVIiLqmTI3JMTkiLcHJL7skjebLurAfr2Xr5LM1Zs55t8gPW6wcJAp1O</latexit> <latexit sha1_base64="ai5Ukj+j/Gg8YVkAyrv2Vy43v1k=">AAACDnicdVC7SgNBFJ31bXxFLW0GQyBV2NW87AI2lormAdkQZid34+Dsg5m7Ylj2C2z8FRsLRWyt7fwbJzGCih4YOJz7OHeOF0uh0bbfrbn5hcWl5ZXV3Nr6xuZWfnunraNEcWjxSEaq6zENUoTQQoESurECFngSOt7V8aTeuQalRRRe4DiGfsBGofAFZ2ikQb7oBgwvPT91EW5wui/1ZAJZep5RF0UAmh5ng3zBLlfrjcNqnRrSqFSqNUNqtlM/PKJO2Z6iQGY4HeTf3GHEkwBC5JJp3XPsGPspUyi4hCznJhpixq/YCHqGhsz49NOpfUaLRhlSP1LmhUin6veJlAVajwPPdE6O179rE/GvWi9Bv9FPRRgnCCH/NPITSTGik2zoUCjgKMeGMK6EuZXyS6YYR5NgzoTw9VP6P2kflB3Dz+xCszSLY4XskX1SIg6pkyY5IaekRTi5JffkkTxZd9aD9Wy9fLbOWbOZXfID1usHGhidTg==</latexit>

In this paper, we present Asymmetric Non-local Neural Softmax Softmax

Network to semantic segmentation, which has two promi- <latexit sha1_base64="QYqWmpDubqXc9iTbu9agjoZZQfI=">AAAB7nicbZBNSwMxEIZn61etX1WPXoJF8FSyXvRY8OKxgv2AdinZdLYNzWaXJFsoS3+EFw+KePX3ePPfmLZ70NYXAg/vzJCZN0ylMJbSb6+0tb2zu1ferxwcHh2fVE/P2ibJNMcWT2SiuyEzKIXClhVWYjfVyOJQYiec3C/qnSlqIxL1ZGcpBjEbKREJzqyzOv0p8rw9H1RrtE6XIpvgF1CDQs1B9as/THgWo7JcMmN6Pk1tkDNtBZc4r/QzgynjEzbCnkPFYjRBvlx3Tq6cMyRRot1Tlizd3xM5i42ZxaHrjJkdm/Xawvyv1stsdBfkQqWZRcVXH0WZJDYhi9vJUGjkVs4cMK6F25XwMdOMW5dQxYXgr5+8Ce2buu/4kdYatIijDBdwCdfgwy004AGa0AIOE3iGV3jzUu/Fe/c+Vq0lr5g5hz/yPn8Ad5GPkg==</latexit>


<latexit sha1_base64="lvMs7uA0Sm+QnmepkWRqY4q3W9o=">AAACDXicdVDLSgMxFM34rPVVdekmWIWuyozah7uCG1elgtVCW0omvdMGMw+SO2IZ5gfc+CtuXCji1r07/8b0IajogZDDOffm3hw3kkKjbX9Yc/MLi0vLmZXs6tr6xmZua/tSh7Hi0OShDFXLZRqkCKCJAiW0IgXMdyVcudenY//qBpQWYXCBowi6PhsEwhOcoZF6uf2Oz3DoekmddlD4oM0Ftzh5OFHQT5N6mvZyebtYqlSPShVqSPX4uFQ2pGw7laMT6hTtCfJkhkYv997phzz2IUAumdZtx46wmzCFgktIs51YQ8T4NRtA29CAmcHdZDI0pQdG6VMvVOYESCfq946E+VqPfNdUjnfXv72x+JfXjtGrdhMRRDFCwKeDvFhSDOk4GtoXCjjKkSGMK2F2pXzIFONoAsyaEL5+Sv8nl4dFx/BzO18rzOLIkF2yRwrEIRVSI2ekQZqEkzvyQJ7Is3VvPVov1uu0dM6a9eyQH7DePgE6XZzd</latexit>

<latexit sha1_base64="yT3nrJP/08mJC0LfcR9+LpeRm58=">AAAB8HicbVDLSgMxFL3js9ZX1aWbYBFclRk3uiy4cVnBPqQdSia904YmmSHJFMrQr3DjQhG3fo47/8a0nYW2HggczjmX3HuiVHBjff/b29jc2t7ZLe2V9w8Oj44rJ6ctk2SaYZMlItGdiBoUXGHTciuwk2qkMhLYjsZ3c789QW14oh7tNMVQ0qHiMWfUOumpN0GWt2b9Rr9S9Wv+AmSdBAWpQgGX/+oNEpZJVJYJakw38FMb5lRbzgTOyr3MYErZmA6x66iiEk2YLxaekUunDEicaPeUJQv190ROpTFTGbmkpHZkVr25+J/XzWx8G+ZcpZlFxZYfxZkgNiHz68mAa2RWTB2hTHO3K2EjqimzrqOyKyFYPXmdtK5rgeMPfrXuF3WU4Bwu4AoCuIE63EMDmsBAwjO8wpunvRfv3ftYRje8YuYM/sD7/AHQ7ZBV</latexit>
<latexit sha1_base64="+s9S2rK5zChV1Z0MKa9br9AUoC0=">AAACDnicdVDLSgMxFM34rPU16tJNsAiuyoz25a7gxpVUtK3QlpJJ72gw8yC5I5ZhvsCNv+LGhSJuXbvzb0wfgooeCDmcc2/uzfFiKTQ6zoc1Mzs3v7CYW8ovr6yurdsbmy0dJYpDk0cyUhce0yBFCE0UKOEiVsACT0Lbuz4a+e0bUFpE4TkOY+gF7DIUvuAMjdS3d7sBwyvPT09oF0UA2lxwi+OHU08mkKVnWda3C06xXK0dlKvUkFqpVK4YUnHc6sEhdYvOGAUyRaNvv3cHEU8CCJFLpnXHdWLspUyh4BKyfDfREDN+zS6hY2jIzOReOp6a0V2jDKgfKXNCpGP1e0fKAq2HgWcqR8vr395I/MvrJOjXeqkI4wQh5JNBfiIpRnSUDR0IBRzl0BDGlTC7Un7FFONoEsybEL5+Sv8nrf2ia/ipU6jvTePIkW2yQ/aIS6qkTo5JgzQJJ3fkgTyRZ+veerRerNdJ6Yw17dkiP2C9fQIa4p1Z</latexit>

nent components: Asymmetric Pyramid Non-local Block <latexit sha1_base64="Buo8wf9BhYBTlJbdaNQPKSbOCEo=">AAAB/HicdVDLSsNAFJ3UV62vaJduBovQVUnsI3VX6MaVVLAPaEqZTCft0MkkzEyEEOqvuHGhiFs/xJ1/46StoKIHBg7n3Ms9c7yIUaks68PIbWxube/kdwt7+weHR+bxSU+GscCki0MWioGHJGGUk66iipFBJAgKPEb63ryd+f07IiQN+a1KIjIK0JRTn2KktDQ2i26A1Mzz02voKhoQCduLsVmyKnWnWa07UJNmrVZvaNKwbKd6Ce2KtUQJrNEZm+/uJMRxQLjCDEk5tK1IjVIkFMWMLApuLEmE8BxNyVBTjvSdUboMv4DnWplAPxT6cQWX6veNFAVSJoGnJ7Oo8reXiX95w1j5zVFKeRQrwvHqkB8zqEKYNQEnVBCsWKIJwoLqrBDPkEBY6b4KuoSvn8L/Se+iYmt+Y5Va5XUdeXAKzkAZ2MABLXAFOqALMEjAA3gCz8a98Wi8GK+r0Zyx3imCHzDePgG9wpS9</latexit> <latexit sha1_base64="Buo8wf9BhYBTlJbdaNQPKSbOCEo=">AAAB/HicdVDLSsNAFJ3UV62vaJduBovQVUnsI3VX6MaVVLAPaEqZTCft0MkkzEyEEOqvuHGhiFs/xJ1/46StoKIHBg7n3Ms9c7yIUaks68PIbWxube/kdwt7+weHR+bxSU+GscCki0MWioGHJGGUk66iipFBJAgKPEb63ryd+f07IiQN+a1KIjIK0JRTn2KktDQ2i26A1Mzz02voKhoQCduLsVmyKnWnWa07UJNmrVZvaNKwbKd6Ce2KtUQJrNEZm+/uJMRxQLjCDEk5tK1IjVIkFMWMLApuLEmE8BxNyVBTjvSdUboMv4DnWplAPxT6cQWX6veNFAVSJoGnJ7Oo8reXiX95w1j5zVFKeRQrwvHqkB8zqEKYNQEnVBCsWKIJwoLqrBDPkEBY6b4KuoSvn8L/Se+iYmt+Y5Va5XUdeXAKzkAZ2MABLXAFOqALMEjAA3gCz8a98Wi8GK+r0Zyx3imCHzDePgG9wpS9</latexit>

(APNB) and Asymmetric Fusion Non-local Block (AFNB). (a) Non-local Block (b) Asymmetric Non-local Block

APNB leverages a pyramid sampling module into the non- Figure 1: Architecture of a standard non-local block (a) and the asymmet-
local block to largely reduce the computation and memory ric non-local block (b). N = H · W while S  N .
consumption without sacrificing the performance. AFNB is Non-local Ours

178.068
adapted from APNB to fuse the features of different levels
under a sufficient consideration of long range dependencies
and thus considerably improves the performance. Extensive
118.772

experiments on semantic segmentation benchmarks demon-


Time (ms)

strate the effectiveness and efficiency of our work. In par-


ticular, we report the state-of-the-art performance of 81.3
43.103

29.898
mIoU on the Cityscapes test set. For a 256 × 128 input,
15.839

15.936

12.174
0.875

0.355

0.728
0.185

0.000
APNB is around 6 times faster than a non-local block on
GPU while 28 times smaller in GPU running memory occu- Matmu l Convolution Softmax Batchnorm Pool To t a l

pation. Code is available at: https://github.com/ Figure 2: GPU time ( ≥ 1 ms) comparison of different operations between
MendelXu/ANN.git. a generic non-local block and our APNB. The last bin denotes the sum of
all the time costs. The size of the inputs for these two blocks is 256 × 128.
1. Introduction
Some recent studies [20, 33, 47] indicate that the perfor-
Semantic segmentation is a long-standing challenging mance could be improved if making sufficient use of long
task in computer vision, aiming to predict pixel-wise se- range dependencies. However, models that solely rely on
mantic labels in an image accurately. This task is ex- convolutions exhibit limited ability in capturing these long
ceptionally important to tons of real-world applications, range dependencies. A possible reason is the receptive field
such as autonomous driving [27, 28], medical diagnosing of a single convolutional layer is inadequate to cover cor-
[52, 53], etc. In recent years, the developments of deep related areas. Choosing a big kernel or composing a very
neural networks encourage the emergence of a series of deep network is able to enlarge the receptive field. However,
works [1, 5, 18, 26, 40, 43, 47]. Shelhamer et al. [26] such strategies require extensive computation and parame-
proposed the seminal work called Fully Convolutional Net- ters, thus being very inefficient [44]. Consequently, several
work (FCN), which discarded the fully connected layer to works [33, 47] resort to use global operations like non-local
support input of arbitrary sizes. Since then, a lot of works means [2] and spatial pyramid pooling [12, 16].
[5, 18] were inspired to manipulate FCN techniques into In [33], Wang et al. combined CNNs and traditional non-
deep neural networks. Nonetheless, the segmentation accu- local means [2] to compose a network module named non-
racy is still far from satisfactory. local block in order to leverage features from all locations
∗ Equal contribution in an image. This module improves the performance of
† Corresponding author existing methods [33]. However, the prohibitive compu-

1
tational cost and vast GPU memory occupation hinder its ficiency, APNB is around 6 times faster than a non-local
usage in many real applications. The architecture of a com- block on a GPU while 28 times smaller in GPU running
mon non-local block [33] is depicted in Fig. 1(a). The block memory occupation for a 256 × 128 input.
first calculates the similarities of all locations between each
other, requiring a matrix multiplication of computational 2. Related Work
complexity O(CH 2 W 2 ), given an input feature map with
size C × H × W . Then it requires another matrix multipli- In this section, we briefly review related works about se-
cation of computational complexity O(CH 2 W 2 ) to gather mantic segmentation or scene parsing. Recent advances fo-
the influence of all locations to themselves. Concerning the cus on exploring the context information and can be roughly
high complexity brought by the matrix multiplications, we categorized into five directions:
are interested in this work if there are efficient ways to solve Encoder-Decoder. A encoder generally reduces the spa-
this without sacrificing the performance. tial size of feature maps to enlarge the receptive field. Then
We notice that as long as the outputs of the key branch the encoded codes are fed to the decoder, which is respon-
and value branch hold the same size, the output size of the sible for recovering the spatial size of the prediction maps.
non-local block remains unchanged. Considering this, if Long et al. [26] and Noh et al. [22] used deconvolutions
we could sample only a few representative points from key to perform the decoding pass. Ronneberger et al. [25] in-
branch and value branch, it is possible that the time com- troduced skip-connections to bridge the encoding features
plexity is significantly decreased without sacrificing the per- to their corresponding decoding features, which could en-
formance. This motivation is demonstrated in Fig. 1 when rich the segmentation output with more details. Zhang et
changing a large value N in the key branch and value branch al. [43] introduced a context encoding module to predict
to a much smaller value S (From (a) to (b)). semantic category importance and selectively strengthen or
In this paper, we propose a simple yet effective non- weaken class-specific feature maps.
local module called Asymmetric Pyramid Non-local Block CRF. As a frequently-used operation that could leverage
(APNB) to decrease the computation and GPU memory context information in machine learning, Conditional Ran-
consumption of the standard non-local module [33] with dom Field [15] meets its new opportunity in combining with
applications to semantic segmentation. Motivated by the CNNs for semantic segmentation [4, 5, 6, 49, 31]. CRF-
spatial pyramid pooling [12, 16, 47] strategy, we propose to CNN [49] adopted this strategy, making the deep network
embed a pyramid sampling module into non-local blocks, end-to-end trainable. Chandra et al. [4] and Vemulapalli
which could largely reduce the computation overhead of et al. [31] integrated Gaussian Conditional Random Fields
matrix multiplications yet provide substantial semantic fea- into CNNs and achieved relatively good results.
ture statistics. This spirit is also related to the sub-sampling
Different Convolutions. Chen et al. [5, 6] and Yu et al.
tricks [33] (e.g., max pooling). Our experiments suggest
[40] adapted generic convolutions to dilated ones, making
that APNB yields much better performance than those sub-
the networks sensitive to global context semantics and thus
sampling tricks with a decent decrease of computations. To
improves the performance. Peng et al. [24] found large
better illustrate the boosted efficiency, we compare the GPU
kernel convolutions help relieve the contradiction between
times of APNB and a standard non-local block in Fig. 2, av-
classification and localization in segmentation.
eraging the running time of 10 different runs with the same
configuration. Our APNB largely reduces the time cost on Spatial Pyramid Pooling. Inspired by the success of spa-
matrix multiplications, thus being nearly 6 times faster than tial pyramid pooling in object detection [12], Chen et al.
a non-local block. [6] replaced the pooling layers with dilated convolutions of
Besides, we also adapt APNB to fuse the features of different sampling weights and built an Atrous Spatial Pyra-
different stages of a deep network, which brings a con- mid Pooling layer (ASPP) to account for multiple scales ex-
siderable improvement over the baseline model. We call plicitly. Chen et al. [8] further combined ASPP and the
the adapted block as Asymmetric Fusion Non-local Block encoder-decoder architecture to leverage the advantages of
(AFNB). AFNB calculates the correlations between every both and boost the performance considerably. Drawing in-
pixel of the low-level and high-level feature maps, yielding spiration from [16], PSPNet [47] conducted spatial pyramid
a fused feature with long range interactions. Our network is pooling after a specific layer to embed context features of
built based on a standard ResNet-FCN model by integrating different scales into the networks. Recently, Yang et al. [36]
APNB and AFNB together. pointed out the ASPP layer has a restricted receptive field
The efficacy of the proposed network is evaluated on and adapted ASPP to a densely connected version, which
Cityscapes [9], ADE20K [50] and PASCAL Context [21], helps to overcome such limitation.
achieving the state-of-the-art performance 81.3%, 45.24% Non-local Network. Recently, researchers [20, 33, 47] no-
and 52.8%, respectively. In terms of time and space ef- ticed that skillful leveraging the long range dependencies

2
Stage 4 Stage 5 Classifier

|| <latexit sha1_base64="hg8VurtecQJGanZzoda3oFupLxg=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8CAl8VKPBUE8VrQf0oay2U7apZtN2N0IJfQnePGgiFd/kTf/jds2B219YeHhnRl25g0SwbVx3W+nsLa+sblV3C7t7O7tH5QPj1o6ThXDJotFrDoB1Si4xKbhRmAnUUijQGA7GF/P6u0nVJrH8sFMEvQjOpQ85Iwaa90/9m/65Ypbdeciq+DlUIFcjX75qzeIWRqhNExQrbuemxg/o8pwJnBa6qUaE8rGdIhdi5JGqP1svuqUnFlnQMJY2ScNmbu/JzIaaT2JAtsZUTPSy7WZ+V+tm5rwys+4TFKDki0+ClNBTExmd5MBV8iMmFigTHG7K2EjqigzNp2SDcFbPnkVWpdVz/KdW6lf5HEU4QRO4Rw8qEEdbqEBTWAwhGd4hTdHOC/Ou/OxaC04+cwx/JHz+QP0X418</latexit>

||
<latexit sha1_base64="gaC9SCowO376pD1EOV4640ZkqD8=">AAAB6nicbVDLSsNAFL2pr1pfVZduBovgQkriRpcFNy4r2oe0oUymN+3QySTMTIQQ+gluXCji1i9y5984bbPQ1gMDh3POZe49QSK4Nq777ZTW1jc2t8rblZ3dvf2D6uFRW8epYthisYhVN6AaBZfYMtwI7CYKaRQI7ASTm5nfeUKleSwfTJagH9GR5CFn1Fjp/nHQHFRrbt2dg6wSryA1KGDzX/1hzNIIpWGCat3z3MT4OVWGM4HTSj/VmFA2oSPsWSpphNrP56tOyZlVhiSMlX3SkLn6eyKnkdZZFNhkRM1YL3sz8T+vl5rw2s+5TFKDki0+ClNBTExmd5MhV8iMyCyhTHG7K2Fjqigztp2KLcFbPnmVtC/rnuV3bq1xUdRRhhM4hXPw4AoacAtNaAGDETzDK7w5wnlx3p2PRbTkFDPH8AfO5w8Dlo2G</latexit>

<latexit sha1_base64="GTy0u+mpaIv3kso5b+IDnJRd9FQ=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8CBl14seC148VrS10C4lm862oUl2SbJCWfoTvHhQxKu/yJv/xrTdg7a+EHh4Z4bMvFEquLG+/+2V1tY3NrfK25Wd3b39g+rhUdskmWbYYolIdCeiBgVX2LLcCuykGqmMBD5G45tZ/fEJteGJerCTFENJh4rHnFHrrPtOf9Sv1vy6PxdZhaCAGhRq9qtfvUHCMonKMkGN6QZ+asOcasuZwGmllxlMKRvTIXYdKirRhPl81Sk5c86AxIl2T1kyd39P5FQaM5GR65TUjsxybWb+V+tmNr4Oc67SzKJii4/iTBCbkNndZMA1MismDijT3O1K2IhqyqxLp+JCCJZPXoX2ZT1wfOfXGhdFHGU4gVM4hwCuoAG30IQWMBjCM7zCmye8F+/d+1i0lrxi5hj+yPv8ASZwjZ0=</latexit>

<latexit sha1_base64="I4foLdO79PRQI0D+a3RnAnZT1Hg=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8CBl14seC148VrS10C4lm2bb0GyyJLNCWfoTvHhQxKu/yJv/xrTdg7a+EHh4Z4bMvFEqhUXf//ZKa+sbm1vl7crO7t7+QfXwqG11ZhhvMS216UTUcikUb6FAyTup4TSJJH+Mxjez+uMTN1Zo9YCTlIcJHSoRC0bRWfedvuxXa37dn4usQlBADQo1+9Wv3kCzLOEKmaTWdgM/xTCnBgWTfFrpZZanlI3pkHcdKppwG+bzVafkzDkDEmvjnkIyd39P5DSxdpJErjOhOLLLtZn5X62bYXwd5kKlGXLFFh/FmSSoyexuMhCGM5QTB5QZ4XYlbEQNZejSqbgQguWTV6F9WQ8c3/m1xkURRxlO4BTOIYAraMAtNKEFDIbwDK/w5knvxXv3PhatJa+YOYY/8j5/ACyAjaE=</latexit>

AFNB APNB
<latexit sha1_base64="GTy0u+mpaIv3kso5b+IDnJRd9FQ=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8CBl14seC148VrS10C4lm862oUl2SbJCWfoTvHhQxKu/yJv/xrTdg7a+EHh4Z4bMvFEquLG+/+2V1tY3NrfK25Wd3b39g+rhUdskmWbYYolIdCeiBgVX2LLcCuykGqmMBD5G45tZ/fEJteGJerCTFENJh4rHnFHrrPtOf9Sv1vy6PxdZhaCAGhRq9qtfvUHCMonKMkGN6QZ+asOcasuZwGmllxlMKRvTIXYdKirRhPl81Sk5c86AxIl2T1kyd39P5FQaM5GR65TUjsxybWb+V+tmNr4Oc67SzKJii4/iTBCbkNndZMA1MismDijT3O1K2IhqyqxLp+JCCJZPXoX2ZT1wfOfXGhdFHGU4gVM4hwCuoAG30IQWMBjCM7zCmye8F+/d+1i0lrxi5hj+yPv8ASZwjZ0=</latexit> <latexit sha1_base64="I4foLdO79PRQI0D+a3RnAnZT1Hg=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8CBl14seC148VrS10C4lm2bb0GyyJLNCWfoTvHhQxKu/yJv/xrTdg7a+EHh4Z4bMvFEqhUXf//ZKa+sbm1vl7crO7t7+QfXwqG11ZhhvMS216UTUcikUb6FAyTup4TSJJH+Mxjez+uMTN1Zo9YCTlIcJHSoRC0bRWfedvuxXa37dn4usQlBADQo1+9Wv3kCzLOEKmaTWdgM/xTCnBgWTfFrpZZanlI3pkHcdKppwG+bzVafkzDkDEmvjnkIyd39P5DSxdpJErjOhOLLLtZn5X62bYXwd5kKlGXLFFh/FmSSoyexuMhCGM5QTB5QZ4XYlbEQNZejSqbgQguWTV6F9WQ8c3/m1xkURRxlO4BTOIYAraMAtNKEFDIbwDK/w5knvxXv3PhatJa+YOYY/8j5/ACyAjaE=</latexit>

C <latexit sha1_base64="hg8VurtecQJGanZzoda3oFupLxg=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8CAl8VKPBUE8VrQf0oay2U7apZtN2N0IJfQnePGgiFd/kTf/jds2B219YeHhnRl25g0SwbVx3W+nsLa+sblV3C7t7O7tH5QPj1o6ThXDJotFrDoB1Si4xKbhRmAnUUijQGA7GF/P6u0nVJrH8sFMEvQjOpQ85Iwaa90/9m/65Ypbdeciq+DlUIFcjX75qzeIWRqhNExQrbuemxg/o8pwJnBa6qUaE8rGdIhdi5JGqP1svuqUnFlnQMJY2ScNmbu/JzIaaT2JAtsZUTPSy7WZ+V+tm5rwys+4TFKDki0+ClNBTExmd5MBV8iMmFigTHG7K2EjqigzNp2SDcFbPnkVWpdVz/KdW6lf5HEU4QRO4Rw8qEEdbqEBTWAwhGd4hTdHOC/Ou/OxaC04+cwx/JHz+QP0X418</latexit>
C
C C C
Query Key Value Query
Pyramid Pooling Pyramid Pooling Pyramid Pooling

Key C <latexit sha1_base64="HEW3XK/8ZqCUW/0jJ/NXRhvVXxY=">AAAB8XicbVA9TwJBEJ3DL8Qv1NJmIzGxIrc2WhJtLDERMMKF7C17sGFv77I7Z0Iu/AsbC42x9d/Y+W9c4AoFXzLJy3szmZkXpkpa9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aNskMFy2eqMQ8hMwKJbVooUQlHlIjWBwq0QnHNzO/8ySMlYm+x0kqgpgNtYwkZ+ikR0p6KGNhCe1Xa37dn4OsElqQGhRo9qtfvUHCs1ho5IpZ26V+ikHODEquxLTSy6xIGR+zoeg6qplbE+Tzi6fkzCkDEiXGlUYyV39P5Cy2dhKHrjNmOLLL3kz8z+tmGF0FudRphkLzxaIoUwQTMnufDKQRHNXEEcaNdLcSPmKGcXQhVVwIdPnlVdK+qFPH7/xa47qIowwncArnQOESGnALTWgBBw3P8ApvnvVevHfvY9Fa8oqZY/gD7/MHRiyP+g==</latexit>

Convolution
Value
|| Concatenation

Matrix Multiplication
<latexit sha1_base64="08D4TllTtqLm1JvGWkkh5x9Al5c=">AAAB6nicbVDLSsNAFL2pr1pfVZduBovgQkriRpcFN+6saB/QhjKZTtqhk0mYuRFK6Ce4caGIW7/InX/jNM1CWw8MHM45l7n3BIkUBl332ymtrW9sbpW3Kzu7e/sH1cOjtolTzXiLxTLW3YAaLoXiLRQoeTfRnEaB5J1gcjP3O09cGxGrR5wm3I/oSIlQMIpWergbNAfVmlt3c5BV4hWkBgVs/qs/jFkacYVMUmN6npugn1GNgkk+q/RTwxPKJnTEe5YqGnHjZ/mqM3JmlSEJY22fQpKrvycyGhkzjQKbjCiOzbI3F//zeimG134mVJIiV2zxUZhKgjGZ302GQnOGcmoJZVrYXQkbU00Z2nYqtgRv+eRV0r6se5bfu7XGRVFHGU7gFM7BgytowC00oQUMRvAMr/DmSOfFeXc+FtGSU8wcwx84nz/0S418</latexit>

<latexit sha1_base64="cxxi7a9wcYmIPKBzuSdabgv02Eo=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8CAl8VKPBUG8WdF+QBvKZrtpl242YXcilNCf4MWDIl79Rd78N27bHLT1hYWHd2bYmTdIpDDout9OYW19Y3OruF3a2d3bPygfHrVMnGrGmyyWse4E1HApFG+iQMk7ieY0CiRvB+PrWb39xLURsXrEScL9iA6VCAWjaK2Hu/5Nv1xxq+5cZBW8HCqQq9Evf/UGMUsjrpBJakzXcxP0M6pRMMmnpV5qeELZmA5516KiETd+Nl91Ss6sMyBhrO1TSObu74mMRsZMosB2RhRHZrk2M/+rdVMMr/xMqCRFrtjiozCVBGMyu5sMhOYM5cQCZVrYXQkbUU0Z2nRKNgRv+eRVaF1WPcv3bqV+kcdRhBM4hXPwoAZ1uIUGNIHBEJ7hFd4c6bw4787HorXg5DPH8EfO5w/lI41y</latexit>

Figure 3: Overview of the proposed Asymmetric Non-local Neural Network. In our implementation, the key branch and the value branch in APNB share
the same 1 × 1 convolution and sampling module, which decreases the number of parameters and computation without sacrificing the performance.

brings great benefits to semantic segmentation. Wang et V ∈ RN ×N is calculated by a matrix multiplication as


al. [33] proposed a non-local block module combining non-
local means with deep networks and showcased its efficacy V = φT × θ. (2)
for segmentation.
Afterward, a normalization is applied to V to get a unified
Different from these works, our network uniquely incor- similarity matrix as
porates pyramid sampling strategies with non-local blocks
to capture the semantic statistics of different scales with ~ = f (V ).
V (3)
only a minor budget of computation, while maintaining the
excellent performance as the original non-local modules. According to [33], the normalizing function f can take
the form from softmax, rescaling, and none. We choose
softmax here, which is equivalent to the self-attention
3. Asymmetric Non-local Neural Network mechanism and proved to work well in many tasks such as
In this section, we firstly revisit the definition of non- machine translation [30] and image generation [44]. For
local block [33] in Sec. 3.1, then detail the proposed Asym- every location in γ, the output of the attention layer is
metrical Pyramid Non-local Block (APNB) and Asym- ~ × γT,
metrical Fusion Non-local Block (AFNB) in Sec. 3.2 and O=V (4)
Sec. 3.3, respectively. While APNB aims to decrease the
where O ∈ RN ×Ĉ . By referring to the design of the non-
computational overhead of non-local blocks, AFNB im-
local block, the final output is given by
proves the learning capacity of non-local blocks thereby im-
proving the segmentation performance. Y = Wo (OT ) + X or Y = cat(Wo (OT ), X), (5)
3.1. Revisiting Non-local Block where Wo , also implemented by a 1 × 1 convolution, acts as
A typical non-local block [33] is shown in Fig. 1. Con- a weighting parameter to adjust the importance of the non-
sider an input feature X ∈ RC×H×W , where C, W , and local operation w.r.t. the original input X and moreover, re-
H indicate the channel number, spatial width and height, covers the channel dimension from Ĉ to C.
respectively. Three 1 × 1 convolutions Wφ , Wθ , and Wγ 3.2. Asymmetric Pyramid Non-local Block
are used to transform X to different embeddings φ ∈
RĈ×H×W , θ ∈ RĈ×H×W and γ ∈ RĈ×H×W as The non-local network is potent to capture the long range
dependencies that are crucial for semantic segmentation.
φ = Wφ (X ), θ = Wθ (X ), γ = Wγ (X ), (1) However, the non-local operation is very time and mem-
ory consuming compared to normal operations in the deep
where Ĉ is the channel number of the new embeddings. neural network, e.g., convolutions and activation functions.
Next, the three embeddings are flattened to size Ĉ × N , Motivation and Analysis. By inspecting the general com-
where N represents the total number of the spatial loca- puting flow of a non-local block, one could clearly find
tions, that is, N = H · W . Then, the similarity matrix that Eq. (2) and Eq. (4) dominate the computation. The

3
The time complexity of such an asymmetric matrix
<latexit sha1_base64="xs37djecHv14uTDM0psVUiDnqdw=">AAAB8XicbZDLSgNBEEVrfMb4irp00xgEV2HGjS4DblxGMA9MhtDTqUma9PQM3TVCCPkLNy4UcetP+A3u/Bs7j4UmFjQc7q2mqm6UKWnJ97+9tfWNza3twk5xd2//4LB0dNywaW4E1kWqUtOKuEUlNdZJksJWZpAnkcJmNLyZ+s1HNFam+p5GGYYJ72sZS8HJSQ8B65BM0LKgWyr7FX9WbBWCBZRhUbVu6avTS0WeoCahuLXtwM8oHHNDUiicFDu5xYyLIe9j26Hmbkw4nm08YedO6bE4Ne5pYjP1948xT6wdJZHrTDgN7LI3Ff/z2jnF1+FY6iwn1GI+KM4Vo5RNz2c9aVCQGjngwki3KxMDbrggF1LRhRAsn7wKjctK4PjOL1ejz3kcBTiFM7iAAK6gCrdQgzoI0PAEL/DqWe/Ze/Pe561r3iLCE/hT3scPiMyQ2g==</latexit>

Pool1

<latexit sha1_base64="HTPkZgSAUfmd2UXR1iEIKb7Zctk=">AAAB8XicbZDLSgMxFIZP6q3WW9Wlm2ARXJUZXeiy4MZlBXvBdiiZNNOGZjJDckYoQ9/CjQtF3PoSPoM738b0stDWA4GP/z/hnPOHqZIWPe+bFNbWNza3itulnd29/YPy4VHTJpnhosETlZh2yKxQUosGSlSinRrB4lCJVji6mfqtR2GsTPQ9jlMRxGygZSQ5Qyc9XNIuylhYetkrV7yqNyu6Cv4CKrCoeq/81e0nPIuFRq6YtR3fSzHImUHJlZiUupkVKeMjNhAdh5q5MUE+23hCz5zSp1Fi3NNIZ+rvHzmLrR3HoeuMGQ7tsjcV//M6GUbXQS51mqHQfD4oyhTFhE7Pp31pBEc1dsC4kW5XyofMMI4upJILwV8+eRWaF1Xf8Z1XqYWf8ziKcAKncA4+XEENbqEODeCg4Qle4JVY8kzeyPu8tUAWER7DnyIfP47ukN4=</latexit>
multiplication is only O(ĈN S), significantly lower than
Pool2
Flatten
O(ĈN 2 ) in a standard non-local block. It is ideal that S
<latexit sha1_base64="H2Kjxi9ND5ngW3Sz0tKghD4w9tI=">AAAB6HicbZDLSgMxFIbPeK31VnXpJlgEV2XGjS4LbrpswV6gHUomPdPGJpkhyQhl6BO4caGIW5/EZ3Dn25heFtr6Q+DjPyecc/4oFdxY3//2Nja3tnd2C3vF/YPDo+PSyWnLJJlm2GSJSHQnogYFV9i03ArspBqpjAS2o/HdrN5+RG14ou7tJMVQ0qHiMWfUOqtR65fKfsWfi6xDsIQyLFXvl756g4RlEpVlghrTDfzUhjnVljOB02IvM5hSNqZD7DpUVKIJ8/miU3LpnAGJE+2esmTu/v6RU2nMREauU1I7Mqu1mflfrZvZ+DbMuUozi4otBsWZIDYhs6vJgGtkVkwcUKa525WwEdWUWZdN0YUQrJ68Dq3rSuC44Zer0ecijgKcwwVcQQA3UIUa1KEJDBCe4AVevQfv2Xvz3hetG94ywjP4I+/jB97bjag=</latexit>

<latexit sha1_base64="OBIrG8Wf98FV91iYy/Ih1asYcik=">AAAB8XicbZDLSgMxFIZP6q3WW9Wlm2ARXJUZF+qy4MZlBXvBdiiZNNOGZjJDckYoQ9/CjQtF3PoSPoM738b0stDWA4GP/z/hnPOHqZIWPe+bFNbWNza3itulnd29/YPy4VHTJpnhosETlZh2yKxQUosGSlSinRrB4lCJVji6mfqtR2GsTPQ9jlMRxGygZSQ5Qyc9XNIuylhYetkrV7yqNyu6Cv4CKrCoeq/81e0nPIuFRq6YtR3fSzHImUHJlZiUupkVKeMjNhAdh5q5MUE+23hCz5zSp1Fi3NNIZ+rvHzmLrR3HoeuMGQ7tsjcV//M6GUbXQS51mqHQfD4oyhTFhE7Pp31pBEc1dsC4kW5XyofMMI4upJILwV8+eRWaF1Xf8Z1XqYWf8ziKcAKncA4+XEENbqEODeCg4Qle4JVY8kzeyPu8tUAWER7DnyIfP5ghkOQ=</latexit>

Concat should be much smaller than N . However, it is hard to en-


sure that when S is small, the performance would not drop
<latexit sha1_base64="Tzph/fHUiMApGafa+NZ/iztiF+k=">AAACC3icbZDLSgMxGIUzXmu9jbp0E1oEF6VMKmg3QsGNywr2Ap1xyKSZNjSTGZKMUIbZu/FV3LhQxIUbX8Cdb2N6WWjrgcDHOX9I/hMknCntON/Wyura+sZmYau4vbO7t28fHLZVnEpCWyTmsewGWFHOBG1ppjntJpLiKOC0E4yuJnnnnkrFYnGrxwn1IjwQLGQEa2P5dslVaeRnwmUCuhmqwLMKPK/AOnTzXNzVLhFyfLvsVJ2p4DKgOZTBXE3f/nL7MUkjKjThWKkechLtZVhqRjjNi26qaILJCA9oz6DAEVVeNt0lhyfG6cMwluYIDafu7xsZjpQaR4GZjLAeqsVsYv6X9VId1r2MiSTVVJDZQ2HKoY7hpBjYZ5ISzccGMJHM/BWSIZaYaFNf0ZSAFldehnatigzfOOVG8DGrowCOQQmcAgQuQANcgyZoAQIewBN4Aa/Wo/VsvVnvs9EVa17hEfgj6/MHvXCY9w==</latexit>

Pool3

<latexit sha1_base64="Tx/caOJatFQsh4xHuYwO7hNl7ds=">AAAB6HicbZDLSgMxFIbP1Futt6pLN8EiuCozbnRZcOOyBXuBdiiZ9Ewbm2SGJCOUoU/gxoUibn0Sn8Gdb2N6WWjrD4GP/5xwzvmjVHBjff/bK2xsbm3vFHdLe/sHh0fl45OWSTLNsMkSkehORA0KrrBpuRXYSTVSGQlsR+PbWb39iNrwRN3bSYqhpEPFY86odVaj3S9X/Ko/F1mHYAkVWKreL3/1BgnLJCrLBDWmG/ipDXOqLWcCp6VeZjClbEyH2HWoqEQT5vNFp+TCOQMSJ9o9Zcnc/f0jp9KYiYxcp6R2ZFZrM/O/Wjez8U2Yc5VmFhVbDIozQWxCZleTAdfIrJg4oExztythI6opsy6bkgshWD15HVpX1cBxw6/Uos9FHEU4g3O4hACuoQZ3UIcmMEB4ghd49R68Z+/Ne1+0FrxlhKfwR97HD/WXjbc=</latexit>

too much in the meantime.


Input feature map <latexit sha1_base64="26f94fR2MA3vBNz0V67yz0RMV18=">AAAB8XicbZDLSgMxFIZP6q3WW9Wlm2ARXJUZN3ZZcOOygr1gO5RMmmlDM5khOSOUoW/hxoUibn0Jn8Gdb2N6WWjrgcDH/59wzvnDVEmLnvdNChubW9s7xd3S3v7B4VH5+KRlk8xw0eSJSkwnZFYoqUUTJSrRSY1gcahEOxzfzPz2ozBWJvoeJ6kIYjbUMpKcoZMearSHMhaW1vrlilf15kXXwV9CBZbV6Je/eoOEZ7HQyBWztut7KQY5Myi5EtNSL7MiZXzMhqLrUDM3JsjnG0/phVMGNEqMexrpXP39I2extZM4dJ0xw5Fd9Wbif143w6gW5FKnGQrNF4OiTFFM6Ox8OpBGcFQTB4wb6XalfMQM4+hCKrkQ/NWT16F1VfUd33mVevi5iKMIZ3AOl+DDNdThFhrQBA4anuAFXoklz+SNvC9aC2QZ4Sn8KfLxA55DkOg=</latexit>

<latexit sha1_base64="DETDlNcNEV2kxsWXtGcpqA9BifQ=">AAAB6HicbZDLSgMxFIbP1Futt6pLN8EiuCozbnRZcOOyBXuBdiiZ9Ewbm2SGJCOUoU/gxoUibn0Sn8Gdb2N6WWjrD4GP/5xwzvmjVHBjff/bK2xsbm3vFHdLe/sHh0fl45OWSTLNsMkSkehORA0KrrBpuRXYSTVSGQlsR+PbWb39iNrwRN3bSYqhpEPFY86odVaj0y9X/Ko/F1mHYAkVWKreL3/1BgnLJCrLBDWmG/ipDXOqLWcCp6VeZjClbEyH2HWoqEQT5vNFp+TCOQMSJ9o9Zcnc/f0jp9KYiYxcp6R2ZFZrM/O/Wjez8U2Yc5VmFhVbDIozQWxCZleTAdfIrJg4oExztythI6opsy6bkgshWD15HVpX1cBxw6/Uos9FHEU4g3O4hACuoQZ3UIcmMEB4ghd49R68Z+/Ne1+0FrxlhKfwR97HD/cbjbg=</latexit>

Pool4
As discovered by previous works [16, 47], global and
multi-scale representations are useful for categorizing scene
Figure 4: Demonstration of the pyramid max or average sampling process. semantics. Such representations can be comprehensively
time complexities of the two matrix multiplications are both carved by Spatial Pyramid Pooling [16], which contains
O(ĈN 2 ) = O(ĈH 2 W 2 ). In semantic segmentation, the several pooling layers with different output sizes in par-
output of the network usually has a large resolution to retain allel. In addition to this virtue, spatial pyramid pooling
detailed semantic features [6, 47]. That means N is large is also parameter-free and very efficient. Therefore, we
(for example in our training phase, N = 96 × 96 = 9216). embed pyramid pooling in the non-local block to enhance
Hence, the large matrix multiplication is the main cause the global representations while reducing the computational
of the inefficiency of a non-local block (see our statistic in overhead.
Fig. 2). By doing so, we now arrive at the final formulation of
A more straightforward pipeline is given as Asymmetric Pyramid Non-local Block (APNB), as given in
Fig. 3. As can be seen, our APNB derives from the design
RN ×Ĉ × RĈ×N → RN ×N × RN ×Ĉ → RN ×Ĉ . (6) of a standard non-local block [33]. A vital change is to add
| {z } | {z }
Eq.(2) Eq.(4) a spatial pyramid pooling module after θ and γ respectively
to sample representative anchors. This sampling process is
We hold a key yet intuitive observation that by changing N clearly depicted in Fig. 4, where several pooling layers are
to another number S (S  N ), the output size will remain applied after θ or γ and then the four pooling results are
the same, as flattened and concatenated to serve as the input to the next
RN ×Ĉ × RĈ×S → RN ×S × RS×Ĉ → RN ×Ĉ . (7) layer. We denote the spatial pyramid pooling modules as
Pθn and Pγn , where the superscript n means the width (or
Returning to the design of the non-local block, changing height) of the output size of the pooling layer (empirically,
N to a small number S is equivalent to sampling several the width is equal to the height). In our model, we set n ⊆
representative points from θ and γ instead of feeding all {1, 3, 6, 8}. Then the total number of the anchor points is
the spatial points, as illustrated in Fig. 1. Consequently, the X
computational complexity could be considerably decreased. S = 110 = n2 . (12)
Solution. Based on the above observation, we propose to n∈{1,3,6,8}

add sampling modules Pθ and Pγ after θ and γ to sample As a consequence, the complexity of our asymmetric matrix
several sparse anchor points denoted as θP ∈ RĈ×S and multiplication is only
γP ∈ RĈ×S , where S is the number of sampled anchor S
points. Mathematically, this is computed by T = (13)
N
θP = Pθ (θ), γP = Pγ (γ). (8) times of the complexity of the non-local matrix multiplica-
The similarity matrix VP between φ and the anchor points tion. When H = 128 and W = 256, the asymmetrical ma-
θP is thus calculated by trix multiplication saves us 256×128
110 ≈ 298 times of com-
putation (see the results in Fig. 2). Moreover, the spatial
VP = φT × θP . (9) pyramid pooling gives sufficient feature statistics about the
Note that VP is an asymmetric matrix of size N × S. VP global scene semantic cues to remedy the potential perfor-
then goes through the same normalizing function as a stan- mance deterioration caused by the decreased computation.
dard non-local block, giving the unified similarity matrix We will analyze this in our experiments.
~P . And the attention output is acquired by
V 3.3. Asymmetric Fusion Non-local Block
~ P × γP T ,
OP = V (10) Fusing features of different levels are helpful to seman-
tic segmentation and object tracking as hinted in [16, 18,
where the output is in the same size as that of Eq. (4). Fol-
26, 41, 46, 51]. Common fusing operations such as addi-
lowing non-local blocks, the final output YP is given as
tion/concatenation, are conducted in a pixel-wise and local
YP = cat(Wo (OP T ), X). (11) manner. We provide an alternative that leverages long range

4
dependencies through a non-local block to fuse multi-level the training just begins and degrades the overall perfor-
features, called Fusion Non-local Block. mance. Such features, full of rich long range cues from dif-
A standard non-local block only has one input source ferent feature levels, serve as the input to APNB, which then
while FNB has two: a high-level feature map Xh ∈ help to discover the correlations among pixels. As done for
RCh ×Nh and a low-level feature map Xl ∈ RCl ×Nl . Nh AFNB, the output of APNB is also concatenated with its
and Nl are the numbers of spatial locations of Xh and Xl , input source. Note that in our implementation for APNB,
respectively. Ch and Cl are the channel numbers of Xh and Wθ and Wγ share parameters in order to save parameters
Xl , respectively. Likewise, 1 × 1 convolutions Wφh , Wθl and computation, following the design of [42]. This de-
and Wγl are used to transform Xh and Xl to embeddings sign doesn’t decrease the performance of APNB. Finally, a
φh ∈ RĈ×Nh , θl ∈ RĈ×Nl and γl ∈ RĈ×Nl as classifier is followed up to produce channel-wise semantic
maps that later receive their supervisions from the ground
φh = Wφh (Xh ), θl = Wθl (Xl ), γl = Wγl (Xl ). (14) truth maps. Note we also add another supervision to Stage4
following the settings of [47], as it is beneficial to improve
Then, the similarity matrix VF ∈ RNh ×Nl between φh and the performance.
θl is computed by a matrix multiplication
4. Experiments
VF = φh T × θl . (15)
To evaluate our method, we carry out detailed experi-
We also put a normalization upon VF resulting in a unified ments on three semantic segmentation datasets: Cityscapes
~F ∈ RNh ×Nl . Afterward, we integrate
similarity matrix V [9], ADE20K [50] and PASCAL Context [21]. We have
~
VF with γl through a similar matrix multiplication as Eq. more competitive results on NYUD-V2 [29] and COCO-
(4) and Eq. (10), written as Stuff-10K [3] in the supplementary materials.
~F × γlT .
OF = V (16)
4.1. Datasets and Evaluation Metrics
Nh ×Ĉ
The output OF ∈ R reflects the bonus of Xl to Xh , Cityscapes [9] is particularly created for scene parsing,
which are carefully selected from all locations in Xl . Like- containing 5,000 high quality finely annotated images and
wise, OF is fed to a 1×1 convolution to recover the channel 20,000 coarsely annotated images. All images in this
number to Ch . Finally, we have the output as dataset are shot on streets and of size 2048 × 1024. The
finely annotated images are divided into 2,975/500/1,525
YF = cat(Wo (OF T ), Xh ). (17) splits for training, validation and testing, respectively. The
dataset contains 30 classes annotations in total while only
Similar to the adaption of APNB w.r.t. the generic non-
19 classes are used for evaluation.
local block, incorporating spatial pyramid pooling into FNB
could derive an efficient Asymmetric Fusion Non-local ADE20K [50] is a large-scale dataset used in ImageNet
Block (AFNB), as illustrated in Fig. 3. Inheriting from the Scene Parsing Challenge 2016, containing up to 150
advantages of APNB, AFNB is more efficient than FNB classes. The dataset is divided into 20K/2K/3K images
without sacrificing the performance. for training, validation, and testing, respectively. Different
from Cityscapes, both scenes and stuff are annotated in this
3.4. Network Architecture dataset, adding more challenge for participated methods.
The overall architecture of our network is depicted in PASCAL Context [21] gives the segmentation labels of the
Fig. 3. We choose ResNet-101 [13] as our backbone net- whole image from PASCAL VOC 2010, containing 4,998
work following the choice of most previous works [38, 47, images for training and 5,105 images for validation. We
48]. We remove the last two down-sampling operations use the 60 classes (59 object categories plus background)
and use the dilation convolutions instead to hold the feature annotations for evaluation.
maps from the last two stages1 18 of the input image. Con-
cretely, all the feature maps in the last three stages have the Evaluation Metric. We adopt Mean IoU (mean of class-
same spatial size. According to our experimental trials, we wise intersection over union) as the evaluation metric for
fuse the features of Stage4 and Stage5 using AFNB. The all the datasets.
fused features are thereupon concatenated with the feature
maps after Stage5, avoiding situations that AFNB could not 4.2. Implementation Details
produce accurate strengthened features particularly when
Training Objectives. Following [47], our model has two
1 We refer to the stage with original feature map size 1
16
as Stage4 and supervisions: one after the final output of our model while
1
size 32 as Stage5. another at the output layer of Stage4. Therefore, our loss

5
function is composed by two cross entropy losses as Method Input size GFLOPs GPU memory GPU time
NB 96 × 96 58.0 609 19.5
APNB 96 × 96 15.5 (↓ 42.5) 150 (↓ 459) 12.4 (↓ 7.1)
L = Lfinal + λLStage4 . (18) NB 256 × 128 601.4 7797 179.4
APNB 256 × 128 43.5 (↓ 557.9) 277 (↓ 7520) 30.8 (↓ 148.6)
For Lfinal , we perform online hard pixel mining, which ex- Table 1: Computation and memory statistics comparisons between non-
cels at coping with difficult cases. λ is set to 0.4. local block and our APNB. The channel numbers of the input feature maps
X is C = 2048 and of the embeddings φ, φP etc. is Ĉ = 256, respec-
Training Settings. Our code is based on an open source tively. Batch size is 1. The lower values, the better.
repository for semantic segmentation [37] using PyTorch
1.0 framework [23]. The backbone network ResNet-101 is single scale testing for Cityscapes dataset. Hence, we give
pretrained on the ImageNet [10]. We use Stochastic Gra- relevant statistics of the two sizes. The testing environment
dient Descent (SGD) to optimize our network, in which is identical for these two blocks, that is, a Titan Xp GPU
we set the initial learning rate to 0.01 for Cityscapes and under CUDA 9.0 without other ongoing programs. Note
PASCAL Context and 0.02 for ADE20K. During training, our APNB has four extra adaptive average pooling layers to
the learning rate is decayed according to the “poly” lean- count as opposed to the non-local block while other parts
ing rate policy, where the learning rate is multiplied by are entirely identical. The comparison results are given in
iter power Tab. 2. Our APNB is superior to the non-local block in all
1 − ( max iter ) with power = 0.9. For Cityscapes, we
randomly crop out high-resolution patches 769 × 769 from aspects. Enlarging the input size will give a further edge
the original images as the inputs for training [7, 47]. While to our APNB because in Eq. (13), N grows quadratically
for ADE20K and PASCAL Context, we set the crop size to while S remains unchanged.
520 × 520 and 480 × 480, respectively [43, 47]. For all Besides the comparison of the single block efficiency, we
datasets, we apply random scaling in the range of [0.5, 2.0], also provide the whole network efficiency comparison with
random horizontal flip and random brightness as additional the two most advanced methods, PSANet [48] and DenseA-
data augmentation methods. Batch size is 8 in Cityscapes SPP [36], in terms of inference time (s), GPU occupation
experiments and 16 in the other datasets. We choose the with batch size set to 1 (MB) and the number of parameters
cross-GPU synchronized batch normalization in [43] or (Million). According to Tab. 2, though our inference time
apex to synchronize the mean and standard-deviation of and parameter number are larger than DenseASPP [36], the
batch normalization layer across multiple GPUs. We also GPU memory occupation is obviously smaller. We attribute
apply the auxiliary loss LStage4 and online hard example this to the different backbone networks: ResNet compara-
mining strategy in all the experiments as their effects for tively contains more parameters and layers while DenseNet
improving the performance are clearly discussed in previ- is more GPU memory demanding. When comparing with
ous works [47]. We train on the training set of Cityscapes, the previous advanced method PSANet [48], which shares
ADE20K and PASCAL Context for 60K, 150K, 28K itera- the same backbone network with us, our model is more ad-
tions, respectively. All the experiments are conducted using vantageous in all aspects. This verifies our network is supe-
8× Titan V GPUs. rior because of the effectiveness of APNB and AFNB rather
than just having more parameters than previous works.
Inference Settings. For the comparisons with state-of-the-
art methods, we apply multi-scale whole image and left-
Method Backbone Inf. time (s) Mem. (MB) # Param (M)
right flip testing for ADE20K and PASCAL Context while DenseASPP [36] DenseNet-161 0.568 7973 35.63
multi-scale sliding crop and left-right flip testing for the PSANet [48] ResNet-101 0.672 5233 102.66
Ours ResNet-101 0.611 3375 63.17
Cityscapes testing set. For quick ablation studies, we
only employ single scale testing on the validation set of Table 2: Time, parameter and GPU memory comparisons based on the
whole networks. Inf. time, Mem., # Param mean inference time, GPU
Cityscapes by feeding the whole original images. memory occupation and number of parameters, respectively. Results are
averaged from feeding ten 2048 × 1024 images.
4.3. Comparisons with Other Methods
4.3.1 Efficiency Comparison with Non-local Block 4.3.2 Performance Comparisons
As discussed in Sec. 3.2, APNB is much more efficient than Cityscapes. To compare the performance on the test set of
a standard non-local block. We hereby give a quantitative Cityscapes with other methods, we directly train our asym-
efficiency comparison between our APNB and a generic metric non-local neural network for 120K iterations with
non-local block in the following aspects: GFLOPs, GPU only the finely annotated data, including the training and
memory (MB) and GPU computation time (ms). In our net- validation sets. As shown in Tab. 3, our method outper-
work, non-local block/APNB receives a 96 × 96 (1/8 of the forms the previous state-of-the-art methods, attaining the
769 × 769 input image patch) feature map during training performance of 81.3%. We give several typical qualitative
while 256×128 (1/8 of the 2048×1024 input image) during comparisons with other methods in Fig. 5. DeepLab-V3 [7]

6
Method Backbone Val mIoU (%) Method Backbone mIoU (%)
DeepLab-V2 [6] ResNet-101 70.4 FCN-8s [26] – 37.8
RefineNet [18] ResNet-101 X 73.6 Piecewise [19] – 43.3
GCN [24] ResNet-101 X 76.9 DeepLab-V2 [6] ResNet-101 45.7
DUC [32] ResNet-101 X 77.6 RefineNet [18] ResNet-152 47.3
SAC [45] ResNet-101 X 78.1 PSPNet [47] ResNet-101 47.8
ResNet-38 [34] WiderResNet-38 78.4 CCL [11] ResNet-101 51.6
PSPNet [47] ResNet-101 78.4 EncNet [43] ResNet-101 51.7
BiSeNet [38] ResNet-101 X 78.9 Ours ResNet-101 52.8
AAF [14] ResNet-101 X 79.1 Table 5: Comparisons on the validation set of PASCAL Context with the
DFN [39] ResNet-101 X 79.3 state-of-the-art methods.
PSANet [48] ResNet-101 X 80.1
DenseASPP [36] DenseNet-161 X 80.6 Method mIoU (%)
Ours ResNet-101 X 81.3 Baseline 75.8
+ NB 78.4
Table 3: Comparisons on the test set of Cityscapes with the state-of-the-art
+ APNB 78.6
methods. Note that the Val column indicates whether including the finely
annotated validation set data of Cityscapes for training. + Common fusion 76.5
+ FNB 77.3
Method Backbone mIoU (%) + AFNB 77.1
RefineNet [18] ResNet-152 40.70 + Common fusion + NB 79.0
UperNet [35] ResNet-101 42.65 + FNB + NB 79.7
DSSPN [17] ResNet-101 43.68 + AFNB + APNB (Full) 79.9
PSANet [48] ResNet-101 43.77
Table 6: Ablation study on the validation set of Cityscapes about APNB
SAC [45] ResNet-101 44.30
and AFNB. “+ M odule” means add “M odule” to the Baseline model.
EncNet [43] ResNet-101 44.65
PSPNet [47] ResNet-101 43.29 60K iterations.
PSPNet [47] ResNet-269 44.94
Ours ResNet-101 45.24 Efficacy of the APNB and AFNB. Our network has two
prominent components: APNB and AFNB. The following
Table 4: Comparisons on the validation set of ADE20K with the state-of-
the-art methods. will evaluate the efficacy of each and the integration of both.
The Baseline network is basically a FCN-like ResNet-101
and PSPNet [47] are somewhat troubled with local inconsis- network with a deep supervision branch. By adding a non-
tency on large objects like truck (first row), fence (second local block (+ NB) before the classifier to the Baseline
row) and building (third row) etc. while our method isn’t. model, the performance is improved by 2.6% (75.8% →
Besides, our method performs better for very slim objects 78.4%), as shown in Tab. 6. By replacing the normal non-
like the pole (fourth row) as well. local block with our APNB (+ APNB), the performance is
slightly better (78.4% → 78.6%).
ADE20K. As is known, ADE20K is challenging due to its
When adding a common fusion module (+ Com-
various image sizes, lots of semantic categories and the gap
mon Fusion) from Stage4 to Stage5: Stage5 +
between its training and validation set. Even under such cir-
ReLU(BatchNorm(Conv(Stage4))) to Baseline model,
cumstance, our method achieves better results than EncNet
we also achieve a good improvement compared to the Base-
[43]. It is noteworthy that our result is better than PSPNet
line (75.8% → 76.5%). This phenomenon verifies the use-
[47] even when it uses a deeper backbone ResNet-269.
fulness of the strategy that fuses the features from the last
PASCAL Context. We report the comparison with state- two stages. Replacing the common fusion module with
of-the-art methods in Tab. 5. It can be seen that our model our proposed Fusion Non-local Block (+ FNB), the perfor-
achieves the state-of-the-art performance of 52.8%. This mance is further boosted at 0.8% (76.5% → 77.3%). Like-
result firmly suggests the superiority of our method. wise, changing FNB to AFNB (+ AFNB) reduces the com-
putation considerably at the cost of a minor performance
4.4. Ablation Study decrease (77.3% → 77.1%).
In this section, we give extensive experiments to verify To study whether the fusion strategy could further boost
the efficacy of our main method. We also give several de- the highly competitive + NB model, we add Common fusion
sign choices and show their influences on the results. All the to + NB model (+ Common fusion + NB) and achieve 0.6%
following experiments adopt ResNet-101 as the backbone, performance improvement (78.4% → 79.0%). Using both
trained on the fine-annotated training set of Cityscapes for the fusion non-local block and typical non-local block (+

7
Images PSPNet DeepLabV3 Ours Ground truth
Bicycle

Motorcycle

Train

Bus

Truck

Car

Rider

Person

Sky

Terrain

Vegetation

Traffic sign

Traffic light

Pole

Fence

Wall

Building

Sidewalk

Road

Figure 5: Qualitative comparisons with DeepLab-V3 [7] and PSPNet [47]. The red circles mark where our model is particularly superior to other methods.

FNB + NB) can improve the performance of 79.7%. Using Sampling method n S mIoU (%)
the combination of APNB and AFNB, namely our asym- random 15 225 78.2
metric non-local neural network (+ AFNB + APNB (Full) max 15 225 78.1
in Fig. 3), achieves the best performance of 79.9%, demon- average 15 225 78.4
strating the efficacy of APNB and AFNB. pyramid random 1,2,3,6 50 78.8
Selection of Sampling Methods. As discussed in Sec. 3.2, pyramid max 1,2,3,6 50 79.1
the selection of the sampling module has a great impact Huawei on Confidentialpyramid average 1,2,3,6 50 79.3
the performance of APNB. Normal sampling strategies in- pyramid average 1,3,6,8 110 79.9
clude: max, average and random. When integrated into pyramid average 1,4,8,12 225 80.1
spatial pyramid pooling, there goes another three strategies: Table 7: Ablation study on the validation set of Cityscapes in terms of
pyramid max, pyramid average and pyramid random. We sampling methods and anchor point number. “n” column represents the
thereupon conduct several experiments to study their effects output width/height of a pooling layer. Note when implementing random
and pyramid random, we use the numpy.random.choice function to ran-
by combining them with APNB. As shown in Tab. 7, aver- domly sample n2 anchor points from all possible locations. “S” column
age sampling performs better than max and random sam- means the total number of the anchor points.
pling, which conforms to the conclusion drawn in [47]. We
reckon it is because the resulted sampling points are more
informative by receiving the provided information of all the 5. Conclusion
input locations inside the average sampling kernel, when
compared to the other two. This explanation could also be
In this paper, we propose an asymmetric non-local neu-
transferred to pyramid settings. Comparing average sam-
ral network for semantic segmentation. The core contribu-
pling and pyramid sampling under the same number of an-
tion of asymmetric non-local neural network is the asym-
chor points (third row vs. the last row), we can surely find
metric pyramid non-local block, which can dramatically
pyramid pooling is a very key factor that contributes to the
improve the efficiency and decrease the memory consump-
significant performance boost.
tion of non-local neural blocks without sacrificing the per-
Influence of the Anchor Points Numbers. In our case, the formance. Besides, we also propose asymmetric fusion
output sizes of the pyramid pooling layers determine the to- non-local block to fuse features of different levels. The
tal number of anchor points, which influence the efficacy asymmetric fusion non-local block can explore the long
of APNB. To investigate the influence, we perform the fol- range spatial relevance among features of different levels,
lowing experiments by altering the pyramid average pool- which demonstrates a considerable performance improve-
ing output sizes: (1, 2, 3, 6), (1, 3, 6, 8) and (1, 4, 8, 12). As ment over a strong baseline. Comprehensive experimental
shown in Tab. 7, it is clear that more anchor points improve results on the Cityscapes, ADE20K and PASCAL Context
the performance with the cost of increasing computation. datasets show that our work achieves the new state-of-the-
Considering this trade-off between efficacy and efficiency, art performance. In the future, we will apply asymmetric
we opt to choose (1, 3, 6, 8) as our default setting. non-local neural networks to other vision tasks.

8
Acknowledgement [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proc.
This work was supported by NSFC 61573160, to Dr. Xi- CVPR, pages 770–778, 2016. 5
ang Bai by the National Program for Support of Top-notch [14] Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X.
Young Professionals and the Program for HUST Academic Yu. Adaptive affinity fields for semantic segmentation. In
Frontier Youth Team 2017QYTD08. We sincerely thank Proc. ECCV, pages 605–621, 2018. 7
Huawei EI Cloud for their generous grant of GPU use for [15] John D. Lafferty, Andrew McCallum, and Fernando C. N.
our paper. We genuinely thank Ansheng You for his kind Pereira. Conditional random fields: Probabilistic models
help and suggestions throughout the project. for segmenting and labeling sequence data. In Proc. ICML,
pages 282–289, 2001. 2
References [16] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Be-
yond bags of features: Spatial pyramid matching for recog-
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. nizing natural scene categories. In Proc. CVPR, pages 2169–
Segnet: A deep convolutional encoder-decoder architecture 2178, 2006. 1, 2, 4
for image segmentation. IEEE Trans. Pattern Anal. Mach. [17] Xiaodan Liang, Hongfei Zhou, and Eric Xing. Dynamic-
Intell., 39(12):2481–2495, 2017. 1 structured semantic propagation network. In Proc. CVPR,
[2] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A pages 752–761, 2018. 7
non-local algorithm for image denoising. In Proc. CVPR, [18] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D.
pages 60–65, 2005. 1 Reid. Refinenet: Multi-path refinement networks for high-
[3] Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. resolution semantic segmentation. In Proc. CVPR, pages
Coco-stuff: Thing and stuff classes in context. In Proc. 5168–5177, 2017. 1, 4, 7, 11
CVPR, pages 1209–1218, 2018. 5 [19] Guosheng Lin, Chunhua Shen, Anton van den Hengel, and
[4] Siddhartha Chandra and Iasonas Kokkinos. Fast, exact and Ian D. Reid. Efficient piecewise training of deep structured
multi-scale inference for semantic image segmentation with models for semantic segmentation. In Proc. CVPR, pages
deep gaussian crfs. In Proc. ECCV, pages 402–418, 2016. 2 3194–3203, 2016. 7, 11
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [20] Wei Liu, Andrew Rabinovich, and Alexander C. Berg.
Kevin Murphy, and Alan L. Yuille. Semantic image segmen- Parsenet: Looking wider to see better. CoRR,
tation with deep convolutional nets and fully connected crfs. abs/1506.04579, 2015. 1, 2
In Proc. ICLR, 2015. 1, 2 [21] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu
[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im- Alan Yuille. The role of context for object detection and se-
age segmentation with deep convolutional nets, atrous con- mantic segmentation in the wild. In Proc. CVPR, 2014. 2,
volution, and fully connected crfs. IEEE Trans. Pattern Anal. 5
Mach. Intell., 40(4):834–848, 2018. 2, 4, 7, 11 [22] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.
[7] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Learning deconvolution network for semantic segmentation.
Hartwig Adam. Rethinking atrous convolution for semantic In Proc. ICCV, pages 1520–1528, 2015. 2
image segmentation. CoRR, abs/1706.05587, 2017. 6, 7, 8 [23] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
[8] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
Schroff, and Hartwig Adam. Encoder-decoder with atrous ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
separable convolution for semantic image segmentation. In differentiation in pytorch. 2017. 6
Proc. ECCV, pages 833–851, 2018. 2 [24] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and
[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Jian Sun. Large kernel matters - improve semantic segmenta-
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe tion by global convolutional network. In Proc. CVPR, pages
Franke, Stefan Roth, and Bernt Schiele. The cityscapes 1743–1751, 2017. 2, 7
dataset for semantic urban scene understanding. In Proc. [25] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
CVPR, pages 3213–3223, 2016. 2, 5, 11, 12 Convolutional networks for biomedical image segmentation.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, In Proc. MICCAI, pages 234–241, 2015. 2
and Fei-Fei Li. Imagenet: A large-scale hierarchical image [26] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully
database. In Proc. CVPR, pages 248–255, 2009. 6, 11, 12 convolutional networks for semantic segmentation. IEEE
[11] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017. 1,
Gang Wang. Context contrasted feature and gated multi- 2, 4, 7
scale aggregation for scene segmentation. In Proc. CVPR, [27] Mennatullah Siam, Sara Elkerdawy, Martin Jägersand, and
pages 2393–2402, 2018. 7, 11 Senthil Yogamani. Deep semantic segmentation for auto-
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. mated driving: Taxonomy, roadmap and challenges. In Proc.
Spatial pyramid pooling in deep convolutional networks for ITSC, pages 1–8, 2017. 1
visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., [28] Mennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek,
37(9):1904–1916, 2015. 1, 2 Senthil Yogamani, Martin Jägersand, and Hong Zhang. A

9
comparative study of real-time semantic segmentation for [45] Rui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and
autonomous driving. In CVPR workshop, pages 587–597, Shuicheng Yan. Scale-adaptive convolutions for scene pars-
2018. 1 ing. In Proc. ICCV, pages 2050–2058, 2017. 7
[29] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob [46] Zhenli Zhang, Xiangyu Zhang, Chao Peng, Xiangyang Xue,
Fergus. Indoor segmentation and support inference from and Jian Sun. Exfuse: Enhancing feature fusion for semantic
RGBD images. In Proc. ECCV, pages 746–760, 2012. 5, segmentation. In Proc. CVPR, pages 273–288, 2018. 4, 11
11 [47] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- Wang, and Jiaya Jia. Pyramid scene parsing network. In
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Proc. CVPR, pages 6230–6239, 2017. 1, 2, 4, 5, 6, 7, 8
Polosukhin. Attention is all you need. In Proc. NeurIPS, [48] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi,
pages 6000–6010, 2017. 3 Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-
[31] Raviteja Vemulapalli, Oncel Tuzel, Ming-Yu Liu, and Rama wise spatial attention network for scene parsing. In Proc.
Chellappa. Gaussian conditional random field network for ECCV, pages 270–286, 2018. 5, 6, 7
semantic segmentation. In Proc. CVPR, pages 3224–3233, [49] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-
2016. 2 Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang
[32] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, and Philip H. S. Torr. Conditional random fields as
Huang, Xiaodi Hou, and Garrison W. Cottrell. Understand- recurrent neural networks. In Proc. ICCV, pages 1529–1537,
ing convolution for semantic segmentation. In Proc. WACV, 2015. 2
pages 1451–1460, 2018. 7 [50] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
[33] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Barriuso, and Antonio Torralba. Scene parsing through
Kaiming He. Non-local neural networks. In Proc. CVPR, ADE20K dataset. In Proc. CVPR, pages 5122–5130, 2017.
pages 7794–7803, 2018. 1, 2, 3, 4 2, 5
[34] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. [51] Yu Zhou, Xiang Bai, Wenyu Liu, and Longin Jan Latecki.
Wider or deeper: Revisiting the resnet model for visual Similarity fusion for visual tracking. International Journal
recognition. CoRR, abs/1611.10080, 2016. 7 of Computer Vision, 118(3):337–363, 2016. 4
[35] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
[52] Yuyin Zhou, Zhe Li, Song Bai, Chong Wang, Xinlei Chen,
Jian Sun. Unified perceptual parsing for scene understand-
Mei Han, Elliot Fishman, and Alan Yuille. Prior-aware neu-
ing. In Proc. ECCV, pages 432–448, 2018. 7
ral network for partially-supervised multi-organ segmenta-
[36] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan
tion. In Proc. ICCV, 2019. 1
Yang. Denseaspp for semantic segmentation in street scenes.
[53] Yuyin Zhou, Yan Wang, Peng Tang, Song Bai, Wei Shen,
In Proc. CVPR, pages 3684–3692, 2018. 2, 6, 7
Elliot Fishman, and Alan Yuille. Semi-supervised 3d ab-
[37] Ansheng You, Xiangtai Li, Zhen Zhu, and Yunhai Tong.
dominal multi-organ segmentation via deep multi-planar co-
Torchcv: A pytorch-based framework for deep learning in
training. In Proc. WACV, pages 121–140, 2019. 1
computer vision. https://github.com/donnyyou/
torchcv, 2019. 6
[38] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
Gang Yu, and Nong Sang. Bisenet: Bilateral segmenta-
tion network for real-time semantic segmentation. In Proc.
ECCV, pages 334–349, 2018. 5, 7
[39] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
Gang Yu, and Nong Sang. Learning a discriminative feature
network for semantic segmentation. In Proc. CVPR, pages
1857–1866, 2018. 7
[40] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-
tion by dilated convolutions. CoRR, abs/1511.07122, 2015.
1, 2
[41] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Dar-
rell. Deep layer aggregation. In Proc. CVPR, pages 2403–
2412, 2018. 4
[42] Yuhui Yuan and Jingdong Wang. Ocnet: Object context net-
work for scene parsing. CoRR, abs/1809.00916, 2018. 5
[43] Hang Zhang, Kristin J. Dana, Jianping Shi, Zhongyue
Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal.
Context encoding for semantic segmentation. In Proc.
CVPR, pages 7151–7160, 2018. 1, 2, 6, 7
[44] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and
Augustus Odena. Self-attention generative adversarial net-
works. In Proc. ICML, pages 7354–7363, 2019. 1, 3

10
A. Quantitative comparisons on COCO-Stuff- Comparing the two networks’ feature visualizations of the
10K and NYUD-V2 same stage, the features are quite similar in the first three
stages while differs hugely in the last two. This observation
Following the training and evaluation protocols of accord with our guess. Note our experiment results partially
DeepLab-V2 [6] and RefineNet [18], our method achieves conforms to the conclusions in ExFuse [46], further validat-
competitive results on the two datasets using single scale ing the effectiveness of fusing only the last two stages.
whole image testing, as shown in Tab. 8. As NYUD-V2
[29] is a small benchmark and COCO-Stuff-10K is quite
C. Discussions
challenging and large, these results further verify the effec-
tiveness of our method on both small and large benchmarks. We’ve conducted extensive experiments to summarize
some experiences about semantic segmentation. As known,
Method Backbone mIoU (%) Method Backbone mIoU (%)
RefineNet [18] ResNet-101 33.6 Piecewise [19] VGG16 40.6
deep networks are fantastic tools but also need very care-
CCL [11] ResNet-101 35.7 RefineNet [18] ResNet-101 43.6 ful tuning. The tuning process is extremely painful when it
Ours ResNet-101 37.2 Ours ResNet-101 44.4
comes to semantic segmentation as there are many factors
Table 8: Comparisons on COCO-Stuff-10K (Left) and NYUD-V2 (Right)
datasets. Results of the competing methods are taken from their papers. and the training process is very time and resources consum-
ing. So we plan to share some experiences to the interested
B. More ablation results readers as references. But please do note these experiences
are not as thorough as those in the main draft and it’s very
Qualitative comparisons. We also give the qualitative possible that some may not suit your cases.
comparisons of our Full (+ AFNB + APNB) method with
• We found the types of graphic cards, pytorch versions,
other variants of our model in Fig. 6. In summary, our Full
CUDA versions and NVIDIA driver versions may in-
method shows the best semantic consistency and the least
fluence the performances. In our case, we found the
inconsistency artifacts while +AFNB and +APNB fail in
same model trained on a workstation with 8 × Titan
some cases. The results also indicate AFNB and APNB is
V is around 0.5 mIoU better than it trained on another
complementary to each other and the combination of them
machine with 8 × Titan Xp.
is beneficial to improve the performance.
Selection of the fusing layers. Fusing features from multi- • The architecture of AFNB needs careful modifications
levels is effective in many computer vision tasks. However, to make AFNB work better. For example, we found
it still requires a lot of trials to find a good combination of adding a batch normalization layer after Wo helps a lot
the fusing layers. For semantic segmentation, the last sev- when using AFNB alone while may not help on the
eral layers of the network contain plenty of features with se- Full model. We suspect there is a theoretical reason
mantic information, which is critical for better performance. here and we’re working to solve this issue. However,
Hence, we only combine the features from the shallow lay- we found APNB is quite stable across a variaty of ar-
ers to the deep layers in a top-down manner. The responses chitectures.
are summed up if a certain layer receives more than two
fusion invitations. The results are listed in Tab. 9. An ob- • It seems that learning rate is important for segmenta-
vious conclusion is that fusing only the features of Stage4 tion models because we observe that the performance
and Stage5 brings a considerable improvement while keep is boosted tremendously in the last 1/4 of training iter-
fusing more layers will only hurt the performance. ations. We also found that changing the initial learn-
We conclude a possible intuitive reason: features from ing rate has a great impact on the performance. We
the early stages are generic to most tasks, while the last two strongly suggest those, who are new to this task, re-
are task-specific. Therefore, merging the features of the last fer to the learning rate configurations of the published
several stages is more effective for a specific task. In the methods that are mostly similar to your own method in
supplementary materials, we compare the feature visualiza- architecture.
tions of the outputs of all 5 stages of network trained on
segmentation benchmark Cityscapes [9] and of that trained
on classification benchmark ImageNet [10], using the same Method Fusing layers mIoU (%)
backbone network ResNet-101 to demonstrate our guess. Baseline – 75.8
As can be seen in Fig. 7, we compare the feature visual- AFNB 4&5 77.1
izations of the outputs of all 5 stages of network trained AFNB 3 & 5, 4 & 5 76.7
on segmentation benchmark Cityscapes [9] (Lower) and AFNB 2 & 5, 3 & 5, 4 & 5 76.2
of that trained on classification benchmark ImageNet [10] Table 9: Ablation study on the validation set of Cityscapes in terms of
(Upper), using the same backbone network ResNet-101. the selection of layers to be fused. “4 & 5” means fusing the features of
Stage4 and Stage5. Others likewise.

11
Images Baseline +AFNB +APNB Full Ground truth
Bicycle

Motorcycle

Train

Bus

Truck

Car

Rider

Person

Sky

Terrain

Vegetation

Traffic sign

Traffic light

Pole

Fence

Wall

Building

Sidewalk

Road

Figure 6: Qualitative comparisons among Full model and other variants of our model. The red circles indicate where Full model is superior to other model
variants.
Generic feature Task-specific feature

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

Images PSPNet DeepLabV3 Ours Ground truth


Bicycle

Motorcycle

Train

Bus

Truck

Car

Rider

Person

Sky

Terrain

Vegetation

Figure 7: Feature visualization of different stages on ResNet-101. The Upper row represents visualizations from network trained on classification dataset
Traffic sign

ImageNet [10]. The Lower row represents visualizations from network trained on scene segmentation dataset Cityscapes [9]. Traffic light

Pole
• Using larger training iterations while keeping other Fence

configurations identical doesn’t always improve the Wall

performance. In fact, changing training iterations is re- Building

lated to change the learning rate during training as we Sidewalk

mainly adopt poly learning decay method in semantic Road

segmentation.

Huawei Confidential

12

You might also like