You are on page 1of 38

Computer Vision; Image Classi cation; Large Networks

YouTube Playlist

Maziar Raissi

Assistant Professor

Department of Applied Mathematics

University of Colorado Boulder

fi
maziar.raissi@colorado.edu
Introduction to Convolutional Neural Networks
YouTube Playlist

Question: What is an image? Feature Learning


<latexit sha1_base64="iAvc9dHfmsMIDwWN1w4uFg4jY/w=">AAACGXicbVDLSsNAFJ34rPUVdamLwSK4KkkFFTcW3bhswT6gDWUyubFDJ5MwMxFK6cbf8Afc6h+4E7eu/AG/w0mbhW09MHA499zHHD/hTGnH+baWlldW19YLG8XNre2dXXtvv6niVFJo0JjHsu0TBZwJaGimObQTCSTyObT8wW1Wbz2CVCwW93qYgBeRB8FCRok2Us8+qqegMnqFW32iMVOYCMyMC65xzy45ZWcCvEjcnJRQjlrP/ukGMU0jEJpyolTHdRLtjYjUjHIYF7upgoTQgZneMVSQCJQ3mvxijE+MEuAwluYJjSfq344RiZQaRr5xRkT31XwtE/+rdVIdXnojJpJUg6DTRWHKsY5xFgkOmASq+dAQQiUzt2LaJ5JQbYKb2RKo7LRx0QTjzsewSJqVsntePqtXStWbPKICOkTH6BS56AJV0R2qoQai6Am9oFf0Zj1b79aH9Tm1Lll5zwGagfX1C691n/k=</latexit>
<latexit sha1_base64="QLblGEiyqXEVmzsVo9JlXYJGJXk=">AAACGHicbVDLSsNAFJ3UV62vqMtuBovgqiQV1GVREBcuKthaaEOZTG7aoZNJmJkIJXThb/gDbvUP3Ilbd/6A3+E07cK2Hhg4nHPv3MPxE86Udpxvq7Cyura+UdwsbW3v7O7Z+wctFaeSQpPGPJZtnyjgTEBTM82hnUggkc/hwR9eTfyHR5CKxeJejxLwItIXLGSUaCP17HI3/yOTEIyvgehUAr4FIgUT/Z5dcapODrxM3BmpoBkaPfunG8Q0jUBoyolSHddJtJcRqRnlMC51UwUJoUPSh46hgkSgvCwPMMbHRglwGEvzhMa5+ncjI5FSo8g3kxHRA7XoTcT/vE6qwwsvYyJJNQg6PRSmHOsYTxrBAZNANR8ZQqhkJiumAyIJ1aa3uSuBmkQbl0wx7mINy6RVq7pn1dO7WqV+OauoiMroCJ0gF52jOrpBDdREFD2hF/SK3qxn6936sD6nowVrtnOI5mB9/QJLX6EG</latexit>

Answer: An image is a tensor. g✓ : X 2 RH⇥W ⇥C 7! x 2 Rd


<latexit sha1_base64="g9qPiGRY10U36+FRcylhqQn39eA=">AAACGnicbVDLSsNAFJ3UV62vqEsRBovgKiQVVFy1unFZwT6gDWUyndihk5kwM1FK6Mrf8Afc6h+4E7du/AG/w0mbhW29cOFwzr2ce08QM6q0635bhaXlldW14nppY3Nre8fe3WsqkUhMGlgwIdsBUoRRThqaakbasSQoChhpBcPrTG89EKmo4Hd6FBM/QvechhQjbaiefVjj6pHIS1jjkBqNQKoggppwJaTTs8uu404KLgIvB2WQV71n/3T7AicR4RozpFTHc2Ptp0hqihkZl7qJIjHCQ2PUMZCjiCg/nbwxhseG6cNQSNNcwwn7dyNFkVKjKDCTEdIDNa9l5H9aJ9HhhZ9SHifmLzw1ChMGtYBZJrBPJcGajQxAWFJzK8QDJBHWJrkZl77KThuXTDDefAyLoFlxvDPn9LZSrl7lERXBATgCJ8AD56AKbkAdNAAGT+AFvII369l6tz6sz+lowcp39sFMWV+/VICgUg==</latexit>

<latexit sha1_base64="0hPK07QDr9ZadJ6v6irNBezwGaM=">AAACSHicbVBNS8NAEN3Ur1q/oh69LBbBU0lUVDwVeyl4UbG20NSy2WzbpZtN2J2IJeRH+Tf8A+JNL569iTc3tYJWBwYeb94wb54fC67BcZ6swszs3PxCcbG0tLyyumavb1zrKFGUNWgkItXyiWaCS9YADoK1YsVI6AvW9Ie1fN68ZUrzSF7BKGadkPQl73FKwFBd+6zf9WDAgJzgFva4xF5IYOD76WV2k9axBzxkGje/QS3LBbGGCN9Ny4OuXXYqzrjwX+BOQBlN6rxrv3pBRJOQSaCCaN12nRg6KVHAqWBZyUs0iwkdkj5rGyiJsdBJx09neMcwAe5FyrQEPGZ/bqQk1HoU+kaZe9TTs5z8b9ZOoHfcSbmME2CSfh3qJQKbn/MEccAVoyBGBhCquPGK6YAoQsHk/OtKoHNrWckE407H8Bdc71Xcw8r+xUG5ejqJqIi20DbaRS46QlVUR+eogSi6R4/oGb1YD9ab9W59fEkL1mRnE/2qQuETniqyNQ==</latexit>

X 2 RH⇥W ⇥C ! an image
<latexit sha1_base64="67tzYPgWGQJH189svGEeuQIDkhI=">AAACP3icbVDLSsNAFJ3Ud31FXboZLIKrkqioy1I3LlWsLTS1TCbTduhkEmZu1BLyP/6GP+BW/QHdiVt3TmoErV64cDjnPo8fC67BcZ6t0tT0zOzc/EJ5cWl5ZdVeW7/UUaIoa9BIRKrlE80El6wBHARrxYqR0Bes6Q+Pc715zZTmkbyAUcw6IelL3uOUgKG6dr2FPS6xFxIY+H56nl2lJ9gDHjKNm9/gOMOe4v0BEKWiG8OyW0iJxNwMY1nXrjhVZxz4L3ALUEFFnHbtFy+IaBIyCVQQrduuE0MnJQo4FSwre4lmMaFDM7xtoCTmhk46/jXD24YJcC9SJiXgMfuzIyWh1qPQN5X5T3pSy8n/tHYCvaNOymWcAJP0a1EvERginBuHA64YBTEygFDFza2YDogiFIy9v7YEOj8tKxtj3Ekb/oLL3ap7UN0726/U6oVF82gTbaEd5KJDVEMn6BQ1EEV36AE9oifr3nq13qz3r9KSVfRsoF9hfXwCWcmwRw==</latexit>

{(g✓ (Xn ), yn ) : n = 1, . . . , N } ! tabular data (multinomial logistic regression)


<latexit sha1_base64="3ZnMVCOi7AlP+kJUcjzlgX2kUsA=">AAACfHicbZFda9swFIZl76tL95F9Xe1GLBtkLA32NrpRKJTtZlejg6UNxMEcyyeOqCwZ6XhrMGa/sz9g+xllcpqLtd0Bwav36OgcPcoqJR1F0VkQ3rh56/adrbu97Xv3HzzsP3p85ExtBU6EUcZOM3CopMYJSVI4rSxCmSk8zk4+d/njH2idNPo7rSqcl1BouZACyFtp/1fSDJNa52gzCwKbIk1oiQTDaapft2lzmup2xFd+s8c13+fxiCcqN+RG/GvS8sTKYklgrfnJE8JTagiyWoHlORDwYVkrktqUEhRXpvDvkYJbLCy6biTfoT+IxtE6+HURb8SAbeIw7f9JciPqEjUJBc7N4qiieQPW36yw7SW1wwrECRQ481JDiW7erEG1/JV3cr4w1i9NfO3+W9FA6dyqzPzJEmjpruY683+5WU2Lj/NG6qom1OKi0aJWnAzvqPNcWhSkVl6AsLKjIJbgiZP/m0tdcteN1vY8mPgqhuvi6O043h2/+/Z+cPBpg2iLPWcv2JDF7AM7YF/YIZswwX4H28HT4FlwHr4M34Q7F0fDYFPzhF2KcPcvQPrB9Q==</latexit>

| {z }
xn
g✓ ! CNN
<latexit sha1_base64="evuGds2hzXJHQciqvnnvMxOCQv8=">AAACHHicbVDJSgNBEO2JW4xb1KMHG4PgKcyoqMdgLp5CBLNAJoSeTiVp0rPQXaOGIUd/wx/wqn/gTbwK/oDfYWc5mMQHBY/3qqiq50VSaLTtbyu1tLyyupZez2xsbm3vZHf3qjqMFYcKD2Wo6h7TIEUAFRQooR4pYL4noeb1iyO/dg9KizC4w0EETZ91A9ERnKGRWtnDbsvFHiCjrhLdHjKlwgfqIjxiUiyVhq1szs7bY9BF4kxJjkxRbmV/3HbIYx8C5JJp3XDsCJsJUyi4hGHGjTVEjPdZFxqGBswH3UzGjwzpsVHatBMqUwHSsfp3ImG+1gPfM50+w56e90bif14jxs5VMxFBFCMEfLKoE0uKIR2lQttCAUc5MIRxJcytlPeYYhxNdjNb2np02jBjgnHmY1gk1dO8c5E/uz3PFa6nEaXJATkiJ8Qhl6RAbkiZVAgnT+SFvJI369l6tz6sz0lryprO7JMZWF+/F72iew==</latexit>

The aim is to take an image as input and turn it into a vector!


<latexit sha1_base64="24Ff0Yyy7nXMWQOF92jq6WFO2JQ=">AAACPHicbVA9SwNBEN3zM8avqKXNahCswl0EtTNoYxkhX5CEsLe3MUv29o7duUAI+Tn+Df+ArRb2YiO21k4uKUzig4W3b94wM8+PlbTguu/Oyura+sZmZiu7vbO7t587OKzZKDFcVHmkItPwmRVKalEFCUo0YiNY6CtR9/t3k3p9IIyVka7AMBbtkD1q2ZWcAUqd3E2lJyiTIZWWQkSB9fGrqUQbEkuljhNAJaCQGNQBFfQxOhAcInPSyeXdgpuCLhNvRvJkhnIn99kKIp6EQgNXzNqm58bQHjEDkisxzrYSK2LG+zi/iVSzUNj2KD10TM9QCWg3Mvg00FT92zFiobXD0EdnyKBnF2sT8b9aM4HudXuUHis0nw7qJiqNBFOjgTR4rxoiYdxI3JXyHjOMA2Y7NyWwk9XGWQzGW4xhmdSKBe+ycPFQzJduZxFlyDE5JefEI1ekRO5JmVQJJ0/khbySN+fZ+XC+nO+pdcWZ9RyROTg/v3grrZ4=</latexit>

H
<latexit sha1_base64="UNo1T9dTZCrR2zm8p5TSCkm56yE=">AAAB/HicbVDLTsJAFL3FF+ILdemmkZi4Iq0adUl0wxISQRJoyHR6gQnTaTMzNSEN/oBb/QN3xq3/4g/4HU6hCwFPMsnJOffmnjl+zJnSjvNtFdbWNza3itulnd29/YPy4VFbRYmk2KIRj2THJwo5E9jSTHPsxBJJ6HN89Mf3mf/4hFKxSDzoSYxeSIaCDRgl2kjNer9ccarODPYqcXNSgRyNfvmnF0Q0CVFoyolSXdeJtZcSqRnlOC31EoUxoWMyxK6hgoSovHQWdGqfGSWwB5E0T2h7pv7dSEmo1CT0zWRI9Egte5n4n9dN9ODWS5mIE42Czg8NEm7ryM5+bQdMItV8YgihkpmsNh0RSag23SxcCVQWbVoyxbjLNayS9kXVva5eNq8qtbu8oiKcwCmcgws3UIM6NKAFFBBe4BXerGfr3fqwPuejBSvfOYYFWF+/veWVUQ==</latexit>

1 ⇥ 1 Convolution
<latexit sha1_base64="07WnS3P/AiPREp6xTgyMjLuRIV8=">AAACIHicbVBLTsMwFHTKr5RfgSUbixaJVZUUCVhWdMOySPQjtVHlOE5r1bEj26lURTkA1+ACbOEG7BBLOADnwE2zoC0jWRrNvOc3Gi9iVGnb/rIKG5tb2zvF3dLe/sHhUfn4pKNELDFpY8GE7HlIEUY5aWuqGelFkqDQY6TrTZpzvzslUlHBH/UsIm6IRpwGFCNtpGG5Msj+SCTx06oDB5qGREGnCpuCTwWL8ym7ZmeA68TJSQXkaA3LPwNf4DgkXGOGlOo7dqTdBElNMSNpaRArEiE8QSPSN5Qjc9NNsiApvDCKDwMhzeMaZurfjQSFSs1Cz0yGSI/VqjcX//P6sQ5u3YTyKNaE48WhIGZQCzhvBvpUEqzZzBCEJTVZIR4jibA2/S1d8dU8WloyxTirNayTTr3mXNeuHuqVxl1eURGcgXNwCRxwAxrgHrRAG2DwBF7AK3iznq1368P6XIwWrHznFCzB+v4FBUmjYg==</latexit>

W X(i, j), 8(i, j) ! parameter sharing


<latexit sha1_base64="J6hKYf8fAvuXs7UQqV3RMleCA6c=">AAACsXicdVHLbhMxFPUMj5bwCrCDjUWEKFIVzVAEbJAisgFWBZEmVSYE23OTuPF4RvadlsjyX/Ez/ADfgSfNIn1wJUtH5577OuaVkhaT5E8U37h56/bO7p3W3Xv3HzxsP3p8ZMvaCBiIUpVmxJkFJTUMUKKCUWWAFVzBkC/7TX54CsbKUn/HVQWTgs21nEnBMFDT9u+s1jkYbpgAd7wn909e+anLpKZZwXDBufvmf7j+S+/pB7qtHV4roxnKAiztB/22evSfzn2/T7NZaZhSdC2hmZHzBTJjyrPQDH6hq5hhBSAYahfMSD3303Yn6SbroFdBugEdsonDaftvlpeiLkCjUMzacZpUOHHMoBQKfCurLVRMLNkcxgHqMM9O3NpdT18EJqdhyfA00jW7XeFYYe2q4EHZHGYv5xryuty4xtn7iZO6qhG0OB80qxXFkjZfRXNpQKBaBcCEkWFXKoIDTAQvLk7JbbOabwVj0ss2XAVHr7vp2+7B1zed3seNRbvkGXlO9khK3pEe+UQOyYCI6GnUiz5HX+KD+Dj+GfNzaRxtap6QCxEv/wG0StaS</latexit>

Y (i, j) = |{z}
| {z } 0
| {z }
2RC ⇥C
2RC 0 2RC
3 ⇥ 3 Convolution
<latexit sha1_base64="n5CKsI+lGrCsIPvr4FzSVFL2E9Q=">AAACIHicbVBLTsMwFHTKr5RfgSUbixaJVZW0ErCs6IZlkehHaqrKcZzWqmNHtlOpinoArsEF2MIN2CGWcADOgZNmAS0jWRrNvOc3Gi9iVGnb/rQKG5tb2zvF3dLe/sHhUfn4pKtELDHpYMGE7HtIEUY56WiqGelHkqDQY6TnTVup35sRqajgD3oekWGIxpwGFCNtpFG54mZ/JJL4i2oDupqGRMFGFbYEnwkW51N2zc4A14mTkwrI0R6Vv11f4DgkXGOGlBo4dqSHCZKaYkYWJTdWJEJ4isZkYChH5uYwyYIs4IVRfBgIaR7XMFN/byQoVGoeemYyRHqiVr1U/M8bxDq4GSaUR7EmHC8PBTGDWsC0GehTSbBmc0MQltRkhXiCJMLa9Pfniq/SaIuSKcZZrWGddOs156rWuK9Xmrd5RUVwBs7BJXDANWiCO9AGHYDBI3gGL+DVerLerHfrYzlasPKdU/AH1tcPC++jZg==</latexit>

W X X
<latexit sha1_base64="fowvhvbNRw7XNEJtQqPofYe4ch8=">AAAB/HicbVBLSgNBFHwTfzH+oi7dNAbBVZhRUZdBNy4TMB9IhtDT85I06fnQ3SOEIV7Ard7Anbj1Ll7Ac9iTzMIkFjQUVe/xqsuLBVfatr+twtr6xuZWcbu0s7u3f1A+PGqpKJEMmywSkex4VKHgITY11wI7sUQaeALb3vg+89tPKBWPwkc9idEN6DDkA86oNlKj3S9X7Ko9A1klTk4qkKPeL//0/IglAYaaCapU17Fj7aZUas4ETku9RGFM2ZgOsWtoSANUbjoLOiVnRvHJIJLmhZrM1L8bKQ2UmgSemQyoHqllLxP/87qJHty6KQ/jRGPI5ocGiSA6Itmvic8lMi0mhlAmuclK2IhKyrTpZuGKr7Jo05IpxlmuYZW0LqrOdfWycVWp3eUVFeEETuEcHLiBGjxAHZrAAOEFXuHNerberQ/rcz5asPKdY1iA9fUL1c2VYA==</latexit>

<latexit sha1_base64="T0c71ZzNkzcOSZLEKrEhjHgcMM4=">AAAC93icbVHLjtMwFHXCayiPKbBkc0WFWjSlSgoCNkgjumE5IDotakpwHLd1x3Ei2wEqy1u28AfsEFs+hx/gI1jhpF10pr2SpaNzfe65Pk4KzpQOgj+ef+nylavXDq43bty8dfuweefuqcpLSeiQ5DyX4wQrypmgQ800p+NCUpwlnI6Ss0HVH32iUrFcvNOrgk4zPBdsxgjWjoqb/6JSpFQmEhNq3ndYd/nIxiZiAqIM60WSmLf2gxm0rYWXEKkyiw1rQ903j8Nu0A0ja9f8cpffmj3q9I9Yuwv9o2V7vwVEmmVUwcB5bSvHHQWs0ipY7hcPbBeiWS4x51A/ASLJ5guNpcw/u7H0izYFljijmkpQCyyZmNu42Qp6QV2wC8INaKFNncTNv1GakzKjQhOOlZqEQaGnBkvNCKe2EZWKFpic4TmdOCicn5qa+o8sPHRMCm5Jd4SGmt1WGJwptcoSd7N6mLrYq8h9vUmpZy+mhomi1FSQtdGs5KBzqD4cUiYp0XzlACaSuV2BuAQwcVmcd0lVtZptuGDCizHsgtN+L3zWe/Lmaev41SaiA3QfPUAdFKLn6Bi9RidoiIj30fvqffO++yv/h//T/7W+6nsbzT10rvzf/wEuoO4I</latexit>

Y (i, j) = W (2 + i0 , 2 + j 0 ) X(si + i0 , sj + j 0 ), 8(i, j) ! parameter sharing


| {z } | {z }| {z }
0 i 2{ 1,0,1} j 0 2{ 1,0,1}
C
<latexit sha1_base64="j2UR/tI0kCsKRiHNcMVIXKUo83A=">AAAB/HicbVDLTsJAFL3FF+ILdemmkZi4Iq0adUlk4xISQRJoyHR6gQnTaTMzNSEN/oBb/QN3xq3/4g/4HU6hCwFPMsnJOffmnjl+zJnSjvNtFdbWNza3itulnd29/YPy4VFbRYmk2KIRj2THJwo5E9jSTHPsxBJJ6HN89Mf1zH98QqlYJB70JEYvJEPBBowSbaRmvV+uOFVnBnuVuDmpQI5Gv/zTCyKahCg05USpruvE2kuJ1IxynJZ6icKY0DEZYtdQQUJUXjoLOrXPjBLYg0iaJ7Q9U/9upCRUahL6ZjIkeqSWvUz8z+smenDrpUzEiUZB54cGCbd1ZGe/tgMmkWo+MYRQyUxWm46IJFSbbhauBCqLNi2ZYtzlGlZJ+6LqXlcvm1eV2l1eURFO4BTOwYUbqME9NKAFFBBe4BXerGfr3fqwPuejBSvfOYYFWF+/te2VTA==</latexit>

2R C0 2RC 0 ⇥C 2RC
s ! stride
<latexit sha1_base64="pZx4TjTMSIzM9wmSv3mAPUtOnoc=">AAACGHicbVDJSgNBEO1xjXGLesylMQiewoyKegx68RjBLJAMoaenJmnSs9Bdo4YhB3/DH/Cqf+BNvHrzB/wOO8vBJD4oeLxXRVU9L5FCo21/W0vLK6tr67mN/ObW9s5uYW+/ruNUcajxWMaq6TENUkRQQ4ESmokCFnoSGl7/euQ37kFpEUd3OEjADVk3EoHgDI3UKRQ1bSvR7SFTKn6gbYRHzDQq4cOwUyjZZXsMukicKSmRKaqdwk/bj3kaQoRcMq1bjp2gmzGFgksY5tuphoTxPutCy9CIhaDdbPzEkB4ZxadBrExFSMfq34mMhVoPQs90hgx7et4bif95rRSDSzcTUZIiRHyyKEglxZiOEqG+UMBRDgxhXAlzK+U9phhHk9vMFl+PThvmTTDOfAyLpH5Sds7Lp7dnpcrVNKIcKZJDckwcckEq5IZUSY1w8kReyCt5s56td+vD+py0LlnTmQMyA+vrF5b/oTo=</latexit>

<latexit sha1_base64="/xZwqSHLnke7J4r8+stf9NH4Wio=">AAACJnicbVDLTsJAFJ3iC/FVdelmImhgQ1pI1I0JgYUu0cgjgYZMhylMmD4yMzUhDd/gb/gDbvUP3BnjzpXf4bR0IeBZnZxzb+65xw4YFdIwvrTM2vrG5lZ2O7ezu7d/oB8etYUfckxa2Gc+79pIEEY90pJUMtINOEGuzUjHnjRiv/NIuKC+9yCnAbFcNPKoQzGSShropUIDXsNqARbvb+ol6HOYCKYS6gzhCeyfw86YSlIa6HmjbCSAq8RMSR6kaA70n/7Qx6FLPIkZEqJnGoG0IsQlxYzMcv1QkECdQCPSU9RDLhFWlLw0g2dKGUJH5XF8T8JE/bsRIVeIqWurSRfJsVj2YvE/rxdK58qKqBeEknh4fsgJGZQ+jPuBQ8oJlmyqCMKcqqwQjxFHWKoWF64MRRxtllPFmMs1rJJ2pWxelKt3lXytnlaUBSfgFBSBCS5BDdyCJmgBDJ7AC3gFb9qz9q59aJ/z0YyW7hyDBWjfvyAaoLo=</latexit>

C = 3 (RGB) or C = 1 (Black & White)


The convolution operation takes as input an image and outputs another image
<latexit sha1_base64="V6E/f3jed4n00PAtpJVG5Tc0zwQ=">AAACbHicbVFNa9wwEJWdpE23X5umt1AYurSkUIydQtNjaC49ppBNArvLMpbHXbGyZKRxYFn2Z/bQP5Bj/0AvlR0fmqQDgsd7M3qjp7zWynOa/orire2dR493nwyePnv+4uVw79WFt42TNJZWW3eVoyetDI1Zsaar2hFWuabLfHna6pfX5Lyy5pxXNc0q/GFUqSRyoOZDOzVWmYIMw/mCQFpzbXXTamBrcl0XMC7JA3pQpm4Y0IAK11AABdiGAxdUY3lBrlcOVULJRwgMlITcOIIKa/8hmQ9HaZJ2BQ9B1oOR6OtsPryZFlY2VdhQavR+kqU1z9boWElNm8G08VSjXAbXSYAGK/KzdRfMBt4FpoDSunDCCzv234k1Vt6vqjx0VsgLf19ryf9pk4bLL7N1FwcZeWtUNhrYQpsyFMqRZL0KAKVTYVeQC3QoOfzFHZfCt6ttBiGY7H4MD8HFUZJ9Tj59PxqdfO0j2hUH4q04FJk4FifimzgTYyHFT/En2o52ot/x6/ggfnPbGkf9zL64U/H7v2mUu+M=</latexit>

<latexit sha1_base64="/xZwqSHLnke7J4r8+stf9NH4Wio=">AAACJnicbVDLTsJAFJ3iC/FVdelmImhgQ1pI1I0JgYUu0cgjgYZMhylMmD4yMzUhDd/gb/gDbvUP3BnjzpXf4bR0IeBZnZxzb+65xw4YFdIwvrTM2vrG5lZ2O7ezu7d/oB8etYUfckxa2Gc+79pIEEY90pJUMtINOEGuzUjHnjRiv/NIuKC+9yCnAbFcNPKoQzGSShropUIDXsNqARbvb+ol6HOYCKYS6gzhCeyfw86YSlIa6HmjbCSAq8RMSR6kaA70n/7Qx6FLPIkZEqJnGoG0IsQlxYzMcv1QkECdQCPSU9RDLhFWlLw0g2dKGUJH5XF8T8JE/bsRIVeIqWurSRfJsVj2YvE/rxdK58qKqBeEknh4fsgJGZQ+jPuBQ8oJlmyqCMKcqqwQjxFHWKoWF64MRRxtllPFmMs1rJJ2pWxelKt3lXytnlaUBSfgFBSBCS5BDdyCJmgBDJ7AC3gFb9qz9q59aJ/z0YyW7hyDBWjfvyAaoLo=</latexit>

C = 3 (RGB) or C = 1 (Black & White)


Question: What is a pixel? (i.e., the feature maps).
<latexit sha1_base64="Rm5y1/43bVVuQzRNi0BLhYIiTSI=">AAACF3icbZDLSsNAFIYnXmu9Rd3pZrAIrkpSQcWNRTcuW7AXaEOZTE7aoZMLMxOxhIKv4Qu41TdwJ25d+gI+h5M2C9v6w8DHf87hnPndmDOpLOvbWFpeWV1bL2wUN7e2d3bNvf2mjBJBoUEjHom2SyRwFkJDMcWhHQsggcuh5Q5vs3rrAYRkUXivRjE4AemHzGeUKG31zMN6AjLDK9waEIWZxATH7BH4dc8sWWVrIrwIdg4llKvWM3+6XkSTAEJFOZGyY1uxclIiFKMcxsVuIiEmdEj60NEYkgCkk07+MMYn2vGwHwn9QoUn7t+JlARSjgJXdwZEDeR8LTP/q3US5V86KQvjREFIp4v8hGMV4SwQ7DEBVPGRBkIF07diOiCCUKVjm9niyey0cVEHY8/HsAjNStk+L5/VK6XqTR5RAR2hY3SKbHSBqugO1VADUfSEXtArejOejXfjw/icti4Z+cwBmpHx9Qug7Z92</latexit>

<latexit sha1_base64="CMnLzqdN4dICz9MDT04MS12iBaE=">AAACMHicbVDLSgNBEJz1bXxFPXoZDIKCWXYV1KPoxWMEo4EkhN7ZThycnVlmZsWw5EP8DX/Aq/6BnsSLB7/CyeNgjA0NRVV3dVNRKrixQfDuTU3PzM7NLywWlpZXVteK6xvXRmWaYZUpoXQtAoOCS6xabgXWUo2QRAJvorvzvn5zj9pwJa9sN8VmAh3J25yBdVSreNgYeOQa415FKWfTobvod/x9msBDmYKMKTgD6GA5Hep7rWIp8INB0UkQjkCJjKrSKn41YsWyBKVlAoyph0Fqmzloy5nAXqGRGUyB3bkjdQclJGia+eCxHt1xTEzbSruWlg7Y3xs5JMZ0k8hNJmBvzV+tT/6n1TPbPmnmXKaZRcmGh9qZoFbRflI05hqZFV0HgGnufqXsFjQw6/IcuxKb/mu9ggsm/BvDJLg+8MMj//DyoHR6NopogWyRbbJLQnJMTskFqZAqYeSRPJMX8uo9eW/eh/c5HJ3yRjubZKy87x8S3al9</latexit>

Answer: Each pixel is a vector. Pooling (e.g., max- and average-pooling)


<latexit sha1_base64="xKAQD1l7m2XQXn/OBgdTrXd0EUc=">AAACHHicbVDLSgNBEJz1GeMr6tGDg0HwtOxGUPEUFcFjBPOAJITZ2d5kyOzsMjMbDUuO/oY/4FX/wJt4FfwBv8PJ42ASCxqKqm66u7yYM6Ud59taWFxaXlnNrGXXNza3tnM7uxUVJZJCmUY8kjWPKOBMQFkzzaEWSyChx6Hqda+HfrUHUrFI3Ot+DM2QtAULGCXaSK3cwaVQDyAv8A2hHRyzR+CYKUxwD6iOpN3K5R3bGQHPE3dC8miCUiv30/AjmoQgNOVEqbrrxLqZEqkZ5TDINhIFMaFd0oa6oYKEoJrp6JEBPjKKj4NImhIaj9S/EykJleqHnukMie6oWW8o/ufVEx2cN1Mm4kSDoONFQcKxjvAwFewzaf7lfUMIlczcimmHSEK1yW5qi6+Gpw2yJhh3NoZ5UinY7ql9clfIF68mEWXQPjpEx8hFZ6iIblEJlRFFT+gFvaI369l6tz6sz3HrgjWZ2UNTsL5+ARB1oT8=</latexit>

0 0
<latexit sha1_base64="cgfVuOY4+sL/j9cwKAEJ+4tPHy0=">AAACx3icdVHLjtMwFHXCayivAks2V1SoRROqBBCwQRoxG9gNiHYGmhIcx23dcezIdoZWlhf8Hn/AD/AdOGkXM1O4kqWjc699zj3OK860iePfQXjl6rXrN/Zudm7dvnP3Xvf+g7GWtSJ0RCSX6iTHmnIm6Mgww+lJpSguc06P89PDpn98RpVmUnw264pOSzwXbMYINp7Kur/SWhRU5QoTar8OWLR86jKbMgFpic0iz+0n980e9p2Dtw21yizrQ9u3z5IojpLUuQ2/3OXPvf1loIHts36kYbm/7P9PJYJ0JhXmHFovkCo2XxislPwBqaErY4UEXBSssY85VFjhkhq/ocu6vXgYtwW7INmCHtrWUdb9kxaS1CUVhnCs9SSJKzO1WBlGOHWdtNa0wuQUz+nEQ+GF9NS2mTt44pkCvFd/hIGWPX/D4lLrdZn7yWZFfbnXkP/qTWozezO1TFS1oYJshGY1ByOh+UAomKLE8LUHmCifAwGy8CmQJoQLKoVurLmODya5HMMuGD8fJq+GLz6+7B2820a0hx6hx2iAEvQaHaD36AiNEAniYBxkwffwQyjDs3C1GQ2D7Z2H6EKFP/8CfuTb6w==</latexit>

C ), 8(i, j) ! no additional parameters


X(i, j) 2 R , i = 1, . . . , H, j = 1, . . . , W
<latexit sha1_base64="pN90m891Q9LrtyPolhJVpG9dIFc=">AAACOXicbVDLSsNAFJ34rPVVdelmsAgVSklU1IVCsZsuq9gHNLFMJtN27GQSZiZCCf0Zf8MfcKs7l25E3PoDTtosbOuBC4dz7uXee9yQUalM891YWFxaXlnNrGXXNza3tnM7uw0ZRAKTOg5YIFoukoRRTuqKKkZaoSDIdxlpuoNK4jcfiZA04HdqGBLHRz1OuxQjpaVO7rJVoEX4cARtyqHtI9V33fh2dF8pQgqvoFWENvMCJYuwqtumlWYnlzdL5hhwnlgpyYMUtU7u0/YCHPmEK8yQlG3LDJUTI6EoZmSUtSNJQoQHqEfamnLkE+nE4y9H8FArHuwGQhdXcKz+nYiRL+XQd3Vn8oec9RLxP68dqe6FE1MeRopwPFnUjRhUAUwigx4VBCs21ARhQfWtEPeRQFjpYKe2eDI5bZTVwVizMcyTxnHJOiud3Jzmy9dpRBmwDw5AAVjgHJRBFdRAHWDwBF7AK3gzno0P48v4nrQuGOnMHpiC8fMLYoCpEg==</latexit>

Z(i, j) = max max Y (si + i , sj + j


8
| {z } i 2{ 1,0,1} j 2{ 1,0,1} |
0 0 {z }
<latexit sha1_base64="yz21km3On3TQRRRiig/o0Jm19Bs=">AAACSHicbVBNTxsxFPSmH9D0K9BjL08NlVJpFe0GQrlUQnCpxAWkJkTKppHX6yQuXntlvy2NVvuj+Bv8AcStvfTMDXHDCTmUpCNZGs28p3meOJPCYhBce5UnT589X1t/UX356vWbt7WNza7VuWG8w7TUphdTy6VQvIMCJe9lhtM0lvw0Pjuc+ac/ubFCq284zfggpWMlRoJRdNKwdtRrCB9++MA+QSQUREXgQ+hDJBON1odWux2VEBkxniA1Rp9DhPwXFlut9i58gdb3vS1QeRq7iHJYqwfNYA5YJeGC1MkCx8Pa3yjRLE+5Qiaptf0wyHBQUIOCSV5Wo9zyjLIzOuZ9RxVNuR0U80+X8NEpCYy0cU8hzNV/NwqaWjtNYzeZUpzYZW8m/s/r5zjaGxRCZTlyxR6CRrkE1DBrEBJhOEM5dYQyI9ytwCbUUIauhEcpiZ2dVlZdMeFyDauk22qGu83tk536/sGionXynnwgDRKSz2SffCXHpEMYuSBX5Df54116N96td/cwWvEWO+/II1Qq99FormA=</latexit>

X(i, j, c) 2 {0, 1, . . . , 255} ! 256 = 2 numbers 2RC 0 2RC 0


0 0
<latexit sha1_base64="cgfVuOY4+sL/j9cwKAEJ+4tPHy0=">AAACx3icdVHLjtMwFHXCayivAks2V1SoRROqBBCwQRoxG9gNiHYGmhIcx23dcezIdoZWlhf8Hn/AD/AdOGkXM1O4kqWjc699zj3OK860iePfQXjl6rXrN/Zudm7dvnP3Xvf+g7GWtSJ0RCSX6iTHmnIm6Mgww+lJpSguc06P89PDpn98RpVmUnw264pOSzwXbMYINp7Kur/SWhRU5QoTar8OWLR86jKbMgFpic0iz+0n980e9p2Dtw21yizrQ9u3z5IojpLUuQ2/3OXPvf1loIHts36kYbm/7P9PJYJ0JhXmHFovkCo2XxislPwBqaErY4UEXBSssY85VFjhkhq/ocu6vXgYtwW7INmCHtrWUdb9kxaS1CUVhnCs9SSJKzO1WBlGOHWdtNa0wuQUz+nEQ+GF9NS2mTt44pkCvFd/hIGWPX/D4lLrdZn7yWZFfbnXkP/qTWozezO1TFS1oYJshGY1ByOh+UAomKLE8LUHmCifAwGy8CmQJoQLKoVurLmODya5HMMuGD8fJq+GLz6+7B2820a0hx6hx2iAEvQaHaD36AiNEAniYBxkwffwQyjDs3C1GQ2D7Z2H6EKFP/8CfuTb6w==</latexit>

Z(i, j) != normalize
max max Y (si + i , sj + j ), 8(i, j) ! no additional parameters
<latexit sha1_base64="I+cIXyOtu5k4rrMXjER44243/s8=">AAACKnicbVDLSgNBEJz1bXxFPXoZDIIX467viyB68ahgNJCEMDvpJENmZ5aZXjUu+xX+hj/gVf/Am3gVv8NJ3IOvgoaiqpvurjCWwqLvv3ojo2PjE5NT04WZ2bn5heLi0qXVieFQ4VpqUw2ZBSkUVFCghGpsgEWhhKuwdzLwr67BWKHVBfZjaESso0RbcIZOahY3qvSQ1hFuMa1ubu3uZrRuRKeLzBh9kxtKm4hJcQdZs1jyy/4Q9C8JclIiOc6axY96S/MkAoVcMmtrgR9jI2UGBZeQFeqJhZjxHutAzVHFIrCNdPhWRtec0qJtbVwppEP1+0TKImv7Ueg6I4Zd+9sbiP95tQTbB41UqDhBUPxrUTuRFDUdZERbwgBH2XeEcSPcrZR3mWEcXZI/trTs4LSs4IIJfsfwl1xulYO98vb5TunoOI9oiqyQVbJOArJPjsgpOSMVwsk9eSRP5Nl78F68V+/tq3XEy2eWyQ94759oYqfS</latexit>

X = X/255
| {z } i 2{ 1,0,1} j 2{ 1,0,1} |
0 0 {z }
Training CData
<latexit sha1_base64="xLEohGksaBrelpOKInVKXS74iwI=">AAACFXicbVDLSsNAFJ3UV62vqBvBzWARXJWkgros6sJlhbYW2lBuJtN26GQSZiZCCfU3/AG3+gfuxK1rf8DvcJpmYVsPDBzOuXfO5fgxZ0o7zrdVWFldW98obpa2tnd29+z9g5aKEklok0Q8km0fFOVM0KZmmtN2LCmEPqcP/uhm6j88UqlYJBp6HFMvhIFgfUZAG6lnH3WzP1JJg0lDAhNMDPAtaOjZZafiZMDLxM1JGeWo9+yfbhCRJKRCEw5KdVwn1l4KUjPC6aTUTRSNgYxgQDuGCgip8tIsfYJPjRLgfiTNExpn6t+NFEKlxqFvJkPQQ7XoTcX/vE6i+1deykScaCrILKifcKwjPK0DB0xSovnYECCSmVsxGYIEok1pcymBmp42KZli3MUalkmrWnEvKuf31XLtOq+oiI7RCTpDLrpENXSH6qiJCHpCL+gVvVnP1rv1YX3ORgtWvnOI5mB9/QK2zp+i</latexit>

Vector Representation
<latexit sha1_base64="09GsjJo767DJJWgiB0A4NjFj2x4=">AAACHXicbVC7TsMwFHV4lvIqMLJYVEhMVVIkYKxgYSyIPqQ2qhznprXqxJHtIFVRV36DH2CFP2BDrIgf4Dtw0gy05U5H59zHuceLOVPatr+tldW19Y3N0lZ5e2d3b79ycNhWIpEUWlRwIbseUcBZBC3NNIduLIGEHoeON77J9M4jSMVE9KAnMbghGUYsYJRoQw0quJ/vSCX40zZQLSS+B7NBQaSLlqpds/PCy8ApQBUV1RxUfvq+oEloFlBOlOo5dqzdlEjNKIdpuZ8oiAkdkyH0DIxICMpNcxdTfGoYHwfGRSAijXP270RKQqUmoWc6Q6JHalHLyP+0XqKDKzdlUZxoiOjsUJBwrAXOYsE+k+Z7PjGAUMmMV0xHRBKqTXhzV3yVWZuWTTDOYgzLoF2vORe187t6tXFdRFRCx+gEnSEHXaIGukVN1EIUPaEX9IrerGfr3fqwPmetK1Yxc4Tmyvr6BegYo4I=</latexit>

0 C 0
2R 2R
{(Xn , yn ) : n = 1, . . . , N }
<latexit sha1_base64="ruODafN6+oBzwshTFRonCg3dlL8=">AAACHXicbVDLSgMxFM3UV62vUZdugkWoUMqMioogFN24kgq2FjplyGTSNjSTDElGKEO3/oY/4Fb/wJ24FX/A7zDTdmFbD1w4nHMv994TxIwq7TjfVm5hcWl5Jb9aWFvf2Nyyt3caSiQSkzoWTMhmgBRhlJO6ppqRZiwJigJGHoL+deY/PBKpqOD3ehCTdoS6nHYoRtpIvg29tNT0eRkOfH54ATm8hG4ZeiwUWpXhrTf07aJTcUaA88SdkCKYoObbP14ocBIRrjFDSrVcJ9btFElNMSPDgpcoEiPcR13SMpSjiKh2OvpkCA+MEsKOkKa4hiP170SKIqUGUWA6I6R7atbLxP+8VqI75+2U8jjRhOPxok7CoBYwiwWGVBKs2cAQhCU1t0LcQxJhbcKb2hKq7LRhwQTjzsYwTxpHFfe0cnx3UqxeTSLKgz2wD0rABWegCm5ADdQBBk/gBbyCN+vZerc+rM9xa86azOyCKVhfvyBooAY=</latexit>

x ! global average pooling


<latexit sha1_base64="E8hvzd73MN8c8JI3fTZGySHa0Aw=">AAACKHicbVDLSsNAFJ34tr6qLt0MFkEQSqKiLotuXCrYB7Sl3Eyn6eAkE2ZutCXkI/wNf8Ct/oE7cevC73BSs9DqgYHDuY9z5/ixFAZd992ZmZ2bX1hcWi6trK6tb5Q3txpGJZrxOlNS6ZYPhksR8ToKlLwVaw6hL3nTv73I6807ro1Q0Q2OY94NIYjEQDBAK/XKByPa0SIYImit7mkH+QjTQCofJAU7CAGnsVJ2fZD1yhW36k5A/xKvIBVS4KpX/uz0FUtCHiGTYEzbc2PspqBRMMmzUicxPAZ2a03alkYQctNNJ5/K6J5V+nSgtH0R0on6cyKF0Jhx6NvOEHBopmu5+F+tneDgrJuKKE6QR+zbaJBIiormCdG+0JyhHFsCTAt7K2VD0MDQ5vjLpW/y07KSDcabjuEvaRxWvZPq0fVxpXZeRLREdsgu2SceOSU1ckmuSJ0w8kCeyDN5cR6dV+fNef9unXGKmW3yC87HF1Ihp9g=</latexit>

Xn 2 RH⇥W ⇥C ! input image


<latexit sha1_base64="tt9uuCoJbVVGtpzTWLMWRSvQ+Hk=">AAACRHicbVDLSsNAFJ34rPVVdelmsAiuSqKiLkUXdlnFPqCJYTKd2sHJJMzcqCXkk/wNf8CVoFtX7sStOGkr2OqFC4dzn+cEseAabPvZmpqemZ2bLywUF5eWV1ZLa+sNHSWKsjqNRKRaAdFMcMnqwEGwVqwYCQPBmsHNaV5v3jKleSQvoR8zLyTXknc5JWAov3TW8iV2ucmQQC8I0ovsKq1iF3jING7+gNMMu4pf94AoFd0Zlt1DymWcAOZmI8v8Utmu2IPAf4EzAmU0ippfenM7EU1CJoEKonXbsWPwUqKAU8GyoptoFhN6Y5a3DZTEvOGlA8EZ3jZMB3cjZVICHrC/J1ISat0PA9OZy9KTtZz8r9ZOoHvkDYUxSYeHuonAEOHcPdzhilEQfQMIVdz8immPKELBeDx2paPz17KiMcaZtOEvaOxWnIPK3vl++fhkZFEBbaIttIMcdIiOURXVUB1R9ICe0At6tR6td+vD+hy2TlmjmQ00FtbXNxO9sqc=</latexit>

<latexit sha1_base64="jYC4lTAk2nRwWysBZNDP4ACPKj8=">AAACJHicbVC7SgNBFJ2NrxhfUUubwSBqYdiNoJZBG4sEIphESEKYnb3RIbOzy8xdISz5BH/DH7DVP7ATCxtLv8PJo1DjgQuHc1+H48dSGHTdDyczN7+wuJRdzq2srq1v5De3GiZKNIc6j2Skb3xmQAoFdRQo4SbWwEJfQtPvX4z6zXvQRkTqGgcxdEJ2q0RPcIZW6ub32+MbqYZgWE0kiqMKG4CmNbD3Y9SRogfVSu2wmy+4RXcMOku8KSmQKWrd/Fc7iHgSgkIumTEtz42xkzKNgksY5tqJgZjxPruFlqWKhWA66djMkO5ZJaC9SNtSSMfqz42UhcYMQt9OhgzvzN/eSPyv10qwd9ZJhYoTBMUnj3qJpBjRUTo0EBo4yoEljGthvVJ+xzTjaDP89SUwI2vDnA3G+xvDLGmUit5J8fiqVCifTyPKkh2ySw6IR05JmVySGqkTTh7IE3kmL86j8+q8Oe+T0Ywz3dkmv+B8fgOLraUz</latexit>

Multi-Layer Perceptron (MLP)


yn 2 {1, 2, . . . , K} ! labels
<latexit sha1_base64="BkEof47cdeGfWFHLt6Q43V0b+Ow=">AAACMnicbVDLSgNBEJz1bXxFPXoZDIKHEHZ9H0UvghcFo0I2hNnZTjI4O7PM9Kph2S/xN/wBr/oFehPBkx/hJObgq6ChqOqmuytKpbDo+8/eyOjY+MTk1HRpZnZufqG8uHRudWY41LmW2lxGzIIUCuooUMJlaoAlkYSL6Oqw719cg7FCqzPspdBMWEeJtuAMndQqb/daiobCVR5U6UaVhjLWaKv0OCxoaESni8wYfUNDhFvMJYtA2qJVrvg1fwD6lwRDUiFDnLTK72GseZaAQi6ZtY3AT7GZM4OCSyhKYWYhZfyKdaDhqGIJ2GY+eK+ga06JaVsbVwrpQP0+kbPE2l4Suc6EYdf+9vrif14jw/ZeMxcqzRAU/1rUziRFTftZ0VgY4Ch7jjBuhLuV8i4zjKNL9MeW2PZPK0oumOB3DH/J+UYt2Kltnm5V9g+GEU2RFbJK1klAdsk+OSInpE44uSMP5JE8effei/fqvX21jnjDmWXyA97HJ0z3qjA=</latexit>

<latexit sha1_base64="NwDOKo3eoxXVYPCh0r+oRlFOBgQ=">AAACUXicbZBNixNBEIYr49ea9SPq0UtjEBKUMLOKirCw6MWDh1U2yUImDDU9NZtme3qG7hpJGOaP+Tc8efMk6D/wZufj4O5a0PDyvlVU9ZNWWjkOw++d4Nr1Gzdv7d3u7t+5e+9+78HDiStrK2ksS13a0xQdaWVozIo1nVaWsEg1TdPz9+t8+oWsU6U54VVF8wLPjMqVRPZW0jupkiZGXS3wuYhTYmwHy6E4FDHTkhtX5lzgsh1MRVybjGxqUVKzDT/Tx3E7mIileCZw2CbN4dtF63U6THr9cBRuSlwV0U70YVfHSe9nnJWyLsiw1OjcLAornjdoWUlNbTeuHVUoz/GMZl4aLMjNm83vW/HUO5nIS+ufYbFx/51osHBuVaS+s0BeuMvZ2vxfNqs5fzNvlKlqJiO3i/JaCy7FGqXIlCXJeuUFSqv8rUIu0BNiD/zClsytT2u7Hkx0GcNVMTkYRa9GLz697B+92yHag8fwBAYQwWs4gg9wDGOQ8BV+wC/43fnW+RNAEGxbg85u5hFcqGD/L6k5sv8=</latexit>

p↵, (x) = softmax(W ReLU(V x + a) +b) Play Dough


<latexit sha1_base64="AnKNLLbsgt7+GZwvcuchneRdeJQ=">AAACBXicbVDLSsNAFL3xWeur6tLNYBFclaSCuizqwmUF+8A2lMlk0g6dTMLMRAiha3/Arf6BO3Hrd/gDfoeTNgvbemDgcM693DPHizlT2ra/rZXVtfWNzdJWeXtnd2+/cnDYVlEiCW2RiEey62FFORO0pZnmtBtLikOP0443vsn9zhOVikXiQacxdUM8FCxgBGsjPTY5TtFtlAxHg0rVrtlToGXiFKQKBZqDyk/fj0gSUqEJx0r1HDvWboalZoTTSbmfKBpjMsZD2jNU4JAqN5smnqBTo/goiKR5QqOp+ncjw6FSaeiZyRDrkVr0cvE/r5fo4MrNmIgTTQWZHQoSjnSE8u8jn0lKNE8NwUQykxWREZaYaFPS3BVf5dEmZVOMs1jDMmnXa85F7fy+Xm1cFxWV4BhO4AwcuIQG3EETWkBAwAu8wpv1bL1bH9bnbHTFKnaOYA7W1y+DoZkO</latexit>

| {z }
Multi-column Deep Neural Networks
for Image Classi cation YouTube Video

This is the first


time human-
competitive results
are reported on
widely used computer
vision benchmarks. On
the MNIST handwriting
benchmark, this
method is the first
to achieve near-human
performance. On a
traffic sign
recognition benchmark
it outperforms humans
by a factor of two.

prediction of column j

! Latin characters

! Chinese characters

Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classi cation." 2012 IEEE conference on computer vision and
pattern recognition. IEEE, 2012.
fi
fi
ImageNet Classi cation with Deep
Convolutional Neural Networks YouTube Playlist

“To learn about thousands of objects from millions of images, we need a model with a large
learning capacity. However, the immense complexity of the object recognition task means that
this problem cannot be speci ed even by a dataset as large as ImageNet, so our model should
also have lots of prior knowledge to compensate for all the data we don’t have.”
“CNNs make strong and mostly correct assumptions about the nature of images (namely,
stationarity of statistics and locality of pixel dependencies).”

X
Architecture: I 2 R224⇥224⇥3 7! p 2 {p 2 R1000 : pi = 1, pi 0 8i}
i
1x1 Conv: X 2 RH⇥W ⇥C 7! Y 2 RH⇥W ⇥F N
X
Y (x, y) = W X(x, y), W 2 RF ⇥C , X(x, y) 2 RC , Y (x, y) 2 RF Loss: L(✓) = log(pin (In ))
n=1
X X
3x3 Conv: Y (x, y) = W (i, j)X(s(x 2) + i, s(y 2) + j), W 2 R3⇥3⇥F ⇥C
i=1,2,3 j=1,2,3

Krizhevsky, Alex, Ilya Sutskever, and Geo rey E. Hinton. "Imagenet classi cation with deep convolutional neural
networks." Advances in neural information processing systems. 2012.
fi
fi
ff
fi
Dropout: A Simple Way to Prevent
Neural Networks from Over tting YouTube Playlist

1. Prevent overfitting (regularization technique)


2. Approximately combine exponentially many
di↵erent neural network architectures efficiently

Dropout with linear regression is


equivalent to ridge regression.

Each hidden unit in a neural network trained


with dropout must learn to work with a
randomly chosen sample of other units
L ! number of hidden layers

Hinton, Geo rey E., et al. "Improving neural networks by preventing co-adaptation of feature detectors." arXiv preprint arXiv:1207.0580 (2012).
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from over tting." The journal of machine learning research 15.1 (2014): 1929-1958.
ff
fi
fi
Network In Network
YouTube Playlist

1) MLP Convolution Layers (1x1 Convolutions) Input patch centered at location (i, j)

ReLU activation function

k` is used to index the channels of the feature map

2) Global Average Pooling

Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).
Very Deep Convolutional Networks for
Large-Scale Image Recognition YouTube Playlist

Local Response Normalization (LRN):


Create competition for big activities amongst neuron outputs computed using di↵erent kernels.

activity of a neuron computed by applying kernel i


response-normalized activity
at position (x, y) and then applying the ReLU nonlinearity
4
k = 2, n = 5, ↵ = 10 , = 0.75

Data Augmentation:
1) Image translations (random crops) and horizontal re ection

2) Altering the intensities of the RGB channels in training images (object identity is invariant
to changes in the intensity and color of the illumination )

random variable drawn from a Gaussian with mean zero and standard deviation 0.1

R G B T T
Ixy = [Ixy , Ixy , Ixy ] + [p1 , p2 , p3 ][↵1 1 , ↵2 2 , ↵3 3 ]

eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values

3) Scale jittering
Each training image is individually rescaled by randomly sampling S from a certain range [Smin , Smax ]
S is the smallest side of an isotropically-rescaled training image.
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
fl
Going Deeper with Convolutions
YouTube Playlist

Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift YouTube Playlist

statistics of layer’s inputs change during training Benefits of batch normalization:


z = g(W u + b) 1. higher learning rate
replace 2. less sensitive to initialization
z = g(BN(W u )) 3. less sensitive to activation functions
|{z} 4. regularization e↵ects
mean
Training x z }| { 5. preserve gradient magnitudes (maybe)
x µ(x1 , . . . , xm ) Evolution of input distributions to
BN(x; , , x1 , x2 , . . . , xm ) = p + a typical sigmoid, over the course
| {z } 2 (x , . . . , x ) + ✏
1 m of training, shown as {15, 50, 85}th
mini-batch | {z } percentiles.
variance
x 2 {x1 , x2 , . . . , xm }
d Remove Dropout
xi , , , µ, 2R
Increase learning rate
All operations are element-wise (dimension-wise) Accelerate the learning rate decay
The gradients flow through µ and 2 Reduce the L2 weight regularization
Inference Remove Local Response Normalization
x µ Shu✏e training examples more thoroughly
BN(x; , ) = p +
2+✏

µ = Ex1 ,...,xm [µ(x1 , . . . , xm )]


2 m
= Ex1 ,...,xm [ 2 (x1 , . . . , xm )]
m 1
Convolution
X 2 Rp⇥q⇥d
m0 = mpq ! e↵ective mini-batch size
Batch-norm is done per feature map
Io e, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167
(2015).
ff
Delving Deep into Recti ers: Surpassing Human-Level
Performance on ImageNet Classi cation YouTube Playlist

Parametric Rectifiers k ! spatial filter size of the layer Backward Propagation 1


n = k 2 c ! number of connections of a response (1 + a2 )n` Var[w` ] = 1
x` = Ŵ` y` 2
d ! number of filters @E 1
(1 + a2 )n̂` Var[w` ] = 1
W` 2 Rd⇥n ! weights of d filters x` = 2 Rc 2
@x`
b` 2 Rd ! bias gradient at a pixel of this layer
y` ! response at a pixel of the output map @E n̂ n̂ = k 2 d
x` = f (y` 1 ) y` = 2R
@y`
f (yi ) = max(0, yi ) + ai min(0, yi ) ! Parametric ReLU c ` = d` 1
k ⇥ k pixels in d channels
ai = 0 ! ReLU = n` Var[w` ] E[x2` ]
Var[y` ] = n` Var[w` x` ] |{z}
ai = 0.01 ! Leaky ReLU E[w` ]=0
| {z } | {z } Ŵ` 2 Rc⇥n̂ ! reshaped from W`
=E[w`2 ] 6=Var[x` ]
i ! indicates di↵erent channels y` = f 0 (y` ) x`+1
y` ! random variable of each element in y`
@E 1
ai µ ai + ✏ momentum method w` ! random variable of each element in W` E[ y` ] = E[ x`+1 ] = 0
@ai 2
x` ! random variable of each element in x`
for ReLU f 0 (y` ) = 0 or 1 with equal prob.
2 1
E[x` ] = Var[y` 1 ] for ReLU 2 1
2 E[ y` ] = Var[ y` ] = Var[ x`+1 ]
1 2
Var[y` ] = n` Var[y` 1 ]Var[w` ]
2 ! Var[ x` ] = n̂` Var[w` ]Var[ y` ]
L
Y1
Initialization of Filter Weights for Rectifiers 1
Var[yL ] = Var[y1 ] n` Var[w` ] = n̂` Var[w` ]Var[ x`+1 ]
Central idea: investigate the variance `=2
2 2 0 1
of the responses in each layer r BY L
C
1 2 1
y ` = W ` x ` + b` n` Var[w` ] =) w` ⇠ N (0, ) Var[ x2 ] = Var[ xL+1 ] B
@ n̂ ` Var[w ` ] C
A
|2 {z n` 2
} `=2 | {z }
x` 2 Rn ! co-located k ⇥ k pixels in c input channels =1 if = 1, 8`
He, Kaiming, et al. "Delving deep into recti ers: Surpassing human-level performance on imagenet classi cation." Proceedings of the IEEE international conference
on computer vision. 2015.
fi
fi
fi
fi
Rethinking the Inception Architecture
for Computer Vision YouTube Playlist

Mini-network replacing

Inception module that


reduces the grid-size while
expanding the lter banks
the 5 ⇥ 5 convolutions

{
Replace q(k|x) with q 0 (k|x) = (1 ✏)q(k|x) + ✏u(k)
label-smoothing regularization
n=7 x ! training example
H(q 0 , p) = (1 ✏)H(q, p) + ✏H(u, p)
k 2 {1, . . . , K} ! label
K H(u, p) = KL(u||p) + H(u)
X
p(k|x) = exp(zk )/ exp(zi ) u = 1/K
i=1
zi ! logits (un-normalized log-probabilities)
q(k|x) ! ground truth distribution over labels
k ⇤ ! ground truth label

0 if k 6= k ⇤ ;
q(k|x) = k⇤ (k) = ⇤
K 1 if k = k .
X
cross-entropy loss L= q(k|x) log(p(k|x)) = log(p(k ⇤ |x)) ! negative log-likelihood
H(q, p) k=1
Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE conference on computer vision and pattern recognition.
2016.
fi
Training Very Deep Networks
YouTube Video

Highway Networks
<latexit sha1_base64="bltVsaAtI6J+DMBb5J6xxKAs7u8=">AAACBHicbVC7TsMwFHV4lvIKMHaxqJCYqqQDMFawdEJFog+pjSrHuWmtOg/ZDlUUdWDhV1gYQIiVj2Djb3DTDNBypCsdnXOvfe9xY86ksqxvY219Y3Nru7RT3t3bPzg0j447MkoEhTaNeCR6LpHAWQhtxRSHXiyABC6Hrju5mfvdBxCSReG9SmNwAjIKmc8oUVoampVB/kYmwJs12Wg8JSm+BTWNxEQOzapVs3LgVWIXpIoKtIbm18CLaBJAqCgnUvZtK1ZORoRilMOsPEgkxIROyAj6moYkAOlk+QIzfKYVD/uR0BUqnKu/JzISSJkGru4MiBrLZW8u/uf1E+VfORkL40RBSBcf+QnHKsLzRLDHBFDFU00IFUzviumYCEKVzq2sQ7CXT14lnXrNvqjV7+rVxnURRwlV0Ck6Rza6RA3URC3URhQ9omf0it6MJ+PFeDc+Fq1rRjFzgv7A+PwBoyGYug==</latexit>

<latexit sha1_base64="BQDeInhVwFPHb9E1l+LnZNXiusA=">AAAB/XicbVDJSgNBEO2JW4xbXG5eGhMhggwzQdRj0IvHCNkgCaGnU0ma9Cx014hxCP6KFw+KePU/vPk3dpaDRh8UPN6roqqeF0mh0XG+rNTS8srqWno9s7G5tb2T3d2r6TBWHKo8lKFqeEyDFAFUUaCERqSA+Z6Euje8nvj1O1BahEEFRxG0fdYPRE9whkbqZA9aCPeYFIQN9inNe51K/mTcyeYc25mC/iXunOTIHOVO9rPVDXnsQ4BcMq2brhNhO2EKBZcwzrRiDRHjQ9aHpqEB80G3k+n1Y3pslC7thcpUgHSq/pxImK/1yPdMp89woBe9ifif14yxd9lORBDFCAGfLerFkmJIJ1HQrlDAUY4MYVwJcyvlA6YYRxNYxoTgLr78l9SKtntuF2/PcqWreRxpckiOSIG45IKUyA0pkyrh5IE8kRfyaj1az9ab9T5rTVnzmX3yC9bHNx+Mk7c=</latexit>

<latexit sha1_base64="h6NO6gJ/XLhZv/qRNdyEBR0psCs=">AAAB+3icbVDJSgNBEO2JW4xbjEcvjYkQQYaZIOox6MVjhGyQhNDTqSRNeha6ayRhyK948aCIV3/Em39jZzlo4oOCx3tVVNXzIik0Os63ldrY3NreSe9m9vYPDo+yx7m6DmPFocZDGaqmxzRIEUANBUpoRgqY70loeKP7md94AqVFGFRxEkHHZ4NA9AVnaKRuNtdGGGNSFDbYl7RQLVxMu9m8Yztz0HXiLkmeLFHpZr/avZDHPgTIJdO65ToRdhKmUHAJ00w71hAxPmIDaBkaMB90J5nfPqXnRunRfqhMBUjn6u+JhPlaT3zPdPoMh3rVm4n/ea0Y+7edRARRjBDwxaJ+LCmGdBYE7QkFHOXEEMaVMLdSPmSKcTRxZUwI7urL66Rest1ru/R4lS/fLeNIk1NyRorEJTekTB5IhdQIJ2PyTF7JmzW1Xqx362PRmrKWMyfkD6zPH6R8kuI=</latexit>

(i.e., bT ) (i.e., T )
y = H(x; WH ) ! a single layer
<latexit sha1_base64="1tfWuKE2xJ/qwIha+zMYBHgADKM=">AAACGHicbVDLSgNBEJz1bXxFPXoZDIJe4q6ICiKIXnJUMEZIQuiddJLB2dllplddlnyGF3/FiwdFvHrzb5zEHHwVNBRV3XR3hYmSlnz/wxsbn5icmp6ZLczNLywuFZdXLm2cGoFVEavYXIVgUUmNVZKk8CoxCFGosBZenw782g0aK2N9QVmCzQi6WnakAHJSq7id8SNe2bw75LVWZYs3jOz2CIyJb3mD8I5y4FbqrkKuIEPTbxVLftkfgv8lwYiU2AhnreJ7ox2LNEJNQoG19cBPqJmDISkU9guN1GIC4hq6WHdUQ4S2mQ8f6/MNp7R5JzauNPGh+n0ih8jaLApdZwTUs7+9gfifV0+pc9DMpU5SQi2+FnVSxSnmg5R4WxoUpDJHQBjpbuWiBwYEuSwLLoTg98t/yeVOOdgr75zvlo5PRnHMsDW2zjZZwPbZMauwM1Zlgt2zR/bMXrwH78l79d6+Wse80cwq+wHv/RPmlJ8m</latexit>

H ! non-linear transformation parametrized by WH


<latexit sha1_base64="W2Y73U18CYrMhBz+6zjkInDr44I=">AAACLXicbVDLSgMxFM34rPVVdekmWAU3lhkRdSnqossK9gFtKZn0tg1mkiG5o9ahP+TGXxHBRUXc+humj4WvC4HDOefm3nvCWAqLvj/0Zmbn5hcWM0vZ5ZXVtfXcxmbF6sRwKHMttamFzIIUCsooUEItNsCiUEI1vLkY6dVbMFZodY39GJoR6yrREZyho1q5yyJtGNHtITNG39EGwj2mSquD0YfMUDRM2Y420dhPY2ZYBGjEA7Rp2Ke71VZxd9DK5f2CPy76FwRTkCfTKrVyL4225kkECrlk1tYDP8ZmygwKLmGQbSQWYsZvWBfqDio30zbT8bUDuueYNnU7uaeQjtnvHSmLrO1HoXO6rXv2tzYi/9PqCXZOm6lQcYKg+GRQJ5EUNR1FR9vCAEfZd4BxI9yulPdcIBxdwFkXQvD75L+gclgIjguHV0f5s/NpHBmyTXbIPgnICTkjRVIiZcLJI3kmQ/LmPXmv3rv3MbHOeNOeLfKjvM8vDjKpMw==</latexit>

<latexit sha1_base64="gngleh6CltvwxD8+C5iCkautMc8=">AAACeXicbVFNb9NAEF2brxK+AhzhsJAGhQosO0htjxVcciwSaSolUTTejJNV17vW7rglWPkP/Lbe+CNcuLBOfCgpI6309Gbe7MybtFDSURz/CsI7d+/df7D3sPXo8ZOnz9rPX5w5U1qBQ2GUsecpOFRS45AkKTwvLEKeKhylF1/q/OgSrZNGf6NVgdMcFlpmUgB5atb+2R3wiZWLJYG15opPCL9TpY3+WHcEy8mCdpmx+UbAC7CQI1n5A+c8XfH90Wywv271MFpEHzhkmVftajKjlLna1gO/0RsEycumptSiBu9n7U4cxZvgt0HSgA5r4nTWvp7MjShz1CQUODdO4oKmFViSQuG6NSkdFiAuYIFjD7Uf302rjXNr3vXM3A9o/dPEN+xNRQW5c6s89ZV+maXbzdXk/3LjkrLjaSV1URJqsf0oKxUnw+sz8Lm0KEitPABhpZ+Vi6X3VpA/VsubkOyufBuc9aPkMOp/7XdOPjd27LFX7C3rsYQdsRM2YKdsyAT7HbwOusG74E/4JuyFB9vSMGg0L9k/EX76Cz7RwIc=</latexit>

(e.g., affine transformation followed by a non-linear activation function)


y = H(x; WH ) · T (x; WT ) + x · C(x; WC )
<latexit sha1_base64="Tx/LMpguSrrTBAEoM/dSSLAdmuI=">AAACGXicbZDLSgMxFIYz9VbrbdSlm2ARWoQyU0QFEYrddFmhN2iHIZPJtKGZC0lGWoa+hhtfxY0LRVzqyrcx7cxCW38I/PnOOSTndyJGhTSMby23tr6xuZXfLuzs7u0f6IdHHRHGHJM2DlnIew4ShNGAtCWVjPQiTpDvMNJ1xvV5vftAuKBh0JLTiFg+GgbUoxhJhWzdmMJb2ChNbmDXbpQH2A0lbKXXVhmewwlMWT1l9bKtF42KsRBcNWZmiiBT09Y/B26IY58EEjMkRN80ImkliEuKGZkVBrEgEcJjNCR9ZQPkE2Eli81m8EwRF3ohVyeQcEF/TyTIF2LqO6rTR3Iklmtz+F+tH0vv2kpoEMWSBDh9yIsZlCGcxwRdygmWbKoMwpyqv0I8QhxhqcIsqBDM5ZVXTadaMS8r1fuLYu0uiyMPTsApKAETXIEaaIAmaAMMHsEzeAVv2pP2or1rH2lrTstmjsEfaV8/TPWcIw==</latexit>

element-wise multiplication
<latexit sha1_base64="jqZTPQnnHZrzUBFgRha1qEBPS+0=">AAACC3icbVC7SgNBFJ31GeNr1dJmSBBsDLtB1DJoYxnBPCBZwuzkJhky+2DmrhqW9Db+io2FIrb+gJ1/4yTZQhMvDBzOOffeucePpdDoON/W0vLK6tp6biO/ubW9s2vv7dd1lCgONR7JSDV9pkGKEGooUEIzVsACX0LDH15N9MYdKC2i8BZHMXgB64eiJzhDQ3XsQhvhAVOQEECIJ/dCAw0SicIsn3nGHbvolJxp0UXgZqBIsqp27K92N+LJZCCXTOuW68TopUyh4BLG+XaiIWZ8yPrQMjBkAWgvnd4ypkeG6dJepMwLkU7Z3x0pC7QeBb5xBgwHel6bkP9prQR7F14qwjhBCPlsUS+RFCM6CYZ2hQKOcmQA40qYv1I+YIpxNPHlTQju/MmLoF4uuWel8s1psXKZxZEjh6RAjolLzkmFXJMqqRFOHskzeSVv1pP1Yr1bHzPrkpX1HJA/ZX3+AIHsm/w=</latexit>

T ! transform gate
<latexit sha1_base64="KadYP5ne/qGVzRZmTTsZjALivOQ=">AAACDHicbVDLSgNBEJz1GeMr6tHLYBA8hV0R9Rj04jFCHkISpHfSSQZnZ5eZXjUs+QAv/ooXD4p49QO8+TdO4h58FQwUVdX0dIWJkpZ8/8ObmZ2bX1gsLBWXV1bX1ksbm00bp0ZgQ8QqNhchWFRSY4MkKbxIDEIUKmyFV6cTv3WNxspY12mUYDeCgZZ9KYCcdFkq13nHyMGQwJj4hncIbykjA9r2YxPxARCOXcqv+FPwvyTISZnlqF2W3ju9WKQRahIKrG0HfkLdDAxJoXBc7KQWExBXMMC2oxoitN1sesyY7zqlx9129zTxqfp9IoPI2lEUumQENLS/vYn4n9dOqX/czaROUkItvhb1U8Up5pNmeE8aFKRGjoAw0v2ViyEYEOT6K7oSgt8n/yXN/UpwWNk/PyhXT/I6Cmyb7bA9FrAjVmVnrMYaTLA79sCe2LN37z16L97rV3TGy2e22A94b5+eiJv/</latexit>

C ! carry gate
<latexit sha1_base64="bT7DnABtmn1FLbrm3J9RELyWJD8=">AAACCHicbVC7SgNBFJ31GeNr1dLCwSBYhd0gahlMYxnBPCAJYXZykwyZfTBzV12WlDb+io2FIrZ+gp1/4yTZQhMPXDhzzr3MvceLpNDoON/W0vLK6tp6biO/ubW9s2vv7dd1GCsONR7KUDU9pkGKAGooUEIzUsB8T0LDG1UmfuMOlBZhcItJBB2fDQLRF5yhkbr2UYW2lRgMkSkV3tM2wgOm3DwSOmAI465dcIrOFHSRuBkpkAzVrv3V7oU89iFALpnWLdeJsJMyhYJLGOfbsYaI8REbQMvQgPmgO+n0kDE9MUqP9kNlKkA6VX9PpMzXOvE90+kzHOp5byL+57Vi7F92UhFEMULAZx/1Y0kxpJNUaE8o4CgTQxhXwuxK+ZApxtFklzchuPMnL5J6qeieF0s3Z4XyVRZHjhySY3JKXHJByuSaVEmNcPJInskrebOerBfr3fqYtS5Z2cwB+QPr8wcY/JoL</latexit>

C=1 T
<latexit sha1_base64="l3zbcwg3aAJQxfOBfHVN6leZ7Io=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBi2W3iHoRir14rNAvaZeSTbNtaJJdkqxQlv4KLx4U8erP8ea/MW33oK0PBh7vzTAzL4g508Z1v53c2vrG5lZ+u7Czu7d/UDw8aukoUYQ2ScQj1QmwppxJ2jTMcNqJFcUi4LQdjGszv/1ElWaRbJhJTH2Bh5KFjGBjpccaukUeukCNfrHklt050CrxMlKCDPV+8as3iEgiqDSEY627nhsbP8XKMMLptNBLNI0xGeMh7VoqsaDaT+cHT9GZVQYojJQtadBc/T2RYqH1RAS2U2Az0sveTPzP6yYmvPFTJuPEUEkWi8KEIxOh2fdowBQlhk8swUQxeysiI6wwMTajgg3BW355lbQqZe+qXHm4LFXvsjjycAKncA4eXEMV7qEOTSAg4Ble4c1Rzovz7nwsWnNONnMMf+B8/gDylI6N</latexit>

<latexit sha1_base64="n/vD4R9Sr6XuUROe9uG+3sbYwlI=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovQIpSkiLoRim5cVkgf0IYwmU7aoZNJmJlIS+jCjb/ixoUibv0Id/6N0zYLbT1w4XDOvdx7jx8zKpVlfRu5tfWNza38dmFnd2//wDw8askoEZg0ccQi0fGRJIxy0lRUMdKJBUGhz0jbH93O/PYDEZJG3FGTmLghGnAaUIyUljyz6JTHFXgNe5IOQgTLbc+BY3gGfc+peGbJqlpzwFViZ6QEMjQ886vXj3ASEq4wQ1J2bStWboqEopiRaaGXSBIjPEID0tWUo5BIN50/MYWnWunDIBK6uIJz9fdEikIpJ6GvO0OkhnLZm4n/ed1EBVduSnmcKMLxYlGQMKgiOEsE9qkgWLGJJggLqm+FeIgEwkrnVtAh2Msvr5JWrWpfVGv356X6TRZHHhTBCSgDG1yCOrgDDdAEGDyCZ/AK3own48V4Nz4WrTkjmzkGf2B8/gBX+JVa</latexit>

T (x) = (WT x + bT )
<latexit sha1_base64="jf3tihwCQvUVpuyihMoF4AzemUY=">AAACOXicbVBNbxMxEPW2QNNAaVqOXCyiqu0h0W5BbS+VIrhwDFLSRkqidNY7Sax67ZU9W4hW6c/qhX/BDYkLhyLElT+A84FEG55k6c17MxrPizMlHYXh12Bt/dHjJxulzfLTZ1vPtys7u+fO5FZgWxhlbCcGh0pqbJMkhZ3MIqSxwov46t3Mv7hG66TRLZpk2E9hpOVQCiAvDSrNeNDiZ7wW8R7hJypujJ3y2uu/ldSSJCg14Qex9GsSTuYj2MRxGiO/vBRg7WR/n8c4hmtp7OF0UKmG9XAOvkqiJamyJZqDypdeYkSeoiahwLluFGbUL8CSFAqn5V7uMANxBSPseqohRdcv5pdP+V4++9PQWP808bn670QBqXOTNPadKdDYPfRm4v+8bk7D034hdZYTarFYNMyVP5/PYuSJtCjI55JIENaHJLgYgwVBPuyyDyF6ePIqOT+qR8f1ow9vqo23yzhK7CV7xQ5YxE5Yg71nTdZmgt2yb+yO/Qg+B9+Dn8GvRetasJx5we4h+P0H9f+sfQ==</latexit>

bT = 1 or 3 initially (biased towards the “carry” behavior)

<latexit sha1_base64="nBASIm9oRMCGhVPVDfPfdaoa9Lg=">AAAB/XicbVDLSsNAFJ3UV62v+Ni5GSxCBQlJF+qy6MZlBfuANpTJ5KYdOpmEmYlQS/FX3LhQxK3/4c6/cdpmoa0HLhzOuZd77wlSzpR23W+rsLK6tr5R3Cxtbe/s7tn7B02VZJJCgyY8ke2AKOBMQEMzzaGdSiBxwKEVDG+mfusBpGKJuNejFPyY9AWLGCXaSD37qMIccM7xgIUhCJwJptVZzy67jjsDXiZeTsooR71nf3XDhGYxCE05Uarjuan2x0RqRjlMSt1MQUrokPShY6ggMSh/PLt+gk+NEuIokaaExjP198SYxEqN4sB0xkQP1KI3Ff/zOpmOrvwxE2mmQdD5oijjWCd4GgUOmQSq+cgQQiUzt2I6IJJQbQIrmRC8xZeXSbPqeBdO9a5arl3ncRTRMTpBFeShS1RDt6iOGoiiR/SMXtGb9WS9WO/Wx7y1YOUzh+gPrM8fhTKT/g==</latexit>

(i.e., hidden units)


<latexit sha1_base64="vXdWZBb7KDsPBc9rLr988B2h/gQ=">AAAB+3icbVDJSgNBEO2JW4zbGI9eGhMhggwzQdRj0IvHCGaBJISeTiVp0rPQXSMJQ37FiwdFvPoj3vwbO8tBow8KHu9VUVXPj6XQ6LpfVmZtfWNzK7ud29nd2z+wD/N1HSWKQ41HMlJNn2mQIoQaCpTQjBWwwJfQ8Ee3M7/xCEqLKHzASQydgA1C0RecoZG6dr6NMMa0JBxwzmlxUjybdu2C67hz0L/EW5ICWaLatT/bvYgnAYTIJdO65bkxdlKmUHAJ01w70RAzPmIDaBkasgB0J53fPqWnRunRfqRMhUjn6s+JlAVaTwLfdAYMh3rVm4n/ea0E+9edVIRxghDyxaJ+IilGdBYE7QkFHOXEEMaVMLdSPmSKcTRx5UwI3urLf0m97HiXTvn+olC5WcaRJcfkhJSIR65IhdyRKqkRTsbkibyQV2tqPVtv1vuiNWMtZ47IL1gf39z/kwc=</latexit>

(i.e., y)
<latexit sha1_base64="cFVtwIlInLF4qSQZ4Zdj/Ki4EKg=">AAAB+XicbVA9SwNBEN2LXzF+nVraLAYhNuEuhVoGbSwjmA9IjrC3mUuW7O4du3vBcOSf2FgoYus/sfPfuEmu0MQHA4/3ZpiZFyacaeN5305hY3Nre6e4W9rbPzg8co9PWjpOFYUmjXmsOiHRwJmEpmGGQydRQETIoR2O7+Z+ewJKs1g+mmkCgSBDySJGibFS33UrmskhBwxPRCQcLvtu2at6C+B14uekjHI0+u5XbxDTVIA0lBOtu76XmCAjyjDKYVbqpRoSQsdkCF1LJRGgg2xx+QxfWGWAo1jZkgYv1N8TGRFaT0VoOwUxI73qzcX/vG5qopsgYzJJDUi6XBSlHJsYz2PAA6aAGj61hFDF7K2Yjogi1NiwSjYEf/XlddKqVf2rau2hVq7f5nEU0Rk6RxXko2tUR/eogZqIogl6Rq/ozcmcF+fd+Vi2Fpx85hT9gfP5A+awky8=</latexit>

(single example)

Srivastava, Rupesh Kumar, Klaus Gre , and Jürgen Schmidhuber. "Training very deep networks." arXiv preprint arXiv:1507.06228 (2015).
Srivastava, Rupesh Kumar, Klaus Gre , and Jürgen Schmidhuber. "Highway networks." arXiv preprint arXiv:1505.00387 (2015).
ff
ff
Deep Residual Learning for Image Recognition
YouTube Playlist

bottleneck

8
<
:

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Identity Mappings in Deep Residual Networks
YouTube Playlist

:
<
8
Original

! if f and h are both identity

| {z }

He, Kaiming, et al. "Identity mappings in deep residual networks." European conference on computer vision. Springer, Cham, 2016.
Deep Networks with Stochastic Depth
For each mini-batch, randomly select sets of layers and remove their correspon
<latexit sha1_base64="kwqQc42cmtM+Z/nU8U5L3y4wS5U=">AAACinicbVHLbtNAFB2bR0soEGDJ5ooIiUUb2UFqQWwqQIhlkUhbKYmi8fg6vso8rJlxpSjqx/BL7Pgbrt0soOVKIx+dc5/HRaMpxCz7naT37j94uLf/aPD44MnTZ8PnL86Da73CqXLa+ctCBtRkcRoparxsPEpTaLwo1p87/eIKfSBnf8RNgwsjV5YqUjIytRz+nFtHtkQb4avzgFLVYMjSUSGjqg/BS1s6ozfAI1BF/sQArgItN9wVWAWPxl0hxBrJg3LeY2icLcmuIHJ5qJw3/TSoWqs6EA7BWe65Rmz6tBqBuh0o8qA1NdzGWuxzx8vhKBtnfcBdkO/ASOzibDn8NS+dag33U1qGMMuzJi620kdSGq8H8zZgI9VarnDG0EqDYbHtrbyGN8yUwCvzY0969u+KrTQhbEzBmXxUHW5rHfk/bdbG6v1iS7ZpI1p1M6hqNUQH3X+BkjwfzKaUJJUn3hVULb1UkX0esAn57ZPvgvPJOD8ev/s+GZ1+2tmxL16J1+KtyMWJOBXfxJmYCpXsJUfJcXKSHqST9EP68SY1TXY1L8U/kX75A9SkxsQ=</latexit>

For each mini-batch, randomly select sets of layers and remove their correspond-ing transformation functions, only keeping the identity skip connection.
<latexit sha1_base64="kwqQc42cmtM+Z/nU8U5L3y4wS5U=">AAACinicbVHLbtNAFB2bR0soEGDJ5ooIiUUb2UFqQWwqQIhlkUhbKYmi8fg6vso8rJlxpSjqx/BL7Pgbrt0soOVKIx+dc5/HRaMpxCz7naT37j94uLf/aPD44MnTZ8PnL86Da73CqXLa+ctCBtRkcRoparxsPEpTaLwo1p87/eIKfSBnf8RNgwsjV5YqUjIytRz+nFtHtkQb4avzgFLVYMjSUSGjqg/BS1s6ozfAI1BF/sQArgItN9wVWAWPxl0hxBrJg3LeY2icLcmuIHJ5qJw3/TSoWqs6EA7BWe65Rmz6tBqBuh0o8qA1NdzGWuxzx8vhKBtnfcBdkO/ASOzibDn8NS+dag33U1qGMMuzJi620kdSGq8H8zZgI9VarnDG0EqDYbHtrbyGN8yUwCvzY0969u+KrTQhbEzBmXxUHW5rHfk/bdbG6v1iS7ZpI1p1M6hqNUQH3X+BkjwfzKaUJJUn3hVULb1UkX0esAn57ZPvgvPJOD8ev/s+GZ1+2tmxL16J1+KtyMWJOBXfxJmYCpXsJUfJcXKSHqST9EP68SY1TXY1L8U/kX75A9SkxsQ=</latexit>

ResNet Architecturefunctions, only keeping the identity skip connection.


<latexit sha1_base64="yvqG3ywPmYXLA4+blwrTWcTuMKI=">AAACB3icbVA9SwNBEN2LXzF+RS0FWQyCVbiLoJZRGyuJYhIhOcLe3iRZsvfB7pwQjnQ2/hUbC0Vs/Qt2/hs3lxSa+GDh8d7M7MzzYik02va3lVtYXFpeya8W1tY3NreK2zsNHSWKQ51HMlL3HtMgRQh1FCjhPlbAAk9C0xtcjv3mAygtovAOhzG4AeuFois4QyN1ivvtbEaqwB/dgr4GpOeK9wUCx0RBp1iyy3YGOk+cKSmRKWqd4lfbj3gSQIhcMq1bjh2jmzKFgksYFdqJhpjxAetBy9CQBaDdNNthRA+N4tNupMwLkWbq746UBVoPA89UBgz7etYbi/95rQS7Z24qwjhBCPnko24iKUZ0HAr1hTL3yqEhjCthdqW8zxTjaKIrmBCc2ZPnSaNSdk7KxzeVUvViGkee7JEDckQcckqq5IrUSJ1w8kieySt5s56sF+vd+piU5qxpzy75A+vzB7/cmd8=</latexit>

ing transformation

f` !
<latexit sha1_base64="GszjD2FTJoBPhsKXgZvj+POm77g=">AAAB/HicbVBNS8NAEJ3Ur1q/oj16WSyCp5KoqMeiF48V7Ac0IWy2m3bpZhN2N0oo9a948aCIV3+IN/+N2zYHbX0w8Hhvhpl5YcqZ0o7zbZVWVtfWN8qbla3tnd09e/+grZJMEtoiCU9kN8SKciZoSzPNaTeVFMchp51wdDP1Ow9UKpaIe52n1I/xQLCIEayNFNjVKPAo58iTbDDUWMrkEQV2zak7M6Bl4hakBgWagf3l9ROSxVRowrFSPddJtT/GUjPC6aTiZYqmmIzwgPYMFTimyh/Pjp+gY6P0UZRIU0Kjmfp7YoxjpfI4NJ0x1kO16E3F/7xepqMrf8xEmmkqyHxRlHGkEzRNAvWZpETz3BBMJDO3IjLEEhNt8qqYENzFl5dJ+7TuXtTP7s5rjesijjIcwhGcgAuX0IBbaEILCOTwDK/wZj1ZL9a79TFvLVnFTBX+wPr8AYgBlLQ=</latexit>

Stochastic depth
<latexit sha1_base64="Mxpzg0cJKoCpd5Odz+BTDSSgP9U=">AAACBHicbVDLSsNAFJ34rPUVddlNsAiuSlJBXRbduKxoH9CGMpnctkMnD2ZuhBK6cOOvuHGhiFs/wp1/4yTNQlsPDBzOuY+5x4sFV2jb38bK6tr6xmZpq7y9s7u3bx4ctlWUSAYtFolIdj2qQPAQWshRQDeWQANPQMebXGd+5wGk4lF4j9MY3ICOQj7kjKKWBmaln89IJfizO4zYmCrkzPIhxvHArNo1O4e1TJyCVEmB5sD86vsRSwIIkQmqVM+xY3RTKvVIAbNyP1EQUzahI+hpGtIAlJvmH5hZJ1rxrWEk9QvRytXfHSkNlJoGnq4MKI7VopeJ/3m9BIeXbsrDOEEI2XzRMBEWRlaWiOVzCQzFVBPKJM/O1zFIylDnVtYhOIsnL5N2veac185u69XGVRFHiVTIMTklDrkgDXJDmqRFGHkkz+SVvBlPxovxbnzMS1eMoueI/IHx+QOfO5i3</latexit>

b` ! Bernoulli random variable


<latexit sha1_base64="U1aB5LEv9FgU5XvZGhjtRMzBO5Q=">AAACHHicbVC7SgNBFJ31bXxFLW0Gg2AVdlXUUrSxVDAqZEO4O7lJBmdnlpm70bD4ITb+io2FIjYWgn/jJKbwdWDgcM59zD1JpqSjMPwIxsYnJqemZ2ZLc/MLi0vl5ZVzZ3IrsCaMMvYyAYdKaqyRJIWXmUVIE4UXydXRwL/ooXXS6DPqZ9hIoaNlWwogLzXL20kzRqV4bGWnS2CtueYx4Q0Vh2i1yZWS3IJumZT3wErwc2+b5UpYDYfgf0k0IhU2wkmz/Ba3jMhT1CQUOFePwowaBViSws8rxbnDDMQVdLDuqYYUXaMYHnfLN7zS4m1j/dPEh+r3jgJS5/pp4itToK777Q3E/7x6Tu39RiF1lhNq8bWonStOhg+S4i1pUZDqewLCSv9XLrpgQZDPs+RDiH6f/Jecb1Wj3er26U7l4HAUxwxbY+tsk0Vsjx2wY3bCakywO/bAnthzcB88Bi/B61fpWDDqWWU/ELx/AkDWorw=</latexit>

p` = Pr(b` = 1) ! “survival” probability


<latexit sha1_base64="99RnG4KmeCAM3kcy/PVAm9A6sow=">AAACM3icbVBdSxtBFJ21to3pV7SPvgyGYvoSdm1pfRGkvhSfIhgVsku8O7lJhszuLDN3o2HZ/9SX/hEfCqUPFfHV/+AkbqE1PTBwOOdc7twTZ0pa8v2f3sqT1afPntfW6i9evnr9prG+cWJ1bgR2hVbanMVgUckUuyRJ4VlmEJJY4Wk8OZj7p1M0Vur0mGYZRgmMUjmUAshJ/cZh1g9RKb7HQ8JLKjqmbMV/pOA9D40cjQmM0RdV4vzc5mYqp6C2t3lmdAyxVJJmZb/R9Nv+AnyZBBVpsgqdfuMqHGiRJ5iSUGBtL/AzigowJIXCsh7mFjMQExhhz9EUErRRsbi55O+cMuBDbdxLiS/UvycKSKydJbFLJkBj+9ibi//zejkNd6NCpllOmIqHRcNccdJ8XiAfSIOC1MwREEa6v3IxBgOCXM11V0Lw+ORlcrLTDj61Pxx9bO5/qeqosU22xVosYJ/ZPvvKOqzLBPvGfrDf7Nr77v3ybrzbh+iKV828Zf/Au7sHl3qq+Q==</latexit>

Stochastic depth during testing


<latexit sha1_base64="Imnyrzclak0oCNj+yrExJ+a6NMY=">AAACE3icbZDLSgMxFIYzXmu9VV26CRZBXJSZCuqy6MZlRXuBtpRM5rQNzWSG5IxQhr6DG1/FjQtF3Lpx59uYabvQ1gOBn/8/Jyf5/FgKg6777Swtr6yurec28ptb2zu7hb39uokSzaHGIxnpps8MSKGghgIlNGMNLPQlNPzhdZY3HkAbEal7HMXQCVlfiZ7gDK3VLZy2J3ekGoLxHUZ8wAwKTgOIcUCDRAvVpwjWU/1uoeiW3EnRReHNRJHMqtotfLWDiCchKOSSGdPy3Bg7KdN2g4Rxvp0YiBkfsj60rFQsBNNJJ+8Z02PrBLQXaXsU0on7eyJloTGj0LedIcOBmc8y87+slWDvspMKFScIik8X9RJJMaIZIBoIDRzlyArGtchoWCqacbQY8xaCN//lRVEvl7zz0tltuVi5muHIkUNyRE6IRy5IhdyQKqkRTh7JM3klb86T8+K8Ox/T1iVnNnNA/pTz+QNbfZ8U</latexit>

p` ! expected number of times f` participates in training


<latexit sha1_base64="kqBe2cp9sbW8wAgk8t9Tn/wj930=">AAACOnicbVCxThtBEN0DQohDwAklzSh2JCrrjkghJUqalCBhQLIta249Z6/Y213tziWxLL4rTb4iXQoaChCi5QNYHy4S4EkjPb03o5l5udMqcJr+TZaWV16svlx71Xi9/mZjs/n23XGwlZfUlVZbf5pjIK0MdVmxplPnCctc00l+9nXun3wnH5Q1Rzx1NChxbFShJHKUhs3Dthv2SWvoezWeMHpvf7SBfjqSTCMwVZmTB1sAq5ICtIu6uw0OPSupHHJUlQH2qIwyY2gMm620k9aApyRbkJZY4GDY/NMfWVmVZFhqDKGXpY4Hs3qBpvNGvwrkUJ7hmHqRGoyHDGb16+fwISojKKyPZRhq9d+JGZYhTMs8dpbIk/DYm4vPeb2Ki8+DmTKuYjLyYVFRaWAL8xxhpHyMSE8jQelVvBXkBD3G2HyYh5A9fvkpOd7tZJ86Hw93W/tfFnGsiW3xXuyITOyJffFNHIiukOKXuBBX4jr5nVwmN8ntQ+tSspjZEv8hubsHanOtLg==</latexit>

Huang, Gao, et al. "Deep networks with stochastic depth." European conference on computer vision. Springer, Cham, 2016.
Wide Residual Networks
YouTube Playlist

{z } ! BN-ReLU-Conv
Conv-BN-ReLU
| | {z }
ResNet Identity Mapping

n ! total number of convolutional layers


k ! widening factor (multiplies the number of features in conv layers)
WRN-n-k

` ! deepening factor (number of convs in a block)


Zagoruyko, Sergey, and Nikos Komodakis. "Wide residual networks." arXiv preprint arXiv:1605.07146 (2016).
Aggregated Residual Transformations
for Deep Neural Networks YouTube Playlist

32
X
Wi xi = [W1 , . . . , W32 ] [x1 ; . . . ; x32 ]
|{z} |{z} | {z }| {z }
i=1 2R256⇥4 2R4
2R256⇥128 R128
C
X
split transform
| {z } merge strategy ! inception F(x) = Ti (x) ! “Network in Neuron”
|{z} | {z }
1 ⇥ 1 conv 3 ⇥ 3, 5 ⇥ 5, etc. conv concatenation i=1
D
X C ! cardinality
cardinality: size of the set of transformations C
wi xi ! neuron X
i=1
y =x+ Ti (x)
i=1
x = [x1 , x2 , . . . , xD ]
split: x ! xi
transform: xi wi
X D
aggregate: w i xi
i=1
Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition.
2017.
Densely Connected Convolutional Networks
YouTube Playlist

x0 ! single image
L ! number of layers
H` (·) ! nonlinear transformation
` = 1, . . . , L ! layer index
x` ! output of the `-th layer 4k feature maps
x` = H` (x` 1 )
x` = H` (x` 1 ) + x` 1
x` = H` ([x0 , x1 , . . . , x` 1 ])
| {z }
growth rate of the network
concatenate k = 12
If each function H` produces k feature-maps,
then the `-th layer has k0 + k(` 1) input feature maps
Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Huang, Gao, et al. "Convolutional networks with dense connectivity." IEEE transactions on pattern analysis and machine intelligence (2019).
Inception-v4, Inception-ResNet and
the Impact of Residual Connections on Learning YouTube Playlist

Inception-v4 Inception-Resnet-v1 & -v2 Stem: Inception-v4 Inception-A Inception-A Inception-A


& Inception-Resnet-v2 Inception-v4 Inception-Resnet-v1 Inception-Resnet-v2

Stem:
Inception-Resnet-v1
Refer to the paper for the exact forms of
Inception-B and Inception-C blocks

Szegedy, Christian, et al. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).
mixup: Beyond Empirical Risk Minimization
YouTube Playlist

X m
data-agnostic data augmentation
<latexit sha1_base64="F+YSwLidKuXl3fEcLgHx8j62kKk=">AAACG3icbVC7TsMwFHV4lvIqMDJgUSGxUCUdgLGChbFI9CG1UXXjOK1Vx45sB6mqOvIZfAErfAEbYmXgA/gPnDYDbbmSpeNzzr3XPkHCmTau++2srK6tb2wWtorbO7t7+6WDw6aWqSK0QSSXqh2AppwJ2jDMcNpOFIU44LQVDG8zvfVIlWZSPJhRQv0Y+oJFjICxVK90EoKBC8tJbRjB2Q1D2o+pMLmj7FbcaeFl4OWgjPKq90o/3VCSNBtAOGjd8dzE+GNQdjynk2I31TQBMoQ+7VgoIKbaH08/MsFnlglxJJU9wuAp+7djDLHWoziwzhjMQC9qGfmf1klNdO2PmUhSQwWZLYpSjo3EWSo4ZIoSw0cWAFEsi4IMQAExNru5LaHOnjaxuXiLKSyDZrXiXVaq99Vy7SZPqICO0Sk6Rx66QjV0h+qogQh6Qi/oFb05z8678+F8zqwrTt5zhObK+foFYaCiWA==</latexit>

<latexit sha1_base64="D8W73lp6Wi8kFfcjNpBZfgoImXI=">AAACdXicbVHBbtQwEHUChVKgXeCIkCwW0FZCS1Ih4FKpggvHgti20nqJHGeyO1rbiWynNLLyoZw48Q1ccbY50JaRRn7z3oxm9JzXEq1Lkp9RfOv21p272/d27j94uLs3evT4xFaNETATlazMWc4tSNQwc+gknNUGuMolnObrT71+eg7GYqW/ubaGheJLjSUK7gKVjVqmuFsJLv3XLmO6mZT79JCy0nDh086rjjLbqMzjYdp9V5SBlJNywhzKAvxFl+H+azpUbV9RZnC5ctyY6kcQ4MJ5UDWasFDScxSow2vQrrtsNE6mySboTZAOYEyGOM5Gv1lRiUaBdkJya+dpUruF58ahkNDtsMZCzcWaL2EeoOYK7MJvLOroy8AUtKxMSO3ohv13wnNlbavy0NkbYq9rPfk/bd648sPCo64bB1pcLiobSV1Fe79pgQaEk20AXBgMt1Kx4sFeF37lypbC9qf1vqTXXbgJTg6m6bvpwZe346OPg0Pb5Cl5TiYkJe/JEflMjsmMCPIr2op2o73oT/wsfhG/umyNo2HmCbkS8Zu/2M7Abw==</latexit>

1
R⌫ (f ) = `(f (x̃i ), ỹi ) ! empirical vicinal risk
<latexit sha1_base64="NoflTODSggNNWgjoH+b2x8iap34=">AAACKHicbVDdSgJBGJ21P7M/q8tuhiQwCNn1orqUQuhGMMkUdJHZ2VEHZ2aXmdnAln2JHqMn6LaeoLvwNug9GlcvUjswcDjn++Y7HC9kVGnbnliZtfWNza3sdm5nd2//IH949KiCSGLSxAELZNtDijAqSFNTzUg7lARxj5GWN7qd+q0nIhUNxIMeh8TlaCBon2KkjdTLX3TTP2JJ/KTKQyqNw2CDqhGsUUE5fU4HYbHaqJ338gW7ZKeAq8SZkwKYo97L/3T9AEecCI0ZUqrj2KF2YyQ1xYwkuW6kSIjwCA1Ix1CBOFFunCZK4JlRfNgPpHlCw1T9uxEjrtSYe2aSIz1Uy95U/M/rRLp/7cZUhJEmAs8O9SMGdQCnFUGfSoI1GxuCsKQmK8RDJBHWpsiFK76aRktML85yC6vksVxyLkvl+3KhcjNvKAtOwCkoAgdcgQq4A3XQBBi8gDfwDj6sV+vT+rIms9GMNd85Bguwvn8BCoCnPw==</latexit>

Empirical Risk Minimization (ERM) m i=1


f 2 F ! function
<latexit sha1_base64="nT1tgI8P8YNin1Lq6spuw8KxoPA=">AAACKXicbVDLSsNAFJ3UV62vqEs3g0VwY0mKqMuiIC4r2Ac0oUwmk3boZBJmbtQS+hV+hl/gVr/AnboV/8Ok7cK2Hhg4nHMv98zxYsE1WNanUVhaXlldK66XNja3tnfM3b2mjhJFWYNGIlJtj2gmuGQN4CBYO1aMhJ5gLW9wlfute6Y0j+QdDGPmhqQnecApgUzqmicBdrjETkigT4lIr0fYUbzXB6JU9IAdYI+QBomk+fioa5atijUGXiT2lJTRFPWu+eP4EU1CJoEKonXHtmJwU6KAU8FGJSfRLCZ0QHqsk1FJQqbddPytET7KFB8HkcqeBDxW/26kJNR6GHrZZB5fz3u5+J/XSSC4cFMu4wSYpJNDQSIwRDjvCPtcMQpimBFCFc+yYtonilDImpy54us8Wt6LPd/CImlWK/ZZpXp7Wq5dThsqogN0iI6Rjc5RDd2gOmogip7QC3pFb8az8W58GF+T0YIx3dlHMzC+fwE06KiS</latexit>

m
D {(x̃
<latexit sha1_base64="TQ5zwsAnjoYONwArl6G4KOOYe6U=">AAACNHicbZDNSgMxEMezftb6VfXoJVgEBSm7IlqEQlEPHitYK3Trks2mGkyyS5IVS9hX8TF8Aq96F7xJrz6D2XYPVh0Y+PGfGWbmHyaMKu26787U9Mzs3Hxpoby4tLyyWllbv1JxKjFp45jF8jpEijAqSFtTzch1IgniISOd8P40r3ceiFQ0Fpd6kJAeR7eC9ilG2kpBpe5zpO8wYuYsC3yRwuMG9M2OrymLiHnMArpX8MDyrp8Fhja87IYHlapbc0cB/4JXQBUU0QoqQz+KccqJ0Jghpbqem+ieQVJTzEhW9lNFEoTv0S3pWhSIE9Uzow8zuG2VCPZjaVNoOFJ/ThjElRrw0Hbm/6jftVz8r9ZNdb/eM1QkqSYCjxf1UwZ1DHO7YEQlwZoNLCAsqb0V4jskEdbW1IktkcpPy6wv3m8X/sLVfs07rO1fHFSbJ4VDJbAJtsAO8MARaIJz0AJtgMETeAGv4M15dj6cT2c4bp1yipkNMBHO1zdBu6yJ</latexit>

X ! random feature vector := , ỹ )}


<latexit sha1_base64="Gnov9HE+VmSu8PLIfg0X4CAbOSk=">AAACJnicbVBNa9tAEF2lX26atkp7zGWpCZQcjBRK22NILjm6UDsGW5jRamQvWe2K3VFaI/wf8jPyC3JNfkFuIeSWS/9HVrYPzceDgcd7M8zMS0slHUXRbbD24uWr129ab9ffbbz/8DHc/NR3prICe8IoYwcpOFRSY48kKRyUFqFIFR6lxweNf3SC1kmjf9OsxKSAiZa5FEBeGoc7Az6ycjIlsNb84SPCv1Rb0JkpeI5AlUV+goKMnY/DdtSJFuBPSbwibbZCdxz+G2VGVAVqEgqcG8ZRSUkNlqRQOF8fVQ5LEMcwwaGnGgp0Sb34ac63vZLx3FhfmvhC/X+ihsK5WZH6zgJo6h57jficN6wo/5nUUpcVoRbLRXmlOBneBMQzaf2/auYJCCv9rVxMwYIgH+ODLZlrTmtyiR+n8JT0dzvx987ur2/tvf1VQi22xb6wryxmP9geO2Rd1mOCnbJzdsEug7PgKrgObpata8Fq5jN7gODuHt1gp1k=</latexit>

⌫ i i i=1
mixup
<latexit sha1_base64="ENGZe/SC4yHGgdofcsdOEccavn8=">AAACDnicbVBLTsMwFHT4lvILZcnGokJiVSVdAMsKNiyLRD9SG1WO47RWbSeyHdQqyh04AVs4ATvElitwAO6Bk2ZBW0ayNJp5z/M0fsyo0o7zbW1sbm3v7Fb2qvsHh0fH9kmtq6JEYtLBEYtk30eKMCpIR1PNSD+WBHGfkZ4/vcv93hORikbiUc9j4nE0FjSkGGkjjezasPgjlSTIOJ0lMayO7LrTcArAdeKWpA5KtEf2zzCIcMKJ0JghpQauE2svRVJTzEhWHSaKxAhP0ZgMDBWIE+WlRW4GL4wSwDCS5gkNC/XvRoq4UnPum0mO9EStern4nzdIdHjjpVTEiSYCL4LChEEdwbwIGFBJsGZzQxCW1NwK8QRJhLWpayklUPlpmenFXW1hnXSbDfeq0Xxo1lu3ZUMVcAbOwSVwwTVogXvQBh2AwQy8gFfwZj1b79aH9bkY3bDKnVOwBOvrFyMsnNw=</latexit>

Y ! random target vector


<latexit sha1_base64="ShaUlpHpDSl4BbJH6cLDT1SHFRE=">AAACJXicbVDLSgNBEJyNr/iOevQyGAS9hN0g6lH04jGC8UESwuxsJxmcnVlmetWw5Bv8DL/Aq36BNxE8efI/nI05GGNBQ1HVTXdXmEhh0fc/vMLU9MzsXHF+YXFpeWW1tLZ+YXVqONS5ltpchcyCFArqKFDCVWKAxaGEy/DmJPcvb8FYodU59hNoxayrREdwhk5ql3avadOIbg+ZMfqONhHuMTNMRTqmTusC0lvgqM2gXSr7FX8IOkmCESmTEWrt0lcz0jyNQSGXzNpG4CfYyphBwSUMFpqphYTxG9aFhqOKxWBb2fClAd12SkQ72rhSSIfq74mMxdb249B1xgx79q+Xi/95jRQ7h61MqCRFUPxnUSeVFDXN86GRMO5f2XeEcSPcrZT3mGEcXYpjWyKbn5bnEvxNYZJcVCvBfqV6tlc+Oh4lVCSbZIvskIAckCNySmqkTjh5IE/kmbx4j96r9+a9/7QWvNHMBhmD9/kNCTym6w==</latexit>

P (X, Y ) ! joint distribution


<latexit sha1_base64="664Qk806ZO855yS2Ik8+2DSXnxc=">AAACKHicbVDLTgJBEJzFF+Jr1aOXicQEE0J2iVGPRC8eMZGHAUJmhwFGZnc2M70q2fATfoZf4FW/wJvhauJ/OAscBKykk0p1d7q6vFBwDY4ztlIrq2vrG+nNzNb2zu6evX9Q1TJSlFWoFFLVPaKZ4AGrAAfB6qFixPcEq3mD66Rfe2RKcxncwTBkLZ/0At7llICR2na+nKvn709xU/FeH4hS8gk3gT1D/CB5ALhjLCjuRcn0qG1nnYIzAV4m7oxk0Qzltv3T7Ega+SwAKojWDdcJoRUTBZwKNso0I81CQgekxxqGBsRnuhVPvhrhE6N0cFcqU8bJRP27ERNf66HvmUmfQF8v9hLxv14jgu5lK+ZBGAEL6PRQNxIYJE4iMj8rRkEMDSFUceMV0z5RhIIJcu5KRyfWklzcxRSWSbVYcM8LxduzbOlqllAaHaFjlEMuukAldIPKqIIoekFv6B19WK/Wp/VljaejKWu2c4jmYH3/AqDZp7c=</latexit>

` ! loss function (penalizes the di↵erence btw predictions f (x) and


<latexit sha1_base64="oxlB886oFN7c9AA2THG6Pn9zSzk=">AAACVnicbZDPbtNAEMY3LqWl/DPlyGVEihQukV1VwLEqF45FIm1FHEXj9ThZdb1r7Y5pUyvvxmPAA8AVngCxTnOgLSOt9OmbbzSzv7zWynOSfO9FG/c2729tP9h5+Ojxk6fxs90TbxsnaSSttu4sR09aGRqxYk1ntSOsck2n+fn7rn/6hZxX1nziRU2TCmdGlUoiB2saf85Ia8icms0ZnbMXkDFdcqut91A2RnYxGNRkUKsr8sBzgkKVJTkykiDnCwgLC7UKetgrB5ev9wBNsZzG/WSYrAruinQt+mJdx9P4R1ZY2VRkWGr0fpwmNU9adKykpuVO1niqUZ7jjMZBGqzIT9oVgyW8Ck4BpXXhGYaV++9Ei5X3iyoPyQp57m/3OvN/vXHD5btJq0zdcPjx9aKy0cAWOqABhiPJehEESqfCrSDn6FBywH5jS+G70zou6W0Kd8XJ/jB9M9z/eNA/PFoT2hYvxEsxEKl4Kw7FB3EsRkKKr+Kn+CV+9771/kSb0dZ1NOqtZ56LGxXFfwFoKLc+</latexit>

Z actual targets y, for examples (x, y) ⇠ P )


<latexit sha1_base64="L9uru0NVN5Gosfc2INI9EV//Btg=">AAACM3icbZDPSiNBEMZ7dNWsf6Me99JsIihImBFRj0EvHrOwUSEJoaZTExu7Z4buGskw5FF8DJ/Aqz6AeBM9+g7bE3NY/9Tp46sq6qtfmCppyfcfvZnZH3PzC5Wfi0vLK6tr1fWNM5tkRmBbJCoxFyFYVDLGNklSeJEaBB0qPA+vTsr++TUaK5P4L+Up9jQMYxlJAeSsfvWwSziiAgRloDiBGSJZXs/ruzxKDMcR6FShc7ZHu/kO71qpeau+M+5Xa37DnxT/KoKpqLFptfrV1+4gEZnGmIQCazuBn1KvAENSKBwvdjOLKYgrGGLHyRg02l4xeXDMt5wzmASKkpj4xP1/owBtba5DN6mBLu3nXml+1+tkFB31ChmnGWEs3g9FmQOR8JIWH0iDglTuBAgjXVYuLsE4XI7physDW0YruQSfKXwVZ3uN4KCx92e/1jyeEqqwX+w322YBO2RNdsparM0Eu2F37J49eLfek/fsvbyPznjTnU32oby3fydUqhk=</latexit>

<latexit sha1_base64="6gNx7hL5SZGaJU+7wgtaR1Ol2YI=">AAACSnicbVDLSgNBEJyN73fUo5fBIGxAwq6IehFEL54kilEhG8LsbG8yZPbBTK8mLPkrP8Mf0Kv6A97Ei7MxB18NDUV1N11VfiqFRsd5tEoTk1PTM7Nz8wuLS8sr5dW1K51kikODJzJRNz7TIEUMDRQo4SZVwCJfwrXfOynm17egtEjiSxyk0IpYJxah4AwN1S6feRHDLmcyvxjaYZUeUk/ESD2Qktqh3a9u00GVBrRu97cN8JTodJEpldxRD6GPOfRT4AgBVUL3hu1yxak5o6J/gTsGFTKuerv86gUJzyKIkUumddN1UmzlTKHgEobzXqYhZbzHOtA0MGYR6FY+8j2kW4YJaJgo00b0iP1+kbNI60Hkm83Cpf49K8j/Zs0Mw4NWLuI0Q4j516MwkxQTWoRIA6GMaTkwgHEljFbKu0wxE4T6+SXQhbQiF/d3Cn/B1U7N3avtnO9Wjo7HCc2SDbJJbOKSfXJETkmdNAgn9+SJPJMX68F6s96tj6/VkjW+WSc/qjT5CU5wsjA=</latexit>

R(f ) = `(f (x), y)dP (x, y) ! expected risk ↵ ! 0 =) recovering ERM


<latexit sha1_base64="ww9rPbTiHCgiVaYgw9nQtTOpiq4=">AAACL3icbVBNbxMxEPW2tKShLaEce7GIKnGKdtOqcIxASFyQAiIfUjaKZp3JxqrXXtmzQLTKD+Fn8Au4wi9AXBCXHvov8CY5kLZPsvT0ZsZv5iW5ko7C8Hews/tgb/9h7aD+6PDo+HHjyUnfmcIK7AmjjB0m4FBJjT2SpHCYW4QsUThIrl5X9cEntE4a/ZEWOY4zSLWcSQHkpUnjPAaVz4HHVqZzAmvNZx7yWGbeHB2PCb9QaVEY/4nUKX/z4d1y0miGrXAFfpdEG9JkG3Qnjet4akSRoSahwLlRFOY0LsGSFAqX9bhwmIO4ghRHnmrI0I3L1XFLfuaVKZ8Z658mvlL/nyghc26RJb4zA5q727VKvK82Kmj2clxKnReEWqyNZoXiZHiVFJ9KfzephScgrPS7cjEHC4J8nlsuU1etVuUS3U7hLum3W9Flq/3+otl5tUmoxk7ZM/acRewF67C3rMt6TLCv7Dv7wX4G34JfwZ/g77p1J9jMPGVbCG7+ASRmqnQ=</latexit>

P is unknown!
<latexit sha1_base64="7ff5k1Un8H2E/DEhwY7yk7PMEJ8=">AAACC3icbVBLTgJBFOzxi/hh1KWbVjBxRWZYqEuiG5eYyCeBCenp6YEOPd2T/mjIhCN4Ard6AnfGrYfwAN7DBmYhYCUvqVS9l1epMGVUac/7dtbWNza3tgs7xd29/YOSe3jUUsJITJpYMCE7IVKEUU6ammpGOqkkKAkZaYej26nffiRSUcEf9DglQYIGnMYUI22lvluqNCqQKmj4iIsnftp3y17VmwGuEj8nZZCj0Xd/epHAJiFcY4aU6vpeqoMMSU0xI5NizyiSIjxCA9K1lKOEqCCbBZ/Ac6tEMBbSDtdwpv69yFCi1DgJ7WaC9FAte1PxP69rdHwdZJSnRhOO549iw6AWcNoCjKgkWLOxJQhLarNCPEQSYW27WvgSqWm0ie3FX25hlbRqVf+yWruvles3eUMFcALOwAXwwRWogzvQAE2AgQEv4BW8Oc/Ou/PhfM5X15z85hgswPn6BScHmpg=</latexit>

↵ 2 {1/4, 1/2, 1, 2, 4}
<latexit sha1_base64="jUa1qkdXIOruyVAeasv15MlxkLI=">AAACHHicbVDLSgMxFM34rPVVdekmWAQXpZ0pRV0W3bisYB/QGcqdTNqGZjJDkhHK0K2f4Re41S9wJ24FP8D/MNN2YVsP5HI4517uzfFjzpS27W9rbX1jc2s7t5Pf3ds/OCwcHbdUlEhCmyTikez4oChngjY105x2Ykkh9Dlt+6PbzG8/UqlYJB70OKZeCAPB+oyANlKvgF3g8RCwywR2U6dSK2GnUjWlhE2tuZNeoWiX7SnwKnHmpIjmaPQKP24QkSSkQhMOSnUdO9ZeClIzwukk7yaKxkBGMKBdQwWEVHnp9CcTfG6UAPcjaZ7QeKr+nUghVGoc+qYzBD1Uy14m/ud1E92/9lIm4kRTQWaL+gnHOsJZLDhgkhLNx4YAkczciskQJBBtwlvYEqjstCwXZzmFVdKqlp3LcvW+VqzfzBPKoVN0hi6Qg65QHd2hBmoigp7QC3pFb9az9W59WJ+z1jVrPnOCFmB9/QK7sp7R</latexit>

D = {(xi , yi )}ni=1 ! training data


<latexit sha1_base64="1KY0TpbaY687mzoayNO7HLlpmLw=">AAACQXicbVBNS8NAEN34bf2qevSyWAQFKYmIehHED/CoYKvQ1DDZbNvFzSbsTtQS8ov8Gf4Cj+rZgzfx6sWk9qCtDwYe780wM8+PpTBo28/WyOjY+MTk1HRpZnZufqG8uFQ3UaIZr7FIRvrKB8OlULyGAiW/ijWH0Jf80r85KvzLW66NiNQFdmPeDKGtREswwFzyyiduCNhhINPjjO5TN12/98Qm7Xpiw828VOw72bWirhbtDoLW0R11kd9jihqEEqpNA0DIvHLFrto90GHi9EmF9HHmld/cIGJJyBUyCcY0HDvGZgoaBZM8K7mJ4TGwG2jzRk4VhNw00967GV3LlYC2Ip2XQtpTf0+kEBrTDf28s3jODHqF+J/XSLC110yFihPkiv0saiWSYkSL7GggNGcouzkBpkV+K2Ud0MAwT/jPlsAUpxW5OIMpDJP6VtXZqW6db1cODvsJTZEVskrWiUN2yQE5JWekRhh5IE/khbxaj9a79WF9/rSOWP2ZZfIH1tc3eyCxnA==</latexit>

(xi , yi ) ⇠ P, i = 1, . . . , n
<latexit sha1_base64="DmqdnyV15uJOUn2pAGuJEqlE/bc=">AAACH3icbVDLSsNAFJ3UV62vqks3g0WoEEpSRN0IRTcuK9gHNCFMJpN26OTBzEQMoR/gZ/gFbvUL3InbfoD/4aTNwrYeGDiccy73znFjRoU0jKlWWlvf2Nwqb1d2dvf2D6qHR10RJRyTDo5YxPsuEoTRkHQklYz0Y05Q4DLSc8d3ud97IlzQKHyUaUzsAA1D6lOMpJKcaq3+7FAdpg49h5agAWzrkMIbaOrQYl4khQ7zlNEwZoCrxCxIDRRoO9Ufy4twEpBQYoaEGJhGLO0McUkxI5OKlQgSIzxGQzJQNEQBEXY2+8wEninFg37E1QslnKl/JzIUCJEGrkoGSI7EspeL/3mDRPrXdkbDOJEkxPNFfsKgjGDeDPQoJ1iyVBGEOVW3QjxCHGGp+lvY4on8tInqxVxuYZV0mw3zstF8uKi1bouGyuAEnII6MMEVaIF70AYdgMELeAPv4EN71T61L+17Hi1pxcwxWIA2/QUic6C7</latexit>

Xn
<latexit sha1_base64="ss6xc5C+KdVbW7yXnXA0Ii19c44=">AAACaXicbVHBattAEF3JbeMmTeukl9JelpiCA8FIoSS5BEJ76dGFOglYrlitRvbi3ZXYHSUWQl+ZUz4gp35Bb13ZPjRJBxYeb97wZt4mhRQWg+De8zsvXr7a6r7e3nmz+/Zdb2//0ual4TDmuczNdcIsSKFhjAIlXBcGmEokXCWLb23/6gaMFbn+iVUBU8VmWmSCM3RU3FOjOEpBIqOD5RGtDuk5jTLDeB02tW5oZEsV1+I8bH5puhYOlk6zjIVTO1DF4pBGRszmyIzJb2mEsMQaVCGMM5E0dTcYkZStXRP3+sEwWBV9DsIN6JNNjeLeQ5TmvFSgkUtm7SQMCpzWzKDgEprtqLRQML5gM5g4qJkCO61XsTT0s2NSmuXGPY10xf47UTNlbaUSp1QM5/ZpryX/15uUmJ1Na6GLEkHztVFWSoo5bTN2NxvgKCsHGDfC7Ur5nLlU0f3EI5fUtqu1uYRPU3gOLo+H4cnw+MeX/sXXTUJd8okckAEJySm5IN/JiIwJJ3fkj+d7He+3v+d/8D+upb63mXlPHpXf/wtQwLp8</latexit>

1
P (x, y) = (x = xi , y = yi ) ! empirical distribution
Z n i=1 Xn
<latexit sha1_base64="Bzk5t5HAZhp4S99fnVgbyb3IPXA=">AAAChnicbVFdi9NAFJ3Er7p+VQVffPBiEVJYSrKo68tC0Rcfq9jdhaaG6eSmHTqZhJkbbQj5A/5Df4Dv/gQn3SDurhdmOJx7DvfOmVWppKUw/On5N27eun1ncPfg3v0HDx8NHz85tUVlBM5FoQpzvuIWldQ4J0kKz0uDPF8pPFttP3T9s29orCz0F6pLXOZ8rWUmBSdHJcMfcc5pI7hqPrdJnKIiHmRjOIFYaoIYlQqyYDc+hHoMKcx6CQS7w3qvygwXTdQ2uoXYVnnSyJOo/ar/OhPZed0NsZHrDXFjiu8QE+6owbyUxm2iwEi7bZPhKJyE+4LrIOrBiPU1S4a/4rQQVY6ahOLWLqKwpGXDDUmhsD2IK4slF1u+xoWDmudol80+sxZeOSaFrDDuuJfu2X8dDc+trfOVU3YJ2au9jvxfb1FR9m7ZSF1WhFpcDMoqBVRA9wGQSoOCVO0AF0a6XUFsuIuR3DddmpLabrUul+hqCtfB6dEkejs5+vR6NH3fJzRgz9lLFrCIHbMp+8hmbM4E++0981544A/8if/GP76Q+l7vecoulT/9A6vFwto=</latexit>

1
R (f ) = `(f (x), y)dP (x, y) = `(f (xi ), yi ) ! empirical risk
mixup GANs
<latexit sha1_base64="2a6cEp6sQwuuZz7LduUlVbmxuKQ=">AAACEXicbVBLTsMwFHTKr5RfALFiY1EhsaqSLoBlgQWsUJFoi9RGleO4rVXbiWwHUUU5BSdgCydgh9hyAg7APXDSLGjLSJZGM+95nsaPGFXacb6t0tLyyupaeb2ysbm1vWPv7rVVGEtMWjhkoXzwkSKMCtLSVDPyEEmCuM9Ixx9fZX7nkUhFQ3GvJxHxOBoKOqAYaSP17YNe/kciSZBy+hRH8PriVvXtqlNzcsBF4hakCgo0+/ZPLwhxzInQmCGluq4TaS9BUlPMSFrpxYpECI/RkHQNFYgT5SV5dAqPjRLAQSjNExrm6t+NBHGlJtw3kxzpkZr3MvE/rxvrwbmXUBHFmgg8DRrEDOoQZl3AgEqCNZsYgrCk5laIR0girE1jMymByk5LTS/ufAuLpF2vuae1+l292rgsGiqDQ3AEToALzkAD3IAmaAEMEvACXsGb9Wy9Wx/W53S0ZBU7+2AG1tcvpMmeOQ==</latexit>

n i=1
<latexit sha1_base64="fGA6Ris+0IC18Q/ByH6XYuUv4O8=">AAACJnicbVDNTgIxGOziH+LfqkcvjcQEPZBdDuqR6MULCRJBEiCk2y3Q0HY3bdcEN/sOPoZP4FWfwJsx3rz4HnYXDgJO0mQy8339JuOFjCrtOF9WbmV1bX0jv1nY2t7Z3bP3D1oqiCQmTRywQLY9pAijgjQ11Yy0Q0kQ9xi598bXqX//QKSigbjTk5D0OBoKOqAYaSP17bNu9kcsiZ+0KKYCMdigagxrVFBOH7MxWGo1aqd9u+iUnQxwmbgzUgQz1Pv2T9cPcMSJ0JghpTquE+pejKSmmJGk0I0UCREeoyHpGCoQJ6oXZ3kSeGIUHw4CaZ7QMFP/bsSIKzXhnpnkSI/UopeK/3mdSA8uezEVYaSJwNNDg4hBHcC0IOhTSbBmE0MQltRkhXiEJMLa1Dh3xVdptMT04i62sExalbJ7Xq7cVorVq1lDeXAEjkEJuOACVMENqIMmwOAJvIBX8GY9W+/Wh/U5Hc1Zs51DMAfr+xdtWaZs</latexit>

Vicinal Risk Minimization (VRM)


Xn
<latexit sha1_base64="MOWehUX5y2l4mBRaEsR5/JKuK04=">AAACVnicdZDNSsNAFIWn8a/Wv6hLN4NFUJCSiKibQtGNywq2FpsaJpNJO3QyCTMTaYh5Nx9DH0C3+gTitGZhWz1w4XDuvdzL58WMSmVZryVjYXFpeaW8Wllb39jcMrd32jJKBCYtHLFIdDwkCaOctBRVjHRiQVDoMXLnDa/G/btHIiSN+K1KY9ILUZ/TgGKkdOSa903X4cmhoyjzSTbKj2Fh0/wI1qETCIQzO894Dh2ZhG5G63b+wOF/S09w5NJjmLr0yDWrVs2aCM4buzBVUKjpmm+OH+EkJFxhhqTs2lasehkSimJG8oqTSBIjPER90tWWo5DIXjZhkMMDnfgwiIQuruAk/b2RoVDKNPT0ZIjUQM72xuFfvW6igoteRnmcKMLxz6EgYVBFcAwU+lQQrFiqDcKC6l8hHiCNTWnsU1d8OX4t11zsWQrzpn1Ss89qJzen1cZlQagM9sA+OAQ2OAcNcA2aoAUweAbv4AN8ll5KX8aSsfIzapSKnV0wJcP8Bpr1tt8=</latexit>

1
P⌫ (x̃, ỹ) = ⌫(x̃, ỹ|xi , yi )
n i=1
⌫ ! vicinity distribution (measures the probability of finding the
<latexit sha1_base64="/6SKto5FPoMwTF8WdrXOfTZU3nY=">AAACUXicbVBNbxMxEJ0sX6XlI8CRi0WE1FO0WyHgWMGFY5FIWykbRbPe2cSq117Z40K0yi/jZ3DiyIFL+wu4Yac50JaRLD2990bz/KpOK895/nOQ3bl77/6DnYe7e48eP3k6fPb82NvgJE2k1dadVuhJK0MTVqzptHOEbaXppDr7mPSTc3JeWfOFVx3NWlwY1SiJHKn5cFKaIEqnFktG5+xXUTJ94/5cSWUUr0QdIzhVheQW+y2hD4684CWJztkKK6WTzTaiUaZWZpGk9Xw4ysf5ZsRtUGzBCLZzNB/+LmsrQ0uGpUbvp0Xe8axHx0pqWu+WwVOH8gwXNI3QYEt+1m++vxavI1OLxrr4DIsN++9Gj633q7aKzhZ56W9qifyfNg3cvJ/1ynSBycirQ03Qgq1IXcZyHEnWqSWUTsWsQi7RoeTY+LUrtU/RUi/FzRZug+ODcfF2fPD5zejww7ahHXgJr2AfCngHh/AJjmACEr7DL7iAy8GPwZ8MsuzKmg22Oy/g2mR7fwFK2LaE</latexit>

<latexit sha1_base64="+9f8mm8zJjiALxuql0jvGrS2GB8=">AAACS3icbVBNT9tAEF2HUr5KCfTIZdVQiUolsnMAjqhceiNIDSAlUTRej5MV67W1O46wLP8sfkZ/QNUr+QXcEAc2wYfy8aSV3rw3o5l9YaakJd//6zWWPix/XFldW9/4tPl5q7m9c2HT3AjsiVSl5ioEi0pq7JEkhVeZQUhChZfh9encv5yisTLVv6nIcJjAWMtYCiAnjZpnU2koB8VjBMoNHhCYMRLf2x+QVBGWN9UPXtOi+r7HpeY0QT6VQmpJBU/jRU0GXK3Ho2bLb/sL8LckqEmL1eiOmrNBlIo8QU1CgbX9wM9oWIIhKRRW64PcYgbiGsbYd1RDgnZYLj5e8W9OiXicGvc08YX6/0QJibVFErrOBGhiX3tz8T2vn1N8PCylznJCLZ4XxbnilPJ5ijySBgWpwhEQRrpbuZiAAUEu6xdbIjs/rXK5BK9TeEsuOu3gsN0577ROftYJrbJd9pXts4AdsRP2i3VZjwl2y/6xOzbz/nj33oP3+Nza8OqZL+wFGstPn6y0dg==</latexit>

virtual feature-target (x̃, ỹ) in the vicinity of the training <latexit sha1_base64="5W0Ttkbgaf3hdjcUl3JTXqJ0YTI=">AAACHXicbVDLSgNBEJyN7/eqRy+jUTCgYTcH9Sh68ahgNBDD0jvpTYbMPpjpFUPI2c/wC7zqF3gTr+IH+B9OYg5qLGgoqrrp7gozJQ153odTmJicmp6ZnZtfWFxaXnFX165MmmuBVZGqVNdCMKhkglWSpLCWaYQ4VHgddk4H/vUtaiPT5JK6GTZiaCUykgLISoG7GSFQrnGfQLeQ+PbuXSD3eDeQpW2egdSlwC16ZW8IPk78ESmyEc4D9/OmmYo8xoSEAmPqvpdRoweapFDYn7/JDWYgOtDCuqUJxGgaveErfb5jlSaPUm0rIT5Uf070IDamG4e2MwZqm7/eQPzPq+cUHTV6MslywkR8L4pyxSnlg1x4U2oUpLqWgNDS3spFGzQIsun92tI0g9P6Nhf/bwrj5KpS9g/KlYtK8fhklNAs22BbbJf57JAdszN2zqpMsHv2yJ7Ys/PgvDivztt3a8EZzayzX3DevwBFZKGC</latexit>

feature-target (xi , yi ) pair)


⌫(x̃, ỹ|xi , yi ) = N (x̃ xi , 2 ) (ỹ = yi ) ! Gaussian vicinity
<latexit sha1_base64="1LZeQ1hfBrWULrjxyncqgSGoA/o=">AAACg3icbVFda9RAFJ1Eq7V+rfZRhcFF2GK7JEtRXwpFH/RJKrhtYbOGm8ns7qWTSZi5qRtinv2N/gAf/Q9O0qC29cLA4dxz7p05kxQKLQXBD8+/cXPj1u3NO1t3791/8HDw6PGxzUsj5FTkKjenCVipUMspISl5WhgJWaLkSXL2ru2fnEtjMdefqSrkPIOlxgUKIEfFg++RLkcRoUplvW52eQ+r5ts6xl1exbjDD3iUAa0EqPpj81fM93iniSwuM/gy2YlSqQhGf0Y4Y+ePDC5XBMbkX918uab6PZTWImh+jgI1UtXEg2EwDrri10HYgyHr6yge/IzSXJSZ1CQUWDsLg4LmNRhCoWSzFZVWFiDOYClnDmrIpJ3XXWANf+GYlC9y444m3rH/OmrIrK2yxCnbl9urvZb8X29W0uLNvEZdlCS1uFi0KBWnnLfp8xSNFKQqB0AYdHflYgUGBLk/urQlte3V2lzCqylcB8eTcfhqPPm0Pzx82ye0yZ6w52zEQvaaHbIP7IhNmWC/vG3vqffM3/Bf+hN//0Lqe71nm10q/+A3eUjEiQ==</latexit>

<latexit sha1_base64="242uxs5MO1qu43qskWjFvBDWs8I=">AAACN3icbVC7SgNBFJ31bXxFLW0Gg6BN2E2hgo1ooaWCiYEYwt3Zm2RwdnaZuRsJIR/jZ/gFtlpaWSm2/oGzMYVGDwycOedezswJUyUt+f6LNzU9Mzs3v7BYWFpeWV0rrm/UbJIZgVWRqMTUQ7CopMYqSVJYTw1CHCq8Dm9Pc/+6h8bKRF9RP8VmDB0t21IAOalVPNqFrBOjJqk7nLrIyYDU+SUCAn4nqcshiiTJHvIzyKyVoLlOpMW9VrHkl/0R+F8SjEmJjXHRKr7dRInI8jihwNpG4KfUHIAhKRQOCzeZxRTELXSw4aiGGG1zMPrkkO84JeLtxLijiY/UnxsDiK3tx6GbjIG6dtLLxf+8Rkbtw+ZA6jQj1OI7qJ0pTgnPG+ORNChI9R0BYVwTgosuGBDkev2VEtn8aUPXSzDZwl9Sq5SD/XLlslI6Phk3tMC22DbbZQE7YMfsnF2wKhPsnj2yJ/bsPXiv3rv38T065Y13NtkveJ9fGC+tYQ==</latexit>

(augmenting the training data with additive Gaussian noise)


Zhang, Hongyi, et al. "mixup: Beyond empirical risk minimization." arXiv preprint arXiv:1710.09412 (2017).
Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour YouTube Video

1 X X
<latexit sha1_base64="qMeOAsbzWAJ1ltsPE2/0+5nwexw=">AAACKHicbZDLSsNAFIYn9VbrLerSTbAIFaQkXajLooW6rNReoA1lMjlth04uzEyEEvoSPoZP4FafwJ10K/geTtIsbOuBYX7+c+VzQkaFNM25ltvY3Nreye8W9vYPDo/045O2CCJOoEUCFvCugwUw6kNLUsmgG3LAnsOg40zuk3znGbiggf8kpyHYHh75dEgJlsoa6Ff9dEbMwZ01ZUDGWEhKjDrHLgVfGjUQJPlLzXrtcqAXzbKZhrEurEwUURaNgf7TdwMSeWoCYViInmWG0o4xVzsYzAr9SECIyQSPoKekjz0QdpxeNDMulOMaw4Crpy5I3b8dMfaEmHqOqvSwHIvVXGL+l+tFcnhrx9QPIwk+WSwaRsyQgZEgMlzKgUg2VQITThMeigvHRCqQS1tckZw2U1ysVQrrol0pW9flymOlWL3LCOXRGTpHJWShG1RFD6iBWoigF/SG3tGH9qp9al/afFGa07KeU7QU2vcv0venGw==</latexit>

Stochastic Gradient Descent (SGD)


<latexit sha1_base64="AGZ5HUu5CDfC2ka9u4m0pnuGx5w=">AAACwHicbVHbbtNAEF2bWym3FB55GREjFVAju0IUCZCqgASPQZC2EEfWeLOON1mv3d0xSWr57/gJPoD/wE6DRFtGWunonJmdozNxoaQl3//luNeu37h5a+v29p279+4/6Ow8PLJ5abgY8lzl5iRGK5TUYkiSlDgpjMAsVuI4nr9v9eMfwliZ66+0KsQ4w6mWieRIDRV1fi6iil7Ma3gHi4hgD0JBCGFikFdBXekaQltmUeVDqMQpzOAtzP9ySwilhjBDSjmqql9Hs0bSGCsEtbt8A+u/Z/UzCI2cpoTG5AsISSyp8uYeSBJmbcNCnsCXjx9gISkFJdBoqafQiAK81pAHqCeAkEkt92IknoKVZ6Id87RXR52u3/PXBVdBsAFdtqlB1PkdTnJeZkITV2jtKPALGldoSHIl6u2wtKJAPsepGDVQYybsuFqnXcPThplAkpvmaYI1++9EhZm1qyxuOtto7GWtJf+njUpKXo8rqYuShObni5JSAeXQng4m0ghOatUA5EY2XoGn2FyqyfHiloltrbW5BJdTuAqO9nvBq97+55fdw/4moS32mD1huyxgB+yQfWIDNmTcee4MnG/Od7fvpm7unp63us5m5hG7UO7ZH1lE2H8=</latexit>

1 X 1 X X wt+k = wt ⌘ rl(x; wt+j ) ! k iterations of SGD with learning rate ⌘ and a mini-b <latexit sha1_base64="YWWm3eL1kGTW+oDKdz5nHOSTCN0=">AAACJnicbZDLSgMxFIYzXmu9jbp0EyxC66LMFFFBhKIbFy4q2At0SsmkmTY0kxmSjG2Zzjv4GD6BW30CdyLu3PgeppeFbf0h8POfczgnnxsyKpVlfRlLyyura+upjfTm1vbOrrm3X5FBJDAp44AFouYiSRjlpKyoYqQWCoJ8l5Gq270Z1auPREga8Ac1CEnDR21OPYqR0lHTPLnL9nLwCjqeQDi2k3hYGybQkZHfjPvQoRzWEsiy/ctermlmrLw1Flw09tRkwFSlpvnjtAIc+YQrzJCUddsKVSNGQlHMSJJ2IklChLuoTeracuQT2YjHf0rgsU5a0AuEflzBcfp3Ika+lAPf1Z0+Uh05XxuF/9XqkfIuGjHlYaQIx5NFXsSgCuAIEGxRQbBiA20QFlTfCnEHaTpKY5zZ0pKj0xLNxZ6nsGgqhbx9li/cn2aK11NCKXAIjkAW2OAcFMEtKIEywOAJvIBX8GY8G+/Gh/E5aV0ypjMHYEbG9y9CwKUv</latexit>

<latexit sha1_base64="AGZ5HUu5CDfC2ka9u4m0pnuGx5w=">AAACwHicbVHbbtNAEF2bWym3FB55GREjFVAju0IUCZCqgASPQZC2EEfWeLOON1mv3d0xSWr57/gJPoD/wE6DRFtGWunonJmdozNxoaQl3//luNeu37h5a+v29p279+4/6Ow8PLJ5abgY8lzl5iRGK5TUYkiSlDgpjMAsVuI4nr9v9eMfwliZ66+0KsQ4w6mWieRIDRV1fi6iil7Ma3gHi4hgD0JBCGFikFdBXekaQltmUeVDqMQpzOAtzP9ySwilhjBDSjmqql9Hs0bSGCsEtbt8A+u/Z/UzCI2cpoTG5AsISSyp8uYeSBJmbcNCnsCXjx9gISkFJdBoqafQiAK81pAHqCeAkEkt92IknoKVZ6Id87RXR52u3/PXBVdBsAFdtqlB1PkdTnJeZkITV2jtKPALGldoSHIl6u2wtKJAPsepGDVQYybsuFqnXcPThplAkpvmaYI1++9EhZm1qyxuOtto7GWtJf+njUpKXo8rqYuShObni5JSAeXQng4m0ghOatUA5EY2XoGn2FyqyfHiloltrbW5BJdTuAqO9nvBq97+55fdw/4moS32mD1huyxgB+yQfWIDNmTcee4MnG/Od7fvpm7unp63us5m5hG7UO7ZH1lE2H8=</latexit>

L(w) = wt+k = wl(x;t w)


⌘ rl(x; wt+j ) ! k iterations of SGD n with learning rate ⌘ and a mini-batch size of n
0j<k x2Bj
|X| n
x2X 1 X X
<latexit sha1_base64="zPoFvwKli4pFQDZLyzvhA2eU3Pc=">AAAC1nicbVFdi9NAFJ3Er3X9qvroy8VWWJEtySIqqFBWQR9XtLuFpoSb6aSZ7WQSZ25sa4hv4qt/T9/9H066Fd1dLwycOed+cW5SKmkpCH54/oWLly5f2bq6fe36jZu3OrfvHNqiMlwMeaEKM0rQCiW1GJIkJUalEZgnShwl81etfvRJGCsL/YFWpZjkONMylRzJUXHnZ5Qh1YsmrulR2MBLWMQEu7BmI0HYQJQa5HXY1HPtPrbK4zqASImPcAwvYP6HW0IkNUQ5UsZR1ftNfOwkjYlCUDvL523jhxAZOcsIjSkWEJFYUl1oAZKEWe8DRQrv37yGhaQMlECjpZ6B0wT0/q7UA9RTcG3RzATkUsvdBIlnYOVn0bbozUH3mrjTDfrBOuA8CDegyzZxEHd+RdOCV7nQxBVaOw6DkiY1GpJciWY7qqwokc9xJsYOasyFndTrGzTwwDFTSAvjniZYs/9W1Jhbu8oTl9l6ZM9qLfk/bVxR+mxSS11WJDQ/GZRWCqiA9qAwlUZwUisHkBvpdgWeobuZM/X0lKltV2t9Cc+6cB4c7vXDJ/29d4+7g/2NQ1vsHrvPdljInrIBe8sO2JBxb+ClXuGV/sj/4n/1v52k+t6m5i47Ff733z0l4oQ=</latexit>

0j<k x2Bj
L(w) ! loss function ŵt+1 = wt ⌘ˆ rl(x; wt ) ! one iteration of SGD with learning rate ⌘ˆ and a large
<latexit sha1_base64="Mv6mwwDQhdDEwrOQd+qryPdKqWM=">AAACIXicbVDLTgIxFO34RHyhLt00EhLckBli1CXRjQsXmMgjAUI6pQMNnXbS3hHJhC/wM/wCt/oF7ow749r/cAZmIeBJmpycc27u7XEDwQ3Y9pe1srq2vrGZ2cpu7+zu7ecODutGhZqyGlVC6aZLDBNcshpwEKwZaEZ8V7CGO7xO/MYD04YreQ/jgHV80pfc45RALHVzhdvi6BS3Ne8PgGitRrgN7BEioYzBXihpEpt0c3m7ZE+Bl4mTkjxKUe3mfto9RUOfSaCCGNNy7AA6EdHAqWCTbDs0LCB0SPqsFVNJfGY60fQ7E1yIlR72lI6fBDxV/05ExDdm7Ltx0icwMIteIv7ntULwLjsRl0EITNLZIi8UGBROusE9rhkFMY4JoZrHt2I6IJpQiBuc29IzyWlJL85iC8ukXi4556Xy3Vm+cpU2lEHH6AQVkYMuUAXdoCqqIYqe0At6RW/Ws/VufVifs+iKlc4coTlY37843aTq</latexit>

kn large mini-batch size of kn


<latexit sha1_base64="v5eaRIO29rANtZ7vBx+M0fP2VVw=">AAACIXicbVBNS8NAEN3U7++oRy+LbcGLJSmiHkUvHhVsLbShbLaTdulmE3YnYg39Bf4Mf4FX/QXexJt49n+Y1By0+mDg8d4MM/P8WAqDjvNulWZm5+YXFpeWV1bX1jfsza2miRLNocEjGemWzwxIoaCBAiW0Yg0s9CVc+8Oz3L++AW1EpK5wFIMXsr4SgeAMM6lrVzsIt5hKpvtAQ6HEvs+QD6gRd0CjgFaGVFXGXbvs1JwJ6F/iFqRMClx07c9OL+JJCAq5ZMa0XSdGL2UaBZcwXu4kBmLGh6wP7YwqFoLx0sk7Y1rNlB4NIp2VQjpRf06kLDRmFPpZZ8hwYKa9XPzPaycYHHupUHGCoPj3oiCRFCOaZ0N7QgNHOcoI41pkt1I+YJpxzBL8taVn8tPyXNzpFP6SZr3mHtbqlwflk9MioUWyQ3bJHnHJETkh5+SCNAgn9+SRPJFn68F6sV6tt+/WklXMbJNfsD6+AKJMo+I=</latexit>

w ! weights
<latexit sha1_base64="ThM5TXWr/3ArlmONGwKS713TI38=">AAACGHicbZBLTgJBEIZ78IX4GnWpi47ExBWZIUZdEt24xEQeCRDS0xTQoeeR7hqRTNh4DE/gVk/gzrh15wG8hz3AQsA/6eTLX1Wp6t+LpNDoON9WZmV1bX0ju5nb2t7Z3bP3D6o6jBWHCg9lqOoe0yBFABUUKKEeKWC+J6HmDW7Seu0BlBZhcI+jCFo+6wWiKzhDY7Xt4yFtKtHrI1MqNIzwiMkQUkeP23beKTgT0WVwZ5AnM5Xb9k+zE/LYhwC5ZFo3XCfCVsIUCi5hnGvGGiLGB6wHDYMB80G3kskvxvTUOB3aDZV5AdKJ+3ciYb7WI98znT7Dvl6speZ/tUaM3atWIoIoRgj4dFE3lhRDmkZCO0IBRzkywLgS5lbK+0wxjia4uS0dnZ6W5uIuprAM1WLBvSgU787zpetZQllyRE7IGXHJJSmRW1ImFcLJE3khr+TNerberQ/rc9qasWYzh2RO1tcv6bihpw==</latexit>

0j<k x2Bj
ŵt+1 6= wt+k
<latexit sha1_base64="NZlSzswu73wV8ij3q4xnrz4CWLE=">AAACFnicbVDLSgMxFM34rPVVdaebYBEEocwUUZdFNy4r2Ae0Zchk0jY0kxmTO5YyDPgZfoFb/QJ34tatH+B/mGm7sK0HAueecy/35niR4Bps+9taWl5ZXVvPbeQ3t7Z3dgt7+3UdxoqyGg1FqJoe0UxwyWrAQbBmpBgJPMEa3uAm8xuPTGkeynsYRawTkJ7kXU4JGMktHLb7BJJh6iZw5qS4LdkDHmbFIHULRbtkj4EXiTMlRTRF1S38tP2QxgGTQAXRuuXYEXQSooBTwdJ8O9YsInRAeqxlqCQB051k/IcUnxjFx91QmScBj9W/EwkJtB4FnukMCPT1vJeJ/3mtGLpXnYTLKAYm6WRRNxYYQpwFgn2uGAUxMoRQxc2tmPaJIhRMbDNbfJ2dluXizKewSOrlknNRKt+dFyvX04Ry6Agdo1PkoEtUQbeoimqIoif0gl7Rm/VsvVsf1uekdcmazhygGVhfvxegoAo=</latexit>

X ! labeled training set


<latexit sha1_base64="6GohzKNIHYKDBsoKYMh0H9UoLH8=">AAACJXicbVDLSgNBEJz1GeMr6tHLYBD0EnZF1GPQi8cIJgaSEHpnO8ng7Owy06uGJd/gZ/gFXvULvIngyZP/4W7MwagFDUVVN91dfqykJdd9d2Zm5+YXFgtLxeWV1bX10sZmw0aJEVgXkYpM0weLSmqskySFzdgghL7CK//6LPevbtBYGelLGsbYCaGvZU8KoEzqlvabvG1kf0BgTHTL24R3lCrwUWHAyYDUUve5RRp1S2W34o7B/xJvQspsglq39NkOIpGEqEkosLbluTF1UjAkhcJRsZ1YjEFcQx9bGdUQou2k45dGfDdTAt6LTFaa+Fj9OZFCaO0w9LPOEGhgf3u5+J/XSqh30kmljhNCLb4X9RLFKeJ5PjyQBgWpYUZAGJndysUADAjKUpzaEtj8tDwX73cKf0njoOIdVQ4uDsvV00lCBbbNdtge89gxq7JzVmN1Jtg9e2RP7Nl5cF6cV+ftu3XGmcxssSk4H1/DiabA</latexit>

rl(x; wt ) label
⇡ rl(x; wt+j ) ! assume
<latexit sha1_base64="378l+1EGh2AqVKD1KHFcGS1WhBU=">AAACQ3icbVDLSgMxFM34flt16SZYBEUoMyIquBFd6FLBqtAp5U6atrGZZEjuaMvQT/Iz/AJXgi5duRO3gpnahVUPBA7n3FdOlEhh0fefvJHRsfGJyanpmdm5+YXFwtLypdWpYbzMtNTmOgLLpVC8jAIlv04MhziS/CpqH+f+1S03Vmh1gd2EV2NoKtEQDNBJtcJJqCCSQOVG54De1XCThpAkRnfosJHh1k3PmUY0WwjG6DsaIu9gBtamMe/VCkW/5PdB/5JgQIpkgLNa4TWsa+ZaFTLphlQCP8FqBgYFk7w3E6aWJ8Da0OQVRxXE3Faz/od7dN0pddrQxj2FtK/+7MggtrYbR64yBmzZ314u/udVUmzsVzOhkhS5Yt+LGqmkqGmeHq0LwxnKriPAjHC3UtYCAwxdxkNb6jY/Lc8l+J3CX3K5XQp2S9vnO8XDo0FCU2SVrJENEpA9ckhOyRkpE0buySN5Ji/eg/fmvXsf36Uj3qBnhQzB+/wCHJOx3A==</latexit>

l(x; w) ! loss computed from sample x and its corresponding


<latexit sha1_base64="hsCncrp2aMOa7pXFii5R9K+rJbA=">AAACT3icbVDLThtBEJw1JDzyMnDkMoqJRC7WLooSJC6IXDiCggHJtqze2bY9Yh6rmV6wtfKH8Rkcc+KElHwBN5RZs4fwaGmkUlWNurrSXElPcfw7aiwsvnm7tLyy+u79h4+fmmvrp94WTmBHWGXdeQoelTTYIUkKz3OHoFOFZ+nFz0o/u0TnpTUnNM2xr2Fk5FAKoEANmr/U9mTv6ivvOTkaEzhnr3iPcEKlst5zYXVeEGZ86KzmHnSukG9NtjiYjEuqDM6hz63JpBlxBSmq2aDZitvxfPhLkNSgxeo5GjTvepkVhUZDQoH33STOqV+CIykUzlZ7hcccxAWMsBugAY2+X86Pn/EvgQn5rAvPEJ+z//8oQXs/1WlwaqCxf65V5Gtat6Dhbr+UpirAiMdFw0JxsrxqkmfSoSA1DQCEkyErF2NwICj0/WRL5qtoVS/J8xZegtOddvK9vXP8rbV/UDe0zDbZZ7bNEvaD7bNDdsQ6TLBrdsv+sL/RTXQfPTRqayOqwQZ7Mo2Vf0iTtVw=</latexit>

⌘ˆ = k⌘ =) ŵt+1 ⇡ wt+k
<latexit sha1_base64="z4N0gk0K6eAeNCttDBnMsZOQhtw=">AAACNnicbVDLSgNBEJz1bXxFPXoZDIIghF0RFUEQvXhUMDGQDaF30jFDZh/M9Kph2X/xM/wCr3r14k28+gnOxhyM2jBQXdVN9VSQKGnIdV+dicmp6ZnZufnSwuLS8kp5da1u4lQLrIlYxboRgEElI6yRJIWNRCOEgcLroH9W6Ne3qI2MoysaJNgK4SaSXSmALNUuH/k9oMxHgpwf8z4vEPdlaL3R8KF4l7cz2vFy7kOS6Pie3xV9P2+XK27VHRb/C7wRqLBRXbTL734nFmmIEQkFxjQ9N6FWBpqkUJiX/NRgAqIPN9i0MIIQTSsb/jHnW5bp8G6s7YuID9mfGxmExgzCwE6GQD3zWyvI/7RmSt3DViajJCWMxLdRN1WcYl4ExjtSoyA1sACElvZWLnqgQZCNdcylY4rTily83yn8BfXdqrdf3b3cq5ycjhKaYxtsk20zjx2wE3bOLliNCfbAntgze3EenTfn3fn4Hp1wRjvrbKyczy8wzaz0</latexit>

om sample x and its corresponding label


l(x; w) ! (e.g., cross-entropy + a regularization on w)Gradual Warmup
<latexit sha1_base64="5v2U5CHjJBDP9Pyt4HPFwG6P8M4=">AAACRXicbVA9TxtBEN0DQoAkYKBMs8JEMgqc7lAESDSINKEDCQOSbVlz6/F5xd7taXcOY07+TfwMfgEFDVQp6RBtsmdcBMjTrPT05nNflClpKQjuvInJqQ/TH2dm5z59/jK/UFlcOrE6NwLrQittziKwqGSKdZKk8CwzCEmk8DQ6/1nmTy/QWKnTYxpk2EogTmVXCiAntSsHqna521/jTSPjHoExus+bhJdU1NCP/XUujLZ2A1MyOhvw7xy4wThXYOTVaAR3sdpfXRu2K9XAD0bg70k4JlU2xmG78rvZ0SJP3GyhwNpGGGTUKsCQFAqHc83cYgbiHGJsOJpCgrZVjL485N+c0uFdbdxLiY/UfzsKSKwdJJGrTIB69m2uFP+Xa+TU3WkVMs1ywlS8LOrmipPmpX+8Iw0KUgNHQBjpbuWiBwYEOZdfbenY8rTSl/CtC+/JyaYfbvmbRz+qe/tjh2bYV7bCaixk22yP/WKHrM4Eu2a37J49eDfeo/fkPb+UTnjjnmX2Ct6fv2XkscM=</latexit>

<latexit sha1_base64="WmRxOn24c2wz17DLx7m+ZWI+YTM=">AAACFnicbVDLSsNAFJ34rPUVdaebwSK4KkkX6rLoQpcV7APaUG4mk3bo5MHMRCgh4Gf4BW71C9yJW7d+gP/hJM3Cth4YOJxz75zLcWPOpLKsb2NldW19Y7OyVd3e2d3bNw8OOzJKBKFtEvFI9FyQlLOQthVTnPZiQSFwOe26k5vc7z5SIVkUPqhpTJ0ARiHzGQGlpaF5PCj+SAX1slsBXgIcd0EESYyHZs2qWwXwMrFLUkMlWkPzZ+BFJAloqAgHKfu2FSsnBaEY4TSrDhJJYyATGNG+piEEVDppkZ/hM6142I+EfqHChfp3I4VAymng6skA1Fguern4n9dPlH/lpCyME0VDMgvyE45VhPNCsMcEJYpPNQEimL4VkzEIIErXNpfiyfy0TPdiL7awTDqNun1Rb9w3as3rsqEKOkGn6BzZ6BI10R1qoTYi6Am9oFf0Zjwb78aH8TkbXTHKnSM0B+PrF4d0oEs=</latexit>

ss-entropy + a regularization on w) rl(x; wt ) ⇡ rl(x; wt+j ) ! does not hold initially during training when the network is changing
<latexit sha1_base64="DbiwYHVJHbpfyLFCdUjF8/rRwTw=">AAACiXicbZHfahNBFMZn13+19U/UK/Hm0CBUhJAtokVvir3xsoJpC9kQzs6e7I6ZnVlmzpqEJY/gA/oAPoEv4GyaC9N6YODj+87hDL+T1Vp5Hg5/RfGdu/fuP9h7uH/w6PGTp71nzy+8bZykkbTauqsMPWllaMSKNV3VjrDKNF1m87Muv/xBzitrvvGqpkmFhVEzJZGDNe39TA1mGkEfLT/BYspvIMW6dnYJu0HLb7+vQ+hUUTI6ZxeQMi25zS15MJahtDoHZRQr1HoFeeOUKYAdBi+IRUkGuCQwxAvr5qA8yBJN0YUOa5Xr1Xra6w8Hw03BbZFsRV9s63za+53mVjYVGZYavR8nw5onLTpWUtN6P2081SjnWNA4SIMV+Um7wbaG18HJYWZdeIZh4/470WLl/arKQmeFXPqbWWf+Lxs3PDuZtMrUDZOR14tmjQa20N0AcuVIcgdJoXSBmOxQOJQcLrWzJffd1zouyU0Kt8XF8SB5Pzj++q5/+nlLaE+8EofiSCTigzgVX8S5GAkp/kQvo8OoHx/ESXwSf7xujaPtzAuxU/HZX6LjxtY=</latexit>

x; wt ) ⇡ rl(x; wt+j1) !Xdoes not hold initially during training when the network is changing rapidly <latexit sha1_base64="AfYEUtfP3QV7teRD6dcL7JQ1u/Y=">AAACaXicbVFdaxQxFM1M/aj1o1t9EX25uAgVcZkpooIIpQr6WNFtCzvLcCeb2Q1NMkNyx90lnV/pkz/AJ3+Bb2a2+9APDwQO59ybe3NS1Eo6SpJfUbxx4+at25t3tu7eu/9gu7fz8MhVjeViyCtV2ZMCnVDSiCFJUuKktgJ1ocRxcfqx849/COtkZb7TshZjjVMjS8mRgpT39Dz39DJt4QPMc4JXkAlCyEqL3KetP8s00oyj8gftWQuZa3TuF5BJAxecYBgsFILaXbzv7nkBmZXTGaG11RwyEgvy3z5/avNePxkkK8B1kq5Jn61xmPd+Z5OKN1oY4gqdG6VJTWOPliRXot3KGidq5Kc4FaNADWrhxn4VSwvPgzKBsrLhGIKVerHDo3ZuqYtQ2b3FXfU68X/eqKHy3dhLUzckDD8fVDYKqIIuY5hIKzipZSDIrQy7Ap9hiJTCT1yaMnHdal0u6dUUrpOjvUH6ZrD39XV//2Cd0CZ7yp6xXZayt2yffWGHbMg4+8n+RnG0Ef2Jd+LH8ZPz0jha9zxilxD3/wH49btD</latexit>

wt+1 = wt ⌘ rl(x; wt ) ! SGD Start with a learning rate of ⌘ and increment it by a constant amount at each
<latexit sha1_base64="B1xJMPiQVlm2j/2OvhT18PvbvXk=">AAACmnicbVHLbtNAFB2bAqW8AixhcUWChFhEdnhukKqyAbEpomkrJVF0Pb6ORxnPWDPXoMjKr/BffAD/wTj1grZcaTRH59zXnMlqrTwnye8ovrF389bt/TsHd+/df/Bw8OjxqbeNkzSVVlt3nqEnrQxNWbGm89oRVpmms2z9qdPPfpDzypoT3tS0qHBlVKEkcqCWg19zY5XJyTB8Z3QMPxWXgKAJnVFmBQ6ZwBYwmhPjCNDkoIx0VHUliiHbhGxpjWcMBFa26S4GQlkGndxuEARW6S7fdQL50K9EbrumW/gIa+jbF6ECRm9HQLWVpV8Ohsk42QVcB2kPhqKP4+Xgzzy3sum2kxq9n6VJzYs2vExJTduDeeOpRrnGFc0CNFiRX7Q7H7fwIjA5FNaFE16xY/+taLHyflNlIbNCLv1VrSP/p80aLj4sWmXqhsnIi0FFo4EtdJ8CuXIkWW8CQOlU2BVkiQ5lcOPylNx3q22DL+lVF66D08k4fTeefJsMD496h/bFU/FcvBSpeC8OxWdxLKZCRnvRq+h19CZ+Fh/FX+KvF6lx1Nc8EZciPvkLV3PKLQ==</latexit>

|B|
x2B
iteration until it reaches ⌘ˆ = k⌘ after 5 epochs
B ! mini-batch sampled from X
<latexit sha1_base64="xCExidHpsjW/ddyEDUbAWuxCSYk=">AAACNnicbVDLThtBEJwFwisBDDnmMsIgccHatRBEnBC5cCQSBktey+odt+0R81jN9EKslf8ln5Ev4EquXHJDueYTmDU+8EhLLZWqulXdleVKeorjh2hufuHD4tLyyurHT2vrG7XNrUtvCyewJayyrp2BRyUNtkiSwnbuEHSm8Cq7/lbpVzfovLTmgsY5djUMjRxIARSoXu041UAjAao8nfDUyeGIwDl7y1PCH1RqaeR+BiRG3IPOFfb5wFnNd9o7k16tHjfiafH3IJmBOpvVea/2mPatKDQaEgq87yRxTt0SHEmhcLKaFh5zENcwxE6ABjT6bjn9ccJ3AxPMrQttiE/ZlxslaO/HOguT1Uf+rVaR/9M6BQ2+dktp8oLQiGejQaE4WV4FxvvSoSA1DgCEk+FWLkbgQFCI9ZVL31enVbkkb1N4Dy6bjeSw0fx+UD85nSW0zL6wbbbHEnbETtgZO2ctJthPdsfu2e/oV/Qneoz+Po/ORbOdz+xVRf+eANzIrUc=</latexit>

Batch normalization with large mini-batches


<latexit sha1_base64="ppWdreR4GfJR7rl2m8Odnmb6kEk=">AAACMnicbVDLThtBEJyF8AhPQ465jGJF4oK164PJ0TIXjiDFgGRbVu9s2x4xM7ua6Q0yK/8Jn8EXcIUfgFsU5ZaPyOziQ4xT0kilqurpVsWZko7C8CVYWf2wtr6x+XFre2d3b792cHjp0twK7IpUpfY6BodKGuySJIXXmUXQscKr+Oa09K9+oHUyNd9pmuFAw9jIkRRAXhrWWv3qj8JiMusAiQk3qdWg5F0V4LeSJlyBHSPX0sjjuMygG9bqYSOswJdJNCd1Nsf5sPa7n6Qi12hIKHCuF4UZDQqwJIXC2VY/d5iBuIEx9jw1oNENiuq2Gf/qlYSPUuufIV6p/04UoJ2b6tgnNdDEvfdK8X9eL6fRt0EhTZYTGvG2aJQrTikvy+KJtChITT0BYaW/lYsJWBDkK13YkrjytJnvJXrfwjK5bDaiVqN50ay3O/OGNtln9oUdsYidsDY7Y+esywS7Z4/siT0HD8Fr8DP49RZdCeYzn9gCgj9/AapUrEs=</latexit>

|B| ! mini-batch size


<latexit sha1_base64="V0Xiq0Nw3YVUhEaYro9+eYN1SRY=">AAACLHicbVDLSgNBEJz1bXxFPXoZDIIXw66Iegx68ahgNJANoXfSSQZnZ5eZXjWu+Q0/wy/wql/gRcSjfoe7SQ4aLWgoqrrp7gpiJS257pszMTk1PTM7N19YWFxaXimurl3YKDECqyJSkakFYFFJjVWSpLAWG4QwUHgZXB3n/uU1GisjfU69GBshdLRsSwGUSc2ie++HQF0BKj3q33PfyE6XwJjohvuEt5SGUsudAEh0uZV32G8WS27ZHYD/Jd6IlNgIp83ip9+KRBKiJqHA2rrnxtRIwZAUCvsFP7EYg7iCDtYzqiFE20gHn/X5Vqa0eDsyWWniA/XnRAqhtb0wyDrzN+y4l4v/efWE2oeNVOo4IdRiuKidKE4Rz2PiLWlQkOplBISR2a1cdMGAoCzMX1taNj8tz8UbT+Evudgte/vl3bO9UuVolNAc22CbbJt57IBV2Ak7ZVUm2AN7Ys/sxXl0Xp1352PYOuGMZtbZLzhf3zjaqhw=</latexit>

lB (x; w) ! loss of a single sample x depends on t


<latexit sha1_base64="JWA4mKLEVDCQRYbsg038adFpmxE=">AAACiHicbVHLbtswEKTUV+q+nPZSoJdt7ALpoYYUFE2LXgL30mMK1EkAyzAoamURoUiBXCU2BP1Bf7Af0B/oF5RydMijCxAczMxyF8O0UtJRFP0Ownv3Hzx8tPN48OTps+cvhrsvT5yprcCZMMrYs5Q7VFLjjCQpPKss8jJVeJqef+v00wu0Thr9kzYVLkq+0jKXgpOnlsNfatkkJadCcNVM23Z//RUu30Ni5aogbq25hIRwTY0yzoHJgYOTeqUQHC8rf43XY8iwQp15WQMVXiH/uCMptn6leqsDqUGSg1Jq+SHlJAoYX5s9bpfDUTSJtgV3QdyDEevreDn8k2RG1CVqEoo7N4+jihYNt364wnaQ1A4rLs75Cuceal6iWzTb1Fp455kMcmP90QRb9npHw0vnNmXqnd2S7rbWkf/T5jXlnxeN1FVNqMXVoLxWQAa6L4BMWhSkNh5wYWUXlCi45YL8R92YkrlutS6X+HYKd8HJwST+NDn48XF0NO0T2mFv2B7bZzE7ZEfsOztmMybY3+B18DbYCwdhFB6GX66sYdD3vGI3Kpz+A+Pxxb0=</latexit>

lB (x; w) ! loss of a single sample x depends on the statistic of all samples in its m
<latexit sha1_base64="JWA4mKLEVDCQRYbsg038adFpmxE=">AAACiHicbVHLbtswEKTUV+q+nPZSoJdt7ALpoYYUFE2LXgL30mMK1EkAyzAoamURoUiBXCU2BP1Bf7Af0B/oF5RydMijCxAczMxyF8O0UtJRFP0Ownv3Hzx8tPN48OTps+cvhrsvT5yprcCZMMrYs5Q7VFLjjCQpPKss8jJVeJqef+v00wu0Thr9kzYVLkq+0jKXgpOnlsNfatkkJadCcNVM23Z//RUu30Ni5aogbq25hIRwTY0yzoHJgYOTeqUQHC8rf43XY8iwQp15WQMVXiH/uCMptn6leqsDqUGSg1Jq+SHlJAoYX5s9bpfDUTSJtgV3QdyDEevreDn8k2RG1CVqEoo7N4+jihYNt364wnaQ1A4rLs75Cuceal6iWzTb1Fp455kMcmP90QRb9npHw0vnNmXqnd2S7rbWkf/T5jXlnxeN1FVNqMXVoLxWQAa6L4BMWhSkNh5wYWUXlCi45YL8R92YkrlutS6X+HYKd8HJwST+NDn48XF0NO0T2mFv2B7bZzE7ZEfsOztmMybY3+B18DbYCwdhFB6GX66sYdD3vGI3Kpz+A+Pxxb0=</latexit>

k ! number of GPUs in a server


<latexit sha1_base64="m2HRLv/W+BN47852bPwPihdIkOc=">AAACK3icbVDLSgMxFM3U97vq0k2wCK7qTBF1WXShywrWCm0pmfROG5pJhuSOWoZ+hp/hF7jVL3CluPU/zNQurHogcDjnvnLCRAqLvv/mFWZm5+YXFpeWV1bX1jeKm1vXVqeGQ51rqc1NyCxIoaCOAiXcJAZYHEpohIOz3G/cgrFCqyscJtCOWU+JSHCGTuoUDwa0ZUSvj8wYfUdbCPeYqTQOwVAd0fNa3VKhKKMWjJsz6hRLftkfg/4lwYSUyAS1TvGz1dU8jUEhl8zaZuAn2M6YQcEljJZbqYWE8QHrQdNRxWKw7Wz8sRHdc0qXRtq4p5CO1Z8dGYutHcahq4wZ9u1vLxf/85opRiftTKgkRVD8e1GUSoqa5inRrjDAUQ4dYdwIdyvlfWYYR5fl1JauzU/Lcwl+p/CXXFfKwVG5cnlYqp5OElokO2SX7JOAHJMquSA1UiecPJAn8kxevEfv1Xv3Pr5LC96kZ5tMwfv8AnpxqIo=</latexit>

lB (x; w) ! loss of a single sample x depends on the statistic of all samples in its mini-batch B
<latexit sha1_base64="JWA4mKLEVDCQRYbsg038adFpmxE=">AAACiHicbVHLbtswEKTUV+q+nPZSoJdt7ALpoYYUFE2LXgL30mMK1EkAyzAoamURoUiBXCU2BP1Bf7Af0B/oF5RydMijCxAczMxyF8O0UtJRFP0Ownv3Hzx8tPN48OTps+cvhrsvT5yprcCZMMrYs5Q7VFLjjCQpPKss8jJVeJqef+v00wu0Thr9kzYVLkq+0jKXgpOnlsNfatkkJadCcNVM23Z//RUu30Ni5aogbq25hIRwTY0yzoHJgYOTeqUQHC8rf43XY8iwQp15WQMVXiH/uCMptn6leqsDqUGSg1Jq+SHlJAoYX5s9bpfDUTSJtgV3QdyDEevreDn8k2RG1CVqEoo7N4+jihYNt364wnaQ1A4rLs75Cuceal6iWzTb1Fp455kMcmP90QRb9npHw0vnNmXqnd2S7rbWkf/T5jXlnxeN1FVNqMXVoLxWQAa6L4BMWhSkNh5wYWUXlCi45YL8R92YkrlutS6X+HYKd8HJwST+NDn48XF0NO0T2mFv2B7bZzE7ZEfsOztmMybY3+B18DbYCwdhFB6GX66sYdD3vGI3Kpz+A+Pxxb0=</latexit>

n ! number of samples per GPU


<latexit sha1_base64="GdACsXu+gJ4K5k7l1xoJ/m4oEi0=">AAACKnicbVDLSgNBEJz1/Tbq0ctgEDyFXRH1GPSgxwgmCkkIs5PeZHAey0yvGpb8hZ/hF3jVL/AmXvU/nE1y8FUwUF3VTfdUnErhMAzfgqnpmdm5+YXFpeWV1bX10sZmw5nMcqhzI429jpkDKTTUUaCE69QCU7GEq/jmtPCvbsE6YfQlDlJoK9bTIhGcoZc6pYqmLSt6fWTWmjvaQrjHXGcqBktNQh1TqQRHU1+e1erDTqkcVsIR6F8STUiZTFDrlD5bXcMzBRq5ZM41ozDFds4sCi5huNTKHKSM37AeND3VTIFr56N/DemuV7o0MdY/jXSkfp/ImXJuoGLfqRj23W+vEP/zmhkmx+1c6DRD0Hy8KMkkRUOLkGhXWOAoB54wboW/lfI+s4yjj/LHlq4rTityiX6n8Jc09ivRYWX/4qBcPZkktEC2yQ7ZIxE5IlVyTmqkTjh5IE/kmbwEj8Fr8Ba8j1ungsnMFvmB4OMLFLuoXQ==</latexit>

Baseline: k = 8, n = 32, and |B| = kn = 256.


<latexit sha1_base64="XDynShBaInTvS6TxxADWYeo9/1o=">AAACOXicbZDLSsNAFIYn9V5vUZduBhvBhZQk4gWhIHXjsoK9QFvKZDJth04mYWYilDRP42P4BG515dKFIG59ASc1C209MPDx/+dwzvxexKhUtv1qFBYWl5ZXVteK6xubW9vmzm5DhrHApI5DFoqWhyRhlJO6ooqRViQICjxGmt7oOvOb90RIGvI7NY5IN0ADTvsUI6Wlnlmp5sOX0BrBCrywjqHFNZy4mhD3oTXpBEgNMWJJNZ1oZwQz3z09s8o9s2SX7WnBeXByKIG8aj3zveOHOA4IV5ghKduOHalugoSimJG02IkliRAeoQFpa+QoILKbTL+ZwkOt+LAfCv24glP190SCAinHgac7s4vlrJeJ/3ntWPUvugnlUawIxz+L+jGDKoRZZtCngmDFxhoQFlTfCvEQCYSVTvbPFl9mp6U6F2c2hXlouGXnrOzeuqWrap7QKtgHB+AIOOAcXIEbUAN1gMEDeALP4MV4NN6MD+Pzp7Vg5DN74E8ZX9+/96hV</latexit>

⌘ = 0.1 ! learning rate


<latexit sha1_base64="aD7xlKz5jzhgjaE81YZ/IH7rfFM=">AAACJ3icbVDLSgNBEJz1/Tbq0ctgEAQh7AZRL4LoxWMEo0I2hN5JJxkyO7vM9KphyUf4GX6BV/0Cb6JHD/6Hu0kOmljQUFR1090VxEpact1PZ2p6ZnZufmFxaXlldW29sLF5baPECKyKSEXmNgCLSmqskiSFt7FBCAOFN0H3PPdv7tBYGekr6sVYD6GtZUsKoExqFPZ9JOAn3C153Dey3SEwJrrnPuEDpQrBaKnb3ABhv1EouiV3AD5JvBEpshEqjcK334xEEqImocDamufGVE/BkBQK+0t+YjEG0YU21jKqIURbTwdP9flupjR5KzJZaeID9fdECqG1vTDIOkOgjh33cvE/r5ZQ67ieSh0nhFoMF7USxSnieUK8KQ0KUr2MgDAyu5WLDhgQlOX4Z0vT5qfluXjjKUyS63LJOyyVLw+Kp2ejhBbYNtthe8xjR+yUXbAKqzLBHtkze2GvzpPz5rw7H8PWKWc0s8X+wPn6AY2Bpok=</latexit>

Learning rates for large mini-batches


<latexit sha1_base64="wsFn56bsQfR5wxLLxnNo0QaDYHA=">AAACLHicbVA7TsNAFFzzDf8AJc2KCImGyE4BlBE0FBRBIglSEkXP6+dkxXpt7T4jRVauwTE4AS2cgAYhSjgHjpOC31SjmfcZjZ8oacl1X525+YXFpeXSyura+sbmVnl7p2Xj1AhsiljF5sYHi0pqbJIkhTeJQYh8hW3/9nzit+/QWBnraxol2ItgoGUoBVAu9ctut7iRGQzGlwhGSz3gBggtD2PDFZgB8khqeeQDiSHafrniVt0C/C/xZqTCZmj0yx/dIBZphJqEAms7nptQLwNDUigcr3ZTiwmIWxhgJ6caIrS9rEg15ge5EhRJwlgTL9TvGxlE1o4iP5+MgIb2tzcR//M6KYWnvUzqJCXUYvooTBWnmE9q4oE0KEiNcgLCyDwrF0MwICgv88eXwE6ijfNevN8t/CWtWtU7rtauapX62ayhEttj++yQeeyE1dkFa7AmE+yePbIn9uw8OC/Om/M+HZ1zZju77Aeczy87rKmA</latexit>

<latexit sha1_base64="SVWed4FJaIfB4GNZqMh7VOUttN8=">AAACH3icbVDLTgIxFO3gC/GFunTTSExwQ2ZYqEuiG5eYyCMBQjqdCzS0M5P2jgkhfICf4Re41S9wZ9zyAf6HHZiFgidpcnLOffX4sRQGXXfu5DY2t7Z38ruFvf2Dw6Pi8UnTRInm0OCRjHTbZwakCKGBAiW0Yw1M+RJa/vgu9VtPoI2IwkecxNBTbBiKgeAMrdQvlgKGjMZMMyntEKNoWSUSRSyBGtBp66WtcivuAnSdeBkpkQz1fvG7G0Q8URAil8yYjufG2JsyjYJLmBW6iYGY8TEbQsfSkCkwveniMzN6YZWADiJtX4h0of7umDJlzET5tlIxHJlVLxX/8zoJDm56UxHGCULIl4sGiaQY0TQZGggNHOXEEsa1sLdSPrLBcLQh/NkSmPS0mc3FW01hnTSrFe+qUn2olmq3WUJ5ckbOSZl45JrUyD2pkwbh5Jm8kjfy7rw4H86n87UszTlZzyn5A2f+A0LSo9I=</latexit>

data parallelism (multiple servers) X n


! all distinct subsets of size n drawn from th
<latexit sha1_base64="Do4OkWlsoThMNegjJPdw0fFemnA=">AAACXXicbZA9TxtBEIbXByTEIeCQIkWaEXakVNYdQoQSJQ0lkWKwZDvW3N6cvWJv97Q7B3FO/n35DVRUlGmTNmvjgq+RVnr1zuc+aamV5zi+bkRr6xsvXm6+ar7eerO903q7e+Zt5ST1pNXW9VP0pJWhHivW1C8dYZFqOk8vvi7y55fkvLLmO89KGhU4MSpXEjlY4xb2fxgYOjWZMjpnr2DI9JNr1BqysF0ZyeCr1BN7sDl49YugYzqQObwykDtbAE8JbJigDGpgh8ooM4HQAZ1+Z94ct9pxN14GPBXJSrTFKk7HrdthZmVVkGGp0ftBEpc8qtGxkprmzWHlqUR5gRMaBGmwID+qlyjm8DE4GeTWhWcYlu79jhoL72dFGioL5Kl/nFuYz+UGFedHo1qZsmIy8m5RXoUPW1hwDbQcSdazIFA6FW4FOUWHkgP9B1syvzhtHrgkjyk8FWf73eSwu//toH38ZUVoU3wQe+KTSMRncSxOxKnoCSl+iz/ir/jXuIk2oq1o+640aqx63okHEb3/D8UOuDE=</latexit>

<latexit sha1_base64="vTlClPQmNYtuqLqu5le/hH6Tyts=">AAACTnicbVDLSsNAFJ3UV31XXboZrIKrkhR8IAiiGxcufFWFppSb6U0dOpmEmYlQYv7Lz3Drxp3oF7gTnbRd+LowcDjn3Ln3niARXBvXfXJKY+MTk1Pl6ZnZufmFxcrS8pWOU8WwwWIRq5sANAousWG4EXiTKIQoEHgd9I4K/foOleaxvDT9BFsRdCUPOQNjqXbl/MQ2gqIXDOwXXXqeCtyj6z4aoPvUrXnUDxWw7N6PwNxaU3aY3+dZfWs7t/o3Q4/KIb3erlTdmjso+hd4I1AlozptV178TszSCKVhArRuem5iWhkow5nAfMZPNSbAetDFpoUSItStbHB7Tjcs06FhrOyThg7Y7x0ZRFr3o8A6iwv0b60g/9OaqQl3WxmXSWpQsuGgMBXUxLQIkna4QmZE3wJgittdKbsFG4Wxcf+Y0tHFarnNxfudwl9wVa9527X6Wb16cDhKqExWyRrZJB7ZIQfkmJySBmHkgTyTV/LmPDrvzofzObSWnFHPCvlRpfIXbvWyYQ==</latexit>

|B| kn
Linear Scaling Rule: ⌘ = 0.1 256 = 0.1 256 X n ! all distinct subsets of size n drawn from the original training set X
<latexit sha1_base64="Do4OkWlsoThMNegjJPdw0fFemnA=">AAACXXicbZA9TxtBEIbXByTEIeCQIkWaEXakVNYdQoQSJQ0lkWKwZDvW3N6cvWJv97Q7B3FO/n35DVRUlGmTNmvjgq+RVnr1zuc+aamV5zi+bkRr6xsvXm6+ar7eerO903q7e+Zt5ST1pNXW9VP0pJWhHivW1C8dYZFqOk8vvi7y55fkvLLmO89KGhU4MSpXEjlY4xb2fxgYOjWZMjpnr2DI9JNr1BqysF0ZyeCr1BN7sDl49YugYzqQObwykDtbAE8JbJigDGpgh8ooM4HQAZ1+Z94ct9pxN14GPBXJSrTFKk7HrdthZmVVkGGp0ftBEpc8qtGxkprmzWHlqUR5gRMaBGmwID+qlyjm8DE4GeTWhWcYlu79jhoL72dFGioL5Kl/nFuYz+UGFedHo1qZsmIy8m5RXoUPW1hwDbQcSdazIFA6FW4FOUWHkgP9B1syvzhtHrgkjyk8FWf73eSwu//toH38ZUVoU3wQe+KTSMRncSxOxKnoCSl+iz/ir/jXuIk2oq1o+640aqx63okHEb3/D8UOuDE=</latexit>

<latexit sha1_base64="S4Zg1H6m7KBeumGlG/iiVYDfVic=">AAACGnicbVC7TsMwFHXKq5RXgBEJGVqkslRJB2CsYGFgKBJ9SG1UOY7TWnWcyHZAVdSNz+ALWOEL2BArCx/Af+C0EaItR7J07jn36l4fN2JUKsv6MnJLyyura/n1wsbm1vaOubvXlGEsMGngkIWi7SJJGOWkoahipB0JggKXkZY7vEr91j0Rkob8To0i4gSoz6lPMVJa6pmHpZvyw2kJ4gHifSIhkrDEf8ujnlm0KtYEcJHYGSmCDPWe+d31QhwHhCvMkJQd24qUkyChKGZkXOjGkkQID1GfdDTlKCDSSSb/GMMTrXjQD4V+XMGJ+nciQYGUo8DVnQFSAznvpeJ/XidW/oWTUB7FinA8XeTHDKoQpqFAjwqCFRtpgrCg+tY0AYGw0tHNbPFketpY52LPp7BImtWKfVap3laLtcssoTw4AMegDGxwDmrgGtRBA2DwCJ7BC3g1now34934mLbmjGxmH8zA+PwBmBGfgg==</latexit>

L(w) changes as n changes!


<latexit sha1_base64="5iIo01Iq8d2ljSzBSEqqXKhQRkM=">AAACN3icbVDLSgNBEJz1/Tbq0ctoEPQSdnNQwYtoQI+RGBViCL2zvWbI7Owy0yuEkI/xM/wCr3r05Enx6h84iTn4amgoqrro7gozJS35/rM3Nj4xOTU9Mzs3v7C4tFxYWb2waW4E1kWqUnMVgkUlNdZJksKrzCAkocLLsHM80C9v0ViZ6nPqZthM4EbLWAogR7UKB7U8dCaSaDnoiFclxaCU5WnMK269kWFOGPHaSYVvW0RObeQZZGg2dlqFol/yh8X/gmAEimxU1Vbh9TpKRZ6gJqHA2kbgZ9TsgSEpFPbnrnOLGYgO3GDDQQ0J2mZv+GSfbzkm4nFqXGviQ/a7oweJtd0kdJMJUNv+1gbkf1ojp3i/2ZM6c49q8bUozhWnlA8S45E0KEh1HQBhpLuVizYYEORy/bElsoPT+i6X4HcKf8FFuRTslspn5eLh0SihGbbONtk2C9geO2SnrMrqTLA79sAe2ZN37714b9771+iYN/KssR/lfXwCQNKsPw==</latexit>

Subtleties and Pitfalls of Distributed SGD (see the paper!)


Batch normalization with large mini-batches
<latexit sha1_base64="ppWdreR4GfJR7rl2m8Odnmb6kEk=">AAACMnicbVDLThtBEJyF8AhPQ465jGJF4oK164PJ0TIXjiDFgGRbVu9s2x4xM7ua6Q0yK/8Jn8EXcIUfgFsU5ZaPyOziQ4xT0kilqurpVsWZko7C8CVYWf2wtr6x+XFre2d3b792cHjp0twK7IpUpfY6BodKGuySJIXXmUXQscKr+Oa09K9+oHUyNd9pmuFAw9jIkRRAXhrWWv3qj8JiMusAiQk3qdWg5F0V4LeSJlyBHSPX0sjjuMygG9bqYSOswJdJNCd1Nsf5sPa7n6Qi12hIKHCuF4UZDQqwJIXC2VY/d5iBuIEx9jw1oNENiuq2Gf/qlYSPUuufIV6p/04UoJ2b6tgnNdDEvfdK8X9eL6fRt0EhTZYTGvG2aJQrTikvy+KJtChITT0BYaW/lYsJWBDkK13YkrjytJnvJXrfwjK5bDaiVqN50ay3O/OGNtln9oUdsYidsDY7Y+esywS7Z4/siT0HD8Fr8DP49RZdCeYzn9gCgj9/AapUrEs=</latexit>

Bj , 0  j < k ! sequence of k minibatches of size n


<latexit sha1_base64="ug5XtUL+P88KISbTqefAaFkFjkI=">AAACUnicbVI9bxNBEF2bAEkI4EBJM8JGokDWXYQSihRRaCiDhJ1IPsuaW8/ZG+/tXnbnAHPyP+NnpEkbKRX8glTsOS7Ix0grPb03o5n3tGmhlecoumg0H609fvJ0fWPz2dbzFy9b26/63pZOUk9abd1Jip60MtRjxZpOCkeYp5qO09nnWj/+Ts4ra77xvKBhjhOjMiWRAzVq9ZMceSpRV4eL0ekHiCDRdAansA8zSJyaTBmdsz8gYfrJlaezkowksBl0Zh3IlVEpspySrymvfhF0TGcxarWjbrQsuA/iFWiLVR2NWlfJ2MoyJ8NSo/eDOCp4WKFjJTUtNpPSU4FyhhMaBGgwJz+slv4X8C4wY8isC88wLNn/JyrMvZ/naeis3fq7Wk0+pA1Kzj4NK2WKkoPrm0VZqYEt1GHCWDmSrOcBoHQq3Apyig4lh8hvbRn7+rQ6l/huCvdBf6cb73Z3vn5sHxyuEloXb8Rb8V7EYk8ciC/iSPSEFL/Fpfgj/jbOG9fN8EtuWpuN1cxrcauaW/8Aowazzg==</latexit>

[ The BN statistics should not be computed across all workers, not only for the sake of reducing
communication, but also for maintaining the same underlying loss function being optimized.
<latexit sha1_base64="HiDuPg0AdmYLZo9nymosvt2I7kk=">AAACUXicbVA9b9RAEN0zgYQkwAElzSiXSKlOdoSAgiIKTcogcUmk88ka7419m1vvmt1x4LD8y/gZVCkpaOAX0GFfriAfTxrp6b0ZzcxLS608h+FVL3iw9vDR+sbjza3tJ0+f9Z+/OPW2cpJG0mrrzlP0pJWhESvWdF46wiLVdJbOP3T+2SU5r6z5xIuSJgXmRmVKIrdS0h/FqcplVSZ1CLGmz3AB72HeQFwgzyTq+qhJLiB2Kp8xOme/QMz0lWuNLicolFEpspyBzcCrbwS7czC7TdIfhMNwCbhLohUZiBVOkv6veGplVZBhqdH7cRSWPKnRsZKams248lSinGNO45YaLMhP6uX7Dey1yhQy69oyDEv1/4kaC+8XRdp2dl/5214n3ueNK87eTWplyorJyOtFWaWBLXRZwlQ5kqwXLUHpVHsryBk6lNwmfmPL1HendblEt1O4S04PhtGb4cHH14PDo1VCG+KV2BH7IhJvxaE4FidiJKT4Ln6K3+JP70fvbyCC4Lo16K1mXoobCLb+AYOGtGI=</latexit>

Bj ! large minibatch of size kn


0j<k Goyal, Priya, et al. "Accurate, large minibatch sgd: Training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
SGDR: Stochastic Gradient
Descent with Warm Restarts
n ! number of free parameters
<latexit sha1_base64="dCi0LZQgg9EMIPARsmbKEIzhH40=">AAACF3icbVDLSgMxFM34tr5GXboJFsFVmVFRl6IblxVsFdpSMumdNjSTDMkdtQz9Czf+ihsXirjVnX9jWmeh1gOBwzn3kXuiVAqLQfDpTU3PzM7NLyyWlpZXVtf89Y261ZnhUONaanMdMQtSKKihQAnXqQGWRBKuov7ZyL+6AWOFVpc4SKGVsK4SseAMndT2K4o2jej2kBmjb2kT4Q5zlSURGKpjGhsAmjLDEkA3Zdj2y0ElGINOkrAgZVKg2vY/mh3NswQUcsmsbYRBiq2cGRRcwrDUzCykjPdZFxqOKrfItvLxXUO645QOjbVxTyEdqz87cpZYO0giV5kw7Nm/3kj8z2tkGB+3cqHSDEHx70VxJilqOgqJdoQBjnLgCONGuL9S3nMp8FEIJRdC+PfkSVLfq4SHlf2Lg/LJaRHHAtki22SXhOSInJBzUiU1wsk9eSTP5MV78J68V+/tu3TKK3o2yS94718Nb6B/</latexit>

f : Rn ! R
<latexit sha1_base64="TkUgG4EOGrwbUNJftZ37scNlgb4=">AAACEHicbVC7TsMwFHV4lvIKMLJYVAimKgEEiKmChbEg+pCaUDmu01p17Mh2QFXUT2DhV1gYQIiVkY2/wWkzlJYjXenonHt17z1BzKjSjvNjzc0vLC4tF1aKq2vrG5v21nZdiURiUsOCCdkMkCKMclLTVDPSjCVBUcBII+hfZX7jgUhFBb/Tg5j4EepyGlKMtJHa9kF4Ab0I6V4QpLfDew49Sbs9jaQUjxMGbNslp+yMAGeJm5MSyFFt299eR+AkIlxjhpRquU6s/RRJTTEjw6KXKBIj3Edd0jKUo4goPx09NIT7RunAUEhTXMOROjmRokipQRSYzuxENe1l4n9eK9HhuZ9SHieacDxeFCYMagGzdGCHSoI1GxiCsKTmVoh7SCKsTYZFE4I7/fIsqR+V3dPy8c1JqXKZx1EAu2APHAIXnIEKuAZVUAMYPIEX8AberWfr1fqwPsetc1Y+swP+wPr6BZMxnPI=</latexit>

function to be minimized
<latexit sha1_base64="XhPGxVJ2VVphVS++9+2VcL3vNAc=">AAACCHicbVC7SgNBFJ31bXxFLS0cDIJV2FVRy6CNpYJ5QBLC7OSuDs7MLjN3xbiktPFXbCwUsfUT7PwbZ5MUmnhg4HDOvdw5J0yksOj7397U9Mzs3PzCYmFpeWV1rbi+UbNxajhUeSxj0wiZBSk0VFGghEZigKlQQj28Pcv9+h0YK2J9hb0E2opdaxEJztBJneJ2C+EesyjVPBcoxjQEqoQWSjxAt98plvyyPwCdJMGIlMgIF53iV6sb81SBRi6Ztc3AT7CdMYOCS+gXWqmFhPFbdg1NRzVTYNvZIEif7jqlS6PYuKeRDtTfGxlT1vZU6CYVwxs77uXif14zxeiknQmdpAiaDw9Fqczj5q3QrjDAUfYcYdwI91fKb5hhHF13BVdCMB55ktT2y8FR+eDysFQ5HdWxQLbIDtkjATkmFXJOLkiVcPJInskrefOevBfv3fsYjk55o51N8gfe5w9mh5o4</latexit>

xt 2 Rn ! parameter vector at time step t


<latexit sha1_base64="Pkh6J9k8V0YUF6RuoghrZuElfnk=">AAACMnicbVC7TgMxEPTxJrwClDQWAYkqugMElAga6ACRgJQLkc/ZSyx89sneA6JTvomGL0GigAKEaPkInEfBa6vZmV3tzkSpFBZ9/9kbGR0bn5icmi7MzM7NLxQXl6pWZ4ZDhWupzWXELEihoIICJVymBlgSSbiIrg97+sUNGCu0OsdOCvWEtZSIBWfoqEbx+K6BNBSKhgnDdhTlZ90r1xjRaiMzRt/SEOEO85QZlgCCoTfAURvKkKJIgFqElK7hWrdRLPllv1/0LwiGoESGddIoPoZNzbMEFHLJrK0Ffor1nBkUXEK3EGYWUsavWQtqDip339bzvuUuXXdMk8bukVgrpH32+0bOEms7SeQme8bsb61H/qfVMoz36rlQaYag+OBQnEmKmvbyo01hXACy4wDjRrhfKW+7cLjLxhZcCMFvy39BdbMc7JS3TrdL+wfDOKbIClklGyQgu2SfHJETUiGc3JMn8krevAfvxXv3PgajI95wZ5n8KO/zC9l3qxk=</latexit>

rft (xt ) ! gradient information obtained on a relatively small t-th batch of b data-points
<latexit sha1_base64="kiUvbru1VbE18mCdGVnQk3lKVTM=">AAACX3icbVFNb9NAEF2bj5a0tC6cEJcRCVI5NLIBFY4VXDgWibSV4siaXa/jVde71u44NLLyJ7khceGfsElzgJaRVnp6M6N57y1vtfKUpj+j+MHDR493dp8M9vafHhwmR88uvO2ckBNhtXVXHL3UysgJKdLyqnUSG67lJb/+vO5fLqTzyppvtGzlrMG5UZUSSIEqkkVukGuEqqDjm4LeQO7UvCZ0zn6HnOQN9XOHpZKGQJnKumazCJYThpMlBIzgpA70Quol+Aa1hhGNTqgGjiRqsBWM+AhKJDxprTLkV0UyTMfppuA+yLZgyLZ1XiQ/8tKKrgk6hEbvp1na0qxHR0pouRrknZctimucy2mABhvpZ/0mnxW8DkwJQXx4wceG/Xujx8b7ZcPDZLBX+7u9Nfm/3rSj6uOsV6btSBpxe6jqNJCFddhQKicFhVRKhcKpoBVEjQ4FhS8ZhBCyu5bvg4u34+x0/O7r++HZp20cu+wle8WOWcY+sDP2hZ2zCRPsVxRHe9F+9DveiQ/i5HY0jrY7z9k/Fb/4A1yTtX8=</latexit>

on a relatively small t-th batch of b data-points


<latexit sha1_base64="6j6zLH82SFUmpYG64Qskt7FUusE=">AAACFXicbVDLSsNAFJ3UV62vqEs3wSJUkJJUUJdFC3VZqX1AG8pkctsOnTyYmQgl9Cfc+CtuXCjiVnDn3zhJs9DWA5c5nPuc44SMCmma31puZXVtfSO/Wdja3tnd0/cP2iKIOIEWCVjAuw4WwKgPLUklg27IAXsOg44zuUnynQfgggb+vZyGYHt45NMhJVgqaaCf9dMZMQd31pQBGWMhKTHqHLsUfGnUQJDkLTXrtdOBXjTLZgpjmVgZKaIMjYH+1XcDEnlqAmFYiJ5lhtKOMVc7GMwK/UhAiMkEj6CnqI89EHacXjQzTpTiGsOAq1AXpOrvjhh7Qkw9R1V6WI7FYi4R/8v1Ijm8smPqh5EEn8wXDSNmyMBILDJcyoFINlUEE04TP5QvHBOpjCwoE6zFLy+TdqVsXZTP7yrF6nVmRx4doWNUQha6RFV0ixqohQh6RM/oFb1pT9qL9q59zEtzWtZziP5A+/wBCvSerw==</latexit>

Stochastic Gradient Descent (SGD)


xt+1 = xt ⌘t rft (xt )
<latexit sha1_base64="Z4Wyv8jCMuTGkCQM7TPnB+cxhEw=">AAACD3icbZC7SgNBFIZnvcZ4W7W0GQyKIoZdFbURgjaWEYwKSVjOTmZ1yOzsMnNWEpa8gY2vYmOhiK2tnW/j5FJ4+2Hg4z/ncOb8YSqFQc/7dMbGJyanpgszxdm5+YVFd2n50iSZZrzGEpno6xAMl0LxGgqU/DrVHOJQ8quwfdqvX91xbUSiLrCb8mYMN0pEggFaK3A3OkGO236PHtNOgHSHNjiChYaCUAKNAty0/lbglryyNxD9C/4ISmSkauB+NFoJy2KukEkwpu57KTZz0CiY5L1iIzM8BdaGG163qCDmppkP7unRdeu0aJRo+xTSgft9IofYmG4c2s4Y8Nb8rvXN/2r1DKOjZi5UmiFXbLgoyiTFhPbDoS2hOUPZtQBMC/tXym5BA0MbYdGG4P8++S9c7pb9g/Le+X6pcjKKo0BWyRrZJD45JBVyRqqkRhi5J4/kmbw4D86T8+q8DVvHnNHMCvkh5/0LJfKa2w==</latexit>

⌘t ! learning rate
<latexit sha1_base64="l606J9IF17qrLg3reFdZOssTuoA=">AAACEHicbVA9SwNBEN3z2/gVtbRZDKJVuFNRy6CNpYJJhFwIc5tJsri3d+zOqeHIT7Dxr9hYKGJraee/cRNT+PVg4PHeDDPzolRJS77/4U1MTk3PzM7NFxYWl5ZXiqtrNZtkRmBVJCoxlxFYVFJjlSQpvEwNQhwprEdXJ0O/fo3GykRfUD/FZgxdLTtSADmpVdwOkaBFPDSy2yMwJrnhIeEt5QrBaKm73ADhoFUs+WV/BP6XBGNSYmOctYrvYTsRWYyahAJrG4GfUjMHQ1IoHBTCzGIK4gq62HBUQ4y2mY8eGvAtp7R5JzGuNPGR+n0ih9jafhy5zhioZ397Q/E/r5FR56iZS51mhFp8LepkilPCh+nwtjQoSPUdAWGku5WLHhgQ5DIsuBCC3y//JbXdcnBQ3jvfL1WOx3HMsQ22yXZYwA5ZhZ2yM1Zlgt2xB/bEnr1779F78V6/Wie88cw6+wHv7RPUMZ28</latexit>

<latexit sha1_base64="x1H+8zlD1WJ0v3cE1Z4bQ+R8I70=">AAACKXicbVDLTgIxFO3gC/GFunTTSExwQ2Y0UZdETXCJIEIChHQ6F2nsPNLe0ZAJv+PGX3GjiUbd+iN2gIWiN2l6cs69p7fHjaTQaNsfVmZufmFxKbucW1ldW9/Ib25d6zBWHBo8lKFquUyDFAE0UKCEVqSA+a6Epnt7lurNO1BahMEVDiPo+uwmEH3BGRqqly93xh6JAm9Ux5APmEbBaUUxT0CA9Bw0T+97gQPaZMqnNdDIFGparFfOa/u9fMEu2eOif4EzBQUyrWov/9LxQh77xpVLpnXbsSPsJsZScAmjXCfWEDF+y26gbWDAfNDdZLzliO4ZxqP9UJljthqzPycS5ms99F3T6TMc6FktJf/T2jH2T7qJCKIYIeCTh/qxpBjSNDbqCQUc5dAAxpVIMzJZKcbRhJszITizX/4Lrg9KzlHp8PKgUD6dxpElO2SXFIlDjkmZXJAqaRBOHsgTeSVv1qP1bL1bn5PWjDWd2Sa/yvr6BrrLptQ=</latexit>

Stochastic Gradient Descent with Warm Restarts (SGDR)


Gradient Descent with Warm Restarts (SGDR)
i ! indexes a new warm-started run
<latexit sha1_base64="6gD6Mqu8mYq/klXgwsWEDEoewiU=">AAACHHicbVDLSgMxFM34rPVVdekmWAQ3lhkr6rLoxmUF+4C2lEzmtg3NZIbkjm0p/RA3/oobF4q4cSH4N6aPhbYeCBzOPTfJOX4shUHX/XaWlldW19ZTG+nNre2d3czeftlEieZQ4pGMdNVnBqRQUEKBEqqxBhb6Eip+92Y8rzyANiJS9ziIoRGythItwRlaqZnJC1rXot1BpnXUo3WEPg6FCqAPhjKqoEd7TIenxhoQAqoTNWpmsm7OnYAuEm9GsmSGYjPzWQ8inoSgkEtmTM1zY2wM7Y2CSxil64mBmPEua0PNUsVCMI3hJNyIHlsloK1I26OQTtTfG0MWGjMIfesMGXbM/Gws/jerJdi6atiscYKg+PShViIpRnTcFA2EBo5yYAnjWti/Ut5hmnG0faZtCd585EVSPst5F7n83Xm2cD2rI0UOyRE5IR65JAVyS4qkRDh5JM/klbw5T86L8+58TK1LzmzngPyB8/UDtFGiYw==</latexit>

Restart SGD once Ti epochs are performed


<latexit sha1_base64="GOhqlr4VOGqpmb6E8nukL3uAzF4=">AAACFHicbVDLSgMxFM34rPVVdekm2AqCUGYqqMuigi6r9gVtKZn0ThuaSYYkI5ShH+HGX3HjQhG3Ltz5N6aPhbYeCBzOuZeTe/yIM21c99tZWFxaXllNraXXNza3tjM7u1UtY0WhQiWXqu4TDZwJqBhmONQjBST0OdT8/uXIrz2A0kyKshlE0ApJV7CAUWKs1M4c34E2RBl8f32FpaCAc+U2y2GIJO1pTBTgCFQgVQiddibr5t0x8DzxpiSLpii1M1/NjqRxCMJQTrRueG5kWomNY5TDMN2MNUSE9kkXGpYKEoJuJeOjhvjQKh1sk+0TBo/V3xsJCbUehL6dDInp6VlvJP7nNWITnLcSJqLYgKCToCDm2Eg8agh3mAJq+MASQhWzf8W0RxShxvaYtiV4syfPk2oh753mT24L2eLFtI4U2kcH6Ah56AwV0Q0qoQqi6BE9o1f05jw5L8678zEZXXCmO3voD5zPH8EonWk=</latexit>

Tcur ! number of epochs performed since the last restart


<latexit sha1_base64="70JtXjm/Jo+TTyi3AuBqNis9kIY=">AAACNXicbVA9bxNBEN0LAYz5MklJM8JCorLuICKUETQpKIwUJ5Fsy9pbz/lW3o/T7lwS63R/iib/IxUUKRKhtPkLrD8KsPOklZ7em5mdeWmhpKc4/h1tPdp+/ORp41nz+YuXr1633uwce1s6gT1hlXWnKfeopMEeSVJ4WjjkOlV4kk6/zf2TM3ReWnNEswKHmk+MzKTgFKRR6/vRqBKlq2Hg5CQn7pw9hwHhBVWm1Ck6sBlgYUXuoUCXWadxDF4agUA5guKewKEPnVSPWu24Ey8AmyRZkTZboTtqXQ3GVpQaDYkwyfeTuKBhFWZJobBuDkqPBRdTPsF+oIZr9MNqcXUN74MyhrBReIZgof7bUXHt/UynoVJzyv26Nxcf8volZV+GlTRFSWjE8qOsVEAW5hHCWDoUpGaBcOFk2BVEzh0XFIJuhhCS9ZM3yfHHTvK58+nHXvvg6yqOBnvL3rEPLGH77IAdsi7rMcF+sl/sht1Gl9F19Ce6W5ZuRaueXfYfovu/rk+tLg==</latexit>

since the last restart


Tcur is updated at each batch iteration t, it can take fractional values such as 0.1, 0.2, etc.
<latexit sha1_base64="zagi7tL8dSS2mOpdqaVjsdQlbfA=">AAACVnicbVFNbxMxEPVuKS3ha4EjlxEJEodotRsk4FjBhWORmrZSEkWz3tnGite7sscV0Sp/Ei7wU7ggvGkO0DKSref35snj56LVynGW/Yzig3uH94+OHwwePnr85Gny7Pm5a7yVNJWNbuxlgY60MjRlxZouW0tYF5ouivWnXr+4JutUY85409KixiujKiWRA7VM6jnTV+5GZ8tOersdgXLg2xKZSkAGQrmCAjnsisnuTDDi0TgcQaIBxjVBZVH2Cmq4Ru3JgfPBgQ6yNB+HbTIGYplul8kwS7NdwV2Q78FQ7Ot0mXybl430NRmWGp2b5VnLiw4tK6lpO5h7Ry3KNV7RLECDNblFt4tlC68DU0LV2LBMGLdn/3Z0WDu3qYvQWSOv3G2tJ/+nzTxXHxadMq1nMvLmospr4Ab6jKFUliTrTQAorQqzglxhH1L4iUEIIb/95LvgfJLm79K3XybDk4/7OI7FS/FKvBG5eC9OxGdxKqZCiu/iVxRHB9GP6Hd8GB/dtMbR3vNC/FNx8gdhabEf</latexit>

it can take fractional values such as 0.1, 0.2, etc.


i
<latexit sha1_base64="LW3X5iG0sB/MJ8zKwg04agE9MtM=">AAACDXicbZC7SgNBFIZnvcZ4W7W0GYyCVdhVURshaGMZwVwgiWF2cpIMmdldZs6KYckL2PgqNhaK2Nrb+TZOLoUm/jDw8Z9zOHP+IJbCoOd9O3PzC4tLy5mV7Ora+samu7VdNlGiOZR4JCNdDZgBKUIooUAJ1VgDU4GEStC7GtYr96CNiMJb7MfQUKwTirbgDK3VdPeRXlCP1oWy28DQOiBrDr0RpIo9DO5E0815eW8kOgv+BHJkomLT/aq3Ip4oCJFLZkzN92JspEyj4BIG2XpiIGa8xzpQsxgyBaaRjq4Z0APrtGg70vaFSEfu74mUKWP6KrCdimHXTNeG5n+1WoLt80YqwjhBCPl4UTuRFCM6jIa2hAaOsm+BcS3sXynvMs042gCzNgR/+uRZKB/l/dP88c1JrnA5iSNDdskeOSQ+OSMFck2KpEQ4eSTP5JW8OU/Oi/PufIxb55zJzA75I+fzB5RGmqs=</latexit>

t = 0 =) ⌘t = ⌘max
i
<latexit sha1_base64="z2AuyhhVaX7E9Q3w1yfxjP64ZBo=">AAACGnicbZDLSgMxFIYz3q23qks3wSK4KjMq6kYQ3bis0Bt06pBJTzWYZIbkjFiGPocbX8WNC0XciRvfxvSyUOsPgY//nMPJ+eNUCou+/+VNTc/Mzs0vLBaWlldW14rrG3WbZIZDjScyMc2YWZBCQw0FSmimBpiKJTTi2/NBvXEHxopEV7GXQluxay26gjN0VlQMqlGIcI85z0yfntBqJGgolFsNloaALELnDiFXQvevRFQs+WV/KDoJwRhKZKxKVPwIOwnPFGjkklnbCvwU2zkzKLiEfiHMLKSM37JraDnUTIFt58PT+nTHOR3aTYx7GunQ/TmRM2VtT8WuUzG8sX9rA/O/WivD7nE7FzrNEDQfLepmkmJCBznRjjDAUfYcMG6E+yvlN8wwji7Nggsh+HvyJNT3ysFhef/yoHR6No5jgWyRbbJLAnJETskFqZAa4eSBPJEX8uo9es/em/c+ap3yxjOb5Je8z2/Hy6C5</latexit>

Tcur = Ti =) ⌘t = ⌘min
Loshchilov, Ilya, and Frank Hutter. "Sgdr: Stochastic gradient descent with warm restarts." arXiv preprint arXiv:1608.03983 (2016).
Decoupled Weight Decay Regularization

! weight decay
<latexit sha1_base64="7pdDBZiPxE3vsVdsizFcboFxIfc=">AAACCHicbVA9SwNBEN3zM8avqKWFi0GwCncqahm0sYxgPiA5wt7eJFmy98HunDEcKW38KzYWitj6E+z8N+4lKTTxwcDjvRlm5nmxFBpt+9taWFxaXlnNreXXNza3tgs7uzUdJYpDlUcyUg2PaZAihCoKlNCIFbDAk1D3+teZX78HpUUU3uEwBjdg3VB0BGdopHbhoKVEt4dMqWhAWwgPmA4gU6gPnA1H7ULRLtlj0HniTEmRTFFpF75afsSTAELkkmnddOwY3ZQpFFzCKN9KNMSM91kXmoaGLADtpuNHRvTIKD7tRMpUiHSs/p5IWaD1MPBMZ8Cwp2e9TPzPaybYuXRTEcYJQsgnizqJpBjRLBXqCwUc5dAQxpUwt1LeY4pxNNnlTQjO7MvzpHZScs5Lp7dnxfLVNI4c2SeH5Jg45IKUyQ2pkCrh5JE8k1fyZj1ZL9a79TFpXbCmM3vkD6zPH8T/mnU=</latexit>

final test error on CIFAR-10


<latexit sha1_base64="P88UX0y+zSlm48baVtni/OGm3G0=">AAACBXicbVC7SgNBFJ2Nrxhfq5ZaDAbBxrAbQS2jAdEuinlAEsLs5G4yZHZmmZkVQkhj46/YWChi6z/Y+TdOHoUmnupwzr3ce04Qc6aN5307qYXFpeWV9GpmbX1jc8vd3qlomSgKZSq5VLWAaOBMQNkww6EWKyBRwKEa9Iojv/oASjMp7k0/hmZEOoKFjBJjpZa7HzJBODagDQalpMJS4OLN1cXdse+13KyX88bA88SfkiyaotRyvxptSZMIhKGcaF33vdg0B0QZRjkMM41EQ0xoj3SgbqkgEejmYJxiiA+t0sahfSGUwuCx+ntjQCKt+1FgJyNiunrWG4n/efXEhOfNARNxYkDQyaEwsaElHlWC20wBNbxvCaGK2V8x7RJFqLHFZWwJ/mzkeVLJ5/zT3MltPlu4nNaRRnvoAB0hH52hArpGJVRGFD2iZ/SK3pwn58V5dz4moylnurOL/sD5/AEoupcN</latexit>
Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." arXiv preprint arXiv:1711.05101 (2017).
Residual Attention Network for Image Classi cation
YouTube Video

Attention Module
<latexit sha1_base64="/iC8fzPgD/TE1QkXLj4Mdkqkn+o=">AAACF3icbVC7TsMwFHXKq5RXgLGLRYXEVCUdgLHAwoJUJPqQ2qhynNvWqvOQ7SBVUQY+gy9ghS9gQ6yMfAD/gZNmoC1HsnR07uv4uBFnUlnWt1FaW9/Y3CpvV3Z29/YPzMOjjgxjQaFNQx6KnkskcBZAWzHFoRcJIL7LoetOb7J69xGEZGHwoGYROD4ZB2zEKFFaGprVQb4jEeClV0pBkMn4LvRiDkOzZtWtHHiV2AWpoQKtofkz8EIa+3oL5UTKvm1FykmIUIxySCuDWEJE6JSMoa9pQHyQTpIbSPGpVjw8CoV+gcK5+nciIb6UM9/VnT5RE7lcy8T/av1YjS6dhAVRrL9H54dGMccqxFki2GMCqOIzTQgVTHvFdEIEoUrntnDFk5m1VOdiL6ewSjqNun1eb9w3as3rIqEyqqITdIZsdIGa6Ba1UBtR9IRe0Ct6M56Nd+PD+Jy3loxi5hgtwPj6BepVoRU=</latexit>

<latexit sha1_base64="Qz7OhACABr4nC0TzhYGYBfQwrE8=">AAACKXicbVDLSgMxFM34rO+qSzfBIuhiykwX6rLoxo1QwVqhDiWTSdvQTDIkN0op/Qo/wy9wq1/gTt2K/2Gm7cJaD1w4nHMv994TZ4IbCIIPb25+YXFpubCyura+sblV3N65McpqyupUCaVvY2KY4JLVgYNgt5lmJI0Fa8S989xv3DNtuJLX0M9YlJKO5G1OCTipVfR9H18S08OxJpJ28WGsAFTq2wyDyvxEPUhsQFsKVrOjVrEUlIMR8CwJJ6SEJqi1it93iaI2ZRKoIMY0wyCDaEA0cCrYcPXOGpYR2iMd1nRUkpSZaDB6a4gPnJLgttKuJOCR+ntiQFJj+mnsOlMCXfPXy8X/vKaF9mk04DKzwCQdL2pb4T7GeUY44ZpREH1HCNXc3Yppl2hCwSU5tSUx+WlDl0v4N4VZclMph8flylWlVD2bJFRAe2gfHaIQnaAqukA1VEcUPaJn9IJevSfvzXv3Psetc95kZhdNwfv6AYBHpvQ=</latexit>

– Mask branch (bottom-up top-down structure)


<latexit sha1_base64="7ujscK+UGYG0E49U3inyjkegKRw=">AAACKXicbVDLTgIxFO3gC/GFunTTSExwAZlhoS6JblxiwiuBCel07kBD25m0HRNC+Ao/wy9wq1/gTt0a/8PyWAh4kiYn59ybe3qChDNtXPfTyWxsbm3vZHdze/sHh0f545OmjlNFoUFjHqt2QDRwJqFhmOHQThQQEXBoBcO7qd96BKVZLOtmlIAvSF+yiFFirNTLl0olXFepHOJAEUkHuJiAimIlNI6AmFQBTlRMQWsm+5e9fMEtuzPgdeItSAEtUOvlf7phTFMB0lBOtO54bmL8MVGGUQ6TXDfVkBA6JH3oWCqJAO2PZ9+a4AurhNiGsU8aPFP/boyJ0HokAjspiBnoVW8q/ud1UhPd+GMmk9SApPNDUcqxifG0IxwyBdTwkSWEKmazYjogilBjm1y6EupptIntxVttYZ00K2Xvqlx5qBSqt4uGsugMnaMi8tA1qqJ7VEMNRNETekGv6M15dt6dD+drPppxFjunaAnO9y8IWKdD</latexit>

– Trunk branch (performs feature processing)


! Attention Residual Learning (ARL)
<latexit sha1_base64="ISx3CrbdzVfp6JKybyLmFefRSjU=">AAACMHicbVC7TgMxEPTxfhOgpLGIkKCJ7hACSh4NBQUgAkhJFO35NomFz3ey94DolB/hM/gCWvgCqBAFDV+BL6TgNZKl0cyud3fCVElLvv/iDQ2PjI6NT0xOTc/Mzs2XFhbPbZIZgVWRqMRchmBRSY1VkqTwMjUIcajwIrw6KPyLazRWJvqMuik2Ymhr2ZICyEnN0mbdyHaHwJjkhtcJbynfI0JduPwUrYwyUPwIwWip23xt7/Rovdcslf2K3wf/S4IBKbMBjpul93qUiCx2/woF1tYCP6VGDoakUNibqmcWUxBX0MaaoxpitI28f12Przol4q3EuKeJ99XvHTnE1nbj0FXGQB372yvE/7xaRq2dRi51mrmDxdegVqY4JbyIikfSoCDVdQSEkW5XLjpgQJAL9MeUyBarFbkEv1P4S843KsFWZeNks7y7P0hogi2zFbbGArbNdtkhO2ZVJtgde2CP7Mm79569V+/tq3TIG/QssR/wPj4Byf2qtA==</latexit>

!mask
<latexit sha1_base64="ISx3CrbdzVfp6JKybyLmFefRSjU=">AAACMHicbVC7TgMxEPTxfhOgpLGIkKCJ7hACSh4NBQUgAkhJFO35NomFz3ey94DolB/hM/gCWvgCqBAFDV+BL6TgNZKl0cyud3fCVElLvv/iDQ2PjI6NT0xOTc/Mzs2XFhbPbZIZgVWRqMRchmBRSY1VkqTwMjUIcajwIrw6KPyLazRWJvqMuik2Ymhr2ZICyEnN0mbdyHaHwJjkhtcJbynfI0JduPwUrYwyUPwIwWip23xt7/Rovdcslf2K3wf/S4IBKbMBjpul93qUiCx2/woF1tYCP6VGDoakUNibqmcWUxBX0MaaoxpitI28f12Przol4q3EuKeJ99XvHTnE1nbj0FXGQB372yvE/7xaRq2dRi51mrmDxdegVqY4JbyIikfSoCDVdQSEkW5XLjpgQJAL9MeUyBarFbkEv1P4S843KsFWZeNks7y7P0hogi2zFbbGArbNdtkhO2ZVJtgde2CP7Mm79569V+/tq3TIG/QssR/wPj4Byf2qtA==</latexit>

Attentiontrunk
Residual Learning (ARL)
<latexit sha1_base64="ZrICqKy6BLXXbJrG5943RXSCnj0=">AAAB/nicbVDLSgMxFL1TX7W+qi7dBIvgqsx0oS6LblxWsA9oh5LJZNrQJDMkGaEMBb/ArX6BO3Hrr/gB/oeZdha29UDgcM693JMTJJxp47rfTmljc2t7p7xb2ds/ODyqHp90dJwqQtsk5rHqBVhTziRtG2Y47SWKYhFw2g0md7nffaJKs1g+mmlCfYFHkkWMYJNLAuvJsFpz6+4caJ14BalBgdaw+jMIY5IKKg3hWOu+5ybGz7AyjHA6qwxSTRNMJnhE+5ZKLKj2s3nWGbqwSoiiWNknDZqrfzcyLLSeisBOCmzGetXLxf+8fmqiGz9jMkkNlWRxKEo5MjHKP45CpigxfGoJJorZrIiMscLE2HqWroQ6jzazvXirLayTTqPuXdUbD41a87ZoqAxncA6X4ME1NOEeWtAGAmN4gVd4c56dd+fD+VyMlpxi5xSW4Hz9AtmhlsA=</latexit> <latexit sha1_base64="T/naqgBifIMOdMz8ava/Pje18Ok=">AAAB/3icbVDLTsJAFL3FF+ILdemmkZi4Ii0LdUl04xITCyTQkOl0gAnTaTNza0IaFn6BW/0Cd8atn+IH+B9OoQsBTzLJyTn35p45QSK4Rsf5tkobm1vbO+Xdyt7+weFR9fikreNUUebRWMSqGxDNBJfMQ46CdRPFSBQI1gkmd7nfeWJK81g+4jRhfkRGkg85JWgkD1UqJ4Nqzak7c9jrxC1IDQq0BtWffhjTNGISqSBa91wnQT8jCjkVbFbpp5olhE7IiPUMlSRi2s/mYWf2hVFCexgr8yTac/XvRkYiradRYCYjgmO96uXif14vxeGNn3GZpMgkXRwapsLG2M5/bodcMYpiagihipusNh0TRSiafpauhDqPNjO9uKstrJN2o+5e1RsPjVrztmioDGdwDpfgwjU04R5a4AEFDi/wCm/Ws/VufVifi9GSVeycwhKsr1/Y55dS</latexit>

i ! ranges over all spatial positions


<latexit sha1_base64="bH40OYPwNHIxfbFxXjGLMsorXpk=">AAACMnicbVDLSgNBEJz1/Tbq0ctgEDyFXRH1KHrxqGASIQmhd9JJBmdnlpleNSz5Ez/DL/CqP6A3EW9+hLMxB18FA0VVFz1dcaqkozB8DiYmp6ZnZufmFxaXlldWS2vrNWcyK7AqjDL2MgaHSmqskiSFl6lFSGKF9fjqpPDr12idNPqCBim2Euhp2ZUCyEvt0r7kTSt7fQJrzQ1vEt5SbkH30HHjgxyU4i7106B4apwsYm7YLpXDSjgC/0uiMSmzMc7apfdmx4gsQU1CgXONKEyplYMlKRQOF5qZwxTEFfSw4amGBF0rH9035Nte6fCusf5p4iP1eyKHxLlBEvvJBKjvfnuF+J/XyKh72MqlTjNCLb4WdTPFyfCiLN6RFgWpgScgrL9dcNEHC4J8pT+2dFzxtaKX6HcLf0lttxLtV3bP98pHx+OG5tgm22I7LGIH7IidsjNWZYLdsQf2yJ6C++AleA3evkYngnFmg/1A8PEJ0I6sbQ==</latexit>

c ! index of the channel


<latexit sha1_base64="Y7Cgm+TmZz0C6ygEbeb2WxVgOE0=">AAACJXicbVDLSgNBEJyN73fUo5fBIOgl7AZRj6IXjxFMDCQhzM72JkNmZ5aZXk0I+QY/wy/wql/gTQRPnvwPZ2MOvgoaiqpuurvCVAqLvv/mFWZm5+YXFpeWV1bX1jeKm1t1qzPDoca11KYRMgtSKKihQAmN1ABLQgnXYf88969vwFih1RUOU2gnrKtELDhDJ3WKB5y2jOj2kBmjb2kLYYAjoSIYUB1T7AHlPaYUyHGnWPLL/gT0LwmmpESmqHaKH61I8ywBhVwya5uBn2J7xAwKLmG83MospIz3WReajiqWgG2PJi+N6Z5TIhpr40ohnajfJ0YssXaYhK4zYdizv71c/M9rZhiftN2HaYag+NeiOJMUNc3zoZEwwFEOHWHcCHdrnoBhHF2KP7ZENj8tzyX4ncJfUq+Ug6Ny5fKwdHo2TWiR7JBdsk8CckxOyQWpkhrh5I48kEfy5N17z96L9/rVWvCmM9vkB7z3T11hpoE=</latexit>

! Mixed Spatial and Channel Attention


<latexit sha1_base64="gnEUJ8fWSXCPoV8TdigIvjrl2zE=">AAACMnicbVDLSgNBEJz1/Tbq0ctgEDyFXZHo0cfFi6BoVEhC6J3tJIOzs8tMrxqW/Imf4Rd41R/Qm4g3P8LZmIOvgoaiuovurjBV0pLvP3sjo2PjE5NT0zOzc/MLi6Wl5XObZEZgTSQqMZchWFRSY40kKbxMDUIcKrwIrw6K/sU1GisTfUa9FJsxdLRsSwHkpFap2jCy0yUwJrnhDcJbyo/kLUb8NHUToDjoiB90QWtUfI8IdeHrt0plv+IPwP+SYEjKbIjjVum9ESUii51fKLC2HvgpNXMwJIXC/kwjs5iCuIIO1h3VEKNt5oP/+nzdKRFvJ8aVJj5QvztyiK3txaGbjIG69nevEP/r1TNq7zRzqdPMPSa+FrUzxSnhRVg8kgYFqZ4jIIx0t3LRBQOCXKQ/tkS2OK3IJfidwl9yvlkJqpXNk63y7v4woSm2ytbYBgvYNttlh+yY1Zhgd+yBPbIn79578V69t6/REW/oWWE/4H18AiM1rAA=</latexit>

Wang, Fei, et al. "Residual attention network for image classi cation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
fi
fi
Squeeze-and-Excitation Networks
YouTube Playlist

H 0 ⇥W 0 ⇥C 0
X2R U 2 RH⇥W ⇥C x̃c = Fscale (uc , sc ) = sc uc
Ftr : X 7! U uc 2 RH⇥W
V = [v1 , . . . , vC ] ! learned set of filter kernels X̃ = [x̃1 , . . . , x̃C ]
vc ! parameters of the c-th filter
U = [u1 , . . . , uC ]
0
XC
uc = v c ⇤ X = vcs ⇤ xs
s=1
1 C0
vc = [vc , . . . , vc ]
1 C0
X = [x , . . . , x ]
Squeeze 1 XH X W

z 2 RC zc = Fsq (uc ) = uc (i, j)


HW i=1 j=1
Excitation
s = Fex (z, W ) = (g(z, W )) = (W2 (W1 z))
C
⇥C C⇥ C
W1 2 R r W2 2 R r

Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
CBAM: Convolutional Block Attention Module
YouTube Video

F 2 RC⇥H⇥W ! intermediate feature map


Mc 2 RC⇥1⇥1 ! 1D channel attention map
Ms 2 R1⇥H⇥W ! 2D spatial attention map
F 0 = Mc (F ) F
00 0 0
F = Ms (F ) F

r ! reduction ratio

Woo, Sanghyun, et al. "Cbam: Convolutional block attention module." Proceedings of the European conference on computer vision (ECCV). 2018.
ResNeSt: Split-Attention Networks
ResNeSt = ResNeXt + Squeeze-and-Excitation
<latexit sha1_base64="fFaMv+6YkKNbH1dvOn+QbzMnfs0=">AAACLnicbVDLSgMxFM34tr6qLt0EiyCIZUZF3QhFEVyJr2qhLSWTua3BTGZM7kjr0O/wN/wBt/oHggtxJ36G6diFrwOBwzn3cG+OH0th0HVfnIHBoeGR0bHx3MTk1PRMfnbu3ESJ5lDmkYx0xWcGpFBQRoESKrEGFvoSLvyrvZ5/cQPaiEidYSeGeshaSjQFZ2ilRt6rIbQxPQFzCKdId2jGKkhX6Ol1AnALq0wFq/ttLjCLdBv5glt0M9C/xOuTAunjqJF/rwURT0JQyCUzpuq5MdZTplFwCd1cLTEQM37FWlC1VLEQTD3NvtalS1YJaDPS9imkmfo9kbLQmE7o28mQ4aX57fXE/7xqgs3teipUnCAo/rWomUiKEe31RAOhgaPsWMK4FvZWyi+ZZhxtmz+2BKZ3Wjdni/F+1/CXnK8Vvc3i+vFGobTbr2iMLJBFskw8skVK5IAckTLh5I48kEfy5Nw7z86r8/Y1OuD0M/PkB5yPT9OUqO8=</latexit>

X 2 Rh⇥w⇥c ! input to the split-attention block


<latexit sha1_base64="p4uzQ7ygpc9pyRiMcHBA0exo3xo=">AAACWXicbVDLbtNAFJ24r+CWktIlm1EjJDZENkXAMoJNly0iDylOo/HkJh5lPGPNXJdGln+Ov6i6R2zbL+jYCRJpuNKVjs59nXviTAqLQXDX8HZ29/YPmi/8w6OXx69aJ6/7VueGQ49rqc0wZhakUNBDgRKGmQGWxhIG8eJbVR/cgLFCqx+4zGCcsrkSM8EZOmrSioY0EopGKcMkjovv5XWR0AhFCpb+/At4SSMj5gkyY3TFwi0WQmU5UtQUE6DWacX3DBFUtZfGUvNFOWm1g05QB90G4Rq0yTouJ63f0VTzPHVbuGTWjsIgw3HBDAouofSj3ELG+ILNYeSgYk7duKhdKOlbx0zpTBuXCmnN/jtRsNTaZRq7zupb+7xWkf+rjXKcfRmv/gXFV4dmuax/d5bSqTDAUS4dYNwIp5XyhBnG0Rm/cWVqK2ml74wJn9uwDfofOuGnzvnVx3b369qiJnlDzsg7EpLPpEsuyCXpEU5+kT/kgTw27r2G1/T8VavXWM+cko3wTp8AjX+3yw==</latexit>

k ! cardinality hyperparameter (number of feature map groups)


<latexit sha1_base64="zn9Y/7zTo9iKlPumsuZMj8Ew8YA=">AAACS3icbVDLbhNBEJw1CQnmZeDIZRQLKVysXYKAY0QuOaEg4SSSbVm9s732yPPSTE/AWvmv+I18AFyD+AFuiANjx4e8ShpNqapb3V2lUzJQnv/IWvc2Nu9vbT9oP3z0+MnTzrPnx8FGL7AvrLL+tISAShrskySFp84j6FLhSTk7WPonZ+iDtOYLzR2ONEyMrKUAStK482nGh15OpgTe2698SPiNGgG+kgaUpDmfpibvwINGQs93TdRl+m3NawSKHrkGxyfeRhdeL8adbt7LV+C3SbEmXbbG0bjze1hZETUaEgpCGBS5o1EDnqRQuGgPY0AHYgYTHCRq0hph1KzuXvBXSal4bX16hvhKvdrRgA5hrstUqYGm4aa3FO/yBpHqD6NGGhcJjbgcVEfFyfJliLySHgWpeSIgvEy7cjFNGYkU0fUpVViutminYIqbMdwmx296xbve3ue33f2P64i22Uu2w3ZZwd6zfXbIjlifCfad/WQX7Fd2nv3J/mb/Lktb2brnBbuG1uZ/nZC1XA==</latexit>

r ! radix hyperparameter (number of splits within a cardinal group)


<latexit sha1_base64="BmeL6+5S211MFZZNwsFBFBJXQKc=">AAACUXicbVA9bxNBEB0fX8Hhw0BJs8JCCo11RxBQRtBQBilOItmWNbc351tlb/c0O0dinfzH+BtUdFRI8A/oWDsuSMJIq316b0bz5uWNNUHS9HsvuXX7zt17O/f7uw8ePno8ePL0OPiWNY21t55PcwxkjaOxGLF02jBhnVs6yc8+rvWTL8TBeHcky4ZmNS6cKY1GidR8cMRqymZRCTL7czUVupCOsTAXqort3CBjTUKs9lxb5/H3pQrRmQR1bqQyTqHSyIVxaNWCfdu8Ws0Hw3SUbkrdBNkWDGFbh/PBz2nhdVuTE20xhEmWNjLrkMVoS6v+tA3UoD7DBU0idNFSmHWb61fqZWQKVXqOz4nasP9OdFiHsKzz2FmjVOG6tib/p01aKd/POuOaVsjpy0Vla5V4tY5SFYZJi11GgJpN9Kp0FfPSMa6rW4qwtrbqx2Cy6zHcBMevR9nb0f7nN8ODD9uIduA5vIA9yOAdHMAnOIQxaPgKP+AX/O596/1JIEkuW5PeduYZXKlk9y8berWZ</latexit>

g = kr ! total number of feature groups


<latexit sha1_base64="ejGoTiv5xjZxWzCQHgUAzQYALiw=">AAACNXicbVA9TxtBEN0jfMUEMKFMs4qFlMq6AwRukBBpKEGKAcm2rLn13Hnlvd3T7ixgnfxX+Bv5A2lDnyIdouUvZG1chI8njfT03oxm5qWlko7i+E+08GFxaXll9WNt7dP6xmZ96/OFM94KbAujjL1KwaGSGtskSeFVaRGKVOFlOvo+9S+v0Tpp9A8al9grINcykwIoSP16K+dHfGR518p8SGCtueFdwluqyBAorn2RouUm4xkCeYs8t8aXbtKvN+JmPAN/S5I5abA5zvr1x+7ACF+gJqHAuU4Sl9SrwJIUCie1rndYghhBjp1ANRToetXswwnfCcqAZ8aG0sRn6v8TFRTOjYs0dBZAQ/fam4rveR1PWatXSV16Qi2eF2VecTJ8GhcfSIuC1DgQEFaGW7kYggVBIdQXWwZuetqkFoJJXsfwllzsNpOD5t75fuP4ZB7RKvvCvrJvLGGH7JidsjPWZoLdsV/sN7uPfkZ/o4fo8bl1IZrPbLMXiJ7+AfJ0rK0=</latexit>

Ui = Fi (X), i = 1, . . . , g ! intermediate representation of each group


<latexit sha1_base64="EG94sILbHW7a/4eYDzCGs2DHcho=">AAACYnicbVBNi9RAEO2JX+v4Nese9VA4CCsMQ7LK6kVYFMTjCs7uwGQIlU4l02wnHborq0PIL/QXeBfvXvViZ3YO7q4FBY9XH6/qpbVWjsPw+yC4cfPW7Ts7d4f37j94+Gi0+/jEmcZKmkmjjZ2n6EirimasWNO8toRlquk0PXvf10/PyTplqs+8rmlZYlGpXElkTyUjmiUK3kJcIq8k6vZDl6j9+YsJ9Gw0gVhnht0ECoitKlaM1povEDN95VZVTLakTCETWPK6jireLAaTA6FcQWFNU3fJaBxOw03AdRBtwVhs4zgZ/YwzI5vS75ManVtEYc3LFi0rqakbxo2jGuUZFrTwsMKS3LLd2NHBc89kkBvrs2LYsP9OtFg6ty5T39m/7a7WevJ/tUXD+Zulf7tumCp5IZQ3GthA7y1kypJkvfYApVX+VpArtCi9T5dVMtef1g29MdFVG66Dk4NpdDh9+enV+Ojd1qId8UQ8E/siEq/FkfgojsVMSPFN/BK/xZ/Bj2AY7AZ7F63BYDuzJy5F8PQvvZa5iA==</latexit>

Fi ! 1 ⇥ 1 conv followed by a 3 ⇥ 3 conv


<latexit sha1_base64="ThyJ6CmDcvCbdqHRwH2qsBiZsGg=">AAACT3icbZDPbtNAEMbXKX+aQCG0Ry4rkko9RXZTAccKJMSxINJWiqNovJ4kq653rd1xS2T5vXiNHnvhSPsGvSHWiZFoy0gr/fR9s5qZL8mVdBSGV0Fr49HjJ083251nz7devOy+2j52prACR8IoY08TcKikxhFJUniaW4QsUXiSnH2s/ZNztE4a/Y2WOU4ymGs5kwLIS9Pu1zgDWghQ5adqKnls5XxBYK254DHhdyr7kQeZoeNRnwujz/nMKGUuMOXJkgPvD//6w7VfTbu9cBCuij+EqIEea+po2v0Vp0YUGWoSCpwbR2FOkxIsSaGw6sSFwxzEGcxx7FGDnzYpV7dXfNcrqd/J+qeJr9R/f5SQObfMEt9ZX+rue7X4P29c0Oz9pJQ6Lwi1WA+aFYqT4XWQPJUWBamlBxBW+l25WIAFQT7uO1NSV69WdXww0f0YHsLx/iB6Oxh+Oegdfmgi2mSv2Ru2xyL2jh2yz+yIjZhgP9hPds1ugsvgNvjdalpbQQM77E612n8AKs2zOA==</latexit>

h⇥w⇥(c0 /k)
<latexit sha1_base64="VsARmfhf0n95ZNAQshfW7XQ+ATQ=">AAACKnicbVDNTsJAGNz6i/hX9ehlIzHiQWzVqEeiF49oLJDQSrbLAhu222Z3qyFNn8LX8AW86ht4I16Nz+EWMBHwSzaZzHxfZnb8iFGpLGtgzM0vLC4t51byq2vrG5vm1nZVhrHAxMEhC0XdR5IwyomjqGKkHgmCAp+Rmt+7zvTaIxGShvxe9SPiBajDaZtipDTVNI+cJoUu5dANkOr6fnKXPiRd6CoaEAmffkERHxz3DtOmWbBK1nDgLLDHoADGU2ma324rxHFAuMIMSdmwrUh5CRKKYkbSvBtLEiHcQx3S0JAjbeYlw2+lcF8zLdgOhX5cwSH79yJBgZT9wNebWXg5rWXkf1ojVu1LL6E8ihXheGTUjhlUIcw6gi0qCFasrwHCguqsEHeRQFjpJidcWjKLluZ1MfZ0DbOgelKyz0unt2eF8tW4ohzYBXugCGxwAcrgBlSAAzB4Bq/gDbwbL8aHMTA+R6tzxvhmB0yM8fUDvG+mvA==</latexit>

Ui 2 R
Split Attention
<latexit sha1_base64="mbS3Ud3igkww+YK4GmK0Y1WcyoA=">AAACF3icbVDLSsNAFJ3UV62vqjvdDBbBVUkqqMuqG5cVbSu0oUwmk3bo5MHMjVBCwN/wB9zqH7gTty79Ab/DSZqFbb0wcDjn3nvuHCcSXIFpfhulpeWV1bXyemVjc2t7p7q711FhLClr01CE8sEhigkesDZwEOwhkoz4jmBdZ3yd6d1HJhUPg3uYRMz2yTDgHqcENDWoHvTzHYlkbnqnDQFfArBgKtbMupkXXgRWAWqoqNag+tN3Qxr7epwKolTPMiOwEyKBU8HSSj9WLCJ0TIasp2FAfKbsJPdP8bFmXOyFUr8AcM7+nUiIr9TEd3SnT2Ck5rWM/E/rxeBd2AkPolj/i06NvFhgCHEWCHa5ZBTERANCJde3YjoiklDQsc24uCo7La3oYKz5GBZBp1G3zuqnt41a86qIqIwO0RE6QRY6R010g1qojSh6Qi/oFb0Zz8a78WF8TltLRjGzj2bK+PoFv+qgwg==</latexit>

For notation convenience, let us reuse c = c0 /k so that Ui 2 Rh⇥w⇥c


<latexit sha1_base64="1dq987I5NunIA5JPGfcvf9dKLU8=">AAACYXicbZFNbxMxEIad5asNH13KsZcRCYIDCrtFAi5IFUiIY0GkrZQNkXd2trHitVf2bFG02j/IP+CMxJkrnHDSINGWOb1+Z6x5/TivtfKcJN960bXrN27e2tru375z995OfH/3yNvGIY3RautOculJK0NjVqzppHYkq1zTcb54u+ofn5HzyppPvKxpWslTo0qFkoM1i4t31oGxvD4CWnNGRpFBegqaGBoPjhpPMER4Dfj42WII3gLPJcNwPFOQKQNZJXme5+3H7nM7h4xVRR6+/BXYDWfxIBkl64KrIt2IgdjU4Sz+kRUWm4oMo5beT9Kk5mkrHSvU1PWzEKmWuJCnNAnSyLBo2q5pdPAoOAWU4V2lNQxr998bray8X1Z5mFwF95d7K/N/vUnD5atpq0zdcCB0vqhsNHAAEtBCoRwh62UQEp0KWQHn0knk8AEXthR+Fa3rBzDpZQxXxdH+KH0xev5hf3DwZoNoS+yJh+KJSMVLcSDei0MxFii+ip/il/jd+x5tR3G0ez4a9TZ3HogLFe39Ab5Bt7Y=</latexit>

For notation convenience, let us reuse c = c0 /k so that Ui 2 Rh⇥w⇥c


<latexit sha1_base64="1dq987I5NunIA5JPGfcvf9dKLU8=">AAACYXicbZFNbxMxEIad5asNH13KsZcRCYIDCrtFAi5IFUiIY0GkrZQNkXd2trHitVf2bFG02j/IP+CMxJkrnHDSINGWOb1+Z6x5/TivtfKcJN960bXrN27e2tru375z995OfH/3yNvGIY3RautOculJK0NjVqzppHYkq1zTcb54u+ofn5HzyppPvKxpWslTo0qFkoM1i4t31oGxvD4CWnNGRpFBegqaGBoPjhpPMER4Dfj42WII3gLPJcNwPFOQKQNZJXme5+3H7nM7h4xVRR6+/BXYDWfxIBkl64KrIt2IgdjU4Sz+kRUWm4oMo5beT9Kk5mkrHSvU1PWzEKmWuJCnNAnSyLBo2q5pdPAoOAWU4V2lNQxr998bray8X1Z5mFwF95d7K/N/vUnD5atpq0zdcCB0vqhsNHAAEtBCoRwh62UQEp0KWQHn0knk8AEXthR+Fa3rBzDpZQxXxdH+KH0xev5hf3DwZoNoS+yJh+KJSMVLcSDei0MxFii+ip/il/jd+x5tR3G0ez4a9TZ3HogLFe39Ab5Bt7Y=</latexit>

Xr <latexit sha1_base64="SsOtgily3IVXvCK05izEwDRXYOc=">AAACZ3icbVHLbtNAFJ2YVwmPBpAQEpsLKVIRamTTCthUqmDDski4rRSn1vV4HE869lgz10A08j+y5QeQ+AG2MHGzoC1ndXTOfencrFHSUhj+GATXrt+4eWvj9vDO3Xv3N0cPHh5Z3RouYq6VNicZWqFkLWKSpMRJYwRWmRLH2dmHlX/8RRgrdf2Zlo2YVTivZSE5kpfS0SIpkVzcnS5gHxLbVqmT+1F36kwHceoMbC92opfwCmQHiZHzktAY/RUSEt/IGeGXWVFTPw0KbYBKAVuLrR0qgaPJZY0K5ka3TZeOxuEk7AFXSbQmY7bGYTr6meSat5WfzxVaO43ChmYODUmuRDdMWisa5Gc4F1NPa6yEnbk+kw5eeCXvLyp0TdCr/3Y4rKxdVpmvrJBKe9lbif/zpi0V72ZO1k1Loubni4pWAWlYBQy5NIKTWnqC3Eh/K/ASDXLyb7iwJber07qhDya6HMNVcvR6Er2Z7H7aGx+8X0e0wZ6y52ybRewtO2Af2SGLGWff2W/2Z8AGv4LN4HHw5Lw0GKx7HrELCJ79BX6ZurU=</latexit>

Xr Û j = Ur(j 1)+i ! representation for the j-th cardinal group


<latexit sha1_base64="SsOtgily3IVXvCK05izEwDRXYOc=">AAACZ3icbVHLbtNAFJ2YVwmPBpAQEpsLKVIRamTTCthUqmDDski4rRSn1vV4HE869lgz10A08j+y5QeQ+AG2MHGzoC1ndXTOfencrFHSUhj+GATXrt+4eWvj9vDO3Xv3N0cPHh5Z3RouYq6VNicZWqFkLWKSpMRJYwRWmRLH2dmHlX/8RRgrdf2Zlo2YVTivZSE5kpfS0SIpkVzcnS5gHxLbVqmT+1F36kwHceoMbC92opfwCmQHiZHzktAY/RUSEt/IGeGXWVFTPw0KbYBKAVuLrR0qgaPJZY0K5ka3TZeOxuEk7AFXSbQmY7bGYTr6meSat5WfzxVaO43ChmYODUmuRDdMWisa5Gc4F1NPa6yEnbk+kw5eeCXvLyp0TdCr/3Y4rKxdVpmvrJBKe9lbif/zpi0V72ZO1k1Loubni4pWAWlYBQy5NIKTWnqC3Eh/K/ASDXLyb7iwJber07qhDya6HMNVcvR6Er2Z7H7aGx+8X0e0wZ6y52ybRewtO2Af2SGLGWff2W/2Z8AGv4LN4HHw5Lw0GKx7HrELCJ79BX6ZurU=</latexit>

Û j = Ur(j 1)+i ! representation


i=1 for the j-th cardinal group <latexit sha1_base64="qLNgcW5/xhJxCjLh7AaetTGLqXg=">AAACP3icbVBNS8NAEN34bf2qevQyWAQPpSQq6kUQvXhUsSo0tWw2W7t2swm7E6WE/B//hn/Aq/oH9CZevbmpFfwaGHi8N8ObeUEihUHXfXKGhkdGx8YnJktT0zOzc+X5hVMTp5rxOotlrM8DargUitdRoOTnieY0CiQ/C7r7hX52zbURsTrBXsKbEb1Uoi0YRUu1ynt+h2JWzy+uwBcK/IhiJwiy4/wi64CPIuIGbr4Ay6twBTvgVcGXYYymCt1SqVWuuDW3X/AXeANQIYM6bJWf/TBmacQVMkmNaXhugs2MahRM8rzkp4YnlHXpJW9YqKj1bmb9X3NYsUwI7VjbVgh99vtGRiNjelFgJ4tfzG+tIP/TGim2t5uZUEmKXLFPo3YqAWMogoNQaM5Q9iygTAt7K7AO1ZShjfeHS2iK0/IiGO93DH/B6VrN26ytH21UdvcGEU2QJbJMVolHtsguOSCHpE4YuSX35IE8OnfOi/PqvH2ODjmDnUXyo5z3D9E6rgo=</latexit>

i=1 Û j 2 Rh⇥w⇥c , j = 1, . . . , k
h X w
<latexit sha1_base64="quCH8WHZGF/fJVoUBao6pRJoQ+8=">AAACSnicbVDLSgMxFM3UqrW+qi7dBIugIGVGRd0URDduhCpWhU5bMmnGSZvJDEnGUsJ8lb/hD7hV8APciRszdRa2eiBw7rn3ck+OFzMqlW2/WIWZ4uzcfGmhvLi0vLJaWVu/lVEiMGniiEXi3kOSMMpJU1HFyH0sCAo9Ru68wXnWv3skQtKI36hRTNoheuDUpxgpI3Url7LTh3Xo+gJh7aQ6gMMUujIJu5rWnbQT5EU/K4bQDZDSzbTT36F7/V3oUg7dEKnA8/R12sHdStWu2WPAv8TJSRXkaHQr724vwklIuMIMSdly7Fi1NRKKYkbSsptIEiM8QA+kZShHIZFtPf52CreN0oN+JMzjCo7V3xsahVKOQs9MZh7ldC8T/+u1EuWftDXlcaIIxz+H/IRBFcEsQ9ijgmDFRoYgLKjxCnGATILKJD1xpScza2nZBONMx/CX3O7XnKPawdVh9fQsj6gENsEW2AEOOAan4AI0QBNg8ARewCt4s56tD+vT+voZLVj5zgaYQKH4DUwZswM=</latexit>

1 X
j
s = Û j (i, j) 2 Rc
hw i=1 j=1
global contextual information with embedded channel-wise statistics
<latexit sha1_base64="XcZgHpZK5JAzZ7fhUWwUm6/DL+4=">AAACQHicbVBNSwMxEM3Wr1q/qh69BIvgxbJbQT2KvXisYK1QS8lmp20wmyzJrFpKf5B/wz/g1f4B8SZePZlde7Dqg8DjvZnMzAsTKSz6/sQrzM0vLC4Vl0srq2vrG+XNrSurU8OhybXU5jpkFqRQ0ESBEq4TAywOJbTC23rmt+7AWKHVJQ4T6MSsr0RPcIZO6pbrfalDJinXCuEBU0eF6mkT5z69FzigEIcQRRBRPmBKgTy4FxaoRVdiUXDbLVf8qp+D/iXBlFTIFI1u+fUm0jyNQSGXzNp24CfYGTHjfpMwLt2kFhLGb1kf2o4qFoPtjPJjx3TPKRF1K7qnkObqz44Ri60dxqGrdEcM7G8vE//z2in2TjojoZIUQfHvQb1UUtQ0S45GwgBHOXSEcSPcrlkghnF0+c5MiWy22rjkggl+x/CXXNWqwVH18KJWOT2bRlQkO2SX7JOAHJNTck4apEk4eSTP5IVMvCfvzXv3Pr5LC960Z5vMwPv8AkZ2sdk=</latexit>

Gi (sj ) 2 Rc ! two fully connected layers with ReLU activation


<latexit sha1_base64="7dXYXRZHlfjUz/SD2LyYtUtVwGE=">AAACY3icbZDPbtNAEMY35k9LCjQUbhXSiAipXCK7oMKxggM9cCgVaSvFabRej5Nt17vW7rghsvyGvAAPQB+g13Lo2s2Btoy00qdvZvTN/pJCSUdh+LsTPHj46PHK6pPu2tNnz9d7LzYOnSmtwKEwytjjhDtUUuOQJCk8LizyPFF4lJx9afpH52idNPoHLQoc53yqZSYFJ29Nelmcc5oJrqqv9URuuZPTdxBLDa2dJNVBfSIgtnI6I26tmUNM+JMqmhvISqUWIIzWKAhTUHzhg2AuaQYH+G0IXJA8b3PqSa8fDsK24L6IlqLPlrU/6V3EqRFljpqE4s6NorCgccUtSaGw7salw4KLMz7FkZea5+jGVcujhrfeSSEz1j9N0Lr/blQ8d26RJ36y+aa722vM//VGJWWfxpXURUmoxU2QxwBkoIELqbSehaeSSi6s9LeCmHHrQXgyt1JS15xWdz2Y6C6G++JwexDtDN5//9Df/bxEtMo22Ru2xSL2ke2yPbbPhkywX+ySXbG/nT/BWrARvLoZDTrLnZfsVgWvrwG5rrwO</latexit>

j
<latexit sha1_base64="SNx+gXPX+Ny0pGCPNVFoIs3qx+E=">AAACQHicbVC7TsMwFHV4U14FRhaLCompSgABI4KFERAFpKZUjnvTGhw7sm+AKsoH8Rv8ACv8AGJDrEw4pQOvK13p6Jz7PFEqhUXff/ZGRsfGJyanpiszs3PzC9XFpTOrM8OhwbXU5iJiFqRQ0ECBEi5SAyyJJJxH1welfn4DxgqtTrGfQithXSViwRk6ql09YG1xeUVDoWiYMOxFUX5SXOa8oKER3R4yY/QtDRHuMLc6RsqsFV2VgEJ6C2WFLdrVml/3B0H/gmAIamQYR+3qS9jRPCuHcOkGNgM/xVbODAouoaiEmYWU8WvWhaaDiiVgW/ng2YKuOaZDY21cuiMG7PeOnCXW9pPIVZYP2d9aSf6nNTOMd1u5UGmGoPjXojiTFDUtnaMdYYCj7DvAuBHuVsp7zDCOzt8fWzq2PK2oOGOC3zb8BWcb9WC7vnm8VdvbH1o0RVbIKlknAdkhe+SQHJEG4eSePJIn8uw9eK/em/f+VTriDXuWyY/wPj4BwHOyLQ==</latexit>

ai 2 Rc ! soft assignment weights See the paper for an efficient


r radix-major implementation!
X
<latexit sha1_base64="8UKvWnNEP7R9ddpZUOA2R5TvZ2w=">AAACXHicbVBNS8NAFNzG7/pVFbx4WSxivdRERb0oogc9KlgVmjZsti+6uvlg90UsIf47f4QXj1686t1N24NaHzwYZubxhvETKTTa9mvJGhkdG5+YnCpPz8zOzVcWFq90nCoODR7LWN34TIMUETRQoISbRAELfQnX/sNJoV8/gtIiji6xm0ArZLeRCARnaCiv4jFPtO8pPaAuPCU1N2R4x5nMTnNP1HT7fmNj09Vp6GVi3XicvK2GjUbL+17qIjxh9iyC55wqekgdr1K163Zv6DBwBqBKBnPuVd7dTszTECLkkmnddOwEWxlTKLiEvOymGhLGH9gtNA2MWAi6lfWKyOmaYTo0iJXZCGmP/XmRsVDrbugbZ5Ff/9UK8j+tmWKw38pElKQIEe8/ClJJMaZFq7QjFHCUXQMYV8JkpfyOKcbRdP/rS0cX0fKyKcb5W8MwuNqqO7v17Yud6tHxoKJJskJWSY04ZI8ckTNyThqEkxfyQT7JV+nNGrWmrdm+1SoNbpbIr7GWvwGEVrXH</latexit>

ResNeSt Block
<latexit sha1_base64="vSqGFUfrJt02Cahh9lTce6fxrF4=">AAACFXicbVDLSgMxFM3UV62vqhvBTbAIrspMBXVZ6saV1EdboS0lk7nThmYmQ5IRyjD+hj/gVv/Anbh17Q/4HabtLGzrgcDhnHtzLseNOFPatr+t3NLyyupafr2wsbm1vVPc3WsqEUsKDSq4kA8uUcBZCA3NNIeHSAIJXA4td3g59luPIBUT4b0eRdANSD9kPqNEG6lXPOhM/kgkeOktqGu407jGBR32iiW7bE+AF4mTkRLKUO8VfzqeoHEAoaacKNV27Eh3EyI1oxzSQidWEBE6JH1oGxqSAFQ3maSn+NgoHvaFNC/UeKL+3UhIoNQocM1kQPRAzXtj8T+vHWv/opuwMIo1hHQa5Mcca4HHdWCPSaCajwwhVDJzK6YDIgnVprSZFE+NT0sLphhnvoZF0qyUnbPy6U2lVK1lFeXRITpCJ8hB56iKrlAdNRBFT+gFvaI369l6tz6sz+lozsp29tEMrK9fdqqfew==</latexit>

j j
<latexit sha1_base64="GQIJNDivez4t5m+xrWIObtSiUTg=">AAACOnicbVDLSsNAFJ34tr6qLt0MFqFFrImKuhFEF7pUsCo0NUymNzo6eTBzI5YQf8bf8Afc6sqtC0Hc+gFOahe+DgwczrmXe+b4iRQabfvZ6usfGBwaHhktjY1PTE6Vp2eOdZwqDg0ey1id+kyDFBE0UKCE00QBC30JJ/7VbuGfXIPSIo6OsJNAK2TnkQgEZ2gkr7zFPHF2Sbeos1x16CJ14SapLlE3ZHjBmcz2ck9U9dllrUZr1EW4wexWBLc5VcWOV67YdbsL+pc4PVIhPRx45Ve3HfM0hAi5ZFo3HTvBVsYUCi4hL7mphoTxK3YOTUMjFoJuZd1v5nTBKG0axMq8CGlX/b6RsVDrTuibySK+/u0V4n9eM8Vgs5WJKEkRIv51KEglxZgWndG2UMBRdgxhXAmTlfILphhH0+yPK21dRMtLphjndw1/yfFK3Vmvrx6uVbZ3ehWNkDkyT6rEIRtkm+yTA9IgnNyRB/JInqx768V6s96/Rvus3s4s+QHr4xNDu6rB</latexit>

ai j
= exp(Gi (s ))/ j
exp(Gi0 (s )) if r > 1 ai = 1/(1 + exp( Gi (sj ))) if r = 1
V = concat{V j , j = 1, . . . , k}
<latexit sha1_base64="f41eLxvEWDFIvGiPsRVxjyht6GU=">AAACJ3icbZDLSgMxFIYzXmu9VV26CRZBpJQZFXUjFN24rGAv0Kklk0k1bSYZkjNiGeYdfA1fwK2+gTvRpRufw/Sy8PZD4Oc753BO/iAW3IDrvjtT0zOzc/O5hfzi0vLKamFtvW5UoimrUSWUbgbEMMElqwEHwZqxZiQKBGsE/bNhvXHLtOFKXsIgZu2IXEve5ZSARZ3Cbh2fYB/YHaRUSUszP61f9Uq4Z7lXwr4IFZgS7vtZp1B0y+5I+K/xJqaIJqp2Cp9+qGgSMQlUEGNanhtDOyUaOBUsy/uJYTGhfXLNWtZKEjHTTkd/yvC2JSHuKm2fBDyi3ydSEhkziALbGRG4Mb9rQ/hfrZVA97idchknwCQdL+omAoPCw4BwyDWjIAbWEKq5vRXTG6IJBRvjjy2hGZ6W5W0w3u8Y/pr6Xtk7LO9fHBQrp5OIcmgTbaEd5KEjVEHnqIpqiKJ79Iie0LPz4Lw4r87buHXKmcxsoB9yPr4AXhmk7w==</latexit>

r i0 =1
X
<latexit sha1_base64="GSuJSQ3etg2zPcKB9yV4aikL0NM=">AAACjnicbVFdb9MwFHXCx0b5yuARIV1RAUOIKtnQxkvFBC97HIh2k5o2cpyb1ptjR7bDVln5Efw8/gB/gVecrkhs40qWjs79Oj43rwU3No5/BuGt23fubmze691/8PDR42jrydioRjMcMSWUPsmpQcEljiy3Ak9qjbTKBR7nZ5+7/PF31IYr+c0ua5xWdC55yRm1nsqiH+PZKQwhNU2VOT5M2pkGmnFPjjKnYfv0XfIG3gJvIeUS0oraRZ67r+3MLSC1vEID538B80WazxeWaq06Fi+sO8eOwQLKphMBqgRGdcElFTDXqqlBo1dsUNqVpDaL+vEgXgXcBMka9Mk6jrLoV1oo1lR+AhPUmEkS13bqqLacCWx7aWOwpuyMznHioaRe69StrGvhpWe8NqX9kxZW7L8djlbGLKvcV3Z/N9dzHfm/3KSx5Yep47JuLEp2uahsBFgF3R2g4BqZFUsPKNPcawW2oJoy6691ZUthOmltzxuTXLfhJhjvDJK9we6X9/2DT2uLNskz8oJsk4TskwNySI7IiDDyO3gevApeh1G4Fw7Dj5elYbDueUquRHj4B4Zhx9k=</latexit>

j Y=V+X
<latexit sha1_base64="MTJYf34k/Hvvlztvv2yGeR1DPv8=">AAACBHicbVDLSsNAFL2pr1pfVZduBosgCCWpoG6EohuXFWxtaUOZTCbt0MkkzEyEErr1B9zqH7gTt/6HP+B3OGmzsK0HBg7n3Ms9c7yYM6Vt+9sqrKyurW8UN0tb2zu7e+X9g5aKEklok0Q8km0PK8qZoE3NNKftWFIcepw+eqPbzH98olKxSDzocUzdEA8ECxjB2kidDrpGLXSG2v1yxa7aU6Bl4uSkAjka/fJPz49IElKhCcdKdR071m6KpWaE00mplygaYzLCA9o1VOCQKjedBp6gE6P4KIikeUKjqfp3I8WhUuPQM5Mh1kO16GXif1430cGVmzIRJ5oKMjsUJBzpCGW/Rz6TlGg+NgQTyUxWRIZYYqJNR3NXfJVFm5RMMc5iDcukVas6F9Xz+1qlfpNXVIQjOIZTcOAS6nAHDWgCgRBe4BXerGfr3fqwPmejBSvfOYQ5WF+/j4WXRg==</latexit>

Vj = ai Ur(j 1)+i 2 Rh⇥w⇥c ! weighted fusion of cardinal group representation


i=1 Zhang, Hang, et al. "Resnest: Split-attention networks." arXiv preprint arXiv:2004.08955 (2020).
Random Erasing Data Augmentation
YouTube Video

Zhong, Zhun, et al. "Random Erasing Data Augmentation." AAAI. 2020.


CutMix: Regularization Strategy to Train
Strong Classi ers with Localizable Features
Black pixels or random noise: information loss and inefficiencies during training
<latexit sha1_base64="0rHY6dbN1jxJu4rCf4VgkMxDyUA=">AAACWHicbVDLThtBEGwvJIDzcsgxlxYWUk7WLpGSKCcEF44gxYBkW9bsbK8ZMTuzmp6NYln+uHxG+AC4wh/QNj6ER0stlaqmp6srr63hmKb/Wsna+qvXG5tb7Tdv373/0Pm4fcq+CZr62lsfznPFZI2jfjTR0nkdSFW5pbP88nChn/2mwMa7X3Fa06hSE2dKo1UUatwZDJ03riAX8cAqfYm1+UOW0QcMyhW+QtGZfqJxpQ/VcgqtZ0ZRhaRS/jLkpBmLJhg3wRiUcQLGnW7aS5eFz0G2Al1Y1fG4cz0svG4qcaOtYh5kaR1HMxWi0Zbm7WHDVItJNaGBQKcq4tFsGcIcd4UpUExKyzVL9v+JmaqYp1UuL+WMC36qLciXtEETyx+jmXF1E+XOh0VlYzF6XCSKhQmko50KUDoY8Yr6QgWlo+T+aEvBC2vztgSTPY3hOTjd62Xfel9P9rr7B6uINuEz7MAXyOA77MMRHEMfNPyFG7iFu9ZVAslGsvXwNGmtZj7Bo0q27wF6lrc6</latexit>

CutMix: Patches are cut and pasted among training images where the ground
<latexit sha1_base64="kOrAqxOv7CQnBlnz7K4wLBEucHo=">AAAClnicbVHbihNBEO0Zb2u8RX0RfGkMCz7FmV1Q8UEWF9EXIYLZXUhCqOmpyTTb0z10V+uGkA/x0/wBf0Mrkzy4uxY0HKpO1ak6XbRGB8qyX0l64+at23f27vbu3X/w8FH/8ZOT4KJXOFbOOH9WQECjLY5Jk8Gz1iM0hcHT4vx4Uz/9jj5oZ7/RssVZAwurK62AODXv/5xap22JluRxpC/64p0cAakagwSPUkWSYEvZQiAsJTTOLiR50FYz0DyMiT9qZCrVKBfeRWaTj1RLAwWa7RgwwclGX/CI1rvW+Y04GLOU5LpGJoF0VYfbrf68P8iGWRfyOsh3YCB2MZr3f09Lp2LDtygDIUzyrKXZClhMGVz3pjFgC+qcd54wtNBgmK06C9dynzOlrJznx1502X87VtCEsGwKZjZAdbha2yT/V5tEqt7OVtq2kdCqrVAVTXc4/4cstUdF7ESpQXnNu0pVgwdF/GuXVMqwWW3dY2PyqzZcBycHw/z18PDrweDow86iPfFcvBAvRS7eiCPxWYzEWCjxJ9lPhsmr9Fn6Pv2YftpS02TX81RcinT0F/kZy78=</latexit>

truth labels are also mixed proportionally to the area of the patches
x 2 RW ⇥H⇥C ! training image
<latexit sha1_base64="XuQYsclh2N+nHCpypH33ixmserw=">AAACRXicbVC7TsMwFHV4lvIKMLJYVEhMVQIIGCtYOhZEKVJTKsdxWwvHiewboIryS/wGP8DCACMbG2IFpwQJWq50paNz7vP4seAaHOfJmpqemZ2bLy2UF5eWV1bttfULHSWKsiaNRKQufaKZ4JI1gYNgl7FiJPQFa/nXJ7neumFK80iewzBmnZD0Je9xSsBQXbt+hz0usRcSGPh+epZdpS3sAQ+ZxvUfcJJhT/H+AIhS0a1h2R2koAiXXPYxNyNZ1rUrTtUZBZ4EbgEqqIhG1371gogmIZNABdG67ToxdFKigFPBsrKXaBYTem2Gtw2UxFzSSUcfZ3jbMAHuRcqkBDxif3ekJNR6GPqmMv9Mj2s5+Z/WTqB31Em5jBNgkn4v6iUCQ4Rz+3DAFaMghgYQqri5FdMBUYSCMfnPlkDnp2VlY4w7bsMkuNitugfVvdP9Su24sKiENtEW2kEuOkQ1VEcN1EQU3aNH9IxerAfrzXq3Pr5Lp6yiZwP9CevzCyTTszA=</latexit>

y ! corresponding label
<latexit sha1_base64="5RHyJmki9cdnf6vBSm2tKTJqrNg=">AAACJXicbVDLSgNBEJz1bXytevQyGARPYVdFPYpePEYwUUhCmJ3tJIOzM8tMr7os+QV/wx/wqn/gTQRP3vwOJzEHTSxoKKq66e6KUiksBsGHNzU9Mzs3v7BYWlpeWV3z1zfqVmeGQ41rqc11xCxIoaCGAiVcpwZYEkm4im7OBv7VLRgrtLrEPIVWwrpKdARn6KS2v5vTphHdHjJj9B1tItxjwbUxYFOtYqG6VLIIZL/tl4NKMASdJOGIlMkI1bb/1Yw1zxJQyCWzthEGKbYKZlBwCf1SM7OQMn7DutBwVLEEbKsYftSnO06JaUcbVwrpUP09UbDE2jyJXGfCsGfHvYH4n9fIsHPcKoRKMwTFfxZ1MklR00E8NBYGOMrcEcaNcLdS3mOGcXQh/tkS28Fp/ZILJhyPYZLU9yrhYWX/4qB8cjqKaIFskW2yS0JyRE7IOamSGuHkgTyRZ/LiPXqv3pv3/tM65Y1mNskfeJ/fSLmm1A==</latexit>

<latexit sha1_base64="BqIYfxnezhcVW+m+6kft4MGvxGg=">AAACgHicbVFdaxNBFJ1dv2r8aNQXxZfBREihxN0KKj618UEfK5i2kITl7uxNOnRmdpm5a7Ms++Sv9A8I/gsnmwVt64WBw7nncO89kxZKOoqin0F46/adu/d27vcePHz0eLf/5OmJy0srcCpylduzFBwqaXBKkhSeFRZBpwpP04tPm/7pd7RO5uYbVQUuNKyMXEoB5Kmk/2NucmkyNMQ/o0ELhBy4wUtOFqSRZsVxDbpQyIejOUmVYb1u9nkHq2ZvyNOKi1ynWzVd5n+trnU6b10nR/u8So68HEzWEpMNMdkbjpP+IBpHbfGbIO7AgHV1nPR/zbNclNpvLRQ4N4ujghY1WJJCYdOblw4LEBewwpmHBjS6Rd2G1fDXnsn4Mrf++atb9l9HDdq5SqdeqYHO3fXehvxfb1bS8sOilqYoCY3YDlqWipMPxCfPM2lRkKo8AGGl35WLc7AgyP/PlSmZ26zW9Hww8fUYboKTg3H8bvz268HgcNJFtMNesldsxGL2nh2yL+yYTZlgv4Pd4HnwIgzDUfgmjLfSMOg8z9iVCj/+AaYUv1o=</latexit>

Generate a new training example (x̃, ỹ) by combining two training samples
(xA , yA ) and (xB , yB ).
<latexit sha1_base64="S91TUU0TrL5vj1FAd2s+aTRv1aU=">AAACKHicbVDLSsNAFJ34tr6iLt0MFkERS6KibgQfGzeFCrYKbQiTydQOnWTCzI20hHyEv+EPuNU/cCduXfgdTmoRrR64cDjnXu69J0gE1+A4b9bY+MTk1PTMbGlufmFxyV5eaWiZKsrqVAqpbgKimeAxqwMHwW4SxUgUCHYddM8L//qOKc1lfAX9hHkRuY15m1MCRvLt7RZwEbKsl+NjXMUtGUrAPf8Ub+NNF+/g6ta3dubbZafiDID/EndIymiImm9/tEJJ04jFQAXRuuk6CXgZUcCpYHmplWqWENolt6xpaEwipr1s8FSON4wS4rZUpmLAA/XnREYirftRYDojAh096hXif14zhfaRl/E4SYHF9GtROxUYJC4SwiFXjILoG0Ko4uZWTDtEEQomx19bQl2clpdMMO5oDH9JY7fiHlT2LvfLJ2fDiGbQGlpHm8hFh+gEXaAaqiOK7tEjekLP1oP1Yr1ab1+tY9ZwZhX9gvX+Cfe6pAI=</latexit>

x̃ = M xA + (1 M ) xB
<latexit sha1_base64="hzrOGiC4w1g2yKGEHk+/lsVzApE=">AAACKHicbVDLSsNAFJ34rPUVdelmsAiVYklU1I1Q68ZlBfuAJpTJZNoOnUzCzEQIoR/hb/gDbvUP3Em3LvwOp20E23pg4HDOPdw7x4sYlcqyRsbS8srq2npuI7+5tb2za+7tN2QYC0zqOGShaHlIEkY5qSuqGGlFgqDAY6TpDe7GfvOJCElD/qiSiLgB6nHapRgpLXXMkqMo80maDOENdJgO+ggmnVtYgkUbnv5KJ1qrdsyCVbYmgIvEzkgBZKh1zG/HD3EcEK4wQ1K2bStSboqEopiRYd6JJYkQHqAeaWvKUUCkm04+NYTHWvFhNxT6cQUn6t9EigIpk8DTkwFSfTnvjcX/vHasutduSnkUK8LxdFE3ZlCFcNwQ9KkgWLFEE4QF1bdC3EcCYaV7nNniy/Fpw7wuxp6vYZE0zsr2Zfn84aJQqWYV5cAhOAJFYIMrUAH3oAbqAINn8ArewLvxYnwYn8ZoOrpkZJkDMAPj6wfOWKSB</latexit>

ỹ = yA + (1 )yB
M 2 {0, 1}W ⇥H ! a binary mask indicating where to drop out and fill in from two images
<latexit sha1_base64="zhhETVu+mbv9GgUwAbJEEV6IUcQ=">AAACcXicbVHbihNBEO0Zb7vxFvVJfCk2CIISZlTUx0Vf9kVYwWwWMjHU9NQkTXq6h+4aYxjmQ/UD/AF/wJ5sHtxdCxoOpy6n6nRea+U5SX5G8Y2bt27fOTgc3L13/8HD4aPHZ942TtJEWm3deY6etDI0YcWazmtHWOWapvn6U5+ffifnlTVfeVvTvMKlUaWSyIFaDPkzZMpA1iav0qz71k4hY1WRh5MOMqeWK0bn7Caw9INbhFwZdFuo0K9BmWI3xyxhsyJHwBYKZ2uwDQOaAkqldaiC0tkKeGNBBXXy3WI4SsbJLuA6SPdgJPZxuhj+zgorm4oMS43ez9Kk5nmLjpXU1A2yxlONch2mzwI0GC6Ytzt3OngemLCLdeEZhh37b0eLlffbKg+VFfLKX8315P9ys4bLD/NWmbphMvJCqGx0b0NvNRTKkWS9DQClU2FXkCt0KDl8yCWVwverdYNgTHrVhuvg7PU4fTd+8+Xt6Pjj3qID8UwciRciFe/FsTgRp2IipPgViegwGkR/4qcxxEcXpXG073kiLkX88i9dCLz5</latexit>

⇠ Beta(↵, ↵) ! similar to mixup


<latexit sha1_base64="nUAxmUdAwhjVZ8MDKjBbio/6sE4=">AAACSXicbVBNT9tAEF0HaCFACfTIZUWEBBKK7Ba1PSK49NADlQggxVE0Xk+SFbu2tTtuiSz/Kf4GfwCOcODeG+LUdeIDXyOt9unNe5qZF2VKWvL9W68xN7/w4ePiUnN5ZfXTWmt949SmuRHYFalKzXkEFpVMsEuSFJ5nBkFHCs+ii6Oqf/YHjZVpckKTDPsaRokcSgHkqEHrV6icOAYeWql5SHhJxSESlDshqGwMe3z27/LQyNGYwJj0b61zDqnAcEq5lpd5Vg5abb/jT4u/BUEN2qyu40HrIYxTkWtMSCiwthf4GfULMCSFwrIZ5hYzEBcwwp6DCWi0/WJ6dcm3HRPzYWrcS4hP2eeOArS1Ex05pQYa29e9inyv18tp+KNfyCTLCRMxGzTMVXVnFSGPpUFBauIACCPdrlyMwYAgF/SLKbGtViubLpjgdQxvwemXTvCt8/X3fvvgsI5okW2yLbbDAvadHbCf7Jh1mWBX7IbdsXvv2vvnPXpPM2nDqz2f2YtqzP0H3va0Bg==</latexit>

↵ = 1 =) ⇠ Unif(0, 1)
<latexit sha1_base64="8YgNB8ooAVxJzY+B4PR4Vlh5KOk=">AAACLnicbVDLSgMxFM34rPVVdekmWAQFKTMq6kYQ3bhUsCp0SrmTubXBZGZI7ohl6Hf4G/6AW/0DwYW4Ez/DtHbh60DgcO7j3JwoU9KS7794I6Nj4xOTpany9Mzs3HxlYfHcprkRWBepSs1lBBaVTLBOkhReZgZBRwovouujfv3iBo2VaXJG3QybGq4S2ZYCyEmtShCCyjrA93nAQ6mdJVoeKrcgBh5aqXlIeEtF3Q311vyNYL1Vqfo1fwD+lwRDUmVDnLQq72GcilxjQkKBtY3Az6hZgCEpFPbKYW4xA3ENV9hwNAGNtlkMvtbjq06JeTs17iXEB+r3iQK0tV0duU4N1LG/a33xv1ojp/Zes5BJlhMm4suonStOKe/nxGNpUJDqOgLCSHcrFx0wIMil+cMltv3TemUXTPA7hr/kfLMW7NS2TrerB4fDiEpsma2wNRawXXbAjtkJqzPB7tgDe2RP3r337L16b1+tI95wZon9gPfxCRsUp+Q=</latexit>

B = (rx , ry , rw , rh ) ! bounding box coordinates


<latexit sha1_base64="RSgn2x07nkqP5YTUEcOeuQXdNX4=">AAACQXicbVBNSwMxEM36bf2qevQSLIKClF0V9SKIevCoYFVoy5LNTttgNlmSWW1Z+of8G/4Br/oDBG/i1YvZ2oNfDzK8eTPDTF6USmHR95+8kdGx8YnJqenSzOzc/EJ5cenS6sxwqHEttbmOmAUpFNRQoITr1ABLIglX0c1xUb+6BWOFVhfYS6GZsLYSLcEZOiksnxzRA7puwu4mNWGvCHdF6GzQhhHtDjJj9B1tIHQxj3SmYqHaNNJdyrU2LmEIth+WK37VH4D+JcGQVMgQZ2H5pRFrniWgkEtmbT3wU2zmzKDgEvqlRmYhZfyGtaHuqGIJ2GY++G2frjklpi1t3FNIB+r3iZwl1vaSyHUmDDv2d60Q/6vVM2ztN3Oh0gxB8a9FrUxS1LSwjsbCAEfZc4RxI9ytlHeYYRydwT+2xLY4rV9yxgS/bfhLLreqwW51+3yncng0tGiKrJBVsk4CskcOySk5IzXCyT15JE/k2XvwXr037/2rdcQbziyTH/A+PgGL4rBY</latexit>

<latexit sha1_base64="Fn5R4B6AlXS28ipEQOaLsU585A8=">AAACa3icbZHNSsNAFIUn8b/+VbtTF4NFUNCSqKgboeimywq2FZoSJpOpHZxM4syNWkLe0o0v4M4XcOWkdmFtLwwczncv93ImSATX4Dgflj03v7C4tLxSWl1b39gsb223dZwqylo0FrF6CIhmgkvWAg6CPSSKkSgQrBM83Ra888KU5rG8h2HCehF5lLzPKQFj+WWp/DfsaR5hD9gbZC0D80PnGHeOjrHyhzNZY8Re8TXuGP6sIHPxCfaEWRuSvGADwxozmF+uOjVnVHhauGNRReNq+uVPL4xpGjEJVBCtu66TQC8jCjgVLC95qWYJoU/kkXWNlCRiupeNcsnxgXFC3I+VeRLwyP07kZFI62EUmM6IwED/Z4U5i3VT6F/1Mi6TFJikv4v6qcAQ4yJkHHLFKIihEYQqbm7FdEAUoWC+YmJLqIvT8pIJxv0fw7Ron9bci9rZ3Xm1fjOOaBnton10iFx0ieqogZqohSh6R9/WnDVvfdkVe8fe+221rfFMBU2UffADXVa3ng==</latexit>

p p
rx ⇠ Unif(0, W ), ry ⇠ Unif(0, H), rw = W 1 , rh = H 1
r w rh
<latexit sha1_base64="esqKbyJzEg1wH1/zGfzw3euZH2s=">AAACRHicbVBNb9NAEF2XQkv4CvTIZdQIiQuRTRFwQargQI9FappKcWSN1+Nk1bXX2h23RJZ/En+DP8AJiV576g1xRV2nOdCWkVZ6em9m3uxLK60ch+HPYO3O+t17G5v3ew8ePnr8pP/02aEztZU0kkYbe5SiI61KGrFiTUeVJSxSTeP0+FOnj0/IOmXKA15UNC1wVqpcSWRPJf3PcW5RNjY5BZvM22YMey18gAheQaz9mgwhtmo2Z7TWnELM9JUbaU1VUQboncB2m9qkPwiH4bLgNohWYCBWtZ/0z+PMyLqgkqVG5yZRWPG0QctKamp7ce2oQnmMM5p4WGJBbtosP9zCC89kkBvrX8mwZP+daLBwblGkvrNAnrubWkf+T5vUnL+fNqqsaqZSXhnltQY20KUHmbIkWS88QGmVvxXkHH2C7DO+5pK57rS254OJbsZwGxy+HkZvhztf3gx2P64i2hTPxbZ4KSLxTuyKPbEvRkKKb+KH+CXOgu/BRfA7+HPVuhasZrbEtQr+XgLnJrH5</latexit>

=1 ! cropped area ratio


WH
M ! filling zeros within the bounding box B and ones otherwise
<latexit sha1_base64="xPGLqXPEPd8Ds50vdda7jhnpqaU=">AAACTnicbVBNT9tAFFyHloa00BSOvTw1IHGKbKigRwQXLpVAagApiaL1+jlesd61dp8bgpXfxd/gyqFX+Ae9VWUdcijQkVYazbyvnbhQ0lEY3gWNpTdvl981V1rvP6yufWx/Wj9zprQCe8IoYy9i7lBJjT2SpPCisMjzWOF5fHlU++c/0Tpp9A+aFjjM+VjLVApOXhq1T7/DwMpxRtxaM4EB4RVVqVR+3hiu0RoHE0mZ1EAZQmxKndRObK5g83ATuE7AaHRgvG0n0uFs1O6E3XAOeE2iBemwBU5G7ftBYkSZoyahuHP9KCxoWHFLUiictQalw4KLSz7Gvqea5+iG1fzrM9jySgKpsf5pgrn6b0fFc+emeewrc06Ze+nV4v+8fknpt2EldVESavG0KC0VkIE6R0ikRUFq6gkXVvpbQWTcckE+7WdbElefNmv5YKKXMbwmZzvdaK+7e/q1c3C4iKjJPrMvbJtFbJ8dsGN2wnpMsBv2i92zh+A2+B38Cf4+lTaCRc8Ge4ZG8xF3ibWd</latexit>

Yun, Sangdoo, et al. "Cutmix: Regularization strategy to train strong classi ers with localizable features." Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2019.
fi
fi
Neural Ordinary Di erential Equations
YouTube Playlist

ht+1 = ht + f✓ (ht ) ! residual networks, recurrent neural networks, and normalizing flows (density estimation)
<latexit sha1_base64="W+Q7R42MBLdwAbxbWSk13xtXmZo=">AAAChnicbVHbihNBEO0ZbzHeouKToIVByLJLmPG2vghBX3xcwewuZMLQ01OTadLTPXTXGOMwH+An+gO++gt2snkwuxY0nDqnijqczmolHUXRryC8dv3GzVu92/07d+/dfzB4+OjUmcYKnAqjjD3PuEMlNU5JksLz2iKvMoVn2fLTRj/7htZJo7/SusZ5xRdaFlJw8lQ6+FmmLR3GHXyAMiU4hCJNqETiI98eQGLloiRurVlBQvidWotO5g1XoJFWxi7dEVgUjbWoyXON3ZO4zkEbW3Elf0i9gEKZlYNRjtpJWgM6ktXWyUGXDobRONoWXAXxDgzZrk7Swe8kN6Kp/GGhuHOzOKpp3nJLUijs+knjsOZiyRc481DzCt283UbWwUvP5FAY6583vmX/3Wh55dy6yvykN1i6y9qG/J82a6h4P2+lrhtCLS4OFY0CMrDJH3Lp4yK19oALK71XECW3XJD/pb0rudtY6/o+mPhyDFfB6atx/G78+sub4eTjLqIee8pesBGL2TGbsM/shE2ZYH+CJ8Gz4HnYC8fh2/D4YjQMdjuP2V6Fk7+WpsXH</latexit> <latexit sha1_base64="mL7CCJAv72xGyo9+3G3L6nxbi2Y=">AAACGHicbVC7SgNBFJ31GeMraplmMAjahF0VtQzaWEYwD0hCmJ29SQZnZ5eZu2JYUvgb/oCt/oGd2Nr5A36Hk00KTTwwcDjnXu6Z48dSGHTdL2dhcWl5ZTW3ll/f2NzaLuzs1k2UaA41HslIN31mQAoFNRQooRlrYKEvoeHfXY39xj1oIyJ1i8MYOiHrK9ETnKGVuoViG+EB08MAlBE4pGBQhJl3NOoWSm7ZzUDniTclJTJFtVv4bgcRT0JQyCUzpuW5MXZSplFwCaN8OzEQM37H+tCyVLEQTCfNPjGiB1YJaC/S9imkmfp7I2WhMcPQt5M24MDMemPxP6+VYO+ikwoVJwiKTw71EkkxouNGaCA0cJRDSxjXwmalfMA042h7+3MlMONoo7wtxputYZ7Uj8veWfnk5rRUuZxWlCNFsk8OiUfOSYVckyqpEU4eyTN5Ia/Ok/PmvDsfk9EFZ7qzR/7A+fwBRdWhBQ==</latexit>

Euler discretization of a continuous transformation! Continuous Depth Residual Networks


<latexit sha1_base64="NnL351sQkU2bN/RbbXEG7YZAFGk=">AAACMXicbVDLSgMxFM34rPVVdekmWgRXZUZFXRZFcFnBVqGWkknvaDCTDMmNUEt/xN/wB9zqH3QnLtz4E6bTLrR6IXA4517OyYkzKSyG4SCYmp6ZnZsvLBQXl5ZXVktr6w2rneFQ51pqcx0zC1IoqKNACdeZAZbGEq7i+9OhfvUAxgqtLrGbQStlt0okgjP0VLt0cOYkGNoRlhtA8ZjTVCeUUa4VCuW0sxQNUzbRJs3lrXapHFbCfOhfEI1BmYyn1i593nQ0dyko5JJZ24zCDFs9ZlBwCf3ijbOQMX7PbqHpoWIp2FYv/12f7nimQ727fwppzv686LHU2m4a+02f785OakPyP63pMDlu9YTKHILiI6PESYqaDqvypRjgKLseMG6Ez0r5HTOMoy/0l0vHDqP1i76YaLKGv6CxV4kOK/sXe+XqybiiAtkk22SXROSIVMk5qZE64eSJvJBX8hY8B4PgPfgYrU4F45sN8muCr29bgatk</latexit>

<latexit sha1_base64="tPndAJIykfgGkH4ri+tM4vzYbds=">AAACK3icbVDLSgMxFM34rPVVdekmWARXZUZBXRbrwpWo2FZoS8lkbm1oJhmSG6UM/Qt/wx9wq3/gSnGr3+FM7cLXgcDhnHtzDydMpLDo+y/e1PTM7Nx8YaG4uLS8slpaW29Y7QyHOtdSm6uQWZBCQR0FSrhKDLA4lNAMB7Xcb96AsUKrSxwm0InZtRI9wRlmUrdUaY//SA1Eo5pWKJTTztJjSLBPL8CKyDFJTwFvtRnYYrdU9iv+GPQvCSakTCY465Y+2pHmLgaFXDJrW4GfYCdlBgWXMCq2nYWE8QG7hlZGFYvBdtJxphHdzpSI9rTJnkI6Vr9vpCy2dhiH2WTMsG9/e7n4n9dy2DvspEIlDkHxr0M9JylqmpdEI2GAoxxmhHEjsqyU95lhHLMqf1yJbB5tlBcT/K7hL2nsVoL9yt75brl6NKmoQDbJFtkhATkgVXJCzkidcHJHHsgjefLuvWfv1Xv7Gp3yJjsb5Ae8909kMajU</latexit>

Replace the 6 standard residual blocks in ResNet with ODESolve modules!


<latexit sha1_base64="XaGmOn2N9wsVahlnXVRgoFOqVDk=">AAACRnicbVA9TxtBEJ0z+QDny0lKmkmsSKmsO5BISpQEKTSBkBiQjGXN7Y3xynu7p905Isvyb+Jv8AcoaEiXMh1KmzvjIkCeNNLTezOamZcWRgeJ44uosXTv/oOHyyvNR4+fPH3Wev5iP7jSK+4qZ5w/TCmw0Za7osXwYeGZ8tTwQTr+WPsHJ+yDdva7TAru53Rs9VArkkoatLb3uDCkGGXEuIFByGbkM/QcdFaSwdQ4NQ6oLe5x+MKCP7SMcOfT1jdnThhzl5WGw6tmc9Bqx514DrxLkgVpwwK7g9avo8ypMmcrylAIvSQupD8lL1oZnjWPysAFqTEdc6+ilnIO/en85Rm+qZQMh85XZQXn6r8TU8pDmORp1ZmTjMJtrxb/5/VKGb7vT7UtSmGrrhcNS4PisM4PM+1ZiZlUhJTX1a2oRuRJSZXyjS1ZqE+b1cEkt2O4S/bXOslGZ/3rWnvzwyKiZViF1/AWEngHm/AZdqELCk7hHC7hZ3QW/Y6uoj/XrY1oMfMSbqABfwGxB7CX</latexit>

<latexit sha1_base64="HyU6SA9TkwyAANeL7cdQ3jeulUQ=">AAACFHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKoqJuhKIblxXsA9pSJtNJM3TyYOZGKCG/4Q+41T9wJ27d+wN+h5M2C9t6YOBwzn3NcSLBFVjWt1FYWl5ZXSuulzY2t7Z3zN29pgpjSVmDhiKUbYcoJnjAGsBBsHYkGfEdwVrO6C7zW09MKh4GjzCOWM8nw4C7nBLQUt886A5CSLwU32C33wWPAal4p3DSN8tW1ZoALxI7J2WUo943f/QkGvssACqIUh3biqCXEAmcCpaWurFiEaEjMmQdTQPiM9VLJh9I8bFWBtgNpX4B4In6tyMhvlJj39GVPgFPzXuZ+J/XicG97iU8iGJgAZ0ucmOBIcRZGnjAJaMgxpoQKrm+FVOPSEJBZzazZaCy09KSDsaej2GRNM+q9mX1/OGiXLvNIyqiQ3SEKshGV6iG7lEdNRBFKXpBr+jNeDbejQ/jc1paMPKefTQD4+sXLUCeJg==</latexit>

ḣ = f✓ (h, t)
Replacebackpropagation
the 6 standard through
residualODE
blocks in ResNet with ODESolve modules!
<latexit sha1_base64="XaGmOn2N9wsVahlnXVRgoFOqVDk=">AAACRnicbVA9TxtBEJ0z+QDny0lKmkmsSKmsO5BISpQEKTSBkBiQjGXN7Y3xynu7p905Isvyb+Jv8AcoaEiXMh1KmzvjIkCeNNLTezOamZcWRgeJ44uosXTv/oOHyyvNR4+fPH3Wev5iP7jSK+4qZ5w/TCmw0Za7osXwYeGZ8tTwQTr+WPsHJ+yDdva7TAru53Rs9VArkkoatLb3uDCkGGXEuIFByGbkM/QcdFaSwdQ4NQ6oLe5x+MKCP7SMcOfT1jdnThhzl5WGw6tmc9Bqx514DrxLkgVpwwK7g9avo8ypMmcrylAIvSQupD8lL1oZnjWPysAFqTEdc6+ilnIO/en85Rm+qZQMh85XZQXn6r8TU8pDmORp1ZmTjMJtrxb/5/VKGb7vT7UtSmGrrhcNS4PisM4PM+1ZiZlUhJTX1a2oRuRJSZXyjS1ZqE+b1cEkt2O4S/bXOslGZ/3rWnvzwyKiZViF1/AWEngHm/AZdqELCk7hHC7hZ3QW/Y6uoj/XrY1oMfMSbqABfwGxB7CX</latexit>

solvers
<latexit sha1_base64="ku9ZABlYqWt7IpVoFABLu6BhxUA=">AAACNHicbVDLSgMxFM34rPVVdekmWARXZUZFXYoPcGcFq0JbSiZz2wYzyZDcEcswn+Jv+ANu9QMEd6JLv8FM7cLXgcDh3Me5OWEihUXff/bGxicmp6ZLM+XZufmFxcrS8oXVqeHQ4FpqcxUyC1IoaKBACVeJARaHEi7D68OifnkDxgqtznGQQDtmPSW6gjN0Uqey20pVBCY0jEPWQrjFLGT8OjE6Yb1hD8W+0WmvT0+PjqnVstiW551K1a/5Q9C/JBiRKhmh3qm8tyLN0xgUcsmsbQZ+gu2MGRRcQl5upRYS58x60HRUsRhsOxt+MKfrToloVxv3FNKh+n0iY7G1gzh0nTHDvv1dK8T/as0Uu3vtTKgkRVD8y6ibSoqaFmnRSBjgKAeOMG6Eu5XyPnNhoUvhh0tki9Pysgsm+B3DX3KxWQt2altn29X9g1FEJbJK1sgGCcgu2ScnpE4ahJM78kAeyZN37714r97bV+uYN5pZIT/gfXwCFhStWg==</latexit>

h(0) ! input
<latexit sha1_base64="lB4oZluf6F9pXWxoWMWwgr9DxGY=">AAACGnicbVDLSsNAFJ34rPVVdSnCYBHqpiQq6rLoxmUF+4AmlMlk2g6dTMLMjVpKV/6GP+BW/8CduHXjD/gdTtosbOuBC4dz7uXee/xYcA22/W0tLC4tr6zm1vLrG5tb24Wd3bqOEkVZjUYiUk2faCa4ZDXgIFgzVoyEvmANv3+d+o17pjSP5B0MYuaFpCt5h1MCRmoXDnol+xi7ind7QJSKHrAL7BGGXMYJjNqFol22x8DzxMlIEWWotgs/bhDRJGQSqCBatxw7Bm9IFHAq2CjvJprFhPZJl7UMlSRk2huO3xjhI6MEuBMpUxLwWP07MSSh1oPQN50hgZ6e9VLxP6+VQOfSm7zEJJ0s6iQCQ4TTTHDAFaMgBoYQqri5FdMeUYSCSW5qS6DT00Z5E4wzG8M8qZ+UnfPy6e1ZsXKVRZRD++gQlZCDLlAF3aAqqiGKntALekVv1rP1bn1Yn5PWBSub2UNTsL5+AQ9FoWk=</latexit>

| {z }
h(T ) ! output (black-box di↵erential equation solver)
<latexit sha1_base64="8XaIiG8+BwAd8CjO9ksodKW/E3k=">AAACRHicbVA7SwNBEN7zGeMrammzGIRYGO5U1DJooWWEJApJCHubuWRx7/bcndOEIz/Jv+EfsBK0tbITW3HzKHwNDHx8M/PNzOfHUhh03Sdnanpmdm4+s5BdXFpeWc2trdeMSjSHKldS6SufGZAigioKlHAVa2ChL+HSvz4d1i9vQRuhogr2Y2iGrBOJQHCGlmrlzrqFyg5taNHpItNa3dEGQg9TlWCcIC34kvHrXV/1aFsEAWiIUDBJ4SYZCVCjpJXfGbRyebfojoL+Bd4E5Mkkyq3ca6OteBJaQS6ZMXXPjbGZMo2CSxhkG4mB2O5mHahbGLEQTDMdPTyg25Zp00BpmxHSEft9ImWhMf3Qt50hw675XRuS/9XqCQbHzVRE9neI+HhRkEiKig7dsyZo4Cj7FjCuhb2V8i7TjKP1+MeWthmeNshaY7zfNvwFtb2id1jcvzjIl04mFmXIJtkiBeKRI1Ii56RMqoSTe/JInsmL8+C8Oe/Ox7h1ypnMbJAf4Xx+AYZ/suQ=</latexit>

Reverse-mode automatic di↵erentiation of ODE solutions


<latexit sha1_base64="Rnxsx6Vjx1NuZCv0yroMqqrczq8=">AAACP3icbVDLSsNAFJ34rPUVdelmsAhuLImCuixVwZ1V7APaUiaTGx06yYSZiVBC/sff8Afcqj+gO3HrzknahVYvDJw595x7L8eLOVPacV6tmdm5+YXF0lJ5eWV1bd3e2GwpkUgKTSq4kB2PKOAsgqZmmkMnlkBCj0PbG57m/fY9SMVEdKNHMfRDchuxgFGiDTWw671iRirBz64hV8J+KHzAJNEiNCKKfRYEICHSrPBgEeDLs3OsBE/yvxrYFafqFIX/AncCKmhSjYH91vMFTUIzknKiVNd1Yt1PiTTbOGTlXqIgJnRIbqFrYERCUP20uDPDu4bxcSCkeZHGBfvTkZJQqVHoGaU5/05N93Lyv1430cFJP2VRnGiI6HhRkHCsBc6DMzFIoJqPDCBUsjwZekckodqE9muLr/LTsrIJxp2O4S9oHVTdo+rh1UGlVp9EVELbaAftIRcdoxq6QA3URBQ9oCf0jF6sR+vd+rA+x9IZa+LZQr/K+voGX2uxXQ==</latexit>

Adjoint sensitivity method


<latexit sha1_base64="T3yGFlsR8Jo5RU4sMOnRsaYJEHE=">AAACF3icbZDPSgMxEMaz/rf+q3rTS7AInspuBfVY9eJRwdpCW0o2O7Wx2WRJZgtlKfgavoBXfQNv4tWjL+BzmNY9aPWDwMc3M8zkFyZSWPT9D29mdm5+YXFpubCyura+UdzcurE6NRxqXEttGiGzIIWCGgqU0EgMsDiUUA/75+N6fQDGCq2ucZhAO2a3SnQFZ+iiTnHnNLrTQiG1oKxAMRA4pDFgT0edYskv+xPRvybITYnkuuwUP1uR5mkMCrlk1jYDP8F2xgwKLmFUaKUWEsb77BaazioWg21nkz+M6L5LItrVxj13ziT9OZGx2NphHLrOmGHPTtfG4X+1Zordk3YmVJIiKP69qJtKipqOgdBIGOAoh84wbhwCTnmPGcbRYfu1JbLj00YFByaYxvDX3FTKwVH58KpSqp7liJbILtkjByQgx6RKLsglqRFO7skjeSLP3oP34r16b9+tM14+s01+yXv/AlxboIc=</latexit>

L ! scalar-valued loss function


<latexit sha1_base64="Tp+uhYV6Nmw9OpOk3D7euntzIwU=">AAACLXicbVDLTttAFB1TWiB9YMqymxFRpW4a2QW1XUZ00wULKjWAFFvR9fg6GTGesWauQyMrv9Hf6A90W/6ARaWKJfxGxyELXkca6eice3TvnKxS0lEU/Q1Wnqw+fba2vtF5/uLlq81w6/WRM7UVOBBGGXuSgUMlNQ5IksKTyiKUmcLj7PRL6x9P0Tpp9HeaVZiWMNaykALIS6MwOuCJleMJgbXmjCeEP6hxAhTY91NQNeZcGed4UWvRJuajsBv1ogX4QxIvSZctcTgKr5LciLpETUKBc8M4qihtwJIUCuedpHZYgTiFMQ491VCiS5vFz+b8rVdyXhjrnya+UG8nGiidm5WZnyyBJu6+14qPecOais9pI3VVE2pxs6ioFSfD25p4Li0KUjNPQFjpb+ViAhYE+TLvbMlde9q844uJ79fwkBx96MUfe7vf9rr9/WVF6+wN22HvWMw+sT77yg7ZgAn2k/1mf9h58Cu4CP4FlzejK8Eys83uILj+D1OCqeU=</latexit>

h(t) ! hidden state at time t


<latexit sha1_base64="+mBrvcIPp0n16LF+GCdPQI0HsiQ=">AAACLXicbVDLSgNBEJz1bXxFPXoZjIJewq6KehS9eFQwKiQh9M52soOzs8tMrxqW/Ia/4Q941T/wIIhH/Q0nMQdfBQ1FVTfdXWGmpCXff/FGRsfGJyanpkszs3PzC+XFpXOb5kZgTaQqNZchWFRSY40kKbzMDEISKrwIr476/sU1GitTfUbdDJsJdLRsSwHkpFbZjzdokzeM7MQExqQ3vEF4S0Usowg1twSEHIiTTJCv0VqvVa74VX8A/pcEQ1JhQ5y0yu+NKBV5gpqEAmvrgZ9RswBDUijslRq5xQzEFXSw7qiGBG2zGHzW4+tOiXg7Na408YH6faKAxNpuErrOBCi2v72++J9Xz6m93yykznJCLb4WtXPFKeX9mHgkDQpSXUdAGOlu5SIGA4JcmD+2RLZ/Wq/kggl+x/CXnG9Vg93q9ulO5eBwGNEUW2GrbIMFbI8dsGN2wmpMsDv2wB7Zk3fvPXuv3ttX64g3nFlmP+B9fAIU5qiI</latexit>

@L
<latexit sha1_base64="j6Vl7OgabiiT3oHoVYgiCotMvtA=">AAACQHicbVDLShxBFK028TW+xmTppsgg6GboVlERAuJssnBhIKPC9DDcrq6eKae6qqm6rRma/iB/wx9wm/mB4C5k68rqcSDxcaDgcM69nFsnyqSw6Ptjb+bDx9m5+YXF2tLyyupaff3TudW5YbzNtNTmMgLLpVC8jQIlv8wMhzSS/CIatir/4pobK7T6gaOMd1PoK5EIBuikXr0FW7hNj77SMDHAijADgwIkPS3/8YEbKWloRH+AYIy+oSHyn1hAfKWFwrJXb/hNfwL6lgRT0iBTnPXqv8NYszzlCpkEazuBn2G3qOKY5GUtzC3PgA2hzzuOKki57RaTz5Z00ykxTbRxTyGdqP9vFJBaO0ojN5kCDuxrrxLf8zo5JofdQqgsR67Yc1CSS4qaVs3RWBjOUI4cAWaEu5WyAbjW0PX7IiW21WllzRUTvK7hLTnfaQb7zd3ve43jk2lFC2SDfCFbJCAH5Jh8I2ekTRi5JffkFxl7d96D98f7+zw64013PpMX8B6fADtosMU=</latexit>

a(t) := ! adjoint Continuous Normalizing Flows


<latexit sha1_base64="EFdvCFfQ9Sad20lj40IUzN77SEo=">AAACJXicbVDLSsNAFJ3UV62vqEs3g0XoqiQV1GWxIK6kgn1AG8pkMmmHTmbCzESpob/gb/gDbvUP3Ingyp3fYZJmYVsPDBzOuXfu4bgho0pb1pdRWFldW98obpa2tnd298z9g7YSkcSkhQUTsusiRRjlpKWpZqQbSoICl5GOO26kfueeSEUFv9OTkDgBGnLqU4x0Ig3MSj/7I5bEmzYE15RHIlLwRsgAMfpI+RBeMfGgSgOzbFWtDHCZ2DkpgxzNgfnT9wSOAsI1Zkipnm2F2omR1BQzMi31I0VChMdoSHoJ5SggyomzNFN4kige9IVMHtcwU/9uxChQahK4yWSA9Egteqn4n9eLtH/hxJSHkSYczw75EYNawLQe6FFJsGaThCAsaZIV4hGSCOukxLkrnkqjTdNi7MUalkm7VrXPqqe3tXL9Mq+oCI7AMagAG5yDOrgGTdACGDyBF/AK3oxn4934MD5nowUj3zkEczC+fwHQxaaC</latexit>

@h(t) Change of variable theorem


<latexit sha1_base64="V6nV49yfoeQycJ4eTCfcsIZdjVY=">AAACF3icbVDLSgMxFM34rPVVdaebYBFclRkFdVnsxmUF+4Bayp30ThvMJEOSKZRS8Df8Abf6B+7ErUt/wO8w03ZhqwcCh3Pu4d6cMBHcWN//8paWV1bX1nMb+c2t7Z3dwt5+3ahUM6wxJZRuhmBQcIk1y63AZqIR4lBgI3yoZH5jgNpwJe/sMMF2DD3JI87AOqlTOKz0QfaQqogOQHNwOWr7qDTGnULRL/kT0L8kmJEimaHaKXzfdxVLY5SWCTCmFfiJbY9AW84EjvP3qcEE2AP0sOWohBhNezT5w5ieOKVLI6Xdk5ZO1N+JEcTGDOPQTcZg+2bRy8T/vFZqo6v2iMsktSjZdFGUCmoVzQqhXa6RWTF0BJjm7lbK+qCBWVfb3JauyU4b510xwWINf0n9rBRclM5vz4rl61lFOXJEjskpCcglKZMbUiU1wsgjeSYv5NV78t68d+9jOrrkzTIHZA7e5w9Tm5/n</latexit>

<latexit sha1_base64="5bwyPMgEWwcVZMeFr/nmRsRxtrk=">AAACYHicbVBNS8NAEN3Gr1q/qt70slgEPVgSFfUiFL14VLAqNKVsNpN26eaD3YlY0/xAf4JXD1696s1NW/BzYOHNe/N2huclUmi07eeSNTU9MztXnq8sLC4tr1RX1250nCoOTR7LWN15TIMUETRRoIS7RAELPQm3Xv+80G/vQWkRR9c4SKAdsm4kAsEZGqpT5a4IzR7Q1JVxlyY7jx1nl55+dfYu3Rt3roQAh9RFeMDMB8ypGyjGMzdhCgWTNMi/sDEaXYluD4edas2u26Oif4EzATUyqctO9dX1Y56GECGXTOuWYyfYzoqvuYS84qYaEsb7rAstAyMWgm5nozByum0YnwaxMi9COmK/OzIWaj0IPTMZMuzp31pB/qe1UgxO2pmIkhQh4uNFQSopxrRIlvpCAUc5MIBxJcytlPeYSQhN/j+2+Lo4La+YYJzfMfwFN/t156h+cHVYa5xNIiqTTbJFdohDjkmDXJBL0iScPJE38k4+Si9W2VqxVsejVmniWSc/ytr4BNAkuJI=</latexit>

@f (h, t) @f
<latexit sha1_base64="/6p6DnVUQnbJLAcCf4bmroUfUk0=">AAACNnicbVDLSsNAFJ34tr6qLt0MFkFBS6LiYyGIblwq2FpoarmZTszg5MHMjVBCvsXf8Afc6tqNO3XrJzipBW31wMDh3Me5c7xECo22/WKNjI6NT0xOTZdmZufmF8qLS3Udp4rxGotlrBoeaC5FxGsoUPJGojiEnuRX3u1pUb+640qLOLrEbsJbIdxEwhcM0Ejt8qHbiTGDnB7RLQrXl9T1FbDMTUChAEn9tosBR1gPNnEj/9GDvF2u2FW7B/qXOH1SIX2ct8vvxoulIY+QSdC66dgJtrJiIZM8L7mp5gmwW7jhTUMjCLluZb0v5nTNKB3qx8q8CGlP/T2RQah1N/RMZwgY6OFaIf5Xa6boH7QyESUp8oh9G/mppBjTIi/aEYozlF1DgClhbqUsAJMRmlQHXDq6OC0vmWCc4Rj+kvp21dmr7lzsVo5P+hFNkRWyStaJQ/bJMTkj56RGGLknj+SJPFsP1qv1Zn18t45Y/ZllMgDr8wsTF6wV</latexit>

z1 = f (z0 ), f ! bijective =) log p(z1 ) = log p(z0 ) log det


<latexit sha1_base64="K+fT913tMkHJ297Dbl3SY6Cd8og=">AAACKXicbVDLSgNBEJz1GeMr6tHLYBAihLCrol4E0YtHBZMISVhmJ73JmNkHM73RuOQn/A1/wKv+gTf1KvgdTmIOmljQUFR1093lxVJotO13a2p6ZnZuPrOQXVxaXlnNra1XdJQoDmUeyUhde0yDFCGUUaCE61gBCzwJVa9zNvCrXVBaROEV9mJoBKwVCl9whkZyc8V716HH1C/cu/ZOkfq0rkSrjUyp6JbWEe4w9cQNcBRd6Lu5vF2yh6CTxBmRPBnhws191ZsRTwIIkUumdc2xY2ykTKHgEvrZeqIhZrzDWlAzNGQB6EY6/KpPt43SpH6kTIVIh+rviZQFWvcCz3QGDNt63BuI/3m1BP2jRirCOEEI+c8iP5EUIzqIiDaFMg/LniGMK2FupbzNFONogvyzpakHp/WzJhhnPIZJUtktOQelvcv9/MnpKKIM2SRbpEAcckhOyDm5IGXCyQN5Is/kxXq0Xq036+OndcoazWyQP7A+vwEvu6aU</latexit>

T ✓
ȧ = a @z0
@h T
<latexit sha1_base64="osxPzMIwoYWRpEeozVaJ6BJd9IQ=">AAACSnicbVDRShtBFJ2N2tpYa9RHXy4GIUEIu1ravgiiL30RFIwKSRruTmaTwdmZZeZuY1zyVf6GP+BrC/2AvokvTmKEqj0wcO4593LvnDhT0lEY3gWlufmFd+8XP5SXPi5/Wqmsrp05k1sumtwoYy9idEJJLZokSYmLzApMYyXO48vDiX/+U1gnjT6lUSY6Kfa1TCRH8lK3cnRdo+2oDnvgSR22IYcB1IY/Tp/ruA5tK/sDQmvNENokrqjIFGq0oI1NUclrqfuQKDMcdyvVsBFOAW9JNCNVNsNxt/Kn3TM8T4UmrtC5VhRm1CnQkuRKjMvt3IkM+SX2RctTjalwnWL67TFseaUHibH+aYKp+u9EgalzozT2nSnSwL32JuL/vFZOybdOIXWWk9D8aVGSKyADkwyhJ63gpEaeILfS3wp8gBY5+aRfbOm5yWnjsg8meh3DW3K204i+NHZPPlf3D2YRLbINtslqLGJf2T77zo5Zk3F2w+7YL/Y7uA3+BvfBw1NrKZjNrLMXKM0/AhmCsNY=</latexit>

cubic cost!
<latexit sha1_base64="S3QZVC1xzz/kqdj0C8SQ7ykBr7I=">AAACBnicbVDLSsNAFJ3UV62vqks30SK4KkkFdVl047KCfUAaymQyaYdOZsLMjVBC9/6AW/0Dd+LW3/AH/A4nbRa29cCFwzn3cu89QcKZBsf5tkpr6xubW+Xtys7u3v5B9fCoo2WqCG0TyaXqBVhTzgRtAwNOe4miOA447Qbju9zvPlGlmRSPMEmoH+OhYBEjGIzkkTRgxCZSw+mgWnPqzgz2KnELUkMFWoPqTz+UJI2pAMKx1p7rJOBnWAEjnE4r/VTTBJMxHlLPUIFjqv1sdvLUPjdKaEdSmRJgz9S/ExmOtZ7EgemMMYz0speL/3leCtGNnzGRpEAFmS+KUm6DtPP/7ZApSoBPDMFEMcjfH2GFCZiUFraEOj9tWjHBuMsxrJJOo+5e1S8fGrXmbRFRGZ2gM3SBXHSNmugetVAbESTRC3pFb9az9W59WJ/z1pJVzByjBVhfvzIWmWs=</latexit>

@L z(t + 1) = z(t) + uh(w z(t) + b) ! planar normalizing flow


<latexit sha1_base64="NK/zX5CUTzsB7AcWcXIeU22N3j4=">AAACPXicbVBNSyNBFOxR14+4atY9emkMC/ESZlTUy0JYL3vYg4LRQCaEN52epElP99D9Rg3D/J39G/4Bry7sD1hPste92hMDGrMFDUXVe9TrilIpLPr+b29hcenD8srqWmX948bmVvXT9qXVmWG8xbTUph2B5VIo3kKBkrdTwyGJJL+KRqelf3XNjRVaXeA45d0EBkrEggE6qVdtQv1ij36lYWyA5WEKBgVI+qN45UM3UdDQiMEQwRh9Q0Pkt5iPlL5RRa9a8xv+BHSeBFNSI1Oc9aqPYV+zLOEKmQRrO4GfYjcvw5jkRSXMLE+BjWDAO44qSLjt5pOfFvSLU/o01sY9hXSivt3IIbF2nERuMgEc2vdeKf7P62QYn3RzodIMuWIvQXEmKWpa1kb7wnCGcuwIMCPcrZQNwXWGrtyZlL4tTysqrpjgfQ3z5HK/ERw1Ds4Pa81v04pWyQ7ZJXUSkGPSJN/JGWkRRn6Se/JAfnl33h/vyfv7MrrgTXc+kxl4/54BksCvcQ==</latexit>

a(T ) = ! known <latexit sha1_base64="o73pn0vE5Cx67lwcF1uJ2gwIu7E=">AAACX3icbZBNT9tAEIY3BgqkQA09VVxGjSoFoUZ2WwEXpKi9cASJAFIcovVmnKxYf2h3jBQc/z/+AsdeeuQKR9ZJJL460krvvDOzM3rCTElDnndXcxYWlz4sr6zWP66tb3xyN7fOTJprgR2RqlRfhNygkgl2SJLCi0wjj0OF5+HVn6p+fo3ayDQ5pXGGvZgPExlJwclafTcMZGz3oIFApUPImjdN2vV3duDwhWHT77M0UBjRBHzYhfzyFIJIc1EEGdckuYJR+axvSgi0HI5o0ncbXsubBrwX/lw02DyO++6/YJCKPMaEhOLGdH0vo15RfSwUlvUgN5hxccWH2LUy4TGaXjFlUcI36wwgSrV9CcHUfTlR8NiYcRzazpjTyLytVeb/at2cooNeIZMsJ0zEbFGUK6AUKrAwkBoFqbEVXGhpbwUx4pYPWfyvtgxMdVpZt2D8txjei7MfLX+v9fPkV6P9e45ohW2zr6zJfLbP2uyIHbMOE+yW3bMH9lj76yw7G447a3Vq85nP7FU4X54AX5K2Lw==</latexit>

@h(T ) T @h
=) log p(z(t + 1)) = log p(z(t)) log 1 + u
@L @z
<latexit sha1_base64="p5KpxNBpqcNFkwIIrE6IHgNfL2A=">AAACXnicbZBNT9tAEIY35qOQQknhgsRlRYQEh0Y2oLaXSoheOHAAiQBSHEXjzThZsfaa3TEQWf59/Q299dRbr3BlbSLx1ZFWevXO7LyjJ8qUtOT7vxvezOzc/IeFxebHpeVPK63Pq+dW50ZgV2ilzWUEFpVMsUuSFF5mBiGJFF5EVz+r/sUNGit1ekaTDPsJjFIZSwHkrEELYNvf4T94GBsQRZiBIQmKH5fPeuwmSh4aORoTGKNveUh4R0WkQFx9ifQdH8o4RoNpPY7Xeb2bW61ccjlotf2OXxd/L4KpaLNpnQxaf8OhFnni9gkF1vYCP6N+UV0jFJbNMLeYuWgYYc/JFBK0/aJGUfIt5wx5rI17KfHaffmjgMTaSRK5yQRobN/2KvN/vV5O8fd+IdMsJ0zFU1CcK06aV1wdA4OC1MQJEEa6W7kYg4NKjv6rlKGtTiubDkzwFsN7cb7bCb529k732weHU0QLbINtsm0WsG/sgB2xE9Zlgv1i/9g9e2j88ea9ZW/ladRrTP+ssVflrT8CVn25Cw==</latexit>

a(0) = ! black-box di↵erential equation solver KL(q(x) || p(x)) ! loss function


<latexit sha1_base64="4eDA7xkOVAFJrZh9uLU/HfGDhBA=">AAACN3icbVDLSgMxFM34tr6qLt0Ei1A3ZUZFxZXoRtBFBdsKbSmZNNMGM8mY3FHLOH6Lv+EPuNWtK3fFrX9gpu3C14HA4ZxzuTfHjwQ34Lpvztj4xOTU9Mxsbm5+YXEpv7xSNSrWlFWoEkpf+sQwwSWrAAfBLiPNSOgLVvOvjjO/dsO04UpeQC9izZB0JA84JWClVv6gAewOktOztHhdvNvED/f3DziybBM3NO90gWitbvEwJZQxOIglzWbTVr7gltwB8F/ijUgBjVBu5fuNtqJxyCRQQYype24EzYRo4FSwNNeIDYsIvSIdVrdUkpCZZjL4Y4o3rNLGgdL2ScAD9ftEQkJjeqFvkyGBrvntZeJ/Xj2GYL+ZcBnFwCQdLgpigUHhrDDc5ppRED1LCNXc3oppl2hCwdb6Y0vbZKelOVuM97uGv6S6VfJ2S9vnO4XDo1FFM2gNraMi8tAeOkQnqIwqiKJH9Ixe0Kvz5Lw7fedjGB1zRjOr6Aeczy9huq1h</latexit>

@h(0) Time Series


<latexit sha1_base64="/oaIq2XOBb0WlnQWi7EDP1E2WyU=">AAACFHicbVDLSsNAFJ3UV62vqAsXbgaL4KokFdRl0Y3Lin1BG8pkctsOnTyYmQgl5Df8Abf6B+7ErXt/wO9wkmZhWw8MHM65d87luBFnUlnWt1FaW9/Y3CpvV3Z29/YPzMOjjgxjQaFNQx6KnkskcBZAWzHFoRcJIL7LoetO7zK/+wRCsjBoqVkEjk/GARsxSpSWhubJIP8jEeClLeYDfgTBQFaGZtWqWTnwKrELUkUFmkPzZ+CFNPYhUJQTKfu2FSknIUIxyiGtDGIJEaFTMoa+pgHxQTpJHp7ic614eBQK/QKFc/XvRkJ8KWe+qyd9oiZy2cvE/7x+rEY3TsKCKFYQ0HnQKOZYhThrA3tMAFV8pgmhgulbMZ0QQajSnS2keDI7Lc2KsZdrWCWdes2+ql0+1KuN26KiMjpFZ+gC2egaNdA9aqI2oihFL+gVvRnPxrvxYXzOR0tGsXOMFmB8/QJrI57m</latexit>

target density
<latexit sha1_base64="R+jhTZtMnEnaU3tuPRwwAikqH54=">AAACCXicbVBLSgNBFOyJvxh/UZduGoPgKsxEUJdBNy4jmA8kY+jpeUma9PQM3W+EIeQEXsCt3sCduPUUXsBz2PksTGLBg6LqPV5RQSKFQdf9dnJr6xubW/ntws7u3v5B8fCoYeJUc6jzWMa6FTADUiioo0AJrUQDiwIJzWB4O/GbT6CNiNUDZgn4Eesr0ROcoZUekek+IA1BGYFZt1hyy+4UdJV4c1Iic9S6xZ9OGPM0AoVcMmPanpugP2IaBZcwLnRSAwnjQ9aHtqWKRWD80TT1mJ5ZJaS9WNtRSKfq34sRi4zJosBuRgwHZtmbiP957RR71/5IqCRFUHz2qJdKijGdVEBDoYGjzCxhXAublfIB04yjLWrhS2gm0cYFW4y3XMMqaVTK3mX54r5Sqt7MK8qTE3JKzolHrkiV3JEaqRNONHkhr+TNeXbenQ/nc7aac+Y3x2QBztcvUpebMA==</latexit>

Constant memory cost as a function of depth!


<latexit sha1_base64="ujzjM9tKg8LMV3320TST9mehtYs=">AAACKXicbVDLbhNBEJwNkBhDwhKOXAYsJA6RtZtIJEcruXAMEo4jOZbVO9sbjzKP1UxvpJXln+A3+AGuyR9wA65IfAezzh6wTUkjlaq61TWVlUp6SpKf0dajx0+2dzpPu8+e7+69iF/uX3hbOYFDYZV1lxl4VNLgkCQpvCwdgs4UjrKbs8Yf3aLz0prPVJc40XBtZCEFUJCm8cGZNZ7AENeorau5sJ44eA68qIxohrgteI4lzd5M417ST5bgmyRtSY+1OJ/Gf65yKyqNhoQC78dpUtJkDo6kULjoXlUeSxA3cI3jQA1o9JP58lcL/i4oOS+sCy/kW6r/bsxBe1/rLExqoJlf9xrxf964ouJkMpemrAiNeDhUVIqT5U1FPJcOBak6EBBOhqxczMCBoFDkypXcN9EW3VBMul7DJrk47Kcf+kefDnuD07aiDnvN3rL3LGXHbMA+snM2ZIJ9Yd/YHbuPvkbfox/Rr4fRrajdecVWEP3+CyBqpwg=</latexit>

normalizing flow) and a(T ).


<latexit sha1_base64="ElXDJKuMiqnGrsmoWmQB4N15Wz8=">AAACDnicbVBLSgNBFOzxG+Nvoks3jUFwFWYiqMugG5cRzAeSIfT09CRN+jN09xhiyB28gFu9gTtx6xW8gOewJ5mFSSx4UFS9xysqTBjVxvO+nbX1jc2t7cJOcXdv/+DQLR01tUwVJg0smVTtEGnCqCANQw0j7UQRxENGWuHwNvNbj0RpKsWDGSck4KgvaEwxMlbquSUhFUeMPlHRhzGTI9hzy17FmwGuEj8nZZCj3nN/upHEKSfCYIa07vheYoIJUoZiRqbFbqpJgvAQ9UnHUoE40cFkFn0Kz6wSwVgqO8LAmfr3YoK41mMe2k2OzEAve5n4n9dJTXwdTKhIUkMEnj+KUwaNhFkPMKKKYMPGliCsqM0K8QAphI1ta+FLpLNo06Itxl+uYZU0qxX/snJxXy3XbvKKCuAEnIJz4IMrUAN3oA4aAIMReAGv4M15dt6dD+dzvrrm5DfHYAHO1y/ghpx6</latexit>

<latexit sha1_base64="esaUt63aXY+a1c+BYNec+YSpWtA=">AAACaHicbVHLbtNAFB27QEt4uWWBEJtRE6QiochupcKygg3Lgpq2UhJF1+PreIhnxpq5pkRWPpIlP8CCH2DLTJoFbbnSSEfnPs7RmbyppaM0/RnFW/fuP9jeedh79PjJ02fJ7t65M60VOBKmNvYyB4e11DgiSTVeNhZB5TVe5IuPoX/xDa2TRp/RssGpgrmWpRRAnpoliwnhd+q+oDCqaQn5oDqgNwOeg1hcgS241JykQk5mjlSh5VeSKi7JcSi+GqmJDyBsvOWOwJLUc15ao8KdM38HdBEGPByuZkk/Habr4ndBtgF9tqnTWfJrUhjRKtQkanBunKUNTbsgI2pc9Satw8YbhTmOPdSg0E27dSgr/tozBS+N9c+7XLP/bnSgnFuq3E8qoMrd7gXyf71xS+X7aSd1iEuLa6GyrX1CPCTMC2lRUL30AISV3isXFVgQ5P/hhkrhgrVVzweT3Y7hLjg/HGbHw6PPh/2TD5uIdtgrts8OWMbesRP2iZ2yERPsB/sTsSiKfsdJ/CJ+eT0aR5ud5+xGxft/AZVKtz4=</latexit>

Recompute h(t) backward in time together with its adjoint a(t), starting from h(T
Continuous Normalizing Flows
<latexit sha1_base64="EFdvCFfQ9Sad20lj40IUzN77SEo=">AAACJXicbVDLSsNAFJ3UV62vqEs3g0XoqiQV1GWxIK6kgn1AG8pkMmmHTmbCzESpob/gb/gDbvUP3Ingyp3fYZJmYVsPDBzOuXfu4bgho0pb1pdRWFldW98obpa2tnd298z9g7YSkcSkhQUTsusiRRjlpKWpZqQbSoICl5GOO26kfueeSEUFv9OTkDgBGnLqU4x0Ig3MSj/7I5bEmzYE15RHIlLwRsgAMfpI+RBeMfGgSgOzbFWtDHCZ2DkpgxzNgfnT9wSOAsI1Zkipnm2F2omR1BQzMi31I0VChMdoSHoJ5SggyomzNFN4kige9IVMHtcwU/9uxChQahK4yWSA9Egteqn4n9eLtH/hxJSHkSYczw75EYNawLQe6FFJsGaThCAsaZIV4hGSCOukxLkrnkqjTdNi7MUalkm7VrXPqqe3tXL9Mq+oCI7AMagAG5yDOrgGTdACGDyBF/AK3oxn4934MD5nowUj3zkEczC+fwHQxaaC</latexit>

its adjoint a(t), starting from h(T ) and a(T ).


Z 0 d log p(z(t)) @f
<latexit sha1_base64="qnECW2BJDBkeUo592qOvekvNBJo=">AAACinicbVHtatRAFJ2kftS12q39Jf4Zughb0CWpYi1FKFrBH4IVd9vCZhtuJpPN0MlMmLmpLiGP4AP6Ar6AL+Bkd0G77YWBwzn3zL2cm5RSWAyCX56/dufuvfvrDzoPNx493uxuPTm1ujKMj5iW2pwnYLkUio9QoOTnpeFQJJKfJZcfWv3sihsrtBrirOSTAqZKZIIBOiru/owyA6yOSjAoQNLPzT8cYc4RGvqOvqSRUBjXw+aiDhoKfdy9GNIVaxYvDP3cyS8o7t7yVYo0MmKaIxijvzua/8D6y/FH+k3LK97E3V4wCOZFb4JwCXpkWSdx93eUalYVXCGTYO04DEqc1O1UJnnTiSrLS2CXMOVjBxUU3E7qeWwNfe6YlGbauKeQztn/HTUU1s6KxHUWgLld1VryNm1cYfZ2UgtVVsgVWwzKKklR0/YGNBWGM5QzB4AZ4XalLAcXJrpLXZuS2na1puOCCVdjuAlO9wbhm8Grr697R++XEa2TZ2SH9ElI9skR+UROyIgw8sd76u14PX/D3/MP/MNFq+8tPdvkWvnHfwH/F8aE</latexit>
<latexit sha1_base64="1TlC9jbUYb0bV1fs6DRA2eBx5Pk=">AAACYXicbVFNSyNBEO2M7m7Mfo3x6KUxCBPYDTPusnoRRC97dMF8QCaEnp6a2NjzQXfNYjLMH/QfeBY8e9XT9iQDa9SChtev3usqXgeZFBpd97ZhbWy+e/+hudX6+Onzl6/2dnug01xx6PNUpmoUMA1SJNBHgRJGmQIWBxKGwdVZ1R/+BaVFmlzgPINJzGaJiARnaKipHfphisWipMc0chbfsEt9EZvBoKkfKcaLkPoyndHMWTjY7ZbmjpX4O/URrrFAVToroZ8xhYJJGpX/cWUqu1O74/bcZdHXwKtBh9R1PrXvzVo8jyFBLpnWY8/NcFJUj3IJZcvPNWSMX7EZjA1MWAx6UizTKOm+YUIapcqcBOmSfe4oWKz1PA6MMmZ4qV/2KvKt3jjH6GhSiCTLERK+GhTlkmJKq2hpKBRwlHMDGFfC7Er5JTPZoPmAtSmhrlYrWyYY72UMr8HgoOf96v3487NzclpH1CS7ZI84xCOH5IT8JuekTzi5IQ/kkTw17qwty7baK6nVqD07ZK2s3X9KBbip</latexit>

@L T @f ✓ (h(t), t) ż = f (z, t) =) = tr( )


= a(t) dt ! ODE Solve dt @z(t)
@✓ T @✓
f does not need to be bijective!
<latexit sha1_base64="Vx10z8wCIrTvzVsHcEwFud7OipQ=">AAACH3icbVDLSgNBEOz1GeMr6tHLmCh4CrsK6jHoxaOCiYEYwuxsbxydnVlmZoUQcvc3/AGv+gfexKs/4Hc4m+RgEgsaiupuuqvCVHBjff/bm5tfWFxaLqwUV9fWNzZLW9sNozLNsM6UULoZUoOCS6xbbgU2U400CQXeho8Xef/2CbXhSt7YXorthHYljzmj1kmdUnk/3ieRQkOkskQiRsQqEiIJ+QMyy59wr1Oq+FV/CDJLgjGpwBhXndLPXaRYlqC0TFBjWoGf2nafasuZwEHxLjOYUvZIu9hyVNIETbs/9DIgB06JSKy0K2nJUP270aeJMb0kdJMJtfdmupeL//VamY3P2n0u08yiZKNDcSZyu3kwJOLaGRY9RyjT3P1K2D3VlFkX38SVyOSvDYoumGA6hlnSOKoGJ9Xj66NK7XwcUQF2oQyHEMAp1OASrqAODJ7hFd7g3XvxPrxP72s0OueNd3ZgAt73L6Qcogg=</latexit>

<latexit sha1_base64="Y057faftSfVthrKeoiMHiW+VcB8=">AAACsXiclVHLbtNAFB2bVwmvFHawGREhpVKJbIoKywg20FWRmrYoDuF6fB2PMp6xZq4LkeW/4mf4Ab6DcRIJ2rLhSiMdnfs6c25aKekoin4G4Y2bt27f2bnbu3f/wcNH/d3Hp87UVuBEGGXseQoOldQ4IUkKzyuLUKYKz9Ll+y5/doHWSaNPaFXhrISFlrkUQJ6a93/AlxOe5BZEk1RgSYLi+TyhAgmGxT6nvfZPomj3+f/Ub+iWJ1YuCgJrzTfP4XdqLlCQsS+PQJhUguaVNVktiA8dgc7AZhxqMqUXKXgm8xwtaj+y0+z22nl/EI2idfDrIN6CAdvG8bz/K8mMqEs/RChwbhpHFc2aTqZQ2PaS2mEFYgkLnHqooUQ3a9butvyFZzKeG+ufJr5m/+5ooHRuVaa+0gsu3NVcR/4rN60pfztrpK5qQi02i/JacTK8O5X/uPU2qZUHIKzsvBAFeOvJH/TSlsx10tqeNya+asN1cPpqFB+ODj69HozfbS3aYc/YczZkMXvDxuwDO2YTJoKnwTj4GByFB+Hn8GuYbkrDYNvzhF2KcPkbVbbXDg==</latexit>

T @f✓ (h, t) T @f✓ (h, t)


a ,a ! vector-Jacobian product (standard automatic di↵erentiations)
@h @✓
Chen, Ricky TQ, et al. "Neural ordinary di erential equations." arXiv preprint arXiv:1806.07366 (2018).
ff
ff
Spatial Transformer Networks
YouTube Playlist

Source Target

(a) MNIST distorted with random


translation, scale, rotation, and clutter.

Bilinear sampling

Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in neural information processing systems. 2015.
Dynamic Routing Between Capsules
YouTube Playlist

neuron ! capsule ! layer exp(bij )


| {z } cij = P
group of neurons k exp(bik )
sj ! total input to capsule j bij ! log prob. that capsule i should
vj ! vector output of capsule j be coupled to capsule j
kvj k ! probability that the entity represented by aij = vj · uj|i ! agreement btw current output vj j 2 {1, . . . , 10}
the capsule is present in the current input and the predictions uj|i made by capsule i i 2 {1, . . . , 32 ⇤ 6 ⇤ 6}
vj
! properties of the entity
kvj k
entity: an object or an object part
ksj k2 sj
vj =
1 + ksj k2 ksj k
| {z }
squashing function

Margin loss for digit existence


X
sj = cij uj|i
i
cij ! coupling coefficients
uj|i = Wij ui
uj|i ! prediction vectors from the
capsules in the layer below
ui ! output of a capsule
in the layer below
Sabour, Sara, Nicholas Frosst, and Geo rey E. Hinton. "Dynamic routing between capsules." Advances in neural information processing systems. 2017.
ff
An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale YouTube Playlist

LN ! layer normalization
<latexit sha1_base64="+Z7d8yz/tWP3U03x+EreZF8At6M=">AAACGXicbVBNTxsxFPSGj9Lw0QDHXqxGSJyi3Qi1PaJy6QEhkAggJavorfOSWPHaK/ttIV3lb/TCX+mlBxDiCKf+mzrJHkpgJEujmTd6fpNkSjoKw79BZWl5ZfXd2vvq+sbm1ofa9s6FM7kV2BJGGXuVgEMlNbZIksKrzCKkicLLZHQ09S9/oHXS6HMaZxinMNCyLwWQl7q1sEN4Q8XxyYR3rBwMCaw113yuKhij5drYFJT8OUtMurV62Ahn4K9JVJI6K3HarT11ekbkKWoSCpxrR2FGcQGWpFA4qXZyhxmIEQyw7amGFF1czC6b8D2v9HjfWP808Zn6f6KA1LlxmvjJFGjoFr2p+JbXzqn/NS6kznJCLeaL+rniZPi0Jt6TFgWpsScgrPR/5WIIFgT5Mqu+hGjx5NfkotmIPjeaZwf1w29lHWvsI/vE9lnEvrBD9p2dshYT7Bf7ze7YfXAb/Akegsf5aCUoM7vsBYLnf74Kogc=</latexit>

MSA ! multiheaded self-attention


<latexit sha1_base64="/9qL1P5j334GoDHAguAZioCJmx0=">AAACIXicbVDJSgNBFOxxN25Rj14ag+DFMCOiObpcvAgRjQpJCG963iSNPQvdb9Qw5Fe8+CtePCiSm/gzdpaDW0FDUfWK16/8VElDrvvhTExOTc/Mzs0XFhaXlleKq2tXJsm0wJpIVKJvfDCoZIw1kqTwJtUIka/w2r89GfjXd6iNTOJL6qbYjKAdy1AKICu1ipUG4QPlZxdHPd7Qst0h0Dq55yM5yhTJDkKAAbc7wh0gwniQ7LWKJbfsDsH/Em9MSmyMaqvYbwSJyCKbFwqMqXtuSs0cNEmhsFdoZAZTELfQxrqlMURomvnwwh7fskrAw0TbFxMfqt8TOUTGdCPfTkZAHfPbG4j/efWMwkozl3Ga2cPEaFGYKU4JH9TFA6lRkOpaAkJL+1cuOqBBkC21YEvwfp/8l1ztlr398u75XunweFzHHNtgm2ybeeyAHbJTVmU1Jtgje2av7M15cl6cd6c/Gp1wxpl19gPO5xfTm6Ut</latexit>

z 2 RN ⇥D
<latexit sha1_base64="RI+WJQ+DrEJwvjroF7GMPt0P17s=">AAACBXicbVA9SwNBEJ3zM8avU0stFoNgFe6CqGVQCyuJYj4gd4a9zSZZsrd37O4J8Uhj41+xsVDE1v9g579xk1yhiQ8GHu/NMDMviDlT2nG+rbn5hcWl5dxKfnVtfWPT3tquqSiRhFZJxCPZCLCinAla1Uxz2oglxWHAaT3on4/8+j2VikXiVg9i6oe4K1iHEayN1LL3HpDHBPJCrHtBkN4M79IrT7OQKnQxbNkFp+iMgWaJm5ECZKi07C+vHZEkpEITjpVquk6s/RRLzQinw7yXKBpj0sdd2jRUYLPHT8dfDNGBUdqoE0lTQqOx+nsixaFSgzAwnaNr1bQ3Ev/zmonunPopE3GiqSCTRZ2EIx2hUSSozSQlmg8MwUQycysiPSwx0Sa4vAnBnX55ltRKRfe4WLo+KpTPsjhysAv7cAgunEAZLqECVSDwCM/wCm/Wk/VivVsfk9Y5K5vZgT+wPn8AFLuYUQ==</latexit>

[q, k, v] = zUqkv , Uqkv 2 RD⇥3Dh


<latexit sha1_base64="8T2Rs4i+UfpKr9YXSner0lfTdEI=">AAACJ3icbVBNS8NAEN34WetX1aOXwSJ4KCVRUS9K0R48VrEqNDFstlu7dLNJdzeFGvJvvPhXvAgqokf/idtawa8HA4/3ZpiZF8ScKW3bb9bY+MTk1HRuJj87N7+wWFhaPldRIgmtk4hH8jLAinImaF0zzellLCkOA04vgs7RwL/oUalYJM50P6ZeiK8FazGCtZH8wkGjW4JOCXoe7MMN1P202+llpS8CLhPghli3gyA9za7SKriahVTBFlT9duYXinbZHgL+EmdEimiEml94dJsRSUIqNOFYqYZjx9pLsdSMcJrl3UTRGJMOvqYNQwU2u7x0+GcG60ZpQiuSpoSGofp9IsWhUv0wMJ2Dk9VvbyD+5zUS3drzUibiRFNBPhe1Eg46gkFo0GSSEs37hmAimbkVSBtLTLSJNm9CcH6//Jecb5adnfLmyXaxcjiKI4dW0RraQA7aRRV0jGqojgi6RffoCT1bd9aD9WK9fraOWaOZFfQD1vsHL+OkXA==</latexit>

p
A = softmax(qk T / Dh ), A 2 RN ⇥N
<latexit sha1_base64="YWLftFLWAKgAi2CmKXDDuBywNKw=">AAACLnicbVBNa9tAEF25bZq6Taq0x16GmoILxZFCaXIJJGkLOYU0xB9g2Wa1XtmLVytldxRihH5RL/0rzSHQhJBrfkZXjg+p3QcDj/dmmJkXplIY9Lw/TuXJ02crz1dfVF++Wlt/7W68aZkk04w3WSIT3Qmp4VIo3kSBkndSzWkcSt4OJ19Lv33OtRGJOsVpynsxHSkRCUbRSgP3+z7sQoD8AnOTRBjTi6J+BpP+KWxCYM405t8G4+LjJ9iHQCgIYorjMMxPin5+ZOdEzA0cFQO35jW8GWCZ+HNSI3McD9zLYJiwLOYKmaTGdH0vxV5ONQomeVENMsNTyiZ0xLuWKmr39PLZuwV8sMoQokTbUggz9fFETmNjpnFoO8tzzaJXiv/zuhlGO71cqDRDrtjDoiiTgAmU2cFQaM5QTi2hTAt7K7Ax1ZShTbhqQ/AXX14mra2G/6Wx9eNzbe9gHscqeUfekzrxyTbZI4fkmDQJIz/Jb3JNbpxfzpVz69w9tFac+cxb8g+c+79uOKeY</latexit>

<latexit sha1_base64="0jnq5Wf1uT1bOfr/6VTKv0ZB4nA=">AAAB+3icbVDLTgJBEJzFF+JrxaOXicQEL2SXGPViAnrxiFEeCWzI7DDAhNlHZnoJuNlf8eJBY7z6I978GwfYg4KVdFKp6k53lxsKrsCyvo3M2vrG5lZ2O7ezu7d/YB7mGyqIJGV1GohAtlyimOA+qwMHwVqhZMRzBWu6o9uZ3xwzqXjgP8I0ZI5HBj7vc0pAS10z3wE2gfihmhSfzvA1ruJx1yxYJWsOvErslBRQilrX/Or0Ahp5zAcqiFJt2wrBiYkETgVLcp1IsZDQERmwtqY+8Zhy4vntCT7VSg/3A6nLBzxXf0/ExFNq6rm60yMwVMveTPzPa0fQv3Ji7ocRMJ8uFvUjgSHAsyBwj0tGQUw1IVRyfSumQyIJBR1XTodgL7+8Shrlkn1RKt+fFyo3aRxZdIxOUBHZ6BJV0B2qoTqiaIKe0St6MxLjxXg3PhatGSOdOUJ/YHz+ACDukzI=</latexit>

SA(z) = Av
MSA(z) = [SA1 (z); SA2 (z); . . . ; SAk (z)]Umsa , Umsa 2 RkDh ⇥D
<latexit sha1_base64="eQNEU1mRX0BqeabSLbRSIE1yM8w=">AAACTXicbZFRSxtBEMf3YtWYVhvtY1+GhoIFCXehqCCCUR98KVhtVMidx97exizZ2zt258R43Bfsi9A3v4UvPlhK6eaM0GoHFn78Z4aZ+W+USWHQdW+d2syr2bn5+kLj9ZvFpbfN5ZUTk+aa8R5LZarPImq4FIr3UKDkZ5nmNIkkP41Ge5P86SXXRqTqG44zHiT0QomBYBStFDZjH/kVFl+Ou+Xq9SfYhv5xN/QsboGFTgW+jFM0lTCyQgC9sEgMLdeeAHyhwE8oDqOoOCrPixHsh0PwUSTcwH4ZNltu260CXoI3hRaZxmHY/OHHKcsTrpBJakzfczMMCqpRMMnLhp8bnlE2ohe8b1FROycoKjdK+GiVGAaptk8hVOrfHQVNjBknka2crGye5ybi/3L9HAebQSFUliNX7HHQIJeAKUyshVhozlCOLVCmhd0V2JBqytB+QMOa4D0/+SWcdNreervz9XNrZ3dqR528Jx/IKvHIBtkhB+SQ9Agj38kdeSA/nRvn3vnl/H4srTnTnnfkn6jN/wHSWLDr</latexit>

<latexit sha1_base64="aSZ7QxdzQdF3QnKDm+JrVgpEJSI=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU90tol6Eoj14rGA/pF1KNs22oUl2SbJCWforvHhQxKs/x5v/xrTdg7Y+GHi8N8PMvCDmTBvX/XZyK6tr6xv5zcLW9s7uXnH/oKmjRBHaIBGPVDvAmnImacMww2k7VhSLgNNWMLqd+q0nqjSL5IMZx9QXeCBZyAg2Vnqs9YboGtXORr1iyS27M6Bl4mWkBBnqveJXtx+RRFBpCMdadzw3Nn6KlWGE00mhm2gaYzLCA9qxVGJBtZ/ODp6gE6v0URgpW9Kgmfp7IsVC67EIbKfAZqgXvan4n9dJTHjlp0zGiaGSzBeFCUcmQtPvUZ8pSgwfW4KJYvZWRIZYYWJsRgUbgrf48jJpVsreRblyf16q3mRx5OEIjuEUPLiEKtxBHRpAQMAzvMKbo5wX5935mLfmnGzmEP7A+fwBBrKPQQ==</latexit>

Dh = D/k
Inductive Bias in CNNs: translation equivariance and locality
<latexit sha1_base64="ZcGpMYnrl0EiZijMjJyLSpIQ2JQ=">AAACJ3icbVDLSgNBEJz1bXxFPXoZDIKnsJuDigcJ8aKXEMEkQgyhd7ajg7Oz60xvIIT8jRd/xYugInr0T5zEHHwVDBRV1fR0hamSlnz/3Zuanpmdm19YzC0tr6yu5dc3GjbJjMC6SFRiLkKwqKTGOklSeJEahDhU2Axvjkd+s4fGykSfUz/FdgxXWnalAHJSJ390qqNMkOwhr0iwXGp+XK3aQ04GtFXjFMfbTPbASNACOeiIq0SAktTnnXzBL/pj8L8kmJACm6DWyT9dRonIYtQkFFjbCvyU2gMwJIXCYe4ys5iCuIErbDmqIUbbHozvHPIdp0S8mxj3NPGx+n1iALG1/Th0yRjo2v72RuJ/Xiuj7kF7IHWaEWrxtaibKU4JH5XGI2lQkOo7AsJI91cursGAIFdtzpUQ/D75L2mUisFesXRWKpQrkzoW2BbbZrssYPuszE5YjdWZYHfsgT2zF+/ee/Revbev6JQ3mdlkP+B9fALJtaXp</latexit>

Image Patches ⌘ Tokens (Words) in NLP


<latexit sha1_base64="ljPwU10+njLWjQUm5s3XSNH6fuY=">AAACIXicbVDLSgNBEJyN7/iKevQyGIR4CbtBNEfRi4JIBGOEJITZSccMmZ1ZZ3rFsORXvPgrXjwo4k38GSePgxoLGoqqbrq7wlgKi77/6WVmZufmFxaXsssrq2vruY3Na6sTw6HKtdTmJmQWpFBQRYESbmIDLAol1MLeydCv3YOxQqsr7MfQjNitEh3BGTqplSs3EB4wPXMy0ApD3gU7oA24S8Q9HXtXugfK0kJNm7bdo0LRi/PKoJXL+0V/BDpNggnJkwkqrdxHo615EoFCLpm19cCPsZkyg4JLGGQbiYWY8Z47pO6oYhHYZjr6cEB3ndKmHW1cKaQj9edEyiJr+1HoOiOGXfvXG4r/efUEO+VmKlScICg+XtRJJEVNh3HRtjDAUfYdYdwIdyvlXWYYRxdq1oUQ/H15mlyXisFBsXS5nz86nsSxSLbJDimQgBySI3JKKqRKOHkkz+SVvHlP3ov37n2MWzPeZGaL/IL39Q1frqOa</latexit>

x 2 RH⇥W ⇥C ! image
<latexit sha1_base64="o4QZt5HqVIMH2+3+nRA7XqCoa1k=">AAACKHicbVDLSgMxFM3UV62vqks3wSK4KjNF1J3FbrqsYh/QqSWTpm1oJjMkd7RlmM9x46+4EVGkW7/E9CFo64HA4dxH7jleKLgG2x5bqZXVtfWN9GZma3tndy+7f1DTQaQoq9JABKrhEc0El6wKHARrhIoR3xOs7g1Kk3r9gSnNA3kHo5C1fNKTvMspASO1s1dD7HKJXZ9A3/Pi2+Q+LmMXuM80rv+QUoJdxXt9IEoFj0ZlQ4i52cSSdjZn5+0p8DJx5iSH5qi0s29uJ6CRzyRQQbRuOnYIrZgo4FSwJONGmoWEDszypqGSmANa8dRogk+M0sHdQJknAU/V3xMx8bUe+Z7pnBjSi7WJ+F+tGUH3shVzGUbAJJ191I0EhgBPUsMdrhgFMTKEUMXNrZj2iSIUTLYZE4KzaHmZ1Ap55zxfuDnLFa/ncaTRETpGp8hBF6iIyqiCqoiiJ/SC3tGH9Wy9Wp/WeNaasuYzh+gPrK9vZsqm3A==</latexit>

Big Transfer (BiT) ! A ResNet


<latexit sha1_base64="5zhsIqTa4QTmDlQyOU4/9hVZ0VI=">AAACHnicbVDLSgMxFM3Ud31VXboJFqFuykzxtdS6cSUqrQptKZn0ThuayQzJHbUM/RI3/oobF4oIrvRvTB8LtR4IHM45l5t7/FgKg6775WSmpmdm5+YXsotLyyurubX1KxMlmkOVRzLSNz4zIIWCKgqUcBNrYKEv4drvngz861vQRkSqgr0YGiFrKxEIztBKzdxeHeEe07Jo04pmygSgaaEsKjt9Wtei3UGmdXRHR6ljegnmDLDfzOXdojsEnSTemOTJGOfN3Ee9FfEkBIVcMmNqnhtjI2UaBZfQz9YTAzHjXdaGmqWKhWAa6fC8Pt22SosGkbZPIR2qPydSFhrTC32bDBl2zF9vIP7n1RIMDhupUHGCoPhoUZBIihEddEVbQgNH2bOEcS3sXynvMM042kaztgTv78mT5KpU9PaLpYvd/FF5XMc82SRbpEA8ckCOyCk5J1XCyQN5Ii/k1Xl0np03530UzTjjmQ3yC87nNw2UonQ=</latexit>

N ⇥(P 2 C)
<latexit sha1_base64="Kp4NYPU8AQTw/UbyWb6avXPVWQc=">AAACT3icbVFNbxMxEPWGrzZ8BThyGZEitZdoN0LAsaIcOKGASFspm0ZeZzZr1WsbexYSrfYfcoEbf4MLBxDCG/YALSNZenpvRjPvObNKeorjr1HvytVr12/s7PZv3rp95+7g3v1jbyoncCqMMu404x6V1DglSQpPrUNeZgpPsvOjVj/5gM5Lo9/RxuK85Cstcyk4BWoxyNcLC6nUkJaciiyr3zZn9WtISZboYX9yNj46aCB1clUQd858DBKuqfb4vkItEEwOueJEqHEJ45dgOYmiHXXoC24R9tZ7B81iMIxH8bbgMkg6MGRdTRaDL+nSiKpETUJx72dJbGlec0dSKGz6aeXRcnHOVzgLUPNw7rze5tHA48AsITcuPE2wZf+eqHnp/abMQmfr2l/UWvJ/2qyi/Pm8ltpWwa/4syivFJCBNlxYSoeC1CYALpwMt4IouOOCwhf0QwjJRcuXwfF4lDwdjd88GR6+6OLYYQ/ZI7bPEvaMHbJXbMKmTLBP7Bv7wX5Gn6Pv0a9e19qLOvCA/VO93d8iAbNO</latexit>

xp 2 R ! sequence of flattened 2D patches (reshape x)


P ⇥ P ! resolution of each image patch
<latexit sha1_base64="YGVTSR/xz2oQKEZEbuVu4Lw/ERg=">AAACJXicbVDLSgNBEJz1GeMr6tHLYBA8hV0R9eAh6MVjBPOAJITZSW92yOzOMtOrhiU/48Vf8eLBIIInf8XZJAdfBQNFVfd0d/mJFAZd98NZWFxaXlktrBXXNza3tks7uw2jUs2hzpVUuuUzA1LEUEeBElqJBhb5Epr+8Cr3m3egjVDxLY4S6EZsEItAcIZW6pUuarSDIgJDLdFiECLTWt1bER4w02CUTPNKqgIKjIdU2A+AJgx5OO6Vym7FnYL+Jd6clMkctV5p0ukrnkYQI5fMmLbnJtjNmEbBJYyLndRAwvjQjmhbGjO7WDebXjmmh1bp00Bp+2KkU/V7R8YiY0aRbysjhqH57eXif147xeC8m4k4SRFiPhsUpJKionlktC80cJQjSxjXwu5Kecg042iDLdoQvN8n/yWN44p3Wjm+OSlXL+dxFMg+OSBHxCNnpEquSY3UCSeP5Jm8konz5Lw4b877rHTBmffskR9wPr8ALImlsw==</latexit>

HW
<latexit sha1_base64="7N6zly6VnADmJjmpW569Ve/Qee0=">AAACK3icbVBNa9tAEF25aeo4Teq2x16WmEJORjKh7aUQnEtOwYE6NliuWa1H0uLVrtgdJTVC/yeX/pUe0kOb0mv+R9cfh8bOg4HHezPMzItyKSz6/r1Xe7bzfPdFfa+x//Lg8FXz9ZsrqwvDoc+11GYYMQtSKOijQAnD3ADLIgmDaHa28AfXYKzQ6gvOcxhnLFEiFpyhkybN7gX9TMPYMF6e00FV9r52KhoakaTIjNE3NET4hqUBW0gUKqGqyCIwVMc0Z8hTsNWk2fLb/hJ0mwRr0iJr9CbNu3CqeZGBQi6ZtaPAz3FcMoOCS6gaYWEhZ3zGEhg5qlgGdlwuf63oe6dMaayNK4V0qf4/UbLM2nkWuc6MYWo3vYX4lDcqMP40LoXKCwTFV4viQlLUdBEcnQoDHOXcEcaNcLdSnjIXHLp4Gy6EYPPlbXLVaQcf2p3Lk9Zpdx1HnbwjR+SYBOQjOSXnpEf6hJNb8oP8Ir+9795P74/3d9Va89Yzb8kjeA//AMdsqBk=</latexit>

N= 2
! resulting number of patches
P
D ! latent vector size
<latexit sha1_base64="GINUJ9y/cDEi2T17eh/Hs2/C71U=">AAACEHicbVC7TsMwFHV4lvIKMLJYVAimKqkQMFbAwFgk+pDaqHJcp7XqxJF9UyhRP4GFX2FhACFWRjb+BrfNAC13Ojrnvs7xY8E1OM63tbC4tLyymlvLr29sbm3bO7s1LRNFWZVKIVXDJ5oJHrEqcBCsEStGQl+wut+/HOv1AVOay+gWhjHzQtKNeMApAUO17aMr3FK82wOilLzDLWD3kAoCLAI8YBSkwpo/sFHbLjhFZ1J4HrgZKKCsKm37q9WRNAnNIiqI1k3XicFLiQJOBRvlW4lmMaF90mVNAyMSMu2lE0MjfGiYDg7M8UCaRybs74mUhFoPQ990hgR6elYbk/9pzQSCcy/lUZwYh3R6KEgEBonH6eAOV8a0GBpAqOLmV0x7RBEKJsO8CcGdtTwPaqWie1os3ZwUyhdZHDm0jw7QMXLRGSqja1RBVUTRI3pGr+jNerJerHfrY9q6YGUze+hPWZ8/jBmdkA==</latexit>

(P 2 C)⇥D
<latexit sha1_base64="CGQSHpxX7rNArWVVPLW9EpJ0UXQ=">AAACh3icbVFdaxNBFJ1dq7ZRa6qPvlwMQmtL3I2lLYhQrYE+lShNW8hultnJJB06O7PM3JXEZf+KP8o3/42zScS28cKFw7nncL/SXAqLQfDb8x+sPXz0eH2j8eTps83nza0XF1YXhvE+01Kbq5RaLoXifRQo+VVuOM1SyS/Tm5O6fvmdGyu0OsdZzuOMTpQYC0bRUUnz548kgI8wmCZlhHyKJZPU2qr6ANMkH4bQXYBODSI50mgXxBl0Y9iF7l9brp1pD7oQCQVRRvE6Tctv1bDc7g07JzsRioxb+FJL7lhW9We74T950mwF7WAesArCJWiRZfSS5q9opFmRcYXzTQZhkGNcUoOCSV41osLynLIbOuEDBxV1feJyfscK3jhmBGNtXCqEOXvbUdLM2lmWOmU9sb1fq8n/1QYFjo/iUqi8QK7YotG4kIAa6qfASBjOUM4coMwINyuwa2ooQ/e6hjtCeH/lVXDRaYcH7c7X/dbx5+U51skr8ppsk5AckmNySnqkT5i35r313nv7/ob/zj/wjxZS31t6XpI74X/6A/m3wSM=</latexit>

1 2 N
z0 = [xclass ; xp E; xp E; . . . ; xp E] + Epos , E 2 R , Epos 2 R(N +1)⇥D
zl0 = MSA(LN(zl zl = MLP(LN(zl0 )) + zl0 , l = 1, . . . , L 0
<latexit sha1_base64="B+bSfZj0qiL1weZ6SM1kKZqqtkk=">AAAB/XicbVDJSgNBEO1xjXEbl5uXxiDES5gJol6EoBcPQSKYBZIx9HR6kiY9C9014mQI/ooXD4p49T+8+Td2loMmPih4vFdFVT03ElyBZX0bC4tLyyurmbXs+sbm1ra5s1tTYSwpq9JQhLLhEsUED1gVOAjWiCQjvitY3e1fjfz6A5OKh8EdJBFzfNINuMcpAS21zf0EX+AWsEdIyzfD/KBdvreO22bOKlhj4HliT0kOTVFpm1+tTkhjnwVABVGqaVsROCmRwKlgw2wrViwitE+6rKlpQHymnHR8/RAfaaWDvVDqCgCP1d8TKfGVSnxXd/oEemrWG4n/ec0YvHMn5UEUAwvoZJEXCwwhHkWBO1wyCiLRhFDJ9a2Y9ogkFHRgWR2CPfvyPKkVC/ZpoXh7kitdTuPIoAN0iPLIRmeohK5RBVURRQP0jF7Rm/FkvBjvxsekdcGYzuyhPzA+fwDI4JQn</latexit>

<latexit sha1_base64="wOigFzoy7JSFdh4P0RWLP40wicQ=">AAACKnicbZDLSgMxFIYz9V5vVZdugkWsWMtMEXUjtLpxUaWirUJbhkyaamjmQnJGrMM8jxtfxU0XSnHrg5heBG39IfDxn3M4Ob8TCK7ANHtGYmp6ZnZufiG5uLS8sppaW68qP5SUVagvfHnnEMUE91gFOAh2F0hGXEewW6d91q/fPjKpuO/dQCdgDZfce7zFKQFt2ani844t8AmuA3uC6OK6GGeGWLqMM892JPateBfv4R/M4n63lcV10fRBZXHJTqXNnDkQngRrBGk0UtlOdetNn4Yu84AKolTNMgNoREQCp4LFyXqoWEBom9yzmkaPuEw1osGpMd7WThO3fKmfB3jg/p6IiKtUx3V0p0vgQY3X+uZ/tVoIreNGxL0gBObR4aJWKDD4uJ8bbnLJKIiOBkIl13/F9IFIQkGnm9QhWOMnT0I1n7MOc/mrg3ThdBTHPNpEWyiDLHSECugclVEFUfSC3tA7+jBeja7RMz6HrQljNLOB/sj4+gb9MKPh</latexit> <latexit sha1_base64="FZ/eCGDl8HZ2AfpjZFaYPh5OuBs=">AAACI3icbZBNS0JBFIbn2pfZl9WyzZBESiL3SlQEgdSmhYVBfoCKzB1HHZz7wcy5kV38L236K21aFNKmRf+lueqitAMDD+97DmfOa/uCKzDNLyO2sLi0vBJfTaytb2xuJbd3KsoLJGVl6glP1myimOAuKwMHwWq+ZMSxBava/avIrz4wqbjn3sPAZ02HdF3e4ZSAllrJ86eWwBe4AewRwptiaZieYPF2mNbWYSaDj3AEWRz1WVncEG0PVBYXW8mUmTPHhefBmkIKTavUSo4abY8GDnOBCqJU3TJ9aIZEAqeCDRONQDGf0D7psrpGlzhMNcPxjUN8oJU27nhSPxfwWP09ERJHqYFj606HQE/NepH4n1cPoHPWDLnrB8BcOlnUCQQGD0eB4TaXjIIYaCBUcv1XTHtEEgo61oQOwZo9eR4q+Zx1ksvfHacKl9M44mgP7aM0stApKqBrVEJlRNEzekXv6MN4Md6MkfE5aY0Z05ld9KeM7x+hbaEe</latexit>

1 ) + zl 1 ), l = 1, . . . , L y = LN(zL )
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
MLP-Mixer: An all-MLP Architecture for Vision
YouTube Video

– token-mixing MLP – channel-mixing MLP


<latexit sha1_base64="8NCJZEr5nyew/6tsxXfb5vS+qDg=">AAACEnicbVC7SgNBFJ31GeMramkzGAQtEnZTqGXQxkIhgnlAsoTZyd1kyOzsMjMrhiXfYOOv2FgoYmtl5984m2wREw8MHM45lzv3eBFnStv2j7W0vLK6tp7byG9ube/sFvb2GyqMJYU6DXkoWx5RwJmAumaaQyuSQAKPQ9MbXqV+8wGkYqG416MI3ID0BfMZJdpI3cJpqYR1OARRCtgjE318e1PDRqMDIgTwGbVbKNplewK8SJyMFFEGk//u9EIaByA05USptmNH2k2I1IxyGOc7sYKI0CHpQ9tQQQJQbjI5aYyPjdLDfijNExpP1NmJhARKjQLPJAOiB2reS8X/vHas/Qs3YSKKNQg6XeTH3JSA035wj0mgmo8MIVQy89e0C0moNi3mTQnO/MmLpFEpO2flyl2lWL3M6sihQ3SETpCDzlEVXaMaqiOKntALekPv1rP1an1Yn9PokpXNHKA/sL5+ARQrm9w=</latexit>

S⇥C
X2R
<latexit sha1_base64="cZbqZhENTtXVeZ70QvrdqH28/m0=">AAACMXicbVDLTsJAFJ36Fl9Vl24mEhM3ktYYdUlk49IXQkKR3A4DTJhOm5lblDT8khv/xLhxoTFu/QkHZKHgTW5ycu7znDCRwqDnvTozs3PzC4tLy7mV1bX1DXdz69bEqWa8zGIZ62oIhkuheBkFSl5NNIcolLwSdkvDeqXHtRGxusF+wusRtJVoCQZoqYZ7XqWBUDSIADthmF0N7rJrGqCIuKGlAQ20aHcQtI7vLcsfMLPL5UEPZMqbVKgkRYpgjw0abt4reKOg08AfgzwZx0XDfQ6aMUsjrpBJMKbmewnWM9AomN2XC1LDE2BdaPOahQrsS/VspHhA9yzTpK1Y21RIR+zviQwiY/pRaDuHysxkbUj+V6ul2DqtZyNhXLGfQ61UUozp0D7aFJozlH0LgGlhf6WsAxoYWpNz1gR/UvI0uD0s+MeFw8ujfPFsbMcS2SG7ZJ/45IQUyTm5IGXCyCN5IW/k3XlyXp0P5/OndcYZz2yTP+F8fQMd2qrT</latexit>

! real-valued input table


S ! number of non-overlapping image patches
<latexit sha1_base64="ar45hejTt9i1lv3gdqRKdC4mcus=">AAACJXicbVDLSgMxFM34tr6qLt0Ei+DGMiOiLlwU3bhUtFVoS8mkd9pgJgnJHbUM/Rk3/oobFxYRXPkrprULXwcCh3Pu4eae2EjhMAzfg4nJqemZ2bn5wsLi0vJKcXWt5nRmOVS5ltpex8yBFAqqKFDCtbHA0ljCVXxzMvSvbsE6odUl9gw0U9ZRIhGcoZdaxaML2rCi00Vmrb6jDYR7zFWWxmCpTqjSakf7vGTGCNWhwseBGoa8C67fKpbCcjgC/UuiMSmRMc5axUGjrXmWgkIumXP1KDTYzJlFwSX0C43MgWH8xi+pe6pYCq6Zj67s0y2vtGmirX8K6Uj9nshZ6lwvjf1kyrDrfntD8T+vnmFy2MyFMhmC4l+LkkxS1HRYGW0LCxxlzxPGrfB/pbzLLOPoiy34EqLfJ/8ltd1ytF/ePd8rVY7HdcyRDbJJtklEDkiFnJIzUiWcPJAn8kIGwWPwHLwGb1+jE8E4s05+IPj4BBu3pkM=</latexit>

HW
<latexit sha1_base64="T816eTeoqkx5GMR6vQS18bZklGU=">AAAB/XicbVDLSsNAFL2pr1pf8bFzM1gEVyUpom6EopsuK9oHtLVMppN26GQSZiZCDcFfceNCEbf+hzv/xmmbhbYeuHA4517uvceLOFPacb6t3NLyyupafr2wsbm1vWPv7jVUGEtC6yTkoWx5WFHOBK1rpjltRZLiwOO06Y2uJ37zgUrFQnGnxxHtBnggmM8I1kbq2Qe3CF2iji8xSaqomSa1+3Las4tOyZkCLRI3I0XIUOvZX51+SOKACk04VqrtOpHuJlhqRjhNC51Y0QiTER7QtqECB1R1k+n1KTo2Sh/5oTQlNJqqvycSHCg1DjzTGWA9VPPeRPzPa8fav+gmTESxpoLMFvkxRzpEkyhQn0lKNB8bgolk5lZEhtgkoU1gBROCO//yImmUS+5ZqXxzWqxcZXHk4RCO4ARcOIcKVKEGdSDwCM/wCm/Wk/VivVsfs9aclc3swx9Ynz9t3JPx</latexit>

H ⇥ W ! resolution of the original image


<latexit sha1_base64="c0uEPkpd7qrV5fdwatfo1sL7I8k=">AAACJ3icbVDLSgMxFM34tr5GXboJFsFVmRFRVyK66bKCbYV2KJn0jg1mkiG5o5ahf+PGX3EjqIgu/RPTx8LXgcDh3HNv7j1xJoXFIPjwpqZnZufmFxZLS8srq2v++kbD6txwqHMttbmMmQUpFNRRoITLzABLYwnN+PpsWG/egLFCqwvsZxCl7EqJRHCGTur4x1XaRpGCpU3aNuKqh8wYfetEuMPCgNUyHzqpTij2gGrnEYpJKtwgGHT8clAJRqB/STghZTJBreM/t7ua5yko5JJZ2wqDDKOCGRRcwqDUzi1kjF+74S1HFXOrRcXozgHdcUqXJtq4p5CO1O8dBUut7aexc6YMe/Z3bSj+V2vlmBxFhVBZjqD4+KMklxQ1HYZGu8IAR9l3hHEj3K6U95hhHF20JRdC+Pvkv6SxVwkPKnvn++WT00kcC2SLbJNdEpJDckKqpEbqhJN78kheyKv34D15b9772DrlTXo2yQ94n18ZZ6a7</latexit>

S= 2
P Linear computational complexity
P ⇥ P ! resolution of each patch in the number of input patches
<latexit sha1_base64="44meArKE8XxsFY5DNE++U8rgwvs=">AAACH3icbVDLSgNBEJz1bXxFPXoZDIKnsCsSPYpePEYwUUhC6J30Zgdnd5aZXjUs+RMv/ooXD4qIt/yNk5iDr4KBoqqa6a4wU9KS74+8mdm5+YXFpeXSyura+kZ5c6tpdW4ENoRW2lyHYFHJFBskSeF1ZhCSUOFVeHM29q9u0Vip00saZNhJoJ/KSAogJ3XLtTpvk0zQckeM7McExug7J+I9FQatVvk4yXXEEUTMMyARD7vlil/1J+B/STAlFTZFvVv+aPe0yBNMSSiwthX4GXUKMCSFwmGpnVvMQNxAH1uOpuBW6hST+4Z8zyk9HmnjXkp8on6fKCCxdpCELpkAxfa3Nxb/81o5RcedQqZZTpiKr4+iXHHSfFwW70mDgtTAERBGul25iMGAIFdpyZUQ/D75L2keVINa9eDisHJyOq1jie2wXbbPAnbETtg5q7MGE+yBPbEX9uo9es/em/f+FZ3xpjPb7Ae80SePx6NU</latexit>

and number of pixels


all patches are projected with the same projection matrix
<latexit sha1_base64="ds5FjX1oTicutFq64HhQgRk2T5I=">AAACInicbVDJSgNBEO1xN25Rj14ag+ApzOTgcgt68RjBLJCEUNNT47T2LHTXqCH4LV78FS8eFPUk+DF2FsStoODxqh5V7/mZkoZc992Zmp6ZnZtfWCwsLa+srhXXNxomzbXAukhVqls+GFQywTpJUtjKNELsK2z6l8fDefMKtZFpckb9DLsxnCcylALIUr3iISjFMyARoeGgkWc6vUBBGPBrSRGnCLmB+Iu3Ih4DaXnTK5bcsjsq/hd4E1Bik6r1iq+dIBV5jAkJBca0PTej7gA0SaHwttDJDWYgLuEc2xYm9qrpDkYWb/mOZQIeptp2QnzEflcMIDamH/t2074Xmd+zIfnfrJ1TeNAdyCTLCRMxPhTmilPKh3nxQGprW/UtAKGl/ZWLCDTYiLQp2BC835b/gkal7O2VK6eVUvVoEscC22LbbJd5bJ9V2QmrsToT7I49sCf27Nw7j86L8zZenXImmk32o5yPT8zcpHg=</latexit>

C ! hidden dimension
<latexit sha1_base64="2bG1gB2LtLTWvgj88TdmdU78++M=">AAACDnicbVC7SgNBFJ31GeMramkzGAJWYTeIWgbTWEYwD0hCmJ29SYbMzi4zd9Ww5Ats/BUbC0Vsre38GyePQhMPDBzOuYc79/ixFAZd99tZWV1b39jMbGW3d3b39nMHh3UTJZpDjUcy0k2fGZBCQQ0FSmjGGljoS2j4w8rEb9yBNiJStziKoROyvhI9wRlaqZsrVGhbi/4AmdbRPW0jPGA6EEEAigYiBDVJjru5vFt0p6DLxJuTPJmj2s19tYOIJzaPXDJjWp4bYydlGgWXMM62EwMx40PWh5alioVgOun0nDEtWCWgvUjbp5BO1d+JlIXGjELfToYMB2bRm4j/ea0Ee5edVKg4QVB8tqiXSIoRnXRjD9bAUY4sYVwL+1fKB0wzjrbBrC3BWzx5mdRLRe+8WLo5y5ev5nVkyDE5IafEIxekTK5JldQIJ4/kmbySN+fJeXHenY/Z6IozzxyRP3A+fwD4OJy3</latexit>

! token mixing
<latexit sha1_base64="287xY6GjCWatioS/32Fmj0X6iaA=">AAACCHicbVA9SwNBEN3z2/gVtbRwMQhW4S6IWoo2lhFMFJIj7G0myZK93WN3ThOOlDb+FRsLRWz9CXb+GzfxCjU+GHi8N8PMvCiRwqLvf3ozs3PzC4tLy4WV1bX1jeLmVt3q1HCocS21uYmYBSkU1FCghJvEAIsjCddR/3zsX9+CsUKrKxwmEMasq0RHcIZOahV3m0Z0e8iM0Xe0iTDADHUfFI3FQKjuqFUs+WV/AjpNgpyUSI5qq/jRbGuexqCQS2ZtI/ATDDNmUHAJo0IztZAw3mddaDiqWAw2zCaPjOi+U9q0o40rhXSi/pzIWGztMI5cZ8ywZ/96Y/E/r5Fi5yTMhEpSBMW/F3VSSVHTcSq0LQxwlENHGDfC3Up5jxnG0WVXcCEEf1+eJvVKOTgqVy4PS6dneRxLZIfskQMSkGNySi5IldQIJ/fkkTyTF+/Be/Jevbfv1hkvn9kmv+C9fwH0JpqT</latexit>

! channel mixing
<latexit sha1_base64="cQTaJDyoGSjQku5ImSmtqI3MOjA=">AAACCnicbVA9TwJBEN3zE/ELtbRZJSZW5I4YtSTaWGIiHwkQsrcMsGFv77I7p5ALtY1/xcZCY2z9BXb+Gxe4QsGXTPLy3kxm5vmRFAZd99tZWl5ZXVvPbGQ3t7Z3dnN7+1UTxppDhYcy1HWfGZBCQQUFSqhHGljgS6j5g+uJX7sHbUSo7nAUQStgPSW6gjO0Ujt31NSi10emdfhAmwhDTHifKQWSBmIoVG/czuXdgjsFXSReSvIkRbmd+2p2Qh4HoJBLZkzDcyNsJUyj4BLG2WZsIGJ8wHrQsFSxAEwrmb4ypidW6dBuqG0ppFP190TCAmNGgW87A4Z9M+9NxP+8Rozdy1YiVBQjKD5b1I0lxZBOcqEdoYGjHFnCuBb2VmqD0IyjTS9rQ/DmX14k1WLBOy8Ub8/ypas0jgw5JMfklHjkgpTIDSmTCuHkkTyTV/LmPDkvzrvzMWtdctKZA/IHzucPbjebXw==</latexit>

W1 2 RDS ⇥S , W2 2 RS⇥DS , W3 2 RDC ⇥C , W4 2 RC⇥DC ,


<latexit sha1_base64="plePWbQ4GhzLqYRTwFelK0JYChw=">AAACsHicbZFNb9NAEIbX5quErwBHLiMipCJBZIeq5VgRqnLooRDcRIqNNd6sk6X22todA5Hl38edG/+GtRu+ms7p1bzPzM7MJmUmDXneT8e9dv3GzVs7t3t37t67/6D/8NGZKSrNRcCLrNCzBI3IpBIBScrErNQC8yQT0+R83PrTL0IbWaiPtC5FlONSyVRyJJuK+9+nsQ+hVBDmSKskqT80n+q38QRCkrkwMGlewDQebSF/AMt2yKsruox/Q+MO2dtCxn+7tEho5DJHCLVcrgi1Lr5aX3yj+vjoJIDdY6yMkajgyFoaTuzKqCFQkp43cX/gDb0uYFv4GzFgmziN+z/CRcGrXCjiGRoz972Soho1SZ6JphdWRpTIz3Ep5lYqtGNGdXfwBp7ZzAJSO0RaKIIu+29Fjbkx6zyxZLutuey1yau8eUXp66iWqqxIKH7xUFplQAW0vwcLqQWnbG0Fci3trMBXqJGT/eOePYJ/eeVtcTYa+vvD0fu9weGbzTl22BP2lO0ynx2wQ/aOnbKAceelM3FCJ3JH7syNXbxAXWdT85j9F+7nXy110Uw=</latexit>

! GELU (Gaussian Error Linear Unit)


Tolstikhin, Ilya, et al. "MLP-Mixer: An all-MLP Architecture for Vision." arXiv preprint arXiv:2105.01601 (2021).
Swin Transformer: Hierarchical Vision
Transformer using Shifted Windows
: general-purpose backbone for computer vision
<latexit sha1_base64="+47yqkQB4K84YoWVehdxW+2lWjo=">AAACF3icbVC7SgNBFJ2NrxhfUUubwSDYGHYjqFgFbSwjmAckS5id3E2GzM4sM7OBsOQvbPwVGwtFbLXzb5xsUmjirQ7n3MO95wQxZ9q47reTW1ldW9/Ibxa2tnd294r7Bw0tE0WhTiWXqhUQDZwJqBtmOLRiBSQKODSD4e1Ub45AaSbFgxnH4EekL1jIKDGW6hbL17gPAhThZ3GiYqkBB4QOAykAh1JhKqM4MaDwiOnMUHLLbjZ4GXhzUELzqXWLX52epEkEwlBOtG57bmz8lCjDKIdJoZNoiO1B0oe2hYJEoP00yzXBJ5bpZW+EUhicsb8dKYm0HkeB3YyIGehFbUr+p7UTE175KRPTaILODoUJx0biaUm4xxRQw8cWEKqY/RXTAVGE2iZ0wZbgLUZeBo1K2bson99XStWbeR15dISO0Sny0CWqojtUQ3VE0SN6Rq/ozXlyXpx352O2mnPmnkP0Z5zPHyjUn/M=</latexit>

A masking mechanism is employed to limit self-attention


<latexit sha1_base64="+zKSvhc7bZRf4NtAosVyYtKu+mY=">AAACUnicbVJNTxsxEHVCPyClJS1HLlajSr0Q7VKp9EjppUcqNYCUjSKvdzYZxR8re7ZRFPEbKyEu/BAuPQCzIYcWOpKlpzcfz/PsvDIYKUmuW+2NZ89fvNzc6rzafv1mp/v23Wn0ddAw0N74cJ6rCAYdDAjJwHkVQNncwFk++9bkz35BiOjdT1pUMLJq4rBErYipcRcz59EV4Eh+lVbFGbqJtKCnymG0EqMEWxm/gEKSlwYtkmS1cl8RcRPPyLKO9raqaTWxqZojTdFJUHoqY53vz1nAz8fdXtJPViGfgnQNemIdJ+PuZVZ4XVuW0UbFOEyTikZLFQi1gYtOVkeolJ6pCQwZOmUhjpYrSy7kB2YKWfrAh3dbsX93LJWNcWFzrrSKpvFxriH/lxvWVH4ZLdHxwuD0g1BZm2bvxl9ZYABNZsFA6YB8V8lmBqWJX6HDJqSPV34KTg/66ef+px8HvaPjtR2bYk+8Fx9FKg7FkfguTsRAaPFb3Ihbcde6av1p8y95KG231j274p9ob98Dj521Xg==</latexit>

computation to within each sub-window


Relative position bias
<latexit sha1_base64="a2Y4UKT3Sz1mleftaZZSJ0kKlME=">AAACCnicbVC7TsMwFHXKq5RXgJHFUCExVUmRgLGChbEg+pDaqHKcm9aq85DtVKqiziz8CgsDCLHyBWz8DU6aAVqOdKWjc+617z1uzJlUlvVtlFZW19Y3ypuVre2d3T1z/6Ato0RQaNGIR6LrEgmchdBSTHHoxgJI4HLouOObzO9MQEgWhQ9qGoMTkGHIfEaJ0tLAPO7nb6QCvNk9cK1OAMeRZJmNXUbkwKxaNSsHXiZ2QaqoQHNgfvW9iCYBhIpyImXPtmLlpEQoRjnMKv1EQkzomAyhp2lIApBOmq8xw6da8bAfCV2hwrn6eyIlgZTTwNWdAVEjuehl4n9eL1H+lZOyME4UhHT+kZ9wrCKc5YI9JoAqPtWEUKHPp5iOiCBU6fQqOgR78eRl0q7X7Iva+V292rgu4iijI3SCzpCNLlED3aImaiGKHtEzekVvxpPxYrwbH/PWklHMHKI/MD5/ADfhmzk=</latexit>

Patch Merging
<latexit sha1_base64="s1MymVVWVy48NNhVPHgezhq0QGw=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSyCq5JUUJdFN26ECvYBbSiTyU07dDIJMxOhhLrxV9y4UMStf+HOv3GaZqGtBy4czrl37tzjJ5wp7Tjf1tLyyuraemmjvLm1vbNr7+23VJxKCk0a81h2fKKAMwFNzTSHTiKBRD6Htj+6nvrtB5CKxeJejxPwIjIQLGSUaCP17cNe/kYmIZg0iKZDfAtywMSgb1ecqpMDLxK3IBVUoNG3v3pBTNMIhKacKNV1nUR7GZGaUQ6Tci9VkBA6IgPoGipIBMrL8u0TfGKUAIexNCU0ztXfExmJlBpHvumMiB6qeW8q/ud1Ux1eehkTSapB0NmiMOVYx3gaBw6YBKr52BBCJTN/xXRIJKHahFY2IbjzJy+SVq3qnlfP7mqV+lURRwkdoWN0ilx0geroBjVQE1H0iJ7RK3qznqwX6936mLUuWcXMAfoD6/MH3uyXKA==</latexit>

The first patch merging layer concatenates the features


<latexit sha1_base64="xOs1O0dSTc+2feFrVOctsm0xQjk=">AAADSXicbVJLixNBEJ5JfKzxsVk9einMCCtoSOKiXoTFXDyukOwuJCHU9FQmTXq6h34oIezf8+LNm//BiwdFPFmThGXNWtBQU4/v+6qm0lJJ5zudb3GtfuPmrdt7dxp3791/sN88eHjqTLCChsIoY89TdKSkpqGXXtF5aQmLVNFZuuhX+bOPZJ00euCXJU0KzLWcSYGeQ9ODeDrWRuqMtIfBnGAmrfNQohdzKMjmUuegcEkWhNHcRJqfA1+VEvpgyY3HDTMDQu7IrQkl8FfSg7GXBVf2EtAk83lqbIW1Rib3HFBngCWPuAZAqAZAu+Uyes2QHPWTFxnD6Eo/qqsaskv+NvcP5tKBpSyIrTgdirQCmoE3C+6HdAkIRVBelop2JMJbOEoY5TDpbUIJZOaTdliwPhbN1cxjVKh29myjvSIxwZfBw6VChmAZjjyTMn4/aUNj2mx12p21wXWnu3Va0dZOps2v48yIwJBeKHRu1O2UfrJC66VQdNEYB0cligXmNGJXI+udrNaXcAFPOcKbMZYf/9F19GrHCgvnlkXKlQX6udvNVcH/5UbBz95MVlLzvKTFhmgWVDVodVa8A0vCqyU7KKxkrSDmaFF4Pr5qCd3dka87p71291X75Yde6/jddh170ePoSXQYdaPX0XH0PjqJhpGIP8ff45/xr9qX2o/a79qfTWkt3vY8iv6xev0veUsKEA==</latexit>

of each group of 2 ⇥ 2 neighboring patches, and applies


a linear layer on the 4C-dimensional concatenated features.
This reduces the number of tokens by a multiple of 2 ⇥ 2 = 4
Patch Partition
<latexit sha1_base64="eOr3tWwkS05gbH8JXHRm3rMDXMY=">AAACA3icbVBNS8NAEN3Ur1q/qt70slgETyWpoB6LXjxWsB/QhrLZTNqlm03Y3QglFLz4V7x4UMSrf8Kb/8ZNmoO2Dgw83pt5wzwv5kxp2/62Siura+sb5c3K1vbO7l51/6CjokRSaNOIR7LnEQWcCWhrpjn0Ygkk9Dh0vclNpncfQCoWiXs9jcENyUiwgFGiDTWsHg1yj1SCP2sRTce4RaTxycWaXbfzwsvAKUANFdUaVr8GfkSTEISmnCjVd+xYu2nmRznMKoNEQUzohIygb6AgISg3ze/P8KlhfBxE0rTQOGd/b6QkVGoaemYyJHqsFrWM/E/rJzq4clMm4kSDoPNDQcKxjnAWCPaZBKr51ABCpfmcYjomklBtYquYEJzFl5dBp1F3Lurnd41a87qIo4yO0Qk6Qw66RE10i1qojSh6RM/oFb1ZT9aL9W59zEdLVrFziP6U9fkDqoKYLQ==</latexit>

(2⇥ downsampling of resolution), and the output dimension


4848 ⇥ 44⇥3⇥3
== 44 ⇥
<latexit sha1_base64="Rdj73jrhAsDL8DJq7pSnuiujT0A=">AAACDnicbZDLSsNAFIYn9VbrLerSzWApuCqJBu1GKLpxWcFeoCllMj1th04mYWYilNAncOOruHGhiFvX7nwbp20Ebf1h4OM/53Dm/EHMmdKO82XlVlbX1jfym4Wt7Z3dPXv/oKGiRFKo04hHshUQBZwJqGumObRiCSQMODSD0fW03rwHqVgk7vQ4hk5IBoL1GSXaWF275FXwJfYT0QMZSEIh9bCvWQgKe5MfOuvaRafszISXwc2giDLVuvan34toEoLQlBOl2q4T605KpGaUw6TgJwpiQkdkAG2Dgpg1nXR2zgSXjNPD/UiaJzSeub8nUhIqNQ4D0xkSPVSLtan5X62d6H6lkzIRJxoEnS/qJxzrCE+zwT0mgWo+NkCoZOavmA6JSUWbBAsmBHfx5GVonJbd8/LZrVesXmVx5NEROkYnyEUXqIpuUA3VEUUP6Am9oFfr0Xq23qz3eWvOymYO0R9ZH9+XDJqU</latexit>

is set to 2C.
|| {z
{z}}
patch size
Linear Embedding
<latexit sha1_base64="VSg/mWhBglKFJV0V0anOltXF/2M=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiQV1GVRBBcuKtgHtKFMJrft0MkkzEyEErpw46+4caGIWz/CnX/jNM1CWw8MHM65rzl+zJnSjvNtFVZW19Y3ipulre2d3T17/6ClokRSaNKIR7LjEwWcCWhqpjl0Ygkk9Dm0/fHVzG8/gFQsEvd6EoMXkqFgA0aJNlLfLveyGamEYHprZhCJr0MfgoCJYd+uOFUnA14mbk4qKEejb3/1gogmIQhNOVGq6zqx9lIiNaMcpqVeoiAmdEyG0DVUkBCUl2YHTPGxUQI8iKR5QuNM/d2RklCpSeibypDokVr0ZuJ/XjfRgwsvZSJONAg6XzRIONYRniWCAyaBaj4xhFDJzK2YjogkVJvcSiYEd/HLy6RVq7pn1dO7WqV+mcdRRGV0hE6Qi85RHd2gBmoiih7RM3pFb9aT9WK9Wx/z0oKV9xyiP7A+fwAiVZhn</latexit>

Project to an arbitrary dimension C


<latexit sha1_base64="VVEPBTeV2iN5Aet/5FxQvOrSB4k=">AAACD3icbVC7TsMwFHXKq5RXgJHFogUxVUmRgLGiC2OR6ENqo+rGdVtTx4lsB6mK+gcs/AoLAwixsrLxNzhtBmg5kqWjc++xfY4fcaa043xbuZXVtfWN/GZha3tnd8/eP2iqMJaENkjIQ9n2QVHOBG1opjltR5JC4HPa8se1dN56oFKxUNzpSUS9AIaCDRgBbaSefVqX4T0lGusQg8AgfaYlyAnus4CK1IZLtVKhZxedsjMDXiZuRoooQ71nf3X7IYnNHZpwUKrjOpH2EpCaEU6nhW6saARkDEPaMVRAQJWXzPJM8YlR+ngQSnOExjP1tyOBQKlJ4JvNAPRILc5S8b9ZJ9aDKy9hIoo1FWT+0CDmafi0HBNami54mh6IZOavmIxAAtGmwrQEdzHyMmlWyu5F+fy2UqxeZ3Xk0RE6RmfIRZeoim5QHTUQQY/oGb2iN+vJerHerY/5as7KPIfoD6zPH0E/m4E=</latexit>

Swin Transformer Blocks


<latexit sha1_base64="Q5rRvg7p4GsRL31cVx//65Pofvc=">AAACC3icbVC7TsMwFHXKq5RXgJHFaoXEVCVFAsaqLIxF9CW1UeU4t61Vx4lsB1RF3Vn4FRYGEGLlB9j4G9y0A7QcydLROffax8ePOVPacb6t3Nr6xuZWfruws7u3f2AfHrVUlEgKTRrxSHZ8ooAzAU3NNIdOLIGEPoe2P76e+e17kIpFoqEnMXghGQo2YJRoI/XtYi+7I5UQTO8emMANSYQaRDIEiWs8omPVt0tO2cmAV4m7ICW0QL1vf/WCiCYhCE05UarrOrH2UiI1oxymhV6iICZ0TIbQNVSQEJSXZjmm+NQoATYBzBEaZ+rvjZSESk1C30yGRI/UsjcT//O6iR5ceSkTcaJB0PlDg4RjHeFZMThgEqjmE0MIlcxkxXREJKHa1FcwJbjLX14lrUrZvSif31ZK1dqijjw6QUV0hlx0iaroBtVRE1H0iJ7RK3qznqwX6936mI/mrMXOMfoD6/MHy9ubhQ==</latexit>

LN ! Layer Norm
<latexit sha1_base64="qVenOu9h8p8VuCuCY8gOkU1kURA=">AAACEHicbVA9SwNBEN3z2/h1ammzGESrcKeilkEbCxEFY4RcCHubSbK4d3vszqnhuJ9g41+xsVDE1tLOf+MmXqHGBwOP92aYmRcmUhj0vE9nbHxicmp6ZrY0N7+wuOQur1walWoONa6k0lchMyBFDDUUKOEq0cCiUEI9vD4a+PUb0Eao+AL7CTQj1o1FR3CGVmq5mwHCHWYnpzkNtOj2kGmtbmmhsj5oeqp0lLfcslfxhqCjxC9ImRQ4a7kfQVvxNIIYuWTGNHwvwWbGNAouIS8FqYGE8WvWhYalMYvANLPhQzndsEqbdpS2FSMdqj8nMhYZ049C2xkx7Jm/3kD8z2uk2DloZiJOUoSYfy/qpJKiooN0aFto4Cj7ljCuhb2V8h7TjKPNsGRD8P++PEoutyv+XmXnfLdcPSzimCFrZJ1sEZ/skyo5JmekRji5J4/kmbw4D86T8+q8fbeOOcXMKvkF5/0LsBSdow==</latexit>

<latexit sha1_base64="5yJwGPCyQ/6NPFcVzC299cuj3S0=">AAACI3icbVC7TgJBFJ31ifhCLW0mEhMsILuYqLECbWxIMMgjAUJmZ+/ChNlHZmY1ZMO/2PgrNhYaYmPhvzgLFAreZJKTc+49c++xQ86kMs0vY2V1bX1jM7WV3t7Z3dvPHBw2ZBAJCnUa8EC0bCKBMx/qiikOrVAA8WwOTXt4m+jNRxCSBf6DGoXQ9UjfZy6jRGmql7nuTD1iAc64yXwneMondg6uRFyx/ACIg2vA3XxZKfCTGZxr5iu18lm6l8maBXNaeBlYc5BF86r2MpOOE9DI0z6UEynblhmqbkyEYpTDON2JJISEDkkf2hr6xAPZjaf7jfFplGzlBkI/X+Ep+3siJp6UI8/WnR5RA7moJeR/WjtS7lU3Zn4Y6QPp7CM34lgFOAkMO0wAVXykAaGC6V0xHRBBqNKxJiFYiycvg0axYF0Uzu+L2dLNPI4UOkYnKIcsdIlK6A5VUR1R9Ixe0Tv6MF6MN2NifM5aV4z5zBH6U8b3D8EEo54=</latexit>

Window-based Multi-head Self-Attention (W-MSA)


Compute self-attention within local windows
<latexit sha1_base64="D/H7I7b0Avpenoroq5Z4TqtNAKU=">AAACFnicbVDLTgIxFO3gC/E16tJNIzFxA5nBRF0S2bjERB4JENIpHWjotJP2joQQvsKNv+LGhca4Ne78GzvAQsGTNDk9597b3hPEghvwvG8ns7a+sbmV3c7t7O7tH7iHR3WjEk1ZjSqhdDMghgkuWQ04CNaMNSNRIFgjGFZSv/HAtOFK3sM4Zp2I9CUPOSVgpa5bqKgoToBhOyIsEAAmUwOPOAy4xEJRIuxF9tTI4FzXzXtFbwa8SvwFyaMFql33q91TNInsVCqIMS3fi6EzIRo4FWyaayeGxYQOSZ+1LJUkYqYzma01xWdW6eFQaXsk4Jn6u2NCImPGUWArIwIDs+yl4n9eK4HwujPhMl1c0vlDYSIwKJxmhHtcMwpibAmhmtu/YjogmlCwSaYh+Msrr5J6qehfFi/uSvnyzSKOLDpBp+gc+egKldEtqqIaougRPaNX9OY8OS/Ou/MxL804i55j9AfO5w8aoJ9Q</latexit>

<latexit sha1_base64="Lm8WJp8vQJHVA9J512Mfl/5weqg=">AAACJnicbVDLSgNBEJz1GeMr6tHLYBD0kLAbQb0IiV68CJEYIyQhzM72miGzD2Z6lbDka7z4K148KCLe/BRnkxx8FQwUVd093eXGUmi07Q9rZnZufmExt5RfXlldWy9sbF7rKFEcmjySkbpxmQYpQmiiQAk3sQIWuBJa7uAs81t3oLSIwiscxtAN2G0ofMEZGqlXOOmMZ6QKvFGjL3wEj7ZE6EX39CKRKEp9YB5tgPRLNUQIsy6612iVLhq1/XyvULTL9hj0L3GmpEimqPcKLx0v4klgBnHJtG47dozdlCkUXMIo30k0xIwP2C20DQ1ZALqbjlcc0V2jeNSPlHkh0rH6vSNlgdbDwDWVAcO+/u1l4n9eO0H/uJuKME7MhXzykZ9IihHNMqOeUMBRDg1hXAmzK+V9phhHk2wWgvP75L/kulJ2DssHl5Vi9XQaR45skx2yRxxyRKrknNRJk3DyQJ7IC3m1Hq1n6816n5TOWNOeLfID1ucXDHOkyg==</latexit>

Shifted Window Multi-head Self-Attention (SW-MSA)

Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows."
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
High-Performance Large-Scale Image
Recognition Without Normalization YouTube Playlist

ve Gradient Clipping for Efficient Large-Batch Training


L ! loss
<latexit sha1_base64="+ERs6xVxkv/cm1+mAyfhUVUVcu4=">AAACFXicbVDLSsNAFJ3UV62vqCtxM1gEVyUpoi6Lbly4qGAf0IYymUzaoZNMmLlRSyh+hl/gVr/Anbh17Qf4HyZtFrb1wIXDOfdy7z1uJLgGy/o2CkvLK6trxfXSxubW9o65u9fUMlaUNagUUrVdopngIWsAB8HakWIkcAVrucOrzG/dM6W5DO9gFDEnIP2Q+5wSSKWeeXCDu4r3B0CUkg+4C+wREiG1HvfMslWxJsCLxM5JGeWo98yfridpHLAQqCBad2wrAichCjgVbFzqxppFhA5Jn3VSGpKAaSeZvDDGx6niYV+qtELAE/XvREICrUeBm3YGBAZ63svE/7xODP6Fk/AwioGFdLrIjwUGibM8sMcVoyBGKSFU8fRWTAdEEQppajNbPJ2dluViz6ewSJrVin1Wqd6elmuXeUJFdIiO0Amy0TmqoWtURw1E0RN6Qa/ozXg23o0P43PaWjDymX00A+PrFx66oCQ=</latexit>

✓ ! model parameters
<latexit sha1_base64="pT0tMd5pZHmWcAwrae/w3UcY3pA=">AAACJnicbVDLSsNAFJ34tr6qLt0MFkFclEREXRbduFSwrdCEMpnctIOTTJi5UUvoP/gZfoFb/QJ3Iu7c+B9O2i58HRg4nPs4d06YSWHQdd+dqemZ2bn5hcXK0vLK6lp1faNlVK45NLmSSl+FzIAUKTRRoISrTANLQgnt8Pq0rLdvQBuh0kscZBAkrJeKWHCGVupW93zsAzLqa9HrI9Na3VIf4Q6LREUgacY0SwDthmG3WnPr7gj0L/EmpEYmOO9WP/1I8TyBFLlkxnQ8N8OgYBoFlzCs+LmBjPFr1oOOpak1MkEx+tOQ7lglorHS9qVIR+r3iYIlxgyS0HYmDPvmd60U/6t1coyPg0KkWY6Q8rFRnEuKipYB0Uho4CgHljCuhb2V8r5NgZch/HCJTHlamYv3O4W/pLVf9w7r+xcHtcbJJKEFskW2yS7xyBFpkDNyTpqEk3vySJ7Is/PgvDivztu4dcqZzGySH3A+vgBHT6eY</latexit>

@L
<latexit sha1_base64="ngytrpztVDA6Rs5LAn1zoyMEn3E=">AAACRXicbVBNa9tAEF2laZM6/XDaYy5LTCEnI4XS9hIIyaEt9OBCbAcsY0arkb1kpRW7oyRG6DflZ/QX9JBLeuqxt9Brs7IFbZw+WHi8mdl586JcSUu+f+2tPVp//GRj82lr69nzFy/b268GVhdGYF9opc1pBBaVzLBPkhSe5gYhjRQOo7Pjuj48R2Olzk5onuM4hWkmEymAnDRpf/7ID3iYGBBlmIMhCYp/qf7ykGZIUPHQyOmMwBh94TS8pHJqIJaYET9HQdpUk3bH7/oL8IckaEiHNehN2j/DWIsidX8IBdaOAj+ncVkvFgqrVlhYzEGcwRRHjmaQoh2Xi5Mr/sYpMU+0cc95WKj/TpSQWjtPI9eZAs3saq0W/1cbFZR8GJcyywvCTCwXJYXipHmdH4+lcfequSMgjHReuZiBy49cyve2xLa2VucSrKbwkAz2u8G77v7Xt53DoyahTbbDdtkeC9h7dsg+sR7rM8Gu2Hd2w35437xf3q33e9m65jUzr9k9eH/uAHPYtDQ=</latexit>

G= ! gradient vector
Normalizer-Free ResNets @✓ <latexit sha1_base64="fNwYbH5bi6o3ocUNOSK7o6C3c9Y=">AAACHnicbVDLTgIxFO3gC/E16tJNAzFxI5lhoS6JJsYVQSOPBAjpdC7Q0M5M2o4JTtj7GX6BW/0Cd8atfoD/YRlYCHiSJifn3Nt7cryIM6Ud59vKrKyurW9kN3Nb2zu7e/b+QV2FsaRQoyEPZdMjCjgLoKaZ5tCMJBDhcWh4w6uJ33gAqVgY3OtRBB1B+gHrMUq0kbp2vp3+kUjwx5VQCsLZI8jTawmA70BVQKuuXXCKTgq8TNwZKaAZql37p+2HNBYQaMqJUi3XiXQnIVIzymGca8cKIkKHpA8tQwMiQHWSNMcYHxvFx71QmhdonKp/NxIilBoJz0wKogdq0ZuI/3mtWPcuOgkLolhDQKeHejHHOsSTYrDPJFDNR4YQKpnJiumASEK1qW/uiq8m0camF3exhWVSLxXds2LptlQoX84ayqIjlEcnyEXnqIxuUBXVEEVP6AW9ojfr2Xq3PqzP6WjGmu0cojlYX78ytqPV</latexit>

h`+1 = h` + ↵f ` (h` / ` ) ! residual block


<latexit sha1_base64="K5B9PksITt09h76wyZPy/olI//s=">AAACVnicbVDLTtxAEJx1eAUSMMmRy4gVEhHSxl4hyCUSSi45EikLiPWyao/b69GOPdZMO2Rl+d/yGfAByTV8QRTb+BAeJY2muqpb3aowV9KS5932nBdLyyuray/XN1693txyt9+cWV0YgSOhlTYXIVhUMsMRSVJ4kRuENFR4Hs4/N/75dzRW6uwbLXKcpDDLZCwFUC1N3cvkqgxQqQO/4h95V1T8gAeg8gR4fNUI+0n7vQ9CJGjpOx4YOUsIjNHXPCD8QaVBK6MCFA+VFvNq6va9gdeCPyV+R/qsw+nU/RVEWhQpZiQUWDv2vZwmJRiSQmG1HhQWcxBzmOG4phmkaCdlm0HF92ol4rE29cuIt+r/EyWk1i7SsO5MgRL72GvE57xxQfGHSSmzvCDMxP2iuFCcNG8C5ZE0KEgtagLCyPpWLhIwIKiO/cGWyDanNbn4j1N4Ss6GA/9oMPx62D/51CW0xnbYLttnPjtmJ+wLO2UjJthP9pv9YXe9m95fZ9lZvW91et3MW/YAjvsPMhK2xw==</latexit>

h` ! input to the `-th residual block


<latexit sha1_base64="N0GYejFlQkwhnu2J6RKBqz1K+ZM=">AAACPXicbVC7TsNAEDzzfhOgpDkRkGiI7AgBJY+GMkgEkOIQnc+b+JSzz7pbA5GV7+Ez+AJaEB+QDtHSck5S8FpppdHsrHZnglQKg6775kxMTk3PzM7NLywuLa+sltbWr4zKNIc6V1Lpm4AZkCKBOgqUcJNqYHEg4TronhXz6zvQRqjkEnspNGPWSURbcIaWapVOotvcByn71NeiEyHTWt1TH+EBc5GkGVJUFCOg24Vqew8jqsGIMGOSBlLxbr9VKrsVd1j0L/DGoEzGVWuVBn6oeBZDglwyYxqem2IzZxoFl9Bf8DMDKeNd1oGGhQmLwTTzodU+3bFMSNtK206QDtnvGzmLjenFgVXGDCPze1aQ/80aGbaPmiPLkPDRoXYmh/ZtbjQUGjjKngWMa2F/pTximnG06f64EpritSIX73cKf8FVteIdVKoX++Xj03FCc2STbJFd4pFDckzOSY3UCSeP5Jm8kFfnyRk4787HSDrhjHc2yI9yPr8AydewQA==</latexit>

`
<latexit sha1_base64="lJa3mRo7ucEzuuMFhV9B3fx3EAw=">AAACSnicbVBNT9tAEF2Hr/DZQI9cVg1IXIhshApHBJeeUJAaQIpDNF6P4xXrXWt3TBtZ+Vf8DP4AvdL+gd5QL7VDDgX6pJGe3pvRzLwoV9KR7z96jbn5hcWl5vLK6tr6xofW5talM4UV2BNGGXsdgUMlNfZIksLr3CJkkcKr6Pas9q/u0Dpp9Fca5zjIYKRlIgVQJQ1b58lNGaJSEx5aOUoJrDXfeEj4ncqk0KLu4sJkeUEY82jMKUW+U0/s7FPKLToZF6B4ZEGLdDJstf2OPwV/T4IZabMZusPWrzA2oshQk1DgXD/wcxqUYEkKhZOVsHCYg7iFEfYrqiFDNyinf0/4bqXEPDG2Kk18qv47UULm3DiLqs4MKHVvvVr8n9cvKDkelFLXX2vxsigpFCfD6xB5LC0KUuOKgLCyupWLFCwIqqJ+tSV29Wl1LsHbFN6Ty4NO8LlzcHHYPjmdJdRk2+wT22MBO2In7Avrsh4T7J79YE/sp/fg/faevT8vrQ1vNvORvUJj/i8rgLTT</latexit>

! clipping threshold hyper-parameter f ! function computed by the `-th residual branch <latexit sha1_base64="/NSJwbu+vvsn+Poszn8YtcBZr9Q=">AAACOXicbVDLThsxFPXQ8myBtCzZWI2Qumk0EyFgg4RgwzKVGkDKRJHHc5Ox4rEt+w4QjfI1/Qy+gC2sWLJAQt32B+pJsigJR7J0dO7jXJ/ESOEwDJ+CpQ8fl1dW19Y3Pn3e3Nquffl64XRhObS5ltpeJcyBFAraKFDClbHA8kTCZTI8q+qX12Cd0OoXjgx0czZQoi84Qy/1asex9M0po7EVgwyZtfqGxgi3WHIpjBFqQDGz4DItU5r5DfaHYZblgGDHvVo9bIQT0EUSzUidzNDq1V7iVPMiB4VcMuc6UWiwWzKLgksYb8SFA8P4kA2g46nyPq5bTr45pnteSWlfW/8U0on6/0TJcudGeeI7c4aZm69V4nu1ToH9o24plCkQFJ8a9QtJUdMqM5oKCxzlyBPGrfC3Up75ELjP4K1L6qrTqlyi+RQWyUWzER00mj/36yens4TWyC75Rr6TiBySE3JOWqRNOPlN7skDeQzugufgNfgzbV0KZjM75A2Cv/8AQ5OvtQ==</latexit>

<latexit sha1_base64="ZLTUkgDj4Wl2unre2h9a0KzqURE=">AAACPXicbZC7SgNBFIZnvd+NWtoMBkGbsCuillEbywhGhSSEs7MncXB2dpk5G4xLnsfH8AlsFR8gndjaOhtTeDsw8POf63xhqqQl33/1Jianpmdm5+YXFpeWV1ZLa+uXNsmMwLpIVGKuQ7CopMY6SVJ4nRqEOFR4Fd6eFvmrHhorE31B/RRbMXS17EgB5Kx26bhJeEf5TgoGYiQj7zHilPAQeQ+MBC2Qu4EWTU/qLgfiUkuSoOT9aMLuoF0q+xV/FPyvCMaizMZRa5eGzSgRWYyahAJrG4GfUisHQ1IoHCw0M4spiFvoYsNJ7e6yrXz01QHfdk7EO4lxTxMfud87coit7cehq4yBbuzvXGH+l2tk1Dlq5VKnGaEWX4s6mSpYFNx4JA0KUn0nQBiHQHBx46AJcnR/bIlscVrBJfhN4a+43KsEB5W98/1y9WRMaI5tsi22wwJ2yKrsjNVYnQn2wJ7YM3vxHr2h9+a9f5VOeOOeDfYjvI9PsxSw0A==</latexit>

Adaptive Gradient Clipping for Efficient Large-Batch Training (parametrized to be variance preserving at initialization) <latexit sha1_base64="fprqNr76nncqOD2OtHAfuteJqLs=">AAACRHicbVC7TgMxEPTxDO8AJY1FhERDdJcCKAMRgoICJAJISRTt+fYSC5/Psn1I0Sm/xGfwBTQU0NHRIVqEc6TgNdVodmd3NKES3Fjff/QmJqemZ2ZLc/MLi0vLK+XVtUuTZpphk6Ui1dchGBRcYtNyK/BaaYQkFHgV3jRG86tb1Ian8sIOFHYS6EkecwbWSd3ySbu4kWuMhgcRKMtvkR5riDhKSxuCK8Vlj8appkexsxXyKege7hyCZX16oYFLt9ItV/yqX4D+JcGYVMgYZ93ySztKWZa4g0yAMa3AV7aTg7acCRzOtzODCtgN9LDlqIQETScv0g7pllOiIlWcukCF+t2RQ2LMIAndZgK2b37PRuJ/s1Zm4/1OzqXKLEr29SjOBLUpHdVHI66RWTFwBJjmLitlfdDArCv5x5fIjKINXS/B7xb+kstaNdit1s5rlfrhuKES2SCbZJsEZI/UyQk5I03CyB15IE/k2bv3Xr037/1rdcIbe9bJD3gfn2fcsvc=</latexit>

`
<latexit sha1_base64="yxH0arw6nAS3FaE+Kh6uG6dY5wI=">AAACL3icbVBLSgNBFOyJ//iLunTTGIQEJMyoqBtBdONSwSRCJoaezpuksedD9xsxDjmIx/AEbvUE4kbcuPAW9iRZaGJBQ1H1HvW6vFgKjbb9buWmpmdm5+YX8otLyyurhbX1mo4SxaHKIxmpa49pkCKEKgqUcB0rYIEnoe7dnmV+/Q6UFlF4hb0YmgHrhMIXnKGRWoU9F+Ee0xpT/ZJ/44KUpYdymR7TX/pDeYe6fqSYlDSbaBWKdsUegE4SZ0SKZISLVuHLbUc8CSBELpnWDceOsZkyhYJL6OfdREPM+C3rQMPQkAWgm+ngc326bZQ2NfHmhUgH6u+NlAVa9wLPTAYMu3rcy8T/vEaC/lEzFWGcIIR8GOQnkmJEs6ZoWyjgKHuGMK6EuZXyLlOMo+nzT0pbZ6f1TS/OeAuTpLZbcQ4qu5f7xZPTUUPzZJNskRJxyCE5IefkglQJJ4/kmbyQV+vJerM+rM/haM4a7WyQP7C+fwArB6lD</latexit>

` N ⇥M Var(f (z)) = Var(z), 8`


<latexit sha1_base64="mXhEHB7dBZbFsclZXOFvRv6C3C4=">AAACVnicbVDLTttAFJ2YUij0YcqSzVUDUjeNbFQBS9Ru2FBR1BDUOETjyXU8YjxjzVwXIsv/1s9oP6Dd0i+oGIcsCvRIIx2d+zpz0lJJR1H0sxMsPVl+urL6bG39+YuXr8KN12fOVFZgXxhl7HnKHSqpsU+SFJ6XFnmRKhyklx/b+uAbWieN/kKzEkcFn2qZScHJS+Pw6+AiQaUgkRqSglOepvVpc1F/goRkgQ6OG0isnObErTVXXsVrqq+wVcD3W3kNJgPKEbbbRdvvKAfFZ2ibcdiNetEc8JjEC9JlC5yMw1/JxIiqQE1CceeGcVTSqOaWpFDYrCWVw5KLSz7Foaeae3+jep5BAztemUBmrH+aYK7+O1HzwrlZkfrO9pvuYa0V/1cbVpQdjGqpy4pQi7tDWaWADLSBwkRaFKRmnnBhpfcKIueWC/Kx37syca21Npf4YQqPydluL97r7X5+3z38sEholW2xN+wti9k+O2RH7IT1mWDf2W92w/50fnT+BsvByl1r0FnMbLJ7CMJb16+3Ag==</latexit>

W 2R ! weight matrix of the `-th layer


↵ = 0.2 ! rate at which the variance of the activations increases after each residual b
<latexit sha1_base64="jZI8hWaqj5X7G2pG8OC208pQtFo=">AAACgXicbVFNa9tAEF0p/UjTL6e9NZehppBcjGRCGyiF0F56TKFOApYxo9XIWrxaid2RE1f42B/ZH9BT/0RXig9N0oGFt2/mzVveprVWjqPoVxDuPHj46PHuk72nz56/eDnYf3XuqsZKmshKV/YyRUdaGZqwYk2XtSUsU00X6fJL179YkXWqMt95XdOsxIVRuZLInpoPfiao6wLhE0SjMSRWLQpGa6srSJiuubXIBMhwVShZABcEK7QKjSSo8v6OktWq3+ZAGenNHTnAnMkCoRdZciprUEOqK7mEQ79NGcUKtfrR644288EwGkV9wX0Qb8FQbOtsPvidZJVsSjIsNTo3jaOaZy1aVlLTZi9pHNUol7igqYcGS3Kzto9rA+88k0FeWX8MQ8/+q2ixdG5dpn6yRC7c3V5H/q83bTg/mbXK1A2TkTdGeaOBK+iyh0xZkqzXHqC0PgIJskDrE/Q/dMslc93TulziuyncB+fjUfx+NP52PDz9vE1oVxyIt+JQxOKDOBVfxZmYCCn+BPvBm+Ag3AmPwigc34yGwVbzWtyq8ONflcDDMQ==</latexit>

` N ⇥M `
<latexit sha1_base64="vOqk3Wa1qb9FtMCmcQhPedsV/yg=">AAACU3icbVC5TsNAEN2YKxyBACXNioBEFdkIAWUEBTSggAhBipNovZ4kq6wP7Y4JkeVP4zMoqBEdfAEN65CCa6SRnt5cb54XS6HRtp8L1szs3PxCcXFpeaW0ulZe37jVUaI4NHgkI3XnMQ1ShNBAgRLuYgUs8CQ0veFpXm/eg9IiCm9wHEM7YP1Q9ARnaKhuuXnWcUFK6oqQugHDgeel11knvaQuigA0vcioq0R/gEypaGRYeMC0r5gvIEQ6EjigCnQMHClGdKc5WbeTdcsVu2pPgv4FzhRUyDTq3fKr60c8CcxWLpnWLceOsZ0yhYJLyJbcREPM+JD1oWVgyIy4djoxIKO7hvFpL1ImjaoJ+30iZYHW48AznfmP+nctJ/+rtRLsHbdTEcYJQsi/DvUSmb+au0l9oczncmwA40oYrZQPmGIcjec/rvg6l5b74vx24S+43a86h9X9q4NK7WTqUJFskW2yRxxyRGrknNRJg3DySF7IG3kvPBU+LMua/Wq1CtOZTfIjrNInhza18w==</latexit>

G 2 R ! gradient with respect to W


↵ = 0.2 ! rate at which the variance of the activations increases after q
<latexit sha1_base64="jZI8hWaqj5X7G2pG8OC208pQtFo=">AAACgXicbVFNa9tAEF0p/UjTL6e9NZehppBcjGRCGyiF0F56TKFOApYxo9XIWrxaid2RE1f42B/ZH9BT/0RXig9N0oGFt2/mzVveprVWjqPoVxDuPHj46PHuk72nz56/eDnYf3XuqsZKmshKV/YyRUdaGZqwYk2XtSUsU00X6fJL179YkXWqMt95XdOsxIVRuZLInpoPfiao6wLhE0SjMSRWLQpGa6srSJiuubXIBMhwVShZABcEK7QKjSSo8v6OktWq3+ZAGenNHTnAnMkCoRdZciprUEOqK7mEQ79NGcUKtfrR644288EwGkV9wX0Qb8FQbOtsPvidZJVsSjIsNTo3jaOaZy1aVlLTZi9pHNUol7igqYcGS3Kzto9rA+88k0FeWX8MQ8/+q2ixdG5dpn6yRC7c3V5H/q83bTg/mbXK1A2TkTdGeaOBK+iyh0xZkqzXHqC0PgIJskDrE/Q/dMslc93TulziuyncB+fjUfx+NP52PDz9vE1oVxyIt+JQxOKDOBVfxZmYCCn+BPvBm+Ag3AmPwigc34yGwVbzWtyq8ONflcDDMQ==</latexit>

each residual block (at initialization)


k · kF ! Frobenius norm
<latexit sha1_base64="SdJFntlMr5I9WB9LiNRFl8Mu1AQ=">AAACKXicbVDLSgNBEJz1GeMr6tHLYBC8GHZF1GNQCB4jGBWyIczOdpLB2ZllplcNm3yFn+EXeNUv8KZexf9wN+ag0YKGoqqb7q4glsKi6745U9Mzs3PzhYXi4tLyymppbf3C6sRwaHAttbkKmAUpFDRQoISr2ACLAgmXwfVJ7l/egLFCq3Psx9CKWFeJjuAMM6ld2vUHPg81+oN2jfpGdHvIjNG31Ee4w7RmdABKJJYqbaJhu1R2K+4I9C/xxqRMxqi3S59+qHkSgUIumbVNz42xlTKDgksYFv3EQsz4NetCM6OKRWBb6eitId3OlJB2tMlKIR2pPydSFlnbj4KsM2LYs5NeLv7nNRPsHLVSoeIEQfHvRZ1EUtQ0z4iGwgBH2c8I40Zkt1LeY4ZxzJL8tSW0+Wl5Lt5kCn/JxV7FO6jsne2Xq8fjhApkk2yRHeKRQ1Ilp6ROGoSTe/JInsiz8+C8OK/O+3frlDOe2SC/4Hx8AZRbqMM=</latexit>

<latexit sha1_base64="HYBzBR1ZSiyDh1oIjRoRUFyxKA0=">AAACInicbVDLSgNBEJz1GeNr1aOXwaDES9gNol6EoBePEcwDsjHMTjrJkNmHM71iWPIHfoZf4FW/wJt4Erz6H+4mOZjEgoaiqpvuLjeUQqNlfRkLi0vLK6uZtez6xubWtrmzW9VBpDhUeCADVXeZBil8qKBACfVQAfNcCTW3f5X6tQdQWgT+LQ5CaHqs64uO4AwTqWUeOS4gu3NASnpBHX2vMHYQHjGuMjXM90bO8bBl5qyCNQKdJ/aE5MgE5Zb547QDHnngI5dM64ZthdiMmULBJQyzTqQhZLzPutBIqM880M149M+QHiZKm3YClZSPdKT+nYiZp/XAc5NOj2FPz3qp+J/XiLBz3oyFH0YIPh8v6kSSYkDTcGhbKOAoBwlhXInkVsp7TDGOSYRTW9o6PS3NxZ5NYZ5UiwX7tFC8OcmVLicJZcg+OSB5YpMzUiLXpEwqhJMn8kJeyZvxbLwbH8bnuHXBmMzskSkY37+sY6UX</latexit>

` `)
N M = Var(h
<latexit sha1_base64="46ZLVMZz8uetNKKu94iZ8ymqo3Y=">AAACNXicbVDLTgIxFO3gC/GFunTTSExwQ2aI8bEgIZoYNxpM5JEwMOmUAoXOI23HhAzzLX6GX+BW1y7cqVt/wQ7MQsCTNDn3nHtzb4/tMyqkrr9rqaXlldW19HpmY3Nreye7u1cTXsAxqWKPebxhI0EYdUlVUslIw+cEOTYjdXt4Ffv1R8IF9dwHOfJJy0E9l3YpRlJJVvbCHNfbJmHMHFvXEJagKQLHCmnJiNp3STGIi1uYnzYqcxAdt4tWNqcX9AngIjESkgMJKlb2y+x4OHCIKzFDQjQN3ZetEHFJMSNRxgwE8REeoh5pKuoih4hWOPliBI+U0oFdj6vnSjhR/06EyBFi5Niq00GyL+a9WPzPawaye94KqesHkrh4uqgbMCg9GOcFO5QTLNlIEYQ5VbdC3EccYalSndnSEfFpkcrFmE9hkdSKBeO0ULw/yZUvk4TS4AAcgjwwwBkogxtQAVWAwRN4Aa/gTXvWPrRP7XvamtKSmX0wA+3nF0Voq0g=</latexit>

XX
kW ` kF = (Wij` )2 Scaled Weight Standardization
<latexit sha1_base64="zWDDqNQtVZoLepELN3K9shdHX6I=">AAACK3icbVDLSsNAFJ34rO+qSzeDRXBVExF1WXTjUtFaoS3lZnLbDk4mYeZGrKGf4Wf4BW71C1wpbvsfJmkXvg4MHM65d+7h+LGSllz33Zmanpmdmy8tLC4tr6yuldc3rm2UGIF1EanI3PhgUUmNdZKk8CY2CKGvsOHfnuZ+4w6NlZG+okGM7RB6WnalAMqkTnmvRXhPaav4KTUYDC8FKAx4A2WvT/ySQAdgAvlQLAw75YpbdQvwv8SbkAqb4LxTHrWCSCQhahIKrG16bkztFAxJoXC42EosxiBuoYfNjGoI0bbTIs6Q72RKwLuRyZ4mXqjfN1IIrR2EfjYZAvXtby8X//OaCXWP26nUcUKoxfhQN1GcIp63xANpUJAaZASEkVlWLvpgQFDW5Y8rgc2j5b14v1v4S673q95hdf/ioFI7mTRUYltsm+0yjx2xGjtj56zOBHtkz+yFvTpPzpvz4XyOR6ecyc4m+wFn9AWz2Knf</latexit>

i=1 j=1
kW ` kF
<latexit sha1_base64="vqkkzwhegHfxZ07Xbr6B/wKh4sU=">AAACjXicbVHbbtNAEF2bWymXpvCIkEYkXJ4iuyqFB4QqkCiPRSJNpThE6/XYXnVv2l03RG4+gs/jA/gGXlknkaAtI6105pwdndGZ3AjufJL8jOIbN2/dvrN1d/ve/QcPd3q7j06cbizDEdNC29OcOhRc4chzL/DUWKQyFzjOzz52+vgcreNaffULg1NJK8VLzqgP1Kz3I/P43beDrLSUtdnF+FuGQmQXs0/L0B397QZgrD7nBTqg4Lg0AkEidY1F0CXUeg6yYfVKVFUQK0sLjsqD82hgzoUAVlNVIfg6jFhecUUFzJFXtXcwWDsPlrNePxkmq4LrIN2APtnU8az3Kys0a2TwYoI6N0kT46cttZ4zgcvtrHFoKDujFU4CVFSim7ar6JbwPDAFlNqGF3Zdsf9OtFQ6t5B5+Cmpr91VrSP/p00aX76dtlyZxqNia6OyEeA1dHeAgltkXiwCoMzysGsXTziCD9e65FK4brUul/RqCtfByd4wPRjufdnvH37YJLRFnpBn5BVJyRtySD6TYzIijPyOnkYvopfxTvw6fhe/X3+No83MY3Kp4qM/fGjIuw==</latexit>

`
provides a simple measure of how much a single gradient step will change the original weights W
kG` kF
ch a single gradient step will change the original weights W `
W ` = hG` , h ! learning rate
<latexit sha1_base64="IgEpVWMhEaDfVt0jE8BIBeLSHkU=">AAACOXicbZDPbtNAEMbXpdCQUjBw5LIiQuqBRnZUUS6RIorUHlOJ/JHiEI03k3jV9draHQORlafpY/AEXOHUYw+VKq68QO0khybhk1b66ZsZzewXpkpa8rxrZ+fR7uMne5Wn1f1nB89fuC9fdW2SGYEdkajE9EOwqKTGDklS2E8NQhwq7IWXp2W99w2NlYn+QrMUhzFMtZxIAVRYI7cZfEZFwHtfA1SKN/lRxM8W/J5HPDByGhEYk3znAeEPyhWC0VJPuQHC+citeXVvIb4N/gpqbKX2yL0NxonIYtQkFFg78L2UhjkYkkLhvBpkFlMQlzDFQYEaYrTDfPHNOX9XOGM+SUzxNPGF+3Aih9jaWRwWnTFQZDdrpfm/2iCjycdhLnWaEWqxXDTJFKeEl5nxsTQoSM0KAGFkcSsXERgQVCS7tmVsy9PKXPzNFLah26j7H+qNi+Na69MqoQp7w96yQ+azE9Zi56zNOkywK/aL/WZ/nJ/OjXPn/F227jirmddsTc6/e0tCrXM=</latexit>

k W ` kF kG` kF
<latexit sha1_base64="R8LoJtbk/OiM9KDGPtkyQlv8hb4=">AAACSnicdVDLSgMxFM3UqrW+qi7dBIvgqswUUTdCUVFXUsE+oFNLJr1jQzMPkoxQpv0rP8Mf0K36A+7EjZm2Qh96IHDuuedyb44TciaVab4YqYX04tJyZiW7ura+sZnb2q7KIBIUKjTggag7RAJnPlQUUxzqoQDiORxqTvc86dceQUgW+HeqF0LTIw8+cxklSkut3I3tCkJju29fAFcE1+5t4Nzuty4HWpyo8Cnu4F/z1X+uVi5vFswh8DyxxiSPxii3ch92O6CRB76inEjZsMxQNWMiFKMcBlk7khAS2iUP0NDUJx7IZjz89wDva6WN3UDo5ys8VCcnYuJJ2fMc7fSI6sjZXiL+1WtEyj1pxswPIwU+HS1yI45VgJMQcZsJoIr3NCFUMH0rph2is1E66qktbZmcluRizaYwT6rFgnVUKN4e5ktn44QyaBftoQNkoWNUQteojCqIoif0it7Qu/FsfBpfxvfImjLGMztoCqn0DypstOM=</latexit>

The activation functions are also scaled by a non-linearity specific scalar gain .
<latexit sha1_base64="PbZjPJYFy8A/AdG9MuQ5Zu1VsgA=">AAACXXicbZA9bxNBEIbXB4EQQmKgoKAZ4SDRYN25AMoIGsogxUkk27Lm9ubsVfbjtDsXcTr59/EbqKgoaaFl7+KCJIy00qv3nY/Vk1daBU7T74Pk3v2dBw93H+093n9ycDh8+uwsuNpLmkqnnb/IMZBWlqasWNNF5QlNruk8v/zU5edX5INy9pSbihYGV1aVSiJHaznEOdNXbk/XBChZXfU2lLWVnQiAPgY6OAgSNRWQN4BgnX3bHUSvuIFQkew29i3oYYXKwtF8hcbg0XizHI7ScdoX3BXZVozEtk6Ww5/zwsnakGWpMYRZlla8aNGzkpo2e/M6UIXyElc0i9KiobBoexQbeB2dAkrn47MMvfvvRIsmhMbksdMgr8PtrDP/l81qLj8sWmWrmsnK60NlrYEddFyhUJ4k6yYKlJFL5CHX6CPUSP/GlSJ0X+u4ZLcp3BVnk3H2bjz5Mhkdf9wS2hUvxSvxRmTivTgWn8WJmAopvolf4rf4M/iR7CT7ycF1azLYzjwXNyp58RetRbi4</latexit>

`
=h `
kW kF kW kF The activation functions are also scaled by a non-linearity specific scalar gain .
<latexit sha1_base64="PbZjPJYFy8A/AdG9MuQ5Zu1VsgA=">AAACXXicbZA9bxNBEIbXB4EQQmKgoKAZ4SDRYN25AMoIGsogxUkk27Lm9ubsVfbjtDsXcTr59/EbqKgoaaFl7+KCJIy00qv3nY/Vk1daBU7T74Pk3v2dBw93H+093n9ycDh8+uwsuNpLmkqnnb/IMZBWlqasWNNF5QlNruk8v/zU5edX5INy9pSbihYGV1aVSiJHaznEOdNXbk/XBChZXfU2lLWVnQiAPgY6OAgSNRWQN4BgnX3bHUSvuIFQkew29i3oYYXKwtF8hcbg0XizHI7ScdoX3BXZVozEtk6Ww5/zwsnakGWpMYRZlla8aNGzkpo2e/M6UIXyElc0i9KiobBoexQbeB2dAkrn47MMvfvvRIsmhMbksdMgr8PtrDP/l81qLj8sWmWrmsnK60NlrYEddFyhUJ4k6yYKlJFL5CHX6CPUSP/GlSJ0X+u4ZLcp3BVnk3H2bjz5Mhkdf9wS2hUvxSvxRmTivTgWn8WJmAopvolf4rf4M/iR7CT7ycF1azLYzjwXNyp58RetRbi4</latexit>

k W ` kF
<latexit sha1_base64="KRwVF8Y/3X+FYcLpCyoTmh/iWtg=">AAACXnicbZA9TxtBEIbXRyCEj+CEBinNJCZSisi6c5GkRCFCpCNSjJF8xppbz9kr9vZOu3MB6/D/y1+go6GlTcrsGRd8ZKSV3nlnRjP7JIVWjsPwqhEsPVteeb76Ym19Y/PlVvPV62OXl1ZSV+Y6tycJOtLKUJcVazopLGGWaOolZ/t1vfeLrFO5+cnTggYZjo1KlUT21rCZfE9hN04tyiq+jL+RZoTeaUxax5fDg5k372W7oBxotGP6COcEdFGQZOAJAVtURpkxcA4JyTwjKI1j9Ge8HTZbYTucBzwV0UK0xCKOhs2beJTLMiPDUqNz/SgseFChZSU1zdbi0lGB8gzH1PfSYEZuUM1ZzOC9d0aQ5tY/wzB3709UmDk3zRLfmSFP3ONabf6v1i85/TKolClKJiPvFqWlrn9cg4WRsp6GnnqB0ip/K8gJerDs8T/YMnL1aTPPJXpM4ak47rSjT+3Oj05r7+uC0Kp4I96JDyISn8WeOBRHoiuk+C1uxR/xt3EdrASbwdZda9BYzGyLBxHs/AN5V7ii</latexit>

If kW ` kF
is large, we expect the training to become unstable!
Ensures that the combination of the -scaled activation function and a Scaled Weight
<latexit sha1_base64="VcpJYHDTmvzaJ7oL4wnpNBW9op4=">AAACinicbZHPbtNAEMbX5l8JBQLc4LIiQeJCZOcArXqpCkgci0qaSkkUjddjZ9X9Y+2OI1Irr9D34wF4Ax6AjZMDbRlpV59+84129G1WKekpSX5F8b37Dx4+2nvcebL/9Nnz7ouX597WTuBIWGXdRQYelTQ4IkkKLyqHoDOF4+zy86Y/XqLz0poftKpwpqE0spACKKB593pK+JOar8bXDj2nBVC4kAurM2laE7dFi/rTErSG/gcvQGHOQZBcbh1FbUQrwATOz7aGMcpyQfyMAgWXy6vAFKzQcen5EpwEI5CHdT26pTTlYN2Zd3vJIGmL3xXpTvTYrk7n3d/T3IpaoyGhwPtJmlQ0a8CRFArXnWntsQJxCSVOgjSg0c+aNrc1fxdIzgvrwjHEW/rvRAPa+5XOglMDLfzt3gb+rzepqTiYNdJUNaER24eKWnGyfPMJPJcOBalVECCcDLtysQAXEg1fdeOV3G9WW4dc0tsp3BXnw0H6cTD8Puwdn+wS2mNv2Fv2nqXsEztm39gpGzHB/kSvo17Uj/fjYXwYH22tcbSbecVuVPzlL9akxyY=</latexit>

`
<latexit sha1_base64="Kkzynynl4KXpAAKwpJN5qghEbfo=">AAACMHicbVBLTsMwFHT4/ymwZGPRIrGhSioELBEsYAkSLUhNqRz3pbVw4sh+AaqoF+EYnIAtnABWiAUbToHTdkEpI1kazbynN54gkcKg6747E5NT0zOzc/MLi0vLK6uFtfWaUanmUOVKKn0dMANSxFBFgRKuEw0sCiRcBbcnuX91B9oIFV9iN4FGxNqxCAVnaKVmYe+0KW58kJL6WrQ7yLRW99RHeMCsJEq72KG5oEIaMdTigZZOS71moeiW3T7oOPGGpEiGOG8WvvyW4mkEMXLJjKl7boKNjGkUXEJvwU8NJIzfsjbULY1ZBKaR9X/Xo9tWadFQaftipH3190bGImO6UWAnbcaO+evl4n9ePcXwsJGJOEkRYj44FKaSoqJ5VbQlNHCUXUsY18JmpbzDNONoCx250jJ5tLwX728L46RWKXv75crFXvHoeNjQHNkkW2SHeOSAHJEzck6qhJNH8kxeyKvz5Lw5H87nYHTCGe5skBE43z8S7qm2</latexit>

<latexit sha1_base64="YizidxFdsbfT/jVy2d1FiOCsByE=">AAACHXicbVDLTgJBEJzFF+IL9ehlFEzwINnloB6JXjxiIo8ECJkdemHC7OxmpteEEM5+hl/gVb/Am/Fq/AD/w+FxELSSTipV3enu8mMpDLrul5NaWV1b30hvZra2d3b3svsHNRMlmkOVRzLSDZ8ZkEJBFQVKaMQaWOhLqPuDm4lffwBtRKTucRhDO2Q9JQLBGVqpkz0uJEogzYs8jQKKfaD5FkiZP8c+lWwI+qyTzblFdwr6l3hzkiNzVDrZ71Y34kkICrlkxjQ9N8b2iGkUXMI400oMxIwPWA+alioWgmmPpq+M6alVujSItC2FdKr+nhix0Jhh6NvOkGHfLHsT8T+vmWBw1R4JFScIis8WBYmkGNFJLrQrNHCUQ0sY18LeSnmfacbRprewpWsmp41tLt5yCn9JrVT0Loqlu1KufD1PKE2OyAkpEI9ckjK5JRVSJZw8kmfyQl6dJ+fNeXc+Zq0pZz5zSBbgfP4AM3ag5g==</latexit>

G ! i-th
suresithat the row
combination (unit
of matrixofGthe i of the
-scaled `-th layer)
activation function and a Scaled Weight Standardized layer is variance preserving.
<latexit sha1_base64="lwHYEf7JSy8GrHdeI0RIjnxvDsE=">AAACG3icbVC7TgJBFJ3FF+ILtbRwIjGxIrsUakm0scREHgkQMjtcYMLs7GbmLpFstvQz/AJb/QI7Y2vhB/gfDo9CwJNMcnLOPbl3jh9JYdB1v53M2vrG5lZ2O7ezu7d/kD88qpkw1hyqPJShbvjMgBQKqihQQiPSwAJfQt0f3k78+gi0EaF6wHEE7YD1legJztBKnfxpC+ERE2HoiGnBFAdq8wb0SKh+Mc118gW36E5BV4k3JwUyR6WT/2l1Qx4HoJBLZkzTcyNsJ0yj4BLSXCs2EDE+ZH1oWqpYAKadTD+S0nOrdGkv1PYppFP1byJhgTHjwLeTAcOBWfYm4n9eM8bedTsRKooRFJ8t6sWSYkgnrdCu0MBRji1hXAt7K+UDphlH293Clq6ZnJbaXrzlFlZJrVT0Loul+1KhfDNvKEtOyBm5IB65ImVyRyqkSjh5Ii/klbw5z8678+F8zkYzzjxzTBbgfP0CXB2iTg==</latexit>

Brock, Andrew, et al. "High-performance large-scale image recognition without normalization." arXiv preprint arXiv:2102.06171 (2021).
A ConvNet for the 2020s

Liu, Zhuang, et al. "A ConvNet for the 2020s." arXiv preprint arXiv:2201.03545 (2022).
Questions?
YouTube Playlist

You might also like