You are on page 1of 9

MeshNet: Mesh Neural Network for 3D Shape Representation

Yutong Feng,1 Yifan Feng,2 Haoxuan You,1 Xibin Zhao1∗ , Yue Gao1∗
1
BNRist, KLISS, School of Software, Tsinghua University, China.
2
School of Information Science and Engineering, Xiamen University
{feng-yt15, zxb, gaoyue}@tsinghua.edu.cn, {evanfeng97, haoxuanyou}@gmail.com
arXiv:1811.11424v1 [cs.CV] 28 Nov 2018

Abstract

Acc
Mesh View
Volume Point Cloud SO-Net
Mesh is an important and powerful type of data for 3D shapes
Pointnet++ GVCNN
and widely studied in the field of computer vision and com- MVCNN Pairwise Kd-Net
90 Pointnet
puter graphics. Regarding the task of 3D shape representa- FPNN
tion, there have been extensive research efforts concentrating
on how to represent 3D shapes well using volumetric grid,
80
multi-view and point cloud. However, there is little effort on ShapeNet
LFD
using mesh data in recent years, due to the complexity and
irregularity of mesh data. In this paper, we propose a mesh 70
neural network, named MeshNet, to learn 3D shape represen- SPH

tation from mesh data. In this method, face-unit and feature


splitting are introduced, and a general architecture with avail- 60
able and effective blocks are proposed. In this way, MeshNet
is able to solve the complexity and irregularity problem of 2003 2015 2016 2017 2018 Year
mesh and conduct 3D shape representation well. We have ap-
plied the proposed MeshNet method in the applications of 3D
shape classification and retrieval. Experimental results and
Figure 1: The developing history of 3D shape representa-
comparisons with the state-of-the-art methods demonstrate tion using different types of data. The X-axis indicates the
that the proposed MeshNet can achieve satisfying 3D shape proposed time of each method, and the Y-axis indicates the
classification and retrieval performance, which indicates the classification accuracy.
effectiveness of the proposed method on 3D shape represen-
tation.
Figure 1, although there have been recent successful meth-
Introduction ods using the types of volumetric grid, multi-view and point
cloud, for the mesh data, there are only early methods using
Three-dimensional (3D) shape representation is one of the handcraft features directly, such as the Spherical Harmonic
most fundamental topics in the field of computer vision and descriptor (SPH) (Kazhdan, Funkhouser, and Rusinkiewicz
computer graphics. In recent years, with the increasing ap- 2003), which limits the applications of mesh data.
plications in 3D shapes, extensive efforts (Wu et al. 2015; Mesh data of 3D shapes is a collection of vertices, edges
Chang et al. 2015) have been concentrated on 3D shape rep- and faces, which is dominantly used in computer graphics
resentation and proposed methods are successfully applied for rendering and storing 3D models. Mesh data has the
for different tasks, such as classification and retrieval. properties of complexity and irregularity. The complexity
For 3D shapes, there are several popular types of data, in- problem is that mesh consists of multiple elements, and dif-
cluding volumetric grid, multi-view, point cloud and mesh. ferent types of connections may be defined among them.
With the success of deep learning methods in computer vi- The irregularity is another challenge for mesh data process-
sion, many neural network methods have been introduced ing, which indicates that the number of elements in mesh
to conduct 3D shape representation using volumetric grid may vary dramatically among 3D shapes, and permutations
(Wu et al. 2015; Maturana and Scherer 2015), multi-view of them are arbitrary. In spite of these problems, mesh has
(Su et al. 2015) and point cloud (Qi et al. 2017a). PointNet stronger ability for 3D shape description than other types of
(Qi et al. 2017a) proposes to learn on point cloud directly data. Under such circumstances, how to effectively represent
and solves the disorder problem with per-point Multi-Layer- 3D shapes using mesh data is an urgent and challenging task.
Perceptron (MLP) and a symmetry function. As shown in
In this paper, we present a mesh neural network, named

Corresponding authors MeshNet, that learns on mesh data directly for 3D shape
Copyright c 2019, Association for the Advancement of Artificial representation. To deal with the challenges in mesh data pro-
Intelligence (www.aaai.org). All rights reserved. cessing, the faces are regarded as the unit and connections
Structural Descriptor

Face Kernel Spatial Descriptor

nx3
Neignbor
Index Correlation
MLP

n x 131
n x 64
n x 64
MLP (64, 64)

n x 64
(131 )

nx3

nx3

nx3
Normal Center

Face Rotate

nx9
Corner
Convolution

Global Feature
MLP MLP

n x 1024
n x 512
n x 512

n x 512
n x 256
n x 256

n x 512
n x 64
Spatial (1024) (1024)

1024
Center nx3
Descriptor Pooling

Mesh Mesh
nx9

Corner Conv Conv

(512, 256, k)
n x 131

n x 256

n x 512
Structural

MLP
Descriptor
nx3

Normal

Neighbor
Index
nx3
Output Scores

Figure 2: The architecture of MeshNet. The input is a list of faces with initial values, which are fed into the spatial and
structural descriptors to generate initial spatial and structural features. The features are then aggregated with neighboring in-
formation in the mesh convolution blocks labeled as “Mesh Conv”, and fed into a pooling function to output the global feature
used for further tasks. Multi-layer-perceptron is labeled as “mlp” with the numbers in the parentheses indicating the dimension
of the hidden layers and output layer.

between faces sharing common edges are defined, which Chen develop more functions applied to each triangle and
enables us to solve the complexity and irregularity problem add all the resulting values as features (Zhang and Chen
with per-face processes and a symmetry function. Moreover, 2001). Hubeli and Gross extend the features of surfaces to
the feature of faces is split into spatial and structural fea- a multiresolution setting to solve the unstructured problem
tures. Based on these ideas, we design the network architec- of mesh data (Hubeli and Gross 2001). In SPH (Kazhdan,
ture, with two blocks named spatial and structural descrip- Funkhouser, and Rusinkiewicz 2003), a rotation invariant
tors for learning the initial features, and a mesh convolution representation is presented with existing orientation depen-
block for aggregating neighboring features. In this way, the dent descriptors. Mesh difference of Gaussians (DOG) intro-
proposed method is able to solve the complexity and irregu- duces the Gaussian filtering to shape functions.(Zaharescu et
larity problem of mesh and represent 3D shapes well. al. 2009) Intrinsic shape context (ISC) descriptor (Kokkinos
We apply our MeshNet method in the tasks of 3D shape et al. 2012) develops a generalization to surfaces and solves
classification and retrieval on the ModelNet40 (Wu et al. the problem of orientational ambiguity.
2015) dataset. And the experimental results show that Mesh-
Net achieve significant improvement on 3D shape classifi- Deep Learning Methods for 3D Shape
cation and retrieval using mesh data and comparable perfor- Representation
mance with recent methods using other types of 3D data. With the construction of large-scale 3D model datasets, nu-
The key contributions of our work are as follows: merous deep descriptors of 3D shapes are proposed. Based
• We propose a neural network using mesh for 3D shape on different types of data, these methods can be categorized
representation and design blocks for capturing and aggre- into four types.
gating features of polygon faces in 3D shapes. Voxel-based method. 3DShapeNets (Wu et al. 2015) and
• We conduct extensive experiments to evaluate the perfor- VoxNet (Maturana and Scherer 2015) propose to learn on
mance of the proposed method, and the experimental re- volumetric grids, which partition the space into regular
sults show that the proposed method performs well on the cubes. However, they introduce extra computation cost due
3D shape classification and retrieval task. to the sparsity of data, which restricts them to be applied on
more complex data. Field probing neural networks (FPNN)
Related Work (Li et al. 2016), Vote3D (Wang and Posner 2015) and
Octree-based convolutional neural network (OCNN) (Wang
Mesh Feature Extraction et al. 2017) address the sparsity problem, while they are still
There are plenty of handcraft descriptors that extract fea- restricted with input getting larger.
tures from mesh. Lien and Kajiya calculate moments of each View-based method. Using 2D images of 3D shapes to
tetrahedron in mesh (Lien and Kajiya 1984), and Zhang and represent them is proposed by Multi-view convolutional
neural networks (MVCNN) (Su et al. 2015), which aggre-
gates 2D views from a loop around the object and applies 2D Center O
0.3
0.4
n faces
deep learning framework to them. Group-view convolutional
<latexit sha1_base64="O5qDU1NnK9phbsF6iL+vaKQAXO0=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxJ0t2AdokSSd1qHTJEwmQin6A27128Q/0L/wzjgFtYhOSHLm3HvOzL03TAXPlOe9Fpy5+YXFpeJyaWV1bX2jvLnVypJcRqwZJSKRnTDImOAxayquBOukkgWjULB2ODzV8fYdkxlP4ks1Tll3FAxi3udRoIhqXNyUK17VM8udBb4FFdhVT8ovuEYPCSLkGIEhhiIsECCj5wo+PKTEdTEhThLiJs5wjxJpc8pilBEQO6TvgHZXlo1prz0zo47oFEGvJKWLPdIklCcJ69NcE8+Ns2Z/854YT323Mf1D6zUiVuGW2L9008z/6nQtCn0cmxo41ZQaRlcXWZfcdEXf3P1SlSKHlDiNexSXhCOjnPbZNZrM1K57G5j4m8nUrN5HNjfHu74lDdj/Oc5Z0Dqo+l7VbxxWaid21EXsYBf7NM8j1HCOOprG+xFPeHbOHOFkTv6Z6hSsZhvflvPwAQm8j1Q=</latexit>
<latexit

O
<latexit sha1_base64="O5qDU1NnK9phbsF6iL+vaKQAXO0=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxJ0t2AdokSSd1qHTJEwmQin6A27128Q/0L/wzjgFtYhOSHLm3HvOzL03TAXPlOe9Fpy5+YXFpeJyaWV1bX2jvLnVypJcRqwZJSKRnTDImOAxayquBOukkgWjULB2ODzV8fYdkxlP4ks1Tll3FAxi3udRoIhqXNyUK17VM8udBb4FFdhVT8ovuEYPCSLkGIEhhiIsECCj5wo+PKTEdTEhThLiJs5wjxJpc8pilBEQO6TvgHZXlo1prz0zo47oFEGvJKWLPdIklCcJ69NcE8+Ns2Z/854YT323Mf1D6zUiVuGW2L9008z/6nQtCn0cmxo41ZQaRlcXWZfcdEXf3P1SlSKHlDiNexSXhCOjnPbZNZrM1K57G5j4m8nUrN5HNjfHu74lDdj/Oc5Z0Dqo+l7VbxxWaid21EXsYBf7NM8j1HCOOprG+xFPeHbOHOFkTv6Z6hSsZhvflvPwAQm8j1Q=</latexit>
<latexit
0.2

1x3
neural networks (GVCNN) (Feng et al. 2018) proposes a A
<latexit sha1_base64="1fpNIUZRPl2vZJVNl9EHXpCLbCI=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkVxGUL9gFaJEmndeg0CZOJUIr+gFv9NvEP9C+8M05BLaITkpw5954zc+8NU8Ez5XmvBWdufmFxqbhcWlldW98ob261siSXEWtGiUhkJwwyJnjMmoorwTqpZMEoFKwdDs90vH3HZMaT+FKNU9YdBYOY93kUKKIaJzflilf1zHJngW9BBXbVk/ILrtFDggg5RmCIoQgLBMjouYIPDylxXUyIk4S4iTPco0TanLIYZQTEDuk7oN2VZWPaa8/MqCM6RdArSelijzQJ5UnC+jTXxHPjrNnfvCfGU99tTP/Qeo2IVbgl9i/dNPO/Ol2LQh/HpgZONaWG0dVF1iU3XdE3d79UpcghJU7jHsUl4cgop312jSYzteveBib+ZjI1q/eRzc3xrm9JA/Z/jnMWtA6qvlf1G4eV2qkddRE72MU+zfMINVygjqbxfsQTnp1zRziZk3+mOgWr2ca35Tx8AOhtj0Y=</latexit>
<latexit

nx3
0.2
hierarchical framework, which divides views into different !
OA 0.7
B !
<latexit sha1_base64="5uufU2fHqR6H/BMq1AX8F+x2X0A=">AAAC2HicjVHLSsNAFD2Nr1pf1S7dBIvgqiQi6LLqxp0V7ANtkSQd69A0EyYTpZSCO3HrD7jVLxL/QP/CO2MEH4hOSHLm3HvOzL3Xj0OeKMd5zlkTk1PTM/nZwtz8wuJScXmlkYhUBqweiFDIlu8lLOQRqyuuQtaKJfMGfsiafn9fx5uXTCZcRMdqGLPOwOtF/JwHniLqrFhqCwpL3rtQnpTianS4Oz4rlp2KY5b9E7gZKCNbNVF8QhtdCARIMQBDBEU4hIeEnlO4cBAT18GIOEmImzjDGAXSppTFKMMjtk/fHu1OMzaivfZMjDqgU0J6JSltrJNGUJ4krE+zTTw1zpr9zXtkPPXdhvT3M68BsQoXxP6l+8j8r07XonCOHVMDp5piw+jqgswlNV3RN7c/VaXIISZO4y7FJeHAKD/6bBtNYmrXvfVM/MVkalbvgyw3xau+JQ3Y/T7On6CxWXGdinu0Va7uZaPOYxVr2KB5bqOKA9RQJ+8h7vGAR+vEurZurNv3VCuXaUr4sqy7N5wMl9M=</latexit>

0.1
groups with different weights to generate a more discrimi- Corner <latexit sha1_base64="OB+MD7FAoYbDCJJnnCYjDlaEKJo=">AAACxHicjVHLSsNAFD2Nr/quunQTLIKrkoigy1JBXLZgH1CLJNNpDZ0mYTIRStEfcKvfJv6B/oV3ximoRXRCkjPn3nNm7r1hKqJMed5rwVlYXFpeKa6urW9sbm2XdnZbWZJLxpssEYnshEHGRRTzpoqU4J1U8mAcCt4OR+c63r7jMouS+EpNUt4bB8M4GkQsUEQ1ajelslfxzHLngW9BGXbVk9ILrtFHAoYcY3DEUIQFAmT0dOHDQ0pcD1PiJKHIxDnusUbanLI4ZQTEjug7pF3XsjHttWdm1IxOEfRKUro4JE1CeZKwPs018dw4a/Y376nx1Heb0D+0XmNiFW6J/Us3y/yvTteiMMCZqSGimlLD6OqYdclNV/TN3S9VKXJIidO4T3FJmBnlrM+u0WSmdt3bwMTfTKZm9Z7Z3Bzv+pY0YP/nOOdB67jiexW/cVKu1uyoi9jHAY5onqeo4hJ1NI33I57w7Fw4wsmc/DPVKVjNHr4t5+ED6s2PRw==</latexit>
<latexit

O
OB
!
<latexit sha1_base64="G2RBOz3IuuhRjXaHO0sr/u5gxI8=">AAAC2HicjVHLSsNAFD2Nr/qOdukmWARXJRFBl0U37qxgH1iLJHGsQ9NMmEyUUgR34tYfcKtfJP6B/oV3xhTUIjohyZlz7zkz994giXiqXPe1YE1MTk3PFGfn5hcWl5btldVGKjIZsnooIiFbgZ+yiMesrriKWCuRzO8HEWsGvX0db14xmXIRH6tBwjp9vxvzCx76iqgzu3QqKCx591L5Uorr4eHezZlddiuuWc448HJQRr5qwn7BKc4hECJDHwwxFOEIPlJ62vDgIiGugyFxkhA3cYYbzJE2oyxGGT6xPfp2adfO2Zj22jM16pBOieiVpHSwQRpBeZKwPs0x8cw4a/Y376Hx1Hcb0D/IvfrEKlwS+5dulPlfna5F4QK7pgZONSWG0dWFuUtmuqJv7nypSpFDQpzG5xSXhEOjHPXZMZrU1K5765v4m8nUrN6HeW6Gd31LGrD3c5zjoLFV8dyKd7Rdru7loy5iDevYpHnuoIoD1FAn7wEe8YRn68S6te6s+89Uq5BrSvi2rIcPnm2X1A==</latexit>

OC
<latexit sha1_base64="O5qDU1NnK9phbsF6iL+vaKQAXO0=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxJ0t2AdokSSd1qHTJEwmQin6A27128Q/0L/wzjgFtYhOSHLm3HvOzL03TAXPlOe9Fpy5+YXFpeJyaWV1bX2jvLnVypJcRqwZJSKRnTDImOAxayquBOukkgWjULB2ODzV8fYdkxlP4ks1Tll3FAxi3udRoIhqXNyUK17VM8udBb4FFdhVT8ovuEYPCSLkGIEhhiIsECCj5wo+PKTEdTEhThLiJs5wjxJpc8pilBEQO6TvgHZXlo1prz0zo47oFEGvJKWLPdIklCcJ69NcE8+Ns2Z/854YT323Mf1D6zUiVuGW2L9008z/6nQtCn0cmxo41ZQaRlcXWZfcdEXf3P1SlSKHlDiNexSXhCOjnPbZNZrM1K57G5j4m8nUrN5HNjfHu74lDdj/Oc5Z0Dqo+l7VbxxWaid21EXsYBf7NM8j1HCOOprG+xFPeHbOHOFkTv6Z6hSsZhvflvPwAQm8j1Q=</latexit>
<latexit

native descriptor for a 3D shape. This type of method also C


<latexit sha1_base64="B2raW6ovXDf/LKQ1DZtXNdGAz4E=">AAAC2HicjVHLSsNAFD3GV62vqks3wSK4KokIuix2484K1hZbKUk6bQfTTJhMlFIK7sStP+BWv0j8A/0L74wR1CI6IcmZc+85M/dePw55ohznZcqanpmdm88t5BeXlldWC2vrZ4lIZcBqgQiFbPhewkIesZriKmSNWDJv4Ies7l9WdLx+xWTCRXSqhjG7GHi9iHd54Cmi2oWNlqCw5L2+8qQU16PjyrhdKDolxyx7ErgZKCJbVVF4RgsdCARIMQBDBEU4hIeEniZcOIiJu8CIOEmImzjDGHnSppTFKMMj9pK+Pdo1MzaivfZMjDqgU0J6JSltbJNGUJ4krE+zTTw1zpr9zXtkPPXdhvT3M68BsQp9Yv/SfWb+V6drUejiwNTAqabYMLq6IHNJTVf0ze0vVSlyiInTuENxSTgwys8+20aTmNp1bz0TfzWZmtX7IMtN8aZvSQN2f45zEpztllyn5J7sFcuH2ahz2MQWdmie+yjjCFXUyHuIBzziyTq3bqxb6+4j1ZrKNBv4tqz7d6DOl9U=</latexit>

-0.2

1x9
nx9
<latexit sha1_base64="9JnBjP+aj5VVLWDvCKMvwNY+8Zk=">AAACxHicjVHLSsNAFD2Nr/quunQTLIKrkoigy2JBXLZgH1CLJOm0Dp0mYTIRStEfcKvfJv6B/oV3ximoRXRCkjPn3nNm7r1hKnimPO+14CwsLi2vFFfX1jc2t7ZLO7utLMllxJpRIhLZCYOMCR6zpuJKsE4qWTAOBWuHo5qOt++YzHgSX6lJynrjYBjzAY8CRVSjdlMqexXPLHce+BaUYVc9Kb3gGn0kiJBjDIYYirBAgIyeLnx4SInrYUqcJMRNnOEea6TNKYtRRkDsiL5D2nUtG9Nee2ZGHdEpgl5JSheHpEkoTxLWp7kmnhtnzf7mPTWe+m4T+ofWa0yswi2xf+lmmf/V6VoUBjgzNXCqKTWMri6yLrnpir65+6UqRQ4pcRr3KS4JR0Y567NrNJmpXfc2MPE3k6lZvY9sbo53fUsasP9znPOgdVzxvYrfOClXz+2oi9jHAY5onqeo4hJ1NI33I57w7Fw4wsmc/DPVKVjNHr4t5+ED7S2PSA==</latexit>
<latexit

expensively adds the computation cost and is hard to be ap- !


n 1.0
<latexit sha1_base64="clgDmGxq+gdQeCFlTTEsGoQwqic=">AAAC13icjVHLSsNAFD3GV62vWpdugkVwVRIRdCm6cVnBakWLJONYB9NMmEx8UIo7cesPuNU/Ev9A/8I74xR8IDohyZlz7zkz9944S0Sug+BlyBseGR0bL02UJ6emZ2Yrc9W9XBaK8SaTiVStOMp5IlLe1EInvJUpHnXjhO/H51smvn/BVS5kuquvM97uRp1UnAoWaaKOK9UjSWElOmc6Ukpe9tL+caUW1AO7/J8gdKAGtxqy8owjnECCoUAXHCk04QQRcnoOESJARlwbPeIUIWHjHH2USVtQFqeMiNhz+nZod+jYlPbGM7dqRqck9CpS+lgijaQ8Rdic5tt4YZ0N+5t3z3qau13TP3ZeXWI1zoj9SzfI/K/O1KJxinVbg6CaMsuY6phzKWxXzM39T1VpcsiIM/iE4oows8pBn32ryW3tpreRjb/aTMOaPXO5Bd7MLWnA4fdx/gR7K/UwqIc7q7WNTTfqEhawiGWa5xo2sI0GmuR9hQc84sk78G68W+/uI9Ubcpp5fFne/Tsd/pen</latexit>

plied for tasks in larger scenes. Normal


!
n
<latexit sha1_base64="clgDmGxq+gdQeCFlTTEsGoQwqic=">AAAC13icjVHLSsNAFD3GV62vWpdugkVwVRIRdCm6cVnBakWLJONYB9NMmEx8UIo7cesPuNU/Ev9A/8I74xR8IDohyZlz7zkz9944S0Sug+BlyBseGR0bL02UJ6emZ2Yrc9W9XBaK8SaTiVStOMp5IlLe1EInvJUpHnXjhO/H51smvn/BVS5kuquvM97uRp1UnAoWaaKOK9UjSWElOmc6Ukpe9tL+caUW1AO7/J8gdKAGtxqy8owjnECCoUAXHCk04QQRcnoOESJARlwbPeIUIWHjHH2USVtQFqeMiNhz+nZod+jYlPbGM7dqRqck9CpS+lgijaQ8Rdic5tt4YZ0N+5t3z3qau13TP3ZeXWI1zoj9SzfI/K/O1KJxinVbg6CaMsuY6phzKWxXzM39T1VpcsiIM/iE4oows8pBn32ryW3tpreRjb/aTMOaPXO5Bd7MLWnA4fdx/gR7K/UwqIc7q7WNTTfqEhawiGWa5xo2sI0GmuR9hQc84sk78G68W+/uI9Ubcpp5fFne/Tsd/pen</latexit>
0.0

Point-based method. Due to the irregularity of data, point 0.0


1x3
cloud is not suitable for previous frameworks. PointNet (Qi nx3
et al. 2017b) solves this problem with per-point processes d
<latexit sha1_base64="7kVD1bztflqXqy6ue1pPjTm2YK4=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxGUL9gG1SJJOa+g0CTMToRT9Abf6beIf6F94Z5yCWkQnJDlz7j1n5t4bZjyWyvNeC87C4tLySnG1tLa+sblV3t5pyTQXEWtGKU9FJwwk43HCmipWnHUywYJxyFk7HJ3rePuOCRmnyZWaZKw3DoZJPIijQBHV6N+UK17VM8udB74FFdhVT8svuEYfKSLkGIMhgSLMEUDS04UPDxlxPUyJE4RiE2e4R4m0OWUxygiIHdF3SLuuZRPaa09p1BGdwukVpHRxQJqU8gRhfZpr4rlx1uxv3lPjqe82oX9ovcbEKtwS+5dulvlfna5FYYBTU0NMNWWG0dVF1iU3XdE3d79UpcghI07jPsUF4cgoZ312jUaa2nVvAxN/M5ma1fvI5uZ417ekAfs/xzkPWkdV36v6jeNK7cyOuog97OOQ5nmCGi5RR9N4P+IJz86Fwx3p5J+pTsFqdvFtOQ8fO5yPaQ==</latexit>
<latexit

d 34
and a symmetry function, while it ignores the local informa- Neighbor
<latexit sha1_base64="7kVD1bztflqXqy6ue1pPjTm2YK4=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxGUL9gG1SJJOa+g0CTMToRT9Abf6beIf6F94Z5yCWkQnJDlz7j1n5t4bZjyWyvNeC87C4tLySnG1tLa+sblV3t5pyTQXEWtGKU9FJwwk43HCmipWnHUywYJxyFk7HJ3rePuOCRmnyZWaZKw3DoZJPIijQBHV6N+UK17VM8udB74FFdhVT8svuEYfKSLkGIMhgSLMEUDS04UPDxlxPUyJE4RiE2e4R4m0OWUxygiIHdF3SLuuZRPaa09p1BGdwukVpHRxQJqU8gRhfZpr4rlx1uxv3lPjqe82oX9ovcbEKtwS+5dulvlfna5FYYBTU0NMNWWG0dVF1iU3XdE3d79UpcghI07jPsUF4cgoZ312jUaa2nVvAxN/M5ma1fvI5uZ417ekAfs/xzkPWkdV36v6jeNK7cyOuog97OOQ5nmCGi5RR9N4P+IJz86Fwx3p5J+pTsFqdvFtOQ8fO5yPaQ==</latexit>
<latexit

e 78
Index
<latexit sha1_base64="mJ8GqMi7I82JglfZ7jvtceK19vA=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxGUL9gG1SDKd1qFpEmYmQin6A27128Q/0L/wzpiCWkQnJDlz7j1n5t4bppFQ2vNeC87C4tLySnG1tLa+sblV3t5pqSSTjDdZEiWyEwaKRyLmTS10xDup5ME4jHg7HJ2bePuOSyWS+EpPUt4bB8NYDAQLNFENflOueFXPLnce+DmoIF/1pPyCa/SRgCHDGBwxNOEIARQ9XfjwkBLXw5Q4SUjYOMc9SqTNKItTRkDsiL5D2nVzNqa98VRWzeiUiF5JShcHpEkoTxI2p7k2nllnw/7mPbWe5m4T+oe515hYjVti/9LNMv+rM7VoDHBqaxBUU2oZUx3LXTLbFXNz90tVmhxS4gzuU1wSZlY567NrNcrWbnob2PibzTSs2bM8N8O7uSUN2P85znnQOqr6XtVvHFdqZ/moi9jDPg5pnieo4RJ1NK33I57w7Fw4kaOc7DPVKeSaXXxbzsMHPfyPag==</latexit>
<latexit

f 83
tion of points. PointNet++ (Qi et al. 2017b) adds aggregation f
<latexit sha1_base64="7QQfQx4iNX04F8go9Nvv8s3hbAI=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxGUL9gG1SDKd1qHTJGQmQin6A27128Q/0L/wzpiCWkQnJDlz7j1n5t4bJlIo7XmvBWdhcWl5pbhaWlvf2Nwqb++0VJyljDdZLOO0EwaKSxHxphZa8k6S8mAcSt4OR+cm3r7jqRJxdKUnCe+Ng2EkBoIFmqjG4KZc8aqeXe488HNQQb7qcfkF1+gjBkOGMTgiaMISARQ9XfjwkBDXw5S4lJCwcY57lEibURanjIDYEX2HtOvmbER746msmtEpkt6UlC4OSBNTXkrYnObaeGadDfub99R6mrtN6B/mXmNiNW6J/Us3y/yvztSiMcCprUFQTYllTHUsd8lsV8zN3S9VaXJIiDO4T/GUMLPKWZ9dq1G2dtPbwMbfbKZhzZ7luRnezS1pwP7Pcc6D1lHV96p+47hSO8tHXcQe9nFI8zxBDZeoo2m9H/GEZ+fCkY5yss9Up5BrdvFtOQ8fQFyPaw==</latexit>
<latexit sha1_base64="7QQfQx4iNX04F8go9Nvv8s3hbAI=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxGUL9gG1SDKd1qHTJGQmQin6A27128Q/0L/wzpiCWkQnJDlz7j1n5t4bJlIo7XmvBWdhcWl5pbhaWlvf2Nwqb++0VJyljDdZLOO0EwaKSxHxphZa8k6S8mAcSt4OR+cm3r7jqRJxdKUnCe+Ng2EkBoIFmqjG4KZc8aqeXe488HNQQb7qcfkF1+gjBkOGMTgiaMISARQ9XfjwkBDXw5S4lJCwcY57lEibURanjIDYEX2HtOvmbER746msmtEpkt6UlC4OSBNTXkrYnObaeGadDfub99R6mrtN6B/mXmNiNW6J/Us3y/yvztSiMcCprUFQTYllTHUsd8lsV8zN3S9VaXJIiDO4T/GUMLPKWZ9dq1G2dtPbwMbfbKZhzZ7luRnezS1pwP7Pcc6D1lHV96p+47hSO8tHXcQe9nFI8zxBDZeoo2m9H/GEZ+fCkY5yss9Up5BrdvFtOQ8fQFyPaw==</latexit>

1x3
with neighbors to solve this problem. Self-organizing net- e
nx3
<latexit sha1_base64="mJ8GqMi7I82JglfZ7jvtceK19vA=">AAACxHicjVHLSsNAFD2Nr1pfVZdugkVwVRIRdFkUxGUL9gG1SDKd1qFpEmYmQin6A27128Q/0L/wzpiCWkQnJDlz7j1n5t4bppFQ2vNeC87C4tLySnG1tLa+sblV3t5pqSSTjDdZEiWyEwaKRyLmTS10xDup5ME4jHg7HJ2bePuOSyWS+EpPUt4bB8NYDAQLNFENflOueFXPLnce+DmoIF/1pPyCa/SRgCHDGBwxNOEIARQ9XfjwkBLXw5Q4SUjYOMc9SqTNKItTRkDsiL5D2nVzNqa98VRWzeiUiF5JShcHpEkoTxI2p7k2nllnw/7mPbWe5m4T+oe515hYjVti/9LNMv+rM7VoDHBqaxBUU2oZUx3LXTLbFXNz90tVmhxS4gzuU1wSZlY567NrNcrWbnob2PibzTSs2bM8N8O7uSUN2P85znnQOqr6XtVvHFdqZ/moi9jDPg5pnieo4RJ1NK33I57w7Fw4kaOc7DPVKeSaXXxbzsMHPfyPag==</latexit>
<latexit

work (SO-Net) (Li, Chen, and Lee 2018), kernel correlation one face
network (KCNet) (Shen et al. ) and PointSIFT (Jiang, Wu,
and Lu 2018) develop more detailed approaches for captur- Figure 3: Initial values of each face. There are four types
ing local structures with nearest neighbors. Kd-Net (Klokov of initial values, divided into two parts: center, corner and
and Lempitsky 2017) proposes another approach to solve the normal are the face information, and neighbor index is the
irregularity problem using k-d tree. neighbor information.
Fusion method. These methods learn on multiple types
of data and fusion the features of them together. FusionNet
(Hegde and Zadeh 2016) uses the volumetric grid and multi- relationships to show the local structure clearly. However,
view for classification. Point-view network (PVNet) (You et mesh data is also more irregular and complex for the multi-
al. 2018) proposes the embedding attention fusion to exploit ple compositions and varying numbers of elements.
both point cloud data and multi-view data.
To get full use of the advantages of mesh and solve the
Method problem of its irregularity and complexity, we propose two
In this block, we present the design of MeshNet. Firstly, we key ideas of design:
analyze the properties of mesh, propose the methods for de-
signing network and reorganize the input data. We then in- • Regard face as the unit. Mesh data consists of multiple
troduce the overall architecture of MeshNet and some blocks elements and connections may be defined among them.
for capturing features of faces and aggregating them with To simplify the data organization, we regard face as the
neighbor information, which are then discussed in detail. only unit and define a connection between two faces if
they share a common edge. There are several advantages
Overall Design of MeshNet of this simplification. First is that one triangular face can
We first introduce the mesh data and analyze its properties. connect with no more than three faces, which makes the
Mesh data of 3D shapes is a collection of vertices, edges and connection relationship regular and easy to use. More im-
faces, in which vertices are connected with edges and closed portantly, we can solve the disorder problem with per-face
sets of edges form faces. In this paper, we only consider tri- processes and a symmetry function, which is similar to
angular faces. Mesh data is dominantly used for storing and PointNet (Qi et al. 2017a), with per-face processes and
rendering 3D models in computer graphics, because it pro- a symmetry function. And intuitively, face also contains
vides an approximation of the smooth surfaces of objects more information than vertex and edge.
and simplifies the rendering process. Numerous studies on • Split feature of face. Though the above simplification
3D shapes in the field of computer graphic and geometric enables us to consume mesh data similar to point-based
modeling are taken based on mesh. methods, there are still some differences between point-
Mesh data shows stronger ability to describe 3D shapes unit and face-unit because face contains more information
comparing with other popular types of data. Volumetric grid than point. We only need to know “where you are” for a
and multi-view are data types defined to avoid the irregu- point, while we also want to know “what you look like”
larity of the native data such as mesh and point cloud, while for a face. Correspondingly, we split the feature of faces
they lose some natural information of the original object. For into spatial feature and structural feature, which helps
point cloud, there may be ambiguity caused by random sam- us to capture features more explicitly.
pling and the ambiguity is more obvious with fewer amount
of points. In contrast, mesh is more clear and loses less nat-
ural information. Besides, when capturing local structures, Following the above ideas, we transform the mesh data
most methods based on point cloud collect the nearest neigh- into a list of faces. For each face, we define the initial values
bors to approximately construct an adjacency matrix for fur- of it, which are divided into two parts (illustrated in Fig 3):
ther process, while in mesh there are explicit connection • Face Information:
– Center: coordinate of the center point
– Corner: vectors from the center point to three vertices
– Normal: the unit normal vector ⇥
>tixetal/<084K2+xz8+Q4Oe2yBvWRP2r3xvn25x7KDzzIBDIN6MIu1UAcEkjwcmgwoMey7Kls+De14rhozn0VXWugQkrKCgeZNgEbl4W4ZQ5IxqcGDvQ66YPZULBTivoFFLXGlK0JO4LMJf9n4nZvXiWcCdGXJiT7ihGTP93VyIUlGbNJ6PVmJ701m7kEbhqq48eARvRKPxSzUTb+moQDCBXx/IgdWTamKEOLz+lbvW+eULXBJhMJrCy51/lilvnbSgyu1cSRYwCRgSnJeMljDUdoYx9TbN/O0hWrkHpJcWzO5hCFNIehuoFitbTpm2w4M4cr14xHo8z2dnt2dh7qzNWftFmXfvP6wWqSrsrpqjaogGLNek8Jbwv3/oPvFdwiB7vQQ8KTLPzsxMpZOxcJSOAzR84FgeMoKyeKgPoBe9YUfhxrNyOEBNgSLDVbciX7BAAA>"=QFHqV2J3gSGXUBrKAihXjr6yJKE"=46esab_1ahs tixetal<

Kernel
Pooling
• Neighbor Information: shared

– Neighbor Index: indexes of the connected faces (filled


with the index of itself if the face connects with less ⇥
>tixetal/<084K2+xz8+Q4Oe2yBvWRP2r3xvn25x7KDzzIBDIN6MIu1UAcEkjwcmgwoMey7Kls+De14rhozn0VXWugQkrKCgeZNgEbl4W4ZQ5IxqcGDvQ66YPZULBTivoFFLXGlK0JO4LMJf9n4nZvXiWcCdGXJiT7ihGTP93VyIUlGbNJ6PVmJ701m7kEbhqq48eARvRKPxSzUTb+moQDCBXx/IgdWTamKEOLz+lbvW+eULXBJhMJrCy51/lilvnbSgyu1cSRYwCRgSnJeMljDUdoYx9TbN/O0hWrkHpJcWzO5hCFNIehuoFitbTpm2w4M4cr14xHo8z2dnt2dh7qzNWftFmXfvP6wWqSrsrpqjaogGLNek8Jbwv3/oPvFdwiB7vQQ8KTLPzsxMpZOxcJSOAzR84FgeMoKyeKgPoBe9YUfhxrNyOEBNgSLDVbciX7BAAA>"=QFHqV2J3gSGXUBrKAihXjr6yJKE"=46esab_1ahs tixetal<

Kernel

than three faces)


shared

MLP
In the final of this section, we present the overall archi- ⇥
>tixetal/<084K2+xz8+Q4Oe2yBvWRP2r3xvn25x7KDzzIBDIN6MIu1UAcEkjwcmgwoMey7Kls+De14rhozn0VXWugQkrKCgeZNgEbl4W4ZQ5IxqcGDvQ66YPZULBTivoFFLXGlK0JO4LMJf9n4nZvXiWcCdGXJiT7ihGTP93VyIUlGbNJ6PVmJ701m7kEbhqq48eARvRKPxSzUTb+moQDCBXx/IgdWTamKEOLz+lbvW+eULXBJhMJrCy51/lilvnbSgyu1cSRYwCRgSnJeMljDUdoYx9TbN/O0hWrkHpJcWzO5hCFNIehuoFitbTpm2w4M4cr14xHo8z2dnt2dh7qzNWftFmXfvP6wWqSrsrpqjaogGLNek8Jbwv3/oPvFdwiB7vQQ8KTLPzsxMpZOxcJSOAzR84FgeMoKyeKgPoBe9YUfhxrNyOEBNgSLDVbciX7BAAA>"=QFHqV2J3gSGXUBrKAihXjr6yJKE"=46esab_1ahs tixetal<

Kernel
tecture of MeshNet, illustrated in Fig 2. A list of faces with
initial values is fed into two blocks, named spatial descrip-
tor and structural descriptor, to generate the initial spatial v1
and structural features of faces. The features are then passed <latexit sha1_base64="oY9xBoyCVPHzsCLD8qgLPTDP65g=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiQi6LLoxmUF+4CmlMl00g6dTMLMTaGE/oYbF4q49Wfc+TdO2iy09cDA4Zx7uWdOkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT5pmzjVjLdYLGPdDajhUijeQoGSdxPNaRRI3gkm97nfmXJtRKyecJbwfkRHSoSCUbSS70cUx0GYTecDb1CtuXV3AbJOvILUoEBzUP3yhzFLI66QSWpMz3MT7GdUo2CSzyt+anhC2YSOeM9SRSNu+tki85xcWGVIwljbp5As1N8bGY2MmUWBncwzmlUvF//zeimGt/1MqCRFrtjyUJhKgjHJCyBDoTlDObOEMi1sVsLGVFOGtqaKLcFb/fI6aV/VPbfuPV7XGndFHWU4g3O4BA9uoAEP0IQWMEjgGV7hzUmdF+fd+ViOlpxi5xT+wPn8ASdSkb4=</latexit>

v2
<latexit sha1_base64="ssjW1CGTywUuhiz3AoMjfAjFPK8=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaQIuiy6cVnBPqApZTKdtEMnkzBzUyihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTJFIYdN1vZ2Nza3tnt7RX3j84PDqunJy2TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST+9zvTLk2IlZPOEt4P6IjJULBKFrJ9yOK4yDMpvNBfVCpujV3AbJOvIJUoUBzUPnyhzFLI66QSWpMz3MT7GdUo2CSz8t+anhC2YSOeM9SRSNu+tki85xcWmVIwljbp5As1N8bGY2MmUWBncwzmlUvF//zeimGt/1MqCRFrtjyUJhKgjHJCyBDoTlDObOEMi1sVsLGVFOGtqayLcFb/fI6addrnlvzHq+rjbuijhKcwwVcgQc30IAHaEILGCTwDK/w5qTOi/PufCxHN5xi5wz+wPn8ASjWkb8=</latexit>

through some mesh convolution blocks to aggregate neigh- ⇥ : Convolution


boring information, which get features of two types as input
>tixetal/<084K2+xz8+Q4Oe2yBvWRP2r3xvn25x7KDzzIBDIN6MIu1UAcEkjwcmgwoMey7Kls+De14rhozn0VXWugQkrKCgeZNgEbl4W4ZQ5IxqcGDvQ66YPZULBTivoFFLXGlK0JO4LMJf9n4nZvXiWcCdGXJiT7ihGTP93VyIUlGbNJ6PVmJ701m7kEbhqq48eARvRKPxSzUTb+moQDCBXx/IgdWTamKEOLz+lbvW+eULXBJhMJrCy51/lilvnbSgyu1cSRYwCRgSnJeMljDUdoYx9TbN/O0hWrkHpJcWzO5hCFNIehuoFitbTpm2w4M4cr14xHo8z2dnt2dh7qzNWftFmXfvP6wWqSrsrpqjaogGLNek8Jbwv3/oPvFdwiB7vQQ8KTLPzsxMpZOxcJSOAzR84FgeMoKyeKgPoBe9YUfhxrNyOEBNgSLDVbciX7BAAA>"=QFHqV2J3gSGXUBrKAihXjr6yJKE"=46esab_1ahs tixetal<

and output new features of them. It is noted that all the pro-
v3
<latexit sha1_base64="jpc+PIP/B+GyVY2jx67zQgO78ZI=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiQq6LLoxmUF+4CmlMn0ph06mYSZSaGE/oYbF4q49Wfc+TdO2iy0emDgcM693DMnSATXxnW/nNLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWByl/udKSrNY/loZgn2IzqSPOSMGiv5fkTNOAiz6XxwOajW3Lq7APlLvILUoEBzUP30hzFLI5SGCap1z3MT08+oMpwJnFf8VGNC2YSOsGeppBHqfrbIPCdnVhmSMFb2SUMW6s+NjEZaz6LATuYZ9aqXi/95vdSEN/2MyyQ1KNnyUJgKYmKSF0CGXCEzYmYJZYrbrISNqaLM2JoqtgRv9ct/Sfui7rl17+Gq1rgt6ijDCZzCOXhwDQ24hya0gEECT/ACr07qPDtvzvtytOQUO8fwC87HNypakcA=</latexit>

cesses above work on each face respectively and share the


same parameters. After these processes, a pooling function Figure 4: The face rotate convolution block. Kernels rotate
is applied to features of all faces for generating global fea- through the face and are applied to pairs of corner vectors for
ture, which is used for further tasks. The above blocks will the convolution operation.
be discussed in following sections.

Spatial and Structural Descriptors features inside faces. After the rotate convolution, we apply
an average pooling and a shared MLP as g(·) to each face,
We split feature of faces into spatial feature and structural and output features with the length of K2 .
feature. The spatial feature is expected to be relevant to the
spatial position of faces, and the structural feature is rele-
vant to the shape information and local structures. In this Structural descriptor: face kernel correlation Another
section, we present the design of two blocks, named spatial structural descriptor we design is the face kernel correla-
and structural descriptors, for generating initial spatial and tion, aiming to capture the “outer” structure of faces and
structural features. explore the environments where faces locate. The method
is inspired by KCNet (Shen et al. ), which uses kernel corre-
lation (KC) (Tsin and Kanade 2004) for mining local struc-
Spatial descriptor The only input value relevant to spatial
tures in point clouds. KCNet learns kernels representing dif-
position is the center value. In this block, we simply apply
ferent spatial distributions of point sets, and measures the
a shared MLP to each face’s center, similar to the methods
geometric affinities between kernels and neighboring points
based on point cloud, and output initial spatial feature.
for each point to indicate the local structure. However, this
method is also restricted by the ambiguity of point cloud,
Structural descriptor: face rotate convolution We pro- and may achieve better performance in mesh.
pose two types of structural descriptor, and the first one is In our face kernel correlation, we select the normal values
named face rotate convolution, which captures the “inner” of each face and its neighbors as the source, and learnable
structure of faces and focus on the shape information of sets of vectors as the reference kernels. Since all the normals
faces. The input of this block is the corner value. we use are unit vectors, we model vectors of kernels with pa-
The operation of this block is illustrated in Fig 4. Suppose rameters in the spherical coordinate system, and parameters
the corner vectors of a face are v1 , v2 , v3 , we define the (θ, φ) represent the unit vector (x, y, z) in the Euclidean co-
output value of this block as follows: ordinate system:
1 (
x = sin θ cos φ
g( (f (v1 , v2 ) + f (v2 , v3 ) + f (v3 , v1 ))), (1)
3 y = sin θ sin φ , (2)
where f (·, ·) : R3 × R3 → RK1 and g(·) : RK1 → RK2 are z = cos θ
any valid functions. where θ ∈ [0, π] and φ ∈ [0, 2π).
This process is similar to a convolution operation, with We define the kernel correlation between the i-th face and
two vectors as the kernel size, one vector as the stride and K1 the k-th kernel as follows:
as the number of kernels, except that translation of kernels
is replaced by rotation. The kernels, represented by f (·, ·), 1 X X
KC(i, k) = Kσ (n, m), (3)
rotates through the face and works on two vectors each time. |Ni ||Mk |
n∈Ni m∈Mk
With the above process, we eliminate the influence caused
by the order of processing corners, avoid individually con- where Ni is the set of normals of the i-th face and its neigh-
sidering each corner and also leave full space for mining bor faces, Mk is the set of normals in the k-th kernel, and
Kσ (·, ·) : R3 × R3 → R is the kernel function indicating

n ⇥ in1
Spatial
the affinity between two vectors. In this paper, we generally

n ⇥ out1
Feature Combination

n ⇥ in2
choose the Gaussian kernel: Combination

n ⇥ in2
2
kn − mk Structural
Kσ (n, m) = exp(− ), (4) Feature

n ⇥ out1
2σ 2

n ⇥ in1

n ⇥ in1
n ⇥ in2
M LP

n ⇥ out2
where k·k is the length of a vector in the Euclidean space, Aggregation (out1 )

n⇥3
Neighbor
and σ is the hyper-parameter that controls the kernels’ re- Index
solving ability or tolerance to the varying of sources.
With the above definition, we calculate the kernel corre- Aggregation for one face
Neighbor Aggregation
lation between each face and kernel, and more similar pairs Features
Self 1
will get higher values. Since the parameters of kernels are Gathering max

n ⇥ in2
Feature pooling

n ⇥ out2
learnable, they will turn to some common distributions on M LP

MLP
the surfaces of 3D shapes and be able to describe the local (out2 )
structures of faces. We set the value of KC(i, k) as the k-th 2 3

n⇥3
max average
pooling
feature of the i-th face. Therefore, with M learnable kernels, pooling 1 Concatenation
2 Max pooling
we generate features with the length of M for each face.
3 Average pooling

Mesh Convolution
The mesh convolution block is designed to expand the re- Figure 5: The mesh convolution. “Combination” donates
ceptive field of faces, which denotes the number of faces the combination of spatial and structural feature. “Aggrega-
perceived by each face, by aggregating information of neigh- tion” denotes the aggregation of structural feature, in which
boring faces. In this process, features related to spatial po- “Gathering” denotes the process of getting neighbors’ fea-
sitions should not be included directly because we focus on tures and “Aggregation for one face” denotes different meth-
faces in a local area and should not be influenced by where ods of aggregating features.
the area locates. In the 2D convolutional neural network,
both the convolution and pooling operations do not intro-
duce any positional information directly while aggregating • Max pooling: Another pooling method is the max pool-
with neighboring pixels’ features. Since we have taken out ing, which calculates the max value of features in each
the structural feature that is irrelevant to positions, we only channel. Max pooling is widely used in 2D and 3D deep
aggregate them in this block. learning frameworks for its advantage of maintaining the
Aggregation of structural feature enables us to capture strong activation of neurons.
structures of wider field around each face. Furthermore, to • Concatenation: We define the concatenation aggrega-
get more comprehensive feature, we also combine the spa- tion, which concatenates the feature of face with feature
tial and structural feature together in this block. The mesh of each neighbor respectively, passes these pairs through
convolution block contains two parts: combination of spa- a shared MLP and applies a max pooling to the results.
tial and structural features and aggregation of structural This method both keeps the original activation and leaves
feature, which respectively output the new spatial and struc- space for the network to combine neighboring features.
tural features. Fig 5 illustrates the design of this block.
We finally use the concatenation method in this paper. Af-
ter aggregation, another MLP is applied to further fusion the
Combination of Spatial and Structural Features We
neighboring features and output new structural feature.
use one of the most common methods of combining two
types of features, which concatenates them together and ap- Implementation Details
plies a MLP. As we have mentioned, the combination result,
as the new spatial feature, is actually more comprehensive Now we present the details of implementing MeshNet, il-
and contains both spatial and structural information. There- lustrated in Fig 2, including the settings of hyper-parameters
fore, in the pipeline of our network, we concatenate all the and some details of the overall architecture.
combination results for generating global feature. The spatial descriptor contains fully-connected layers (64,
64) and output a initial spatial feature with length of 64. Pa-
rameters inside parentheses indicate the dimensions of lay-
Aggregation of Structural Feature With the input struc- ers except the input layer. In the face rotate convolution, we
tural feature and neighbor index, we aggregate the feature of set K1 = 32 and K2 = 64, and correspondingly, the func-
a face with features of its connected faces. Several aggrega- tions f (·, ·) and g(·) are implemented by fully-connected
tion methods are listed and discussed as follows: layers (32, 32) and (64, 64). In the face kernel correlation,
• Average pooling: The average pooling may be the most we set M = 64 (64 kernels) and σ = 0.2.
common aggregation method, which simply calculates the We parameterize the mesh convolution block with a four-
average value of features in each channel. However, this tuple (in1 , in2 , out1 , out2 ), where in1 and out1 indicate the
method sometimes weakens the strongly-activated areas input and output channels of spatial feature, and the in2 and
of the original features and reduce the distinction of them. out2 indicates the same of structural feature. The two mesh
convolution blocks used in the pipeline of MeshNet are con- Spatial X X X X X
figured as (64, 131, 256, 256) and (256, 256, 512, 512). Structural-FRC X X X X
Structural-FKC X X X X
Experiments Mesh Conv X X X X X
In the experiments, we first apply our network for 3D shape Accuracy (%) 83.5 88.2 87.0 89.9 90.4 91.9
classification and retrieval. Then we conduct detailed abla-
tion experiments to analyze the effectiveness of blocks in the Table 2: Classification results of ablation experiments on
architecture. We also investigate the robustness to the num- ModelNet40.
ber of faces and the time and space complexity of our net-
work. Finally, we visualize the structural features from the
two structural descriptors. Aggregation Method Accuracy (%)
Average Pooling 90.7
3D Shape Classification and Retrieval Max Pooling 91.1
We apply our network on the ModelNet40 dataset (Wu et Concatenation 91.9
al. 2015) for classification and retrieval tasks. The dataset
contains 12,311 mesh models from 40 categories, in which Table 3: Classification results of different aggregation meth-
9,843 models for training and 2,468 models for testing. For ods on ModelNet40.
each model, we simplify the mesh data into no more than
1,024 faces, translate it to the geometric center, and nor-
malize it into a unit sphere. Moreover, we also compute the Table 1 shows the experimental results of classification
normal vector and indexes of connected faces for each face. and retrieval on ModelNet40, comparing our work with rep-
During the training, we augment the data by jittering the po- resentative methods. It is shown that, as a mesh-based rep-
sitions of vertices by a Gaussian noise with zero mean and resentation, MeshNet achieves satisfying performance and
0.01 standard deviation. Since the number of faces varies makes great improvement compared with traditional mesh-
among models, we randomly fill the list of faces to the length based methods. It is also comparable with recent deep learn-
of 1024 with existing faces for batch training. ing methods based on other types of data.
For classification, we apply fully-connected layers (512, Our performance is dedicated to the following reasons.
256, 40) to the global features as the classifier, and add With face-unit and per-face processes, MeshNet solves the
dropout layers with drop probability of 0.5 before the last complexity and irregularity problem of mesh data and makes
two fully-connected layers. For retrieval, we calculate the it suitable for deep learning method. Though with deep
L2 distances between the global features as similarities and learning’s strong ability to capture features, we do not sim-
evaluate the result with mean average precision (mAP). We ply apply it, but design blocks to get full use of the rich infor-
use the SGD optimizer for training, with initial learning rate mation in mesh. Splitting features into spatial and structural
0.01, momentum 0.9, weight decay 0.0005 and batch size features enables us to consider the spatial distribution and
64. local structure of shapes respectively. And the mesh convo-
lution blocks widen the receptive field of faces. Therefore,
the proposed method is able to capture detailed features of
Acc mAP faces and conduct the 3D shape representation well.
Method Modality
(%) (%)
3DShapeNets (Wu et al. 2015) volume 77.3 49.2 On the Effectiveness of Blocks
VoxNet (Maturana and Scherer volume 83.0 - To analyze the design of blocks in our architecture and prove
2015) the effectiveness of them, we conduct several ablation exper-
FPNN (Li et al. 2016) volume 88.4 - iments, which compare the classification results while vary-
LFD (Chen et al. 2003) view 75.5 40.9 ing the settings of architecture or removing some blocks.
MVCNN (Su et al. 2015) view 90.1 79.5 For the spatial descriptor, labeled as “Spatial” in Table 2,
Pairwise (Johns, Leutenegger, view 90.7 - we remove it together with the use of spatial feature in the
and Davison 2016) network, and maintain the aggregation of structural feature
in the mesh convolution.
PointNet (Qi et al. 2017a) point 89.2 -
PointNet++ (Qi et al. 2017b) point 90.7 - For the structural descriptor, we first remove the whole
Kd-Net (Klokov and Lempit- point 91.8 - of it and use max pooling to aggregate the spatial feature
sky 2017) in the mesh convolution. Then we partly remove the face
KCNet (Shen et al. ) point 91.0 - rotate convolution, labeled as “Structural-FRC” in Table 2,
or the face kernel correlation, labeled as “Structural-FKC”,
SPH (Kazhdan, Funkhouser, mesh 68.2 33.3 and keep the rest of pipeline to prove the effectiveness of
and Rusinkiewicz 2003) each structural descriptor.
MeshNet mesh 91.9 81.9 For the mesh convolution, labeled as “Mesh Conv” in Ta-
ble 2, we remove it and use the initial features to generate
Table 1: Classification and retrieval results on ModelNet40. the global feature directly. We also explore the effectiveness
of different aggregation methods in this block, and com-
pare them in Table 3. The experimental results show that the
adopted concatenation method performs better for aggregat-
ing neighboring features.

On the Number of Faces


The number of faces in ModelNet40 varies dramatically
among models. To explore the robustness of MeshNet to the
number of faces, we regroup the test data by the number of
faces with interval 200. In Table 4, we list the proportion of
the number of models in each group, together with the clas-
sification results. It is shown that the accuracy is absolutely
irrelevant to the number of faces and shows no downtrend
with the decrease of it, which proves the great robustness of
MeshNet to the number of faces. Specifically, on the 9 mod-
els with less than 50 faces (the minimum is 10), our network Figure 6: Feature visualization of structural feature.
achieves 100% classification accuracy, showing the ability Models from the same column are colored with their val-
to represent models with extremely few faces. ues of the same channel in features. Left: Features from the
face rotate convolution. Right: Features from the face kernel
correlation.
Number of Faces Proportion (%) Accuracy (%)
[1000, 1024) 69.48 92.00
[800, 1000) 6.90 92.35 Feature Visualization
[600, 800) 4.70 93.10 To figure out whether the structural descriptors successfully
[400, 600) 6.90 91.76 capture our features of faces as expected, we visualize the
[200, 400) 6.17 90.79 two types of structural features from face rotate convolu-
[0, 200) 5.84 90.97 tion and face kernel correlation. We randomly choose sev-
eral channels of these features, and for each channel, we
Table 4: Classification results of groups with different num- paint faces with colors of different depth corresponding to
ber of faces on ModelNet40. their values in this channel.
The left of Fig 6 visualizes features from face rotate con-
volution, which is expected to capture the “inner” features
On the Time and Space Complexity of faces and concern about the shapes of them. It is clearly
Table 5 compares the time and space complexity of our net- shown that these faces with similar look are also colored
work with several representative methods based on other similarly, and different channels may be activated by differ-
types of data for the classification task. The column labeled ent types of triangular faces.
“#params” shows the total number of parameters in the net- The visualization results of features from face kernel cor-
work and the column labeled “FLOPs/sample” shows the relation are in the right of Fig 6. As we have mentioned, this
number of floating-operations conducted for each input sam- descriptor captures the “outer” features of each face and is
ple, representing the space and time complexity respectively. relevant to the whole appearance of the area where the face
It is known that methods based on volumetric grid and locates. In the visualization, faces in similar types of areas,
multi-view introduce extra computation cost while methods such as flat surfaces and steep slopes, turn to have similar
based on point cloud are more efficient. Theoretically, our features, regardless of their own shapes and sizes.
method works with per-face processes and has a linear com-
plexity to the number of faces. In Table 5, MeshNet shows Conclusions
comparable effectiveness with methods based on point cloud In this paper, we propose a mesh neural network, named
in both time and space complexity, leaving enough space for MeshNet, which learns on mesh data directly for 3D shape
further development. representation. The proposed method is able to solve the
complexity and irregularity problem of mesh data and con-
#params FLOPs / duct 3D shape representation well. In this method, the poly-
Method
(M) sample (M) gon faces are regarded as the unit and features of them
PointNet (Qi et al. 2017a) 3.5 440 are split into spatial and structural features. We also de-
Subvolume (Qi et al. 2016) 16.6 3633 sign blocks for capturing and aggregating features of faces.
MVCNN (Su et al. 2015) 60.0 62057 We conduct experiments for 3D shape classification and
retrieval and compare our method with the state-of-the-art
MeshNet 4.25 509 methods. The experimental result and comparisons demon-
strate the effectiveness of the proposed method on 3D shape
Table 5: Time and space complexity for classification. representation. In the future, the network can be further de-
veloped for more computer vision tasks.
Acknowledgments [Li et al. 2016] Li, Y.; Pirk, S.; Su, H.; Qi, C. R.; and Guibas,
L. J. 2016. FPNN: Field Probing Neural Networks for 3D
This work was supported by National Key R&D Pro-
Data. In Advances in Neural Information Processing Sys-
gram of China (Grant No. 2017YFC0113000), National
tems, 307–315.
Natural Science Funds of China (U1701262, 61671267),
National Science and Technology Major Project (No. [Li, Chen, and Lee 2018] Li, J.; Chen, B. M.; and Lee, G. H.
2016ZX01038101), MIIT IT funds (Research and applica- 2018. SO-Net: Self-Organizing Network for Point Cloud
tion of TCN key technologies) of China, and The National Analysis. In Proceedings of the IEEE Conference on Com-
Key Technology R&D Program (No. 2015BAG14B01-02). puter Vision and Pattern Recognition, 9397–9406.
[Lien and Kajiya 1984] Lien, S.-l., and Kajiya, J. T. 1984. A
References Symbolic Method for Calculating the Integral Properties of
Arbitrary Nonconvex Polyhedra. IEEE Computer Graphics
[Chang et al. 2015] Chang, A. X.; Funkhouser, T.; Guibas, and Applications 4(10):35–42.
L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva,
[Maturana and Scherer 2015] Maturana, D., and Scherer, S.
M.; Song, S.; Su, H.; et al. 2015. ShapeNet: An
2015. VoxNet: A 3D Convolutional Neural Network for
Information-Rich 3D Model Repository. arXiv preprint
Real-time Object Recognition. In 2015 IEEE/RSJ Interna-
arXiv:1512.03012.
tional Conference on Intelligent Robots and Systems (IROS),
[Chen et al. 2003] Chen, D.-Y.; Tian, X.-P.; Shen, Y.-T.; and 922–928. IEEE.
Ouhyoung, M. 2003. On Visual Similarity Based 3D Model [Qi et al. 2016] Qi, C. R.; Su, H.; Nießner, M.; Dai, A.; Yan,
Retrieval. In Computer Graphics Forum, volume 22, 223– M.; and Guibas, L. J. 2016. Volumetric and Multi-View
232. Wiley Online Library. CNNs for Object Classification on 3D Data. In Proceed-
[Feng et al. 2018] Feng, Y.; Zhang, Z.; Zhao, X.; Ji, R.; and ings of the IEEE Conference on Computer Vision and Pat-
Gao, Y. 2018. GVCNN: Group-View Convolutional Neural tern Recognition, 5648–5656.
Networks for 3D Shape Recognition. In Proceedings of the [Qi et al. 2017a] Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J.
IEEE Conference on Computer Vision and Pattern Recogni- 2017a. PointNet: Deep Learning on Point Sets for 3D Clas-
tion, 264–272. sification and Segmentation. In Proceedings of the IEEE
[Hegde and Zadeh 2016] Hegde, V., and Zadeh, R. 2016. Fu- Conference on Computer Vision and Pattern Recognition,
sionnet: 3D Object Classification Using Multiple Data Rep- 77–85.
resentations. arXiv preprint arXiv:1607.05695. [Qi et al. 2017b] Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J.
[Hubeli and Gross 2001] Hubeli, A., and Gross, M. 2001. 2017b. PointNet++: Deep Hierarchical Feature Learning on
Multiresolution Feature Extraction for Unstructured Point Sets in a Metric Space. In Advances in Neural Infor-
Meshes. In Proceedings of the Conference on Visualization, mation Processing Systems, 5099–5108.
287–294. IEEE Computer Society. [Shen et al. ] Shen, Y.; Feng, C.; Yang, Y.; and Tian, D. Min-
[Jiang, Wu, and Lu 2018] Jiang, M.; Wu, Y.; and Lu, C. ing Point Cloud Local Structures by Kernel Correlation and
2018. PointSIFT: A SIFT-like Network Module for Graph Pooling. In Proceedings of the IEEE Conference on
3D Point Cloud Semantic Segmentation. arXiv preprint Computer Vision and Pattern Recognition, volume 4.
arXiv:1807.00652. [Su et al. 2015] Su, H.; Maji, S.; Kalogerakis, E.; and
[Johns, Leutenegger, and Davison 2016] Johns, E.; Learned-Miller, E. 2015. Multi-View Convolutional Neural
Leutenegger, S.; and Davison, A. J. 2016. Pairwise Networks for 3D Shape Recognition. In Proceedings of the
Decomposition of Image Sequences for Active Multi-view IEEE International Conference on Computer Vision, 945–
Recognition. In Proceedings of the IEEE Conference on 953.
Computer Vision and Pattern Recognition, 3813–3822. [Tsin and Kanade 2004] Tsin, Y., and Kanade, T. 2004. A
[Kazhdan, Funkhouser, and Rusinkiewicz 2003] Kazhdan, Correlation-Based Approach to Robust Point Set Registra-
M.; Funkhouser, T.; and Rusinkiewicz, S. 2003. Rotation tion. In European Conference on Computer Vision, 558–
Invariant Spherical Harmonic Representation of 3D Shape 569. Springer.
Descriptors. In Symposium on Geometry Processing, [Wang and Posner 2015] Wang, D. Z., and Posner, I. 2015.
volume 6, 156–164. Voting for Voting in Online Point Cloud Object Detection.
[Klokov and Lempitsky 2017] Klokov, R., and Lempitsky, V. In Robotics: Science and Systems, volume 1.
2017. Escape from Cells: Deep Kd-Networks for the Recog- [Wang et al. 2017] Wang, P.-S.; Liu, Y.; Guo, Y.-X.; Sun, C.-
nition of 3D Point Cloud Models. In Proceedings of the Y.; and Tong, X. 2017. O-CNN: Octree-based Convolutional
IEEE International Conference on Computer Vision, 863– Neural Networks for 3D Shape Analysis. ACM Transactions
872. IEEE. on Graphics (TOG) 36(4):72.
[Kokkinos et al. 2012] Kokkinos, I.; Bronstein, M. M.; Lit- [Wu et al. 2015] Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang,
man, R.; and Bronstein, A. M. 2012. Intrinsic Shape Context L.; Tang, X.; and Xiao, J. 2015. 3d ShapeNets: A Deep
Descriptors for Deformable Shapes. In Proceedings of the Representation for Volumetric Shapes. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni- IEEE Conference on Computer Vision and Pattern Recogni-
tion, 159–166. IEEE. tion, 1912–1920.
[You et al. 2018] You, H.; Feng, Y.; Ji, R.; and Gao, Y. 2018.
PVnet: A Joint Convolutional Network of Point Cloud and
Multi-view for 3D Shape Recognition. In Proceedings of the
26th ACM International Conference on Multimedia, 1310–
1318. ACM.
[Zaharescu et al. 2009] Zaharescu, A.; Boyer, E.; Varanasi,
K.; and Horaud, R. 2009. Surface Feature Detection and
Description with Applications to Mesh Matching. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 373–380. IEEE.
[Zhang and Chen 2001] Zhang, C., and Chen, T. 2001. Effi-
cient Feature Extraction for 2D/3D Objects in Mesh Repre-
sentation. In Proceedings of the IEEE International Confer-
ence on Image Processing, volume 3, 935–938. IEEE.

You might also like