Professional Documents
Culture Documents
135
CNN
x,y,width,height
y
Which object to track? Generate patches from Detected Object
the surrounding Layer-5
Layer-1 Layer-2 Layer-3 Layer-4
FC
FC
FC
-1
-3
-2
Convolution Pooling Fully
Connected
Fig. 2. Overview of overall system of our method. At first frame the desired object is located manually. Then the next frames the surrounding patches of the
object are extracted and one-by-one becomes an input to the CNN that regresses the object bounding box relative location from the patch [x, y, width, height].
The center position of the tracked object is then sent to turret’s actuators (pan-tilt) using PID. Best viewed in color.
136
to actuate the pan-tilt movement. But it is interesting that
our method show fairly well to track the object especially
it is relatively robust to the size of the object. When the
object quite far and quite near the turret still able to track
correctly the object. To this current moment, we have difficulty
in comparing our system with the existing due to different
platform used, such as different motor specification, slow
motor with faster one resulting different tracking.
Accuracy of tracking position when the robot base moves
in circular is shown in Fig. 5. The position of object in camera
view is used. The object position is normalized [0, 1] both on x
and y axes. Point (0, 0) is the top-left position of camera view
while (1, 1) is the bottom-right. Thus, good tracking should
have position (0.5, 0.5). our method shows relatively well for
Fig. 3. Simulation setup in our experiment. tracking the object since the object position is quite close to
the center position (0.5, 0.5). For some positions, our method
slightly misses to track the object, this could be attributed to
increase the surrounding by certain pixels. A common search the slow motor and PID parameters likely not optimally set.
region technique is employed to crop the current frame. But overall our method still manage to track without losing
We made assumption that the desired object was not the target.
occluded and not moving too fast. For faster objects, the search Tracking a person who is identical with another person
region size may likely be increased, sacrificing the complexity is shown in Fig. 6. It still able to track the correct person
of the network which requires more computation. The network though they are identical. This is because the learned deep
are first pre-trained on ImageNet dataset. It is a common way network intrinsically learns the motion of object. Since object
to finetune using this procedure as ImageNet is considered as in video tends to have smooth motion instead of random
generic object datasheet. Layer 1 to 5 are set fix to prevent movement. Although the persons are identical (have similar
overfitting. Learning rate is set to 0.00001 is used, and the visual features), but they have different motion. This is differ
defaults values of CaffeNet for other hyperparameters are taken from previous methods where object motion is hard-coded into
[32]. the system instead of learning from data. However, our method
highly lekely fail due to widely occluded object, as shown in
D. Generating bounding boxes Fig. 7.
During test time, the desired object is located manually,
then for the next frames the system run automatically detecting V. C ONCLUSION
the object. The network produces bounding box [x, y, w, h]
indicating the located object. We simply draw a bounding box We have presented an automatic targeting system of gun
using this information. PID algorithm is then used to rotate the turret using only visual information from a camera to solve
pan-tilt turret actuators using the center position of the tracked the problem of the existing works. It is note that using many
object bounding box. sensors can increase the weight of a payload and not a cheap
solution—not to mention if there is damage and repair costs.
Existing works tend to use features that are manually designed
IV. E XPERIMENTAL R ESULTS as input to a classification system to perform target tracking
PC core i7 16 GigaBytes RAM with GPU 8 GigaBytes and a highly complex kinematic and dynamic analysis that is
memory the training spends about few hours. A free version only specific to a particular turret. Hand-crafted features can
of V-REP1 robotic simulation system is used to test our tar- degrade accuracy due to less optimal parameters and difficult to
geting system. This V-REP is connected with Robot Operating implement in practice. Parameters are not purely learned from
System (ROS)2 and Caffe [32] deep learning library. KUKA- training data but is solely based on engineering’s experience
Youbot robot as a mobile platform and our turret is mounted on and knowledge.
the robot, as shown in Fig. 3. The robot uses non-holonomic
wheel and the turret has pan and tilt movements with a pointing R EFERENCES
gun. A camera is mounted on the similar direction with the
[1] N. Djelal, N. Mechat, and S. Nadia, “Target tracking by visual servo-
gun. In this setup, the robot autonomously follows the given ing,” in Systems, Signals and Devices (SSD), 2011 8th International
line. The line is set in circular. The turret should then track Multi-Conference on. IEEE, 2011, pp. 1–6.
the given object. [2] N. Djelal, N. Saadia, and A. Ramdane-Cherif, “Target tracking based on
surf and image based visual servoing,” in Communications, Computing
The performance of our method is studied for tracking and Control Applications (CCCA), 2012 2nd International Conference
people walking randomly. For some frames, our method seems on. IEEE, 2012, pp. 1–5.
late to move the turret though the object has been successfully [3] E. Iflachah, D. Purnomo, and I. A. Sulistijono, “Coil gun turret control
tracked, this could be attributed to the slow motor rotation using a camera,” EEPIS Final Project, 2011.
[4] A. M. Idris, K. Hudha, Z. A. Kadir, and N. H. Amer, “Development
1 http://www.coppeliarobotics.com/
of target tracking control of gun-turret system,” in Control Conference
2 http://www.ros.org/ (ASCC), 2015 10th Asian. IEEE, 2015, pp. 1–5.
137
Fig. 4. Results of our system for tracking people while the robot base is moved in circular. A window at the bottom-right of each image shows the detected
object (green box) from the camera view.
0.6 cal features for scene labeling,” IEEE transactions on pattern analysis
N
P 0.5 and machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
o
o [14] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and
r 0.4
s J. Schmidhuber, “A novel connectionist system for unconstrained
m
i handwriting recognition,” IEEE transactions on pattern analysis and
a 0.3
t machine intelligence, vol. 31, no. 5, pp. 855–868, 2009.
l Object x
i 0.2
i
o Object y [15] A.-r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using
z
n
0.1 deep belief networks,” IEEE Transactions on Audio, Speech, and
e Language Processing, vol. 20, no. 1, pp. 14–22, 2012.
0
[16] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International
1
664
1327
1990
2653
3316
3979
4642
5305
5968
6631
7294
7957
8620
9283
9946
10609
11272
11935
138
Fig. 6. Results of our system for tracking identical persons while the robot base is moved in circular. Note that our method still able to track the correct person
though they are identical, which could be attributed to learned object motion. Our deep network intrinsically learns object motion.
Fig. 7. Failure case when the robot camera view is widely occluded.
and blink controlled firing system for military tank using labview,” in
Intelligent Human Computer Interaction (IHCI), 2012 4th International
Conference on. IEEE, 2012, pp. 1–4.
[27] R. Bisewski and P. K. Atrey, “Toward a remote-controlled weapon-
equipped camera surveillance system,” in Tools with Artificial Intelli-
gence (ICTAI), 2011 23rd IEEE International Conference on. IEEE,
2011, pp. 1087–1092.
[28] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust
arbitrary text detection system for natural scene images,” Expert Systems
with Applications, vol. 41, no. 18, pp. 8027 – 8048, 2014.
[29] A. Risnumawan and C. S. Chan, “Text detection via edgeless stroke
width transform,” in Intelligent Signal Processing and Communication
Systems (ISPACS), 2014 International Symposium on. IEEE, 2014, pp.
336–340.
[30] L. Neumann and J. Matas, “Real-time scene text localization and
recognition,” in Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, 2012, pp. 3538–3545.
[31] ——, “On combining multiple segmentations in scene text recognition,”
in Document Analysis and Recognition (ICDAR), 2013 12th Interna-
tional Conference on, 2013, pp. 523–527.
[32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22nd ACM international
conference on Multimedia. ACM, 2014, pp. 675–678.
[33] A. Risnumawan, I. A. Sulistijono, and J. Abawajy, “Text detection
in low resolution scene images using convolutional neural network,”
139