Robotics 13 00047

robotics
Article
A Deep Learning Approach to Merge Rule-Based and
Human-Operated Camera Control for Teleoperated
Robotic Systems
Luay Jawad 1 , Arshdeep Singh-Chudda 2 , Abhishek Shankar 1 and Abhilash Pandya 2, *
1 Department of Computer Science, Wayne State University, Detroit, MI 48202, USA; ljawad@wayne.edu (L.J.);
abhishek.shankar@wayne.edu (A.S.)
2 Department of Electrical and Computer Engineering, Wayne State University, Detroit, MI 48202, USA;
arshdeep@wayne.edu
* Correspondence: apandya@wayne.edu
Abstract: Controlling a laparoscopic camera during robotic surgery represents a multifaceted chal-
lenge, demanding considerable physical and cognitive exertion from operators. While manual control
presents the advantage of enabling optimal viewing angles, it is offset by its taxing nature. In contrast,
current autonomous camera systems offer predictability in tool tracking but are often rigid, lacking
the adaptability of human operators. This research investigates the potential of two distinct network
architectures: a dense neural network (DNN) and a recurrent network (RNN), both trained using
a diverse dataset comprising autonomous and human-driven camera movements. A comparative
assessment of network-controlled, autonomous, and human-operated camera systems is conducted to
gauge network efficacies. While the dense neural network exhibits proficiency in basic tool tracking,
it grapples with inherent architectural limitations that hinder its ability to master the camera’s zoom
functionality. In stark contrast, the recurrent network excels, demonstrating a capacity to sufficiently
replicate the behaviors exhibited by a mixture of both autonomous and human-operated methods. In
total, 96.8% of the dense network predictions had up to a one-centimeter error when compared to the
test datasets, while the recurrent network achieved a 100% sub-millimeter testing error. This paper
trains and evaluates neural networks on autonomous and human behavior data for camera control.
Citation: Jawad, L.; Singh-Chudda,
A.; Shankar, A.; Pandya, A. A Deep
Keywords: robotic surgery; deep neural network; autonomous camera control
Learning Approach to Merge
Rule-Based and Human-Operated
Camera Control for Teleoperated
Robotic Systems. Robotics 2024, 13, 47.
https://doi.org/10.3390/
1. Introduction
robotics13030047 The advent of robotic surgical systems and their widespread use has changed the way
surgery is performed. However, with this new technology comes the drawback of needing
Academic Editor: Sylwia Łukasik
to view the surgical scene through a camera. This can lead to several problems, namely
Received: 23 January 2024 depth perception and camera control. In current clinical systems, the only way to control
Revised: 29 February 2024 the laparoscopic camera is for the surgeon to press a clutch pedal or button which shifts the
Accepted: 5 March 2024 control over to the camera arm, move the camera, then resume work with the manipulator
Published: 11 March 2024 tools. This method of camera control requires the surgeon to divert their attention away
from the surgery to move the camera. It might also result in one or both tools being out of
the cameras’ field of view at some points due to suboptimal camera positioning. Recently,
autonomous camera control systems have been developed that track the tools and keep
Copyright: © 2024 by the authors.
them in the field of view. Such systems may increase patient safety significantly as the
Licensee MDPI, Basel, Switzerland.
This article is an open access article
surgeon is always able to see the tools and does not need to interrupt the surgical flow to
distributed under the terms and
move the camera. However, they are limited in their capabilities and are unable to adapt to
conditions of the Creative Commons the surgeons’ needs. A third method has been developed for camera control: an external
Attribution (CC BY) license (https:// operator via joystick. This method shares many of the drawbacks of clutch control such as
creativecommons.org/licenses/by/ suboptimal views and additionally requires vocalization from the surgeon to communicate
4.0/). with the joystick-operator. However, it does outperform both the clutch and autonomous
Robotics 2024, 13, 47. https://doi.org/10.3390/robotics13030047 https://www.mdpi.com/journal/robotics

Robotics 2024, 13, 47 2 of 15
systems in terms of task execution times and does not narrow the camera focus on the
tools [1]. It is difficult to articulate the ideal position for the endoscopic camera vocally
or algorithmically.
To address this, an autonomous camera controller was developed using deep neural
networks trained on combined datasets from rule-based and human-operated camera
control methods. Deep neural networks have shown an impressive ability to process and
generalize a large quantity of information in a particular setting. When properly trained,
they can maximize and combine the advantages of rule-based autonomous and manual
camera control while minimizing the drawbacks present in each method. This paper
compares two network architectures with the existing system and evaluates their ability to
find an optimal camera position.
2. Background
Making the laparoscopic camera autonomous has been the focus of many research
groups over the years. The general capabilities of these systems involve tracking the
manipulator arms using various methods [2]. In 2020, Eslamian et al. published an
autocamera system that tracks tools using kinematic information from encoders to find the
centroid of the tools and positions the camera such that the centroid is in the center of the
camera image. The zoom level of the system is dependent on the distance between the tools,
the movement of the tools, and time. The autocamera algorithm will move the camera in or
out if the tools are stationary at a preset distance for a fixed amount of time [1]. Da Col et al.
developed a similar system—SCAN (System for Camera Autonomous Navigation)—with
the ability to track the tools using kinematics [3]. Garcia-Peraza-Herrera et al. created
a system that uses the camera image to segment the tools in the field of view and track
them. This system tracks one tool if only one is visible, and tracks both if both are visible.
It also provides the ability to set a dominance factor to focus more on one tool over the
other [4]. Elazzazi et al. extended autocamera systems to include a voice interface which
allows the surgeon to vocally give commands and customize some basic system features [5].
While these systems do allow the operator to avoid context switching and focus solely on
performing the surgery, they are limited by an explicit set of rules that must be hard coded
into the system and often do not match what the operator would prefer.
Machine learning and deep neural networks are widely being researched as methods
for surgical instrument tracking or segmentation. It was recently proven that the timing
of the camera movements can be predicted using the instrument kinematics and artificial
neural networks, but no camera control was applied [6]. Sarikaya et al. used deep convo-
lutional networks to detect and localize the tools in the camera frame with 91% accuracy
and a detection time of 0.103 s [7]. For actual camera control, Lu et al. developed the SuPer
Deep framework that used a dual neural network approach to estimate the pose of the tools
and create depth maps of deformable tissue. Both networks are variations of the ResNet
CNN architecture; one network localizes the tools in the field of view and the other tracks
and creates a point cloud depth map of the surrounding tissue [8]. Azqueta-Gavaldon et al.
created a CycleGAN and U-Net based model for use on the MiroSurge surgical system.
The authors used CycleGAN to generate images which were then fed to the U-Net model
for segmentation [9]. Wagner et al. developed a cognitive task-specific camera viewpoint
automation system with self-learning capabilities. The cognitive model incorporates per-
ception, interpretation, and action. The environment is perceived through the endoscope
image, and position information for the robot, end effectors, and camera arm. The study
uses an ensemble of random forests to classify camera guidance quality as good, neutral,
or poor, which is a classification system learned from the annotated training data. These
ensembles are then used during inference to sample possible camera positions according to
quality [10]. Li et al. utilized a GMM and YOLO vision model to extract domain knowledge
from recorded surgical data for safe automatic laparoscope control. They developed a
novel algorithm to find the optimal trajectory and target for the laparoscope and achieved
proper field-of-view control. The authors successfully conducted experiments validating
Robotics 2024, 13, x FOR PEER REVIEW 3 of 16
Robotics 2024, 13, 47 3 of 15
control. The authors successfully conducted experiments validating their algorithm on the
dVRK [11]. Similarly, Liu et al. developed a real-time surgical tool segmentation model us-
their algorithm
ing YOLOv5 ason
thethe dVRK
base. [11].m2cai16-tool-locations
On the Similarly, Liu et al. developed a real-time
dataset they achievedsurgical
a 133 FPS tool
in-
segmentation model using YOLOv5 as the base. On the m2cai16-tool-locations
ference speed with a 97.9% mean average precision [12]. These methods share the same ad- dataset
they achieved
vantages a 133
as the FPSautonomous
above inference speed with a 97.9%
approaches and are mean average
more precision
adaptable [12]. These
to different situa-
methods
tions, but they require dataset annotation, do not use human demonstrations forare
share the same advantages as the above autonomous approaches and more
training,
adaptable to different situations, but they require dataset annotation, do not use human
or suffer from high computational costs due to the use of convolutional networks. The meth-
demonstrations for training, or suffer from high computational costs due to the use of
ods discussed in this paper are computationally efficient while maintaining the advantages
convolutional networks. The methods discussed in this paper are computationally efficient
of rule-based automation and human-operated control. The networks are trained with data
while maintaining the advantages of rule-based automation and human-operated control.
gathered from human demonstrations that can be used without annotation.
The networks are trained with data gathered from human demonstrations that can be used
without annotation.
3. Materials and Methods
This section
3. Materials describes the da Vinci Surgical System and Research Kit (dVRK), the
and Methods
datasets
This used, thedescribes
section neural network architectures,
the da Vinci Surgicaland the dVRK
System integration.
and Research Python3 was
Kit (dVRK), the
datasets used, the neural network architectures, and the dVRK integration. Python3 wasa
used for the development of the neural networks, employing the Keras framework with
TensorFlow
used backend [13].of
for the development Many of the networks,
the neural network hyperparameters
employing the Kerasgivenframework
in this section
withare
a
TensorFlow backend [13]. Many of the network hyperparameters given in this section areis
as they are simply because they are what produced the best results, and as such there
not
as an are
they exploration of the they
simply because network performances
are what given
produced the bestusing different
results, hyperparameter
and as such there is not
values.
an exploration of the network performances given using different hyperparameter values.
3.1.
3.1.da
daVinci
VinciResearch
ResearchKit
Kit(dVRK)
(dVRK)
The
Theda daVinci
VinciStandard
StandardSurgical
SurgicalSystem
Systemused
usedforforthis
thisresearch
researchconsists
consistsof
ofthe
thetools
toolsoror
patient
patientside
sidemanipulators
manipulators (PSMs)
(PSMs) (2×(2)×and the endoscopic
x) and the endoscopiccamera manipulator
camera (ECM)
manipulator (1×)
(ECM)
(1×
as depicted in Figure
) as depicted 1a. The
in Figure da da
1a. The Vinci Research
Vinci ResearchKitKit
(dVRK)
(dVRK) seen
seenininFigure
Figure1b1bprovides
provides
the
the control
controlhardware
hardwareand andassociated
associatedRobot
RobotOperating
OperatingSystem
System(ROS)
(ROS)Melodic
Melodicsoftware
software
interfaces
interfaces to control the the da Vinci Standard robot hardware [14]. This is performedby
to control the the da Vinci Standard robot hardware [14]. This is performed by
creating
creatingaanetwork
networkof ofnodes
nodesand andtopics
topicsusing
usingthe
theframework
frameworkprovided
providedusing
usingROS.
ROS.ItItalso
also
provides
providesvisualization
visualizationof ofthe
thethree
threemanipulators
manipulatorsusingusingROS
ROSandandRViz
RViz(as(asseen
seenin
inFigure
Figure2)2)
which
which move
movein insync
syncwith
withtheir
theirhardware
hardwarecounterparts.
counterparts. TheThe visualization
visualization can
can be
be used
used
independently
independentlyof ofthe
thehardware
hardwareto todevelop
developand andtest
testsoftware
softwareas aswell
wellasasfor
forthe
theplayback
playbackof of
previously recorded manipulator
previously recorded manipulator data. data.
Figure1.1.Hardware
Figure Hardwareconfiguration
configurationof
ofthe
theda
daVinci
VinciSurgical
SurgicalSystem.
System.(a)
(a)The
Theinstrument
instrumentcart
cartwith
withthe
the
patientside
patient sidemanipulators
manipulators(PSMs)
(PSMs)and
andendoscopic
endoscopiccamera
cameramanipulator
manipulator(ECM)
(ECM)hardware.
hardware. (b)
(b)The
The
dVRKhardware
dVRK hardwarecontrol
controltower.
tower.
Robotics 2024,13,
Robotics2024, 13,47x FOR PEER REVIEW 4 of
4 of1515
Figure 2.
Figure 2. RViz
RVizview
viewofofthe
thePSMs
PSMsand ECM,
and andand
ECM, the the
camera views
camera fromfrom
views the left
the(top)
left and
(top)right
and(bot-
right
tom) cameras
(bottom) of the
cameras ECM.
of the ECM.
3.2.
3.2. Data
Data
The
The data
data used
used to train and evaluate
evaluate the the neural
neuralnetworks
networkswere wererecorded
recordedfrom froma auser-
user-
study
studyconducted
conducted by by our
our lab and published by Eslamian Eslamian et et al.
al.[1].
[1].The
The20-subject
20-subjectstudy
studyhad had
the
theda
da Vinci
Vinci operators
operators move
move a wire across a breadboard
breadboard while while the
thecamera
camerawas wascontrolled
controlled
via
viaone
one ofof three
three methods: clutch control
methods: clutch control using
usingthe theda daVinci
Vincioperator,
operator,an anexternal
externalhuman-
human-
operated
operated joystick,
joystick, and
and the previously mentioned
mentioned autonomous
autonomousalgorithm algorithm(autocamera)
(autocamera)by by
Eslamian
Eslamian et et al.
al. which
which tracks the midpoint
midpoint of of the
the tools.
tools. OneOne camera
cameraoperator
operator(for(forjoystick
joystick
control)
control) was well-trainedand
was well-trained andchosen
chosensosothat thatthere
there waswas a consistent
a consistent operator
operator across
across all
all 20
20 participants. The task was chosen such that it would require
participants. The task was chosen such that it would require dexterous movement of thedexterous movement of the
tools
tools and
and handling
handling of of suture-like material, thereby
thereby closely
closely mimicking
mimickingaasurgical
surgicalsuturing
suturing
task.
task. (Please
(Please reference supplementary materials).materials).
The
The results of ofthe
thestudy
studyreport
reportthatthat
thethe joystick-operated
joystick-operated camera
camera control
control method method
nar-
narrowly
rowly edged edged outout
the the autonomous
autonomous control
control methodmethod in terms
in terms of timing
of timing butthe
but that that the au-
autono-
tonomous
mous method method waswas preferred
preferred for for
its its ability
ability to to track
track thetools.
the tools.Both
Boththethehuman-operated
human-operated
joystick
joystickand
and autonomous
autonomous methods outperformed
outperformed the the clutch-controlled
clutch-controlledcamera.camera.The Thestudy
study
demonstrated notable
demonstrated notable benefits of the predominant camera control techniques.
the predominant camera control techniques. Specifically, Specifically,
the
the autonomous
autonomous system system excelled in maintaining
maintaining tool tool visibility,
visibility, while
while the
the joystick-based
joystick-based
human
human operation
operation usedused task awareness, resulting
resulting in in improved
improvedmovement
movementtiming timing[1].[1].Con-
Con-
sequently,
sequently, thethe decision wasmade
decision was madetototraintrainthethe networks
networks using
using a hybrid
a hybrid approach,
approach, incor-
incorpo-
porating
rating datadatafrom
frombothbothautonomous
autonomousand andjoystick
joystickcameracameracontrol
controlmethods.
methods.As Assuch,
such,thethe
training
training data
data comprised
comprised a 70:3070:30 split
split of autonomous–joystick.
autonomous–joystick. Clutch Clutchcontrol
controldata
datawerewere
excluded
excludeddue duetotoititbeing
beingclearly
clearlyoutperformed
outperformedbyby thethe other
other methods
methods andandoffering
offering nono distinct
dis-
advantages. The relevant data captured at 100 Hz contained only
tinct advantages. The relevant data captured at 100 Hz contained only the 3D cartesian the 3D cartesian tooltip
and camera
tooltip positions
and camera obtained
positions via forward
obtained kinematics.
via forward The tool
kinematics. and
The camera
tool positions
and camera are
posi-
in respect
tions are intorespect
the base of each
to the basemanipulator arm as arm
of each manipulator seenas inseen
Figure 3. The3.orientation
in Figure The orientationof the
manipulators was notwas
of the manipulators used
notasused
it hasasno impact
it has on howon
no impact the camera
how controller
the camera positions
controller the
posi-
camera.
tions theThe data underwent
camera. no normalization
The data underwent or standardization
no normalization so as to not
or standardization so distort
as to notthe
spatial
distort correlations between thebetween
the spatial correlations datapoints.
the datapoints.
obotics 2024,Robotics
13, x FOR PEER
2024, REVIEW
13, 47 5 of 16
5 of 15
Figure
Figure 3. An RViz 3. An RViz visualization
visualization of the
of the da Vinci da Vinci instrument
instrument cartthe
cart showing showing the tool
tool base/tip base/tip
frames, the frames, the
camera base/tip frames,
camera andframes,
base/tip the world
andframe.
the world frame.
3.3. Dense 3.3. Dense

Neural Neural Network
Network
Dense neuralDense neuralconsist
networks networks consist
of layers of of layers where
neurons of neurons
each where
neuroneach neuron istoconnected
is connected
to every neuron in the previous and next layer. In this case,
every neuron in the previous and next layer. In this case, dense networks providedense networks
a simple provide a
simple baseline to compare with more complicated models. The dense network trained
baseline to compare with more complicated models. The dense network trained for this
for this study consisted of an input layer with six neurons (xyz of each PSM), three hidden
study consisted of an input layer with six neurons (xyz of each PSM), three hidden layers
layers with two-hundred and fifty-six neurons, and one output layer with three neurons
with two-hundred and fifty-six neurons, and one output layer with three neurons (xyz po-
(xyz position of the ECM). All hidden layer neurons had the rectified linear unit (ReLU)
sition of the ECM). All hidden layer neurons had the rectified linear unit (ReLU) activation
activation function, and the network had a parameter count of 134,147. The network input
function, and the network had a parameter count of 134,147. The network input was a tensor
was a tensor of dimensions (batch size, features dimension), or specific to our case (batch
of dimensions (batch size, features dimension), or specific to our case (batch size, 6), where
size, 6), where the features were the two (x, y, z) coordinates of the PSMs. The output of
the features were the two (x, y, z) coordinates of the PSMs. The output of the network was
the network was the coordinates of the ECM. For training, mini-batch stochastic gradient
the coordinates of the ECM. For training, mini-batch stochastic gradient descent (SGD) was
descent (SGD) was used with a batch size of 50 and the Adam optimizer with a learning
used with a batch size of −50 and the Adam optimizer with a learning rate of 1×10-5 and mean
rate of 1 × 10 5 and mean squared error (MSE) loss. The over one million data points
squared error (MSE) loss. The over one million data points were split into 70:20:10 for train-
were split into 70:20:10 for training, validation, and testing. To prevent overfitting, early
ing, validation, and testing.
stopping based onTomonitoring
prevent overfitting, early
validation stopping
loss based
was used. Theonpatience
monitoring val- when the
to stop
idation loss was used. The patience to stop when the monitor sees
monitor sees no improvement in loss was set to ten epochs. no improvement in loss
was set to ten epochs.
3.4. Recurrent Neural Network
3.4. Recurrent Neural Network
The dense model used in this study processes each timestep individually, with no
The dense
memory model used in this
of previous study processes
timesteps. On the other eachhand,
timestep individually,
a recurrent neural with no (RNN)
network
memory of previousthe
processes timesteps. On the other
input step-by-step buthand,
storesastates
recurrent
betweenneural networkand
timesteps (RNN)
uses them to
processes influence
the input the step-by-step
output. Since but stores
the datastates
are between
sampled timesteps
at 100 Hz,and the uses them in
difference to positions
influence between
the output. Since the data are sampled at 100 Hz, the difference in
timesteps will never be too large, making this feature prime for exploitation by thepositions
between timesteps will never
RNN. Moreover, thebeECM
too large, making this
mechanically zoomsfeature
afterprime
an idleforperiod
exploitation
of 250byms,the
making the
RNN. Moreover,
recurrent theunits
ECM mechanically
especially suitedzooms afterthis
to imitate an behaviour.
idle periodThe of 250
gatedmilliseconds,
recurrent unit (GRU)
making the is recurrent
the RNN units especially
cell used; the GRU suited
hasto imitate
states andthis behaviour.
gates The gated
with learnable recurrent
weights, which operate
unit (GRU)on is the
andRNN cell used; by
are updated thethe
GRU has states
current andand
input gates with learnable
information weights,
carried overwhich
from previous
operate ontimesteps [15,16].by
and are updated It was choseninput
the current becauseanditinformation
has less parameters
carried over thanfrom
other RNN cells and is
previous
better suited for smaller timesteps.
timesteps [15,16]. It was chosen because it has less parameters than other RNN cells and is
better suited forThe input
smaller tensor to the network had dimensions (batch size, timesteps, features
timesteps.
dimension)
The input tensoror,tospecific to our case,
the network had adimensions
tensor of shape
(batch (batch
size,size, 50, 9) where
timesteps, the reccurent
features
dimension) unit
or, processed
specific to 50ourtimesteps and nine
case, a tensor features:
of shape (batch2 size,
PSM50, and9) 1where
ECM the 3D reccurent
cartesian positions.
The 5050timestep
unit processed timesteps inputs
and to thefeatures:
nine network 2werePSMcascadedly
and 1 ECM sampled from the
3D cartesian dataset, treating it
positions.
Robotics 2024, 13, 47 6 of 15
as a sequence prediction problem. The timesteps were chosen based on the 250 millisecond
time taken for the autonomous system to zoom, with some buffer included. The actual
architecture of the network can be found in Figure A1 and Table 1.
Table 1. GRU network architecture showing the layers, activations, recurrent activations (when
relevant), and number of parameters.
Layer Activation Recurrent Activation Number of Parameters

input N/A N/A 0
dense ReLU N/A 1280
gru tanh sigmoid 53,376
add N/A N/A 0
gru_1 tanh sigmoid 99,072
dense_1 ReLU N/A 33,024
concatenate N/A N/A 0
concatenate_1 N/A N/A 0
dense_4 None N/A 387
Total: 269,315
Firstly, the tensor input is fed to a dense layer and a GRU layer separately; both of
these layers return tensors of the same dimension, so their results can be added. There are
also skip connections and concatenation layers present downstream. These are there for
better gradient flow—especially the residual connection from gru_1 to concatenate_1, and
the hypothesis that the network will fit the data better than a sequential architecture. The
output is the coordinates for the ECM.
The RNN was trained with similar parameters to the dense-only network, with mini-
batch SGD, an Adam optimizer with MSE loss, and a learning rate of 1 × 10−5 The train,
validation, testing split for the 1M+ data points was 60:30:10.
3.5. da Vinci Researck Kit (dVRK) Integration

The integration of the neural networks and dVRK was created to allow modularity
in the interface. First, the forward kinematics are computed in the camera control class,
and the pose of the tools is obtained. From this, the positions are extracted and sent to the
neural network class via a ROS service call. If the selected network is the dense network,
then the neural network class immediately feeds the data as inputs to the network and
returns the output camera position via the ROS service response. If the selected network is
the recurrent network, then the class will fill a buffer with the requisite timesteps before
feeding it into the network and obtaining the camera position, which is then returned to
the camera controller.
Once the Cartesian camera positions have been returned, the camera controller uses
the Rodrigues Rotation Formula to calculate the rotation matrix between the returned
position and the known camera trocar point (also referred to as the base frame seen in
Figure 3). The rotation matrix is combined with the returned Cartesian position to obtain
the transformation matrix which is then used to compute the inverse kinematics (IK). The
joint angles from the IK are given to the dVRK software (dVRK version 1.4.0/CISST version
1.0.8) to move the camera.
To find the transformation matrix (T) necessary for the inverse kinematics calculations,
the desired orientation (Rdesired ) of the ECM must first be computed. The keyhole (k) of
the system is the trocar point around which the ECM rotates (also referred to as the base
frame in Figure 3), and the desired position (p) is the output of the neural networks. With
Robotics 2024, 13, 47 7 of 15
this information, a vector (g) is calculated to define the direction that the ECM should be
pointing.
   
kx px
g = p − k , where k =  k y  and p =  py  (1)
kz pz
The current direction that the ECM is pointing (c) is the difference between the ECM’s
current position (b), as calculated using forward kinematics, and the keyhole.
   
bx kx
c =  by  −  k y  (2)
bz kz
Next are a series of calculations to determine the angle (θ) between the current (c)
and the desired (g) ECM directions, and to calculate the unit vector (u) about which that
rotation takes place.
θ = cos−1 (c· g) (3)
c×g
u= (4)
|c × g|
The skew–symmetric matrix (U) of u is defined below as well.
 
0 −uz uy
U =  uz 0 −u x  (5)
−uy ux 0
Using the above calculations, the 3 × 3 rotation matrix (Rcg ) between the current (c)
and desired (g) vectors is determined as follows:
 
1 0 0
Rcg = I + (sin θ )U + (1 − cos θ )U 2 , where I = 0 1 0 (6)
0 0 1
The desired ECM rotation matrix can now be computed using the current orientation
(Rcurrent ) as obtained from forward kinematics and the previously calculated Rcg .
Rdesired = Rcg Rcurrent (7)
Finally, the 4 × 4 transformation matrix can be built as follows:

Rdesired p
T= (8)
0 1
The camera control and neural network classes are separated to allow full modularity
and ease with respect to the network loaded in. Any network with any inputs and outputs
can be loaded in, if what is given from and returned to the camera control class remains the
same—namely, the coordinates of the PSMs and the ECM.
4. Results
The performance of the neural networks was evaluated in four ways: first, using the
train–validation loss graphs; second, using direct comparison of the Euclidean distance
between the network-predicted value and the value from the test portion of the dataset;
third, using the visualization of the predicted value and the dataset’s ECM and PSM values;
and fourth, by having the neural network control the camera and evaluating based on
the views from the camera itself and by counting the number of frames in which the
tools were in/out of view. To ensure a fair comparison during the fourth evaluation, a
sequence of PSM motions were pre-recorded and played back as necessary for each method
Robotics 2024, 13, 47 8 of 15
cs 2024, 13, x FOR PEER REVIEW 8 of 16
of ECM control. The experiments were performed in simulation to allow for an effective
and accurate interpretation of the data, methods, and results.
4.1. Dense Neural
4.1. Network
Dense Neural Network
The train–validation
4.1. Dense loss graph
Neural Network
The train–validation showed the exponential
loss graph showed thedecay of the loss
exponential decayvalue until
of the loss value until
both losses plateaued
both The just
losses above 1 ×
train–validation10 -5 after 40 epochs, after which training was stopped
plateaued justloss above 1×10showed
graph -5 after 40
theepochs, after which
exponential decay oftraining was
the loss stopped
value until
as seen in Figure
both 4. This
losses indicated
plateaued no
just overfitting
above 1 × and
10 − 5 aafter
good 40generalizability
epochs, after for
which the net-
training
as seen in Figure 4. This indicated no overfitting and a good generalizability for the net- was stopped
work. as seen in Figure 4. This indicated no overfitting and a good generalizability for the network.
work.
Figure 4. Loss Figure

graph for the fully
Loss
4. Loss graph connected neural
for the fully networkneural
connected showing the training
network showingand
thevalidation
training and validation
validation
losses over 40 epochs. The red line is the training loss, and the blue line is the validation
losses over 40 epochs. The red line is the training loss, and the blue line is the loss. The
the validation
validation loss. The
losses plateau at a value
losses of 1×10
plateau at
-5 after 40 epochs.
a value of 1×10 -5 −
after
5 40 epochs.
losses plateau at a value of 1 × 10 after 40 epochs.
However, Figure 5 shows

However,
However, that 5when
Figure
Figure 5shows
showsevaluated
that
thatwhenagainst
when evaluatedan unseen
evaluated against
againstdataset,
anan unseen thedataset,
unseen Euclid-
dataset, thethe
Euclidean
Euclid-
ean distance between
distance
ean the
distance dense
between
between network’s
the dense
the dense predicted
network’s position
predicted
network’s and
predicted the
position dataset-given
position andandthe the posi-
dataset-given
dataset-given position
posi-
tion can be astion
high
can be as
as 0–1
high cmas for
0–1 96.8%
cm for of the
96.8% test
of data
the and
test data
can be as high as 0–1 cm for 96.8% of the test data and as low
and asas 10
low mm
as 10 for
mm 12.3%.
for 12.3%.
low as 10 mm for 12.3%. Further
Further evaluation
Further and
evaluation visualization
and
evaluationvisualization of the 3D 3D
of the
and visualization positions
positions
of the 3D in in
Rviz,
Rviz,
positionssuch
suchinasasininFigure
Rviz, Figure
such as 66 indicate
in Figurethat
6
indicate that this
this error
error is
is due
due to
to the
the temporal
temporal nature
nature of
ofthe
theECM’s
ECM’s mechanical
mechanical zoom
indicate that this error is due to the temporal nature of the ECM’s mechanical zoom in the zoom in
in the
the autocamera
autocamera algorithm.
algorithm.Given
autocamera Given that
thatthe
algorithm. themanipulators
Given manipulators ofofdadaVinci
that the manipulators Vinci allofhave
all have
da a remote
a remote
Vinci center
all have center of motion,
a remote center
of motion, there
of is
there only one
is only
motion, possible
oneispossible
there joint
only one configuration
joint configuration
possible that allows the
that allowsthat
joint configuration ECM
theallows to
ECM to be
thein the
beECMin thetopredicted
be in the
predicted position. TheThe
position.
predicted visualization
visualization
position. results
results
The visualizationsuggest
suggest that
thatdespite
results despitethe
suggest thatdifference
the difference
despite the between
differencethe predicted
between
the predictedtheandpredicted
and dataset-given
dataset-given values,
values, the the network-predicted
network-predicted ECM ECM positions
positions
and dataset-given values, the network-predicted ECM positions are are
still still
valid such are
thatstill
the
valid such that the
PSMs PSMs
may may
be in be
the in the
camera camera
view. view.
valid such that the PSMs may be in the camera view.
Figure 5. Histogram
Figure 5. Histogram of the dense network error frequency. The error observed here is the Euclidean
Figureof5.the dense network
Histogram error frequency.
of the dense network errorThefrequency.
error observed here is
The error the Euclidean
observed here is the Euclidean
distance in
distance in millimeters millimeters
between the between the
predicted andpredicted and dataset-given
dataset-given values for a values
sample for
ofatest
sample
data.of test data. The
distance in millimeters between the predicted and dataset-given values for a sample of test data.
bars highlighted
The bars highlighted in red in redthe
indicate indicate
numbertheofnumber of predictions
predictions that were that
at or were
belowat or
thebelow
10 mmthe 10 mm error
The bars highlighted in red indicate the number of predictions that were at or below the 10 mm
error tolerance error
(21,851
toleranceframes),
(21,851
tolerance green
(21,851 is
frames), atgreen
or below
frames), 1iscm
is at or
green at (171,953
below 1 cmframes),
or below (171,953and total
frames),
1 cm (171,953 177,686
and
frames), frames.
total
and 177,686 frames.
total 177,686 frames.
botics 2024,Robotics
13, x FOR PEER
2024, REVIEW
13, 47 9 of 16
Robotics 2024, 13, x FOR PEER REVIEW 9 of 15 9
Figure 6. A visualization of the network-predicted ECM position (green), the dataset-given ECM
position (red), and the dataset-given PSM positions (blue). Note that the actual ECM/PSMs are only
Figure 6. A Figure 6. A visualization
visualization of the network-predicted
of the network-predicted ECM positionECM position
(green), the(green), the dataset-given
dataset-given ECM ECM
shown for contextFigure 6. A
to allow anvisualization
uninterruptedofview
the network-predicted ECM positions.
of the predicted/dataset position (green), the dataset-given
position (red), and the
position dataset-given
(red), andposition PSM positions
the dataset-given PSM (blue). Note(blue).
positions that the actual
Note thatECM/PSMs
the actual are only
ECM/PSMs are only
(red), and the dataset-given PSM positions (blue). Note that the actual ECM/PSMs ar
shown for context to allow an
shown for contextshownuninterrupted
to allow
for an
context
view
to
of
uninterruptedthe
allow an
predicted/dataset
view positions.
of the predicted/dataset
uninterrupted view of the positions.
predicted/dataset positions.
A full system evaluation with pre-recorded PSM motions was performed and ECM
A full control
system wassystem
A full given to
evaluation the dense with
evaluation
with
A full
network.
pre-recorded
system
The analysis
pre-recorded
PSM
evaluation motions
with PSM results
wasmotions
pre-recorded
had was
1333performed
performed
PSM
of 1333 frames
and ECMwas
motions
with
andperformed
ECM and
at least one tool in view. The image stills from the full system test in Figure 7 show that the
control wascontrol
given towasthegiven
dense to
control the dense
network.
was givennetwork.
The analysis
to the The
dense analysis
results had
network. results
1333
The ofhad
1333
analysis1333 of
frames 1333
results with
had frames
1333 with
of 1333 frames
dense network isincapable of learning to position thefull
ECM suchtest thatinthe PSMs7remain in the
at least oneattool
least
in one
view.tool
The view.one
image
atthat
least The image
stills
toolfrom stills
the
inzoom
view. from
full
The the
system
image testsystem
stills infrom
Figure
the 7full Figure
show show
thattest
system the that
in Figure the7 show th
field
dense of view, but it will often in more than is necessary. The narrow focus results
dense network is network
capable of is capable
learning
dense of
tolearning
network position to
is capabletheposition
ECM the to
such ECM
that such
the that
PSMs the PSMs
the remain remain
that the in
in the the remain
in a loss
field of situational awareness and caninof learning
bemore
considered position
detrimental ECM
to the such
operation. PSMs
field of view, butofthat
view, butfield
it will that
often it will
zoom often
in zoom
more than is than
necessary. is necessary.
The narrow The narrow
focus focus
results
of view, but that it will often zoom in more than is necessary. The narrow focus re results
in a loss of in a loss of situational
situational awareness
in a loss awareness
and andawareness
can be detrimental
can be considered
of situational considered
and can be detrimental
toconsidered todetrimental
the operation. the operation.to the operation.
Figure 7. Two stills Figure 7. Two stills of a dense-network-controlled

of a dense-network-controlled ECM at differentECM PSMatpositions.
different The
PSMleft
positions.
pane ofThe left p
Figure 7. Two each
stills of image
a shows the relative positions
dense-network-controlled ECM atofdifferent
the ECMPSMandpositions.
the PSMs,The
while the
left topofand bottom
pane
each image shows the relative positions of the ECM and the PSMs, while the top and bottom panes
each on the right are images from the left and right cameras of the ECM, respectively.
on theimage
right shows the relative
are images from the positions
left andof the ECM
right camerasandofthe
thePSMs,
ECM,while the top and bottom panes
respectively.
Figure 7. Twoonstills of a dense-network-controlled
the right are images from the left and ECM atcameras
right differentofPSM positions.
the ECM, The left pane of
respectively.
each image 4.2.
shows Recurrent 4.2.
Neural
the relative Recurrent
Network
positions Neural
of the ECMNetwork
and the PSMs, while the top and bottom panes
on the right are
4.2.images
Recurrentfrom the
Neural left and
Network
The right cameras
train–validation
The train–validation loss graph for the ofloss
the ECM,
graph
GRU respectively.
for
networkthesimilarly
GRU network similarly
indicated goodindicated
network good net
generalizability. generalizability.
The training
The train–validation Theplateaued
loss
loss graph training
for the GRUloss aplateaued
atnetwork of 2at×a 10
value similarly value of 2×good
−8 while
indicated 10
the-8 while the validation
validation
network
4.2. Recurrent
lossNeural Network
plateaued
plateaued atThe
generalizability. −
6 ×trainingat
8
10 after6 × 10 -8 after 72 epochs (due to having
loss72plateaued
epochs (due at a to having
value no improvements
×10improvements
of 2no overover
-8 while the validation the
lastlast 10 ep
theloss
10
plateaued as-8seen
epochs)atas6×seen
The train–validation loss
10 graphin72Figure
for
in Figure
after the
epochs 8.The
8. GRU Thenetwork
(due validation
validation
to having loss
similarly
loss
no for the
therecurrent
forindicated
improvements overnetwork
good
recurrent network
network
the is nearly
last 10 is 1000 times
nearly
epochs)
generalizability.
1000 The
times
as seen than
training
better
in Figure thatthat
loss
8.than
The of the
plateaueddense
of the
validation at anetwork,
dense
loss value
for the of
network, 2×10
suggesting thatthe
-8 while
suggesting
recurrent network the
that GRU
the will
isvalidation
nearlyGRU
1000perform
loss
will better overall.
timesperform
better
plateaued at 6×10
better
than -8 after 72 epochs (due to having no improvements over the last 10 epochs)
overall.
that of the dense network, suggesting that the GRU will perform better overall.
as seen in Figure 8. The validation loss for the recurrent network is nearly 1000 times better
than that of the dense network, suggesting that the GRU will perform better overall.
Figure 8. GRU network train–validation loss graph. The orange and blue lines are, respective
training and validation loss losses. TheThe
training loss plateaued at a value of 2×10-8, and the valid
Figure 8. GRU network train–validation
Figure 8. GRU network train–validation lossgraph.
graph. Theorange
orangeand
andblue
bluelines
linesare,
are,respectively,
respectively,the
the
training and loss
andvalidation plateaued
validationlosses. at
losses. The 6×10 -8 after 72 epochs.
Thetraining
trainingloss
lossplateaued
plateauedatataavalue
valueofof2 2×10
× 10−,8and
, andthe
thevalidation
-8
training validation
Figure 8. GRUloss plateaued at 6×10-8 −after 72 epochs.
8 after
lossnetwork
plateauedtrain–validation
at 6 × 10Figure
loss graph.
9 72 epochs.
shows
The orange and blue lines are, respectively, the
the difference in the Euclidean distance between the GRU netw
training and validation losses. The training loss plateaued at a value of 2×10-8, and the validation
predicted ECM positions and the dataset-given ECM positions for the same test da
loss plateaued at Figure 9 shows
6×10-8 after the difference in the Euclidean distance between the GRU network-
72 epochs.
as in Figure 5. The GRU achieved sub-millimeter accuracy for 100% of the test
predicted ECM positions and the dataset-given ECM positions for the same test dataset as
otics 2024, 13, x FOR PEER REVIEW 10 of 16
Robotics 2024, 13, 47 10 of 15

Figure 9 shows the difference in the Euclidean distance between the GRU network-
suggesting an ability to learn the temporal aspects of the autocamera’s zoom algorithm
predicted ECM positions and the dataset-given ECM positions for the same test dataset
with respect to the mechanical zoom of the ECM. This is further supported by the visual-
as in Figure 5. The GRU achieved sub-millimeter accuracy for 100% of the test data, sug-
in Figure
ization of5.the
The GRU achievedECM
GRU-predicted sub-millimeter accuracy
position along withfor the100% of the test data,
dataset-given suggesting
positions of the
gesting an ability to learn the temporal aspects of the autocamera’s zoom algorithm with
an ability
ECM and to PSMslearnseen
the in
temporal aspects
Figure 10. of the
The full autocamera’s
system test resulted zoominalgorithm
at least one withtoolrespect
being
respect to the mechanical zoom of the ECM. This is further supported by the visualization
to the
in view mechanical
for 1282 ofzoom of the
the 1354 ECM.with
frames, Thistheis further supported
tools being by thefor
out-of-view visualization
72 frames. of the
Upon
of the GRU-predicted ECM position along with the dataset-given positions of the ECM
GRU-predicted ECM position along with the dataset-given positions
further analysis, it was found that those frames all corresponded with the time needed to of the ECM and PSMs
and PSMs seen in Figure 10. The full system test resulted in at least one tool being in view
seen
fill theintimestep
Figure 10. The full
buffer system
for the GRU, test resulted
during in atit least
which would one tool
not be being in view
providing for 1282
outputs for
for 1282 of of
thethe1354
1354 frames,
frames, with
withthethe
tools being
tools beingout-of-view
out-of-view forfor 7272frames.
frames.Upon
Upon further
further analysis,
the camera position. Image stills from the test as seen in Figure 11 show that the GRU
analysis, it it
waswas found
found that those
that thoseframes allallcorresponded with the time needed toto fillfill
the
keeps both PSMs in theframes corresponded
field-of-view with
without zooming theintime
too needed
closely when the
nottimestep
needed,
timestep buffer
buffer forfor
thetheGRU,
GRU,during
duringwhich ititwould
whichview would not bebeproviding
providingoutputsoutputsfor
for the
the cam-
providing a significantly better of thenotscene. camera position.
era position. Image
Image stills
stills from
from thethe test
test asas seen
seen inin Figure1111show
Figure showthat thatthetheGRU
GRUkeeps
keepsbothbothPSMs in the
PSMs in thefield-of-view
field-of-viewwithout withoutzooming
zoominginintoo tooclosely
closelywhen
whennot not needed, providing a
needed, providing a significantly
significantly better view of
better view of the scene.the scene.
Figure 9. Histogram of the GRU network error frequency. The error observed here is the Euclidean
distance in millimeters between the predicted and dataset-given values for a sample of test data.
Figure 9.ofHistogram
Figure 9. Histogram of the GRU
the GRU network errornetwork errorThe
frequency. frequency. The error
error observed observed
here here is the Euclidean
is the Euclidean
distance in between
distance in millimeters millimeters
the between
predictedthe predicted
and and dataset-given
dataset-given values for a values
samplefor a sample
of test data. of test data.
10. Visualization
Figure 10. Visualizationofofthe
theGRU
GRUpredicted
predictedECM
ECMposition (green),
position thethe
(green), dataset-given ECM
dataset-given ECMposition
posi-
tion (red),
(red), anddataset-given
and the the dataset-given PSM positions
PSM positions (blue)
(blue) over overNote
time. time. Note
that the that the
actual actual ECM/PSMs
ECM/PSMs are only
are onlyfor
shown shown fortocontext
context to allow
allow an an uninterrupted
uninterrupted view
view of the of the predicted/dataset
predicted/dataset positions.
positions.
Figure 10. Visualization of the GRU predicted ECM position (green), the dataset-given ECM posi-
tion (red), and the dataset-given PSM positions (blue) over time. Note that the actual ECM/PSMs
are only shown for context to allow an uninterrupted view of the predicted/dataset positions.
Robotics 2024, 13, 47 Robotics 2024, 13, x FOR PEER REVIEW 11 of 15 11
Figure 11. Two Figure 11. Two

stills from stills from the GRU-controlled
the GRU-controlled ECM at different ECM at different
points points alongand
along pre-recorded pre-recorded
played-back PSM motions. The left pane of each still shows the relative positions of the ECM and theof the ECM
played-back PSM motions. The left pane of each still shows the relative positions
the PSMs, while the top and bottom panes on the right are the images from the left and right cam
PSMs, while the top and bottom panes on the right are the images from the left and right cameras of
of the ECM, respectively.
the ECM, respectively.
4.3. Neural
4.3. Neural Network and ROS Network
Latencyand ROS Latency
Inapplications
In time-sensitive time-sensitive applications
such as surgery,such as surgery,
latency must be latency
minimized.mustAs beaminimized.
result, As
sult, evaluating the latency in our system was vitally important.
evaluating the latency in our system was vitally important. The average latency of the The average laten
developed neuralthenetworks
developed wasneural networks
measured withwas
and measured
without thewith
ROS and without
system the ROSinsystem d
developed
this paper. The oped
latencyin of
this
thepaper.
entireThe
dVRKlatency
ROSof the entire
network wasdVRK ROS network
not considered. Thewas not considered
networks
networks and system were tested on a machine with
and system were tested on a machine with an Intel i7 2.60 GHz CPU, 16 GB DRAM, andan Intel i7 2.60 GHz
an CPU, 1
DRAM, and
NVIDIA GTX 1060 with 6 GB VRAM. an NVIDIA GTX 1060 with 6 GB VRAM.
The results
The results in Table 2 showin Tableprediction
a base 2 show a time
base of
prediction timedense
2 ms for the of 2 milliseconds
network and for the d
network and 33 milliseconds for the GRU network after it
33 ms for the GRU network after it fills the buffer of 50 timesteps. When considering fills the buffertheof 50 times
networks with the ROS service calls, the dense network had a prediction time of 13 ms, had a
When considering the networks with the ROS service calls, the dense network
compared to the diction
GRU’s time
45 of
ms.13This
milliseconds,
indicates compared to the added
that the latency GRU’s using
45 milliseconds.
ROS is, onThis indi
that the latency added using ROS is, on average, 11.5 milliseconds.
average, 11.5 ms.
Table 2. Measured average deep neural network prediction times with and without the RO
Table 2. Measured average deep neural network prediction times with and without the ROS network.
work. The ROS time refers to the time it takes for the camera control class to make the service re
The ROS time refers to the time it takes for the camera control class to make the service request and
and send the data + the time it takes for the neural network to make its prediction + the time it
send the data + the
fortime it takesresponse
the service for the neural
to get network to camera
back to the make itscontrol
prediction
class.+ the time it takes for
the service response to get back to the camera control class.
Network Base Prediction Time (sec) ROS + Base Prediction Time (s
Network Dense Base Prediction Time (s)
0.002 ROS + Base Prediction Time (s)
0.013
Dense GRU 0.002 0.033 0.013 0.045
GRU 0.033 0.045
5. Discussion
5. Discussion The results demonstrate that both the dense and recurrent network were able to
The resultstodemonstrate
appropriately position
that both the thedense
ECM andbased on the positions
recurrent network of the able
were PSMs. to The
learndense netw
to appropriatelyisposition
predictably fast with
the ECM based anoninference speedof
the positions ofthe
2 milliseconds
PSMs. The dense and showed
networkgood accu
in providing
is predictably fast an appropriate
with an inference speed ofcamera
2 ms and position
showed ongood
two of the three
accuracy axes (X and Y) but
in providing
an appropriate unable
camerato learn the
position onzoom
two of level—associated
the three axes (Xwith andtheY) Z-axis
but was ofunable
the camera—due
to learn to the
the zoom level—associated with the Z-axis of the camera—due to the temporal naturetoo
poral nature of that algorithm. This resulted in either the camera being of zoomed
that algorithm.out,Thisthereby
resultedlosing some the
in either valuable
camera information. On the other
being too zoomed in orhand, the GRU was ab
out, thereby
learn theinformation.
losing some valuable position andOn zoom thelevel
otherand achieved
hand, the GRUsub-millimeter
was able to accuracy
learn thewhen comp
position and zoomto thelevel
dataset-given
and achieved values. This can be attributed
sub-millimeter accuracy whento the compared
nature of the GRU cell, w
to the
allows itThis
dataset-given values. to take
can the temporal aspect
be attributed to the of the zoom
nature of the into
GRUconsideration.
cell, which The GRU-based
allows
work is slower
it to take the temporal aspectin ofinference
the zoomwith intoanconsideration.
average inference Thetime of 33 milliseconds
GRU-based network owing t
much more
is slower in inference withcomplex
an average architecture,
inference buttimeit of
may
33 still be fast to
ms owing enough
the muchfor surgical
more camer
plications.
complex architecture, but it may still be fast enough for surgical camera applications.
The deep-learning-based
The deep-learning-based approach developed approach developed
in this paper isinable
this to
paper is able
imitate the to imitate the
best
of both worlds ofbyboth worlds
learning theby learning the
appropriate appropriate
camera camera
position basedposition
on databased
from on data from a hu
a human-
controlled ECMcontrolled ECM and ansystem.
and an autonomous autonomousFiguresystem.
12 showsFigure 12 shows the
the difference difference
in the camera in the ca
position and viewposition
when and view when
controlled bycontrolled
the neural bynetworks,
the neural networks,
joystick, and joystick, and autonomous
autonomous
camera control era controlAll
methods. methods.
methodsAll methods
were were
given the given
same setthe samemotions
of PSM set of PSM motions and v
and videos
showing the performance of each can be found in the supplementary materials. As seen in
Robotics 2024,the figure,
13, 47 the joystick operator sometimes allows the tools to go out of the field-of-view 12 of 15
leading to unsafe conditions, but otherwise it retains a good view of the scene and tools. As
previously stated, the autocamera is bound by a rigid set of rules, as evidenced by it keeping
showing the performance of each can be found in the supplementary materials. As seen in
the PSMs directly in the center of the camera view and having them be the sole focus. In
the figure, the joystick operator sometimes allows the tools to go out of the field-of-view
comparison, the dense
leading network
to unsafekeeps the tools
conditions, but in view, but
otherwise fares aa good
it retains little view
worse ofthan the au-
the scene and tools.
tocamera algorithm As regarding the zoom
previously stated, level due tois the
the autocamera reasons
bound stated
by a rigid setabove
of rules,and may oc- by it
as evidenced
casionally lose sight of thethe
keeping PSMs
PSMs due to attempting
directly in the centerto of
keep
the the restview
camera of theand scene
havingin view.
them be The the sole
GRU network maintains a good view of the tools and the surrounding scene, and, while the worse
focus. In comparison, the dense network keeps the tools in view, but fares a little
viewpoint given by than
thethe autocamera algorithm
GRU-controlled network regarding
does notthe always
zoom levelhave due to tools
the the reasons
centeredstatedin above
and may occasionally lose sight of the PSMs due to attempting to keep the rest of the
the view, it does nonetheless always retain a view of them. The snapshots provided in Fig-
scene in view. The GRU network maintains a good view of the tools and the surrounding
ures 6, 7, 10–12 represent
scene, and, broader
while trends seen ingiven
the viewpoint all thebycamera control methods.
the GRU-controlled network Ultimately,
does not always
the networks— the GRU
have thein particular—have
tools centered in the view,demonstrated the capability
it does nonetheless to learn
always retain safeofcam-
a view them. The
era positioning from the autonomous
snapshots system6, and
provided in Figures 7 andperhaps (due tobroader
10–12 represent its memory units)
trends seen better
in all the camera
predictive movement control methods.
timing from Ultimately,
the human the networks—the
joystick–operator.GRU inHowever,
particular—have demonstrated the
it is inherently
capability to learn safe camera positioning from the autonomous system and perhaps (due
difficult to formalize the behavior of humans, especially in this situation where they may
to its memory units) better predictive movement timing from the human joystick–operator.
have been reacting to visual stimuli in the camera view. This being the case, the behavior
However, it is inherently difficult to formalize the behavior of humans, especially in this
considered to have transferred
situation whereovertheytomaythehave
networks is the enhanced
been reacting predictive
to visual stimuli in thetiming
camera ofview.
the This
joystick operator in thethe
being study.
case,Itthe
is,behavior
however, believedtothat
considered havethetransferred
networksover learned
to theother
networkshu- is the
enhanced predictive timing of the joystick operator in the
man behaviors that cannot be explicitly formalized in a systematic way. This lack of explain- study. It is, however, believed
ability in neural networks is a topic of much contention and research, especially where it in a
that the networks learned other human behaviors that cannot be explicitly formalized
relates to the fieldsystematic
of medicine way. This lack of explainability in neural networks is a topic of much contention
[17–20]. Overall, the ability of neural networks to accurately
and research, especially where it relates to the field of medicine [17–20]. Overall, the ability
learn camera positioning is perhaps
of neural networks not that surprising
to accurately learn camera given that neural
positioning networks
is perhaps not thathave
surprising
been shown to begiven excellent linear- and nonlinear-function approximators given
that neural networks have been shown to be excellent linear- and nonlinear-function the right
network configuration [21,22]. given the right network configuration [21,22].
approximators
Figure 12. Stills captured from

Figure 12. each
Stills camera
captured fromcontrol method:
each camera Dense
control (top Dense
method: left), (top
GRUleft),
(bottom left), left),
GRU (bottom
autocamera (top right), and joystick
autocamera (bottom
(top right), right).(bottom
and joystick Each shows the relative
right). Each positions
shows the of the PSMs
relative positions of the PSMs
and ECM, as well as theECM,
and views fromasboth
as well the left
the views and
from bothright lenses
the left of the
and right ECMofcamera.
lenses the ECM camera.
6. Conclusions and Future Work

In conclusion, this study effectively demonstrates the distinct learning capabilities of
dense neural networks and gated recurrent units in the context of teleoperated robotic
camera control. Both networks successfully learned to position the endoscopic camera
Robotics 2024, 13, 47 13 of 15
6. Conclusions and Future Work

In conclusion, this study effectively demonstrates the distinct learning capabilities
of dense neural networks and gated recurrent units in the context of teleoperated robotic
camera control. Both networks successfully learned to position the endoscopic camera
manipulator based on the positions of the patient side manipulators from data gathered
from an automated algorithm and human operated data. While the DNN showed profi-
ciency in basic tool tracking, it struggled with the camera’s zoom functionality due to its
inherent architectural constraints. In contrast, the GRU not only replicated these behaviors
but also excelled in learning the task of timing-based camera zooming, a function crucial
for optimal surgical views. The integration of human joystick-operated data added a layer
of complexity to the learning process, potentially enhancing the GRU’s ability to predict
and time camera movements more effectively. This suggests that the GRU could assimilate
some aspects of human behavioral patterns, although the exact nature of this learning
remains challenging to formalize systematically. This research underscores the potential of
neural networks in enhancing the efficiency and adaptability of robotic surgical systems,
marrying autonomous precision with nuanced human-like control.
For future work, a more comprehensive collection and analysis of complex human
camera movements in surgical contexts is essential. This endeavor should focus on obtain-
ing a larger, more diverse dataset that captures a wide range of human-operated camera
control behaviors during various surgical procedures. Such data will offer invaluable
insights into the nuances of human control, particularly in complex and dynamic surgical
environments. With this new approach, it may be possible to understand and quantify the
intricate decision-making processes and contextual adjustments made by human operators
and capture this using advanced neural network models. Along with more thorough
training on a wider variety of surgical tasks, the camera controller can be integrated as
part of an on-demand system. In this system, the surgeon can toggle the autonomous
system and take manual control at will, therefore, retaining a level of control over the
endoscopic camera.
Subsequently, a rigorous human factors assessment is necessary, where neural network-
based camera control methods, including both current and advanced models (deep inverse
reinforcement learning, convolutional NN), should be systematically compared against
traditional human-operated methods. This evaluation should aim to measure not only
the accuracy and efficiency of camera positioning but also factors like ease of use, the
cognitive load on the surgeon, and its overall integration within the surgical workflow. By
doing so, the study can illuminate the practical implications of implementing NN-based
camera control in real-world surgical settings and guide the development of more intuitive,
responsive, and context-aware artificial intelligence systems in robotic surgery.
Supplementary Materials: The following supporting information can be downloaded at https:

//www.mdpi.com/article/10.3390/robotics13030047/s1.
Author Contributions: Conceptualization, L.J. and A.P.; methodology, L.J. and A.P.; software, L.J.
and A.S.-C.; validation, L.J. and A.S.-C.; formal analysis, L.J. and A.S.-C.; investigation, L.J., A.S.-C.,
A.S. and A.P.; resources, A.P.; data curation, L.J. and A.P.; writing—original draft preparation, L.J.,
A.S.-C., A.S. and A.P.; writing—review and editing, L.J., A.S.-C., A.S. and A.P.; visualization, L.J. and
A.S.-C.; supervision, A.P.; project administration, L.J. and A.P.; funding acquisition, A.P. All authors
have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: The data was reused from our previous paper (Eslamian
et al., Development and evaluation of an autonomous camera control algorithm on the da Vinci
Surgical System, https://doi.org/10.1002/rcs.2036 (accessed on 23 September 2023)), which was
conducted in accordance with the Declaration of Helsinki. And, the protocol was approved by
the Ethics Committee of Wayne State University’s Institutional Review Board (IRB) with IRB No.
082105B3E(R).
Robotics 2024, 13, 47 14 of 15
Data Availability Statement: Datasets available on request from the authors.

Conflicts of Interest: The authors declare no conflicts of interest.
Appendix A
Appendix A
Figure A1.
Figure A1. GRU
GRUnetwork
networkarchitecture with
architecture activations.
with activations.
Robotics 2024, 13, 47 15 of 15
References
1. Eslamian, S.; Reisner, L.A.; Pandya, A.K. Development and evaluation of an autonomous camera control algorithm on the da
Vinci Surgical System. Int. J. Med. Robot. Comput. Assist. Surg. 2020, 16, e2036. [CrossRef] [PubMed]
2. Pandya, A.; Reisner, L.A.; King, B.; Lucas, N.; Composto, A.; Klein, M.; Ellis, R.D. A review of camera viewpoint automation in
robotic and laparoscopic surgery. Robotics 2014, 3, 310–329. [CrossRef]
3. Da Col, T.; Mariani, A.; Deguet, A.; Menciassi, A.; Kazanzides, P.; De Momi, E. Scan: System for camera autonomous navigation
in robotic-assisted surgery. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 2996–3002.
4. Garcia-Peraza-Herrera, L.C.; Gruijthuijsen, C.; Borghesan, G.; Reynaerts, D.; Deprest, J.; Ourselin, S.; Vercauteren, T.; Vander
Poorten, E. Robotic Endoscope Control via Autonomous Instrument Tracking. Front. Robot. AI 2022, 9, 832208.
5. Elazzazi, M.; Jawad, L.; Hilfi, M.; Pandya, A. A Natural Language Interface for an Autonomous Camera Control System on the da
Vinci Surgical Robot. Robotics 2022, 11, 40. [CrossRef]
6. Kossowsky, H.; Nisky, I. Predicting the timing of camera movements from the kinematics of instruments in robotic-assisted
surgery using artificial neural networks. IEEE Trans. Med. Robot. Bionics 2022, 4, 391–402. [CrossRef]
7. Sarikaya, D.; Corso, J.J.; Guru, K.A. Detection and localization of robotic tools in robot-assisted surgery videos using deep neural
networks for region proposal and detection. IEEE Trans. Med. Imaging 2017, 36, 1542–1549. [CrossRef]
8. Lu, J.; Jayakumari, A.; Richter, F.; Li, Y.; Yip, M.C. Super deep: A surgical perception framework for robotic tissue manipulation
using deep learning for feature extraction. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation
(ICRA), Xi’an, China, 30 May–5 June 2021; pp. 4783–4789.
9. Azqueta-Gavaldon, I.; Fröhlich, F.; Strobl, K.; Triebel, R. Segmentation of surgical instruments for minimally-invasive robot-
assisted procedures using generative deep neural networks. arXiv 2020, arXiv:2006.03486.
10. Wagner, M.; Bihlmaier, A.; Kenngott, H.G.; Mietkowski, P.; Scheikl, P.M.; Bodenstedt, S.; Schiepe-Tiska, A.; Vetter, J.; Nickel,
F.; Speidel, S. A learning robot for cognitive camera control in minimally invasive surgery. Surg. Endosc. 2021, 35, 5365–5374.
[CrossRef] [PubMed]
11. Li, B.; Lu, Y.; Chen, W.; Lu, B.; Zhong, F.; Dou, Q.; Liu, Y.-H. GMM-based Heuristic Decision Framework for Safe Automated
Laparoscope Control. IEEE Robot. Autom. Lett. 2024, 9, 1969–1976. [CrossRef]
12. Liu, Z.; Zhou, Y.; Zheng, L.; Zhang, G. SINet: A hybrid deep CNN model for real-time detection and segmentation of surgical
instruments. Biomed. Signal Process. Control. 2024, 88, 105670. [CrossRef]
13. Available online: https://keras.io (accessed on 23 September 2023).
14. Kazanzides, P.; Chen, Z.; Deguet, A.; Fischer, G.S.; Taylor, R.H.; DiMaio, S.P. An open-source research kit for the da Vinci® Surgical
System. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31
May–7 June 2014; pp. 6434–6439.
15. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations
using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078.
16. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
2014, arXiv:1412.3555.
17. Xu, F.; Uszkoreit, H.; Du, Y.; Fan, W.; Zhao, D.; Zhu, J. Explainable AI: A brief survey on history, research areas, approaches and
challenges. In Proceedings of the Natural Language Processing and Chinese Computing: 8th CCF International Conference,
NLPCC 2019, Dunhuang, China, 9–14 October 2019; pp. 563–574.
18. Samek, W.; Wiegand, T.; Müller, K.-R. Explainable artificial intelligence: Understanding, visualizing and interpreting deep
learning models. arXiv 2017, arXiv:1708.08296.
19. Yu, L.; Xiang, W.; Fang, J.; Chen, Y.-P.P.; Zhu, R. A novel explainable neural network for Alzheimer’s disease diagnosis. Pattern
Recognit. 2022, 131, 108876. [CrossRef]
20. Hughes, J.W.; Olgin, J.E.; Avram, R.; Abreau, S.A.; Sittler, T.; Radia, K.; Hsia, H.; Walters, T.; Lee, B.; Gonzalez, J.E. Performance of
a convolutional neural network and explainability technique for 12-lead electrocardiogram interpretation. JAMA Cardiol. 2021, 6,
1285–1295. [CrossRef]
21. Hassoun, M.H. Fundamentals of Artificial Neural Networks; MIT Press: Cambridge, MA, USA, 1995.
22. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991, 4, 251–257. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Robotics 13 00047

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robotics 13 00047

Uploaded by

Copyright:

Available Formats

robotics

Robotics 2024, 13, 47. https://doi.org/10.3390/robotics13030047 https://www.mdpi.com/journal/robotics

Robotics 2024, 13, 47 3 of 15

3.3. Dense 3.3. Dense

Layer Activation Recurrent Activation Number of Parameters

3.5. da Vinci Researck Kit (dVRK) Integration

Rdesired = Rcg Rcurrent (7)

Finally, the 4 × 4 transformation matrix can be built as follows:

Figure 4. Loss Figure

However, Figure 5 shows

Figure 7. Two stills Figure 7. Two stills of a dense-network-controlled

Robotics 2024, 13, 47 10 of 15

Figure 11. Two Figure 11. Two

Figure 12. Stills captured from

6. Conclusions and Future Work

6. Conclusions and Future Work

Supplementary Materials: The following supporting information can be downloaded at https:

Data Availability Statement: Datasets available on request from the authors.

You might also like