Fang

Mobile Networks and Applications
https://doi.org/10.1007/s11036-019-01244-4
An Augmented Reality-Based Method for Remote Collaborative

Real-Time Assistance: from a System Perspective
Dikai Fang 1 & Huahu Xu 1,2 & Xiaoxian Yang 3,4 & Minjie Bian 1,2
# Springer Science+Business Media, LLC, part of Springer Nature 2019
Abstract
To provide remote assistance to people more efficiently, an augmented reality (AR)-based method for remote real-time assistance
for collaboration is proposed. This paper aims to reduce communication barriers and enhance the three-dimensional (3D) feel of
immersive interactions. First, a multiplayer real-time video communication framework with WebRTC is built, which enables
remote experts to observe a first-hand view of an operator’s site. Second, a shared cross-platform virtual whiteboard based on
Canvas, WebSocket and Node.js is developed that enables remote experts to provide visual assistance, such as drawings or text,
and adjust the position of the whiteboard for seamless integration with video. Last, the virtual assistance information provided by
the remote experts is displayed on the screen of AR holographic glasses to enhance the assistance capability of the platform and
enable an expert to explain to an operator how to correctly perform tasks. A hybrid tracking and registration technique based on
natural features and gyroscopes is adopted to estimate the operator’s posture in real time to enable the virtual assistance
information to be perfectly integrated with the real world at all times. An experimental analysis shows that this system is both
practicable and stable and has broad application prospects in many fields.
Keywords Remote assistance . Augmented reality . WebRTC framework . Collaborative operations . Real-time tracking and
registration
1 Introduction on both sides during communication. However, the populari-

zation of the concepts of BSmart City^ and BSmart Factory^
In complex and collaborative tasks, local users may need to has caused businesses and government agencies to establish
complete operations in the local environment under the direc- stricter requirements for remote assistance that target a more
tion of remote experts. A traditional remote assistance system, realistic and three-dimensional (3D) environment. Thus, the
aided by video, audio, text and other media, provides suffi- creation of a shared Bimmersion^ in a scene, through which all
cient information that enables cooperation between the parties parties involved can communicate and provide hands-on as-
sistance, is a popular research topic and a difficult aspect of
remote assistance.
* Xiaoxian Yang
Advances in computer vision and human-computer interac-
xxyang@sspu.edu.cn tion technology such as augmented reality (AR) [1] provide an
opportunity to make remote collaborative assistance significant-
Dikai Fang ly more immersive. MG Han et al. explored diverse applications
Fang_dikai@163.com of AR for autopsies and pathological examinations using
1
Microsoft HoloLens holographic glasses and confirmed the val-
School of Computer Engineering and Science, Shanghai University,
Shanghai 200444, China
ue of AR holographic glasses in medical diagnosis applications
2
[2]. MB Shenai et al. designed a virtual interactive AR platform
Shanghai Shang Da Hai Run Information System Co., Ltd,
by which remote experts can provide remote surgical assistance
3
to local doctors; however, this system exhibited problems related
School of Computer and Information Engineering, Shanghai
Polytechnic University, Shanghai 201209, China
to its complex equipment and poor universality [3]. J Choi and
4
others proposed an AR system named ARClassNote that en-
Shanghai Key Laboratory of Intelligent Manufacturing and Robotics,
abled students and teachers to save and share handwritten notes
Mobile Netw Appl
with optical perspective display devices, which provides a new 2 Assistance model and architecture
approach to AR distance education [4]. These remote assistance introduction
methods can be applied to specific areas only, such as telemed-
icine and distance education, and most of them do not allow 2.1 Assistance model
multiple people to simultaneously provide remote assistance to
the helper. If we can provide a versatile remote assistance meth- The remote real-time assistance model, which is based on AR
od and enhance the immersive experience, satisfaction in ser- technology, includes roles: operators and remote experts. A
vices will be improved. one-to-many correspondence exists between an operator and
In this paper, real-time remote assistance is defined as a situ- the remote experts, i.e., one operator can request multiple
ation in which two or more people who are not in the same remote experts who can simultaneously provide remote assis-
physical space use network and communication devices to trans- tance services.
mit data in real time to collaboratively accomplish specific tasks, The model consists of the following main modules (refer to
and a new AR-based method for real-time remote collaborative Fig. 1). As the basis of the model, network transmission real-
assistance is proposed. This paper defines the party that is seek- izes the exchange of information between multiple users and
ing remote assistance as an Boperator^ and the party that is pro- the real-time data transmission between various terminal de-
viding remote assistance support as a Bremote expert^. The spe- vices. The transmitted data primarily consists of video data
cific process of remote assistance is described as follows: captured by AR holographic glasses, virtual assistance data
provided by remote experts, and voice call data from both
& Operators who wear AR holographic glasses issue remote parties. The assistance management and communication mod-
assistance requests to multiple remote experts from an ule is the core of the model and is mainly used to establish the
operation site simultaneously. collaboration channel and manage the behaviours of all
& After receiving the request, the remote experts can observe parties. The video/audio call module is mainly used to enable
a first-hand view from the operator’s perspective on their remote experts to observe a first-hand view of operator’s site
personal computer (PC) or mobile device and can directly and provide smooth and fluent speech communication. The
use a mouse or a touchscreen to circle, write text, or draw assistance information generation module of the remote ex-
images anywhere on their video screen. perts implements a transparent shared whiteboard that is inte-
& The virtual assistance information added by remote ex- grated with the video that experts can use to provide virtual
perts is synchronously displayed on the virtual screen of assistance. Moreover, the assistance information display and
the operator’s AR holographic glasses and is seamlessly enhancement module of the operator primarily uses a target-
integrated with the real-world view. Thus, the operator can tracking registration algorithm to construct a dynamic AR
directly perform the required operation with the aid of the model based on the virtual assistance data provided by the
information provided by the remote expert. remote experts.
& To ensure smooth communication between the two parties,
video/voice call functionality is also allowed in this method. 2.2 Architecture
An analysis of the experimental results shows that this The architecture for implementing remote real-time assistance
method successfully reduces the communication barriers be- consists of operators, remote experts, web servers, the net-
tween the two parties and improves the efficiency of the re- work (Internet), AR holographic glasses and other hardware
mote assistance in most instances. However, for particularly and software devices (refer to Fig. 2).
complex collaborative tasks, this method still has some limi- Using web-based real-time communication (WebRTC)
tations. In general, this system will have an extensive range of technology [5], peer-to-peer audio and video transmission
applications in the fields of telemedicine, equipment mainte- channels are established between the operator and each remote
nance, disaster relief, and logistics management. expert, which enables every expert to observe the operator’s
The remainder of the paper is structured as follows: Section 2 view of a scene in real time and supports communication
discusses the assistance model and the architecture of the real- between the operator and the experts. In addition, a virtual
time remote collaborative assistance method proposed in this shared whiteboard based on Canvas in HTML5 is created
paper. Section 3 describes the key implementations in remote and integrated with the video image captured by AR holo-
collaborative work and analyses the key points and main prob- graphic glasses in the remote expert client, which enables
lems. Section 4 designs and implements the real-time remote remote experts to freely add text, circle objects, or directly
system and proves that the method is feasible and effective by provide other guidance information on the video image.
analysing the experimental results. Section 5 elaborates the re- After this information is added, the full-duplex communica-
lated work of this research. Section 6 states our conclusions and tion channel based on the WebSocket protocol [6] synchro-
discusses future work. nously pushes the virtual guidance information to the screen
Mobile Netw Appl
Fig. 1 Modules of real-time re-

mote collaborative assistance
method
of the AR holographic glasses and any of the other experts’ the AR holographic glasses, which effectively Benhances^ the
clients through an intermediate web server. Then, the virtual scene that the operator views and provides instructions for
guidance information is superimposed on the virtual screen of performing related operations. To consider the operator head
Fig. 2 Architecture of the remote

collaborative real-time assistance
method
Mobile Netw Appl
movements, the system uses a hybrid tracking/registration communications have become the primary development trend
technology based on a combination of natural features and of future real-time mobile Internet communications.
inertial sensors to estimate the operator’s position and track Currently, WebRTC provides support that enables web devel-
line-of-sight changes in real time, which enables the system to opers to achieve real-time core audio and video communica-
dynamically adjust the position of the two-dimensional assis- tion without requiring the installation of any extensions or
tance information on the virtual screen of the AR glasses. This plug-ins; it includes audio and video capture, popular codecs,
approach ensures the seamless integration of the virtual assis- network transmission capabilities and client display. Modern
tance information with the real scene. mainstream browsers such as Chrome, Firefox and Opera pro-
vide satisfactory support for WebRTC. WebView is built into
2.3 Halo mini AR glasses Android 5.0 and higher versions to support WebRTC. Many
scholars have researched and implemented real-time commu-
On July 7, 2016, Shadow Creator Information Technology nication schemes based on WebRTC technology. For exam-
released a new-generation binocular AR display device named ple, J. Jang-Jaccard et al. proposed a WebRTC-based video
the Halo Mini (HM) [7], as shown in Fig. 3. An HM is conferencing service for telehealth [10]. Linh Van Ma et al.
equipped with an MTK 64-bit quad-core processor that runs proposed an efficient Session_Weight load balancing and
at 1.3 GHz with 4 GB of main memory, 128 GB of storage and scheduling methodology to improve network performance
a 4000-mAh battery, and uses the Android 5.1 depth- for a telehealthcare service based on WebRTC [11]. Iván
customization system Halo UI. As the HM is equipped with Santos-González et al. presented a complete comparative
a binocular light-conduction transparent hologram lens with study of two of the most extensively employed video stream-
an FOV horizontal angle of 40°, the wearer perceives an 80-in. ing protocols: RTSP and WebRTC. In addition, the feasibility
high-definition virtual screen with a resolution of 1024 × 768 of implementing a high-quality video streaming application
pixels at a forward distance of 2–3 m. The front of the glasses on an Android device with WebRTC was verified in [12].
and rear of the glasses are equipped with a 13-million pixel The WebRTC architecture consists of three modules: a web
high-definition camera and an 8-million pixel high-definition application layer, a browser layer and a Web API layer (refer
camera, respectively, and the glasses are equipped with 1000- to Fig. 4). The session management/abstract signalling layer
Hz high-precision gyroscopes, distance sensors, a global po- implements signalling abstraction, session establishment and
sitioning system (GPS) and other sensors. management functions. The voice engine includes a series of
audio processing technologies, such as the iSAC/iLBC codec,
NetEQ for voice, and echo cancellation/noise reduction tech-
3 Key implementations in remote nology. The video engine contains a series of video processing
collaborative work frameworks: the VP8 codec, a video jitter buffer, and an image
enhancement module. For network transmission, WebRTC
This section describes the key technologies involved in this adopts the RTP/SRTP protocol for media stream transmissions
remote real-time assistance method and analyses the key and adopts the interactive connectivity establishment (ICE)
points and main problems. framework to support private network media stream
passthrough under the support of a Session Traversal
3.1 WebRTC Utilities for network address translation (NAT) (STUN) or
Traversal Using Relays around NAT (TURN) server.
WebRTC is a new technology that supports real-time voice WebRTC aims to embed the multimedia modules, network
and video conversations via web browsers. The specific stan- transmissions, session management and signalling abstraction
dards were developed by the World Wide Web Consortium required for real-time communication applications in the Web
(W3C) [8] and the Internet Engineering Task Force (IETF) browser, to abstract differences between underlying hardware
[9]. Due to their HTML5-based features, WebRTC-based implementations and operating systems, and to reduce the cost
and workload of developing an application by providing an
API to web developers. Therefore, for a remote assistance
system that includes AR holographic glasses, desktop com-
puters, mobile devices and other hardware devices, it is rea-
sonable to base the audio and video real-time communication
module on the WebRTC framework.
The WebRTC Web API layer provides a variety of audio
and video interfaces, such as MediaStream interface,
PeerConnection interface and DataChannel interface [13].
Fig. 3 HM AR glasses When using PeerConnection to deliver data, the WebRTC
Mobile Netw Appl
Fig. 4 WebRTC architecture
signalling system is needed to coordinate communication by camera of the AR holographic glasses. On the operator side,
sending control information. Using WebSocket or HTTP, data the same whiteboard is displayed on the virtual screen
are directly transmitted through the signalling channel be- projected by the AR holographic glasses. The whiteboard
tween web servers. As most users use intranet addresses, a contents are synchronously updated and seamlessly integrated
problem arises in which the other party cannot be directly with the real scene. Many scholars have explored and pro-
contacted through the external port address during the estab- posed solutions that involve remote real-time interaction
lishment of peer-to-peer channels and the connection between schemes. Swati Ringe et al. developed a web application that
the signalling server and the client. Therefore, the use of net- enables users to interact and share information via drawings,
work address translation (NAT) technology is necessary to images and chats to increase collaboration without restricting
connect browsers in different intranets [14]. The people to a particular location, operating system, hardware
TURN/STUN/ICE framework is used as the core of its NAT platform or device [15]. Mário Antunes et al. proposed a
traversal technology in WebRTC. As shown in Fig. 5, the
process of establishing audio and video communication be-
tween the two parties is described as follows. First, both
parties of the communication send call and response signals
to the signalling server through the PeerConnection interface.
Second, the STUN/TURN/ICE server penetrates the firewall
and NAT of both communicating parties, and the
MediaStream interface starts collecting data by the audio
and video engine. Last, the collected data are transmitted
through the DataChannel channel using the RTP/SRTP
protocol.
3.2 Video and whiteboard integration technology
To enable remote experts to provide visual assistance to oper-

ators, this remote assistance solution employs a virtual shared
whiteboard to generate and share assistance information. On
the remote expert side, the virtual transparent whiteboard is
superimposed on the video captured by the front-facing Fig. 5 Association diagram of the WebRTC modules and the API
Mobile Netw Appl
Fig. 8 Coordinate systems of the camera, virtual screen and virtual

imaging plane
The front high-definition camera is located between two

Fig. 6 Schematic of video and whiteboard integration optically transparent holographic lenses [19] (refer to Fig. 7),
and the distance between the two lenses is fixed. Halo Mini
holographic glasses use holographic optical waveguide tech-
telemedicine solution that enables people to send files and nology for imaging [20] and can project an 80-in. virtual
freely draw on a shared whiteboard [16]. Victoria Pimentel screen at a distance of 2–3 m. The virtual screen’s exact width
et al. compared the latency of the WebSocket protocol with and height are determined by the projection system’s internal
HTTP polling and long polling and demonstrated that this parameters. When acquiring video data, the focal length of the
protocol is a reasonable choice for implementing near-real- camera is fixed; therefore, the position and size of the virtual
time applications [17]. screen projection on the virtual imaging plane are also deter-
The whiteboard proposed in this paper is implemented mined, and the optical centre of the camera, the centre of the
using Canvas in HTML5, WebSocket and Node.js [18]. In virtual screen and the centre of the virtual imaging region lie
addition, the detailed integration of the video scene and the on a straight line in 3D space.
whiteboard is shown in Fig. 6. As a transition exists from the As shown in Fig. 8, Oc-XcYcZc represents the spatial coor-
3D spatial information in the AR holographic glasses to the dinate system of the camera on the AR glasses: point Oc co-
2D image information in web pages, converting them to the incides with the camera optical centre; the Yc axis is perpen-
same coordinate dimension is a core aspect of this research. dicular to the top of the camera; and the Zc axis coincides with
This paper also focuses on the seamless integration of the the camera’s optical axis. The camera imaging plane perpen-
assistance information created by remote experts with the dicularly intersects Zc at point Oj; thus, the coordinate system
real-world scene where the operator is located. Oj-XjYj can be established for the camera imaging plane.
Similarly, the virtual screen plane is parallel to the camera
imaging plane and perpendicularly intersects Zc at point Oi;
therefore, the coordinate system O i -X i Yi can also be
established in the virtual screen plane. As previously men-
tioned, the distance from the virtual screen plane to the origin
Oc is fixed and can be represented by the constant k. The width
and height of the virtual screen are also predetermined and can
be represented by the constants w1 and h1, respectively. AR
glasses use a fixed-focus camera; consequently, we can as-
sume that the focal length of the camera is the constant f, the
image width is the constant w2, and the image height is the
constant h2. Thus, the ratio T1 between the image width and
the virtual screen width can be calculated:
Fig. 7 Position diagram of the front camera and holographic lens

T 1 ¼ w2 =w1 : ð1Þ
Mobile Netw Appl
Similarly, the ratio T2 between the image height and the simultaneous localization and mapping for robot control sys-
virtual screen height can also be calculated: tems [21]. SLAM has been extensively adopted in the AR
field due to its ability to perceive spatial information in an
T 2 ¼ h2 =h1 : ð2Þ unknown environment and locate itself in 3D space. For ex-
Assume that point P exists in space, and the line that con- ample, Davison et al. proposed a real-time monocular vision
nects it with the camera’s optical centre intersects the virtual SLAM system in 2007 [22], and Raul Mur-Artal et al. pro-
screen plane and the virtual imaging plane at points Pi(xi,yi) posed a real-time monocular 3D positioning and map con-
and Pj(xj,yj). According to the projection transformation theo- struction system based on ORB features, which achieved im-
rem, the following formula can be obtained: provements in processing speed and tracking effect [23].
Mirzaei and Roumeliotis implemented a system for calibrat-
yj xj f ing the positional deviation between a camera and inertial
¼ ¼ : ð3Þ
yi xi k navigation components with the help of a Kalman filter [24].
In addition, ZHOU Shao-lei et al. proposed a monocular vi-
Assume that the width and height of the whiteboard on the sual Oriented FAST and Rotated BRIEF (ORB)-SLAM/iner-
web page are w3 and h3, respectively, and that the width and tial navigation system (INS), which extends the application
height of the video image are w4 and h4, respectively. To range of integrated navigation systems to strong interference
ensure that the video area covered by the transparent white- and indoor environments [25]. Considering that the comput-
board is exactly the same as the video area in the actual scene ing and processing abilities of AR holographic glasses are
perceived by the operator through the virtual screen of the AR limited and that the external environment is complex and
glasses, the following expressions should hold: changeable, this paper makes some modifications to the
f SLAM/strapdown INS (SINS) combined positioning system
w4 ¼ w3 T 1 ; ð4Þ proposed by Sun L and others [26] to satisfy the real-time and
k
precision requirements for target-tracking registration.
and The combined positioning scheme is primarily composed of a
SLAM system based on monocular vision, an extended Kalman
f
h4 ¼ h 3 T 2 : ð5Þ filter (EKF) and a metric-unit proportion estimation module (re-
k
fer to Fig. 9) [27]. The input of this scheme is the image captured
In addition, the centre of the video image and the centre of by the front camera of the AR glasses and the data from the built-
the whiteboard must completely coincide. Because the sym- in inertial sensor. The output of this scheme is the metric position
bols in (4) and (5) are constants, the width and height of the and posture of the AR glasses in the visual world coordinate
video image viewed by the remote experts can be calculated system. The method can be described as follows:
by this calibration. Based on these solutions, the virtual infor-
mation added by the remote experts at any position of the 1) Monocular visual SLAM is used for initial positioning,
video can be displayed on the virtual screen of the AR glasses and the coordinate system and the map are initialized.
and accurately overlap with the real scene. 2) EKF is used to calibrate the static deviation of the gyro-
scope and accelerometer in the inertial navigation coordi-
nate system based on the SLAM output when the AR
3.3 Tracking registration technology glasses are static.
3) A least squares calculation is used to estimate the value
The main problem in this research is that the virtual assistance between the scale unit of the visual coordinate system and
information must be moved when the operator’s head rotates the metre scale; this value is considered to be a state
to maintain an appropriate display position on the virtual quantity of the filter.
screen of the AR glasses and seamlessly integrate with the 4) The system state equation and the error state equations of
real scene. The process of AR in this paper differs from a the SINS are established by a nonlinear Kalman filter, and
traditional AR system, as the scene for which the operator is the initial positioning result is employed as a system ob-
seeking help and the objects that must be remotely servation to establish the system observed equation. In
Benhanced^ are uncertain and unpredictable. In addition, a this manner, monocular visual SLAM can be applied to
substantial number of real-time requirements exist for regis- correct the integral error that accumulates in inertial nav-
tration; therefore, a hybrid tracking/registration technology igation over time. The approximate metric scale value can
based on natural features and inertial sensors is employed to be corrected by the acceleration obtained from SINS at
implement the tracking registration module. the filtering step. Additionally, this approach ensures that
The simultaneous localization and mapping (SLAM) tech- the SINS can continuously track the pose of the AR
nique was originally designed to solve the problem of glasses when monocular visual SLAM is unstable.
Mobile Netw Appl
Fig. 9 Loosely coupled SLAM/

SINS combined positioning
scheme based on EKF
To realize 3D mapping in an unfamiliar environment and coordinate system, respectively. The prediction step predicts
self-positioning in 3D space, this paper adopts the monocular the state of the system based on the value of the inertial sensor.
visual SLAM technology based on the characteristics of the The value of the inertial sensor serves as an increment for the
natural environment. To improve the system’s real-time per- EKF motion model. The update step applies the results of the
formance, the SLAM method is divided into two parallel visual positioning as an input to the EKF to update the sys-
processes—tracking and mapping—in which one thread is tem’s state. The SLAM/SINS combination positioning meth-
used to individually track the camera pose and another thread od can track the position and pose of the AR glasses at 30 HZ,
is used to create, expand and save the map based on key which ensures that the positional calculations of the virtual
frames [28]. Bundle adjustment is used to optimize the calcu- assistance information displayed on the AR glasses can occur
lation results and improve tracking efficiency and mapping in real time.
accuracy. To use the monocular camera to obtain the depth
information of the scene, this paper employs the initialization
method based on natural textures to calculate the 3D affine 4 System design and experiments
transformation matrix of the camera, which guarantees the
output of the final metric position. Due to space limitations, Based on the new remote real-time assistance method pro-
the details of the map definition, tracking and mapping proposed in the previous section, this paper implements a remote
cesses, image feature extraction and the use of the image pyr- assistance system that allows remote experts to provide re-
amid algorithm to improve accuracy are not described in this mote assistance to an operator. As shown in Fig. 11, the sys-
paper. tem requires the implementation of the following three mod-
Halo Mini AR glasses are equipped with a high-speed gy- ules: multi-person voice and video communication based on
roscope that runs at 1000 Hz and a high-accuracy accelerom- WebRTC, shared transparent whiteboard based on HTML5,
eter. The inertial sensor coordinate system is shown in Fig. 10. and operator posture tracking based on visual and inertial
The gyroscope measures the angular velocity around the three sensors.
axes of inertial navigation, and the accelerometer measures the The multi-person real-time audio and video calling module
acceleration along the three axes of inertial navigation. The based on WebRTC is a cross-platform module. The client used
pose, speed and position of the AR glasses are estimated by by the remote experts is a Chrome browser, and the client used
integrating the continuously measured angular velocity and by the operator on the AR glasses is a mobile Android appli-
acceleration. Due to the integral operation, errors accumulate cation. In addition, the system’s web server is Apache, and its
over time [29]. The SINS/SLAM combined positioning model
that is adopted in this paper effectively overcomes this prob-
lem using an EKF framework that consists of two steps: pre-
diction and update. The filter’s state vector is:

X ¼ piT iT iT T T s s
w ; vw ; qw ; bw ; ba ; λ; pi ; qi ; ð6Þ
where pw, vw and qw are the position, speed and pose of the AR
glasses, respectively; bw and ba denote the deviations of the
accelerometer and the gyroscope in the inertial navigation
system, respectively; λ represents an estimate of the ratio be-
tween the measurement and the true scale; and psi , qsi are cal-
ibration parameters that represent the displacement and rota-
tion between the inertial coordinate system and the visual Fig. 10 Built-in inertial sensor coordinate system of the AR glasses
Mobile Netw Appl
Fig. 11 Modules constituting the

system
signalling server is Node.js. Google’s public STUN server is stream and sent to the Node.js server. After receiving the data
used to complete NAT traversal. Both the web server and the stream, the WebSocket server broadcasts the data to the other
signalling server are deployed on a physical machine with the clients in the same room via WebSocket. The AR glasses worn
IP 202.121.199.225, and ports 80 and 3000 are used to trans- by the operator convert the received base64 data stream into a
mit data (refer to Fig. 12). To establish a one-to-many service picture and display it on the virtual whiteboard in real time.
model between an operator and the remote experts, the con- Simultaneously, the target-tracking registration module in the
cept of a Broom^ is proposed. Only an operator and remote AR glasses estimates the operator’s posture in real time and
experts in the same room can make voice and video calls and dynamically adjusts the display position of the assistance in-
use remote assistance services. formation on the whiteboard.
In this system, the shared whiteboard is implemented based Figure 13a and b show two examples of remote assistance
on Canvas, and remote experts can use the mouse or a finger for equipment maintenance using the system implemented in
to draw on it. Experts can select the brush type, size and this paper. The operator in the bottom of Fig. 13a is wearing
colour. In addition, experts can add text, send files and erase Halo Mini holographic glasses while working in the machine
content on the whiteboard. After virtual assistance informa- room and sends remote assistance requests to the remote ex-
tion has been added, the content is converted to a base64 data perts. After accepting the request, the remote experts in the top
Fig. 12 Multi-person real-time

audio and video calling
framework
Mobile Netw Appl
Fig. 13 Pictures taken during the

experiments
of Fig. 13a and the middle of Fig. 13a can observe the oper- environment is shown in Table 1, and a minimum network
ator’s view of the scene in real time. In addition, the remote bandwidth of 20 Mbps is guaranteed.
experts can add assistance information, such as circles or text, When the audio/video delay time is between 0 and 400
to the virtual whiteboard. In this case, remote experts add milliseconds, people are not aware of the delay. Figure 14
assistance information to the virtual whiteboard, which ex- shows the variation in the maximum audio/video delay time
plains the insertion of the red network cable into the host as the number of online users increases. When the number of
network port. The added virtual assistance information is syn-
chronously displayed on the virtual screen of the AR holo-
Table 1 Experimental environment
graphic glasses worn by the operator to guide the operation.
All parties are connected via real-time audio and video com- System node Hardware resources
munication during this session. In addition, Fig. 13b shows an
Server Operating system: Windows;
operator being guiding to change the position of the yellow CPU: Xeon-E5 2620;
network cable with the help of remote experts’ visual Memory: 16 GB;
instruction. Hard disk: 1 TB
As a remote collaborative real-time assistance method, Remote expert client (PC) Operating system: Windows;
achieving excellent real-time performance with low laten- CPU: Intel Core-i5 7400;
Memory: 8 GB;
cy is necessary. The real-time requirements of the system Hard disk: 500 GB
primarily refer to the following two aspects. First, the re-
Remote expert client (tablet) Operating system: Android;
mote experts can observe the scene at the operator’s site CPU: Cortex-A73;
with low latency or without delay. Second, the remote ex- RAM: 4 GB;
pert’s request to collect collaborative information can re- ROM: 128 GB
ceive a quick response and the content can be simulta- Operator client (Halo Mini AR glasses) Operating system: Android;
CPU: MTK8735;
neously displayed on the screen of the AR glasses. In this
RAM: 4GB;
paper, the real-time performance of the implemented sys- ROM: 128GB
tem is tested using the Wireshark tool. The experimental
Mobile Netw Appl
Fig. 14 Relationship between the Maximum

maximum audio/video delay and audio/video delay
time(millisecond)
the number of users
300
250
200
150
100
50
0 Number of users
0 10 20 30 40 50 60 70 80
online users is less than 80, the maximum audio/video system & Video-based method: remote experts can see the screen of
delay remains less than 300 milliseconds, which ensures that the operation site and send voice instructions.
the remote experts can view a scene at the operator’s site in & AR-based method (proposed in this paper): multiple re-
real time. As the number of users increases, the delay time mote experts and operators can communicate in real time
slowly increases with an increasingly rapid rate, which de- and add visual guidance directly to the scene based on
pends on the performance of the media server and the occu- augmented reality.
pancy of the network bandwidth. The scene can be viewed at
the operator’s site in real time when there are dozens of remote Thirty participants were invited to participate in this exper-
experts. iment. To ensure that each participant in the role of operator
Although real time is not clearly quantified, we define experienced the same conditions for providing instructions,
that there cannot be more than 15 milliseconds from the the same three remote experts were used for all 30 operators.
point at which the remote expert draws the assistance There were 13 females and 17 males between 17 and 45 years
information successfully and that at which the operator old, all with no background in equipment repairs.
sees the assistance information on the virtual screen. Three groups of equipment maintenance tasks were de-
Figure 15 shows the relationship between the average re- signed at three difficulty levels, i.e., primary level, intermedi-
sponse time of the assistance transaction and the through- ate level and advanced level, each of which was composed of
put (the number of requests to send assistance information 15 tasks of comparable difficulty. Additionally, each partici-
per minute). As the throughput increases, the growth of pant was asked to perform three rounds of experiments ac-
the average response time slows. When the number of cording to the three methods mentioned above. Before each
requests to send assistance information is 70 per minute, round started, the participants selected five tasks from each of
the average response time of the transaction is less than 5 the three groups. Note that tasks cannot be selected multiple
milliseconds, which indicates that the push method based times. Then, the participants completed each task according to
on WebSocket and Node.js is effective and satisfies the the instructions from the remote experts, and each operation
requirements of remote real-time assistance. time could not exceed 60 s. The accuracy of the participant’s
To prove that our proposed method can actually improve operation and the time spent were recorded by the organizer.
the efficiency of remote collaboration, a comparative experi- Accuracy was represented by a Boolean value, where 0 indi-
ment was designed to compare three collaboration methods as cates an operation error and 1 indicates a correct operation.
follows. The time spent is the number of seconds the operator spends
completing the task. If it is not completed successfully, the
& Image- and text-based method: remote experts can send time spent is 60 s.
operating instructions based on only image and text to the Finally, we obtained 1350 datasets, each of which contains
operator. two measured values (i.e., accuracy and time spent). Data can
Fig. 15 Relationship between Average response

average response time and time(millisecond)
throughput 20
15
10
5
0 Throughput
0 10 20 30 40 50 60 70
Mobile Netw Appl
Fig. 16 The average accuracy of 95.4%

Average
the task performed by the operator
accuracy
under the collaborative methods
analysed 81.7% 73.4%
100.0%
90.0% 66.4%
75.2%
80.0%
70.0% 37.3%
60.0% 51.1%
32.7%
50.0%
40.0%
24.1%
30.0%
AR based method (proposed in this paper)
20.0%
Video based method
10.0%
Image and text based method
0.0%
Primary level Intermediate Advanced
level level
Image and text based method Video based method AR based method (proposed in this paper)
be divided into 9 groups according to the remote collaboration distracting. Therefore, the experimental analysis shows that
method used and the difficulty of the task. The average accu- the proposed method is effective but has limitations and needs
racy of the data in the 9 groups was separately calculated, and to be optimized.
the results are shown in Fig. 16. As seen from the figure,
compared to the other two methods, our proposed AR-based
method can significantly improve the accuracy of the opera- 5 Related studies
tor’s execution of tasks according to the instructions of the
remote expert. For primary level tasks, the accuracy improve- AR-based remote real-time assistance has become a signifi-
ment is more obvious because the visual-based AR method cant research issue in collaborative work, as it improves the
enables the quick and adequate exchange of visual context- efficiency and immersion of teamwork. We provide a review
related information, which also shows that our method is very of the major techniques and studies that are closely related to
useful for guiding inexperienced operators in simple our work.
operations. Many studies focus on remote real-time collaborative
Additionally, we classified and calculated the time spent on methods in a fixed and single scene, which requires expensive
the operations recorded in the experiment. Figure 17 shows hardware equipment and is not versatile. Anton D [30] et al.
the average time it takes to perform tasks of different difficul- proposed an augmented telemedicine framework for 3D real-
ties with the collaborative methods analysed. It can be con- time communication that combines interaction via AR and
cluded from the graph that the proposed method shortens the provides a reasonable method for experts to remotely inter-
time it takes the operator to perform the task. Thus, our meth- vene and guide operators for telemedicine. Hou L [31] et al.
od helps operators understand the instructions of remote ex- analysed the use of two-dimensional drawings for guiding
perts more quickly. However, when our method is used to assembly and experimentally evaluated the benefits of AR in
assist complex and advanced tasks, the effect of the improve- assembly. Specific findings indicated that the collaborative
ment is not obvious because the area provided by our virtual method based on AR visualization significantly shortened
artboard is limited, and too much virtual assistance informa- the completion time, the cost of correcting erroneous assem-
tion will interfere with the operator’s line of sight and become bly and the payment to assemblers. Oyekan J [32] et al.
Fig. 17 The average time spent 0 10 20 30 40 50 60 Average time spent(s)

on the tasks performed by the
35.3
operators for the collaborative Primary level 27.6
15.4
methods analysed
49.4
Intermediate level 39.6
30.8
58.6
Advanced level 53.8
49.3
Image and text based method Video based method AR based method (proposed in this paper)
Mobile Netw Appl
designed a collaborative infrastructure that enables team Funding This work is supported by the National Natural Science
Foundation of China Grant No. 61502294, the CERNET Innovation
members to collaborate in the real-time completion of com-
Project under Grant Nos. NGII20170513 and NGII20170206, and the
plex tasks and implemented it using human motion capture IIOT Innovation and Development Special Foundation of Shanghai under
technology and a synchronous data transfer protocol from Grant No. 2017-GYHLW- 01037.
computer networks. Although these methods have been suc-
cessful and worked for various application fields, the limita-
tions of the application scope and the complexity of the sys- References
tem architecture need to be considered.
Many studies focus on the development of mobile remote 1. Bajura M, Neumann U (1995) Dynamic registration correction in
augmented-reality systems[C]. In: Virtual reality international sym-
assistance frameworks based on AR. With the popularity of posium, 1995. Proceedings. IEEE, pp 189–196
wearable smart devices, mobile-based collaborative methods 2. Hanna MG, Ahmed I, Nine J et al (2018) Augmented reality tech-
will become more popular [33]. Kim Y [34] et al. proposed an nology using Microsoft HoloLens in anatomic pathology [J]. Arch
AR-based tele-coaching system applied to the game of tennis, Pathol Lab Med 142(5):638–644
3. Shenai MB, Dillavou M, Shum C et al (2011) Virtual interactive
which is referred to as the AR coach, and evaluated the
presence and augmented reality (VIPAR) for remote surgical
player’s performance with regard to instruction comprehen- assistance.[J]. Neurosurgery 68(2):200–207
sion when the remote coaching was presented in several mo- 4. Choi J, Yoon B, Jung C et al (2017) ARClassNote: augmented
dalities, such as visual only and both visual and aural. Cidota reality based remote education solution with tag recognition and
M [35] et al. developed an AR framework to support visual shared hand-written note[C]. In: IEEE international symposium
on mixed and augmented reality. IEEE, pp 303–309
communication between a remote user using a laptop and a 5. WebRTC Official Website. http://www.webrtc.org/. Accessed 25
local user wearing a head-mounted display with an RGB cam- June 2018
era. However, multiple remote experts cannot simultaneously 6. Fette I, Melnikov A (2011) The WebSocket protocol. IETF internet
assist one local user in the system. Gauglitz S [36] et al. pro- draft, work in progress
7. Halo Mini Official Website. http://www.shadowcreator.com/.
posed a touchscreen interface for creating freehand drawings
Accessed 10 June 2018
as world-stabilized annotations and virtually navigating a 8. W3C WebRTC 1.0: Real-time communication between browsers.
scene reconstructed live in three dimensions. Although the https://www.w3.org/TR/2018/CR-webrtc-20180621/. Accessed 28
interface can be used for remote assistance, it lacks the sense June 2018
of immersion and requires users to use their hands. 9. IETF real-time communication in web-browsers (RTCWEB).
http://datatracker.ietf.org/wg/rtcweb/. Accessed 28 June 2018
Compared with the related studies presented in this section,
10. Jang-Jaccard J, Nepal S, Celler B et al (2016) WebRTC-based video
we focus on improving the versatility, immersion and conve- conferencing service for telehealth [J]. Computing 98(1-2):169–193
nience of a remote real-time collaborative assistance method 11. Ma LV, Kim J, Park S et al (2016) An efficient Session_Weight load
with AR glasses. In addition, a more efficient multi-to-one balancing and scheduling methodology for high-quality telehealth care
remote assistance model is presented in this paper. service based on WebRTC [J]. J Supercomput 72(10):3909–3926
12. Santosgonzález I, Riverogarcía A, Molinagil J et al (2017)
Implementation and analysis of real-time streaming protocols [J].
Sensors 17(4):846
13. Johnston AB, Burnett DC (2012) WebRTC: APIs and RTCWEB
6 Conclusions protocols of the HTML5 real-time web [M]. Digital codex LLC
14. Shen-hui C (2013) The study and implementation of NAT traversal
technology. Nanjing University of Posts and Telecommunications
In this paper, an AR-based system for achieving remote 15. Ringe S, Kedia R, Poddar A et al (2015) HTML5 based virtual
collaborative real-time assistance is proposed and imple- whiteboard for real time interaction ☆[J]. Procedia Comput Sci
mented. The system provides a specific solution for ap- 49(1):170–177
plying AR technology to remote assistance. This method 16. Antunes M, Silva C, Barranca JA (2016) A Telemedicine applica-
tion using WebRTC ☆ [J]. Procedia Comput Sci 100:414–420
effectively reduces the communication barriers among 17. Pimentel V, Nickerson BG (2012) Communicating and displaying
parties and improves the efficiency of remote assistance. real-time data with WebSocket [J]. IEEE Internet Comput 16(4):
The solution proposed in this paper can serve as a suit- 45–53
able reference for future studies that involve these 18. Tilkov S, Vinoski S (2010) Node.js: using JavaScript to build high-
performance network programs [J]. IEEE Internet Comput 14(6):
problems.
80–83
In future studies, the system will be optimized from two 19. Gabor D (1948) A new microscopic principle. [J]. Nature
aspects. First, the signal transmission can be unstable during 161(4098):777
multi-person video calls. Second, the accuracy of target- 20. Zeng F, Zhang X, Zhang J-p et al (2013) Holographic waveguide
tracking registration decreases when the scene becomes too head-mounted display system design based on prisms-grating struc-
ture [J]. Acta Opt Sin 33(9):114–119
large. We also plan to continually improve the robustness and 21. Anousaki GC, Kyriakopoulos KJ (1999) Simultaneous localization
stability of the system to promote a wider application of AR and map building for mobile robot navigation [J]. IEEE Robot
technology in the remote assistance field. Autom Mag 6(3):42–53
Mobile Netw Appl
22. Davison AJ, Reid ID, Molton ND et al (2007) MonoSLAM: real- 30. Anton D, Kurillo G, Yang AY et al (2017) Augmented telemedicine
time single camera SLAM.[J]. IEEE Trans Pattern Anal Mach Intell platform for real-time remote medical consultation[M]. In:
29(6):1052–1067 MultiMedia modeling. Springer International Publishing, pp 77–89
23. Mur-Artal R, Montiel JMM, Tardós JD (2017) ORB-SLAM: a 31. Hou L, Wang X, Truijens M (2012) Using augmented reality to
versatile and accurate monocular SLAM system [J]. IEEE Trans facilitate piping assembly: an experiment-based evaluation[J]. J
Robot 31(5):1147–1163 Comput Civ Eng 29(1):05014007
24. Mirzaei FM, Roumeliotis SI (2007) A Kalman filter-based algo- 32. Oyekan J, Prabhu V, Tiwari A et al (2017) Remote real-time col-
rithm for IMU-camera calibration[C]. In: Ieee/rsj international con- laboration through synchronous exchange of digitised human-
ference on intelligent robots and systems. IEEE, pp 2427–2434 workpiece interactions[J]. Futur Gener Comput Syst 67:83–93
25. Zhou S-l, Wu X-z, Liu G et al (2016) Integrated navigation method 33. Mimaroğlu O (2014) Collaborative augmented reality[J]. Commun
of monocular ORB-SLAM/INS [J]. J Chin Iner Techno 24(5):633– ACM 45(7):64–70
637 34. Kim Y, Hong S, Kim GJ (2017) Augmented reality-based remote
26. Sun L, Du J, Qin W (2015) Research on combination positioning coaching for fast-paced physical task[J]. Virtual Reality 6:1–12
based on natural features and gyroscopes for AR on Mobile 35. Cidota M, Lukosch S, Datcu D et al (2016) Workspace awareness in
phones[C]. In: International conference on virtual reality and visu- collaborative AR using HMDS: a user study comparing audio and
alization. IEEE, pp 301–307 visual notifications. In: Proceedings of the 7th augmented human
27. Bailey T, Nieto J, Guivant J et al (2006) Consistency of the EKF- international conference. ACM Press, New York
SLAM algorithm[C]. In: Ieee/rsj international conference on intel- 36. Gauglitz S, Nuernberger B, Turk M (2014) In touch with the remote
ligent robots and systems. IEEE, pp 3562–3568 world:remote collaboration with augmented reality drawings and
28. Klein G, Murray D (2009) Parallel tracking and mapping on a virtual navigation[C]. In: Proceedings of the 20th ACM symposium
camera phone[C]. In: IEEE international symposium on mixed on virtual reality software and technology. ACM Press, New York,
and augmented reality. IEEE Computer Society, pp 83–86 pp 197–205
29. Davis BS (1998) Using low-cost MEMS accelerometers and gyro-
scopes as strapdown IMUs on rolling projectiles[C]. In: Position Publisher’s Note Springer Nature remains neutral with regard to juris-
location and navigation symposium, IEEE. IEEE, pp 594–601 dictional claims in published maps and institutional affiliations.

Fang

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fang

Uploaded by

Copyright:

Available Formats

Mobile Networks and Applications

An Augmented Reality-Based Method for Remote Collaborative

# Springer Science+Business Media, LLC, part of Springer Nature 2019

1 Introduction on both sides during communication. However, the populari-

Fig. 1 Modules of real-time re-

Fig. 2 Architecture of the remote

Fig. 4 WebRTC architecture

3.2 Video and whiteboard integration technology

To enable remote experts to provide visual assistance to oper-

Fig. 8 Coordinate systems of the camera, virtual screen and virtual

The front high-definition camera is located between two

Fig. 7 Position diagram of the front camera and holographic lens

Fig. 9 Loosely coupled SLAM/

Fig. 11 Modules constituting the

Fig. 12 Multi-person real-time

Fig. 13 Pictures taken during the

Fig. 14 Relationship between the Maximum

Fig. 15 Relationship between Average response

Fig. 16 The average accuracy of 95.4%

Fig. 17 The average time spent 0 10 20 30 40 50 60 Average time spent(s)

You might also like