Distributed Visual Processing for Augmented Reality
Wai Ho Li
Recent advances have made augmented reality on smartphonespossible but these applications are still constrained by the lim-ited computational power available. This paper presents a systemwhich combines smartphones with networked infrastructure andﬁxed sensors and shows how these elements can be combined todeliver real-time augmented reality. A key feature of this frame-work is the asymmetric nature of the distributed computing en-vironment. Smartphones have high bandwidth video cameras butlimited computational ability. Our system connects multiple smart-phones through relatively low bandwidth network links to a serverwith large computational resources connected to ﬁxed sensors thatobserve the environment.By contrast to other systems that use preprocessed static mod-els or markers, our system has the ability to rapidly build dynamicmodels of the environment on the ﬂy at frame rate. We achievethis by processing data from a Microsoft Kinect, to build a track-able point cloud model of each frame. The smartphones processtheir video camera data on-board to extract their own set of com-pact and efﬁcient feature descriptors which are sent via WiFi to aserver. The server runs computationally intensive algorithms in-cluding feature matching, pose estimation and occlusion testing foreach smartphone. Our system demonstrates real-time performancefor two smartphones.
This work is motivated by the goal of creating a collaborative multi-user Augmented Reality system which combines three elements:
Handheld devices such as smartphones or tablets which con-tain video cameras that image the environment and displaythat image with overlaid computer graphics on their screens.
Fixedhardwarethatcancommunicatewiththemobiledevicesvia a wireless network and can provide large computing re-sources.
Fixed sensors in the environment that can provide informationabout the environment that provide information that can beused to localise the mobile devices.This kind of architecture creates an asymmetric computing envi-ronment which raises the key issues of how these components cancollaborate to provide an Augmented Reality experience, how the(somewhat) limited bandwidth of the wireless network can best beused and how the computational burden should be distributed be-tween the smartphones and the ﬁxed infrastructure. This has to beorganised so as to overcome the constraint of limited computationalcapability on the mobile devices and also maximise their batterylife.
We present a system comprising multiple (initially two)smartphones and a desktop PC to which a Microsoft Kinectis attached. The Kinect is used to provide a coordinate framein which virtual content and the smartphones positions can beexpressed. It also senses the environment as 3D textured sur-faces and processes this to generate a trackable point cloudmodel consisting of an indexed set of visual descriptors.
and rebuiltat 30Hzframerateand is used to localize the smartphones at interactive framerates. To our knowledge this is the ﬁrst time this has beenachieved and is made possible by using a novel rotationallycovariant descriptor, RHIPS which is based on HIPS .
Because we remodel the world dynamically at 30Hz, our sys-tem is not only robust to moving and possibly unmodelledelements of the scene, it is able to use them to aid localisa-tion.
We show how to transform the Kinect’s depth map into theviewpointof asmartphonethusturning itintoavirtual Kinect.This is used to correctly render virtual content with occlusionsin the smartphones’ viewpoints.
We also use the Kinect’s depth map to capture interactionsbetween real and virtual elements (e.g. when a users handtouches a virtual object).
Figure 1: Distributed visual processing for Multi-user Augmented Re-ality. Here the system demonstrates a real-time AR service for twophones, tracking both phones and rendering virtual content with oc-clusions. Note the virtual model of the earth rendered behind theKinect packaging.
Markerless visual tracking of real scenes is an attractive option [3,10, 7, 15, 1] for obtaining pose information for Augmented Realityapplications for several reasons:
Video cameras are information rich sensors and are alreadyneeded where AR is mediated by video feed-through such ason a handheld tablet or smartphone. The use of additional