You are on page 1of 5

CNN-BASED INDOOR OCCUPANT LOCALIZATION VIA ACTIVE SCENE ILLUMINATION

Jinyuan Zhao, Natalia Frumkin, Prakash Ishwar and Janusz Konrad

Boston University
Department of Electrical and Computer Engineering
Boston, MA 02215

ABSTRACT measure changes in the received signals due to human activi-


ties. These include systems using WiFi transmitters [14] and
We propose and study a data-driven approach to indoor oc- ultrasound transducers [15]. They are more robust to envi-
cupant localization using a network of single-pixel light sen- ronmental conditions but are not as accurate as vision-based
sors and modulated LED light sources. Locations are esti- systems.
mated by processing sensor data using a simple convolutional To address privacy concerns in vision-based systems,
neural network (CNN). Unlike previous model-based meth- extremely-low-resolution (eLR) sensors have been proposed
ods, the proposed approach does not require knowledge of instead of video cameras. These sensors capture little visual
room dimensions, locations of LEDs and sensors, and as- information and have low processing and transmission costs.
sumptions about material properties and object heights. We Both passive and active indoor localization systems have been
quantitatively validate the performance of our approach in developed using eLR sensors. Roeper et al. [16] proposed a
simulated and real-world environments in private and pub- passive indoor localization system using 6 single-pixel color
lic scenarios. In Unity3D simulations, compared to the best- sensors, but its performance was found to be largely affected
performing benchmark method, our approach reduces the av- by ambient light fluctuations. Wang et al. [17] and Li et
erage localization error by 47.69% in private scenarios and al. [18] developed systems combining eLR sensors with
by 46.99% in public scenarios. Similarly, in a real testbed the modulated LED light sources for occupancy estimation and
error is reduced by 36.54% and 11.46% in private and public skeleton reconstruction. These systems are more robust to
scenarios respectively. noise and illumination changes. Building upon Wang et al.’s
Index Terms— Indoor localization, active scene illumi- active scene illumination methodology, we previously devel-
nation, convolutional neural network oped model-based algorithms [19, 20] that achieved good
accuracy in localizing flat objects but this requires explicit
knowledge of room dimensions and locations of LEDs and
1. INTRODUCTION
sensors, as well as assumptions about material properties.
Future smart spaces are envisaged to respond to the needs of In this paper, we develop a simple convolutional neural
occupants and deliver benefits like energy savings, productiv- network (CNN) for indoor occupant localization via active
ity gains and health benefits. This requires information about scene illumination. Unlike our past model-based approaches
locations and activities of occupants. [19, 20], the proposed data-driven method does not require
Many traditional indoor localization systems make use of knowledge about the room, LEDs, sensors, and object prop-
a wearable beacon. These systems require occupants to carry erties. We quantitatively validate the new method’s perfor-
custom-designed electronic devices such as RFID tags [1, 2], mance with experiments in both simulations and on a phys-
badges [3] or a receiver [4, 5, 6], and therefore are intrusive. ical testbed. We demonstrate that our network outperforms
Systems that don’t require a wearable beacon utilize signals both traditional machine learning models as well as our best-
that are affected by occupants’ activities. They can be classi- performing model-based approaches.
fied into passive and active systems based on whether they
measure ambient signal characteristics or actively generate 2. PROPOSED APPROACH
signals to be measured. Passive systems measure changes in
received signals caused by human activities, including sound 2.1. Light transport matrix
[7], airflow [8] and infrared light [9]. These systems are sen- Our system is composed of an array of modulated light
sitive to environmental conditions like noise or illumination sources (fixtures) and another array of single-pixel visible-
change. Vision-based systems [10, 11, 12, 13] are passive light sensors, both mounted on the ceiling and facing verti-
and more accurate, but do not provide visual privacy. Active cally downward. The sensors measure incoming luminous
systems, on the other hand, generate modulated signals and flux and are planar with no lens. The relationship between

978-1-5386-6249-6/19/$31.00 ©2019 IEEE 2636 ICIP 2019


light modulation and sensor responses is described by a light tensor of Nf channels. Each channel consists of entries from
transport matrix A [17]. one column of ∆A0 corresponding to the contribution of a
Suppose we have Nf LED light sources (fixtures) and Ns light source to all sensors, and is reshaped into a 2D matrix
light sensors. The luminous flux measured by sensor i can be to match the spatial ordering of the actual sensor grid on the
expressed as follows: ceiling. This will allow the network to extract the occupant’s
location from the channel images. Figure 2 shows an example
Nf
X of the original ∆A0 matrix and the reshaped 3D tensor.
s(i) = b(i) + f (j)A(i, j), (1)
j=1

where b(i) is ambient light that arrives at sensor i, f (j) is the


relative intensity of light source number j scaled to the range
[0, 1], and A(i, j) is the unit light contribution of source j to
the luminous flux of sensor i. This is illustrated in Figure 1.
(b) Reshaped 3D tensor with Nf =
(a) Original ∆A0 matrix.
12 channels. Each channel is re-
Ns = 12, Nf = 12.
shaped into a 3 × 4 matrix corre-
sponding to the actual sensor grid.
Fig. 2: Example of an original ∆A0 matrix and its reshaped
3D tensor form. Red boxes correspond to the same entries.
Occupant is located at the lower right corner of the floor.
Finally, we upsample the channel images using bilinear
Fig. 1: Illustration of light captured by sensor i. interpolation to allow sub-pixel kernel and stride sizes. In
The light transport matrix A is an Ns × Nf matrix whose Section 3, we study the impact of various upsampling factors
entries A(i, j) are shown in equation (1). The values of ma- on localization performance.
trix A are determined by the locations and characteristics of Network architecture: Our proposed CNN consists of
light sources and sensors, the room geometry, and the loca- 4 layers: 2 convolutional layers, 1 hidden layer and 1 out-
tions, shapes and surface properties of all objects in a room put layer. The first two convolutional layers have 32 and 64
(e.g., floor, furniture, occupants). The presence or movement output channels and use 2D convolutional kernels. The sec-
of an occupant can change the values of A. In practice, matrix ond convolutional layer is flattened and then fully connected
A is obtained by modulating light sources (through f (j)) and with the hidden layer. The hidden layer is fully connected
measuring sensor responses s(i) [19]. Then, in order to lo- with the output layer which gives the estimated occupant lo-
calize an occupant, one measures light-transport matrices for cation (bx, yb). The kernel and stride sizes of the two con-
two different room states: A0 for an empty room, and A for volutional layers and the dimension of the hidden layer are
room with occupant. The change in the light transport matrix chosen according to the upsampling factor of the input ten-
between two states ∆A = A − A0 is used as the feature in sor (which determines the input size). The loss function is
our data-driven localization algorithm. the mean squared localization error (squared Euclidean dis-
tance between estimated and ground truth locations). To avoid
2.2. Proposed network the problem of vanishing gradients, the outputs of the two
We propose a simple CNN for indoor occupant localization. convolutional layers and the hidden layer are activated by a
The network takes the pre-processed light-transport matrix LeakyReLU function with α = 0.1. Figure 3 illustrates the
change ∆A as the input and produces a 2-dimensional vec- overall architecture of the CNN, while Table 1 provides de-
tor (b
x, yb) of coordinates of the estimated location. tailed parameters.
Pre-processing: We first normalize the matrix ∆A by di-
viding all its entries by the magnitude of the entry with the
largest absolute value:

∆A
∆A0 = . (2)
maxi,j (|∆Aij |)

This ensures that all entries of ∆A0 lie in the interval [−1, 1]. Fig. 3: Architecture of proposed CNN. See Table 1 for details.
We note that sensors closer to the occupant should have
larger reading changes between empty and occupied room 2.3. Benchmarks
states than those further away. To leverage this spatial re- We benchmark our proposed CNN against support vector re-
lationship, we reshape the Ns × Nf matrix ∆A0 into a 3D gression (SVR) and K nearest neighbors (KNN) regression.

2637
Both methods take the flattened and normalized ∆A0 matrix Table 1: Network parameters and average localization errors
as the feature vector. for different upsampling factors.
In SVR, we train two regressors (with a Gaussian kernel) Upsampling factor 1 3 5 10
to estimate the x
b and yb coordinates separately. We determine Input channel size 3×4 7×10 11×16 21×31
the optimal values of two SVR tuning parameters, namely the Conv1 kernel 2×2 4×4 4×4 6×6
box constraint C and the margin of tolerance to errors , using Conv1 stride (1,1) (1,1) (2,2) (2,2)
a grid search and 5-fold cross-validation. Conv2 kernel 2×2 2×2 3×3 4×4
In KNN regression, we use Euclidean distance in the Conv2 stride (1,1) (1,1) (1,1) (2,2)
feature-vector space as the distance metric. For a test sample, Dim. hidden layer 170 300 500 750
we find K closest samples (in the Euclidean distance sense) Avg. localization
6.43 6.25 5.69 6.47
in the training set. Then, we estimate the location of the test error - private (cm)
sample by using the weighted centroid of the ground truth Avg. localization
7.89 7.96 8.04 8.61
locations of the K nearest neighbors, where the weight is error - public (cm)
the reciprocal of the Euclidean distance. The parameter K is has 63 samples on a 9 × 7 grid with 25cm spacing. There
optimized through a 5-fold cross-validation. is no overlap between the training and test set grids. The
We also compare CNN localization performance against ground-truth location of a human avatar is the projection of
our best model-based localization algorithm [20]. This algo- its centroid onto the floor. When placing an avatar at each
rithm is based on a light reflection model [17] and assumes ground truth location on the grid, we rotated it around the
the floor to be Lambertian and objects to be flat. It first com- vertical axis by an angle randomly chosen in range [0◦ , 360◦ ]
putes the change of floor albedo based on the change in light to change its orientation. We followed the steps described in
transport matrix, and the knowledge of room dimensions and our previous work [20] to modulate the LEDs and obtain a
locations of sensors and fixtures, and then uses the centroid light transport matrix A, and the difference matrix ∆A (by
of albedo change as the estimated location. subtracting from A the light transport matrix A0 obtained for
the empty room).
3. EXPERIMENTAL RESULTS
We considered two different scenarios when training our
3.1. Simulation experiments proposed CNN and the benchmark models: private and pub-
To validate the performance of our proposed CNN, we col- lic. In a private scenario, like a home, the system can only
lected datasets in both a Unity3D-simulated environment and be used by a small set of people, and therefore we can train
a real physical testbed. In Unity3D simulation, we created a model with data from all users. In a public scenario, like a
a room with tables, doors and a screen as furniture, and a store, the model cannot be trained on all users since there can
window that allows simulated sunlight to illuminate the in- always be new users that are never seen by the system.
terior. The size of the room and the placement of furniture In the private scenario, out of 1,530×8 training samples
are chosen to best approximate our real testbed room. Part we use only 50×8 samples (50 random samples for each
of the empty floor is used as the test area for data collection avatar) for training. All the data-driven models are trained
(2.8m×2m). We placed 12 LED/sensor pairs on the ceiling using the same set of samples. Then, we test each model
on a 3 × 4 grid at a height of 2.71m, simulated in the same on each avatar’s 63 test samples (with 25cm spacing) which
way as described in our previous work [20]. To capture the are separate from the larger training set of 1530×8 samples.
body shape of an occupant more realistically, we used 8 hu- In the public scenario, we perform a leave-one-person-out
man avatars that differ in height, weight, gender and clothing. cross-validation. We train a model on 7 avatars (50 samples
All human avatars are in a standing pose. Our simulated room per avatar) and test it on the eighth avatar, and repeat this pro-
and 8 human avatars are shown in Figure 4. cess 8 times so that each of the avatars is left out for testing.
The performance of a model is evaluated in terms of average
localization error.
We tested several choices of the upsampling factor for the
input tensor of our CNN: 1 (no upsampling), 3, 5 and 10. As
the input size is scaled, we scale the network to roughly match
the input size by changing the network parameters. The net-
work parameters and average localization errors for different
upsampling factors are shown in Table 1. The network with
Fig. 4: Simulated room in Unity3D with test area shown (left) an upsampling factor of 5 performs best in the private sce-
and 8 human avatars used in data collection (right). nario, while performing only slightly worse than the best case
We collected two datasets for each human avatar: the in the public scenario.
training set contains 1,530 samples with ground truth loca- Figure 5 shows the average localization errors for each
tions on a 45 × 34 grid with 5cm spacing, and the test set human avatar for the proposed CNN (upsampling factor of 5)

2638
and 3 benchmark methods. The CNN approach reduces the transport matrices for each location were then averaged to re-
average localization error across all avatars by 47.69% and duce noise. Before collecting data for each person, we ran
46.99% in private and public scenarios, respectively, com- the system for several modulation cycles in empty state and
pared to the best-performing method among SVR, KNN and averaged the obtained light transport matrices to obtain A0 .
model-based. We also considered both private and public scenarios in
25 SVR 40 SVR
the testbed experiments. In the private scenario, we randomly
KNN regression KNN regression
selected 50 samples from each person to form the training
Average localization error (cm)

Average localization error (cm)


Model-based Model-based
Proposed CNN 35 Proposed CNN
20 set (totaling 400 samples), and used the remaining 13 sam-
30
ples of each person as the test set (totaling 104 samples).
15 25
In the public scenario, we performed a leave-one-person-out
20
10
cross-validation by training the models on 7 persons (using
15
63 × 7 = 441 samples) and testing on the eighth person (63
10
5 samples). Figure 7 shows the average localization error for
5
each person in both private and public scenarios. Compared
0
1 2 3 4 5 6 7 8
0
1 2 3 4 5 6 7 8
with the best-performing method, our CNN reduces the aver-
Human avatar index Human avatar index age localization error across all individuals by 36.54% in the
(a) Private scenario (b) Public scenario private scenario and by 11.46% in the public scenario.
Fig. 5: Average localization error for each human avatar for SVR SVR
the proposed CNN and 3 benchmark methods in Unity3D 70 KNN regression 120 KNN regression

Average localization error (cm)

Average localization error (cm)


Model-based Model-based
Proposed CNN
simulation. The model-based algorithm has the same errors 60
Proposed CNN
100
in private and public scenarios since it does not need training. 50 80
3.2. Testbed experiments 40
60
We have also built a physical testbed with dimensions of a 30
small room to evaluate the performance of different models on 20
40

real-world data. We placed 12 LED/sensor pairs on the ceil- 10 20


ing at the same locations as those in the simulated room. The
0 0
LEDs produce 800 lumens of light each and are controlled by 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Person index Person index
PWM signals. We used only the white (unfiltered) channel
of Adafruit TCS34725 single-pixel color sensors to measure (a) Private scenario (b) Public scenario
luminous flux. An MSP432 controller synchronizes the sys- Fig. 7: Average localization error for each person for the pro-
tem and controls the states of the LEDs. 12 Arduino Uno R3 posed CNN (upsampling factor of 5) and 3 benchmark meth-
boards are used to collect sensor readings and send data to a ods evaluated on the testbed.
parsing server. The parsing server organizes the sensor read-
ings and computes the light transport matrix A. Two views of 4. DISCUSSION AND CONCLUSIONS
our physical testbed are shown in Figure 6.
We have proposed a CNN-based method for indoor occupant
localization via active scene illumination. The proposed CNN
significantly outperforms SVR and KNN regression in both
simulation and physical testbed experiments. The approach
works well even with a small training set of just a few hundred
samples because of the small depth of the CNN. Moreover,
the proposed CNN also outperforms our best model-based al-
gorithms. This may be attributed to modeling assumptions,
(a) Bottom-up view of 12 (b) Side view when collect- e.g., Lambertian floor and flat objects, which are not always
LED/sensor pairs. ing data. satisfied in a realistic room with human occupants. On the
Fig. 6: Two views of our room-scale physical testbed. other hand, our data-driven approach requires no such as-
We collected a dataset for 8 different people using this sumptions and even does not need to know the locations of
testbed. We marked 63 ground-truth locations (9 × 7 grid) light sources and sensors.
with a blue masking tape on a floor area of 2.8m×2m. When Acknowledgments: We would like to thank William
collecting data, we turned on the MSP432 to start modulating Chen, Hannah Gibson, Tu Timmy Hoang, Dong Hyun Kim
LEDs and asked each person to walk through all ground-truth and Adam Surette for developing our physical testbed. This
locations following a zig-zag path. Each person was asked to work was supported by the NSF under ERC Cooperative
stand still at each location for 5 light-modulation cycles be- Agreement No. EEC-0812056 and Boston University’s Un-
fore he/she moves to the next location. The obtained 5 light dergraduate Research Opportunities Program.

2639
5. REFERENCES [11] Valery A Petrushin, Gang Wei, and Anatole V Gersh-
man, “Multiple-camera people localization in an in-
[1] Lionel M Ni, Yunhao Liu, Yiu Cho Lau, and Abhishek P door environment,” Knowledge and Information Sys-
Patil, “Landmarc: indoor location sensing using active tems, vol. 10, no. 2, pp. 229–241, 2006.
rfid,” Wireless networks, vol. 10, no. 6, pp. 701–710,
2004. [12] Xue Wang and Sheng Wang, “Collaborative signal pro-
cessing for target tracking in distributed wireless sensor
[2] Jeffrey Hightower, Roy Want, and Gaetano Borriello, networks,” Journal of Parallel and Distributed Comput-
“Spoton: An indoor 3d location sensing technology ing, vol. 67, no. 5, pp. 501–515, 2007.
based on rf signal strength,” UW CSE 00-02-02, Uni- [13] Wojciech Zajdel and Ben JA Kröse, “A sequential
versity of Washington, Department of Computer Science bayesian algorithm for surveillance with nonoverlap-
and Engineering, Seattle, WA, vol. 1, 2000. ping cameras,” International Journal of Pattern Recog-
nition and Artificial Intelligence, vol. 19, no. 08, pp.
[3] Roy Want, Andy Hopper, Veronica Falcao, and Jonathan
977–996, 2005.
Gibbons, “The active badge location system,” ACM
Transactions on Information Systems (TOIS), vol. 10, [14] May Moussa and Moustafa Youssef, “Smart cevices for
no. 1, pp. 91–102, 1992. smart environments: Device-free passive detection in
real environments,” in Pervasive Computing and Com-
[4] Se-Hoon Yang, Hyun-Seung Kim, Yong-Hwan Son, and munications, 2009. PerCom 2009. IEEE International
Sang-Kook Han, “Three-dimensional visible light in- Conference on. IEEE, 2009, pp. 1–6.
door localization using aoa and rss with multiple optical
receivers,” Journal of Lightwave Technology, vol. 32, [15] Eric A Wan and Anindya S Paul, “A tag-free solution
no. 14, pp. 2480–2485, 2014. to unobtrusive indoor tracking using wall-mounted ul-
trasonic transducers,” in Indoor Positioning and Indoor
[5] Weizhi Zhang, MI Sakib Chowdhury, and Mohsen Navigation (IPIN), 2010 International Conference on.
Kavehrad, “Asynchronous indoor positioning system IEEE, 2010, pp. 1–10.
based on visible light communications,” Optical Engi-
[16] Douglas Roeper, Jiawei Chen, Janusz Konrad, and
neering, vol. 53, no. 4, pp. 045105, 2014.
Prakash Ishwar, “Privacy-preserving, indoor occupant
[6] Heidi Steendam, “A 3-d positioning algorithm for aoa- localization using a network of single-pixel sensors,” in
based vlp with an aperture-based receiver,” IEEE Jour- Advanced Video and Signal Based Surveillance (AVSS),
nal on Selected Areas in Communications, vol. 36, no. 2016 13th IEEE International Conference on. IEEE,
1, pp. 23–33, 2018. 2016, pp. 214–220.

[17] Quan Wang, Xinchi Zhang, and Kim L Boyer, “Oc-


[7] Atri Mandal, Cristina V Lopes, Tony Givargis, Amir cupancy distribution estimation for smart light delivery
Haghighat, Raja Jurdak, and Pierre Baldi, “Beep: with perturbation-modulated light sensing,” Journal of
3d indoor positioning using audible sound,” in Con- solid state lighting, vol. 1, no. 1, pp. 17, 2014.
sumer communications and networking conference,
2005. CCNC. 2005 Second IEEE. IEEE, 2005, pp. 348– [18] Tianxing Li, Qiang Liu, and Xia Zhou, “Practical hu-
353. man sensing in the light,” in Proceedings of the 14th An-
nual International Conference on Mobile Systems, Ap-
[8] John Krumm, Ubiquitous computing fundamentals, plications, and Services. ACM, 2016, pp. 71–84.
Chapman and Hall/CRC, 2016.
[19] Jinyuan Zhao, Prakash Ishwar, and Janusz Konrad,
[9] Daniel Hauschildt and Nicolaj Kirchhof, “Advances “Privacy-preserving indoor localization via light trans-
in thermal infrared localization: Challenges and solu- port analysis,” in Acoustics, Speech and Signal Process-
tions,” in Indoor Positioning and Indoor Navigation ing (ICASSP), 2017 IEEE International Conference on.
(IPIN), 2010 International Conference on. IEEE, 2010, IEEE, 2017, pp. 3331–3335.
pp. 1–8.
[20] Jinyuan Zhao, Natalia Frumkin, Janusz Konrad, and
Prakash Ishwar, “Privacy-preserving indoor localization
[10] Rafael Munoz-Salinas, R Medina-Carnicer, Fran-
via active scene illumination,” in 2018 IEEE/CVF Con-
cisco José Madrid-Cuevas, and Angel Carmona-Poyato,
ference on Computer Vision and Pattern Recognition
“Multi-camera people tracking using evidential filters,”
Workshops (CVPRW). IEEE, 2018, pp. 1661–166109.
International Journal of Approximate Reasoning, vol.
50, no. 5, pp. 732–749, 2009.

2640

You might also like