You are on page 1of 3

Human Sensors

Initial requirements were as follows:

1) Determine in what position a person is, what pose he/she takes (standing, lying, sitting). It was
necessary to understand that if a person got out of bed, then the light needs to be turned on, or if you
lay down on the bed, then turn off the light.
2) Make a 3d rendering.
3) Motion Detection.
4) Send results via the mqtt protocol.

Research was carried out to solve the classification problem with light on and off, 3 areas
related to the task were found: human activity recognition, human action classification, pose
estimation. Area human activity recognition had a number of training datasets, but they turned out
to be inappropriate for the task, since they included digital data collected from motion sensors and
accelerometers on phones and bracelets, and not including images data. But such data included
some of the human action classification datasets. The pose estimation area was reviewed and
tested. Quite a lot of problems were revealed during long-term testing of various models, which did
not allow them to be used in the future. Some datasets were selected from human action
classification, on which it was planned to carry out the classification.
Data for training was being prepared, but soon the perspective of the tasks shifted from
classification to 3d reconstruction. Research was also carried out on this topic, of all the methods,
the most optimal and interesting in relation to the task at hand turned out to be the method of
constructing a heatmap from images obtained from one camera without using an additional stereo
camera. These maps determine the color depth of various objects in the image. This information
could be used further.
The primary task has been set for the current time - using depth maps to determine the
position of a person in 3 coordinates - x, y, z. Also, determine in which of the zones a person is
located, if these zones are divided coordinatewise. To determine the person in the frame, you must
use the Object Detection model. Various models were tested, the most lightweight ones, later they
were used in the code.
At the moment, a unified script has been created, launched in gpu mode, in which motion
is detected, then after its successful detection, depth maps are generated frame by frame, then
Object Detection algorithms are connected. Upon successful detection of a person, one or more, the
relative z coordinate for a particular person is calculated using the heatmap of the bounding box
from the OD. FPS - 3 because the decision is heavy. In addition, an attempt was made to connect
another of the trackers from the OpenCV library. However, the computing power of the video card
was not enough for this algorithm either. In the future, it is necessary to consider trackers based on
Sort, Deep Sort, since the presence of a tracker is necessary. This is due to the inaccuracy of Object
Detection models, especially the lighter ones. The demo is still "raw", requiring polishing, revision,
the inclusion of some more algorithms, but it shows the main feature of the task.
In the nearest future, it is planned to conduct the first testing of the first version of the
existing code, where it is necessary to divide the rooms into zones and introduce observation in
which of the zones a person is located. This will show how efficient the current solution scheme is
for the current problem, and will also demonstrate all the vulnerabilities, inaccuracies of the models
and algorithms used.
Also, the necessary software was installed on Maix Dock 2, a script has been written to
determine the movement in the frame and successfully launched. At the moment, it is required to
import the models used in the demo onto the software boards, but the attempts made so far have not
been accomplished with success, since the board is still quite new, there is practically no
documentation, which complicates the porting process. Support requests have been written, pending
responses.
Сhanges starting from the 29th of March
2 test videos were recorded. One video was created in a narrow room, which shows, on the
one hand, how the system behaves and the very process of creating depth maps in a very small
volume of the room and with a modest surroundings, not abundant with a large number of
surrounding objects. Another video was filmed in an office, where one room is a continuation of
another and, accordingly, we have a greater depth of the frame, and the office is filled with a large
number of various objects. Thus, the system had its first tests in different conditions. What results
have been obtained? The concept with depth maps is viable for such tasks, but at this stage it is just
a concept, very rough. The system, gives positive results, and makes mistakes as well, and this is
not surprising, because the models are not perfect, require revision and implementation of
additional algorithms. At the moment, there are the following noticed issues:

1) The model for creating depth maps has several limitations that need to work with: the
optimal distance is up to 10 meters (this is how the model trained on the dataset), but it is bad when
the distance is less than 3 meters since the model counts everything in approximately the same
tones. It is necessary to look for the optimal place. The second limitation - in gray tones, if the depth
is not more than 3 meters, creates a less accurate image of the depth map than the version with the
usual 3-channel image. Also, in different rooms, depth maps will give a different tonality of the
image, this requires manual adjustment of setting the location of objects (for example, during the
installation process of the device and the camera, you can make test depth maps of the room,
suggest the user using a simple interface to highlight the desired / the necessary zones, and then
start the basic algorithms for work). If you need a complete automated operation, then to determine
the location of the desired zone (for example, a bed), we need segmentation models to be trained to
find objects at home (this is perfect), or at least a detection object to determine the location of
objects.

2) Object Detection - has false positives on objects (jackets, T-shirts, people's clothing or
similar silhouettes). Poorly identifies a person if he/she lies or is not seen in full height. In this case,
we need both to improve the model itself (selection is more accurate, but not too resource-
intensive), and it is possible to write additional algorithms to correct the work. It would also be
good to place the camera so that the person can be seen at full height in the monitored areas. A
partial solution, when the detector loses a person in some of the frames, is to add a tracker, but it
also consume quite a lot resources. One of the algorithms that helps with this problem is the Motion
Detection tracking algorithm, which detects frame-by-frame shift. If the frames describing the place
of changes in the frame, taking into account the static background, approximately coincide with the
place and frame from the human detection algorithm, then if the frame of the OD is lost, we can use
the Motion Detection frames. However, you cannot do without a tracker, especially in situations
where a person lies down and does not move. None of the algorithms recognize it.

3) Object Detection is not trained, as well as depth maps, to work with black and white images,
so the quality is much worse in night shooting. If the initial product concept continues to be
developed further, it would be good to train your own models of human detection and depth maps in
black and white images.

4) The best solution for detecting a person would still be segmentation since OD algorithms
only find a circumscribing rectangle around the silhouette, which includes the background
surrounding the person. This gives an error in determining the value of the depth pixels on the
heatmap. Segmentation would not only be able to detect the position of a person but would cut out
its contours without taking into account the background. However, segmentation is the most
resource-intensive among Deep networks..
5) Work speed and resource intensity. The model needs further optimization of both the code
and the models, since the solution is now very heavy. FPS on GPU - 1-3. The situation was slightly
improved by a small optimization, that the depth maps begin to update only when the Motion
Detection movement appears, the rest of the time they use the past static frame values.

Further plans: porting the models (conversion) onto the board and launching them on the H/W

You might also like