You are on page 1of 6

2017 International Conference on Information Technology

Combination of advanced robotics and computer vision for shelf analytics in a retail
store

Dr.Gopichand Agnihotram, Navya vepakomma, Suyog Nick isaacs, Srividya Khatravath, Pradeep Naik, Rajesh
Trivedi, Sumanta Laha kumar
Technovation Center, Wipro Sight, Wipro Limited, Technovation Center, Wipro Sight, Wipro Limited,
Electronic City, Bangalore – 560100, India Electronic City, Bangalore – 560100, India
e-mail: {Gopichand.agnihotram, Navya.vepakomma, e-mail: {Nick isaacs, Sreevidya.khatravath,
Suyog.trivedi, Sumanta.laha}@wipro.com Pradeep.naik, Rajesh.kumar133}@wipro.com

architecture, solution description, results and discussion,


Abstract— Large-scale retail store associates are constantly faced summary, and future work.
with the challenge of managing store operations smoothly and
maintaining the products in full stock on the product support
II. LITERATURE SURVEY
devices (retail shelves). Keeping track of the quantities of each
individual Stock Keeping Unit (Retail Product), replenishing them Many researchers’ organizations have proposed various
when depleted, and identifying and replacing misplaced products solutions, patents pertaining to shelf monitoring,
are few tasks that require continuous monitoring and large replenishment. Image processing algorithms on shelf images
amount of manual effort. The solution being presented here aims and sensory environment as RFID were mainly depicting the
at automating the tasks performed by the store associates which
on-shelf availability. Research on detecting shelf with Hand
result in reducing the manual effort. The solution proposes the use
of a Double Robot to patrol the store over a fixed path and capture
held camera to capture shelf images, using edge detection,
images of the retail shelves at real time. These images are equal-angle wedges centered at the vanishing point shelves are
processed using developed solution to address various retail store identified [7]. Size scale invariant product recognition and
challenges such as stock out problems and misplaced products. An classification with product package design, arrangement is used
alert generating mechanism has also been incorporated into the for monitoring retail store shelves [8]. Object recognition
solution to alert the store associate via email or a text message, algorithms are used for recognition of products. Paper [13] has
when a product is completely out of stock/misplaced. The solution camera placed on shelf which continuously captures shelf
approaches use classification techniques, deep learning techniques images, and these images are processed. Paper [2] explains
along with computer vision algorithms to automate these processes automating store shelf availability with RFID tags and different
in retail stores
sensors to track the products with range coverage. Object
Index Terms— Deep learning, Double Robot, Machine learning,
recognition algorithm [11] with sampled feature pyramids for
Planogram Compliance, Product stock out, Shelf space upscale, down scale images without compromising on the
monitoring performance of the system provides report on display
compliance in a retail store and provides alert when any retail
I. INTRODUCTION store conditions are met or exceeded. Multiscale and multiclass
object recognition [5] with feature contours and chamfer
This paper address Autonomous Shelf Space Monitoring,
matching are used with cascaded sliding-window classifier. The
real time Alerts for Potential Stock Outs, and Automated
below mentioned patents have been filed in the shelf analytics
Merchandising Compliance using a Double Robot for image
space. Patent [1] discloses an image processing based
capture. End to end solution has been presented, with automatic
technology that extracts a present planogram and compares it
image capture, product detection and recognition, and alert
with an ideal / previous planogram and performance measures
generation for stock outs and misplaced products. The main
based on the differences are calculated. Additionally, user can
retail problems being addressed here are reporting the stock-out
be alerted based on the extracted planogram. Patent [3] is an
of products and providing a planogram compliance measure to
imaging based system to identify the product as well as the
identify misplaced products.
quantity of products in stock using computer vision algorithms.
We have trained the double presence robot to intelligently
capture the shelf images without manual intervention. The
Double robot has been programmed to follow a line and III. SYSTEM ARCHITECTURE
automatically capture the images of retail shelves. Using these The overall system architecture is given in Figure 1 and
images, product detection and identification is carried out with Figure 2 as given below. In this Architecture, Double robot
image processing algorithms and supervised learning methods (telepresence robot) will be used to capture the images of the
respectively. Comparison of actual product placement with shelf going through the aisle of the retail store and store the
reference planogram is performed to calculate compliance images into a temporary storage and from there onwards the
metrics and to label products that are out of stock. The image meta data will be created and pushed to the kafka topic
subsequent sections cover the literature survey, system for further processing. The Image metadata will be stored into

978-1-5386-2924-6/17 $31.00 © 2017 IEEE 119


DOI 10.1109/ICIT.2017.13
the repository. Spark streaming contains the prebuild classifier angle which enables the driver to “see” approximately 1.5 ft. of
and listen to the Kafka topic for the image metadata and process the ground ahead. (See figure 3)
the image. The yellow line is the path the robot must follow. To detect
Self-balancing features of the double robot has been used to this line, we defined a window a few inches in front of the robot.
capture the images of the shelf. Here task is controlled through This window is split into three regions – left, center and right.
a driver program which consist of a user interface and task The concentration of yellow pixels is calculated for these
scheduler. Driver program process video feeds and detect regions and the region with the highest concentration decides
predefined color marking and instruction are given to follow the the direction of the robot. If the center region has the highest
line, capture image and upload to image landing server. The concentration, the driver instructs the robot to move forward.
driver program interacts with double robot controller SDK to (See figure 4) Similarly it turns left or right depending on which
control the movement of the robot. region has higher concentration of yellow. Thus, the robot
adjusts itself when it goes off-course, based on the direction of
the yellow line. If the window has high concentration of red
pixels (marker to indicate the shelf’s position), it tells the robot
to turn and capture the shelf image. If none of the regions
contain yellow, then the robot rotates itself to search for the
region with high yellow concentration.

Figure 1: System Architecture

Figure 3: Driver view

We placed markers along the line to indicate locations


of where images should be captured. Once the driver sees this
marker, it stops, faces the shelf, captures the image (See figure
5) and uploads it to the server for processing in real-time. When
the capture process is completed, the driver rotates the robot to
search for the line so it can start moving again and capture other
shelf images.

Figure 2: Double Robot – Application Architecture

IV. IMAGE CAPTURING – CROPPING


The solution, begin with image capture and ending with the
cropping of the products for stock out and planogram miss-
compliance, is explained below.
Figure 4: Robot following the Line
A. Image Capture with Double Robot
The Double Robot is a telepresence robot developed by
Double Robotics [15]. It features a self-balancing base and an
iPad at the top to control the base.
We used the SDK provided by Double along with OpenCV
to create a driver program that steers the robot based on visual
cues. The Double SDK provides access to commands that
enable the movement of the robot. We used OpenCV to process
the rear camera feed. We analyze each frame to decide the
robot’s actions. The iPad is mounted upside down on the robot
causing the rear facing camera to point to the ground. There is
a mirror on the mount in front of the iPad camera, placed at an Figure 5: Capturing image of the shelf

120
B. Detection of Product Bounding Boxes the completion of this process, the bounding box for each
The series of images captured by the double robot are individual item on the shelf is detected.
stitched using the OpenCV stitching algorithm to form an
integral image. The integral image is then processed using a
specified sequence of computer vision algorithms to detect the
various products present on the shelf.

Figure 7: Product detection

F. Product Cropping
Each product is cropped and extracted based on the
Figure 6: Image of a retail shelf
bounding box and the product is identified using image
classification.
C. Image Correction and Noise Removal
The image is initially checked for any amount of blur and it V. PRODUCT CLASSIFICATION – HOG AND SVM
is corrected. As the image is constructed from a series of images The cropped products from the above step are passed on to
taken while the robot is in motion, some amount of blur the classification module to identify the product. Product
correction and noise removal is required. To de-blur the image classification has been experimented using machine learning
a blind deconvolution de-blurring technique is applied where in (HOG and SVM) and Deep learning using a Convolutional
an impulse response or point spread function is first calculated Neural Network (CNN)
indicating the amount of blur [12]. De-convoluting the image
with this calculated function eliminates the blur from the image. A. Image Repository
In a similar fashion, the image is cleansed from any noise that Images of different products (products which are available
may be present in the image by calculating the peak signal to in retail store from different categories, for example cereals,
noise ratio (PSNR) and suppressing the areas of high noise [12]. soft/energy drinks etc.) were collected from different sources
ܴܲܵܰ ൌ ͳͲ݈‫Ͳͳ݃݋‬ሺ‫ʹ݈ܽݒ݇ܽ݁݌‬Ȁ‫ܧܵܯ‬ሻ such as ImageNet and retail stores. Images have also been
where MSE is the Mean squared Error collected for empty shelf areas. The database has been created
for 1000 different retail product categories, including an
D. Horizontal Row Detection
empty shelf category. The repository contains at least 100
The enhanced and cleansed image is taken forward for images of each retail product, captured at different angles and
product bounding box detection. Initially the individual rows of in different lighting conditions.
the shelf are identified in order to find the appropriate region of
interest where in the products are to be detected. The shelf row
detection is performed mainly using a combination of color
information and line detection. Continuous color regions are
identified by merging neighboring pixels of similar intensity.
Based on the lines present in the image, the lines coordinating
with the regions of continuous color are selected as the rows of
the shelf.
E. Vertical Separation between the Boxes
Boundary extraction for each product is carried out by
constructing a histogram for each row of the shelf image. The
points of peaks and troughs indicate the separation between
products. Vertical line detection [12] is also applied on each
row and the by cross validation between the lines and troughs Figure 8: Sample product images in repository
of the histogram, the edges of the individual products are
obtained. To identify vertically stacked products, within each
bounding box, a horizontal line detection is performed. Upon B. Machine Learning – Using Hog and SVM
Using Histogram of Oriented Gradients (HOG) feature
extraction on each image, a feature vector has been computed.

121
Each image has been resized to 32x32 pixels prior to HOG L2 norm followed by clipping
feature extraction. The feature vectors and product labels have
been trained using multi class support vector machine (SVM) L1 norm:
[10]. 
‫ݒ‬
C. Histogram of Oriented Gradients ݂ൌ
HOG is a feature descriptor, which counts occurrences of ሺหȁ‫ݒ‬ȁหͳ ൅ ݁
gradient orientation in localized portions of an image. Local L1 sqrt:
objects appearance and shape within an image can be described ‫ݒ‬
݂ൌ
by the distribution of intensity gradients or edge directions [4].
ටหȁ‫ݒ‬ȁหͳ ൅ ݁
The image is divided into small connected regions called cells,
and for the pixels within each cell, a histogram of gradient Visualization of HOG Features:
directions is compiled. The descriptor is the concatenation of
these histograms. For improved accuracy, the local histograms
can be contrast-normalized by calculating a measure of the
intensity across a larger region of the image, called a block, and
then using this value to normalize all cells within the block. This
normalization results in better invariance to changes in
illumination and shadowing.
The key steps in calculating a HOG descriptor are Gradient
Computation, Orientation Binning, calculating Descriptor
Blocks and Block Normalization [9].
D. Gradients Computations
Gradient is computed by applying a 1-D centered, point
discrete derivative mask in one or both horizontal and vertical Figure 9: HOG features visualization
directions. Specifically, this method requires filtering the color For each product extracted from the product detection
or intensity data of the image with the following filter kernels: module, the HOG features with the same parameters are
extracted. The product is identified by predicting using the
trained SVM model.
Table 1: HOG feature extraction and SVM training details
E. Orientation Binning
Number Total System Validation
The second step of calculation is creating the cell histograms. of Training Configuration Accuracy
Each pixel within the cell casts a weighted vote for an Classes Samples
orientation-based histogram channel based on the values found
in the gradient computation. The cells themselves can either be 1000 100000 64 Bit, 8GB 71%
rectangular or radial in shape, and the histogram channels are RAM system
evenly spread over 0 to 180 degrees or 0 to 360 degrees,
depending on whether the gradient is “unsigned” or “signed”.
VI. PRODUCT CLASSIFICATION – DEEP LEARNING
F. Descriptor Blocks During recent years, Deep learning algorithms are being
The HOG Descriptor is the concatenated vector of the extensively experimented for image classification and object
components of the normalized cell histograms from all the detection use cases. Object detection capability of Deep Neural
block regions. These blocks typically overlap, meaning that Networks are being developed to recognize different products
each cell contributes more than once to the final descriptor. Two in the retail scenario.
main block geometries exist: rectangular R-HOG blocks and
circular C-HOG blocks. A. Creating Training data
To train the Convolution Network for retail scenario, the
G. Block Normalization product images from the Image Repository are used along with
Let v be non-normalized vector containing all appropriate labels. We randomly sample 90% images for
histograms in each block, ||v||k be its k-norm for k = 1,2 training dataset and 10% images for validation dataset. All the
and e be some small constant (the exact value, hopefully, is images are resized to size 32*32 (width*height) pixels during
unimportant). Then the normalization factor can be one of the sampling.
following: Sampling provides Training Set and Validation Set as:
L2 norm: ܶ ൌ ሼሺ݈ଵ ǡ ‫ݐ‬ଵ ሻǡ ሺ݈ଶ ǡ ‫ݐ‬ଶ ሻ ǥ ǥ ǥ ǥ Ǥ Ǥ ǡ ሺ݈௡ ǡ ‫ݐ‬௡ ሻሽ
‫ݒ‬ ܸ ൌ ሼሺ݈ଵ ǡ ‫ݒ‬ଵ ሻǡ ሺ݈ଶ ǡ ‫ݒ‬ଶ ሻ ǥ ǥ ǥ ǥ Ǥ Ǥ ǡ ሺ݈௡ ǡ ‫ݒ‬௡ ሻሽ where
݂ൌ
ටหȁ‫ݒ‬ȁหʹʹ ൅ ݁ ଶ
L2-hys:

122
݈௜ ൌ ݈ܾ݈݅݉ܽ݃݁ܽ݁, ݅ ൌ ͳǡʹǡ ǥ ǡ ݊ Ƚ ൌ Ž‡ƒ”‹‰”ƒ–‡
 Ɋ ൌ ‘‡–—
 ‫ݐ‬௜ ൌ ‫ܽݐ݈ܽ݀݁ݔ݅݌݁݃ܽ݉݅݃݊݅݊݅ܽݎݐ‬ሺ‫݁݊݋‬ ܸ௧ ൌ ’”‡˜‹‘—•™‡‹‰Š–—’†ƒ–‡

 ܹ௧ ൌ …—””‡–™‡‹‰Šr
݀݅݉݁݊‫Ͳ͵݁ݖ݅ݏ݂݋ݎ݋ݐܿ݁ݒ݈ܽ݊݋݅ݏ‬͹ʹሺ͵ʹ ‫ʹ͵ כ‬
 ‫͵ כ‬ሻሻǡ ݅ ൌ ͳǡʹǡ ǥ ǡ ݊ F. Hyper parameter setting
 ‫ݒ‬௜ ൌ ‫ܽݐ݈ܽ݀݁ݔ݅݌݁݃ܽ݉݅݊݋݅ݐ݈ܽ݀݅ܽݒ‬ሺ‫݁݊݋‬ We started the training with learning parameter of 0.0001,
  which follows the 'Step learning' policy. This policy reduces the
 ݀݅݉݁݊‫Ͳ͵݁ݖ݅ݏ݂݋ݎ݋ݐܿ݁ݒ݈ܽ݊݋݅ݏ‬͹ʹሺ͵ʹ ‫כ ʹ͵ כ‬ learning rate with the value 'gamma' after the given step-size
 ͵ሻሻ ݅ ͳ ʹ (number of iterations). We set gamma=0.1 and step-size of
1000. Weight bias have been initialized to 0.0005. Training
gives validation accuracy of 83%. For computation purposes
This method uses Caffe framework [6] to implement the we benchmarked with 8 GB ram machines and we used Spark
convolution neural network (CNN) model. This model will be and Kafka for Realtime streaming and product classification.
utilized in the later stage for detection of products placed in
retail stores. This paper utilizes python implementation for Table 2: Deep Learning Training details
CNN.
Number Number System Validation
B. Model Description of of Configuration accuracy
Convolution neural network architecture, learning algorithm Classes iterations
and hyper parameter setting description is given below: 1000 10000 64 Bit, 8GB 83%
C. Convolution neural networks RAM system
Convolution Neural Network(CNN) trained via back
propagation is being utilized in the current approach to detect VII. FINDING OUT-OF-STOCK PRODUCTS AND
different products placed on the product support devices. CALCULATING THE PLANOGRAM COMPLIANCE
Convolution training can be used both in supervised and
By performing the product classification using HOG and
unsupervised methods. This approach trains the Convolution
SVM and Deep learning approaches on each detected product,
network in supervised method. Convolution Neural Network
the product is identified. Areas of the shelf where products are
consists multiple layers of neurons (input, hidden and output
out of stock, are also identified as the model has been trained
layers), which are arranged in 3 dimensions: width, height,
with images of empty shelf areas. A planogram is constructed
depth.
with the identified products, including the dimensions of the
D. Network Architecture product and the exact position where the product was found.
It uses three main types of layers in neural network such as The areas where the products are non-available are also
Convolution Layer, Pooling Layer, and Fully-Connected Layer. provided with the position and dimension of the empty area.
[6] The actual data is compared with the reference planogram data
INPUT [32x32x3] - This layer will hold the raw pixel values of and accordingly the names of the products that are out of stock
the image, in this case an image of width 32, height 32, and with are identified. The position compliance ratio, the product facing
three color channels R, G, B. compliance ratio and overall planogram compliance for each
CONV layer - This layer applies convolution operator on input shelf are also generated. In the cases of product stock outs and
data. very low compliance, alerts are generated. In all cases the
generated compliance metrics are displayed on the dashboard.
RELU (Rectified-Linear and Leaky) layer - It applies
activation function on the input data. A. Reporting and Alert Generation
POOL layer - This will perform a down sampling operation. Stock-outs and low compliance cases are sent as email
This layer is connected to the Conv layer. alerts to the store managers. For experimental purposes, Elastic
FC (fully-connected) layer - It will compute the Email [14] was used as the email service provider. A web based
final class scores for all the classes, resulting in volume of size dashboard has also been made available, which plots the
[1x1x1000], where each of the 1000 numbers corresponds to a number of stock-outs over time as well as the list of products
class score of for that particular class. that are out of stock. The dashboard displays the product
E. Learning algorithms availability, stock out, and compliance metrics at any given
point of time.
Given the set of training image dataset (T), we train
the CNN model to discriminate between 1000 product classes.
VIII. RESULTS AND DISCUSSIONS
ReLU / Rectified-Linear and Leaky-ReLU activation layer is
used for forward pass whereas loss minimization is done using The two problems mentioned in introduction have been solved
Stochastic Gradient Descent ("SGD") layer. in real time using the architecture given in section 3. The
ܸ௧ାଵ ൌ Ɋܸ௧ െ ߙ‫ܮ׏‬ሺܹ௧ ሻ Double robot captures the image in real time and uploads the
ܹ௧ାଵ ൌ  ܹ௧ ൅  ܸ௧ାଵ data into a shared location of the network server. The Java
listener as given in section 3 are configured for drive in such a
Where:
way that once it gets uploaded, it generates metadata for that
‫ܮ׏‬ሺܹሻ ൌ ‰”ƒ†‹‡–‘ˆ™‡‹‰Š–•

123
specific image and uploads the metadata to the Kafka topic. and accordingly the planogram compliance metric is calculated.
Kafka is a distributed messaging system having a fast and The dashboard provides a live update on the current stock and
highly scalable capability. Kafka server has been configured for planogram compliance, and an additional alert generating
3 nodes and each topic has three partitions with a replication mechanism has been setup to provide stock out alerts. Thereby,
factor of two. Here the Java listeners act as a message producer the store associates can view the alerts and dashboard in order t
for the system and spark streaming work as a consumer. to handle the stock out situations and planogram miss-
Spark is batch processing platform whereas Spark Streaming compliance.
is a near real-time processing tool that runs on top of the Spark The same technology can be applied to other product
cluster with 3 nodes. Spark streaming job collects the data management scenarios, such as in back-store inventories as well
through Spark DStream from Kafka topic and processes image as in ware-houses, to keep a check on product availability and
metadata by fetching image data from data repository. In Spark, to manage product orders.
the pre-trained HOG and Multi class SVM model is present,
which helps in classifying the products images coming through X. FUTURE WORK
Kafka into different categories. In Stock out problem as Image capture mechanisms can be modified to use other
explained in section 4.4, using bounding boxes, Planogram and devices or by improving the Double robot to perform shelf
product classification form the predefined model in Spark, the image capture without the availability of a physical line.
products that are out of stock are identified and alerts are sent Product classification using machine learning and deep learning
out in real time. For the issue of misplaced products, the can be improved by increasing the size of the image repository
Planogram and Spark pre-trained model are used and and providing further product samples for training. Fine tuning
accordingly the products misplaced are reported in real time. of training and testing parameters can also be experimented
The planogram compliance is also calculated. upon to provide better accuracies.
We have also developed deep learning based solution for The use of Bluetooth beacons to guide the robot around the
product classification which helps to addresses the stock out store and additional depth sensors to get accurate product
and compliance problems. counts is being experimented on. Using a correlation between
We have deployed this solution for multiple retail giants in sensor data and image data, the analysis of stock outs and
US and have reported the stock out and compliance at real time, planogram compliance can be made more accurate.
with good accuracy. The table given below discusses the stock
out and compliance accuracy using these two model for 1 store.
REFERENCES
Table 3: Testing Results – comparison [1] Agata Opalach, Andrew Fano, Fredrik Linaker, Robert Bernard (Robin)
Groenevelt, Planogram extraction based on image processing, US
Machine Deep 8189855 B2
Learning Learning [2] Claire Swedberg, ShelfX Unveils Store Shelves for Automating
(HOG + (CNN) Purchases, Rfidjournal.com, 2015
SVM) [3] Craig Hamilton, Wayne Spencer, Alexander Ring, Method and system for
automatically measuring retail store display compliance, US 8429004 B2
Stock Out 78 % 85% [4] Histogram of Oriented gradients,
70 % 83 % https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients
Planogram Compliance [5] Jamie Shotton, Andrew Blake, Roberto Cipolla, ”Multiscale Categorical
Object Recognition Using Contour Fragments”, IEEE Transactions on
From the testing, it has been observed that using deep Pattern Analysis and Machine Intelligence. Volume: 30. NO 7, PP. 1270-
1281, July. 2008
learning techniques provides a higher level of accuracy for the
[6] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev,
product classification. Deep learning technique provides about Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio
85 % accuracy as compared to Machine learning which and Darrell, Trevor, Caffe: Convolutional Architecture for Fast Feature
provides around 70 – 80% accuracy. However as deep learning Embedding, 2014, http://caffe.berkeleyvision.org/
[7] Jian Fan, Tong Zhang, Shelf detection via vanishing point and
is computationally very expensive and time consuming,
radial projection, Image Processing (ICIP), IEEE International
machine learning methods can be explored as the next suitable Conference, 2014
alternatives. Increasing the training data and further fine tuning [8] M. Marder, S. Harary, A. Ribak, Y. Tzur, S. Alpert, A. Tzadok, “Using
of training parameters can improve the accuracy both in cases image analytics to monitor retail store shelves”, IBM Journal of Research
and Development. vol 59. No 2, PP.547-588, Mar. 2015
of machine learning and deep learning.
[9] Navneet Dalal, Bill Triggs, Histograms of Oriented Gradients for Human
Detection, 2005
IX. SUMMARY [10] P. Viola and M. Jones, “Robust real-time object detection”, Intl. J.
Computer Vision. vol 57. NO 2. PP. 137–154, 2004
The above described end to end solution has been proposed [11] Piotr Dollár, Ron Appel, Serge Belongie, Pietro Perona, “Fast Feature
to automate the process for product stock out check and Pyramids for Object Detection”, IEEE Transactions on Pattern Analysis
planogram compliance check that can aid retail store associates and Machine Intelligence. Vol 44, NO 8, PP. 547-588, Aug. 2014.
to maintain store operations smoothly. The Double Robot [12] Rafael C. Gonzalez, Richard Eugene Woods, Digital Image Processing,
Pearson Education India, 3rd Edition, 2009
follows a specified line and captures the shelf images. Product [13] Rakesh Satapathy, Srikanth Prahlad, Vijay Kaulgud, Smart Shelfie --
detection algorithms are applied on the images and each product Internet of Shelves: For Higher On-Shelf Availability
is extracted. Machine learning and Deep learning techniques [14] Elastic Email. https://elasticemail.com
have been used to classify each individual product, where in [15] Double Robotics. http://www.doublerobotics.com/
deep learning provides a higher accuracy. The actual product
placement is compared with the reference planogram details

124

You might also like