Professional Documents
Culture Documents
Combination of advanced robotics and computer vision for shelf analytics in a retail
store
Dr.Gopichand Agnihotram, Navya vepakomma, Suyog Nick isaacs, Srividya Khatravath, Pradeep Naik, Rajesh
Trivedi, Sumanta Laha kumar
Technovation Center, Wipro Sight, Wipro Limited, Technovation Center, Wipro Sight, Wipro Limited,
Electronic City, Bangalore – 560100, India Electronic City, Bangalore – 560100, India
e-mail: {Gopichand.agnihotram, Navya.vepakomma, e-mail: {Nick isaacs, Sreevidya.khatravath,
Suyog.trivedi, Sumanta.laha}@wipro.com Pradeep.naik, Rajesh.kumar133}@wipro.com
120
B. Detection of Product Bounding Boxes the completion of this process, the bounding box for each
The series of images captured by the double robot are individual item on the shelf is detected.
stitched using the OpenCV stitching algorithm to form an
integral image. The integral image is then processed using a
specified sequence of computer vision algorithms to detect the
various products present on the shelf.
F. Product Cropping
Each product is cropped and extracted based on the
Figure 6: Image of a retail shelf
bounding box and the product is identified using image
classification.
C. Image Correction and Noise Removal
The image is initially checked for any amount of blur and it V. PRODUCT CLASSIFICATION – HOG AND SVM
is corrected. As the image is constructed from a series of images The cropped products from the above step are passed on to
taken while the robot is in motion, some amount of blur the classification module to identify the product. Product
correction and noise removal is required. To de-blur the image classification has been experimented using machine learning
a blind deconvolution de-blurring technique is applied where in (HOG and SVM) and Deep learning using a Convolutional
an impulse response or point spread function is first calculated Neural Network (CNN)
indicating the amount of blur [12]. De-convoluting the image
with this calculated function eliminates the blur from the image. A. Image Repository
In a similar fashion, the image is cleansed from any noise that Images of different products (products which are available
may be present in the image by calculating the peak signal to in retail store from different categories, for example cereals,
noise ratio (PSNR) and suppressing the areas of high noise [12]. soft/energy drinks etc.) were collected from different sources
ܴܲܵܰ ൌ ͳͲ݈Ͳͳ݃ሺʹ݈ܽݒ݇ܽ݁Ȁܧܵܯሻ such as ImageNet and retail stores. Images have also been
where MSE is the Mean squared Error collected for empty shelf areas. The database has been created
for 1000 different retail product categories, including an
D. Horizontal Row Detection
empty shelf category. The repository contains at least 100
The enhanced and cleansed image is taken forward for images of each retail product, captured at different angles and
product bounding box detection. Initially the individual rows of in different lighting conditions.
the shelf are identified in order to find the appropriate region of
interest where in the products are to be detected. The shelf row
detection is performed mainly using a combination of color
information and line detection. Continuous color regions are
identified by merging neighboring pixels of similar intensity.
Based on the lines present in the image, the lines coordinating
with the regions of continuous color are selected as the rows of
the shelf.
E. Vertical Separation between the Boxes
Boundary extraction for each product is carried out by
constructing a histogram for each row of the shelf image. The
points of peaks and troughs indicate the separation between
products. Vertical line detection [12] is also applied on each
row and the by cross validation between the lines and troughs Figure 8: Sample product images in repository
of the histogram, the edges of the individual products are
obtained. To identify vertically stacked products, within each
bounding box, a horizontal line detection is performed. Upon B. Machine Learning – Using Hog and SVM
Using Histogram of Oriented Gradients (HOG) feature
extraction on each image, a feature vector has been computed.
121
Each image has been resized to 32x32 pixels prior to HOG L2 norm followed by clipping
feature extraction. The feature vectors and product labels have
been trained using multi class support vector machine (SVM) L1 norm:
[10].
ݒ
C. Histogram of Oriented Gradients ݂ൌ
HOG is a feature descriptor, which counts occurrences of ሺหȁݒȁหͳ ݁
gradient orientation in localized portions of an image. Local L1 sqrt:
objects appearance and shape within an image can be described ݒ
݂ൌ
by the distribution of intensity gradients or edge directions [4].
ටหȁݒȁหͳ ݁
The image is divided into small connected regions called cells,
and for the pixels within each cell, a histogram of gradient Visualization of HOG Features:
directions is compiled. The descriptor is the concatenation of
these histograms. For improved accuracy, the local histograms
can be contrast-normalized by calculating a measure of the
intensity across a larger region of the image, called a block, and
then using this value to normalize all cells within the block. This
normalization results in better invariance to changes in
illumination and shadowing.
The key steps in calculating a HOG descriptor are Gradient
Computation, Orientation Binning, calculating Descriptor
Blocks and Block Normalization [9].
D. Gradients Computations
Gradient is computed by applying a 1-D centered, point
discrete derivative mask in one or both horizontal and vertical Figure 9: HOG features visualization
directions. Specifically, this method requires filtering the color For each product extracted from the product detection
or intensity data of the image with the following filter kernels: module, the HOG features with the same parameters are
extracted. The product is identified by predicting using the
trained SVM model.
Table 1: HOG feature extraction and SVM training details
E. Orientation Binning
Number Total System Validation
The second step of calculation is creating the cell histograms. of Training Configuration Accuracy
Each pixel within the cell casts a weighted vote for an Classes Samples
orientation-based histogram channel based on the values found
in the gradient computation. The cells themselves can either be 1000 100000 64 Bit, 8GB 71%
rectangular or radial in shape, and the histogram channels are RAM system
evenly spread over 0 to 180 degrees or 0 to 360 degrees,
depending on whether the gradient is “unsigned” or “signed”.
VI. PRODUCT CLASSIFICATION – DEEP LEARNING
F. Descriptor Blocks During recent years, Deep learning algorithms are being
The HOG Descriptor is the concatenated vector of the extensively experimented for image classification and object
components of the normalized cell histograms from all the detection use cases. Object detection capability of Deep Neural
block regions. These blocks typically overlap, meaning that Networks are being developed to recognize different products
each cell contributes more than once to the final descriptor. Two in the retail scenario.
main block geometries exist: rectangular R-HOG blocks and
circular C-HOG blocks. A. Creating Training data
To train the Convolution Network for retail scenario, the
G. Block Normalization product images from the Image Repository are used along with
Let v be non-normalized vector containing all appropriate labels. We randomly sample 90% images for
histograms in each block, ||v||k be its k-norm for k = 1,2 training dataset and 10% images for validation dataset. All the
and e be some small constant (the exact value, hopefully, is images are resized to size 32*32 (width*height) pixels during
unimportant). Then the normalization factor can be one of the sampling.
following: Sampling provides Training Set and Validation Set as:
L2 norm: ܶ ൌ ሼሺ݈ଵ ǡ ݐଵ ሻǡ ሺ݈ଶ ǡ ݐଶ ሻ ǥ ǥ ǥ ǥ Ǥ Ǥ ǡ ሺ݈ ǡ ݐ ሻሽ
ݒ ܸ ൌ ሼሺ݈ଵ ǡ ݒଵ ሻǡ ሺ݈ଶ ǡ ݒଶ ሻ ǥ ǥ ǥ ǥ Ǥ Ǥ ǡ ሺ݈ ǡ ݒ ሻሽ where
݂ൌ
ටหȁݒȁหʹʹ ݁ ଶ
L2-hys:
122
݈ ൌ ݈ܾ݈݅݉ܽ݃݁ܽ݁, ݅ ൌ ͳǡʹǡ ǥ ǡ ݊ Ƚ ൌ
Ɋ ൌ
ݐ ൌ ܽݐ݈ܽ݀݁ݔ݅݁݃ܽ݉݅݃݊݅݊݅ܽݎݐሺ݁݊ ܸ௧ ൌ
ܹ௧ ൌ
r
݀݅݉݁݊Ͳ͵݁ݖ݅ݏ݂ݎݐܿ݁ݒ݈ܽ݊݅ݏʹሺ͵ʹ ʹ͵ כ
͵ כሻሻǡ ݅ ൌ ͳǡʹǡ ǥ ǡ ݊ F. Hyper parameter setting
ݒ ൌ ܽݐ݈ܽ݀݁ݔ݅݁݃ܽ݉݅݊݅ݐ݈ܽ݀݅ܽݒሺ݁݊ We started the training with learning parameter of 0.0001,
which follows the 'Step learning' policy. This policy reduces the
݀݅݉݁݊Ͳ͵݁ݖ݅ݏ݂ݎݐܿ݁ݒ݈ܽ݊݅ݏʹሺ͵ʹ כ ʹ͵ כ learning rate with the value 'gamma' after the given step-size
͵ሻሻ ݅ ͳ ʹ (number of iterations). We set gamma=0.1 and step-size of
1000. Weight bias have been initialized to 0.0005. Training
gives validation accuracy of 83%. For computation purposes
This method uses Caffe framework [6] to implement the we benchmarked with 8 GB ram machines and we used Spark
convolution neural network (CNN) model. This model will be and Kafka for Realtime streaming and product classification.
utilized in the later stage for detection of products placed in
retail stores. This paper utilizes python implementation for Table 2: Deep Learning Training details
CNN.
Number Number System Validation
B. Model Description of of Configuration accuracy
Convolution neural network architecture, learning algorithm Classes iterations
and hyper parameter setting description is given below: 1000 10000 64 Bit, 8GB 83%
C. Convolution neural networks RAM system
Convolution Neural Network(CNN) trained via back
propagation is being utilized in the current approach to detect VII. FINDING OUT-OF-STOCK PRODUCTS AND
different products placed on the product support devices. CALCULATING THE PLANOGRAM COMPLIANCE
Convolution training can be used both in supervised and
By performing the product classification using HOG and
unsupervised methods. This approach trains the Convolution
SVM and Deep learning approaches on each detected product,
network in supervised method. Convolution Neural Network
the product is identified. Areas of the shelf where products are
consists multiple layers of neurons (input, hidden and output
out of stock, are also identified as the model has been trained
layers), which are arranged in 3 dimensions: width, height,
with images of empty shelf areas. A planogram is constructed
depth.
with the identified products, including the dimensions of the
D. Network Architecture product and the exact position where the product was found.
It uses three main types of layers in neural network such as The areas where the products are non-available are also
Convolution Layer, Pooling Layer, and Fully-Connected Layer. provided with the position and dimension of the empty area.
[6] The actual data is compared with the reference planogram data
INPUT [32x32x3] - This layer will hold the raw pixel values of and accordingly the names of the products that are out of stock
the image, in this case an image of width 32, height 32, and with are identified. The position compliance ratio, the product facing
three color channels R, G, B. compliance ratio and overall planogram compliance for each
CONV layer - This layer applies convolution operator on input shelf are also generated. In the cases of product stock outs and
data. very low compliance, alerts are generated. In all cases the
generated compliance metrics are displayed on the dashboard.
RELU (Rectified-Linear and Leaky) layer - It applies
activation function on the input data. A. Reporting and Alert Generation
POOL layer - This will perform a down sampling operation. Stock-outs and low compliance cases are sent as email
This layer is connected to the Conv layer. alerts to the store managers. For experimental purposes, Elastic
FC (fully-connected) layer - It will compute the Email [14] was used as the email service provider. A web based
final class scores for all the classes, resulting in volume of size dashboard has also been made available, which plots the
[1x1x1000], where each of the 1000 numbers corresponds to a number of stock-outs over time as well as the list of products
class score of for that particular class. that are out of stock. The dashboard displays the product
E. Learning algorithms availability, stock out, and compliance metrics at any given
point of time.
Given the set of training image dataset (T), we train
the CNN model to discriminate between 1000 product classes.
VIII. RESULTS AND DISCUSSIONS
ReLU / Rectified-Linear and Leaky-ReLU activation layer is
used for forward pass whereas loss minimization is done using The two problems mentioned in introduction have been solved
Stochastic Gradient Descent ("SGD") layer. in real time using the architecture given in section 3. The
ܸ௧ାଵ ൌ Ɋܸ௧ െ ߙܮሺܹ௧ ሻ Double robot captures the image in real time and uploads the
ܹ௧ାଵ ൌ ܹ௧ ܸ௧ାଵ data into a shared location of the network server. The Java
listener as given in section 3 are configured for drive in such a
Where:
way that once it gets uploaded, it generates metadata for that
ܮሺܹሻ ൌ
123
specific image and uploads the metadata to the Kafka topic. and accordingly the planogram compliance metric is calculated.
Kafka is a distributed messaging system having a fast and The dashboard provides a live update on the current stock and
highly scalable capability. Kafka server has been configured for planogram compliance, and an additional alert generating
3 nodes and each topic has three partitions with a replication mechanism has been setup to provide stock out alerts. Thereby,
factor of two. Here the Java listeners act as a message producer the store associates can view the alerts and dashboard in order t
for the system and spark streaming work as a consumer. to handle the stock out situations and planogram miss-
Spark is batch processing platform whereas Spark Streaming compliance.
is a near real-time processing tool that runs on top of the Spark The same technology can be applied to other product
cluster with 3 nodes. Spark streaming job collects the data management scenarios, such as in back-store inventories as well
through Spark DStream from Kafka topic and processes image as in ware-houses, to keep a check on product availability and
metadata by fetching image data from data repository. In Spark, to manage product orders.
the pre-trained HOG and Multi class SVM model is present,
which helps in classifying the products images coming through X. FUTURE WORK
Kafka into different categories. In Stock out problem as Image capture mechanisms can be modified to use other
explained in section 4.4, using bounding boxes, Planogram and devices or by improving the Double robot to perform shelf
product classification form the predefined model in Spark, the image capture without the availability of a physical line.
products that are out of stock are identified and alerts are sent Product classification using machine learning and deep learning
out in real time. For the issue of misplaced products, the can be improved by increasing the size of the image repository
Planogram and Spark pre-trained model are used and and providing further product samples for training. Fine tuning
accordingly the products misplaced are reported in real time. of training and testing parameters can also be experimented
The planogram compliance is also calculated. upon to provide better accuracies.
We have also developed deep learning based solution for The use of Bluetooth beacons to guide the robot around the
product classification which helps to addresses the stock out store and additional depth sensors to get accurate product
and compliance problems. counts is being experimented on. Using a correlation between
We have deployed this solution for multiple retail giants in sensor data and image data, the analysis of stock outs and
US and have reported the stock out and compliance at real time, planogram compliance can be made more accurate.
with good accuracy. The table given below discusses the stock
out and compliance accuracy using these two model for 1 store.
REFERENCES
Table 3: Testing Results – comparison [1] Agata Opalach, Andrew Fano, Fredrik Linaker, Robert Bernard (Robin)
Groenevelt, Planogram extraction based on image processing, US
Machine Deep 8189855 B2
Learning Learning [2] Claire Swedberg, ShelfX Unveils Store Shelves for Automating
(HOG + (CNN) Purchases, Rfidjournal.com, 2015
SVM) [3] Craig Hamilton, Wayne Spencer, Alexander Ring, Method and system for
automatically measuring retail store display compliance, US 8429004 B2
Stock Out 78 % 85% [4] Histogram of Oriented gradients,
70 % 83 % https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients
Planogram Compliance [5] Jamie Shotton, Andrew Blake, Roberto Cipolla, ”Multiscale Categorical
Object Recognition Using Contour Fragments”, IEEE Transactions on
From the testing, it has been observed that using deep Pattern Analysis and Machine Intelligence. Volume: 30. NO 7, PP. 1270-
1281, July. 2008
learning techniques provides a higher level of accuracy for the
[6] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev,
product classification. Deep learning technique provides about Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio
85 % accuracy as compared to Machine learning which and Darrell, Trevor, Caffe: Convolutional Architecture for Fast Feature
provides around 70 – 80% accuracy. However as deep learning Embedding, 2014, http://caffe.berkeleyvision.org/
[7] Jian Fan, Tong Zhang, Shelf detection via vanishing point and
is computationally very expensive and time consuming,
radial projection, Image Processing (ICIP), IEEE International
machine learning methods can be explored as the next suitable Conference, 2014
alternatives. Increasing the training data and further fine tuning [8] M. Marder, S. Harary, A. Ribak, Y. Tzur, S. Alpert, A. Tzadok, “Using
of training parameters can improve the accuracy both in cases image analytics to monitor retail store shelves”, IBM Journal of Research
and Development. vol 59. No 2, PP.547-588, Mar. 2015
of machine learning and deep learning.
[9] Navneet Dalal, Bill Triggs, Histograms of Oriented Gradients for Human
Detection, 2005
IX. SUMMARY [10] P. Viola and M. Jones, “Robust real-time object detection”, Intl. J.
Computer Vision. vol 57. NO 2. PP. 137–154, 2004
The above described end to end solution has been proposed [11] Piotr Dollár, Ron Appel, Serge Belongie, Pietro Perona, “Fast Feature
to automate the process for product stock out check and Pyramids for Object Detection”, IEEE Transactions on Pattern Analysis
planogram compliance check that can aid retail store associates and Machine Intelligence. Vol 44, NO 8, PP. 547-588, Aug. 2014.
to maintain store operations smoothly. The Double Robot [12] Rafael C. Gonzalez, Richard Eugene Woods, Digital Image Processing,
Pearson Education India, 3rd Edition, 2009
follows a specified line and captures the shelf images. Product [13] Rakesh Satapathy, Srikanth Prahlad, Vijay Kaulgud, Smart Shelfie --
detection algorithms are applied on the images and each product Internet of Shelves: For Higher On-Shelf Availability
is extracted. Machine learning and Deep learning techniques [14] Elastic Email. https://elasticemail.com
have been used to classify each individual product, where in [15] Double Robotics. http://www.doublerobotics.com/
deep learning provides a higher accuracy. The actual product
placement is compared with the reference planogram details
124