Sec 2 Team 06

A PROJECT REPORT ON
IMAGE SEGMENTATION USING REGION-BASED

OBJECT DETECTOR
A project report submitted in partial fulfilment of the requirements for the award of
the degree of
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
Submitted by
K. SATYA (18JN5A0430)
T. SUSHMI (17JN1A0452)
M. REVATHI (17JN1A0460)
A. AKHILA (17JN1A0457)
A. GEETHA SUREKHA (17JN1A0467)
Under the esteemed guidance of
Smt. S. ANITHA, M.Tech
Assistant professor, Dept. of ECE
KAKINADA INSTITUTE OF ENGINEERING AND TECHNOLOGY for

WOMEN
(Approved by AICTE, Affiliated to Jawaharlal Nehru Technological University
Kakinada, Yanam Road, Kornagi-533463)
KAKINADA INSTITUTE OF ENGINEERING AND TECHNOLOGY for
WOMEN
(Approved by AICTE, Affiliated to Jawaharlal Nehru Technological University
Kakinada, Yanam Road, Kornagi-533463)
(2017-2021)
CERTIFICATE
This is to certify that the thesis entitled “IMAGE SEGMENTATION USING
REGION-BASED OBJECT DETECTOR” is being submitted by K.SATYA,
T. SUSHMI, M. REVATHI, A. AKHILA, A. GEETHA SUREKHA has been
carried out in partial fulfilment of the requirement for the award of BACHELOR OF
TECHNOLOGY in ELECTRONICS AND COMMUNICATION
ENGINEERING from KAKINADA INSTITUE OF ENGINEERING AND
TECHNOLOGY for WOMEN affiliated to JNTU-KAKINADA is a record of
Bonafede work carried out by them under guidance and supervision. The results
embodied in this thesis has not been submitted to any other university or institute for
the award of Degree
Project Guide Head of the Department
Smt. S. Anitha, M. Tech, Ms. P. Latha, M. Tech,

Department of ECE Department of ECE
EXTERNAL EXAMINER
ACKNOWLEDGEMENT
It gives us immense pleasure to acknowledge all those who helped us throughout in

making this project a great success.
With profound gratitude we thank Mr. Y RAMA KRISHNA, M. Tech, MBA, Principal,
Kakinada Institute of Engineering and Technology-Women for his timely suggestions,
which helped us to complete this project work successfully.
Our sincere thanks and deep sense of gratitude to Ms. P. Latha, M. Tech Head of the
Department ECE, for his valuable guidance, in completion of this project successfully.
We express great pleasure to acknowledge my profound sense of gratitude to our project
guide Smt. S. Anitha, M. Tech, Assistant Professor in ECE Dept for this valuable
guidance, comments, suggestions and encouragement throughout the course of this
project.
We are thankful to both Teaching and Non-Teaching staff members of ECE department
for their kind cooperation and all sorts of help bringing out this project work successfully.
OUR PROJECT MEMBERS
DECLARATION
We hereby declare that the project work “SEMATIC IMAGE

SEGMENTATION USING REGION-BASED OBJECT DETECTOR” submitted to
the JNTU Kakinada, is a record of an original work done by us under the guidance of
Smt. S. Anitha, M. Tech Asst. Professor in Electronics & Communication Engineering.
This project work submitted in partial fulfilment of the requirement for the award of the
degree of Bachelor of Technology in Electronics & Communication Engineering. The
results embodied in this project report have not been submitted to any other University
or Institute for the award of any degree or diploma.
This work has not been previously submitted to any other institution or University
for the award of any other degree or diploma.
OUR PROJECT MEMBERS
ABSTRACT
Semantic image segmentation, which becomes one of the key applications in image
pro-cessing and computer vision domain, has been used in multiple domains such as medical
area and intelligent transportation. However, current state-of-the-art models use a separate
representation for each task making joint inference clumsy and leaving the classification of
many parts of the scene ambiguous. In this paper, we explore a simple semantic segmentation
approach using region-based object detector which only needs bounding box annotations. The
main idea is using object detector to classify region proposals and then applying saliency
detection method to segment such classified proposals.
I
TABLE OF CONTENTS
CONTENTS PAGE NO
CHAPTER 1: INTRODUCTION
1.1 Introduction 1
1.2 Convolution Networks 2
CHAPTER 2: LITERATURE SURVEY 4
CHAPTER 3: IMPLEMENTATION METHODS
3.1 Introduction 9
3.2 Segmentation approaches 10
3.2.1 Region based Semantic Segmentation 10
3.2.2 R-CNN (Regions with CNN feature) 11
3.2.3 Fully Convolutional Neural Network based
Semantic Segmentation 12
3.2.4 Weakly Supervised Semantic Segmentation 13
CHAPTER 4: INTRODUCTION TO IMAGE PROCESSING
& MATLAB
4.1 Introduction to Image processing 15
4.1.1 Region Proposal generation 16
4.1.1 (a) Contour Map Generation 17
4.1.1 (b) Convolutional Encoder – Decoder Network 20
4.1.2 Object detection 22
4.1.3 Object Segmentation 27
4.2 MAT Lab 37
4.2.1 Introduction 37
4.2.2 MAT Lab’s power of Computational Mathematics 37
4.2.3 Features of MAT Lab 38
4.2.4 Uses of MAT Lab 39
4.2.5 Environment Set Up 39
II
4.2.6 Understanding The MATLAB Environment 41
CHAPTER 5: RESULT AND DESCRIPTION

5.1 Flow Chart 44
5.2 Algorithm 45
5.3 Output 46
5.4 Advantages 50
5.5 Applications 51
CONCLUSION 52
FUTURE SCOPE 53
APPENDIX 54
REFERENCES 60
III
LIST OF FIGURES
S.NO Figure No Figure Name Page no
1 1.1 R-CNN Architecture 11
2 1.2 FCN Architecture 13
3 1.3 Weakly Supervised Segmentation 14
4 2.1 Proposed Block Diagram 15
5 2.2 Multiscale Combinatorial Grouping 17
6 2.3 Directions of Pixels 18
7 2.4 Types Of Contour Pixels 19
(a)Absolute Direction
(b)Relative Direction
(c)Types Of Contour Pixels(I, O, IO)
8 2.5 Flowchart Of Convolutional Encoder-
Decoder Network 21
9 2.6 5 x 5 Feature Map 23
10 2.7 New Feature Map From The Left To 24
Detect The Top Left Corner Of An
Abject
11 2.8 9 Score Map 25
12 2.9 Top-Middle Object 26
IV
13 2.10 ROI Poll 27
14 2.11 Saliency detection 30
15 2.12 Example Of Smart Thumbnail Algorithm 31
16 2.13 Example Of Digital Image Processing 32
17 2.14 Example Of Digital Image 33
18 2.15 Example Of Developing a System That 35
Scans Human Face And Opens Any
Kind Of Lock
19 2.16 Example Of Object Rendering 36
20 3.1 MathWorks Installer 40
21 3.2 Installing Pause 40
22 3.3 MATLAB Desktop 41
23 3.4 Current Folder 41
24 3.5 Command Window 42
25 3.6 Work Shape 42
26 3.7 Command History 43
27 4.1 Mask-1 46
28 4.2 Mask-2 46
29 4.3 Mask-3 47
30 4.4 Mask-4 47
V
31 4.5 Contour Map 48
32 4.6 Input Image With Bounding Boxes 48
33 4.7 Final Mask 49
34 4.8 Saliency Map 49
35 4.9 Segmented Image 50
V
Image Segmentation using Region-based Object Detector
CHAPTER – 1
INTRODUCTION
1.1 INTRODUCTION
Semantic image segmentation, also called pixel-level classification, is the task of
clustering parts of image together which belong to the same object class (Thoma2016).Two other
main image tasks are image level classification and detection. Classification means treating each
image as an identical category. Detection refers to object localization and recognition. Image
segmentation can be treated as pixel-level prediction because it classifies each pixel into its
category.
Moreover, there is a task named instance segmentation which joints detection and
segmentation together. Object detection is one of the great challenges of computer vision, having
received continuousattention since the birth of the field. The most common modernapproaches
scan the image forcandidate objects and score each one. This is typified by the sliding-window
object detection ap-proach, but is also true of most other detection schemes (such as centroid-
based meth-ods or boundary edge methods).
The most successful approaches combine cues frominside the object boundary (local
features) with cues from outside the object (contextual cues). Recent works are adopting a more
holistic approach by combining the output of mul-tiple vision tasks and are reminiscent of some
of the earliest work in computer vision. However, these recent works use a different
representationfor each subtask, forcing informationsharing to be done through awkward feature
mappings.
KIETW-ECE Page 1
Another difficulty with these approachesis that the subtask representations can be
inconsistent. For example, a bounding-box based objectdetector includes many pixels within
each candidate detection window that are not part of the ob-ject itself. Furthermore, multiple
overlapping candidatedetections contain many pixels in common. How these pixels should be
treated is ambiguous in such approaches. A model that uniquely iden-tifies each pixel is not only
more elegant, but is also more likely to produce reliable results since itencodes a bias of the true
world (i.e., a visible pixel belongs to only one object) Semantic segmentation is a very important
topic in computer vision due to its crucial contribution for image understanding. The task is to
assign every single pixel a specific category label, such as person, car, and so on, which could be
considered as a dense pixel classification problem. It predicts the label, location and shape of
each object, thus is also called object parsing in some references. And it can be applied in broad
potential applications, such as automatic driving, robot sensing, to name a few. Recently, great
progress has been explored in the area of semantic image segmentation due to the rise of deep
learning. Specifically, it mainly uses Deep convolutional neural networks (CNNs) to extract rich
hierarchical semantic feature which is a bottleneck the traditional methods suffering.
1.2 CONVOLUTION NETWORKS
CNNs is very effective for image classification problem, encouraged by this, scholars
start to apply CNNs to dense prediction problems. In 2015, Long et al. first proposed an end-to-
end fully convolutional network (FCN) for semantic segmentation. However, the obtained label
map is very coarse as can be seen in Fig, However, the obtained label map is very coarse as can
be seen in Fig, that is because multiple stages of convolution and pooling strides reduce the final
prediction typically by a factor of 32 in each dimension, such low-resolution result loses much of
the finer image structure.
KIETW-ECE Page 2
To overcome this, Noh et al. learn a multi-layer deconvolution network as an up-
sampling operation to increase the resolution of prediction maps. Chen et al. proposed DeepLab
which employs atrous (or dilated) convolutions to account for larger receptive fields without
downscaling the image.
Recently, the author made some new improvement and proposed DeepLab v3 which gets
state-of-the-art performance, thus is widely applied. Zheng et al. propose a new type of CNNs by
combining the strengths of CNNs and Conditional Random Fields (CRFs) to improve accuracy.
While the fully-connected CRF is time consuming, Chen et al. replaced it by bilateral filtering
with the domain transform.
Recently, more powerful approaches are proposed. ith the development of the Internet of
Things, more and more image data are collectedby various image sensors or video sensors.
Before using image data for more complex computervision tasks, we need to know what objects
are in the image and where they are located. Therefore, object detection has always been a hot
research direction in the field of computer vision, and itspurpose is to locate and classify objects
in images or videos. Object detection has been widely used inmany fields, including intelligent
traffic and human pose estimation. Traditional algorithms solve the detection problem for
images by finding foreground andbackground from the picture and then manually extracting
foreground features for classification. The algorithm of extracting the foreground can be divided
into static and dynamic according to the stateof the object. The static object detection algorithm
for images usually uses the background subtraction algorithm. The foreground is the part where
the pixel value varies greatly.
KIETW-ECE Page 3
CHAPTER 2
LITERATURE SURVEY
In Lin et al. present a novel multi-path refinement network called RefineNet that uses all
the available information during the down-sampling process to facilitate high-resolution
classification with the help of long-range residual connections.
In, Bertasius et al. introduced a simple, yet efficient Convolutional Random Walk
Network to address the issue of poor boundary localization. Although many effective methods
have been explored, it is still very challenging to obtain high-resolution segmentation results
especially near object boundaries.
Dai et al. as an extra supervision for training convolutional networks to segment semantic
regions. As we know, bounding box annotations can be obtained more easily than masks,
although they are less precise, their amount may help improve segmentation performance.
Similarly, Khoreva et al. proposed to recursively train a convnet such that outputs are
improved after each iteration by using bounding box annotations only. Another interesting work
is scribble supervision segmentation presented by Lin et al. Scribbles are very widely used in
interactive image segmentation and more user-friendly than bounding boxes. In, Bearman et al.
took a step towards stronger supervision for semantic segmentation by pointing. There are also
some other forms of weakly supervised method have been explored as well, such as eye tracks,
noisy web tags. All these approaches require much less annotation effort during training, but
their performances are far away from fully supervised techniques.
Our method inherits features from the sliding-window object detector works, such as
Torralba et al.and Dalal and Triggs, and the multi-class image segmentation work of Shotton et
al.We further incorporate into our model many novel ideas for improving object detection via
scene context. The innovative works that inspire ours include predicting camera viewpoint for
KIETW-ECE Page 4
estimating the real world size of object candidates, relating “things” (objects) to nearby “stuff”
(regions), co-occurrence of object classes, and general scene “gist”. Recent works go beyond
simple appearance-based context and show that holistic scene under-standing (both geometric
and more general) can significantly improve performance by combining related tasks. These
works use the output of one task (e.g., object detection) to provide features for other related tasks
(e.g., depth perception).While they are appealing in their simplicity, current models are not
tightly coupled and may result in incoherent outputs (e.g., the pixels ina bounding box identified
as “car” by the object detector, maybe labeled as “sky” by an image segmentation task). In our
method, all tasks use the same region-based representation which forces consistency between
variables. Intuitively this leads to more robust predictions. The decomposition of a scene into
regions to provide the basis for vision tasks exists in some scene parsing works.
Notably, Tu et al. describe an approach for identifying regions in thes cene. Their
approach has only been shown to be effective on text and faces, leaving much of theimage
unexplained. Sudderth et al. relate scenes, objects and parts in a single hierarchical framework,
but do not provide an exact segmentation of the image. Gould et al. provides a complete
description of the scene using dynamically evolving decompositions that explain every pixel
(both semantically and geometrically). However, the method cannot distinguish between
between foreground objects and often leaves them segmented into multiple dissimilar pieces.
Our work builds on this approach with the aim of classifying objects.
Other works attempt to integrate tasks such as object detection and multi-class image
segmentation into a single CRF model. However, these models either use a different
representation for object and non-object regions or rely on a pixel-level representation. The
former does not enforce label consistency between object bounding boxes and the underlying
pixels while the latter does not distinguish between adjacent objects of the same class. Recent
KIETW-ECE Page 5
work by Gu et al. also use regions for object detection instead of the traditional sliding-window
approach. However, unlike our method, they use a single over-segmentation of the image and
make the strong assumption that each segment represents a (probabilistically) recognizable
object part. Our method, on the other hand, assembles objects (and background regions) using
segments from multiple different over-segmentations. The multiple over-segmentations avoid
errors made by any one segmentation.
Furthermore, we incorporate background regions which allows us to eliminate large
portions of the image thereby reducing the number of component regions that need to be
considered for each object. Liu et al. use a non-parametric approach to image labeling by
warping a given image onto a large set of labeled images and then combining the results. This is
a very effective approach since it scales easily to a large number of classes. However, the
method does not attempt to understand the scene semantics. In particular, their method is unable
to break the scene into separate objects (e.g., a row of cars will be parsed as a single region) and
cannot capture combinations of classes not present in the training set. As a result, the approach
performs poorly on most foreground object classes.
In recent years, many algorithms have been proposed to address the problem of object
detection. The object detection algorithms based on deep learning can be divided into two-stage
detection algorithms and one-stage detection algorithms. The two-stage algorithm is to first
generate a region proposal, and then target the boundary box and category prediction of the
region proposal. Girshick et al. proposed the classic regions with convolutional neural networks
(CNN) features(R-CNN) to achieve excellent object detection accuracy by using a deep ConvNet
to classify object proposals, but it is very time-consuming. To solve this problem, Girshick et al.
proposed theupgraded version of R-CNN, Faster R-CNN, which innovatively used the region
proposal network (RPN) to directly classify the region proposal in the convolutional neural
KIETW-ECE Page 6
network, and achieved the end-to-end goal of the whole detection framework. He et al. proposed
Mask R-CNN on the basis of Faster R-CNN, which added a branch for semantic segmentation
tasks, and used detection tasks and segmentation tasks to extract image features to improve the
accuracy of detection. He et al. proposed spatial pyramid pooling networks (SPPNet) to generate
fixed-length representations. Kong et al. proposed Hyper Net, which combines the generation of
candidate regions with the detection taskto produce fewer candidate regions while ensuring a
higher recall rate. Cai and Vasconcelos proposed Cascade R-CNN to address the problem of
overfitting and quality mismatch. The one-stage detection algorithms do not need to select region
proposals, but use the regression to directly calculate the positioning box and object category,
which further reduce the running time. Redmon et al. proposed the you only look once (YOLO)
algorithm to meet the requirements of real-time detection, but the detection accuracy of small
objects is not high.
Liu et al. proposed the single shot multibox detector (SSD) algorithm to predict the
object from multiple feature maps, which largely solved the problem of small object detection.
Lin et al. proposed RetinaNet mainlyto solve the extremely imbalanced problem of one-stage
algorithm positive and negative samples anddifficult and easy samples. Zhang et al. proposed
the RefineDet method, which absorbed the advantages of the two-stage algorithm, so that the
one-stage detection algorithm can also have theaccuracy of the two-stage algorithm. Liu et al.
proposed RFBNet to use cavity convolution toimprove the receptive field. Shen et al. proposed
deeply supervised object detector (DSOD) torestart training neural networks for detection tasks,
and also introduced the idea of DenseNet, which greatly reduced the number of parameters. Law
and Deng proposed Cornernet to detectan object bounding box as a pair of keypoints using a
single convolution neural network. To furtherimprove on Cornernet, Duan et al. proposed
KIETW-ECE Page 7
CenterNet to detect each object as a triplet of keypoints.Tian et al. proposed fully convolutional
one-stage object detector (FCOS) to solve object detectionin a per-pixel prediction fashion.
KIETW-ECE Page 8
CHAPTER 3
IMPLEMENTATION METHODS
3.1 INTRODUCTION
Image segmentation is useful in many applications. It can identify the regions of interest
in a scene or annotate the data. We categorize the existing segmentation algorithm into region-
based segmentation, data clustering, and edge-base segmentation. Region-based segmentation
includes the seeded and unseeded region growing algorithms, the JSEG, and the fast scanning
algorithm. All of them expand each region pixel by pixel based on their pixel value or quantized
value so that each cluster has high positional relation. For data clustering, the concept of them is
based on the whole image and considers the distance between each data. The characteristic of
data clustering is that each pixel of a cluster does not certainly connective. For data clustering,
the concept of them is based on the whole image and considers the distance between each data.
The characteristic of data clustering is that each pixel of a cluster does not certainly connective.
The basis method of data clustering can be divided into hierarchical and partitional clustering.
Furthermore, we show the extension of data clustering called mean shift algorithm, although this
algorithm much belonging to density estimation. The last classification of segmentation is edge-
based segmentation. This type of the segmentations generally applies edge detection or the
concept of edge. The typical one is the watershed algorithm, but it always has the over-
segmentation problem, so that the use of markers was proposed to improve the watershed
algorithm by smoothing and selecting markers. Finally, we show some applications applying
segmentation technique in the pre processing.
KIETW-ECE Page 9
3.2 SEGMENTATION APPROACHES
A general semantic segmentation architecture can be broadly thought of as an encoder
network followed by a decoder network:
The encoder is usually is a pre-trained classification network like VGG/ResNet followed by a
decoder network.
The task of the decoder is to semantically project the discriminative features (lower resolution)
learnt by the encoder onto the pixel space (higher resolution) to get a dense classification.
Unlike classification where the end result of the very deep network is the only important
thing, semantic segmentation not only requires discrimination at pixel level but also a
mechanism to project the discriminative features learnt at different stages of the encoder onto the
pixel space. Different approaches employ different mechanisms as a part of the decoding
mechanism. Let’s explore the 3 main approaches
3.2.1 REGION-BASED SEMANTIC SEGMENTATION
The region-based methods generally follow the “segmentation using recognition”
pipeline, which first extracts free-form regions from an image and describes them, followed by
region-based classification. At test time, the region-based predictions are transformed to pixel
predictions, usually by labeling a pixel according to the highest scoring region that contains it.
The region-based methods generally follow the segmentation using recognition pipeline, which
first extracts free-form regions from an image and describes them, followed by region-based
classification. At test time, the region-based predictions are transformed to pixel predictions,
usually by labeling a pixel according to the highest scoring region that contains it.
KIETW-ECE Page 10
3.2.2 R-CNN (REGIONS WITH CNN FEATURE)
It is one representative work for the region-based methods. It performs the semantic
segmentation based on the object detection results. To be specific, R-CNN first utilizes selective
search to extract a large quantity of object proposals and then computes CNN features for each
of them.
Fig.1.1 R-CNN Architecture
Finally, it classifies each region using the class-specific linear SVMs. Compared with
traditional CNN structures which are mainly intended for image classification, R-CNN can
address more complicated tasks, such as object detection and image segmentation, and it even
becomes one important basis for both fields. Moreover, R-CNN can be built on top of any CNN
benchmark structures, such as AlexNet, VGG, GoogLeNet, and ResNet.
For the image segmentation task, R-CNN extracted 2 types of features for each region:
full region feature and foreground feature, and found that it could lead to better performance
when concatenating them together as the region feature. R-CNN achieved significant
performance improvements due to using the highly discriminative CNN features. However, it
also suffers from a couple of drawbacks for the segmentation task:
KIETW-ECE Page 11
The feature is not compatible with the segmentation task.
The feature does not contain enough spatial information for precise boundary generation.
Generating segment-based proposals takes time and would greatly affect the final performance.
Due to these bottlenecks, recent research has been proposed to address the problems, including
SDS, Hypercolumns, Mask R-CNN.
3.2.3 Fully Convolutional Network-Based Semantic Segmentation
The original Fully Convolutional Network (FCN) learns a mapping from pixels to pixels,
without extracting the region proposals. The FCN network pipeline is an extension of the
classical CNN. The main idea is to make the classical CNN take as input arbitrary-sized images.
The restriction of CNNs to accept and produce labels only for specific sized inputs comes from
the fully-connected layers which are fixed. The FCN network pipeline is an extension of the
classical CNN. The main idea is to make the classical CNN take as input arbitrary-sized images.
The restriction of CNNs to accept and produce labels only for specific sized inputs comes from
the fully-connected layers which are fixed.
Contrary to them, FCNs only have convolutional and pooling layers which give them the
ability to make predictions on arbitrary-sized inputs. One issue in this specific FCN is that by
propagating through several alternated convolutional and pooling layers, the resolution of the
output feature maps is down sampled. Contrary to them, FCNs only have convolutional and
pooling layers which give them the ability to make predictions on arbitrary-sized inputs. One
issue in this specific FCN is that by propagating through several alternated convolutional and
pooling layers, the resolution of the output feature maps is down sampled.
KIETW-ECE Page 12
Therefore, the direct predictions of FCN are typically in low resolution, resulting in
relatively fuzzy object boundaries.
Fig.1.2 FCN Architecture
A variety of more advanced FCN-based approaches have been proposed to address this
issue, including SegNet, DeepLab-CRF, and Dilated Convolutions.
3.2.4 WEAKLY SUPERVISED SEMANTIC SEGMENTATION
Most of the relevant methods in semantic segmentation rely on a large number of images
with pixel-wise segmentation masks. However, manually annotating these masks is quite time-
consuming, frustrating and commercially expensive.
Therefore, some weakly supervised methods have recently been proposed, which are
dedicated to fulfilling the semantic segmentation by utilizing annotated bounding boxes.
However, manually annotating these masks is quite time-consuming, frustrating and
commercially expensive.
KIETW-ECE Page 13
For example, Box sup employed the bounding box annotations as a supervision to train
the network and iteratively improve the estimated masks for semantic segmentation.
Fig. 1.3 Weakly Supervised Semantic Segmentation
Simple Does It treated the weak supervision limitation as an issue of input label noise and
explored recursive training as a de-noising strategy.
KIETW-ECE Page 14
CHAPTER 4
INTRODUCTION TO IMAGE PROCESSING AND MATLAB
4.1 INTRODUCTION TO IMAGE PROCESSING
Mainly, proposed method consists of
1. Region proposal generation,
2. Object detection, and
3. Object segmentation
Fig. 2.1 Proposed Block Diagram
This project has proposed our approach using object detector for semantic segmentation.
In detail, it includes region proposal generation, object detection, and object segmentation. We
first use proposal generator to get some object proposals and their corresponding masks. Then,
KIETW-ECE Page 15
we use region-based object detector to classify them to obtain their category labels. Finally, we
try to introduce saliency detection method to each object box to get their segmented results using
proposal masks as object seeds. The detailed process pipeline is shown in above figure.
4.1.1 REGION PROPOSAL GENERATION
Object proposals are very important mid-level representations, which providing
subsequent applications with a couple of image regions that objects might occur. And current top
performing object detectors all use region proposals, such as Faster R-CNN, R-FCN. Almost all
the object proposal generation methods could be split into two kinds: grouping based and sliding
window based. The first kind approaches can generate relatively high accurate object bounding
boxes and masks at the same time. Thus, this paper focuses this type. Experiments in show that
MCG (multiscale combinatorial grouping) gets the best performance among all low-level
feature-based proposal generators. Segments in MCG are merged based on contour strength. In
order to boost the performance of object proposals, we use powerful contour detection method
(Convolutional Encoder-Decoder Network, CEDN) based on CNNs to replace classic gPb
contour detector in MCG. In MCG, ultra-metric contour maps are computed from multiscale and
then aligned into a single hierarchical segmentation.
Multiscale Combinatorial Grouping
Consider a segmentation of the image into regions that partition its domainS={Si}i. A
segmentation hierarchy is a family of partitions {S∗, S1.., SL} such that: (1)S∗is the finest set of
super pixels, (2) SL is the complete domain, and (3) regions from coarse levels are unions of
regions from fine levels. A hierarchy where each levelSiis assigned a real-valued indexλican be
represented by a dendrogram, a region tree where the height of each node is its index.
KIETW-ECE Page 16
Furthermore, it can also be represented as an ultrametric contour map (UCM), an image
obtained by weighting the boundary of each pair of adjacent regions in the hierarchy by the
index at which theyare merged.
Fig.2.2 Multiscale Combinatorial Grouping
This representation unifies the problems of contour detection and hierarchical image
segmentation: a threshold at levelλiin the UCMproduces the segmentation Si.
Aligning Segmentation Hierarchies
In order to leverage multi-scale information, our ap-proach combines segmentation
hierarchies computedindependently at multiple image resolutions. How-ever, since subsampling
an image removes details and smooths away boundaries, the resulting UCMs are misaligned, as
illustrated in the second panel.
Hierarchy Alignment
We construct a multi-resolution pyramid with Nscales by subsampling /super sampling
the original image and applying our single-scale segmenter. In order to preserve thin structures
and details, we declare as set of possible boundary locations the Nscales by subsampling /super
sampling the original image and applying our single-scale segmenter finest super pixels in the
highest-resolution order to preserve thin structures and details, we declare as set of possible
boundary locations the finest super pixels in the highest-resolution.
KIETW-ECE Page 17
Multiscale Hierarchy
After alignment, we have a fixed set of boundary locations, and N strengths for each of
them, coming from the different scales. We formulate this problem as binary boundary
classification and train a classifier that combines these N features into a single probability of
boundary estimation.
4.1.1.a: CONTOUR MAP GENERATION
The implemented heuristic uses the fact that branches are often only a few pixels in
length and occur towards the middle of contours to make the assumption that the set of two
possible endpoints that are the farthest apart from one another (in terms of contour pixels) are
the two actual endpoints for a sub-contour.
Fig: 2.3 Directions of pixels
Experimentation with this heuristic showed that it produced correct results in nearly
every “well behaved” map.
KIETW-ECE Page 18
Running the segmented image through the thinning, Moore contour tracing, and end
point finding algorithms yields each contour in a vectorized form which can then be processed
further. As can be seen in Figure, which shows the results of these steps on the previously shown
segmented image, the results from this step are quite good.
Fig: 2.4 Types of contour pixels. (a) Absolute direction; (b) relative direction; (c) types of contour
pixels: inner corner pixel (I), outer corner pixel (O) and inner-outer cornerpixel (IO)
Contour Tracing Algorithms Let I be a binary digital image withM×Npixels, where the
coordinate of the top-leftmost pixel is (0, 0) and that of the bottom-rightmost pixel is (M−1,
N−1). InI, a pixel can be represented as P= (x, y), x=0, 1, 2, ···, M−1, y=0, 1, 2, ···, N−1. Most
contour-tracing algorithms use a tracer T(P,d)with absolute directional information
d∈{N,NE,NW,W,SW,S,SE,E,NE}, and they havethe following basic sequence:1.The tracer
starts contour tracing at the contour of an object after it saves the starting point alongwith its
initial direction.2.The tracer determines the next contour point using its specific rule of following
paths accordingto the adjacent pixels and then moves to the contour point and changes its
absolute direction.3.If the tracer reaches the start point, then the trace procedure is terminated.To
determine the next contour point, which may be a contour pixel or pixel corner, the tracerdetects
KIETW-ECE Page 19
the intensity of its adjacent pixelPrand the new absolute directiondrforPrby usingrelative
direction informationr∈ {f ront,f ront−le f t,le f t,rear−le f t,rear,rear−right,right,r∈{f ront−right}.
For example, if the absolute direction of the current tracerT (P, d) isN, the leftdirection of the
tracerdLe f tisW. Similarly, the left pixel of tracer PLe f tis (x−1, y). Figure a, b showthe
directional information of the tracer, and Figure 2c shows the different types of contour pixels.
Thecontour pixels can be classified into four types, namely straight line, inner corner pixel, outer
corner pixel and inner-outer corner pixel. In Figure c, “O” represents the outer corner, “I”
represents the inner corner and “IO” represents the inner-outer corner according to the local
pattern of the contour. In this study, we focus on a contour-tracing algorithm that is suitable for
cases involving arelatively small number of objects and that require real-time tracing, such as
augmented reality (AR) mixed reality (MR) and recognition image-based code in small-scale
images, e.g., a mobile computing environment. Hence, we first introduce and briefly describe the
conventional contour-tracing algorithms that are used in this environment and analyse their
tracing accuracy and characteristics.
4.1.1.b: CONVOLUTIONAL ENCODER–DECODER NETWORK
A convolutional encoder–decoder network is a standard network used for tasks requiring
dense pixel-wise predictions like semantic segmentation, computing optical flow and disparity
maps, and contour detection. The encoder in the network computes progressively higher-level
abstract features as the receptive fields in the encoder increase with the depth of the encoder. The
spatial resolution of the feature maps is reduced progressively via a down-sampling operation,
whereas the decoder computes feature maps of progressively increasing resolution via un-
pooling or up-sampling. The network has the ability not only to model features like shape or
appearance of different classes but also to model long-range spatial relationships.
KIETW-ECE Page 20
Different variations of the encoder–decoder network have been explored in the literature
for improved performance. Skip connections (Ronneberger et al., 2015) have been used to
recover the fine spatial details during reconstruction which get lost due to successive down-
sampling operations involved in the encoder. Addition of larger context information using
image-level features (Liu et al., 2015), recurrent connections (Pinheiro and Collobert, 2014;
Zheng et al., 2015), and larger convolutional kernels (Peng et al., 2017) has also significantly
improved the accuracy of semantic segmentation.
Fig.2.5 FLOWCHART OF CONVOLUTIONAL ENCODER–DECODER NETWORK
Other methods studied for improving semantic segmentation accuracy include hierarchical
supervision (Chen et al., 2016) and iterative concatenation of feature maps (Jégou et al., 2017).
KIETW-ECE Page 21
Our proposed up sampling idea was inspired by which is intended for unsupervised
feature learning. The fundamental aspects ofthe proposed encoder-decoder network are the
decoding process, which has numerous practical advantages regarding enhancing boundary
delineation and minimizing the total network size for enabling end-to-end training. The key
benefit of such a design is an easy to modify encoder-decoder architecture that can be adapted
and changed with very little modification. This encoder offers slow-resolution feature mapping
for pixel-wise classification. The feature maps produced through the convolution layer are
sparse, those later convolved using the decoder filters to generate detailed feature maps.
4.1.2 OBJECT DETECTION
Recognizing objects and localizing them is the key of our approach. Recent progress
shows that region-based object detectors achieve state-of-the-art performance. These methods
usually include the following parts: takes an image as input, extracts some region proposals,
computes semantic features for each proposal using CNNs, classifies each proposal to obtain
their semantic label. With these labels, we only need to segment each object to get final semantic
segmentation results. Furthermore, we can also get instance segmentation results which is a more
challenging task than semantic segmentation and is beyond this paper’s scope. R-FCN (Region-
based Fully Convolutional Networks) is a new baseline in recent object detection, which is very
efficient by using FCN and powerful by using Residual Networks (ResNets) for feature
extraction.
Region-Based Fully Convolutional Networks (R-FCN)
R-CNN based detectors, like Fast R-CNN or Faster R-CNN, process object detection in 2
stages.
Generate region proposals (ROIs), and Make classification and localization (boundary boxes)
KIETW-ECE Page 22
predictions from ROIs.
Fast R-CNN computes the feature maps from the whole image once. It then derives the region
proposals (ROIs) from the feature maps directly. For every ROI, no more feature extraction is
needed. That cuts down the process significantly as there are about 2000 ROIs. Following the
same logic, R-FCN improves speed by reducing the amount of work needed for each ROI. The
region-based feature maps are independent of ROIs and can be computed outside each ROI. The
remaining work, which we will discuss later, is much simpler and therefore R-FCN is faster than
Fast R-CNN or Faster R-CNN. Here is the pseudo code for R-FCN for comparison.
R-FCN
Fig. 2.6 5 x 5 feature map
Let’s get into the details and consider a 5 × 5 feature map M with a square object inside.
We divide the square object equally into 3 × 3 regions. Now, we create a new feature map from
M to detect the top left (TL) corner of the square only. The new feature map looks like the one
on the right below. Only the yellow grid cell [2, 2] is activated.
KIETW-ECE Page 23
Create a new feature map from the left to detect the top left corner of an object.
Fig. 2.7 New feature map from the left to detect the top left corner of an object
Since we divide the square into 9 parts (top-left TR, top-middle TM, top-right TR, center-left
CF, …, bottom-right BR), we create 9 feature maps each detecting the corresponding region of
the object.
KIETW-ECE Page 24
These feature maps are called position-sensitive score maps because each map detects
(scores) a sub-region of the object.
Generate 9 score maps
Let’s say the dotted red rectangle below is the ROI proposed. We divide it into 3 × 3 regions
and ask how likely each region contains the corresponding part of the object. For example, how
likely the top-left ROI region contains the left eye. We store the results into a 3 × 3 vote array in
the right diagram below.
Fig. 2.8 9 SCORE MAP
Apply ROI onto the feature maps to output a 3 x 3 array.
This process to map score maps and ROIs to the vote array is called position-sensitive ROI-pool
which is very similar to the ROI pool in the Fast R-CNN.
For the diagram below:
KIETW-ECE Page 25
We take the top-left ROI region, and
Map it to the top-left score map (top middle diagram).
We compute the average score of the top-left score map bounded by the top-left ROI (blue
rectangle). About 40% of the area inside the blue rectangle has 0 activation and 60% have 100%
activation, i.e. 0.6 in average. So the likelihood that we have detected the top-left object is 0.6.
We store the result (0.6) into array[0][0]
We redo it with the top-middle ROI but with the top-middle score map now.
The result is computed as 0.55 and stored in array [0][1]. This value indicates the likelihood that
we detected the top-middle object.
Fig. 2.9 Top-Middle Object
Overlay a portion of the ROI onto the corresponding score map to calculate V[i][j]
After calculating all the values for the position-sensitive ROI pool, the class score is the
average of all its elements.
KIETW-ECE Page 26
After calculating all the values for the position-sensitive ROI pool, the class score is the
average of all its elements. Let’s say we have C classes to detect. We expand it to C + 1 classes
so we include a new class for the background (non-object).
Fig. 2.10 ROI pool
Each class will have its own 3 × 3 score maps and therefore a total of (C+1) × 3 × 3
score maps. Using its own set of score maps, we predict a class score for each class. Then we
apply a softmax on those scores to compute the probability for each class. . Using its own set of
score maps, we predict a class score for each class. Then we apply a softmax on those scores to
compute the probability for each class.
4.1.3 OBJECT SEGMENTATION
Main problem is how to classify the overlap parts among several objects with the same
semantic label. segment each detected object is just output its corresponding mask as
KIETW-ECE Page 27
segmentation result. However, these masks are not accurate enough, they usually miss some
parts of the object. To address it, we introduce saliency detection method to refine these masks.
Saliency detection approach detect all the salient objects in the form of saliency map.
In computer vision, a saliency map is an image that shows each pixel's unique quality.
The goal of a saliency map is to simplify and/or change the representation of an image into
something that is more meaningful and easier to analyze. For example, if a pixel has a high grey
level or other unique color quality in a color image, that pixel's quality will show in the saliency
map and in an obvious way. Saliency is a kind of image segmentation. Saliency estimation may
be viewed as an instance of image segmentation. In computer vision, image segmentation is the
process of partitioning a digital image into multiple segments (sets of pixels, also known as
superpixels). The goal of segmentation is to simplify and/or change the representation of an
image into something that is more meaningful and easier to analyze. Image segmentation is
typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely,
image segmentation is the process of assigning a label to every pixel in an image such that pixels
with the same label share certain characteristics. First, we should calculate the distance of each
pixel to the rest of pixels in the same frame: is the value of pixel , in the range of [0,255]. The
following equation is the expanded form of this equation.
SALS(Ik) = |Ik - I1| + |Ik - I2| + ... + |Ik - IN|
Where N is the total number of pixels in the current frame. Then we can further
restructure our formula. We put the value that has same I together.
SALS(Ik) = ∑ Fn × |Ik - In|
Where Fn is the frequency of In. And the value of n belongs to [0,255]. The frequencies
are expressed in the form of histogram, and the computational time of histogram is time
complexity.
KIETW-ECE Page 28
π ={π (0),...,π (K)}; shortest path in the set
distance between a super pixel pair (i,j) as:
R (x,y) =boundary response at pixel (x,y), and l (i,j)
What is Saliency Detection
Saliency is what stands out to you and how you are able to quickly focus on the most
relevant parts of what you see. In neuroscience, saliency is described as an attention mechanism
in organisms to narrow down to the important parts of what they see.
Saliency is a kind of image segmentation. Saliency estimation may be viewed as an
instance of image segmentation. In computer vision, image segmentation is the process of
partitioning a digital image into multiple segments (sets of pixels, also known as superpixels).
The goal of segmentation is to simplify and/or change the representation of an image into
something that is more meaningful and easier to analyze. Image segmentation is typically used to
locate objects and boundaries (lines, curves, etc.) in images. The goal of segmentation is to
simplify and/or change the representation of an image into something that is more meaningful
and easier to analyze. More precisely, image segmentation is the process of assigning a label to
every pixel in an image such that pixels with the same label share certain characteristics.
In UX design, saliency is a feedback loop for understanding what parts of a design are
useful, and which are not. They use the information they gather from usability and eye tracking
studies this to design better interfaces. Advertisers are well aware that many people don’t have
long attention spans, hence they try to catch the eye of a user with a single glance. Saliency
detection methods are used to better design ads and posters. Advertisers are well aware that
KIETW-ECE Page 29
many people don’t have long attention spans, hence they try to catch the eye of a user with a
single glance. Saliency detection methods are used to better design ads and posters.
Fig. 2.11 Saliency Detection
Saliency detection, essentially, can be used in any area in which you’re trying to
automate the process of understanding what stands out in an image. Saliency detection,
essentially, can be used in any area in which you’re trying to automate the process of
understanding what stands out in an image.
KIETW-ECE Page 30
Why Saliency Detection
We use saliency detection to make our algorithms smarter. One example of this would be
the Smart Thumbnail Algorithm.
Fig. 2.12 Example of Smart Thumbnail Algorithm
This microservice uses the Saliency Detector algorithm to get information about the
important parts of an image. Using Saliency Detection will make your app/service smarter by
detecting the relevant (salient) parts in your images automatically. You can use this information
to improve your service, and make your app smarter!
DIGITAL IMAGE PROCESSING
Digital image processing deals with manipulation of digital images through a digital
computer. It is a subfield of signals and systems but focus particularly on images. DIP focuses on
developing a computer system that is able to perform processing on an image. The input of that
system is a digital image and the system process that image using efficient algorithms, and gives
KIETW-ECE Page 31
an image as an output. The most common example is Adobe Photoshop. It is one of the widely
used application for processing digital images.
Fig. 2.13 Example of Digital image processing
In the above figure, an image has been captured by a camera and has been sent to a
digital system to remove all the other details, and just focus on the water drop by zooming it in
such a way that the quality of the image remains the same. The digital image processing deals with
developing a digital system that performs operations on a digital image.
What is an Image
An image is nothing more than a two dimensional signal. It is defined by the
mathematical function f (x, y) where x and y are the two co-ordinates horizontally and vertically.
The value of f (x, y) at any point is gives the pixel value at that point of an image. The above
figure is an example of digital image that you are now viewing on your computer screen.
The above figure is an example of digital image that you are now viewing on your
computer screen. But actually, this image is nothing but a two-dimensional array of numbers
ranging between 0 and 255.
KIETW-ECE Page 32
128 30 123
232 123 321
123 77 89
80 255 255
Fig. 2.14 Example of Digital Image
Each number represents the value of the function f (x, y) at any point. In this case the
value 128, 230 ,123 each represents an individual pixel value. The dimensions of the picture is
actually the dimensions of this two dimensional array.
Relationship between a digital image and a signal
If the image is a two dimensional array then what does it have to do with a signal? In
order to understand that, we need to first understand what is a signal?
KIETW-ECE Page 33
Signal:
In physical world, any quantity measurable through time over space or any higher
dimension can be taken as a signal. A signal is a mathematical function, and it conveys some
information.
A signal can be one dimensional or two dimensional or higher dimensional signal. One
dimensional signal is a signal that is measured over time. The common example is a voice signal.
The two dimensional signals are those that are measured over some other physical
quantities. The example of two-dimensional signal is a digital image. We will look in more detail
in the next tutorial of how a one dimensional or two dimensional signals and higher signals are
formed and interpreted.
Relationship
Since anything that conveys information or broadcast a message in physical world
between two observers is a signal. That includes speech or (human voice) or an image as a
signal. Since when we speak, our voice is converted to a sound wave/signal and transformed
with respect to the time to person we are speaking to. Not only this , but the way a digital camera
works, as while acquiring an image from a digital camera involves transfer of a signal from one
part of the system to the other.
How a digital image is formed
Since capturing an image from a camera is a physical process. The sunlight is used as a
source of energy. A sensor array is used for the acquisition of the image. So when the sunlight
falls upon the object, then the amount of light reflected by that object is sensed by the sensors,
KIETW-ECE Page 34
and a continuous voltage signal is generated by the amount of sensed data. In order to create a
digital image, we need to convert this data into a digital form. This involves sampling and
quantization. (They are discussed later on). The result of sampling and quantization results in a
two dimensional array or matrix of numbers which are nothing but a digital image.
Overlapping fields
Machine/Computer vision
Machine vision or computer vision deals with developing a system in which the input is
an image and the output is some information.
For example: Developing a system that scans human face and opens any kind of lock.
This system would look something like this.
Fig. 2.15 Example of Developing a system that scans human face and opens any kind of lock
Computer graphics
Computer graphics deals with the formation of images from object models, rather then
the image is captured by some device. For example: Object rendering. Generating an image from
an object model. Such a system would look something like this.
KIETW-ECE Page 35
For example: Object rendering. Generating an image from an object model. Such a
system would look something like this.
Fig.2.16 Example of Object rendering.
Artificial intelligence
Artificial intelligence is more or less the study of putting human intelligence into
machines. Artificial intelligence has many applications in image processing. For example:
developing computer aided diagnosis systems that help doctors in interpreting images of X-ray ,
MRI e.t.c and then highlighting conspicuous section to be examined by the doctor.
Signal processing
Signal processing is an umbrella and image processing lies under it. The amount of light
reflected by an object in the physical world (3d world) is pass through the lens of the camera and
it becomes a 2d signal and hence result in image formation. The amount of light reflected by an
object in the physical world is pass through the lens of the camera and it becomes a 2d signal and
hence result in image formation. This image is then digitized using methods of signal processing
and then this digital image is manipulated in digital image processing.
KIETW-ECE Page 36
4.2 MATLAB
4.2.1: INTRODUCTION
MATLAB is a programming language developed by MathWorks. It started out as a
matrix programming language where linear algebra programming was simple. It can be run both
under interactive sessions and as a batch job. This tutorial gives you aggressively a gentle
introduction of MATLAB programming language. It is designed to give students fluency in
MATLAB programming language. Problem-based MATLAB examples have been given in
simple and easy way to make your learning fast and effective.
MATLAB is developed by MathWorks.
It allows matrix manipulations; plotting of functions and data; implementation of
algorithms; creation of user interfaces; interfacing with programs written in other languages,
including C, C++, Java, and FORTRAN; analyze data; develop algorithms; and create models
and applications.
It has numerous built-in commands and math functions that help you in mathematical
calculations, generating plots, and performing numerical methods.
4.2.2 MATLAB'S POWER OF COMPUTATIONAL MATHEMATICS
MATLAB is used in every facet of computational mathematics. Following are some
commonly used mathematical calculations where it is used most commonly −
Dealing with Matrices and Arrays
2-D and 3-D Plotting and graphics
KIETW-ECE Page 37
Linear Algebra
Algebraic Equations
Non-linear Functions
Statistics
Data Analysis
Calculus and Differential Equations
Numerical Calculations
Integration
Transforms
Curve Fitting
Various other special functions
4.2.3 FEATURES OF MATLAB
Following are the basic features of MATLAB −
It is a high-level language for numerical computation, visualization and application development.
It also provides an interactive environment for iterative exploration, design and problem solving.
It provides vast library of mathematical functions for linear algebra, statistics, Fourier analysis,
filtering, optimization, numerical integration and solving ordinary differential equations.
KIETW-ECE Page 38
It provides built-in graphics for visualizing data and tools for creating custom plots.
MATLAB's programming interface gives development tools for improving code quality
maintainability and maximizing performance.
It provides tools for building applications with custom graphical interfaces.
It provides functions for integrating MATLAB based algorithms with external applications and
languages such as C, Java, .NET and Microsoft Excel.
4.2.4 USES OF MATLAB
MATLAB is widely used as a computational tool in science and engineering
encompassing the fields of physics, chemistry, math and all engineering streams. It is used in a
range of applications including −
Signal Processing and Communications
Image and Video Processing
Control Systems
Test and Measurement
Computational Finance
Computational Biology
4.2.5 Environment Setup
Setting up MATLAB environment is a matter of few clicks. The installer can be
downloaded from here.
KIETW-ECE Page 39
MathWorks provides the licensed product, a trial version and a student version as well.
You need to log into the site and wait a little for their approval.
After downloading the installer the software can be installed through few clicks.
Fig.3.1 MathWorks Installer
Fig. 3.2 Installing Pause
KIETW-ECE Page 40
4.2.6 Understanding the MATLAB Environment
MATLAB development IDE can be launched from the icon created on the desktop. The
main working window in MATLAB is called the desktop. When MATLAB is started, the
desktop appears in its default layout −
Fig. 3.3 MATLAB desk top
The desktop has the following panels −
Current Folder − This panel allows you to access the project folders and files.
Fig. 3.4 Current Folder
KIETW-ECE Page 41
Command Window − This is the main area where commands can be entered at the command
line. It is indicated by the command prompt (>>).
Fig. 3.5 Command Window
Workspace − The workspace shows all the variables created and/or imported from files.
Fig. 3.6 Work Shape
KIETW-ECE Page 42
Command History − This panel shows or return commands that are entered at the command
line.
Fig. 3.7 Command History
KIETW-ECE Page 43
CHAPTER 5
RESULT AND DESCRIPTION
5.1 FLOW CHART
Contour
CEDN Input Image
R-FCN
SCG
Masks
Contour Saliency
Segmentation result
KIETW-ECE Page 44
5.2 Algorithm:
First, we have to provide an image into the segmentation process through COUNTER
ENCODER and DECODER NETWORK (CEDN). In CEDN,standard network used for tasks
requiring dense pixel-wise predictions like semantic segmentation, computing optical flow and
disparity maps, and contour detection.Different variations of the encoder–decoder network have
been explored in the literature for improved performance.
Then, the image is processing to contour, which is an important segmentation technique
used for image separation by the boundary or region. Simultaneously, SCG is technique helps to
convert object to masks. After that, masks are provided by contour as a part of image detection
by splitting the images as per the requirement, mostly like region. During, the mask generation,
the image is getting as per condition, when the pixel value is clear up to the mark
RFCN apples bounding boxes to produce accurate image contour. In further, the
segmentation results are occurred after completing the Contour Saliency, at where the image is
separated / get space between the collinear lines. All this will be done using a code through
MATLAB.
KIETW-ECE Page 45
5.3 OUTPUT:
Fig. 4.1 Mask-1
Fig.4.2 Mask-2
KIETW-ECE Page 46
Fig. 4.3 Mask-3
Fig.4.4 Mask-4
KIETW-ECE Page 47
Fig. 4.5 Contour Map
Fig. 4.6 Input Image With Bounding Boxes
KIETW-ECE Page 48
Fig. 4.7 Final Mask
Fig. 4.8 Saliency Map
KIETW-ECE Page 49
Fig. 4.9 Segmented Image
5.4 ADVANTAGES
1. It consumes less Time-consuming.
2. It is easy to construct when compared to previous method.
3. It produces enough accurate segmentation mask.
4. It is not expensive to obtain
5. It uses bounding boxes which will make the segmentation process simple and accurate.
KIETW-ECE Page 50
5.5 APPLICATIONS
Semantic image segmentation, which becomes one of the key applications in image pro-
cessing and computer vision domain, has been used in multiple domains such as
1. Medical area for segmenting tumour size.
2. Medical area for segmenting wound size.
3. Robotics for segmenting bombs.
4. Robotics for segmenting water.
5. Robotics for segmenting hills area.
6. Intelligent transportation for segmenting individual vehicles for counting.
KIETW-ECE Page 51
CONCLUSION
Compared with the traditional image semantic segmentation method, the method based
on the convolutional neural network is simple and the segmentation effect is better than the
traditional image semantic segmentation method. Model fusion helps to achieve high accuracy of
small objects while still achieving high global accuracy.
Image semantic segmentation is a key technology in the field of image processing and
computer vision. It is an important part of computer cognitive image content. The quality of
semantic segmentation plays a crucial role in subsequent tasks such as image understanding,
scene analysis and target tracking.
Therefore, it is of great practical significance to study an effective image semantic
segmentation algorithm. With the continuous development of deep learning, the high accuracy
brought by neural networks has been widely studied and applied in many scenes such a simage
recognition and semantic segmentation. Comparedwith the traditional semantic segmentation
method based on region feature extraction, the image features acquired by the deep
convolutional neural network method havestronger representation ability, so the algorithm has
better effect. The basic idea of semantic segmentation based on deep convolutional neural
network is to extract the semantic features of each pixel in the image by using neural network,
then classify and identify the pixels according to these features, so as to obtain the segmentation
image containing semantic information. Therefore, the core of this method is how to improve the
recognition accuracy of pixels on the network.
KIETW-ECE Page 52
FUTURE SCOPE
1. Our project didn't provide Instance Ground Truth.
"Ground truth" refers to information collected on location. Ground truth allows image data to
be related to real features and materials on the ground.
Ground truth also helps with atmospheric correction. Since images from satellites obviously have
to pass through the atmosphere, they can get distorted because of absorption in the atmosphere.
So ground truth can help fully identify objects in satellite photos.
"Ground truth" means a set of measurements that is known to be much more accurate than
measurements from the system you are testing.
For example, suppose you are testing a stereo vision system to see how well it can estimate 3D
positions. The "ground truth" might be the positions given by a laser rangefinder which is known
to be much more accurate than the camera system.
2. It can be developed in End-to-End encryption to secure the Image Processing results and
improve the performance of the Object detection.In Future,we may expect,these improvements
on our project.
KIETW-ECE Page 53
APPENDIX
clc
close all
clear all
[fn pn]=uigetfile('*.*','select ip image');
tic
hi=imread([pn,fn]);
[ro co]=size(hi);
%1.Region proposal generation

imgl=hi;
%mcg starts
im = double((imgl));
im2 = double(imresize((imgl),[512 512]));
im3 = double(imresize((imgl),[825 825]));
[masked_image] = scg(im,1);
[masked_image2] = scg(im2,1);
[masked_image3] = scg(im3,1);
%%
%
figure,imshow(uint8(masked_image.*im));
title('Aligned Hierarchi1');
figure,imshow(uint8(masked_image2.*im2));
figure,imshow(uint8(masked_image3.*im3));
mh1=uint8(imresize(uint8(masked_image.*im),[ro co]));
mh2=uint8(imresize(uint8(masked_image2.*im2),[ro co]));
mh3=uint8(imresize(uint8(masked_image3.*im3),[ro co]));
mh=mh1+mh2+mh3;
figure,imshow(uint8(mh));
title('Region proposal generation opj')
%mcg ends
%CEDN starts
KIETW-ECE Page 54
im1=rgb2gray(hi);
im1=medfilt2(im1,[3 3]);
BW = edge(im1,'sobel');
[imx,imy]=size(BW);
msk=[0 0 0 0 0;
0 1 1 1 0;
0 1 1 1 0;
0 1 1 1 0;
0 0 0 0 0;];
B=conv2(double(BW),double(msk));
L = bwlabel(B,8);
mx=max(max(L));
op=hi;
B2=conv2(double(BW),double(msk));
L2 = bwlabel(B2,8);
mx2=max(max(L2));
B3=conv2(double(BW),double(msk));
L3 = bwlabel(B3,8);
mx3=max(max(L3));
[r,c] = find(L==17);
rc = [r c];
[sx sy]=size(rc);
n1=zeros(imx,imy);
for i=1:sx
x1=rc(i,1);
y1=rc(i,2);
n1(x1,y1)=255;
end
figure,imshow(B);
title('contour map');
%CEDN ends
imagen=op;
KIETW-ECE Page 55
if size(imagen,3)==3 % RGB image

imagen=rgb2gray(imagen);
end
threshold = graythresh(imagen);
imagen =~im2bw(imagen,threshold);
imagen = bwareaopen(imagen,30);
pause(1)
figure,
imshow(~imagen);
title('INPUT IMAGE WITH Bounding Boxes')
[L Ne]=bwlabel(imagen);
propied=regionprops(L,'BoundingBox');
hold on
%% Plot Bounding Box
for n=1:size(propied,1)
rectangle('Position',propied(n).BoundingBox,'EdgeColor','g','LineWidth',2)
end
hold off
pause (1)
if isdir('networks')==0
mkdir('networks');
end
inputs=dlmread('Inputs1.txt', '\t', 1, 0);

targets=dlmread('Targets1.txt', '\t', 1, 0);
inputs2=dlmread('Inputs2.txt', '\t', 1, 0);
targets16=dlmread('Targets2.txt', '\t', 1, 0);
inputs = inputs';
targets = targets';
inputs2 = inputs2';
targets16 = targets16';
trainFcn = 'trainlm';
for i=1:2 %vary number of hidden layer neurons from 1 to 100

hiddenLayerSize = i; %number of hidden layer neurons
net = fitnet(hiddenLayerSize,trainFcn); %create a fitting network
net.divideParam.trainRatio = 70/100; %use 70% of data for training
net.divideParam.valRatio = 15/100; %15% for validation
net.divideParam.testRatio = 15/100; %15% for testing
KIETW-ECE Page 56
[net,tr] = train(net,inputs,targets); % train the network

outputs = net(inputs(:,tr.testInd)); %simulate 15% test data
outputs2016 = net(inputs2);
rmse15(i)=sqrt(mean((outputs-targets(tr.testInd)).^2));
r15(i)=regression(targets(tr.testInd), outputs);
r2016(i)=regression(targets16, outputs2016);
save(['networks\net' num2str(i)],'net');
end
img=mh;
dim = size(img);
width = dim(2);height = dim(1);
md = min(width, height);%minimum dimension
cform = makecform('srgb2lab');
lab = applycform(img,cform);
l = double(lab(:,:,1));
a = double(lab(:,:,2));
b = double(lab(:,:,3));
sm = zeros(height, width);
off1 = int32(md/2); off2 = int32(md/4); off3 = int32(md/8);
I=imgl;
I = imresize(I,[256,256]);
I = imadjust(I,stretchlim(I));
I_Otsu = im2bw(I,graythresh(I));
I_HIS = rgb2hsi(I);
cform = makecform('srgb2lab');
lab_he = applycform(I,cform);
ab = double(lab_he(:,:,2:3));
nrows = size(ab,1);
ncols = size(ab,2);
ab = reshape(ab,nrows*ncols,2);
nColors = 3;
[cid cce] = contoursailencymerg(ab,nColors);
KIETW-ECE Page 57
pixel_labels = reshape(cid,nrows,ncols);
segmented_images = cell(1,3);
rgb_label = repmat(pixel_labels,[1,1,3]);
for k = 1:nColors
colors = I;
colors(rgb_label ~= k) = 0;
segmented_images{k} = colors;
end
for j = 1:height
y11 = max(1,j-off1); y12 = min(j+off1,height);
for k = 1:width
x11 = max(1,k-off1); x12 = min(k+off1,width);
lm1 = mean2(l(y11:y12,x11:x12));am1 = mean2(a(y11:y12,x11:x12));bm1 =
mean2(b(y11:y12,x11:x12));
mean2(b(y21:y22,x21:x22));
mean2(b(y31:y32,x31:x32));
cv1 = (l(j,k)-lm1).^2 + (a(j,k)-am1).^2 + (b(j,k)-bm1).^2;

sm(j,k) = cv1 + cv2 + cv3;
end
end
figure,imshow(img);
figure,imshow(sm,[]);
title('saliency map');
for k = 1:nColors
colors = I;
colors(rgb_label ~= k) = 0;
segmented_images{k} = colors;
KIETW-ECE Page 58
end
figure, subplot(3,1,1);imshow(segmented_images{1});title('Segment 1');

subplot(3,1,2);imshow(segmented_images{2});title('Segment 2');
subplot(3,1,3);imshow(segmented_images{3});title('Segment 3');
set(gcf, 'Position', get(0,'Screensize'));
KIETW-ECE Page 59
REFERENCES
[1] H.G. Barrow and J.M. Tenenbaum. Computational vision. IEEE, 1981.
[2] S. Bileschi and L. Wolf. A unified system for object detection, texturerecognition, and
context analysisbased on the standard model feature set. InBMVC, 2005.
[3] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis.
PAMI, 2002.[4] N. Dalal and B. Triggs. Histograms of oriented gradients for humandetection.
InCVPR, 2005.[5] V. Ferrari, L. Fevrier, F. Jurie, and C. Schmid. Groups of adjacent contour
segments for object detection. PAMI, 2008.[6] M. Fink and P. Perona. Mutual boosting for
contextual inference. In NIPS, 2003.[7] Stephen Gould, Rick Fulton, and Daphne Koller.
Decompsing a sceneinto geometric and semanticallyconsistent regions. InICCV, 2009.[8] C. Gu,
J. J. Lim, P. Arbelaez, and J. Malik. Recognition using regions. InCVPR, 2009.[9] G. Heitz and
D. Koller. Learning spatial context: Using stuff to find things. InECCV, 2008.[10] G. Heitz, S.
Gould, A. Saxena, and D. Koller. Cascaded classification models: Combining models forholistic
scene understanding. InNIPS, 2008.[11] D. Hoiem, A. A. Efros, and M. Hebert. Closing the loop
on scene interpretation. CVPR, 2008.[12] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects
in perspective. IJCV, 2008.[13] B. Leibe, A. Leonardis, and B. Schiele. Combined object
categorization and segmentation with an implicitshape model. InECCV, 2004.[14] C. Liu, J.
Yuen, and A. Torralba. Nonparametric scene parsing: Label transfer via dense scene alignment.
In CVPR, 2009.[15] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie.
Objects in context. InICCV,2007.[16] J. Shotton, J. Winn, C. Rother, and A. Criminisi.
TextonBoost: Jointappearance, shape and contextmodeling for multi-class object recognition and
segmentation. InECCV, 2006.[17] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky.
Describing visual scenes using transformed objectsand parts. InIJCV, 2007.[18] A. Torralba, K.
P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place andobject
KIETW-ECE Page 60
recognition, 2003.[19] A. Torralba, K. Murphy, and W. Freeman. Sharing features: efficient
boosting procedures for multiclassobject detection. InCVPR, 2004.[20] A. Torralba, K. Murphy,
and W. Freeman. Contextual models for object detection using boosted randomfields. InNIPS,
2004.
[21] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unifying segmentation,
detection, andrecognition. InICCV, 2003.[22] P. Viola and M. J. Jones. Robust real-time face
detection. IJCV, 2004.[23] C. Wojek and B. Schiele. A dynamic conditional random field model
for joint labeling of object and sceneclasses. InECCV, 2008.
KIETW-ECE Page 61

Sec 2 Team 06

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sec 2 Team 06

Uploaded by

Copyright:

Available Formats

A PROJECT REPORT ON

IMAGE SEGMENTATION USING REGION-BASED

KAKINADA INSTITUTE OF ENGINEERING AND TECHNOLOGY for

Project Guide Head of the Department

Smt. S. Anitha, M. Tech, Ms. P. Latha, M. Tech,

It gives us immense pleasure to acknowledge all those who helped us throughout in

OUR PROJECT MEMBERS

We hereby declare that the project work “SEMATIC IMAGE

OUR PROJECT MEMBERS

detection method to segment such classified proposals.

CHAPTER 5: RESULT AND DESCRIPTION

S.NO Figure No Figure Name Page no

1 1.1 R-CNN Architecture 11

2 1.2 FCN Architecture 13

3 1.3 Weakly Supervised Segmentation 14

4 2.1 Proposed Block Diagram 15

5 2.2 Multiscale Combinatorial Grouping 17

6 2.3 Directions of Pixels 18

7 2.4 Types Of Contour Pixels 19

(c)Types Of Contour Pixels(I, O, IO)

8 2.5 Flowchart Of Convolutional Encoder-

9 2.6 5 x 5 Feature Map 23

10 2.7 New Feature Map From The Left To 24

Detect The Top Left Corner Of An

11 2.8 9 Score Map 25

12 2.9 Top-Middle Object 26

14 2.11 Saliency detection 30

15 2.12 Example Of Smart Thumbnail Algorithm 31

16 2.13 Example Of Digital Image Processing 32

17 2.14 Example Of Digital Image 33

18 2.15 Example Of Developing a System That 35

Scans Human Face And Opens Any

19 2.16 Example Of Object Rendering 36

20 3.1 MathWorks Installer 40

21 3.2 Installing Pause 40

22 3.3 MATLAB Desktop 41

23 3.4 Current Folder 41

24 3.5 Command Window 42

25 3.6 Work Shape 42

26 3.7 Command History 43

32 4.6 Input Image With Bounding Boxes 48

33 4.7 Final Mask 49

34 4.8 Saliency Map 49

35 4.9 Segmented Image 50

Semantic image segmentation, also called pixel-level classification, is the task of

based meth-ods or boundary edge methods).

representationfor each subtask, forcing informationsharing to be done through awkward feature

hierarchical semantic feature which is a bottleneck the traditional methods suffering.

1.2 CONVOLUTION NETWORKS

the finer image structure.

To overcome this, Noh et al. learn a multi-layer deconvolution network as an up-

downscaling the image.

with the domain transform.

the pixel value varies greatly.

the available information during the down-sampling process to facilitate high-resolution

classification with the help of long-range residual connections.

especially near object boundaries.

their performances are far away from fully supervised techniques.

segments from multiple different over-segmentations. The multiple over-segmentations avoid

errors made by any one segmentation.

Furthermore, we incorporate background regions which allows us to eliminate large

performs poorly on most foreground object classes.

objects is not high.

single convolution neural network. To furtherimprove on Cornernet, Duan et al. proposed