You are on page 1of 6

IMAGE TOKENIZER SERVICE

The general structure of thesis will have 4 sections as:

1. Introduction
2. Material Methods- Describe all the Methods, Models, and Architecture(auxiliary architectures)
which were referenced for comparisons
3. Results- Comparison should be part along with conclusions
4. Discussion- Either state that you had to use this architecture (prerequisite) or what other
architectures were considered in the work and why you chose the described architecture.

** These are the major chapters of the work- you don’t have to use exact name but the structure should
be as mentioned.

1
1 Abstract
Image tokenizer service or object detection service finds objects from real-world which are
present in a digital image and focus on various specifications including face tags, customer
segmentation, personalized adds or contents based on the photo tags. With the rapid
development in deep learning, more powerful tools are able to learn semantic, high-level, deeper
features and are introduced to address the problems existing in traditional architectures. These
models behave differently in network architecture, training strategy and optimization function,
etc.

The main goal to develop Image tokenizer Service can be highlighted with following points:
 Build Image tokenization service (Search in photos)
We want to offer our customers the option to search in photos not only by image name but
also by meaning (what the photo represents). This feature will be initially integrated in
Cloud, however it will be also used in Hosting and can be used on any other application that
stores images.
 SEO
Populate the images with respect of detected objects present in it in order to increase
website traffic and expose website content to users who might be interested in what we are
offering. By concentrating on search engine optimization (SEO). This can be useful on any
site-builders.
 Clustering and Segmentation
Extract features from user’s images to be used for clustering or segmentation.
 Models
Evaluate most suitable models by manually running predictions on validations set and
calculate matrices on the predictions for different confidence threshold.

2 Introduction
Image tokenizer service depends on some important factors such as Compositional factors,
Semantic factors, and Context factors. These factors can be understood more precisely by the
features they include such as Size and location, object type and depiction strength, and unusual
object-scene Pair of the image. Refer to below figure which gives a basic idea

The Image tokenizer service will receive an image or a batch of images and will return a map of
tokens and Confidence levels. It can also return the density of each token represented the image
area by that token.

2
3 Research Question
To make this service useful in multiple ways:

 Build Image tokenization service (Search in photos)


We want to offer our customers the option to search in photos not only by image name but
also by meaning (what the photo represents). This feature will be initially integrated in
Cloud, however it will be also used in Hosting and can be used on any other application that
stores images.
 SEO
Populate the images with respect of detected objects present in it in order to increase
website traffic and expose website content to users who might be interested in what we are
offering. By concentrating on search engine optimization (SEO). This can be useful on any
site-builders.
 Clustering and Segmentation
Extract features from user’s images to be used for clustering or segmentation.
 Models
Evaluate most suitable models by manually running predictions on validations set and
calculate matrics on the predictions for different confidence threshold.

4 Proposed Architecture

● The entry point of the application will be a REST API that will expose endpoint for real-time
or batch-mode.
● The image will be split into parts if is large or distort.
● The input image(s) will be broadcast to OCR and Image tokenizer models (YoloV3, Inception
ResNet V2).
● The prediction outputs from all the models will be aggregated by the Predictions Merge into
one prediction map.
● The prediction map will be enhanced with the synonyms of the predicted classes.

3
● If new and better models will be developed in the future we can replace our models (if
they supersede the functionality of current models) or we can attach them to the extension
points.

4.1 CNN Architectures


Convolutional Neural Network (CNN) usually stands for the spatial filters. These filters are used
for extracting features from pictures. Some well-known filters are a neural network which
contains one or more convolutional neural layers. Each neural layer can be regarded as a
combination of several Histogram of Oriented Gradients (HOG) and color histograms, etc. A
typical input for a convolutional layer is a 3- dimensional grid.

4.2 Model
The CNN model has the benefit of small model size, good energy efficiency and good accuracy
due to the fact that it’s fully convolutional and only contains a single forward pass.

A mathematical model or an abstract model architecture (like Yolov3, Inception ResNet) will be
used.

 You Only Look Once (YOLO) is a single convolutional network, predicts the bounding boxes
and the class probabilities for these boxes.
 Inception-ResNet-v2 is a convolutional neural network which is 164 layers deep and can
classify images into 1000 object categories, such as a keyboard, mouse, pencil, and many
animals.

5 Evaluation of the Thesis


The confidence threshold returned at each prediction can be used to fine-tune the precision of
the detection. By increasing the confidence threshold (so you select only the predictions with
higher confidence) the better precision. However, this comes with a cost in the number of
classes detected. Increasing the confidence threshold will lower the number of objects
detected (recall).

6 Expected Outcome
Any cloud provider or web application that works with images can use this service to extract
tokens or tags from images. The feature of search in photos by an object is a standard feature
provided by most known platforms like Google Photos and Photos by Apple, OneDrive and
Amazon’s Photos.
The Image tokenizer service will be used in client applications (Mail or Cloud) to index the image
in such a way that if the customer search Cloud or Mail, the image with similar tokens will be
shown to the customer. This service will be developed to provide improved customer
experience on the XYZ website.

4
7 Scope
The scope of this PoC will be to:

 Prove that the Image Tokenizer Service can be implemented with the current Machine Learning tools
and technologies from the market.
 Propose an architecture for the MVP. The architecture should be detailed in the TC.
 Estimate the effort to implement the MVP.
 Propose some performance metrics for the MVP.
 Describe the integration of the Mail/Cloud/OOE with the Image Tokenizer Service (the API that will
be exposed).
 Extract from images only a limited set of tokens (nouns). The list of supported tokens will be an
union between:
 supported tokens provide by the open models available on the internet
 a list of nouns suggested/requested by the Product Owner (if any)

8 Acceptance Criteria
The PoC will be considered done if:

 All the points from the scope are completed and documented.
 We have a deployed PoC version of the Image Tokenizer Service in Kubernetes (KOOPA) as a docker
image.
 The model used for the PoC is able to identify the tokens with an accuracy higher than 80%.
 The accuracy will be measured like this: an accuracy of 80% means that from 100 pictures with noun
"A" the model will tag 80.
 The prediction of the model (tokens, or tags) are translated properly in EN, DE, FR, ES.
 The deployed version of the PoC should be available internally (via URL) for testing and feedback. A
very small HTML form should be developed in order to allow manual upload and test the Image
Tokenizer Service.
 All the libraries/tools used are licensed for commercial use and have no security vulnerabilities.

You might also like