IOMP - Document Format - IQAC

An Industrial Oriented Mini Project / Summer InternshipReport
on
IMAGE CAPTION GENERATOR
Submitted in Partial fulfillment of requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE & ENGINEERING (DATA SCIENCE)
By
K. SRUJAN REDDY 20BD1A6729

O.SIVA JYOTHIKA 20BD1A6743
P. VARSHA 20BD1A6745
T. NAGA VAISHNAVI 20BD1A6756
Under the guidance of
Priyanka Saxena
Assistant Professor, Department of CSD
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING(DATA SCIENCE)
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY

(AN AUTONOMOUS INSTITUTION)
Accredited by NBA & NAAC, Approved by AICTE, Affiliated to JNTUH.
Narayanaguda, Hyderabad, Telangana-29
2023-24
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
(AN AUTONOMOUS INSTITUTION)
Accredited by NBA & NAAC, Approved by AICTE, Affiliated to JNTUH.
Narayanaguda, Hyderabad, Telangana-29
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DATA SCIENCE
CERTIFICATE
This is to certify that this is a Bonafide record of the project report titled “Image Caption
Generator” which is being presented as the Industrial Oriented Mini Project /Summer
Internship report by
1. K. SRUJAN REDDY 20BD1A6729
2. O. SIVA JYOTHIKA 20BD1A6743
3. P. VARSHA 20BD1A6745
4. T. NAGA VAISHNAVI 20BD1A6756
In partial fulfillment for the award of the degree of Bachelor of Technology in Computer Science
and Engineering(Data Science) affiliated to the Jawaharlal Nehru Technological University
Hyderabad, Hyderabad
Internal Guide Head of Department

(Ms. Priyanka Saxena) (Mr. Anil Kumar)
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

Vision of KMIT
 To be the fountainhead in producing highly skilled, globally competent engineers.

 Producing quality graduates trained in the latest software technologies and related
tools and striving to make India a world leader in software products and services.
Mission of KMIT
● To establish an industry institute Interaction to make students ready for the industry.
● To provide exposure to students on the latest hardware and software tools.
● To promote research-based projects/activities in the emerging areas of technology
convergence.
● To encourage and enable students to not merely seek jobs from industry but also to
create new enterprises.
● To induce a spirit of nationalism which will enable the student to develop, understand
India's challenges and to encourage them to develop effective solutions.
● To support the faculty to accelerate their learning curve to deliver excellent service to
students.

Vision of CSD Department
 To produce globally competent graduates to meet the modern challenges
through contemporary knowledge and moral values committed to build a
vibrant nation.
Mission of CSD Department
● To create an academic environment, which promotes the intellectual and
professional development of students and faculty.
● To impart skills beyond university prescribed to transform students into a
well- rounded IT professional.
● To nurture the students to be dynamic, industry ready and to have
multidisciplinary skills including e-learning, blended learning and remote
testing as an individual and as a team.
● To continuously engage in research and projects development, strategic use of
emerging technologies to attain self-sustainability.

PROGRAM OUTCOMES (POs)
1. Engineering Knowledge: Apply the knowledge of mathematics, science,

engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
2. Problem Analysis: Identify formulate, review research literature, and analyze

complex engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering sciences
3. Design/Development of solutions: Design solutions for complex engineering

problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
4. Conduct Investigations of Complex problems: Use research-based knowledge

and research methods including design of experiments, analysis and interpretationof
data, and synthesis of the information to provide valid conclusions.
5. Modern Tool Usage: Create select, and apply appropriate techniques, resources,
and modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
6. The Engineer and Society: Apply reasoning informed by contextual knowledge to

societal, health, safety. Legal und cultural issues and the consequent responsibilities
relevant to professional engineering practice.

7. Environment and Sustainability: Understand the impact of professional
engineering solutions in societal and environmental contexts and demonstrate the
knowledge of and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and

responsibilities and norms of engineering practice.
9. Individual and Teamwork: Function effectively as an individual, and as amember

or leader in diverse teams and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with

the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
11. Project Management and Finance: Demonstrate knowledge and understanding of

the engineering and management principles and apply these to one's own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments.
12. Life-Long Learning: Recognize the need for and have the preparation and ability
to engage in independent and life-long learning in the broadest context of
technological change.

PROGRAM SPECIFIC OUTCOMES (PSOs)
PSO1: An ability to analyze the common business functions to design and develop
appropriate Information Technology solutions for social upliftment's.
PSO2: Shall have expertise on the evolving technologies like Python, Machine
Learning, Deep learning, IOT, Data Science, Full stack development, Social
Networks, Cyber Security, Mobile Apps, CRM, ERP, Big Data, etc.
PROGRAM EDUCATIONAL OBJECTIVES (PEOs)
PEO1: Graduates will have successful careers in computer related engineering fields or
will be able to successfully pursue advanced higher education degrees.
PEO2: Graduates will try and provide solutions to challenging problems in their
profession by applying computer engineering principles.
PEO3: Graduates will engage in life-long learning and professional development by
rapidly adapting to the changing work environment.
PEO4: Graduates will communicate effectively, work collaboratively and exhibit high
levels of professionalism and ethical responsibility.

PROJECT OUTCOMES
P1: Accurate identification of objects and content within uploaded images.
P2: Generating contextually relevant captions based on image content.
P3: A user-friendly platform for image captioning on various devices...
P4: Caption generation for global accessibility.
MAPPING PROJECT OUTCOMES WITH PROGRAM OUTCOMES
PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
P1 H H M H M L M H H M H
P2 M H H M L M L H M M L
P3 H M H M L M L M M M
P4 H H M M H L M L H M H M
L – LOW M –MEDIUM H– HIGH

PROJECT OUTCOMES MAPPING PROGRAM
SPECIFIC OUTCOMES
PSO PSO1 PSO2
P1 H M
P2 M M
L H
P3
M M
P4
PROJECT OUTCOMES MAPPING

PROGRAM EDUCATIONAL OBJECTIVES
PEO PEO1 PEO2 PEO3 PEO4
P1 H L M M
L M H
P2
P3 M H M
P4 H M L

DECLARATION
We hereby declare that the results embodied in the dissertation entitled “Image Caption
Generator'' has been carried out by us together during the academic year 2023-24 as a partial
fulfillment of the award of the B. Tech degree in Computer Science and Engineering (Data
Science) from JNTUH. We have not submitted this report to any other university or
organization forthe award of any other degree.
Student Name Roll no.

O. SIVA JYOTHIKA 20BD1A6743

ACKNOWLEDGEMENT
We take this opportunity to thank all the people who have rendered their full support to our
project work. We render our thanks to Dr. B L Malleswari, Principal, who encouraged us
to do the Project.
We are grateful to Mr. Neil Gogte, Founder & Director, Mr. S. Nitin, Director, for
facilitating all the amenities required for carrying out this project.
We express our sincere gratitude to Ms. Deepa Ganu, Director Academic, for providing an
excellent environment in the college.
We are also thankful to Dr. G Narender, Head of the Department, for providing us with
time to make this project a success within the given schedule.
We are also thankful to our guide Ms. Savitha Ramesh, for her valuable guidance and
encouragement given to us throughout the project work.
We would like to thank the entire IT Department faculty, who helped us directly and
indirectly in the completion of the project.
We sincerely thank our friends and family for their constant motivation during the project
work.
Student Name Roll no.

O. SIVA JYOTHIKA 20BD1A6743

ABSTRACT
Recently, image captioning is a new challenging task that has gathered widespread interest. The
task involves generating a concise description of an image in natural language and is currently
accomplished by techniques that use a combination of computer vision (CV), natural language
processing (NLP), and machine learning methods. In this paper, we presented a model that
generates natural language description of an image. We used an encoder-decoder type of
architecture. The encoder part is the first phase where CNN models are used in order to extract
the important features from an image. On the other hand, the decoder phase involves RNN
models such as LSTM or GRU to translate the features into natural language sentences. We
incorporated the attention mechanism while generating captions. For image captioning, attention
tends to focus on specific regions in the image while generating descriptions. Adding an attention
mechanism to a model allows the model to focus on the parts of the input that are most important
for making a prediction. This can help the model to make more accurate and informative
predictions.

LIST OF ABBREVIATIONS
CV Computer Vision
CNN Convolutional Neural

Network
UI User Interface
LSTM Long Short Term

Memory
NLP Natural Language

Processing

LIST OF DIAGRAMS
S.No Name of the Diagram Page No.
1. Architecture Diagram of the Project 5
2. Sequence Diagram 18
3. Class Diagram 19
4. State Diagram 20
5. Deployment Diagram 20

LIST OF SCREENSHOTS
S.No Name of Screenshot Page

No.
1. Login Page 31
2. Sign Up Page 32
3. Home Screen or Main Page 32
4. Result Page 30

CONTENTS
DESCRIPTION PAGE
CHAPTER - 1 1
1. INTRODUCTION 2
1.1 Purpose of the project 2
1.2 Problem with Existing Systems 3
1.3 Proposed System 3
1.4 Scope of the Project 4

1.5 Architecture Diagram 5
CHAPTER – 2 6
2. LITERATURE SURVEY 7
CHAPTER - 3 9
3. SOFTWARE REQUIREMENT SPECIFICATION 9
3.1 Introduction to SRS 10
3.2 Role of SRS 10
3.3 Functional Requirements 11
3.4 Non-Functional Requirements 12
3.5 Performance Requirements 14
3.6 Software Requirements: 14
3.7 Hardware Requirements: 15
3.7.1 Client-Side 15
3.7.2 Server-Side 15
CHAPTER – 4 17
4. SYSTEM DESIGN 18

4.1 Introduction to UML 18
4.2 UML Diagrams 18
4.2.1 Sequence5 18
4.2.2 Class 8 19
4.2.3 State Chart Diagram 20
4.2.4 Deployment Diagram 21
4.3 TECHNOLOGIES USED 22
CHAPTER – 5 23
5. IMPLEMENTATION 24
5.1 Setting up connections between flutter and 24
model
5.2 Coding the logic 26
5.3 Connecting the dashboard 29
5.4 Screenshots 31
5.5 UI Screenshots 31
CHAPTER – 6 34
6.SOFTWARE TESTING 35
6. Introduction 34
6.1.1 Testing Objectives 35
6.1.2 Testing Stratergies 36
6.1.3 System Evaluation 37
6.1.4 Testing New System 37
6.2 Test Cases 39
CONCLUSION 42
FUTURE ENHANCEMENTS 43
REFERENCES 44
BIBLIOGRAPHY 45

CHAPTER-1
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator 1

1. INTRODUCTION
1.1 Purpose of Project
The primary purpose of the project is to develop a mobile application that leverages deep
learning models and cloud services to enable users to automatically generate descriptive captions
images. This application is designed to fulfill the following objectives:
Enhance User Experience: The project aims to enhance the user experience by providing a user-
friendly and intuitive mobile application that allows users to effortlessly add context and
descriptions to their images. Users can create meaningful captions for personal photos, share
captivating content on social media, and support visually impaired individuals in understanding
image content.
Leverage Deep Learning Models: By incorporating the InceptionV3 pre-trained architecture for
image encoding and LSTM models for caption generation, the project harnesses the power of
deep learning and natural language processing to generate contextually relevant and coherent
captions. This enhances the overall quality of image descriptions and user satisfaction.
User Authentication and Data Management: Utilizing Firebase for user authentication and data
storage, the project ensures secure user account management. Users can register, log in, and
securely store their data and generated captions. Firebase simplifies user management and
provides a reliable cloud-based data storage solution.
Cross-Platform Accessibility: The mobile application, built using Flutter, is designed to be
cross- platform, supporting both Android and iOS devices. This broad accessibility ensures that a
wide range of users can benefit from the image captioning capabilities.
Multilingual Support: The project offers the potential to generate captions in multiple
languages, contributing to its inclusivity and making it suitable for a global audience. This
multilingual support ensures that users from diverse linguistic backgrounds can use the
application effectively.
Innovation and Automation: The project represents an innovative solution to the challenges of
adding context and meaning to visual content. By automating the caption generation process,

it saves users time and effort while providing an engaging and creative means of
storytelling. Educational and Assistive Value: The application can have educational and
assistive applications, helping users learn new languages, understand image content, or assist
visually impaired individuals in comprehending images.
1.2 Problems with Existing Systems
Existing systems for image caption generation often face several challenges and
limitations that the Image Caption Generator project aims to address. Many image caption
generators struggle to provide accurate and contextually relevant captions. They often produce
generic or incorrect descriptions that do not effectively convey the content of the image.
Existing systems may have difficulty understanding the nuances of language and context. They
often fail to capture the subtleties and intricacies of images, leading to vague or inappropriate
captions. Some systems may suffer from slow processing times, especially when dealing with a
large number of images or complex visual content. This can result in frustrating delays for users.
Users may have limited control over the generated captions, with little ability to tailor the
descriptions to their specific needs or preferences.
1.3 Proposed Systems:
The proposed system for the Image Caption Generator is a mobile application developed using
Flutter, which seamlessly integrates with a deep learning model. This deep learning model
employs Inception v3 as the encoder and LSTM (Long Short-Term Memory) as the decoder. This
innovative combination of technologies aims to address the existing limitations in image caption
generation by delivering a user-friendly, accurate, and versatile solution.
The mobile application, built on the Flutter framework, offers an intuitive and accessible platform
for users to upload their images and receive automatically generated captions. This user-friendly
interface ensures that individuals from various backgrounds and expertise levels can effortlessly
harness the power of image captioning to enhance their visual content.
The core of the system lies in the deep learning model.

Inception v3, a state-of-the-art convolutional neural network (CNN), serves as the encoder,
enabling robust and precise image recognition. It has been meticulously trained to analyze the
visual content of uploaded images,
capturing intricate details and context. This recognition process ensures the accuracy of the
generated captions, as it effectively identifies the elements within the images.
The LSTM (Long Short-Term Memory) network, employed as the decoder, complements the
encoder by converting the visual understanding into coherent natural language captions. LSTM's
ability to handle sequential data and maintain context over time facilitates the generation of
contextually relevant descriptions. This ensures that the captions are not only accurate but also
linguistically sound and context aware.
1.4 Scope in the Project
Improved Descriptive Quality: Image caption generators with attention mechanisms have a
broader scope in terms of generating high-quality, contextually relevant captions for a wide range
of images. The attention mechanism allows the model to focus on different regions of the image
while generating the caption, resulting in more accurate and detailed descriptions.
Contextual Awareness: These models excel in capturing contextual information within an image.
They can better understand the relationships between objects and scenes, and they can produce
captions that reflect these relationships effectively.
Multimodal Understanding: Image caption generators with attention can seamlessly integrate
information from both the visual (image) and textual (caption) modalities. This makes them well-
suited for applications where cross-modal understanding is crucial.
Adaptability: They can be fine-tuned and adapted to specific domains or applications, allowing
for customization and improved performance in specialized areas.
Improved Training Efficiency: Attention mechanisms often reduce the reliance on large-scale
training datasets because the model can focus on relevant image regions. This can make training
more efficient.
Variety in Captions: The attention mechanism can promote the generation of diverse captions for
the same image, adding richness and variety to the output.

1.5 Architecture diagram

CHAPTER-2

2. Literature Survey
 Multimodal Recurrent Neural Network (m-RNN):
Language Model Part:
Objective: This component focuses on handling textual information, specifically generating

descriptive sentences or captions.
Recurrent Layers: To capture the sequential and temporal dependencies in sentences, recurrent
layers are employed. These layers allow the model to remember and consider the context of
words that have come before the current word in a sentence.
Vision Part:
Objective: This part deals with visual information, particularly the extraction of features from
images.
Deep Convolutional Neural Network (CNN): A deep CNN is used to process images. CNNs
are known for their ability to automatically learn and extract hierarchical features from images.
In this context, CNN takes an image as input and transforms it into a fixed-size feature
representation.
Multimodal Part:
Objective: The multimodal part bridges the gap between the textual and visual modalities,
enabling the model to jointly consider both types of information.
One-layer Representation: This component connects the language model and the deep CNN. It
essentially combines the output from the language model and the vision part into a unified
representation.
The BLEU score of the above model is around 37.6 and its main disadvantages are
Limited Contextual Information: m-RNNs typically use recurrent neural networks (RNNs) to
capture temporal dependencies in sentences. However, RNNs have limitations in modeling
long- range dependencies effectively.
Attention mechanisms, on the other hand, can focus on specific regions of the image and
relevant parts of the input sentence, providing richer contextual information and

Difficulty Handling Rare Concepts: Rare or uncommon concepts or objects in images may not be
well-handled by multimodal RNNs.
 Show and Tell: A Neural Image Caption Generator:
The model takes two main types of input:
Image Features (from a CNN): The model first processes the input image using a Convolutional
Neural Network (CNN). CNNs are well-suited for extracting hierarchical features from images.
In this case, CNN converts the image into a fixed-length vector representation that captures
important visual information.
Words (from a Vocabulary): The model generates sentences word by word. It takes the previous
words in the sentence (if any) and combines them with the image features to predict the next
word. The words are represented as vectors using an embedding model.
The core of the NIC model is a type of recurrent neural network (RNN) known as Long Short-
Term Memory (LSTM).
The BLEU score of this model is around 42.1.
The main disadvantage of this particular model is model complexity
NIC Model: The NIC model consists of a Convolutional Neural Network (CNN) for image
feature extraction and a Long Short-Term Memory (LSTM) network for text generation.
 BUTD (Bottom-Up and Top-Down model):
Top-down visual attention mechanisms have been used extensively in image captioning and to
enable deeper image understanding through fine-grained analysis and even multiple steps of
reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism
that enables attention to be calculated at the level of objects and other salient image regions.
This is the natural basis for attention to be considered.
Applying this approach to image captioning, our results on the MSCOCO test server establish a
new state-of-the-art for the task, achieving BLEU scores of 45.7, respectively.

CHAPTER-3

3. SOFTWARE REQUIREMENT SPECIFICATION
3.1 Introduction to SRS:
An SRS, or Software Requirements Specification, is a crucial document in the software

development process that serves as a foundation for planning, designing, developing, and testing
a software application. It outlines the detailed functional and non-functional requirements of the
software, providing a comprehensive understanding of what the software should do, how it should
behave, and what its constraints and limitations are. The SRS document is typically created
during the initial phase of a software project and serves as a reference for all stakeholders,
including developers, testers, project managers, and clients.
The Image Caption Generator is a software application that utilizes a sequence-to-sequence

(seq2seq) model with a Convolutional Neural Network (CNN) as an encoder and a Long Short-
Term Memory (LSTM) network as a decoder to generate textual captions for images. The project
also includes a mobile application developed using Flutter, and user data is stored in Firebase for
authentication and registration purposes.
3.2 Role of SRS:
The Software Requirements Specification (SRS) plays a crucial role in the software development
process, serving as a cornerstone for the entire project. Its primary role is to provide a detailed
and comprehensive description of what the software system is supposed to accomplish, how it
should behave, and what its constraints and limitations are.
The SRS serves as the foundation for designing, developing, and testing the software system.

It guides the software development team by specifying what needs to be built and how it should
function.
It clearly defines the scope of the software project by specifying what the software will and will
not do. This helps manage client expectations and avoid scope creeps.
The SRS acts as a reference document for managing changes to the project. Any modifications to
the requirements can be documented and evaluated for their impact on the project's timeline and
budget.
Software architects and designers use the SRS as a starting point for designing the system's
architecture, data structures, and user interfaces.
3.3 Functional Requirements:
Image Upload:
Users should be able to upload an image from their device's gallery or capture one using the device's
camera.
Image Preprocessing:
The system should preprocess the uploaded image, ensuring it is compatible with the Inception v3
model. This may involve resizing, normalizing, and formatting the image appropriately.
Inception v3 Integration:
The system should integrate the Inception v3 model to perform image recognition. It should send
the preprocessed image to the model for analysis.
Caption Generation:
Upon analyzing the image, the system should generate a natural language caption that describes the
contents of the image.

The caption should be coherent and contextually relevant.
Display Captions:
The generated caption should be displayed to the user in a user-friendly format on the Flutter
app's user interface.
Caption Sharing:
Users should have the option to share the generated captions on social media or via other
communication channels.
Save Captions:
Users should be able to save the generated captions, allowing them to review or retrieve captions
for previously analyzed images.
Language Support:
The system should support multiple languages for generating captions, and users may have the
option to choose their preferred language.
Feedback Mechanism:
Users should be able to provide feedback on the generated captions to help improve the system's
accuracy.
Accessibility:
The app should be designed to ensure accessibility features, making it usable by people with
disabilities.
Offline Mode:
The app should have an option to perform image captioning in an offline mode when an internet
connection is not available.
User Authentication:
If required, users should be able to create accounts or log in to personalize their experience and
access saved captions from multiple devices.
3.4 Non-functional Requirements:
Performance:

The system should provide image recognition and caption generation within a reasonable
response time, typically under a few seconds.
The system should support multiple concurrent users without a significant decrease in response
time.
The application should be optimized for efficiency to minimize resource consumption.
Scalability:
The system should be designed to handle an increasing number of users and images as the user
base grows.
Availability:
The system should aim for high availability, with minimal downtime for maintenance or updates.
Define a specific uptime percentage, such as 99.9% availability.
Reliability:
The image caption generator should provide accurate and reliable image recognition and caption
generation results.
It should handle potential errors gracefully and recover from failures without data loss.
Security:
The system should implement data encryption during transmission and storage to protect user
images and captions.
Implement user authentication and authorization to control access to sensitive user data.
Comply with relevant data protection and privacy regulations.
Usability:
The user interface should be user-friendly, with clear instructions and intuitive interactions.
Ensure accessibility features for users with disabilities, such as screen readers and voice
commands. Compatibility:
The Flutter app should be compatible with various mobile and web platforms (iOS, Android, web
browsers) and screen sizes.
Language Support:

The system should support multiple languages for user interfaces and generate captions.
Offline Capability:
The app should have the ability to perform image recognition and generate captions in an offline
mode, preserving functionality when there is no internet connection.
3.5 performance Requirements:
Performance requirements for an image caption generator using Flutter and the Inception v3
model are essential to ensure that the system operates efficiently and meets user expectations.
Response Time:
The system should generate captions for images within a maximum response time of X seconds
(e.g., 2 seconds) under normal operating conditions.
Image Processing Time:
The Inception v3 model should process and recognize an image within a maximum time of X
seconds.
Concurrent Users:
The system should support a minimum of X concurrent users without a significant degradation in
response time. Define the expected level of concurrency.
3.6 Software Requirements:
Flutter to develop mobile applications.

InceptionV3 pretrained architecture to encode the image.
LSTM model to generate caption
user data is stored in Firebase for authentication and registration purposes.
Inception V3, a deep convolutional neural network (CNN), serves as the image encoder in image
caption generators. Developed by Google Research, Inception V3 excels at extracting rich visual
features from input images,

thanks to its factorized convolution approach that efficiently reduces the number of trainable
parameters. These features are essential for understanding and representing the visual content of
images, making Inception V3 a cornerstone in the model's architecture.
Complementing this, Long Short-Term Memory (LSTM) networks are employed as decoders for
generating textual descriptions. LSTMs, part of the recurrent neural network family, address the
vanishing gradient problem often encountered in standard RNNs. This makes them well-suited for
sequential data generation tasks. In image captioning, LSTMs sequentially produce words, taking
into account the context from both the image features encoded by Inception V3 and the
previously generated words. The synergy between Inception V3 and LSTM decoders creates a
powerful model that bridges the gap between visual and textual information, generating coherent
and contextually relevant captions for images.
3.7 Hardware Requirements:
3.7.1 Client-Side (Flutter App):
Mobile Devices (iOS/Android):

Modern smartphones and tablets with sufficient processing power are recommended for running
the Flutter app.
For iOS devices, consider devices with at least an A9 chip or newer.
For Android devices, devices with quad-core processors or better are preferable.
Camera (if image capture is a feature):

Devices with a camera for capturing images. Most modern smartphones and tablets have cameras.
3.7.2 Server-Side (Backend):
The server-side infrastructure should be capable of hosting the backend application, including the
Inception v3 model and image recognition processes.

The server should have a powerful CPU for running the image recognition model efficiently.
Consider a CPU with multiple cores to support concurrent image recognition requests.
Sufficient RAM is essential for loading and running the Inception v3 model. At least 8GB of
RAM is recommended.
If you want to
accelerate image recognition, you can use a server with a compatible GPU. Graphics processing
units can significantly speed up deep learning tasks.
Storage requirements depend on the number of images to be stored and the size of the Inception
v3 model.
Use fast and reliable storage devices to ensure efficient data access.

CHAPTER-4

4. System Design
4.1 Introduction to UML:
Unified Modeling Language (UML) is a standardized modeling language used in the field of
software engineering to visually represent and document the design and structure of complex
systems. UML provides a set of diagrams and symbols that allow software developers, designers,
and stakeholders to communicate and understand the architecture, behavior, and relationships
within a software system. UML was developed by the Object Management Group (OMG) and has
become a widely accepted and essential tool in the software development process.
UML serves as a powerful tool for visualizing, documenting, and communicating the design and
architecture of software systems. It enables better collaboration among project stakeholders,
leading to more effective software development and maintenance. UML diagrams are an integral
part of the software development process, from initial design and modeling to system
implementation and maintenance.
4.2 UML Diagrams:
4.2.1 SEQUENCE DIAGRAM:

4.2.2 Class Diagram:

4.2.3 State Diagram:

4.2.4 Deployment Diagram:

4.3 Technologies Used:
Flutter to develop mobile applications.

InceptionV3 pretrained architecture to encode the image.
LSTM model to generate captions.
user data is stored in Firebase for authentication and registration purposes.
You'll use Flutter, a popular open-source framework for building natively compiled applications
for mobile, web, and desktop from a single codebase.
Flutter allows you to create a cross-platform mobile application that can run on both Android and
iOS devices, ensuring a broad reach.
InceptionV3 is a pre-trained deep learning model used for image recognition. It can analyze and
encode images effectively, extracting meaningful features from them.
In your application, InceptionV3 will take user-uploaded images and provide encoded
representations, which will be used for caption generation.
Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) that is suitable
for sequence-to-sequence tasks like natural language processing.
You'll use LSTM to generate textual captions for the images. The encoded image data from
InceptionV3 will serve as input to the LSTM model.
Firebase is a comprehensive mobile and web application development platform provided by
Google.
You'll use Firebase for user authentication, enabling users to create accounts, log in, and secure
their data. Firebase Authentication simplifies the authentication process.
Firebase can also serve as a backend for storing user data, images, and generated captions.

CHAPTER-5

5. IMPLEMENTATION
5.1 Setting up connections between flutter and deep learning model:
First, you need to have a deep learning model that performs the specific task you want. You
can build your model using popular deep learning frameworks like TensorFlow. Once the
model is trained, save it in a format that can be loaded on the server.
You'll need a backend server that hosts your deep learning model. Common technologies for
building backend servers include Flask (Python), Express (Node.js), Django (Python), or Fast
API (Python). Your server will be responsible for receiving requests from the Flutter app, running
predictions using the deep learning model, and sending back the results.
Create an API endpoint on your backend server to accept requests from the Flutter app. This
endpoint should allow the app to send data to the server for processing. For example, you might
use HTTP endpoints or RESTful APIs.
Ensure that the data you send between your Flutter app and the backend is in a format that both
can understand. JSON is a common choice for this purpose. You might need to serialize your data
on the app side and deserialize it on the server side.
On the backend server, implement the logic to handle the incoming requests. This includes
deserializing the data, using your deep learning model to make predictions, and sending
the results back to the Flutter app.
Deploy your backend server to a hosting platform. Common choices include AWS,
Google Cloud, Heroku, or your own server. Make sure the server is accessible via a public
URL.
Test the connection between the Flutter app and the backend to ensure that data is
transferred correctly, and predictions are received as expected. Debug any issues that may
arise.
By following these steps, you can set up connections between your Flutter app and the deep
learning model for image recognition and caption generation. This will enable your app to
provide meaningful captions for uploaded images.
Creating a Flutter application that takes an image, sends it to an Inception V3 model for image
feature extraction, generates a caption using an LSTM decoder, and then displays the image with
the generated caption.
Login Page: Your Flutter application starts with a login page where users can authenticate
themselves if required. This page typically includes fields for entering a username and password,
as well as buttons for signing in. The login page can be implemented using Flutter's Text Form
Field widgets for input and Elevated Button for the sign-in action. Users need to log in to access
the caption generation feature.
Image Selection: After logging in, the user can navigate to a screen where they can select an
image. You can use Flutter's Image Picker package to implement image selection. Users can
choose to either pick an image from the gallery or capture a picture using the device's camera.
This screen includes buttons to access the gallery or camera, and a preview of the selected
image.
Image Preprocessing: Once an image is selected or captured, it needs to be preprocessed before

sending it to the Inception V3 model. Preprocessing may involve the following steps:
Resize the image: Inception V3 may have specific input size requirements, so the selected image
should be resized accordingly.
Normalize pixel values: Scale the pixel values of the image to ensure they fall within the expected
range (typically 0 to 1).
Encode the image: Use the Inception V3 model to extract features from the image. You can either
use a pre-trained Inception V3 model or fine-tune it on your specific dataset.
Caption Generation: The encoded image features are passed to the LSTM decoder to generate a
caption. The LSTM generates words one by one, considering both the encoded image features
and the previously generated words. This process continues until an end-of-sentence token is
generated.
The generated caption is a sequence of words.
Display the Image with Caption: The Flutter application then displays the original image along
with the generated caption. You can create a new screen or overlay a UI widget on the image to
show the caption. Utilize Flutter's Image widget for displaying the image and Text widget to
show the caption.
5.2 Coding the logic:

5.3 Connecting the dashboard:
Connecting a dashboard in the context of a software application typically involves creating an
interface through which users can interact with the system, access data, and perform various
actions. The dashboard serves as a central hub for monitoring, managing, and visualizing
information.
Decide on the design and layout of your dashboard. Consider what information you want to
display, how you want to visualize the data, and what interactions you want to enable. You can
use UI frameworks like Flutter to create the dashboard.
Your dashboard will need to communicate with the backend server that hosts the deep learning
model.

You can use HTTP requests to send data to the backend and receive predictions or results.
Consider how data will be passed between the dashboard and the backend API.
Build components in your dashboard to display the data and predictions received from the
backend. You might use charts, graphs, tables, or any other appropriate visualization methods to
present the results to users.
Implement user interactions on the dashboard. Users should be able to input data or customize
parameters for the deep learning model. This may involve adding input forms, sliders, or buttons
that allow users to interact with the model.
If your deep learning model provides real-time predictions or updates, consider implementing a
mechanism for real-time data visualization and updates on the dashboard. Technologies like
WebSocket's can be helpful for real-time communication.
If your dashboard should only be accessible to authorized users, implement user authentication
and authorization mechanisms. This ensures that only authorized individuals can access and
interact with the dashboard.
Test the dashboard to ensure that it correctly communicates with the backend and displays the
results as intended. Debug any issues that may arise during testing.
Deploy the dashboard to the intended environment, whether it's a web server, cloud platform, or a
local network.
Ensure that the deployment environment can support the technologies used in your dashboard.
Provide training to users on how to use the dashboard effectively.

Create documentation or help resources to assist users in understanding the dashboard's features
and functionality.
Continuously gather feedback from users to identify areas for improvement.

Use feedback to iterate on the dashboard, adding new features and enhancing the user experience.

5.4 : Screenshots:
5.5 UI Screenshots:

CHAPTER-6

6. Software Testing
6.1 Introduction
6.1.1 Testing Objectives:
Testing objectives for the Image Caption Generator project using Flutter and the
Inception v3 model with LSTM as the encoder and decoder are crucial to ensure the system's
functionality, reliability, and performance.
Functional Testing:
Objective: Verify that the system functions according to the specified requirements.
Test Use Cases: Test image recognition, caption generation, user authentication, and caption saving
functionality.
Ensure that all features work as expected, including image upload, caption generation, and user
account management.
Performance Testing:
Objective: Evaluate the system's performance under different load conditions.
Test Use Cases: Measure response times, throughput, and scalability under normal and peak usage
scenarios.
Ensure the system can handle a predefined number of concurrent users and process a specific
number of image recognition requests per minute.
Compatibility Testing:
Objective: Ensure that the system functions correctly on different platforms and devices.
Test Use Cases: Test the Flutter application on various mobile devices (iOS and Android) and web
browsers.
Confirm that the system is compatible with different screen sizes and operating systems.
Language and Localization Testing:
Objective: Verify that the system supports multiple languages and localization.
Test Use Cases: Test the application in different languages and ensure that captions can be
generated in various supported languages.

Confirm that the user interface elements are appropriately localized.
Offline Mode Testing:
Objective: Ensure that the system's offline functionality works as expected.
Test Use Cases: Test the application's ability to perform image recognition and caption generation
in offline mode.
Confirm that the system can handle these tasks without an internet connection.
6.1.2 Testing Strategies:
Functional Testing:
Unit Testing: Test individual components, such as the image recognition module and caption
generation module, in isolation to verify their correctness.
Integration Testing: Ensure that different components of the system work seamlessly together,
including the integration between the Flutter app and the deep learning model.
System Testing: Verify the end-to-end functionality of the system, including user interactions,
image upload, recognition, and caption generation.
Performance Testing:
Load Testing: Simulate a high volume of concurrent users and image recognition requests to
evaluate the system's response time and scalability.
Stress Testing: Push the system beyond its normal capacity to identify its breaking points and
measure performance under extreme conditions.
Scalability Testing: Assess how the system handles increasing loads by adding resources and
verifying its ability to scale.
Compatibility Testing:
Platform and Device Testing: Test the Flutter app on various mobile devices, operating systems
(iOS and Android), and web browsers to ensure consistent functionality.
Cross-Browser Testing: Ensure that the web version of the app works correctly on different
browsers and platforms.
Language and Localization Testing:

Multilingual Testing: Validate that the system correctly displays content in different languages
and supports caption generation in various languages.
Localization Testing: Confirm that the user interface elements are appropriately localized,
considering date formats, currencies, and cultural differences.
6.1.3 System Evaluation
System evaluation for the Image Caption Generator project involves assessing the system's
performance, functionality, usability, and overall effectiveness. It aims to determine whether the
project has achieved its objectives and if it meets the requirements and expectations of both the
developers and end-users.
Verify that the system successfully recognizes images and generates contextually relevant
captions.
Ensure that all the functional requirements, such as image upload, caption saving, user
authentication, and multilingual support, are met.
Assess the system's response time for image recognition and caption generation under various
load conditions, ensuring it meets performance requirements.
Confirm that the system scales as expected to accommodate growing numbers of users and image
processing requests.
Test the system on various platforms, devices, and web browsers to ensure consistent
functionality and appearance.
Address any compatibility issues that may arise in different environments or screen sizes.
Validate that the system correctly displays content in multiple languages and supports accurate
caption generation in various languages.
Confirm that the localization of the user interface elements is culturally appropriate.

6.1.4 Testing New System

Testing a new system, such as the Image Caption Generator project, is a critical phase in
the software development process. It involves systematically verifying that the system
meets its requirements, functions correctly, and performs reliably.
Begin by reviewing the system's requirements, including functional and non-functional

requirements, to establish a clear understanding of what the system should do. Develop a
comprehensive test plan that outlines the testing objectives, scope, resources, and schedule.
Identify the types of testing to be conducted.
Create test cases for various aspects of the system, including positive and negative scenarios,
boundary cases, and exceptional conditions. Set up the testing environment, including the
hardware, software, and data needed to execute the test cases.
Begin with unit testing, which focuses on testing individual components or functions in isolation.
Verify that each unit works as expected and correct any defects identified.
Conduct integration testing to ensure that the units or components work together seamlessly.
Test the interactions between modules and validate data flows.
Perform system testing to validate the end-to-end functionality of the entire system.
Evaluate the system's compliance with all specified requirements.
Test the system on various platforms, devices, and web browsers to ensure compatibility.
Confirm that the system functions correctly and looks consistent across different environments.
Validate that the system correctly displays content in multiple languages and supports accurate
localization.
Ensure that user interface elements are culturally appropriate.
User Experience: The project aims to enhance the user experience by providing a user-
friendly and intuitive mobile application that allows users to effortlessly add context and
descriptions to their images.

Users can create meaningful captions for personal photos, share captivating content on social
media, and support visually impaired individuals in understanding image content.
Leverage Deep Learning Models: By incorporating the InceptionV3 pre-trained architecture for
image encoding and LSTM models for caption generation, the project harnesses the power of
deep learning and natural language processing to generate contextually relevant and coherent
captions. This enhances the overall quality of image descriptions and user satisfaction.
User Authentication and Data Management: Utilizing Firebase for user authentication and data
storage, the project ensures secure user account management. Users can register, log in, and
securely store their data and generated captions. Firebase simplifies user management and
provides a reliable cloud-based data storage solution.
Cross-Platform Accessibility: The mobile application, built using Flutter, is designed to be cross-
platform, supporting both Android and iOS devices. This broad accessibility ensures that a wide
range of users can benefit from the image captioning capabilities.
Multilingual Support: The project offers the potential to generate captions in multiple languages,
contributing to its inclusivity and making it suitable for a global audience. This multilingual
support ensures that users from diverse linguistic backgrounds can use the application effectively.
Innovation and Automation: The project represents an innovative solution to the challenges of
adding context and meaning to visual content. By automating the caption generation process, it
saves users time and effort while providing an engaging and creative means of storytelling.
Educational and Assistive Value: The application can have educational and assistive applications,
helping users learn new languages, understand image content, or assist visually impaired
individuals in comprehending images.
In summary, the project's purpose is to create a mobile application that uses deep learning, cloud
services, and a user-friendly interface to empower users to easily generate image captions,
6.2 Test Cases

Functional Test Case - Image Recognition and Caption Generation: Test Objective: To ensure
that the system accurately recognizes images and generates contextually relevant captions. Test
Scenario: Upload an image containing various objects. Test Steps:
Open the application.
Select an image to upload.
Verify that the image is correctly recognized.
Confirm that the generated caption accurately describes the objects in the image.
Repeat the test with different images, including those with complex scenes.
Performance Test Case - Response Time Under Load: Test Objective: To measure the system's
response time for image recognition and caption generation under a simulated heavy load. Test
Scenario: Simulate a scenario with a high volume of concurrent users and image recognition
requests. Test Steps:
Simulate concurrent user connections to the system.
Upload multiple images simultaneously.
Usability Test Case - User Interface Evaluation:
Test Objective: To assess the user-friendliness of the application's interface and navigation. Test
Scenario: Involve real users in usability testing to provide feedback. Test Steps:
Engage representative users in the test.
Ask users to upload images and generate captions.
Collect feedback on the intuitiveness of the user interface and navigation.
Address any usability issues identified during testing.
Measure the response time for image recognition and caption generation.
Ensure that the system meets the specified response time requirements under load.
Security Test Case - Authentication and Authorization: Test Objective: To verify that user
authentication and authorization mechanisms are robust. Test Scenario: Attempt to access the
system without proper authentication.
Test Steps:

Attempt to access restricted features without logging in.
Verify that the system enforces proper authentication.
Test users access different system features based on their roles.
Confirm that unauthorized users are denied access to sensitive functionalities.
Compatibility Test Case - Cross-Platform Testing: Test Objective: To ensure that the system
works consistently across different platforms and devices. Test Scenario: Test the application on
various mobile devices (iOS and Android)
. Test Steps:
Install and run the Flutter app on different mobile devices with varying screen sizes and operating
systems.
Access the web version of the application using various web browsers.
Verify that the system's functionality and appearance are consistent on all tested platforms.

CONCLUSIONS
Concluding the Image Caption Generator project, it is evident that the combination of Flutter,
Inception v3, and LSTM has resulted in a sophisticated and user-friendly system that fulfills its
primary objectives. The project aimed to bridge the gap between images and language, enhancing
the storytelling potential of visual content, and it has largely succeeded in doing so. Through
rigorous testing and evaluation, several key conclusions can be drawn.
First and foremost, the system has proven its functionality, successfully recognized images and
generated contextually relevant captions. The combination of Inception v3's accurate image
recognition and LSTM's natural language processing capabilities has resulted in a reliable and
accurate caption generation process. Users can confidently rely on the system to describe the
contents of their images effectively.
The conclusion of the project also acknowledges the ongoing nature of software development.
Regular monitoring, user feedback, and continuous improvement processes will be essential to
keep the system current and effective. User acceptance testing has been crucial in ensuring that
the system aligns with user expectations and business requirements, and this feedback-driven
approach will continue to guide the system's evolution.
the Image Caption Generator project has successfully achieved its primary objectives, offering a
reliable, user-friendly, and scalable solution for image recognition and caption generation. As it
continues to evolve based on user feedback and emerging technologies, the project holds the
promise of further enriching the storytelling potential of visual content for a wide range of users
and applications.

FUTURE ENHANCEMENTS
The introduction of real-time image captioning would be a significant step forward. By

leveraging the power of edge computing and advanced deep learning models, the system could
generate captions for images as they are being captured. This feature would be particularly
valuable for live broadcasts, presentations, and applications that require instant image analysis
and description.
Expanding language support beyond the currently supported languages opens the door to a global
audience. Future enhancements could involve incorporating additional languages, dialects, and
regional variations, making the system more inclusive and accessible to users from diverse
linguistic backgrounds.
Incorporating collaboration and content sharing features can encourage users to create and
share content with others more easily. The ability to collaborate on captioned images, co-
editing, and content-sharing functionalities can expand the system's utility in various
collaborative settings.

REFERENCES
1.O. Vinyals, A. Toshev,S. Bengio and D. Erhan,” Show and tell: A neural image caption
generator,” In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 3156- 3164,2014.
2. J. Mao et al. “Deep captioning with multimodal recurrent neural networks (m-rnn),” arXiv
preprint arXiv:1412.6632,2014.
3. C.Szegedy, V.Vanhoucke, S.Ioffe, J.Shlens and Z.Wojna, “Rethinking the inception architecture
for computer vision,” In Proceedings of the IEEE conference on computer vision and pattern
recognition. 2818-2826,2016.

BIBLIOGRAPHY
1. Flutter. (2021). "Flutter: Open-source UI software development kit." Retrieved from

https://flutter.dev/.
2. TensorFlow. (2021). "TensorFlow: An open-source machine learning framework." Retrieved
from https://www.tensorflow.org/.
3. Keras. (2021). "Keras: An open-source deep learning API." Retrieved from https://keras.io/.
Firebase. (2021). "Firebase: A comprehensive app development platform." Retrieved from
https://firebase.google.com/.
4. Hochreiter, S., & Schmidhuber, J. (1997). "Long short-term memory." Neural computation,
9(8), 1735-1780.
5. Google Research. (2021). "InceptionV3: Classifying ImageNet in TensorFlow." Retrieved

from https://github.com/tensorflow/models/tree/master/research/inception.

IOMP - Document Format - IQAC

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IOMP - Document Format - IQAC

Uploaded by

Copyright:

Available Formats

An Industrial Oriented Mini Project / Summer InternshipReport

COMPUTER SCIENCE & ENGINEERING (DATA SCIENCE)

K. SRUJAN REDDY 20BD1A6729

Under the guidance of

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING(DATA SCIENCE)

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

1. K. SRUJAN REDDY 20BD1A6729

2. O. SIVA JYOTHIKA 20BD1A6743

4. T. NAGA VAISHNAVI 20BD1A6756

Internal Guide Head of Department

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

 To be the fountainhead in producing highly skilled, globally competent engineers.

● To provide exposure to students on the latest hardware and software tools.

● To promote research-based projects/activities in the emerging areas of technology

create new enterprises.

India's challenges and to encourage them to develop effective solutions.

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

 To produce globally competent graduates to meet the modern challenges

through contemporary knowledge and moral values committed to build a

Mission of CSD Department

● To create an academic environment, which promotes the intellectual and

professional development of students and faculty.

● To impart skills beyond university prescribed to transform students into a

well- rounded IT professional.

● To nurture the students to be dynamic, industry ready and to have

multidisciplinary skills including e-learning, blended learning and remote

testing as an individual and as a team.

● To continuously engage in research and projects development, strategic use of

emerging technologies to attain self-sustainability.

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

1. Engineering Knowledge: Apply the knowledge of mathematics, science,

2. Problem Analysis: Identify formulate, review research literature, and analyze

3. Design/Development of solutions: Design solutions for complex engineering

4. Conduct Investigations of Complex problems: Use research-based knowledge

6. The Engineer and Society: Apply reasoning informed by contextual knowledge to

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

8. Ethics: Apply ethical principles and commit to professional ethics and

9. Individual and Teamwork: Function effectively as an individual, and as amember

10. Communication: Communicate effectively on complex engineering activities with

11. Project Management and Finance: Demonstrate knowledge and understanding of

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

appropriate Information Technology solutions for social upliftment's.

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

will be able to successfully pursue advanced higher education degrees.

profession by applying computer engineering principles.

PEO3: Graduates will engage in life-long learning and professional development by

rapidly adapting to the changing work environment.

levels of professionalism and ethical responsibility.

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

P1: Accurate identification of objects and content within uploaded images.

P2: Generating contextually relevant captions based on image content.

P3: A user-friendly platform for image captioning on various devices...

P4: Caption generation for global accessibility.

MAPPING PROJECT OUTCOMES WITH PROGRAM OUTCOMES

L – LOW M –MEDIUM H– HIGH

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

PSO PSO1 PSO2

PROJECT OUTCOMES MAPPING

PEO PEO1 PEO2 PEO3 PEO4

KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY | Image Caption Generator

organization forthe award of any other degree.

Student Name Roll no.