Professional Documents
Culture Documents
SEMINAR REPORT
submitted by
AAE20CS015
to
of
Bachelor of Technology
In
Computer Science and Engineering
January, 2023
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING AL
AZHAR COLLEGE OF ENGINEERING AND TECHNOLOGY
THODUPUZHA 685 605
CERTIFICATE
This is to certify that the report entitled ‘COMPUTER VISION’ submitted by Mr.
Mohammad Razin Rasheed (AAE20CS015) to the APJ Abdul Kalam Technological
University in partial fulfilment of the requirements for the award of the Degree of Bachelor of
Technology in Computer Science and Engineering is a bonafide record of the seminar carried
out by him/her under my guidance and supervision. This report in any form has not been
submitted to any other University or Institute for any purpose.
I, MOHAMMED RAZIN RASHEED hereby declare that, this seminar report entitled
COMPUTER VISION is the bonafide work of mine carried out under the supervision of Kala
O S, Head of the Department. I declare that, to the best of my knowledge, the work reported
herein does not form part of any other project report or dissertation on the basis of which a
degree or award was conferred on an earlier occasion to any other candidate. The content of
this report is not being presented by any other student to this or any other University for the
award of a degree.
Signature:
Signature:
At the very outset, I would like to give the first honours to the Almighty who gave me the
wisdom and knowledge to complete this report.
I would like to thank Mr. D F Melvin Jose, Principal, Al Azhar College of Engineering and
Technology, Thodupuzha for all the support extended during the course of this work.
I would like to thank Ms. Kala O S, Head of the Department and the Seminar Coordinator,
for giving me useful suggestions and his constant encouragement and guidance throughout
the progress of this work.
Department of Computer Science and Engineering for his valuable suggestions and guidance.
I express my special thanks to my parents and friends who were with me from the start of
the dissertation, for the interesting discussions and the ideas they shared.
In particular, I would like to thank all the other faculty members of the Computer Science
Department and all other people who have helped me in many ways for the successful
completion of this work.
ABSTRACT
Computer vision, situated at the convergence of computer science and artificial intelligence,
has undergone transformative developments. The rise of deep learning, particularly through
convolutional neural networks, has redefined the landscape, propelling image recognition,
coupled with pre-trained models like those on ImageNet, has become a cornerstone,
persist, encompassing issues of data quality and bias, interpretability, and the resilience of
models to adversarial attacks. Ethical concerns surrounding biased datasets and the "black
box" nature of deep learning models necessitate ongoing research. Despite challenges,
computer vision finds pivotal applications in autonomous vehicles, healthcare for medical
gaming, education, and industrial training. The ongoing interplay between advancements,
challenges, and applications positions computer vision as a driving force in the realization of
Programming a computer and designing algorithms for understanding what is in these images
is the field of computer vision. Computer vision powers applications like image search, robot
ACKNOWLEDGEMENT
ABSTRACT
LIST OF FIGURES
ABBREVIATIONS
Chapter 1. INTRODUCTION
1.1 Idea
1.2 Need
1.3 History
1.4 Goal
1.5 Problem statement
Chapter 2. LITERATURE SURVEY
Chapter 3. SYSTEM
3.1 Existing System
3.2 Proposed System
Chapter 4. DESIGN
4.1 Algorithm
4.2 Flow chart
Chapter 5. CONCLUSION
REFERENCES
LIST OF FIGURES
NO TITLE
1 Problem Vision
1.1 Idea
Computer vision is the concept of equipping machines with the capability to interpret and
comprehend visual information akin to human vision. It involves acquiring visual data
through sensors, preprocessing to enhance quality, and extracting relevant features. The crux
lies in developing and training models, often utilizing deep learning, to recognize and
interpret patterns in images or videos.
From tasks like image classification, object detection, to segmentation, computer vision finds
applications in diverse domains such as facial recognition, autonomous navigation, medical
image analysis, and augmented reality. The iterative process of feedback and continuous
learning refines these models, enabling machines to make informed decisions based on visual
input. This transformative idea not only enhances efficiency and automation but also fuels
innovation across industries, fundamentally reshaping the way machines perceive and interact
with the visual world.
1.2 Need
Computer vision is indispensable due to the escalating demand for machines capable of
comprehending and interpreting visual information, paralleling human visual perception. This
technology addresses the imperative for automation and heightened efficiency in industries
by enabling machines to autonomously analyse visual data, fostering streamlined processes in
manufacturing, quality control, and various sectors.
Moreover, the exponential growth of visual data from diverse sources necessitates automated
systems capable of rapid and accurate analysis, a role in which computer vision excels. Its
pivotal role in enhancing user experiences, particularly in augmented reality and virtual
reality applications, is transformative for gaming, education, and training simulations.
In medical diagnostics, security surveillance, and autonomous systems such as vehicles and
drones, computer vision's real-time perception capabilities are foundational for safety and
decision-making. It also contributes to quality control in manufacturing, human-computer
interaction advancements, and accessibility features, making it a key enabler across a
spectrum of industries and applications.
1.3 History
The field of computer vision began with the development of pattern recognition and image
processing techniques.
In 1956, the development of the first image scanner by Russell Kirsch marked a crucial step
in early image digitization.
In the 1960s, researchers started exploring methods for edge detection and contour analysis,
laying the groundwork for later image analysis techniques.
The 1970s saw the introduction of algorithms for shape analysis and recognition, including
the work on the "Structural Theory of Shape" by Azriel Rosenfeld.
Early computer vision applications included industrial inspection systems for quality control.
David Marr's influential work on computational theories of vision in the 1980s contributed to
the understanding of visual processing in biological systems.
The 1990s saw the development of models for object recognition and tracking, with the
introduction of keypoint detectors and descriptors.
The release of benchmark datasets like the ImageNet database in the mid-1990s facilitated the
evaluation and comparison of computer vision algorithms.
Feature learning and the use of local image descriptors, such as SIFT (Scale-Invariant Feature
Transform) and SURF (Speeded Up Robust Features), gained popularity.
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became a benchmark for
evaluating deep learning models.
Transfer learning, where pre-trained models on large datasets are fine-tuned for specific tasks,
became a standard practice, enhancing the efficiency of computer vision systems.
The history of computer vision reflects a continuous progression from early image processing
techniques to the transformative impact of deep learning, positioning it as a pivotal
technology with broad applications across various industries.
1.4 Goal
The primary goal of computer vision is to enable machines to interpret and understand visual
information from the world, mimicking human visual perception. This involves developing
algorithms, models, and systems that can analyze and extract meaningful insights from
images or video data. The overarching objectives of computer vision include:
Image Understanding: Computer vision aims to develop systems that can comprehend
the content of images, recognizing objects, scenes, and patterns within visual data.
Object Recognition and Classification: The goal is to enable machines to identify and
categorize objects within images accurately. This is fundamental for applications like
image search, autonomous navigation, and surveillance.
Image and Video Analysis: The ability to analyze images and videos for various
purposes, including extracting features, detecting motion, tracking objects, and
recognizing temporal patterns, is a key goal in computer vision.
3D Reconstruction: Computer vision endeavors to reconstruct three-dimensional
representations of objects or scenes from two-dimensional images. This is crucial for
applications such as virtual reality, robotics, and autonomous navigation.
Medical Image Analysis: In healthcare, computer vision aims to assist in the analysis
of medical images, including tasks such as tumor detection, organ segmentation, and
disease diagnosis.
Augmented Reality (AR) and Virtual Reality (VR): Computer vision contributes to
creating immersive AR and VR experiences by overlaying digital information on the
real world or generating realistic virtual environments.
The problem statement in computer vision typically revolves around addressing challenges
and limitations in the interpretation and understanding of visual information by machines.
Key problem areas include:
Image Recognition and Classification: Developing accurate and robust algorithms for
identifying objects, scenes, or patterns within images is a fundamental challenge.
Variability in lighting conditions, viewpoints, and the presence of occlusions can
impede the performance of recognition systems.
Object Detection and Localization: Locating and delineating objects within images or
video frames is a crucial problem. Achieving high precision and recall, especially in
complex scenes with multiple objects, poses a persistent challenge.
Addressing these problem areas requires ongoing research, innovation, and interdisciplinary
collaboration to advance the capabilities of computer vision systems and make them more
robust, interpretable, and applicable across a broad range of domains and scenarios.
autonomous vehicles continue to evolve, computer vision plays a pivotal role in enabling
perception, decision-making, and navigation capabilities crucial for safe and efficient
operation. The survey covers key topics such as object detection and recognition, semantic
environmental variations, and the interpretability of computer vision models in the context of
autonomous driving.
The survey also discusses notable datasets, benchmarking methodologies, and recent
synthesizing findings from a wide range of scholarly works, this literature survey aims to
provide researchers, practitioners, and stakeholders with valuable insights into the state-of-
the-art, challenges, and future directions in the intersection of computer vision and
autonomous vehicles.
David Marr: Known for his early work on computational theories of vision, Marr laid the
foundation for understanding how the human visual system processes information. His
influential book "Vision" outlined key concepts in computer vision.
Yann LeCun: A pioneer in the field of deep learning, LeCun's work on convolutional neural
networks (CNNs) has been instrumental in the success of deep learning for image recognition
tasks. He is a key figure in the development of modern computer vision techniques.
Geoffrey Hinton: Another luminary in deep learning, Hinton's contributions include the
development of Boltzmann machines and significant advancements in neural network
architectures. His work has had a profound impact on the use of neural networks in computer
vision.
Fei-Fei Li: Renowned for her research in computer vision and machine learning, Li has
contributed to large-scale image recognition datasets like ImageNet. She is also known for
her work in the intersection of computer vision and healthcare.
Andrew Ng: A leading figure in machine learning and co-founder of Google Brain, Ng has
made substantial contributions to computer vision education and research. He has been
involved in projects focusing on deep learning applications in vision.
Richard Szeliski: A computer vision researcher with contributions to structure from motion,
panoramic image stitching, and image-based modeling. His work has been influential in the
development of algorithms for 3D scene reconstruction.
Martial Hebert: Known for his research in computer vision and robotics, Hebert has made
significant contributions to object recognition, visual mapping, and perception for robotic
systems.
Jitendra Malik: Recognized for his contributions to object recognition and scene
understanding, Malik's research spans topics like shape representation, image segmentation,
and the analysis of visual scenes.
Raquel Urtasun: A prominent figure in computer vision and autonomous systems, Urtasun's
work focuses on leveraging computer vision for self-driving cars, with an emphasis on
perception and mapping.
Trevor Darrell: A researcher with extensive contributions to computer vision, Darrell has
worked on topics such as object recognition, human-computer interaction, and machine
learning applications in vision.
Research Directions:
Multi-Sensor Fusion:
Integrating information from diverse sensors, including cameras, LiDAR, radar, and GPS, to
create a more comprehensive perception system.
Continual Learning:
Developing algorithms that can continuously learn and adapt to new scenarios and
environments, reducing the need for frequent updates.
Edge Computing:
Exploring edge computing solutions to process data locally within the vehicle, reducing
reliance on external processing resources and improving response times.
Human-Centric Design:
Regulatory Frameworks:
Long-Term Autonomy:
Exploring approaches for achieving long-term autonomy, including strategies for vehicle
maintenance, system upgrades, and adapting to evolving urban landscapes.
Fig.2 Rising Research Chart
CHAPTER 3: SYSTEM
PyTorch: PyTorch, an open-source machine learning library, is widely adopted for its
dynamic computation graph and ease of use. It is employed in developing computer vision
models, particularly in research settings, due to its flexibility and strong community support.
YOLO (You Only Look Once): YOLO is an object detection system that processes images
in a single pass, making it efficient for real-time applications. YOLO versions, such as
YOLOv3 and YOLOv4, have gained popularity for their speed and accuracy in object
detection tasks.
Faster R-CNN (Region-based Convolutional Neural Network): Faster R-CNN is a popular
object detection framework that combines deep learning with region proposal networks to
achieve high accuracy in object localization and recognition.
Mask R-CNN: An extension of Faster R-CNN, Mask R-CNN adds a segmentation branch,
enabling the model to generate pixel-level masks for object instances. This is particularly
useful in applications requiring precise object segmentation.
ROS (Robot Operating System): While not exclusively a computer vision system, ROS is a
flexible framework for developing robotic systems, and it includes packages for computer
vision tasks. It is widely used in the robotics community for integrating perception with robot
control.
These existing systems provide a foundation for developing computer vision applications
across various domains, from image processing to object detection and recognition, and they
continue to evolve with ongoing research and technological advancements. It's important to
check for the latest developments and updates in this rapidly evolving field.
EfficientDet:
CLIP is a model developed by OpenAI that learns visual concepts by associating images with
natural language descriptions. It can understand images in the context of textual descriptions,
allowing for a wide range of applications, from image classification to zero-shot learning.
DALL-E:
DALL-E, also from OpenAI, is a generative model capable of creating diverse and creative
images based on textual descriptions. It extends the capabilities of generative models to
create novel images from textual prompts, showcasing the potential of generative AI in visual
creativity.
DETR (Detection Transformer):
DETR is a transformer-based model designed for object detection tasks. It approaches object
detection as a set prediction problem, eliminating the need for anchor boxes and significantly
simplifying the detection pipeline. DETR has demonstrated competitive performance in
object detection benchmarks.
Swin Transformer:
The Swin Transformer is a hierarchical transformer architecture designed for vision tasks. It
introduces a shift-based window mechanism that enables efficient processing of image
patches at different scales. Swin Transformer has shown strong performance in image
classification and object detection tasks.
4.1 Algorithm
Computer vision algorithms encompass a broad range of techniques designed to interpret and
understand visual information. The choice of algorithm depends on the specific task or
application within computer vision. Here are some fundamental computer vision algorithms
categorized by common tasks:
1. Image Preprocessing:
Image Blurring (e.g., Gaussian Blur): Smoothing images to reduce noise and emphasize
important features.
Image Gradients (e.g., Sobel Operator): Calculating gradients to identify edges and changes
in intensity.
Image Thresholding (e.g., Otsu's Thresholding): Segmenting images based on pixel intensity.
2. Feature Extraction:
Speeded Up Robust Features (SURF): Similar to SIFT, but designed for efficiency.
3. Object Detection:
Histogram of Oriented Gradients (HOG): Describing object shapes based on gradients for
pedestrian detection.
Cascade Classifier (e.g., Haar Cascades): Utilizing a cascade of simple classifiers for real-
time object detection.
4. Image Classification:
Convolutional Neural Networks (CNNs): Deep learning networks designed for spatial
hierarchies in image data.
Residual Networks (ResNet): Introducing residual connections to ease the training of deep
networks.
Transfer Learning: Leveraging pre-trained models on large datasets for specific classification
tasks.
5. Semantic Segmentation:
DeepLab: Employing atrous convolutions and the dilated convolutional network for pixel-
wise segmentation.
6. Object Tracking:
Kalman Filter: An algorithm for recursive estimation that is widely used for object tracking.
Correlation Filter (e.g., MOSSE): Applying correlation-based tracking for real-time object
tracking.
7. Depth Estimation:
Stereo Vision: Using two or more cameras to estimate depth based on disparities between
corresponding image points.
LiDAR-based Depth Sensing: Leveraging Light Detection and Ranging (LiDAR) technology
for accurate depth information.
8. Face Recognition:
DeepFace and FaceNet: Utilizing deep learning for robust face recognition.
9. Image Stitching:
SIFT-Based Stitching: Matching key points and transforming images to create panoramic
views.
Feature-Based Stitching (e.g., ORB): Utilizing feature extraction and matching for image
alignment.
PoseNet: A deep learning model for estimating the pose (position and orientation) of objects
in images.
Iterative Closest Point (ICP): An algorithm for refining the alignment between 3D models and
point clouds.
These algorithms represent a subset of the diverse techniques within computer vision. The
field continues to evolve with advancements in deep learning, reinforcement learning, and the
integration of multiple sensor modalities, contributing to the development of more
sophisticated and versatile computer vision systems.
Fig 4 Image Processing Algorithm
4.2 Flowchart
Simplified flowchart for a generic computer vision application:
Input:
Acquire the input data, which could be images, video frames, or a sequence of frames.
Preprocessing:
Perform preprocessing on the input data to enhance its quality and prepare it for further
analysis.
Common preprocessing steps include:
Image resizing: Adjust the size of images for consistency.
Normalization: Scale pixel values to a standard range.
Noise reduction: Apply filters to reduce noise.
Feature Extraction:
Identify relevant features in the input data that are important for the specific computer vision
task.
Common feature extraction techniques include:
Edge detection: Highlighting boundaries in the image.
Keypoint detection: Identifying distinctive points in the image.
Texture analysis: Extracting patterns in the image.
Object Detection/Recognition:
Use algorithms to detect and recognize objects or patterns in the input data.
Common object detection/recognition techniques include:
Object detection models (e.g., YOLO, Faster R-CNN): Locate and classify objects in images.
Template matching: Compare image regions with predefined templates.
Semantic Segmentation:
If necessary, perform semantic segmentation to assign a label to each pixel in the image.
Techniques like Convolutional Neural Networks (CNNs) can be used for pixel-wise
classification.
Decision Making:
Based on the extracted features and detected objects, make decisions or predictions relevant
to the application.
This step may involve the use of machine learning models trained on labeled data.
Post-Processing:
Refine the results obtained from the previous steps to improve accuracy or remove artifacts.
Common post-processing steps include:
Non-maximum suppression: Refining object detection results.
Filtering: Removing outliers or noise.
Output:
Generate the final output, which could include annotated images, labeled objects, or specific
information derived from the input data.
The practical applications of computer vision are vast and impactful, ranging from facial
recognition and autonomous vehicles to medical diagnostics and augmented reality. It has
enabled innovations in user interfaces, enhanced accessibility, and contributed to the
development of intelligent systems capable of understanding and interacting with the visual
world.
As research in computer vision continues to push boundaries, the future holds promises of
even more sophisticated algorithms, broader applications, and increased integration with
other emerging technologies. The dynamic nature of the field ensures a constant stream of
innovations, shaping the way machines perceive and interact with the visual world, and
ultimately, contributing to the broader landscape of artificial intelligence and smart
technologies.
In the ever-evolving landscape of technology, computer vision's journey from its early
foundations to the current era of deep learning has been marked by remarkable progress. The
ability to endow machines with visual perception has not only transformed industries but has
also paved the way for novel applications that were once confined to the realm of science
fiction. The increasing reliance on convolutional neural networks, transfer learning, and
advanced architectures has significantly improved the accuracy and efficiency of computer
vision systems. Real-world implementations, such as facial recognition in security systems,
autonomous navigation in vehicles, and medical image analysis for diagnostics, underscore
the tangible impact of computer vision on our daily lives.
However, the journey is far from complete. Challenges persist, ranging from addressing the
ethical implications of widespread surveillance to ensuring the fairness and transparency of
decision-making in machine learning models. As computer vision continues to advance, the
need for interdisciplinary collaboration becomes more apparent, with researchers, engineers,
ethicists, and policymakers working together to navigate the ethical, legal, and societal
implications of this transformative technology. The future of computer vision holds exciting
possibilities, from further refining existing applications to unlocking new frontiers in areas
such as augmented reality, virtual reality, and human-computer interaction. In this era of rapid
technological innovation, computer vision stands as a testament to the remarkable synergy
between human ingenuity and cutting-edge technology.
CHAPTER 6: FUTURE SCOPE
The future scope of computer vision is expansive, with ongoing research and technological
advancements poised to unlock new possibilities and applications. Here are key areas that
indicate the promising future of computer vision:
Computer vision will continue to play a pivotal role in the development of autonomous
systems, including self-driving cars, drones, and robots. Improvements in perception, scene
understanding, and object recognition will contribute to safer and more efficient autonomous
navigation.
Computer vision will enhance AR and VR experiences by enabling more realistic and
interactive virtual environments. This includes accurate object recognition, scene
understanding, and precise tracking of user movements for immersive applications in gaming,
education, and training.
Human-Computer Interaction:
Computer vision technologies will be integral to the evolution of smart cities, enabling
intelligent traffic management, public safety surveillance, and efficient infrastructure
monitoring. Video analytics will play a crucial role in optimizing urban environments.
Advancements in facial recognition, iris scanning, and other biometric technologies will
enhance security systems. Computer vision algorithms will play a crucial role in identifying
and verifying individuals in various applications, from border control to secure access
systems.
Computer vision will continue to optimize industrial processes by automating quality control,
monitoring production lines, and enhancing predictive maintenance. This can lead to
increased efficiency and reduced manufacturing errors.
Computer vision technologies will be leveraged for social good, addressing challenges such
as assistive technologies for people with disabilities, disaster response, and improving living
conditions in underserved communities.
Future research will focus on making computer vision models more interpretable and
explainable, addressing ethical concerns related to bias, fairness, and transparency in
decision-making processes.
Books:
Offers a detailed exploration of computer vision concepts, including probabilistic models and
machine learning approaches.
An extensive textbook that covers a wide range of topics in computer vision, suitable for both
beginners and advanced readers.
Research Papers:
The seminal paper that introduced the AlexNet architecture, marking a significant
breakthrough in deep learning for image classification.
"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"
by Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun (2016):
Introduces the Faster R-CNN model, a widely used architecture for object detection.
"YOLO9000: Better, Faster, Stronger" by Joseph Redmon and Santosh Divvala (2016):
Presents the YOLO (You Only Look Once) object detection system, known for its real-time
performance.
Introduces the U-Net architecture, commonly used in biomedical image segmentation tasks.