Automated Video Enhancement

Automated Video Enhancement
Video Processing Tools For Amateur Filmmakers
Himanshu Madan 404034

Sanika Mokashi 404037
Sneha Rangdal 404042
Siddharth Shashidharan 404048
Guide : Prof. Mrs C V Joshi

H.O.D, Dept. of E&TC
College of Engineering, Pune
Mr. Sumeet Kataria
Oneirix Labs
COLLEGE OF ENGINEERING, PUNE
Shivajinagar, Pune - 411 005
(Formerly Government College of Engineering, Pune)
CERTIFICATE
This is to certify that Himanshu Madan [404034], Sanika Mokashi [404037],

Sneha Rangdal [404042], and Siddharth Shashidharan [404048], of final year,
B.Tech, E&TC, have satisfactorily completed their B.Tech project titled
“Automated Video Enhancement” under the guidance of Prof. Mrs. C.V.Joshi,
as a part of the coursework for the year 2007-08.
Prof. Mrs. C.V.Joshi

Internal Guide and H.O.D, Dept. of E&TC
College of Engineering, Pune
II
CERTIFICATE
This is to certify that Himanshu Madan [404034], Sanika Mokashi

[404037], Sneha Rangdal [404042], and Siddharth Shashidharan [404048],
studying in the final year in the Electronics and Telecommunication department
at COEP, were working on their B.Tech project at Oneirix Labs.
The project, titled “Automated Video Enhancement”, is deemed complete.
Mr. Sumeet Kataria, of Oneirix labs was assigned as their guide for the
entire duration of this project.
The aims that had been decided upon at the start of the project
undertaking were satisfactorily completed in good time. Oneirix Labs is pleased
with the quality of work and ultimate results obtained.
III
ACKNOWLEDGEMENT
We take this opportunity to express our heartfelt gratitude to everyone at

Oneirix Labs, professors and staff of College of Engineering, Pune for their
constant inspiration and guidance.
We would, in particular, like to thank our guide, Sumeet Katariya, for his untiring
efforts, technical advice and guidance throughout the course of this project. We
are also deeply indebted to Prof. Mrs. C.V.Joshi for her support and guidance
derived from the vast pool of her experience and knowledge.
We are extremely grateful to everyone at Oneirix Labs, in particular,

Udayan Kanade, and Sanat Ganu, for their invaluable inputs at critical junctures
during the project work.
We could not have conceived of undertaking a project on such a massive

scale without the encouragement and blessings of everyone at Oneirix Labs, our
professors and parents.
IV
Abstract
Art, it is said, merely reflects life and society. If there is one medium of
expression that has seduced the imagination of the populace, it is the motion
picture. From humble beginnings, the art of film-making is now nearly a century
old.
It comes as little surprise then to see a whole host of amateur filmmakers
making home-made films. This of course has been made possible due to the
surge in video and audio recording technologies. The digital video camcorders
have been envisaged to help such target market segments by adding a lot of
functionalities such as autofocusing and autoexposure. This however, does not
guarantee the right results all the time. In fact a large portion of every film’s
production time is normally wasted in the post-production stage, with editing
required to perform various tasks such as colour correction, brightness and
contrast control and a frame-by-frame editing of videos.
Powerful computers and commercially available versatile video-audio

editing softwares such as Apple Corp's Final Cut Pro or Avid are expensive and
sophisticated tools offering a multitude of services in the post-processing stage of
development of a film. Given the cost and sheer complexity of these softwares,
they are often limited to use in studio houses or editing companies. And even
here, it requires trained professionals who rely heavily on human senses to
produce error-free, professional looking films.
This requires time, training and introduces human errors, in that there will
always be a limitation on what is perceived as the right brightness level or colour
consistency over the entire duration of a scene from the film.
This is where we come into the picture. The problems with autofous and
autoexposure in modern digital cameras are discussed in depth in the chapters
to follow.
V
If we understand exactly how images are formed in a video stream, it is
possible to use the concepts of image processing to correct the irregularities
cause by the camera, by using the good images to correct or improve those
lacking in quality, leading to an improved overall video. Additionally, we are
seeking to make the whole process as automated as possible to eliminate the
inevitable human-perception errors.
For the end user, it translates to shorter hours, more automation and
reliability and even more functionality.
VI
Contents Page No
A1. Certificate Of Completion – COEP II

A2. Certificate Of Completion – Sponsors III
A3. Acknowledgement IV
A4. Abstract V
A5. Contents VII
A6. List of Figures IX
1. Need for the Project 01

2. Video Encoding And Decoding Concepts 04
2.1 MPEG Encoding 05
2.2 Software’s to Extract Frames 08
2.2.1 FFMpeg 08
2.2.2 Frame Dumper 09
3. Problems Faced By Amateurs 10
3.1 Auto Exposure 11
3.1.1 The problems with auto exposure 12
3.2 Autofocus 13
3.2.1 The problems with autofocus 14
4. Post-processing Tools 15
4.1 Apple’s FCP 16
4.2 Avid media composer 16
4.3 Autodesk Smoke 17
4.4 Autodesk Fire 18
5. Our Approach 19
6. Block Diagram 22
7. Motion Estimation 25
7.1 What is Correlation 26
7.2 Algorithm for motion detection 27
7.3 Test Results Of Motion Algorithm 29
VII
7.4 Problems Faced And Its Solution 31
8. Large Space Matrix 33
9. Exposure Correction 37
9.1 A quantitative measure of the exposure in an image 38
9.2 Plotting of the brightness curve 38
9.3 Approach 1: The transfer curve method 40
9.4 Approach 2 42
9.5 Approach 3 44
9.6 Algorithm 45
9.7 Backslash or matrix left division 46
9.8 Pseudo Inverse 47
9.9 Implementation of the algorithm 48
10. Focus 50
10.1 Quantifying focus 51
10.2 Methods to measure the frequency components 51
10.2.1 FFT 51
10.2.2 DCT 53
10.3 Practically applying the above concepts 57
11. Actual Correction 61
11.1 Calculation of weights 64
12. Results 65
13. Conclusion and Future Scope 69
14. References 73
VIII
List of Figures Page No
5.1 Frame 5 containing good information and frame 20 with 21

poor information.
6.1 Block Diagram 23
7.1 Illustration for the idea of correlation 27
7.2 Sequences of images showing the apparent motion 29
between them.
7.3 Plot of calculated motion, using the motion estimation 30
Algorithm.
7.4 Frame with very poor information content. 31
7.5 Good block obtained from the entire frame. 32
8.1 Large space matrix with each frame as one dimension. 35
8.2 Combined fames after motion compensation. 35
9.1 Energy pattern of the frames. 39
9.2 Correction needed of the energy pattern of the frames. 39
9.3 Results of transfer curve method. 40
9.4 Frames showing the variations of the pixel intensity 42
in time before the auto exposure of camera settles.
9.5 Independent correction curves of different pixels. 43
9.6 Corrected image showing banding effect. 44
10.1 Demonstration of FFT transforms. 53
10.2 Image converted to its frequency domain. 55
10.3 Demonstration of 2d DCT transforms. 56
10.4 Images with varying focus. 57
10.5 FFT of an image. 58
10.6 DCT of an image. 59
10.7 Focus factor using FFT. 59
10.8 Focus factor using DCT. 60
11.1 Old and new exposure curve and original pixel values. 61
12.1 First Frame. 66
IX
12.2 First Frame After Correction. 66
12.3 Second Frame. 66
12.4 Second Frame After Correction. 66
12.5 Third Frame. 66
12.6 Third Frame After Correction. 66
12.7 Fourth Frame. 67
12.8 Fourth Frame After Correction. 67
12.9 Final Frame 67
12.10 Final Frame After Correction. 67
X
Chapter 1
Need For the Project
-1-
Need For the Project
Technology today has stretched out its long arms to embrace the
populace ad empower them to do things once deemed impossible. Nearly
anyone today can shoot and edit films with the help of a digital camera and a
computer at home. These Digicams have spurned a generation of filmmakers or
all hues and styles.
When a motion picture is made, the use of elaborate and technologically

advanced cameras is not the only thing that adds that touch of professionalism.
There are also the elaborate lighting and staging processes and of course, a long
detailed editing stage necessary to bring out the picture on the big screen.
These are luxuries that an amateur armed only with his hardy Digicam cannot
afford. In these cases then, he needs to rely on what little he may have in his
toolkit by means of post-processing tools.
The problems that are done away with by controlling the lighting and other
environmental factors in the case of a full-fledged movie rear their heads when
Digicams are used.
Factors such as the lighting and ambient colour cannot really be controlled
externally by an amateur as he does not have the necessary resources for
setting up expensive studio lights or reflectors etc. The video that will be obtained
by his Digicam is bound to look rather crude and unedited. The video editing
softwares available today offer him/her some video tools that the filmmaker can
use to improve the overall appearance of his video shoot. These are discussed in
the next section.
These softwares however cannot really do much to compensate for the

inherent shortcomings of digital cameras and handycams. There are certain
problems which arise due to features such as auto-exposure and auto-focus that
come with virtually every digital camera now-a-days. They are discussed in detail
-2-
in the pages that follow. The traditional method has always been to rectify or
compensate for these shortcomings at the post-processing stage rather than at
the data acquisition stage itself. Thus, what one hopes to do is to use the video
obtained directly from the camera in its raw form to rectify and create
aesthetically appealing, professional-looking videos.
For now, an attempt is made to show the need to develop a different or a

new approach to fixing those very problems and the shortcomings of the
softwares and video tools in the present market scenario.
-3-
Chapter 2
Video Encoding And Decoding Concepts
-4-
Video Encoding And Decoding Concepts
Historically, video was stored as an analog signal on magnetic tape.

Around the time when the compact disc entered the market as a digital-format
replacement for analog audio, it became feasible to also begin storing and using
video in digital form and a variety of such technologies began to emerge.
We can, in most modern contexts atleast, think of most video formats

consisting of a continuous stream of images played out a particular rate called as
the ‘fps’ or frames per second. By keeping this at a high rate (usually from 24 to
30 fps), one uses the persistence of human vision to generate what looks like a
continuous motion picture. Thus, improving the quality of the video boils down to
improving these images.
Most modern cameras today store data digitally. MPEG is the standard or
preferred format that is used to record and store videos. This file format too, as
discussed above may be thought of frames that are placed one after the other
forming a continuous stream of images. Let us discuss this coding technique in
greater detail now:
2.1 MPEG Encoding
MPEG stands for Moving Pictures Expert Group, the committee of industry
which created the standard. MPEG is, in fact, a whole family of standards for
digital video and audio signals using DCT compression. MPEG-2, which employs
DCT compression, is certain to become the dominant standard in consumer
equipment for the foreseeable future. MPEG takes the DCT compression
algorithm and defines how it is used to reduce the data rate, how packets of
video and audio data are multiplexed together in a way that will be understood by
an MPEG decoder.
-5-
DCT or Discrete Cosine Transform, to give it its full name, uses the fact
that adjacent pixels in a picture (either physically close in the image (spatial) or in
successive images (temporal)), may be the same value. Small blocks of 8 x 8
pixels are 'transformed' mathematically in a way that tends to group the common
digital signal elements in a block together. DCT doesn’t directly reduce the data
but the transform tends to concentrate the energy into the first few coefficients
and many of the higher frequency coefficients are often close to zero. Bit rate
reduction is achieved by not transmitting the higher frequency elements, which
have a high probability of not carrying useful information
MPEG first aim was to define a video coding algorithm for application on
'digital storage media', in particular for CD-ROM. Very rapidly the need for audio
coding was added and the scope was extended from being targeted solely on
CD-ROM to trying to define a 'generic' algorithm capable of being used by
virtually all applications, from storage-based multimedia systems, television
broadcasting, and communications applications such as VoD and videophones.
Both the MPEG-1 and MPEG-2 standards are split into three main parts:
Audio coding, video coding, and system management and multiplexing. MPEG
itself is split into three main sub-groups, one responsible for each part, and a
number of other sub-groups to advice on implementation matters, to perform
subjective tests, and to study the requirements that must be supported.
It is neither cost effective nor an efficient use of bandwidth to support all

the features of the standard in all applications. In order to make the standard
practically useful and enforce interoperability between different implementations
of the standard, MPEG has defined profiles and levels of the full standard.
Roughly speaking, a profile is a sub-set, suitable for a particular application, of
the full possible range of algorithmic tools, and a level is a defined range of
parameter values (such as picture size for instance) that are reasonable to
implement and practically useful. There are as many as six MPEG2 profiles
-6-
though only two are currently relevent to broadcasting, main profile which is
essentially MPEG-1 extended to take account of interlace scanning and encodes
chroma 4:2:0 and professional profile which has 4:2:2 chrominance resolution
and is designed for production and post production.
MPEG-2 makes extensive use of motion compensated prediction to

eliminate redundancy. The prediction error remaining after motion compensation
is coded using DCT, followed by quantisation and statistical coding of the
remaining data. MPEG has two types of prediction. The so-called 'P' pictures are
predicted only from pictures that are displayed before the current picture. 'B'
pictures on the other hand are predicted from two pictures, one that is displayed
earlier and one later. In order to do this non-causal prediction the encoder has to
reorder the sequence of pictures before sending them to the decoder and then
the decoder has to return them to the correct display order. B-pictures add
complexity to the system but also produce a significant saving in bit-rate. An
important feature of the MPEG prediction system is the use of 'I frames' that are
coded without motion compensation. These break the chain of predictive coding
so that channel switching can be done with a sufficiently short latency.
The most significant extension of MPEG-2 Main Profile over MPEG-1 is an

improvement in options within a picture that can be used to do motion
compensated prediction of interlaced signals. MPEG-1 treats each picture as a
collection of samples from the same moment in time (known as frame-based
coding). MPEG-2 understands about interlace, that samples within a frame come
from two fields that may represent different moments of time, Therefore MPEG-2
has modes in which the data can either be predicted either using one motion
vector to give an offset to a previous frame or two vectors giving offsets to two
different fields.
MPEG-2 extends this performance to allow:

Multiple programs with independent time-bases
-7-
Error prone environments
Remultiplexing
Support for scrambling
2.2 Software’s to Extract Frames
To extract images (frames) from a video file, we used the following open-
source softwares:
2.2.1 FFMpeg – a cross-platform tool we used for Ubuntu specifically
FFmpeg is a collection of software libraries that can record, convert and

stream digital audio and video in numerous formats. It includes libavcodec, an
audio/video codec library used by several other projects, and libavformat, an
audio/video container mux and demux library. The name of the project comes
from the MPEG video standards group, together with "FF" for "fast forward".
The project is made of several components:
ffmpeg is a command line tool to convert one video file format to another.
It also supports grabbing and encoding in real time from a TV card.
ffserver is an HTTP (RTSP is being developed) multimedia streaming
server for live broadcasts. Time shifting of live broadcast is also
supported.
ffplay is a simple media player based on SDL and on the FFmpeg
libraries.
libavcodec is a library containing all the FFmpeg audio/video encoders
and decoders. Most codecs were developed from scratch to ensure best
performance and high code reusability.
libavformat is a library containing demuxers and muxers for audio/video
container formats.
libavutil is a helper library containing routines common to different parts of
FFmpeg.
-8-
libpostproc is a library containing video postprocessing routines.
libswscale is a library containing video image scaling routines.
2.2.2 Frame Dumper – a simple tool for dumping images in windows in .bmp
format.
It supports all types of video sources as long as they are recognized in

your Windows Media Player. It runs extremely fast and reliable, and is a helpful
tool for video related research work.
Frame Dumper is a command line based tool, and the usage is very
simple:
Usage:
FrameDumper.exe VideoName FromFrameID ToFrameID StepSize [TargetDir]
Parameters:
VideoName: The source video file name (e.g., video.mpg). Use '\' for path.
FromFrameID: The starting frame number (indexed from 1).
ToFrameID: The ending frame number (indexed from 1).
StepSize: The step size during dumping. 1 for continuous dumping.
TargetDir: Optional. Specify the dumping directory. Otherwise use current.
-9-
Chapter 3
Problems Faced By Amateurs
- 10 -
Problems Faced By Amateurs
As discussed in an earlier section, the basic problems that all amateurs or

users of digital cameras have to deal are primarily a result of the camera’s auto-
focus and auto-exposure features. It is therefore essential to understand fully
what the problems are and why they tend to be created due to those features of
digital cameras.
One of the most noticeable problems affecting amateur videos is the

sudden transitions in brightness of the subsequent frames. This is especially
alarming when the camera is moved from a region which is not sufficiently
illuminated, to a light source which is a very bright area, or vice versa. The
camera auto-exposure correction algorithm takes some time to adjust its value,
and meanwhile, the transition phase images can look extremely bright or
extremely dark.
3.1 Auto Exposure
Most of the digital cameras available on the market today provide a host of
features, the most common being auto-exposure setting, auto-focus, auto-white
balance, image stabilizers, etc. Exposure, termed very simply, refers to the
amount of light that the camcorder's lens collects; auto exposure is a system that
controls the incoming light to prevent (among other things) over- or under-
exposure. An over-exposed shot looks washed out and overly bright, while an
underexposed shot looks shadowy and dark.
The camcorder's exposure system regulates two things: iris and shutter
speed. The iris diaphragm in the lens controls the amount of light admitted, while
the electronic circuitry referred as the shutter governs the amount of time the chip
has to respond to the light. By manipulating the lens aperture (and sometimes
- 11 -
the shutter speed), the camcorder does its best to deliver optimum light to the
image sensing chip regardless of lighting conditions.
When you turn on the camera, special circuitry analyzes the amount of
light hitting the chip during the 1/60 second it takes to form an image. If that
amount is greater than the optimum value, which is usually the case, the circuitry
calculates how much to "stop down" (close) the iris diaphragm that sits between
the sensor and the light source. Then it sends the appropriate command to the
circuits controlling the servo motor of the diaphragm. The motor closes the
diaphragm down, light transmission falls to ideal, and the CCD forms a perfectly
exposed image.
3.1.1 The problems with auto exposure
Although auto exposure facility in video cameras saves a lot of hassle and
trouble for the amateur photographer, it has some obvious disadvantages. The
two main problems associated with it are slow response, and not enough
intelligence in the algorithms to decode the image that light forms, let alone
determine which part of that image to expose properly.
It's slow because exposure adjustments require electro-mechanical

operations, which take time. As a result, the system often can't react fast enough
to changing light to maintain a steady exposure. This problem is especially
evident when we try panning onto someone who walks out from a relatively dark
area into a brightly lit region. The second major problem arises mainly because
of the finite contrast ratio that most camcorders will support. Contrast" is the
difference in brightness between the lightest and darkest parts of the image. It's
expressed as a ratio, such as four to one, meaning that the brightest point on the
image is four times as light as the darkest. Within the system's contrast range,
details in the light areas ("highlights") and dark areas ("shadows") are distinct and
- 12 -
readable. Above and below that range, highlights "burn out" or become blobs of
pure white and shadows "block up" or drop to solid black. Four-to-one is about as
wide a contrast ratio as a video system can maintain. The problem is that we are
generally faced with much higher contrast ratios in the real world, leading to the
camcorder properly recording those parts of the scene that fall within its contrast
range and lets the others burn out or block up. Many algorithms such as the
simplistic average brightness based methods to sophisticated weighted
averaging to calculate exposure better for the center portion, are used, The auto-
exposure modes attempt to make whatever you are metering 18% gray (in the
middle). However even the best of them suffer from problems especially evident
in videos panned across a wide brightness range.
3.2 Autofocus
Autofocus is a feature of some optical systems that allows them to obtain

(and in some systems to also continuously maintain) correct focus on a subject,
instead of requiring the operator to adjust focus manually. It is that great time
saver that is found in one form or another on most cameras today. In most cases,
it helps improve the quality of the pictures we take.
Autofocus (AF) really could be called power-focus, as it often uses a

computer to run a miniature motor that focuses the lens for you. Focusing is the
moving of the lens in and out until the sharpest possible image of the subject is
projected onto the film. Depending on the distance of the subject from the
camera, the lens has to be a certain distance from the film to form a clear image.
In most modern cameras, autofocus is one of a suite of automatic features that
work together to make picture-taking as easy as possible. These features
include:
Automatic film advance
Automatic flash
- 13 -
Automatic exposure
There are two types of autofocus systems: active and passive. Some
cameras may have a combination of both types, depending on the price of the
camera
3.2.1 The problems with autofocus
The two main causes of blurred pictures taken via autofocus video cameras are:
Mistakenly focusing on the background

Moving the camera while recording the images
The human eye has a rather fast autofocus. For e.g holding your hand up near
your face and focusing on it, and then quickly looking at something in the
distance shows the distant object clearly. Cameras however, are not nearly this
quick or this precise.
Autofocus in a video camera is a passive system that also uses the central
portion of the image. Though very convenient for fast shooting, autofocus has
some problems:
It can be slow to respond.

It may search back and forth, vainly seeking a subject to focus on.
It has trouble in low light levels.
It focuses incorrectly when the intended subject is not in the center of the
image.
It changes focus when something passes between the subject and the lens.
- 14 -
Chapter 4
Post-processing Tools
- 15 -
Post-processing Tools
Now that there is a fair idea of the problems that we are dealing with, let
us look closely at the current market scenario and the post-processing tools
available. Attention is drawn to the costs for each of these tools listed below.
4.1 Apple’s FCP
Final Cut Pro is a professional non-linear editing system developed by

Apple Inc. The program has the ability to edit many digital formats. The system is
only available for Mac OS X version 10.4.9 or later, and is a module of the Final
Cut Studio product. Final Cut Pro has found acceptance among professionals
and a number of broadcast facilities because of its cost effective efficiency as an
off-line editor as much as a digital on-line editor. Final Cut Pro is also very
popular with independent and semi-professional film-makers.
Features
Broad format support
Incredible real-time effects
Comprehensive editing tools
Expanded power as the hub of Final Cut Studio
Open, extensible architecture
Final Cut Pro 6 now also supports mixed video formats (both resolution
and frame rate) in the timeline with real time support.
Cost USD $ 1299.
4.2 Avid media composer
The Media Composer is a non-linear editing system. Released in 1989 on the

Macintosh II as offline editing system. The application features have increased
to allow for film editing, uncompressed SD video and high definition editing and
- 16 -
finishing. Media Composer is Avid's primary editing software solution. Current
version of Media Composer (MCSoft) has the following important features
• Animatte
• 3D Warp
• Paint
• Live Matte Key
• Tracker
• Timewarps with motion estimation (FluidMotion)
• SpectraMatte (high quality chroma keyer)
• Cost USD $ 1822
4.3 Autodesk Smoke
Autodesk smoke covers the projects like, films, commercials, long-form

episodic’s, bumpers, and promos on a single system. Autodesk Smoke systems
software is the solution for online editing and creative finishing, providing real-
time, mixed-resolution interactivity for all types of creative editorial work and
client-supervised sessions.
4.4 Autodesk Fire
Autodesk Fire is the industry benchmark for non-compressed, non-linear

editing and finishing systems. Autodesk Fire systems software is used for
creating visual effects. It provides Interactive manipulation of high-resolution film
in advanced 3D DVE and compositing environment.
While these tools listed above have tremendous capabilities, they are found
severely wanting in the following aspects:
• The software itself is extremely expensive, with costs running into a few
lakh INR, making it inaccessible to the populace. In fact one rarely finds
- 17 -
users for these softwares outside of big video editing studios or
workshops.
• Sufficient time and effort needs to be invested in learning and operating
these tools. In effect, professionals who can take advantage of the array of
tools available in them are needed to make economical sense and to do
justice to the video itself.
• The process of analysing every frame for colour and brightness itself is
rather tedious and takes many days for a few seconds of footage.
• The most important factor here though, besides the cost, is the fact that it
is prone to the deficiency of human perception. For e.g, even a
professional judging two or three different frames for, say the brightness,
might cause a small amount of mismatch in the settings. If these errors
are allowed to accumulate, he/she may need to redo the whole process
after looking at the resultant video.
Therefore, our approach, which shall be discussed next, involves some basic
correction on the video’s brightness and an improvement in its aesthetic appeal.
An attempt has been made to ensure that the end-user has little to do in terms of
the actual correction itself, which requires a sophisticated automation of the
whole process.
- 18 -
Chapter 5
Our Approach
- 19 -
Our Approach
We have, upto this point, discussed the problems that arise out of using
Digicams and how it is difficult to rely on video editing softwares already available
in the market to help with the editing. We discuss hereon a different method to
automate and improve the overall video quality by relying on the familiar
procedure of post-processing.
Post-processing essentially involves the raw video file obtained from the
Digicam. Usually, the areas that need substantial improvement arise due to a
motion in the camera, as the user is moving the camera to focus on different
views. As a result, the parts of the video (or more precisely, the set of frames that
are involved in this transition) appear very unappealing when viewed in its raw
form. Hence, a particular set of frames will definitely appear to be extremely dark
or extremely bright, depending on the kind of transition.
At a post-processing stage, we can use ‘information’ present in earlier or

latter frames to correct the ones that are bad (usually belong to the timeframe
corresponding to the transition itself). This is discussed below.
More often than not, when there is a drastic change in the brightness as
discussed above, it is either due to the fast motion of the camera or due to the
Digicam’s inability to adjust its focus or brightness fast enough. However, as the
same scene or view is kept in focus for a moderate amount of time after the
transition, the camera eventually adjusts both its exposure and focus to obtain
correct images. We can think of these good frames as those containing a high
amount of information. The transitory frames then are lacking in this same
information.
The method sought to be implemented here involves using the information

from good frames to improve and correct the bad frames. Think of a rather
- 20 -
simple example of a camera initially filming an open street in sunlight. The
camera is now moved to focus on an object located inside the apartment. The
apartment lighting is poorer than the bright sunlit exterior and when the camera
rests upon the object inside, it takes a finite amount of time to adjust to its
ambient luminosity.
Note that in this transition period, there will be a number of frames that
are of poor quality. In this case, the transitory frames will appear dark as the
exposure of the camera is initially set for the street’s luminosity and thus the
camera lens’s aperture would be open only very slightly.
Now for human eyes, there is also this same transition from brightness to relative
darkness. Except in the case of our eyes, they are able to adjust much faster to
changing luminosity and focal distances. Thus, Digicams that are unable to do
this result in videos that look unaesthetic and therefore unprofessional.
Frame 5 Frame 20
Figure 5.1: Frame 5 containing good information and frame 20 with poor information.
Note that in the example above, the object inside the apartment that the
camera focuses upon is present both in the bad frames as well as the good ones
with the information. Thus, what is proposed is to find the exact motion between
these frames and restore or fill in information, on a pixel-by-pixel basis. This
however does not mean simply replacing good frames into bad ones. In fact, the
algorithm followed is a lot more complex and described in the sections below.
- 21 -
Chapter 6
Block Diagram
- 22 -
Block Diagram
Extracting
Input Frames Image
Frames
Video File (Images) Processing
(FFMpeg)
Quantifying Motion
Exposure Estimation
Quantifying
Focus
Image
Correction
Compiling
Storing
Frames
Audio File
into Video
Enhanced
Video
Output
Figure 6.1: Block diagram
- 23 -
The block diagram shown above systematically details the procedure
followed in order to try to automate the process. The MPEG file from the user’s
camera forms the input video file that is split or ripped up into frames using
ffmpeg, which has been discussed before. The individual frames in the .jpg
format are now processed. The correction of these images requires that the
motion between these frames is identified and used. Similarly the focus and
exposure values of these frames are also calculated.
The final stage, post the correction, simply involves recompiling the
frames into a video. This again is done by means of the ffmpeg software. So as
to maintain the sound integrity of the original video, the sound is originally stored
separately and then used again when the final video is being compiled.
The resultant video is of an improved quality, with the few sections that
appeared bad being corrected.
The following pages detail the entire procedure in greater depth.
- 24 -
Chapter 7
Motion Estimation
- 25 -
Motion Estimation
It should, by now, be obvious as to why the exact motion between the

frames obtained using ffmpeg needs to be calculated, so as to deal with various
problems such as focus, exposure, etc., faced by amateur video makers. As we
are using Ubuntu, which supports an open source software “Octave” (which is
similar to matlab), the detection of motion by matrix approach was the obvious
choice.
After going through literature on motion detection, we tried to implement

motion detection using the in-built function of correlation.
7.1 What is Correlation

In simple terms, correlation of two signals is nothing but the measure of similarity
between them. Mathematically, correlation is calculated as:
Consider two images as n-dimensional vectors (where n is the number of pixels

in it), the concepts of basic geometry can be applied to obtain an intuitive
explanation of the concept of correlation. For example, consider two images of
the size NXN, which can be represented as N2-dimension vectors. The degree of
similarity between these two vectors will be the projection of one vector on the
other vector. This can alternatively be viewed as the dot product or the scalar
product of the two vectors. The correlation coefficient can be found by dividing
the correlation by the vector lengths, in order to obtain a normalized measure of
the similarity between two vectors.
- 26 -
7.2 Algorithm for motion detection
Consider two consecutive images from the video. As frame rate is

generally 24 images per second, the motion between the two images is restricted
to few pixels. Considering this, the algorithm crops the second image by a few
pixels on each side. The correlation was found shifting image 2 over image 1,
with the constraint that the correlation always be valid i.e. complete overlap. This
resulted in less overheads, and comparatively decent results since the motion is
restricted to about a 5 pixel wide length, except for extreme cases of really fast
motion. The maximum value of the fraction “cross correlation/auto correlation”
determines the movement of the frames, with respect to one another. Whenever
the maximum value of this ratio is obtained, it indicates that the particular overlap
is the ‘best match’, which in turn can yield the x and y co-ordinates of the camera
motion.
Figure 7.1: Illustration for the idea of correlation
With these basic results, we strove to enhance the algorithm, and after
considerable trial-and-error, we decided to implement the same algorithm, this
time using the filter function instead of the lengthy procedure of shifting the
smaller image over the larger one. The filter function readily allows us to
calculate the correlation of the images, and does so only over the overlap area
when we specify the attribute of ‘valid’.
- 27 -
The filter function is defined to perform the following operation:
Y = filter2(h,X,shape)
It returns the part of Y specified by the shape parameter.
‘shape’ is a string with one of these values:
'full' Returns the full two-dimensional correlation. In this case, Y is larger
than X.
'same' (default) Returns the central part of the correlation. In this case, Y is
the same size as X.
'valid' Returns only those parts of the correlation that are computed
without zero-padded edges. In this case, Y is smaller than X.
- 28 -
7.3 Test Results Of Motion Algorithm
To demonstrate the use of the motion detection algorithm, these are a few
test images which can simulate a camera moving over a fixed background
towards the bottom right direction.
Figure 7.2: Sequences of images showing the apparent motion between them.
The next logical step was to implement this algorithm in order to prepare an
entire database of the motion for a complete sequence of frames. The database
so formed would be useful while evaluating and enhancing further performance
specifications of the video such as focus, exposure, jitter, etc. The graph below
shows the plot of such a database, for the test images above. As can be readily
- 29 -
seen from the images, the camera pans across the images towards the
downward right direction, which is very evident from the motion graph.
Figure 7.3: Plot of calculated motion, using the motion estimation algorithm.
- 30 -
7.4 Problems Faced And Its Solution
The main problem that was faced during motion detection was due to images
like the ones shown below. Most of the image is dark, with a grayscale value of
0. This results in highly inaccurate correlation coefficient results. To overcome
this problem, we followed an approach of using only “good regions” of the image
to estimate the motion.
Figure 7.4: Frame with very poor information content.
The images were divided into a number of blocks, say 16 (4X4). For each
of these blocks, certain parameters are calculated, on the basis of which the
quality of the block is defined. The best block out of these is used, depending on
the quality factor that is assigned to each of them. Motion is estimated only for
this particular block, and the same motion is assumed for the entire frame.
- 31 -
Figure 7.5: Good block obtained from the entire frame.
Parameters used to define the quality factor:

• The mean of the block should be in the linear range (approximately
grayscale values of 50-200).
• The variance of the block should not be too high, implying that most of its
pixels are in the linear range
• These parameters are considered for the r, g and b components
separately.
This approach yielded us the best out the 16 blocks, to be used to find out the
motion for the 2 particular frames in consideration. This algorithm was
implemented on every frame before using correlation to estimate the motion. The
results that were obtained were very consistent with the visual perception of
motion.
- 32 -
Chapter 8
Large Space Matrix
- 33 -
Large Space Matrix –Obtained From Motion Estimation
At this juncture, it is now clear how to estimate accurately the exact motion
between two consecutive frames. This motion, in the form of x-y co-ordinates can
be obtained now for the entire series of frames that require correction with a few
of those frames containing good information.
The challenge herein is to utilize this motion tool mathematically and

computationally so that a ‘database’ can be created which consists of these
frames with their motion accounted for. The exact nature of this database is
discussed now.
The entire spatial region which the camera has panned in the course of its
motion can be termed as a ‘global space’ as far as the camera is concerned. A
640 x 480 frame, a part of the global space, is captured by the camera, every
1/30th of a second (this is decided by the frame rate of the camera). We can
understand this by analogy of how a panoramic picture is shot, which consists of
a set of photos taken one after the other with the result being a larger photo with
information from all the pictures after they have been stitched together
(obviously, after compensating for motion).
Similarly, our large space is defined as frames placed so that each one
corresponds to one plane in a multi-dimensional space. The location of these
frames in this large space is dictated by its motion from the previous frame.
Shown below are some results to better understand the formation of this ‘matrix’:
- 34 -
Figure 8.1: Large space matrix with each frame as one dimension.
Figure 8.2: Combined fames after motion compensation.
When compensated for by the motion between the frames, the matrix formed is shown in
the figure above. Mathematically, each of these pixel values will be a number. The
motion above is purely horizontal. However, motion can exist in both directions.
- 35 -
Thus, the exact x-y motion co-ordinates for the entire series of frames to be
corrected are used to form a database of these values. The database formed is
now in a format that is easily usable and is used in the next stage of brightness
correction.
The following information can now be obtained very easily from the database or
‘large space’ matrix:
• The location of each point in global space, with respect to local camera
frame co-ordinates.
• Information regarding when a pixel was introduced into the database.
• Information regarding the depth of each point, i.e., the number of frames
for which a point stayed in the database.
Clearly, we are now able to decide how long (in terms of the number of
frames) that any single object persists in the video. This is what has been
referred to earlier as the depth of the point. Hence, further manipulations of this
large matrix can be carried out in the exposure correction section discussed next.
- 36 -
Chapter 9
Exposure Correction
- 37 -
Exposure Correction
The auto-exposure algorithm implemented in the digital camera takes

some finite time to adjust the camera aperture in accordance with the changing
brightness of the surroundings. This leads to loss of data in intermediate frames
when camera moves from a bright area to a relatively darker one or vice versa. In
other words, the brightness of intermediate frames reduces/increases making the
video unpleasant to our eyes which adapt to brightness changes more quickly.
9.1 A quantitative measure of the exposure in an image
From the earlier concept of an image represented as an N-dimensional

vector, we can intuitively understand that the energy contained in a signal is
equal to summation of the squares of each of the discrete values, over the entire
range. For an image, this yields nothing but the energy of the image, which can
be considered as a measure of the brightness in the image. The camera
exposure will be highly correlated with the total brightness that is observed in the
image.
9.2 Plotting of the brightness curve
With the above definition of brightness of the image, we implemented a

simple algorithm to plot the brightness of an image over a series of images in a
video sequence. As expected, this curve shows drastic curves for videos where
the camera is panned across dimly lit to brightly lit areas in a short time. This
curve can enable us to zero in on the portion of the video which needs brightness
correction.
- 38 -
The typically observed energy pattern of the frames can be shown as:
Figure 9.1: Energy pattern of the frames.
We propose to make this transition, as smooth as possible.
Old brightness curve
Corrected portion of the curve
Figure 9.2: Correction needed of the energy pattern of the frames.
- 39 -
9.3 Approach 1: The transfer curve method
Treat the frames as a whole.

Increase/decrease the brightness of the frame to make brightness curve
as smooth as possible
The simplest ways we attempted to change the brightness of an image were

trying to scale the total brightness, implementation using a transfer curve, etc.
The problems with these algorithms were quickly evident: they scaling resulted in
loss of information, as shown below, and though the transfer curve method
showed better results, there were many problems associated with it too.
We tried to implement transfer curve functions such as the one given below,
to boost the dark region of the image, and reduce the intensity in the bright
regions. However these too resulted in a loss of information, especially since
cameras treat all values above a certain threshold to be 255 (the maximum
brightness value).
Due to this limitation of its range, the transfer curve method tends to
reduce the intensity of portions where it should not really be scaled at all, for eg.
the tubelight region in the figures below.
- 40 -
O/P
255 I/P
Figure 9.3: Results of transfer curve method.
This approach assumes that brightness curve followed by each and every pixel
on a frame is same. However, it is not true. The value (brightness)of every pixel
changes individually and hence it should be treated individually.
- 41 -
9.4 Approach 2
Treat every pixel separately
Smoothen the brightness curve of every individual pixel
Frame 1 Frame 3 Frame 5
Frame 7 Frame 9 Frame 11
Figure 9.4: Frames showing the variations of the pixel intensity in time
before the auto exposure of camera settles.
From the above images, it can be demonstrated that each pixel needs to
be treated individually for correction. The center tube light region in the above
frames remains at the same grayscale value of 255 (saturated) right from frame
number 1 till frame number 11. However the peripheral region starts out at a
saturated value of 255, and ends in frame 11 at a much lower value of around
180 or so. Thus if we plot the brightness curve for these two images, they will be
quite different from each other, and consequently, the correction to be
implemented has to different for these pixels.
The reason that this happens is because the actual brightness of the tube
light is much higher than that of the peripheral region, however the limitations of
the camera cannot capture these differences at a high exposure setting. Thus
- 42 -
both the regions appear saturated until the auto-exposure of the camera settles
to the correct value for the frame.
Pixel 1
Pixel 5
Figure 9.5: Independent correction curves of different pixels.
Thus the constraints on the algorithm to obtain new exposure values:

The difference between the old exposure values and new exposure values
is minimum. This constraint makes sure that, there is no drastic difference
between changed value and old value. Otherwise the optimization
algorithm results will be a straight line, which is the smoothet possible
curve in the time domain. It will however bear no resemblance to the
original images, yielding frames which will look either too bright or too
dark, and also introduce colour mismatch problems.
• The difference between consecutive two values of new exposure should
be minimum. This constraint is responsible for the smoothening of the
brightness curve.
- 43 -
9.5 Approach 3
Testing of this approach led us to observe the problem of ‘Banding’ that we

hadn’t foreseen.
Pixel 1
Pixel 20
Global time
Pixel 1 Pixel 20
Figure 9.6: Corrected image showing banding effect.
The algorithm starts correction only when new-pixel is introduced in the

global matrix. In the above example, pixel 1 is introduced in the first frame itself,
whereas pixel 20 is not introduced till the third frame. Since this algorithm does
not put any constraint on the neighbouring pixel, the correction for a pixel and its
neighbour can be different. Every time a series of pixels are introduced, their
correction will start from that frame. It may happen that, there is considerable
difference in the correction of neighbouring pixels if one of them is introduced late
on the global time. This results in banding structure as can be observed from the
corrected frame shown above.
Thus the additional constraints on the algorithm to obtain new exposure values:
The new exposure values should be minimally deviant in space.
- 44 -
9.6 Algorithm: To calculate the New Exposure Curve For each new pixel
As explained earlier, every new value for exposure is subjected to three

constraints:
The difference between the old exposure values and new exposure values
is minimum.
The difference between consecutive two values of new exposure should
be minimum.
The new exposure values should be minimally deviant in space.
Now, we have to solve a system

A× X = B
Where
X: The new exposure value for every pixel
Matrix obtained by applying pseudo-inverse function.
Order of X: n X 1
n: total number of pixels in 3D matrix
A: Constraint Matrix
Order of A: (n*number of constraints) X n
B: Column matrix which will have the ideal result values

Order of B: (n*number of constraints) X 1
As we can see, there are many constraints on a single value of exposure, that is,
the number of equations is more than number of variables. This results in an
over-determined system, which can be solved using pseudo-inverse method.
This equation is solved using the octave inbuilt function of matrix left division.
- 45 -
9.7 Backslash or matrix left division
If A is a square matrix, A\B is roughly the same as inv(A)*B, except it is

computed in a different way. If A is an n-by-n matrix and B is a column vector with
n components or a matrix with several such columns, then X = A\B is the solution
to the equation AX = B computed by Gaussian elimination (see Algorithm for
details).
A warning message is displayed if A is badly scaled or nearly singular.If A is

an m-by-n matrix with m ~= n and B is a column vector with m components, or a
matrix with several such columns, then X = A\B is the solution in the least
squares sense to the under- or overdetermined system of equations AX = B. The
effective rank, k, of A is determined from the QR decomposition with pivoting (see
"Algorithm" for details). A solution X is computed that has at most k nonzero
components per column. If k < n, this is usually not the same solution as
pinv(A)*B, which is the least squares solution with the smallest norm.
- 46 -
9.8 Pseudo Inverse
The inverse A −1 of a matrix A exists only if A is square and has full rank. In
this case, A × X = B has the solution X = A −1 × B .
A pseudoinverse is a matrix inverse-like object that may be defined for a

complex matrix, even if it is not necessarily square. For any given complex
matrix, it is possible to define many possible pseudoinverses.
If A has full rank ( n ) we define:
A + = ( AT A) −1 AT
and the solution of A × X = B is X = A + × B
More accurately, the above is called the Moore-Penrose pseudoinverse.
Calculation
The best way to compute A + is to use singular value decomposition. With
A = USV T .
Where U and V (both nXn) orthogonal
S (mXn) is diagonal with real,non-negative singular values
We find
A+ = V ( S T S ) −1 S T U T
If the rank r of A is smaller than n , the inverse of S T S does not exist, and one
uses only the first r singular values; then becomes an ( r X r) matrix and U , V
shrink accordingly.
- 47 -
9.9 Implementation of the algorithm
Here we are trying to form constraint matrix A and ideal value matrix B.
1. Find a new-pixel, for how many frames it exists and what are its old exposure
values.
2. In order to implement the first constraint:
enewi , j ,n = eold i , j ,n
Convert all the pixel values of 3D matrix (obtained by motion detection
algorithm) into a column vector. This column vector will form first [n X 1]
part of the matrix B.
An [n X n] identity matrix will form first part the matrix A thus implementing
the above equation for every pixel in the 3D matrix.
⎡1000 ⎤ ⎡enew1,1, 0 ⎤ ⎡eold1,1,0 ⎤

⎢0100⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ × ⎢enew1, 2,0 ⎥ = ⎢eold1, 2,0 ⎥
⎢0010⎥ ⎢enew ⎥ ⎢eold ⎥
⎢ ⎥ ⎢ 2 ,1, 0
⎥ ⎢ 2 ,1, 0
⎥
⎣0001⎦ ⎢⎣enew2, 2,0 ⎥⎦ ⎢⎣eold 2, 2,0 ⎥⎦
3. In order to implement the second constraint:
enewi , j ,n = enewi , j ,n −1
Matrix B part corresponding to the second constraint is a column vector of
zeroes.
Matrix A is formed so as to implement the above equation, for each I,j
possible. Thus every row of matrix A contains 1 and -1, at the positions
corresponding to enewi , j , n and enewi , j ,n −1 , implementing the equation
enewi , j ,n − enewi , j ,n −1 = 0 .
- 48 -
⎡enew1,1,0 ⎤
⎡1 − 100 ⎤ ⎢ ⎥ ⎡0 ⎤
⎢01 − 10⎥ × ⎢enew1,1,1 ⎥ = ⎢0 ⎥
⎢ ⎥ ⎢enew ⎥ ⎢ ⎥
⎢⎣001 − 1⎥⎦ ⎢ 1,1, 2
⎥ ⎢⎣0⎥⎦
⎢⎣enew1,1,3 ⎥⎦
4. In order to implement the third constraint:

enewi , j ,n − enewi +1. j ,n = 0
enewi , j ,n − enewi. j +1,n = 0

Matrix B, is, again a column of zeroes equal in number to the equations
being implemented.
Every row of matrix A contain 1 and -1 at the locations corresponding to
either enewi , j ,n and enewi +1. j ,n or enewi , j ,n and enewi. j +1,n depending on the
equation being implemented, for all possible values of i, j and n.

⎡1000 − 100000000000 ⎤ ⎡enew1,1,0 ⎤ ⎡0 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢01000 − 10000000000 ⎥ ⎢enew1,1,1 ⎥ ⎢0 ⎥
⎢001000 − 1000000000 ⎥ ⎢enew1,1, 2 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢0001000 − 100000000 ⎥ ⎢enew1,1,3 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢000000001000 − 1000 ⎥ ⎢enew1, 2,0 ⎥ ⎢0 ⎥
⎢0000000001000 100 ⎥ ⎢enew ⎥ ⎢ ⎥
⎢ − ⎥ ⎢0 ⎥
⎢ 1, 2 ,1
⎥
⎢00000000001000 − 10 ⎥ ⎢enew1, 2, 2 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢000000000001000 − 1 ⎥ ⎢enew1, 2,3 ⎥ ⎢0 ⎥
⎢ ⎥ × ⎢enew ⎥ = ⎢ ⎥
⎢10000000 − 10000000 ⎥ ⎢ 2 ,1, 0
⎥ ⎢0 ⎥
⎢010000000 − 1000000 ⎥ ⎢enew2,1,1 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢0010000000 − 100000 ⎥ ⎢enew2,1, 2 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢enew ⎥ ⎢ ⎥
⎢00010000000 − 10000 ⎥ ⎢ 2 ,1, 3
⎥ ⎢0 ⎥
⎢ ⎥ ⎢enew2, 2,0 ⎥ ⎢ ⎥
⎢000010000000 − 1000 ⎥ ⎢ ⎥ ⎢0 ⎥
⎢0000010000000 − 100 ⎥ ⎢enew2, 2,1 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢enew ⎥ ⎢ ⎥
⎢00000010000000 − 10 ⎥ ⎢ 2, 2, 2 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢enew2, 2,3 ⎥ ⎢ ⎥
⎢⎣000000010000000 − 1 ⎥⎦ ⎣ ⎦ ⎢⎣0 ⎥⎦
5. Once we have formed matrix A and B, X can be obtained by pseudo-inverse.
- 49 -
Chapter 10
Focus
- 50 -
Focus
10.1 Quantifying focus: How to quantify the total focus quality of an image
In our effort to understand how the focus for a video varies as images are
taken in rapid succession, we must be able to somehow quantify the focus of an
image. Only then will it be possible to decide if the image is in good focus or not.
This is a rather difficult problem as the focus quality of an image is not a
physically measurable quantity.
What comes to our rescue is the fact that high focus images are images
that have a very clearly defined set of boundaries. That is, in good focus images,
there is a very fine distinction in the details of the objects in the image. This
points to high frequency components at the edges. Now, a measure of these high
frequency components can give you a very decent estimate of the image’s focus
quality.
We have basically worked on two methods for focus quantification (i.e. frequency
analysis).
The Fast Fourier Transform (FFT) for 2D images
The Discrete Cosine Transform (DCT) for 2D images
Let us look at both these functions a little more closely.
10.2 Methods to measure the frequency components: DCT and FFT
10.2.1 FFT
A fast Fourier transform (FFT) is an efficient algorithm to compute the

discrete Fourier transform (DFT) and its inverse. FFTs are of great importance to
a wide variety of applications, from digital signal processing and solving partial
differential equations to algorithms for quick multiplication of large integers.
Let x0, ...., xN-1 be complex numbers. The DFT is defined by the formula
- 51 -
Evaluating these sums directly would take O(N2) arithmetical operations.
An FFT is an algorithm to compute the same result in only O(N log N) operations.
In general, such algorithms depend upon the factorization of N, but (contrary to
popular misconception) there are FFTs with O(N log N) complexity for all N, even
for prime N.
Many FFT algorithms only depend on the fact that is a primitive root
of unity, and thus can be applied to analogous transforms over any finite field,
such as number-theoretic transforms.
Since the inverse DFT is the same as the DFT, but with the opposite sign
in the exponent and a 1/N factor, any FFT algorithm can easily be adapted for it
as well.
This demonstration shows the FFT of a real image and its basis functions:
Note that in the images below, u* et v* are the coordinates of the pixel selected
with the red cross on F(u,v). Blue cross contributes to the same frequency.
- 52 -
Figure 10.1: Demonstration of FFT transforms.
10.2.2 DCT
A discrete cosine transform (DCT) is a Fourier-related transform similar to

the discrete Fourier transform (DFT), but using only real numbers. DCTs are
equivalent to DFTs of roughly twice the length, operating on real data with even
symmetry (since the Fourier transform of a real and even function is real and
even), where in some variants the input and/or output data are shifted by half a
sample. There are eight standard DCT variants, of which four are common.
- 53 -
The most common variant of discrete cosine transform is the type-II DCT,
which is often called simply "the DCT"; its inverse, the type-III DCT, is
correspondingly often called simply "the inverse DCT" or "the IDCT".
The DCT, and in particular the DCT-II, is often used in signal and image
processing, especially for lossy data compression, because it has a strong
"energy compaction" property: most of the signal information tends to be
concentrated in a few low-frequency components of the DCT, approaching the
Karhunen-Loève transform (which is optimal in the decorrelation sense) for
signals based on certain limits of Markov processes.
In the two-dimensional DCT-II (DCT type 2) of N * N blocks are computed.

In the case of JPEG compression, N is typically 8 and the DCT-II formula is
applied to each row and column of the block. The result is an 8 × 8 transform
coefficient array in which the (0,0) element (top-left) is the DC (zero-frequency)
component and entries with increasing vertical and horizontal index values
represent higher vertical and horizontal spatial frequencies.
Like any Fourier-related transform, discrete cosine transforms (DCTs)

express a function or a signal in terms of a sum of sinusoids with different
frequencies and amplitudes. Like the Discrete Fourier Transform (DFT), a DCT
operates on a function at a finite number of discrete data points. The obvious
distinction between a DCT and a DFT is that the former uses only cosine
functions, while the latter uses both cosines and sines (in the form of complex
exponentials). However, this visible difference is merely a consequence of a
deeper distinction: a DCT implies different boundary conditions than the DFT or
other related transforms. The Fourier-related transforms that operate on a
function over a finite domain, such as the DFT or DCT or a Fourier series, can be
thought of as implicitly defining an extension of that function outside the domain.
That is, once you write a function f(x) as a sum of sinusoids, you can evaluate
that sum at any x, even for x where the original f(x) was not specified. The DFT,
- 54 -
like the Fourier series, implies a periodic extension of the original function. A
DCT, like a cosine transform, implies an even extension of the original function.
The formulae for a 2D DCT:
Four example blocks in spatial and frequency domains:
>>>
Spatial Frequency
Figure 10.2: Image converted to its frequency domain.
Inverse Discrete Cosine Transform

To rebuild an image in the spatial domain from the frequencies obtained above,
we use the IDCT formulae:
- 55 -
This demonstration shows the DCT of an image:
Figure 10.3: Demonstration of 2d DCT transforms.
u* and v* are the coordinates of the pixel selected with the red cross on C(u,v).
- 56 -
10.3 Practically applying the above concepts
Carrying out the DCT and FFT operations on images yielded a map of
their frequency components. As expected, there was a larger concentration of
signal strength in the low frequency components. This is an obvious outcome as
it represents most of the image’s uniformities. Thus, when quantifying the focus,
we assigned a low scalar or weightage for these components. After all, we are
more interested in the high-frequency components. Subsequently, the high
frequency components were assigned higher weightage. For 2D FFT
calculations, the low frequency components are shifted to the centre and a
weight assigned to them. DCT low frequency components reside in the upper left
corner. The distance from the origin i.e. u=v=0 was considered as the weight
associated with the frequency components. A simple summation yields the focus
factor. In order to normalize the operation for different amplitudes and slightly
varying images, the above mentioned focus factor was divided by the sum of the
amplitudes of all the frequency components of the respective transforms.
Shown below are some sample images for the code written to estimate the focus
quality in MATLAB.
Figure 10.4: Images with varying focus.
- 57 -
The images shown above are numbered 1 to 6. The first two images represent
pictures in bad focus, whereas the 4th and 6th images are of particularly good
quality. Both, a 2D FFT and 2D DCT of the all images were taken, first by taking
the luminance component of the RGB images and then performing the required
operations. Shown below are what the DCT and the FFT looks like for the last
image. Please note that the absolute value of the 2D DCT or FFT is plotted on a
log scale.
Figure 10.5: FFT of an image.
This is the FFT for the sixth image shown above, shifted to the centre. The
low frequency components are located in the centre. Shown below is the 2D DCT
for the same image. The top-left corner has high-valued low frequency
components, whereas the other corners represent the high frequency
components in their respective directions.
- 58 -
Figure 10.6: DCT of an image.
Figure 10.7: Focus factor using FFT.

- 59 -
Shown above is the output obtained when the 2D FFT operation is performed for
the six images in a sequence. Note the variation or change of the focus.
The output for the 2D DCT is as shown below.
Figure 10.8: Focus factor using DCT.
- 60 -
Chapter 11
Actual Correction
- 61 -
Actual Correction
Figure 11.1: Old and new exposure curve and original pixel values.
Consider a series of images, the exposure value for them is given by the
red curve shown in above diagram. The new exposure curve obtained is given by
the blue curve.
Now, for a pixel whose pixel values change as 10 35 50 75 … on the
global time, the ideal value for a that pixel, at a particular instant of time, will be
function of old exposure curve, new exposure curve, future and past values of
that pixel.
There is a need at this juncture then, to decide exactly how much

weightage needs to be given to this global pixel’s value in different frames, i.e on
a global time axis. The implementation of the program becomes substantially
simpler if we consider this example mathematically as follows:
Let initial value(actual value of a pixel) of a be: I

Exposure value be: E
New exposure values be: Enew
Temporary variable: temp
- 62 -
and resultant value be: R
Mathematically, R can be given as

1 I (t + K ) × E (t + K )
temp(t ) =
N
∑ E (t )
for k=depth of that pixel in 3D global matrix, that is, k will take all future and past
values of that particular pixel.
Where, N: Normalizing factor
N = ∑ I (t + K )
for k=depth of that pixel in 3D global matrix
The temp variable thus obtained will correct the pixel value, considering the past
and future values of the pixel are only function of old exposure curve and they
are correct
However, this may not be true. The actual pixel value should be a function of new
exposure curve as well.
Mathematically,
1 W (t + K ) × temp(t ) × Enew(t + k )
R (t ) =
N
∑ Enew(t )
for k=depth of that pixel in 3D global matrix, that is, k will take all future and past
values of that particular pixel.
Where, N: Normalizing factor
N = ∑ I (t + K )
for k=depth of that pixel in 3D global matrix
W: Weight factor. It represents how much trust should we put on the value
obtained at that particular instant
- 63 -
11.1 Calculation of weights
When we try to change the value of a pixel at a particular instant t, we correct

it using old and new exposure values of that pixel at future and past instances. In
this process we cannot predict that future and past pixel values will always be
correct. Thus we have to assign some weigtage to these calculated values.
This weight assigned is a function of:
1. Value of pixel:
In image processing, the pixel values that fall on the linear range are
suppose to contain maximum information. If the pixel value is within the
linear range, more is the weight assigned. On the other hand, if it too
low(noise) or it is too high(saturated) less weight is assigned.
2. Distance from the pixel

If value of k is small, that is, if the pixel that is near to the original pixel on
the time axis, it is more likely the signal values will be similar. We should
trust these pixels more.
If value of k is large, that is, if the pixel that is near to the original pixel on
the time axis, signal value is more likely to change, so we put less trust
(weight) in these pixel values.
3. Focus quality
If the focus quality of a frame containing the pixel is more, more weight is
assigned to that pixel.
Thus, we get corrected value for each pixel.

This process is repeated for every pixel of the each and every frame of the video
to give corrected video.
- 64 -
Chapter 12
Results
- 65 -
Results
Figure 12.1: First Frame Figure 12.2: First Frame After Correction
Figure 12.3: Second Frame Figure 12.4: Second Frame After Correction
Figure 12.5: Third Frame Figure 12.6: Third Frame After Correction
- 66 -
Figure 12.7: Fourth Frame Figure 12.8: Fourth Frame After Correction
Figure 12.9: Final Frame Figure 12.10: Final Frame After Correction
The results for a set of frames are shown above. For the first image in particular,
there is a rather stark improvement in the brightness, while still not immediately
jumping to the level of the last images.
Subsequent images also show an improvement over the original ones.
Note however that even the corrected images are increasing in their brightness
value at some reasonable gradient, so there is no sudden peak in the brightness,
which again might seem unaesthetic.
- 67 -
To really understand what has transpired here, attention is drawn to the first set
of original and corrected images. Rather than simply being a case of increasing
the brightness, there’s a substantial increase in the information content as well.
The bag and some details of the clothes begin to appear clearly right from the
first frame in the corrected frames now.
Hence there has been a transplant of information itself from latter frames where
the information is clearly present to the earlier frames.
- 68 -
Chapter 13
Conclusion and Future Scope
- 69 -
Conclusion and Future Scope
At this stage, the initial goals laid done have been achieved with good results. In
the results section, the improvement in the frames is clear for all to see. Also, the
entire process needed very little intervention from the user, and almost no error
due to human perception was incorporated.
What has effectively been achieved here may be explained in a different way.
One may think of the video obtained after post-processing as being taken directly
by an imaginary camera.
The feature of this camera is that it sets a different shutter time, i.e allows for
different exposure values at each and every pixel. This is in stark contrast to a
normal camera where there is only one single exposure setting for the entire
frame.
One may think of a scenario where we are panning the camera across a window
looking out on to the bright exterior of the building, but half covered by a curtain
with designs on it. Now, ordinarily a camera would simply adjust to the bright light
and keep a short shutter time.
This would leave the curtain details in the dark. Imagine however that the camera
pans to reveal those details in latter frames. The algorithm that has been applied
in this project would be able to rectify even those frames where the sunlit side of
the window appears to be exceptionally bright.
Hence, we might even be able to see the window brightly lit on one side, and the
curtain in all its glory on the other.
As this was meant to serve as a proof of concept for a larger more complicated
system, it has served its purpose. The opportunity, however to improve upon this
work is great and would involve probably a complete automation of the entire
process, with even he process of identifying the bad frames done by the
program.
- 70 -
At the cost of complexity and time, we can also make the motion detection
absolutely accurate by considering factors such as rotation, foreshortening and
zoom. The idea for using the same to process colours will also require that the
time expended based on this algorithm be reduced, as it would also involve
additional constraints.
- 71 -
Chapter 14
References
- 72 -
References
Determining Optical Flow

Berthold K.P. Horn and Brian G. Rhunck
Artificial Intelligence Laboratory, Massachusetts Institute of Technology,
Cambridge, MA 02139, U.S.A.
Discontinuity-Preserving Computation of Variational Optic Flow in Real-Time

Andres Bruhn, JoachimWeickert, Timo Kohlberger, and Christoph Schnorr2
Digital Signal processing

Bernd Jahne
Digital Signal processing

Rafael C. Gonzalez, University of Tennessee
Richard E. Woods, MedData Interactive
Handbook of Image and Video processing

Al Bovik.
Digital Image processing

William K Pratt
Weisstein, Eric W. "Pseudoinverse." From MathWorld--A Wolfram Web

Resource.
www.wikipedia.com
ffmpeg.mplayerhq.hu
www.mathworks.com
- 73 -
www.octave.org
www.google.com
- 74 -

Automated Video Enhancement

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Video Enhancement

Uploaded by

Copyright:

Available Formats

Automated Video Enhancement

Video Processing Tools For Amateur Filmmakers

Himanshu Madan 404034

Guide : Prof. Mrs C V Joshi

This is to certify that Himanshu Madan [404034], Sanika Mokashi [404037],

Prof. Mrs. C.V.Joshi

This is to certify that Himanshu Madan [404034], Sanika Mokashi

The project, titled “Automated Video Enhancement”, is deemed complete.

We take this opportunity to express our heartfelt gratitude to everyone at

We are extremely grateful to everyone at Oneirix Labs, in particular,

We could not have conceived of undertaking a project on such a massive

Powerful computers and commercially available versatile video-audio

A1. Certificate Of Completion – COEP II

1. Need for the Project 01

5.1 Frame 5 containing good information and frame 20 with 21

Need For the Project

When a motion picture is made, the use of elaborate and technologically

These softwares however cannot really do much to compensate for the

For now, an attempt is made to show the need to develop a different or a

Video Encoding And Decoding Concepts

Historically, video was stored as an analog signal on magnetic tape.

We can, in most modern contexts atleast, think of most video formats

2.1 MPEG Encoding

It is neither cost effective nor an efficient use of bandwidth to support all

MPEG-2 makes extensive use of motion compensated prediction to

The most significant extension of MPEG-2 Main Profile over MPEG-1 is an

MPEG-2 extends this performance to allow:

2.2 Software’s to Extract Frames

2.2.1 FFMpeg – a cross-platform tool we used for Ubuntu specifically

FFmpeg is a collection of software libraries that can record, convert and

It supports all types of video sources as long as they are recognized in

Problems Faced By Amateurs

As discussed in an earlier section, the basic problems that all amateurs or

One of the most noticeable problems affecting amateur videos is the

3.1 Auto Exposure

3.1.1 The problems with auto exposure

It's slow because exposure adjustments require electro-mechanical

Autofocus is a feature of some optical systems that allows them to obtain

Autofocus (AF) really could be called power-focus, as it often uses a

3.2.1 The problems with autofocus

 Mistakenly focusing on the background

 It can be slow to respond.

4.1 Apple’s FCP

Final Cut Pro is a professional non-linear editing system developed by

4.2 Avid media composer

The Media Composer is a non-linear editing system. Released in 1989 on the

4.3 Autodesk Smoke

Autodesk smoke covers the projects like, films, commercials, long-form

4.4 Autodesk Fire

Autodesk Fire is the industry benchmark for non-compressed, non-linear

At a post-processing stage, we can use ‘information’ present in earlier or

The method sought to be implemented here involves using the information

Figure 6.1: Block diagram

The following pages detail the entire procedure in greater depth.

It should, by now, be obvious as to why the exact motion between the

After going through literature on motion detection, we tried to implement

7.1 What is Correlation

Consider two images as n-dimensional vectors (where n is the number of pixels

Consider two consecutive images from the video. As frame rate is

Figure 7.1: Illustration for the idea of correlation

Figure 7.4: Frame with very poor information content.

Parameters used to define the quality factor:

Large Space Matrix

The challenge herein is to utilize this motion tool mathematically and

Mistakenly focusing on the background

It can be slow to respond.

Treat the frames as a whole.

B: Column matrix which will have the ideal result values

enewi , j ,n − enewi. j +1,n = 0