Nlp-Enriched Automatic Video Segmentation: Mohannad Almousa Rachid Benlamri Richard Khoury

NLP-Enriched Automatic Video Segmentation
Mohannad AlMousa* Rachid Benlamri* Richard Khoury*

Department of Electrical and Department of Software Engineering Department of Computer Science and
Computer Engineering Lakehead University Software Engineering
Lakehead University Thunder Bay, ON Canada Université Laval
Thunder Bay, ON Canada rbenlamr@lakeheadu.ca Québec City, QC Canada
malmous@lakeheadu.ca richard.khoury@ift.ulaval.ca
Abstract— E-learning environments are heavily dependent on pre-processing engine to improve the visual and audio
videos as the main media to deliver lectures to learners. Despite segmentation, and then annotates each segment with its main
the merits of video-based lectures, new challenges can paralyze the content. Fig 1 outlines a high-level architecture of the proposed
learning process. Challenges that deal with video content system highlighting the two main engines, as well as the
accessibility, such as searching, retrieving, explaining, matching,
organizing, and even summarizing these contents, significantly
segmentation, annotation and indexing processes.
limit the potential of video-based learning. In this paper, we
propose a novel approach to segment video lectures and integrate Pre-processing engine
Natural Language Processing (NLP) tasks to extract key linguistic
features exist within the video. We exploit the benefits of visual,
audio, and textual features in order to create comprehensive Key frame extraction
temporal feature vectors for the enhanced segmented video.
Afterwards, we apply an NLP cosine similarity to the cluster and
identify the various topics presented in the video. The final
product would be an indexed, vector-based searchable video Audio Shots NLP engine
segments of a specific topic/subtopic File
Cleansing, tokenizing,
and stemming
Keywords-Video segmentation; content-based; NLP; Topic
detection.
TF-IDF
I. INTRODUCTION Texts
Context
Instructional videos are powerful and expressive learning Calculate segments
cosine similarity
resources, extensively used in e-learning. Since each video may
cover many topics, it is critical for an e-learning environment to Transcript
have content-based video searching capabilities to meet diverse OCR & ASR back-chain
segments’ merging
individual learning needs. To accomplish this, it is critical to be
able to semantically segment, annotate, index and retrieve video
content in an efficient and effective manner. Today, the ability
to search video-based media is limited to the available metadata
Key-frame alignment (Master Key Frames)
attached to it. Therefore, it is paramount to develop new models
and techniques to semantically structure video content for a Fig 1: High level architecture of the system containing the two core engines.
better learning experience.
The objectives of this research are to deal with the following
multimedia retrieval issues: (1) provide a contextual video The rest of this paper is organized as follows: Section II
segmentation approach using the video’s visual, audio, and presents the latest research dealing with video-based lectures
textual features; (2) annotate video-based lectures with their analysis. Section III covers the main engines within the system
main learning content (i.e., topic/subtopic); and (3) discover a including the pre-processing engine and the NLP engine, as well
video-based lecture or section of a lecture that meets learners’ as the tools and techniques used within each engine. Finally, in
specific queries. section IV, we present preliminary results and evaluation of our
We addressed the first challenge through an integrated video system followed by a conclusion.
pre-processing engine. This engine integrates and enhances the II. VIDEO-BASED LECTURES ANALYSIS
performance of many tools including visual and audio
segmentation, Optical Character Recognition (OCR), and Attempts to automatically analyze the content of video-
transcription tools in order to create a comprehensive and based lectures and videos in general have been around for at least
enhanced segmentation technique. The second challenge was three decades. Doulamis et al. attempted to create a
addressed by using a Natural Language Processing (NLP) multidimensional feature vector for video frames based on the
engine that processes the textual information provided by the colors, objects, motions, and textual information within a video
978-1-5386-6220-5/18/$31.00 ©2018 Crown

[1]. Many video segmentation studies relied solely on linguistic an additional linguistic annotation (i.e., TF-IDF weight) to the
features based on transcribed spoken text. Lin et al. utilized a content and creates a temporal feature vector for each segment
combination of linguistic features, which included content- of the video. This feature vector can make the video segment
based elements (i.e., noun phrases (NP), verb classes (VC), word searchable and comparable. Before providing a detailed
stems (WS), topic words (TNP), and combined features (NV)). description of the above-mentioned process, we first describe
They also included discourse-based features (i.e., pronouns the dataset used in this experiment.
(PN), and cue phrases (CP)) [2]. Zhang and Nunamaker also In this study we used instructional videos developed for two
used an NLP approach to index video for an interactive Software Quality Assurance (SQA) courses, which are both
Learning-by-Asking (LBA) e-Learning system. Their system available on YouTube. The first course1 contains 29 video
used an NLP question answering technique whereby they lectures with an average duration of three minutes each. The
matched the learner’s questions with the video’s metadata (i.e., lectures’ style consists of visual animation and on-screen text,
title, keywords, content templates, transcript, lecture notes etc.) with a background voice reading the text exactly as it appears on
and other data about the speaker. Then, they suggested a certain the screen. The second course2 however, contains 43 video
video clip as an answer to the user’s question [3]. Kobayashi et lectures with an average duration of 53 minutes each. Its
al. detected topic changes by analyzing the co-occurrence of lectures’ style consists of the instructor presenting the topic
words between sentences of the transcript basing word weights while the slides appear on the screen, during which the instructor
on the Term Frequency – Inverse Document Frequency (TF- elaborates on the content of each slide. Both courses are
IDF) NLP technique [4]. Yang and Meinel applied both presented in English.
Automatic Speech Recognition (ASR) and Optical Character A. Video pre-processing engine:
Recognition (OCR) information to retrieve video lectures [5]. A The first step to creating the temporal feature vector was to
similar approach was implemented by Tuna et al. to create an extract content features from the video, i.e., visual, audio, on-
Indexed Captioned Searchable (ICS) video system. They screen text, and spoken words. The latter two features were
focused on video lectures within the University of Houston in extracted using OCR and ASR tools respectively, as explained
science, technology, engineering, and mathematics (STEM) later in this section.
coursework, which ensured good assessment as it was used by 1) Video-based segmentation:
students for several years. One of the main shortcomings of their We started by detecting scene changes based on the mean
system was the use of manual transcript as their best ASR tool pixels intensity difference between two consecutive frames,
averaged only 68% accuracy [6] [7] [8]. On the other hand, referred to as histogram detection algorithm. This is done by
Gandhi, Biswas, & Deshmukh relied only on OCR information evaluating and using the following tools that employ the same
but identified salient words based on more extracted features histogram scene detection algorithm: FFmpeg3,
(i.e., position, font, style, size, etc.) [9]. Finally, a paper by Shah, PySceneDetect , and Google Cloud Video Intelligence API5.
4
Yu, Shaikh, and Zimmermann attempted an automatic Table 1 compares these tools in terms of their implementation
linguistic-based video segmentation by leveraging Wikipedia and functionality features.
text. Provided with a manual SRT transcript file, they matched
sections for the video transcript to Wikipedia articles to identify Table 1: Comparison between scene detection tools
the best topic match, and presented the timed transcript with FFmpeg & PySceneDetect Google
their matched topic from Wikipedia articles [10]. FFprobe Cloud API
Programming language C python NA
III. VIDEO PREPROCESSING Open source
REST API availability
Automatically extracting video content involves the use of Various threshold values
many technologies due to the complexity and richness of such Segmentation
encoded content. Video content contains visual (i.e., images, Extract shots
objects, motion), audio (i.e., tune, transcript), and linguistic (i.e., Extract audio
on-screen text, spoken text) content. Analyzing all visual, audio Annotations
and linguistic aspects of a video is a task beyond the scope of
this paper. However, we will examine tools to extract selected Furthermore, we evaluated the tools’ performance using two
knowledge from each source. We will then describe a novel metrics: speed and accuracy. We measured speed by computing
integration algorithm of the extracted multi-modal features to the execution time of each tool on the same video set, while
analyze video content. accuracy was measured by comparing the detected scenes to a
The video pre-processing engine, shown in Fig 1 feeds a human gold standard. Fig 2 and Fig 3 illustrate the execution
structured content to the NLP engine to cleanse, tokenize, and time and number of scenes detected by each algorithm using
lemmatize the content in text form. The NLP engine then applies various threshold values, which are considered to be the
1 3
https://www.youtube.com/watch? https://ffmpeg.org/
4
v=xCwkjZcEK6w&list=PLXXvO4OXeJrfbPrI0CV-re2VkdiV1dC7X https://pyscenedetect.readthedocs.io/en/latest/
2 5
http://onlinevideolecture.com/?course_id=1307 https://cloud.google.com/video-intelligence/
minimum acceptable scene score. The scene scores are a score above or equal to an initial threshold value of 0.01. We
computed based on the histogram variance, that is the difference then set an optimal video-based threshold value that was equal
in the average intensity of pixels between two consecutive to the average score of all previously-detected scenes.
frames. Unfortunately, Google API does not have the flexibility In order to determine the dynamic video-based optimum
of setting/changing a threshold value. Hence, the threshold threshold value, we evaluated all scene changes between 1%
value, in this case, is unknown and this was reflected in the match and 5% match, which is the initial threshold value of 0.01
results as dotted line across all values. However, FFmpeg, and 0.05 respectively. We then computed the average threshold
FFprobe, and PySceneDetect were run at various threshold of all the scenes detected using the 1% and 5% match. After
values for all videos in the dataset. applying this experiment on the used dataset, we concluded that
the optimum threshold value of 1% match provided the best
results. Therefore, that optimum threshold value was adopted in
this study. The experimental dataset resulted in an optimum
threshold value ranging between 0.044 and 0.365.
Fig 2: Number of scenes detected at different threshold values
FFprobe and FFmpeg, both use the same decoding algorithm

to process video files. Hence, this results in detecting the same
scene changes. However, executing them with different
Fig 4: ROC curve for the segmentation tools
parameters (i.e., decoding video, audio, or both, counting
frames, logging results, etc.) affects their execution time. In our 2) Scene shots (images) & audio files extraction:
case, FFmpeg performed much faster than FFProbe (refer to the The next step consisted of applying two parallel processes:
execution time comparison in Fig 3). For accuracy, we evaluated extracting shots (images) from each scene for Optical Character
the performance of the tools by comparing the results to human Recognition (OCR) processing and extracting the audio file for
judgment as a gold standard. Fig 4 shows the accuracy of these Automatic Speech Recognition (ASR) processing. For both of
tools using the ROC curve. It can be seen from the curve that the these steps, we used FFmpeg to extract the media files (images
performance of the FFmpeg and FFprobe exceeded the other and audio) from the video file. For each scene, we extracted
tools at threshold values (10% & 20%). Consequently, we three images (beginning, middle, and end of the scene). We
decided to use FFmpeg tool due to its superiority in terms of subsequently deleted duplicate images within each scene based
both speed and accuracy. on the difference of a visual thesaurus constructed from the
colors and texture description of each image [11]. The purpose
of deleting duplicate images was to reduce processing time for
the OCR tool.
3) OCR-based segmentation:
Once we removed all duplicate images from each scene, we
processed each of the remaining images to extract text using
Google OCR API. The detected text from all images was then
combined per scene and processed using the NLP engine for two
main tasks. The first was to create a context vector of text that
was fed to the ASR tool to enhance ASR performance. The
second, and most important task, was to create NLP-based video
Fig 3: Execution time based on threshold values segmentation that is discussed in detail in the “NLP engine:”
An automatic and accurate selection of a threshold value is section. The extracted context vector from the OCR tool was fed
crucial to the segmentation process. Although FFmpeg to the ASR tool (Google Cloud Speech API) as video context to
performance was superior within our video dataset, applying a enhance the audio transcribing process.
preset threshold value across all video files is not logical due to 4) Audio-based and ASR-based segmentation:
the wide variety of video content and recording/teaching styles One of the main advantages of the Google Cloud Speech
included in videos (e.g., PowerPoint slides, speaker, sliding API, is that it provides a transcript segmentation. The API
scenes, cut scenes, moving objects, etc.). Therefore, prior to transcript results were divided into segments based on the pauses
detecting scene changes, we dynamically analyzed the videos to detected within the speech. The tool generated one or more
determine an optimum threshold value for each video separately. alternative transcript for each segment. The alternative with the
Using the FFmpeg tool, we first detected all scene changes with highest confidence level was selected as a candidate transcript.
We leveraged this segmentation as an audio-based segmentation dictionary with these terms, as an enhancement to the
(its benefits will be summarized in the key frame alignment step spellchecker.
below). Fig 5 outlines the audio-based segmentation resulting 2) IDF and TF-IDF weight:
from Google Cloud Speech API, where the vertical white lines After stemming all terms, we calculate the TF-IDF weight
represent a pause in the speech, hence, audio segments key (𝑤td ) using equation (III.1) below. The TF-IDF weights are
frame. computed on segment-based vectors, and for the entire video as
one vector.
𝑁
𝑤td = 𝑡𝑓𝑡𝑑 ∗ 𝑙𝑜𝑔 (III.1)
1 + 𝑛𝑡
Where:
Fig 5: Waveform visualization of transcript segmentation 𝑁: 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
We used Google Vision API and Google Cloud Speech API 𝑛𝑡 : 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑒𝑟𝑚 𝑡.
as OCR and ASR tools respectively, for their simplicity and The first part of the equation computes the term frequency
superiority over other tools. A comprehensive comparison of (𝑡𝑓𝑡𝑑 ) within each segment. The second part calculates the IDF
𝑁
different OCR and ASR tools is presented in [12] and [13]. (𝑙𝑜𝑔 ) as an estimated value using online pages indexed in
1+𝑛𝑡
Vijayarani and Sakila compared eight OCR tools in terms of
Google and Wikipedia as base corpus. We evaluated the IDF
accuracy and error rate of all detected alphanumeric characters
value using three sources: https://www.google.com,
and special characters. Although five of these tools, including
https://www.wikipedia.org/, and https://en.wikipedia.org. The
Google Docs, achieved 100% character accuracy, all eight failed
basic idea was to search the websites (Google, Wikipedia (all
to detect special characters with 100% error rate [12]. Këpuska
languages), or English Wikipedia) for a term, and capture the
and Bohouta compared three tools: Sphinx4, Google Speech
estimated number of pages containing that term (𝑛𝑡 ), while the
API, and Microsoft API. Based on the results produced by these
value of 𝑁 represents the number of indexed pages that exist in
tools compared to the original transcript, Google Speech API
Google, Wikipedia, or English Wikipedia. Although we
achieved only 91% accuracy compared to 63% and 82%
previously used Google [14], we were not certain of the size of
accuracy for Sphix4 and Microsoft API respectively [13].
Google’s indexed pages. On the other hand, Wikipedia clearly
As a result of the video preprocessing engine, the system
states the size of Wiki pages (articles) on their site and it is also
produced 4 different segment vectors:
divided per language (i.e. 5.5M English pages)7. We searched
1) histogram-based video segments,
these sites using Google Custom Search where we can specify
2) audio-based audio segments,
to limit the search within one site/domain or sub domain.
3) segmented OCR textual content, and
To choose a corpus from the three mentioned above, we
4) segmented ASR transcript/textual content. computed the IDF value for approximately 1000 terms on all
Each of the above vectors was represented as a temporal key three sites (Google, Wikipedia, and English Wikipedia) and,
frame vector, where the first node was segment zero, starting at based on the results from each site, we excluded
time zero, and the last node was segment N, ending with the https://en.wikipedia.org (Wikipedia English subdomain) due to
duration of the video. In the middle, were all other identified key some negative IDF values. A negative IDF value indicates an
frames. Note that 3 and 4 at that point were replicas of 1 and 2, inaccurate 𝑁 value (number of indexed pages) that is an
respectively. However, those values changed, as 3 and 4 were inaccurate search scope. Afterwards, we plotted the results from
processed through the NLP engine. www.google.com and www.wikipedia.org (all languages) in Fig
B. NLP engine: 6 to confirm that they were correlated and did not conflict with
All the textual content obtained from the video pre- each other. As a result, we made them both available in the
processing step (i.e., OCR and ASR results) were fed into the application and set the configuration option to use either site.
NLP engine for linguistic processing. 3) Cosine similarity matrix:
1) Text cleansing & preparation: The textual data extracted from the OCR and ASR tools were
The NLP engine started by applying string tokenizer, segmented the same way as the histogram and audio segments.
followed by extraneous text cleansing, which consisted of Once this data was cleansed, organized and the TF-IDF weights
removing any token consisting of single characters, numbers computed for all terms, each segment was represented as vector
only, and special characters only. Then, for each valid token, we of terms. The NLP engine then computed the cosine similarity
applied a spellchecker to confirm that all words were detected for each of the OCR and ASR segments. The cosine similarity
correctly. Any misspelled word was flagged, and the top represents the correlation between all segments. This value was
suggestion from the spellchecker was selected. After the term’s used to merge any consecutive highly-correlated segment using
spelling is confirmed, a term stemmer is applied (both the the Back-chain merging algorithm described in Fig 7.
spellchecker and stemmer are based on NHunspell.NET
library6). Since NHunspell is based on a limited dictionary file,
we tracked all recurring newly-innovated terms and updated the
6 7
https://www.nuget.org/packages/NHunspell/ https://www.wikipedia.org/
outcome of the alignment algorithm was one key frame vector,
where its nodes scored between 1 and 4.
Declare an empty list KFAlign

𝐅𝐨𝐫𝐄𝐚𝐜𝐡 kfAudio 𝐈𝐍 Audio Key Frame Set
𝐀𝐝𝐝 kfAudio to KFAlign
𝐈𝐧𝐢𝐭𝐢𝐚𝐥𝐢𝐳𝐞 kfAudio score to 1
Repeat for every key frame set (KFSet)(Video, OCR, ASR):
𝐅𝐨𝐫𝐄𝐚𝐜𝐡 kfi 𝐈𝐍 the Key Frame Set
𝐅𝐨𝐫𝐄𝐚𝐜𝐡 kfj 𝐈𝐍 KFAlign
𝐈𝐅 kfi = kfj 𝐓𝐡𝐞𝐧 // (𝐤𝐞𝐲 𝐞𝐱𝐢𝐬𝐭)
𝑰𝒏𝒄𝒓𝒆𝒎𝒆𝒏𝒕 kfj 𝑠𝑐𝑜𝑟𝑒
Fig 6: Simplified IDF correlation result between Google and Wikipedia (all 𝑩𝒓𝒆𝒂𝒌;
languages) 𝐈𝐅 kfj > kfi 𝐓𝐡𝐞𝐧 // (𝐤𝐞𝐲 𝐝𝐨𝐞𝐬𝐧′ 𝐭 𝐞𝐱𝐢𝐬𝐭)
𝐀𝐝𝐝 kfi to KFAlign
4) Back-chain merging algorithm:
𝐈𝐧𝐢𝐭𝐢𝐚𝐥𝐢𝐳𝐞 kfi score to 1
The first important task of the NLP engine was to merge 𝑩𝒓𝒆𝒂𝒌;
highly-correlated consecutive segments. We achieved this by
computing the cosine similarity between all segments. We then Fig 8: Key frames alignment and scoring algorithm
compared the cosine similarity value of each segment with the
following two segments to remove any fake key frame, which IV. SYSTEM EVALUATION
did not represent a new topic. The merged segment was flagged We evaluated the performance of our system based on a gold
to prevent remerging, since we applied the back-chain algorithm standard segmentation. The gold standard is determined by a
described in Fig 7 to merge segments. The same algorithm was subject-matter expert of the video lecture. We computed the
applied to the ASR segments. precision, recall, and F-1 score of a test sample of the video
lectures. The sample is divided based on the audio and visual
Having the CosSim[n][n]Matrix of n segment quality of the videos (low & high quality). Table 2 illustrates the
𝐅𝐨𝐫 𝐢 = 𝟎 𝐭𝐨 𝐧 − 𝟏 average precision, recall, and F-1 score for the evaluated dataset.
𝐅𝐨𝐫 𝐣 = 𝐢 + 𝟏 𝐭𝐨 𝐢 + 𝟐 & 𝐣 < 𝐧 It was observed that for low quality videos, the obtained results
IF Seg[j]is merged were not promising, which reflects the shortcoming of the
𝐁𝐫𝐞𝐚𝐤
current tools on a low-quality video files. However, for videos
𝐈𝐅 Seg[i] 𝐢𝐬 merged
Fined the first unmerged segment Seg[p] of acceptable quality, the obtained results were very promising.
in the merged chain Our system provides the best F1-Score result when
𝐀𝐝𝐝 seg[j] to seg[p] chain incorporating Visual and OCR features, followed by
𝑴𝒂𝒓𝒌 𝑆𝑒𝑔[𝑗] 𝒎𝒆𝒓𝒈𝒆𝒅 incorporating all features that are F1 score #4 (0.5272) and score
#8 (0.5012) respectively.
Fig 7: Back-chain segment similarity merging algorithm
We achieved the highest recall value with visual
By the end of this process, we had four key frame (segment) segmentation #3, however its corresponding precision is
vectors: visual-based, audio-based, OCR-based, and ASR-based respectively low. In line #4 we combined Visual and OCR and
key frame vectors. In the next section, we illustrated a key frame noticed better precision. Similarly, applying OCR to visual and
alignment process for all key frame vectors to identify common audio improves the precision and F1-Score (see lines #5 and #7).
key frames. We refer to these as Master Key Frames. Finally, employing all four features, visual, audio, OCR, and
ASR, has an F1-Score above 0.5. These results indicate that all
C. Key frame alignment: features play a role in choosing real topic segments. However,
In this section, we present an enhanced key frames alignment precision value did not exceed 0.5 in all cases, which signifies
algorithm that aligns key frames of visual and audio features, the that our system is too permissive in what it considered to be a
NLP processed, and textual features extracted from the OCR and new topic/subtopic, and thus generated many false positives.
ASR tools. The algorithm shown in Fig 8 aligns and scores all Therefore, we may need to increase the threshold of difference
key frames/segments. Starting with an empty key frames list between segments in order to include them in the same scenes
(KFAlign ), we iterate through all audio-based key frames adding and reduce the false positives. ASR in particular seemed to be
them to the KFAlign list with a value of 1 as a score. We very sensitive to this problem. Given both the natural variety of
subsequently repeat the alignment process for all key frame spoken language and the misrecognized words from the ASR
vectors (i.e., visual, OCR, and AST-based segment vectors), that tools due to noise, low recording quality, etc., the similarity
is matching each key frame to the key frame in the KFAlign . If a between successive spoken segments was much lower than we
match was found, we increase the score by 1; otherwise, we add expected, and therefore a very large number of false scenes was
the key frame of the respective vector with a score of 1. The generated.
Table 2:Evaluation of the automatic video segmentation system
Low Quality Video High Quality Video

Segmentation approach
Avg. Precision Avg. Recall. Avg. F1-Score Avg. Precision Avg. Recall. Avg. F1-Score
1. (Audio) 0.0178 0.4422 0.0343 0.0899 0.6783 0.1553
2. (Audio + Asr) 0.0203 0.4172 0.0386 0.0982 0.5856 0.1681
3. (Visual) 0.1212 0.5828 0.2007 0.3166 0.8252 0.4576
4. (Visual + Ocr) 0.0400 0.1000 0.0571 0.4064 0.7500 0.5272
5. (Visual + Audio) 0.1115 0.2994 0.1625 0.3381 0.6806 0.4518
6. (Visual + Audio + Asr) 0.0700 0.2772 0.1118 0.2114 0.5961 0.3121
7. (Visual + Audio + Ocr) 0.0000 0.0000 ——— 0.4210 0.6183 0.5009
8. (Visual + Audio + Ocr + Asr) 0.0500 0.0250 0.0333 0.4154 0.6317 0.5012
IEEE Transactions On Learning Technologies, vol. 7,

V. CONCLUSIONS AND FUTURE WORK no. 2, pp. 142-154, 2014.
This paper presents a collaborative system that integrates [6] T. Tuna, J. Subhlok and S. Shah, "Indexing and
state-of-the-art tools and techniques to segment video lectures keyword search to ease navigation in lecture videos," in
based on their content. The main innovation of this paper is to Applied Imagery Pattern Recognition Workshop (AIPR),
contextualize the video segments leveraging visual, audio, and 2011 IEEE, 2011.
text (on-screen text and transcribed text) features extracted by
state-of-the-art tools and NLP techniques. [7] T. Tuna, M. Joshi, V. Varghese, R. Deshpande, J.
Another important objective of this research, is that it paves Subhlok and R. Verma, "Topic based segmentation of
the path for further research that is still in the experimental stage classroom videos," in Frontiers in Education
to include a full context segmentation, leveraging knowledge- Conference (FIE), 2015.
based repository, other NLP features, and semantic features. [8] T. Tuna, J. Subhlok, L. Barker, S. Shah, O. Johnson and
Furthermore, the proposed system can be improved by C. Hovey, "Indexed Captioned Searchable Videos: A
enhancing the NLP segmentation process in a way that can Learning Companion," Journal of Science Education
introduce new linguistic key frames similar to the approach in and Technology, vol. 26, no. 1, pp. 82-99, 2017.
[2], which is discussed in the literature. Currently, our system [9] A. Gandhi, A. Biswas and O. Deshmukh, "Topic
employs NLP to eliminate fake key frames only. Transition in Educational Videos Using Visually Salient
Words," International Educational Data Mining
REFERENCES Society, p. ERIC, 2015.
[10] R. R. Shah, Y. Yu, A. D. Shaikh and R. Zimmermann,
[1] A. Doulamis, Y. Avrithis, N. Doulamis and S. Kollias, "TRACE: Linguistic-Based Approach for Automatic
"A genetic algorithm for efficient video content Lecture Video Segmentation Leveraging Wikipedia
representation," in in Proceedings of IMACS/IFAC Texts," in Multimedia (ISM), 2015 IEEE International
International Symposium on Soft Computing in Symposium, 2015.
Engineering Applications (SOFTCOM 1998), Athens, [11] E. Spyrou, G. Tolias, P. Mylonas and Y. Avrithis,
Greece, 1998. "Concept detection and keyframe extraction using a
[2] M. Lin, J. F. Nunamaker, M. Chau and H. Chen, visual thesaurus," Multimedia Tools and Applications,
"Segmentation of lecture videos based on text: a method vol. 41, no. 3, pp. 337-373, 2009.
combining multiple linguistic features," in System [12] S. Vijayarani and A. Sakila, "Performance comparison
Sciences, 2004. Proceedings of the 37th Annual Hawaii of OCR Tools," International Journal of UbiComp
International Conference, 2004. (IJU), vol. 6, no. 3, 2015.
[3] D. Zhang and J. F. Nunamaker, "A Natural Language [13] V. Këpuska and G. Bohouta, "Comparing Speech
Approach to Content-Based Video Indexing and Recognition Systems (Microsoft API, Google API And
Retrieval for Interactive E-Learning," IEEE CMU Sphinx)," Int. Journal of Engineering Research
Transactions on multimedia, vol. 6, no. 3, pp. 450-458, and Application, vol. 7, no. 3, pp. pp.20-24, 2017.
2004.
[14] M. Al-Mousa and J. Fiaidhi, "Developing a
[4] N. Kobayashi, N. Koyama, H. Shiina and F. Kitagawa, Collaborative MOOC Learning Environment utilizing
"Extracting Topic Changes through Word Co- Video Sharing with Discussion Summarization as
occurrence Graphs from Japanese Subtitles of VOD Added-Value," International Journal of Multimedia and
Lecture," in Advanced Applied Informatics (IIAIAAI), Ubiquitous Engineering, vol. 9, no. 11, pp. 397-408,
2012. 2014.
[5] H. Yang and C. Meinel, "Content Based Lecture Video
Retrieval Using Speech and Video Text Information,"

Nlp-Enriched Automatic Video Segmentation: Mohannad Almousa Rachid Benlamri Richard Khoury

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nlp-Enriched Automatic Video Segmentation: Mohannad Almousa Rachid Benlamri Richard Khoury

Uploaded by

Copyright:

Available Formats

NLP-Enriched Automatic Video Segmentation

Mohannad AlMousa* Rachid Benlamri* Richard Khoury*

978-1-5386-6220-5/18/$31.00 ©2018 Crown

Fig 2: Number of scenes detected at different threshold values

FFprobe and FFmpeg, both use the same decoding algorithm

Declare an empty list KFAlign

Low Quality Video High Quality Video

IEEE Transactions On Learning Technologies, vol. 7,

You might also like