Professional Documents
Culture Documents
Abstract— E-learning environments are heavily dependent on pre-processing engine to improve the visual and audio
videos as the main media to deliver lectures to learners. Despite segmentation, and then annotates each segment with its main
the merits of video-based lectures, new challenges can paralyze the content. Fig 1 outlines a high-level architecture of the proposed
learning process. Challenges that deal with video content system highlighting the two main engines, as well as the
accessibility, such as searching, retrieving, explaining, matching,
organizing, and even summarizing these contents, significantly
segmentation, annotation and indexing processes.
limit the potential of video-based learning. In this paper, we
propose a novel approach to segment video lectures and integrate Pre-processing engine
Natural Language Processing (NLP) tasks to extract key linguistic
features exist within the video. We exploit the benefits of visual,
audio, and textual features in order to create comprehensive Key frame extraction
temporal feature vectors for the enhanced segmented video.
Afterwards, we apply an NLP cosine similarity to the cluster and
identify the various topics presented in the video. The final
product would be an indexed, vector-based searchable video Audio Shots NLP engine
segments of a specific topic/subtopic File
Cleansing, tokenizing,
and stemming
Keywords-Video segmentation; content-based; NLP; Topic
detection.
TF-IDF
I. INTRODUCTION Texts
Context
Instructional videos are powerful and expressive learning Calculate segments
cosine similarity
resources, extensively used in e-learning. Since each video may
cover many topics, it is critical for an e-learning environment to Transcript
have content-based video searching capabilities to meet diverse OCR & ASR back-chain
segments’ merging
individual learning needs. To accomplish this, it is critical to be
able to semantically segment, annotate, index and retrieve video
content in an efficient and effective manner. Today, the ability
to search video-based media is limited to the available metadata
Key-frame alignment (Master Key Frames)
attached to it. Therefore, it is paramount to develop new models
and techniques to semantically structure video content for a Fig 1: High level architecture of the system containing the two core engines.
better learning experience.
The objectives of this research are to deal with the following
multimedia retrieval issues: (1) provide a contextual video The rest of this paper is organized as follows: Section II
segmentation approach using the video’s visual, audio, and presents the latest research dealing with video-based lectures
textual features; (2) annotate video-based lectures with their analysis. Section III covers the main engines within the system
main learning content (i.e., topic/subtopic); and (3) discover a including the pre-processing engine and the NLP engine, as well
video-based lecture or section of a lecture that meets learners’ as the tools and techniques used within each engine. Finally, in
specific queries. section IV, we present preliminary results and evaluation of our
We addressed the first challenge through an integrated video system followed by a conclusion.
pre-processing engine. This engine integrates and enhances the II. VIDEO-BASED LECTURES ANALYSIS
performance of many tools including visual and audio
segmentation, Optical Character Recognition (OCR), and Attempts to automatically analyze the content of video-
transcription tools in order to create a comprehensive and based lectures and videos in general have been around for at least
enhanced segmentation technique. The second challenge was three decades. Doulamis et al. attempted to create a
addressed by using a Natural Language Processing (NLP) multidimensional feature vector for video frames based on the
engine that processes the textual information provided by the colors, objects, motions, and textual information within a video
1 3
https://www.youtube.com/watch? https://ffmpeg.org/
4
v=xCwkjZcEK6w&list=PLXXvO4OXeJrfbPrI0CV-re2VkdiV1dC7X https://pyscenedetect.readthedocs.io/en/latest/
2 5
http://onlinevideolecture.com/?course_id=1307 https://cloud.google.com/video-intelligence/
minimum acceptable scene score. The scene scores are a score above or equal to an initial threshold value of 0.01. We
computed based on the histogram variance, that is the difference then set an optimal video-based threshold value that was equal
in the average intensity of pixels between two consecutive to the average score of all previously-detected scenes.
frames. Unfortunately, Google API does not have the flexibility In order to determine the dynamic video-based optimum
of setting/changing a threshold value. Hence, the threshold threshold value, we evaluated all scene changes between 1%
value, in this case, is unknown and this was reflected in the match and 5% match, which is the initial threshold value of 0.01
results as dotted line across all values. However, FFmpeg, and 0.05 respectively. We then computed the average threshold
FFprobe, and PySceneDetect were run at various threshold of all the scenes detected using the 1% and 5% match. After
values for all videos in the dataset. applying this experiment on the used dataset, we concluded that
the optimum threshold value of 1% match provided the best
results. Therefore, that optimum threshold value was adopted in
this study. The experimental dataset resulted in an optimum
threshold value ranging between 0.044 and 0.365.
6 7
https://www.nuget.org/packages/NHunspell/ https://www.wikipedia.org/
outcome of the alignment algorithm was one key frame vector,
where its nodes scored between 1 and 4.