You are on page 1of 1

Localizing and Recognizing Text in Lecture Videos

Kartik Dutta, Minesh Mathew, Praveen Krishnan and C.V. Jawahar

1. Motivation 2. Applications 3. Contributions


Instruction through Lecture videos is the future of mass • Summarizing instructional videos • Release of the LectureVideoDB dataset, which is a compilation of video
education. Recognition & localization can help in • Multimodal (audio, text, video) search
frames & their extracted word images from various instructional
automating many aspects of teaching. •Automatic Indexing and retrieval
videos. These videos involve printed, handwritten text in different
settings.
• Investigate how current, state of the art methods for word localization,
word recognition and word spotting work in a lecture video setting.
•One of the first works to use deep learning for analyzing lecture videos.

4. LectureVideoDB 5. Word Localization, Spotting and Recognition 6. Benchmarking Results


• Created from course videos of 24 MOOC courses, Word Localization • Word images from the IAM and MJSynth dataset are used for pre-
such as NPTEL, MIT OCW, Coursera, etc. • Use 2 different architectures to test word localization performance, EAST training the word recognition and spotting models.
•4 Modalities: Blackboard, Paper, Slides, Whiteboard. released by Zhou et al., CVPR 2017 and Textboxes++ released by Liao et al., • Localizers are trained using ICDAR 2015 Incidental Text data.
•Varying camera angles, distance to source, varying AAAI 2017. Quantitative Results
resolution from 320x240 to 1280x720 Localization Results Spotting Results Recognition Results
•Manually extracted frames to avoid duplicates.
Split #Frame #Word #Writer Type #Frame #Word #Writer
Train 3170 82263 17 Slides 1145 52225 5
Val 549 15379 5 Whiteboard 945 21160 7
Test 1755 40103 13 Paper 1281 27900 9
Total 5474 137745 35 Blackboard 2103 36460 14
EAST, Zhou et al., CVPR 2017
Extracted Frame Images
Word Spotting
• Uses the E2E word spotting network used in Krishnan et al., DAS 2018.
• Contain 2 parallel streams, one using the real world image, while the other
uses the PHOC features from the label and a generated synthetic image.

Qualitative Results for Localization

Word Recognition Qualitative Results for Recognition


• Recognition and spotting perform best for printed text.
• Use the CRNN word recognition network used in Dutta et al., DAS 2018. Their performance for the other 3 domains even after
• It’s a hybrid CNN-RNN neural network. fine-tuning is not close to current state of the art.
•Localization mostly fails when there is unintended spacing
Cropped Word Images between written words or when the image resolution is
1280x720, with the text zoomed in.

7. Future Work
• Creating synthetic data and architectures adapted for lecture videos setting.
• Creating a state of the art mulitmodal search engines for instructional videos.

Project Page: https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb


Acknowledgements: Praveen Krishnan and Minesh Mathew are supported by TCS Research PhD Fellowship.

You might also like