SpeakVision: A Comprehensive Survey on End-to-End Sentence Level Lipreading

Ashwini M Rayannavar; Rakshit Chouhan; Aman Ali Gazi; Maitree Rajesh Patel1

1

Publication Date: 2024/12/09

Abstract: SpeakVision is a speech reading framework capable of extracting speech from audio-video inputs using an AI-based model. A new integrated approach, using both sight and sound, is needed for situations when a voice signal is obscured, or when seeing the apparatus is much easier than hearing it. SpeakVision leverages AI technologies, such as, 3D convolutional layers for extracting spatial features, Bidirectional LSTMs for temporal information and CTC decoding for generating text. Video preprocessing techniques were applied to optimize model performance, and the results were developed into an easy-to-use Streamlit interface for interactive visualization.

Keywords: No Keywords Available

DOI: https://doi.org/10.5281/zenodo.14330071

PDF: https://ijirst.demo4.arinfotech.co/assets/upload/files/IJISRT24NOV1635.pdf

REFERENCES

  1. Yannis M. Assael1, Brendan Shillingford1, Shimon Whiteson1 & Nando de Freitas123 Department of Computer Science, University of Oxford, Oxford, UK 1 Google DeepMind, London, UK 2 CIFAR, Canada 3, “LipNet”
  2. Fu, S. Yan, and T. S. Huang. Classification and feature extraction by simplification. IEEE Transactions on Information Forensics and Security, 3(1):91–100, 2008.K. Eves and J. Valasek, “Adaptive control for singularly perturbed systems examples,” Code Ocean, Aug. 2023. [Online]. Available: https://codeocean.com/capsule/4989235/tree
  3. Garg, J. Noyola, and S. Bagadia. Lip reading using CNN and LSTM. Technical report, Stanford University, CS231n project report, 2016.
  4. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
  5. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:746–748, 1976.
  6. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Lipreading using convolutional neural network. In INTERSPEECH, pp. 1149–1153, 2014.
  7. F. Woodward and C. G. Barber. Phoneme perception in lipreading. Journal of Speech, Language, and Hearing Research, 3(3):212–222, 1960.