Publication Date: 2023/09/08
Abstract: The project proposes an end-to-end deep learning architecture for word-level visual speech recognition without the need for explicit word boundary information. The methodology includes spatiotemporal convolutional layers, Residual Networks (Res Nets), and bidirectional Long Short-Term Memory (Bi- LSTM) networks. The system is trained using the CTC loss function and requires data preprocessing with facial landmark extraction, image cropping, resizing, grayscale conversion, and data augmentation to focus on the mouth region. The model is implemented in Tensor Flow and trained with an adaptive learning rate schedule. With this approach, the proposed system achieves end-to-end lip reading from a video frame and implicitly identifies keywords in utterances. Analysis using the CTC loss function confirms the model’s effectiveness. The results suggest potential applications in dictation, hearing aids, and biometric authentication, thus advancing visual speech recognition compared to traditional methods. In summary, the project presents an innovative deep learning architecture for word-level visual speech recognition, surpassing traditional methods and enabling practical applications.
Keywords: Recurrent Neural Network, Long Short-Term Memory, Graphics Processing Unit, Solid State Drive, Text- to- Speech, Application Programming Interface, Audio- Visual, Lip Reading, Bidirectional Long Short-Term Memory, Graphical User Interface, Red Green Blue, Mean Squared Error, Mean Absolute Error, Adaptive Moment Estimation.
DOI: https://doi.org/10.5281/zenodo.8327791
PDF: https://ijirst.demo4.arinfotech.co/assets/upload/files/IJISRT23AUG1551.pdf
REFERENCES