Video Quality Assessment (VQA) using Vision Transformers

Kallam Lalithendar Reddy; Pogaku Sahnaya; Vattikuti Hareen Sai; Gummuluri Venkata Keerthana1

1

Publication Date: 2024/01/18

Abstract: In this paper, we check the potential of vision transformers in the field of Video Quality Assessment (VQA). Vision Transformers (ViT) are used in field of computer vision based on working nature of transformers in Natural Language Processing (NLP) tasks. They work on the relationship between the input tokens internally. In NLP we use words as tokens, whereas in computer vision we use image patches as tokens where we try to capture the connection between different portions of the image. A pre-trained model of ViT B/16 over imageNet-1k was used to extract features from the video and to validate them over the MOS scores of the video. The patch embeddings are given tokens called as positional embeddings and are send to transformer encoder. There are total 12 layers in ViT - Base Transformer Encoder. Each encoder has a Layer Norm, Multi-Head Attention followed by an another Layer Norm with Multi-Layer Perceptron (MLP) block. The classifier head of the Transformer was removed to get feature vector as our aim is not to classification. After the features are achieved we use an Support Vector Regressor (SVR) of Radial Basis Function (RBF) kernel to assess the video quality.

Keywords: Konvid 1-k Dataset, Vision Transformer, Support Vector Regressor, Attention, Token Embeddings.

DOI: https://doi.org/10.5281/zenodo.10526231

PDF: https://ijirst.demo4.arinfotech.co/assets/upload/files/IJISRT24JAN432.pdf

REFERENCES

No References Available