Dense Caption Imagining

Vatsal Verma; Darpan Khanna; Gaurvi Vishnoi; Shreyas Raturi1

1

Publication Date: 2023/09/30

Abstract: A lot of recent research has focused on both computer vision and natural language processing. Our research focuses on the intersection of these, specifically generating pictures from captions. We focus on the lower data regime, using the COCO and CUB data sets which include 200k and 11k picture and caption pairs (respectively). We will use a hierarchical GAN architecture as our baseline[7][24][26]. To improve our baseline we attempt various methods targeting the upsampling blocks, and adding residual or attention- based layers. We will compare the inception score of the methods to analyze our results. We will also consider qualitative results to assure there is minimal mode collapse and memorization. We find that of all our improvements, improving the up-sampling technique to use a Laplacian pyramid method with transposed convolutional layers obtains the best results with a minimal increase in computation time and memory needs.

Keywords: Computer Vision, Natural Language Processing, stackGAN, Image Captioning, Machine Learning, Deep Learning.

DOI: https://doi.org/10.5281/zenodo.8394992

PDF: https://ijirst.demo4.arinfotech.co/assets/upload/files/IJISRT22MAY1170.pdf

REFERENCES

No References Available