A REVIEW ON AUTOMATIC IMAGE CAPTIONING GENERATION

Rachita Dubey; Rohit Miri

doi:10.29121/shodhkosh.v5.i6.2024.4050

Authors

Rachita Dubey Computer Science & Engineering, Dr. C. V. Raman University, Bilaspur, Chhattisgarh,India
Rohit Miri Computer Science & Engineering, Dr. C. V. Raman University, Bilaspur, Chhattisgarh,India

DOI:

https://doi.org/10.29121/shodhkosh.v5.i6.2024.4050

Keywords:

Image Captioning, Cnn, Rnn, Dnn, Lstm, Mscoco, Flickr30k, Flickr8k

Abstract [English]

Image caption is a very popular approach through which descriptive language can be generated in natural form. It is a difficult task in the field of artificial intelligence to assess an image and then write captions that are appropriate using computer vision approaches. The motive of the paper is to review the related studies in image captioning. Numerous studies on image captioning have been conducted, however optimum precision is still needed for accurate and precise captioning. To create well-organized sentences, a system that considers both semantic and syntactic factors is necessary. It is necessary to get the things that are over the image and explain how they link to one another or to express the activity in accordance with the situation in the image in order to get a better caption. The goal of image captioning can be accomplished using a variety of machine learning techniques, and numerous studies have used CNN, RNN, DNN, LSTM, and other approaches. The majority of researchers evaluated their systems using a variety of benchmarks, including Flickr8K, Flickr30K, MSCOCO and many more. However, Flickr8K, which has 8092 images or challenges to test the system's performance, is the most used dataset.

References

Lee, S., & Kim, I. (2018). Multimodal feature learning for video captioning. Mathematical Problems in Engineering, 2018. DOI: https://doi.org/10.1155/2018/3125879

X. Wei, Y. Qi, J. Liu and F. Liu, "Image retrieval by dense caption reasoning," 2017 IEEE Visual Communications and Image Processing (VCIP), 2017, pp. 1-4, doi: 10.1109/VCIP.2017.8305157.

Pranoy Radhakrishnan, IIT Madras, Towards Data Science, (Sept 29, 2017).

K. Vijay and D. Ramya, "Generation of caption selection for news images using stemming algorithm," 2015 International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), 2015, pp. 0536-0540, doi: 10.1109/ICCPEIC.2015.7259513. DOI: https://doi.org/10.1109/ICCPEIC.2015.7259513

X. Wei, Y. Qi, J. Liu and F. Liu, "Image retrieval by dense caption reasoning," 2017 IEEE Visual Communications and Image Processing (VCIP), 2017, pp. 1-4, doi: 10.1109/VCIP.2017.8305157. DOI: https://doi.org/10.1109/VCIP.2017.8305157

N. Yu, X. Hu, B. Song, J. Yang and J. Zhang, "Topic-Oriented Image Captioning Based on Order-Embedding," in IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2743-2754, June 2019, doi: 10.1109/TIP.2018.2889922. DOI: https://doi.org/10.1109/TIP.2018.2889922

M. Yang et al., "Multitask Learning for Cross-Domain Image Captioning," in IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 1047-1061, April 2019, doi: 10.1109/TMM.2018.2869276. DOI: https://doi.org/10.1109/TMM.2018.2869276

Sharma, Grishma & Kalena, Priyanka & Malde, Nishi & Nair, Aromal & Parkar, Saurabh. (2019). Visual Image Caption Generator , Using Deep Learning. SSRN Electronic Journal. 10.2139/ssrn.3368837. DOI: https://doi.org/10.2139/ssrn.3368837

R. C. Luo, Y. -T. Hsu, Y. -C. Wen and H. -J. Ye, "Visual Image Caption Generation for Service Robotics and Industrial Applications," 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS), 2019, pp. 827-832, doi: 10.1109/ICPHYS.2019.8780171. DOI: https://doi.org/10.1109/ICPHYS.2019.8780171

Tarun Wadhwa, Harleen Virk, Jagannath Aghav, Savita Borole Image. Captioning using Deep Learning. International Journal for Research in Applied Science & Engineering Technology (IJRASET) Volume 8 Issue VI June 2020.

B.Krishnakumar, K.Kousalya, S.Gokul, R.Karthikeyan, and D.Kaviyarasu, “IMAGE CAPTION GENERATOR USING DEEP LEARNING”, (international Journal of Advanced Science and Technology- 2020 )

G. Hoxha, F. Melgani and J. Slaghenauffi, "A New CNN-RNN Framework For Remote Sensing Image Captioning," 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), 2020, pp. 1-4, doi: 10.1109/M2GARSS47143.2020.9105191. DOI: https://doi.org/10.1109/M2GARSS47143.2020.9105191

H. Yanagimoto and M. Shozu, "Multiple Perspective Caption Generation with Attention Mechanism," 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI), 2020, pp. 110-115, doi: 10.1109/IIAI-AAI50415.2020.00031. DOI: https://doi.org/10.1109/IIAI-AAI50415.2020.00031

S. Li and L. Huang, "Context-based Image Caption using Deep Learning," 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), 2021, pp. 820-823, doi: 10.1109/ICSP51882.2021.9408871. DOI: https://doi.org/10.1109/ICSP51882.2021.9408871

Megha J Panicker, Vikas Upadhayay, Gunjan Sethi, Vrinda Mathur, Image Caption Generator, International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-10 Issue-3, January 2021. DOI: https://doi.org/10.35940/ijitee.C8383.0110321

D.Kaviyarasu, B. , K. S. R. (2020). IMAGE CAPTION GENERATOR USING DEEP LEARNING. International Journal of Advanced Science and Technology, 29(3s), 975 - 980.

Aghav, Jagannath. (2020). Image Captioning using Deep Learning. International Journal for Research in Applied Science and Engineering Technology. 8. 1430-1435. 10.22214/ijraset.2020.6232. DOI: https://doi.org/10.22214/ijraset.2020.6232

Kesavan, Varsha & Muley, Vaidehi & Kolhekar, Megha. (2019). Deep Learning based Automatic Image Caption Generation. 1-6. 10.1109/GCAT47503.2019.8978293. DOI: https://doi.org/10.1109/GCAT47503.2019.8978293

C. Amritkar and V. Jabade, "Image Caption Generation Using Deep Learning Technique," 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), 2018, pp. 1-4, doi: 10.1109/ICCUBEA.2018.8697360. DOI: https://doi.org/10.1109/ICCUBEA.2018.8697360

S. M. Xi and Y. I. Cho, "Image caption automatic generation method based on weighted feature," 2013 13th International Conference on Control, Automation and Systems (ICCAS 2013), 2013, pp. 548-551, doi: 10.1109/ICCAS.2013.6703998. DOI: https://doi.org/10.1109/ICCAS.2013.6703998

Mathur, Pranay & Gill, Aman & Yadav, Aayush & Mishra, Anurag & Bansode, Nand. (2017). Camera2Caption: A real-time image caption generator. 1-6. 10.1109/ICCIDS.2017.8272660. DOI: https://doi.org/10.1109/ICCIDS.2017.8272660

D. Elliott and F. Keller. Image description using visual dependency representations. In EMNLP, 2013. DOI: https://doi.org/10.18653/v1/D13-1128

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010. DOI: https://doi.org/10.1007/978-3-642-15561-1_2

R. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In ICIP. IEEE, 1996.

Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014. DOI: https://doi.org/10.1007/978-3-319-10593-2_35

S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 9(8), 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735

M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47, 2013. DOI: https://doi.org/10.1613/jair.3994

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015.

A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NIPS,2014.

R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visualsemantic embeddings with multimodal neural language models. In arXiv:1411.2539, 2014.

R. Kiros and R. Z. R. Salakhutdinov. Multimodal neural language models. In NIPS Deep Learning Workshop, 2013.

G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011. DOI: https://doi.org/10.1109/CVPR.2011.5995466

P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. In ACL, 2012.