Section: ENGINEERING AND TECHNOLOGY Open Access Logo

Feature extraction semilearning and augmented representation for image captioning in crowd scenes

Khang Tan Tran Minh Nguyen 1, *
  1. University of Information Technology, Viet Nam National University Ho Chi Minh City, Viet Nam
Correspondence to: Khang Tan Tran Minh Nguyen, University of Information Technology, Viet Nam National University Ho Chi Minh City, Viet Nam. Email: khangnttm@uit.edu.vn.
Volume & Issue: Vol. 26 No. 4 (2023) | Page No.: 3128-3138 | DOI: 10.32508/stdj.v26i4.4028
Published: 2023-12-31

Online metrics


Statistics from the website

  • Abstract Views: 1522
  • Galley Views: 549

Statistics from Dimensions

Copyright The Author(s) 2023. This article is published with open access by Vietnam National University, Ho Chi Minh city, Vietnam. This article is distributed under the terms of the Creative Commons Attribution License (CC-BY 4.0) which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited. 

Abstract

Image captioning has been an interesting task since 2015. The topic lies in the gap between Computer Vision and Natural Language Processing research directions. The problem can be described as follows: Given the input as a three-channel RGB image, a language model is trained to generate the hypothesis caption that describes the images’ contexts. In this study, we focus on solving image captioning in images captured in a crowd scene, which is more complicated and challenging. In general, a semilearning feature extraction mechanism is proposed to obtain more valuable high-level feature maps of images. Moreover, an augmented approach in the Transformer Encoder is explored to enhance the representation ability. The obtained results are promising and outperform those of other state-of-the-art captioning models on the CrowdCaption dataset.

Comments