Open Access


Download data is not yet available.


Image captioning has been an interesting task since 2015. The topic lies in the gap between Computer Vision and Natural Language Processing research directions. The problem can be described as follows: Given the input as a three-channel RGB image, a language model is trained to generate the hypothesis caption that describes the images’ contexts. In this study, we focus on solving image captioning in images captured in a crowd scene, which is more complicated and challenging. In general, a semilearning feature extraction mechanism is proposed to obtain more valuable high-level feature maps of images. Moreover, an augmented approach in the Transformer Encoder is explored to enhance the representation ability. The obtained results are promising and outperform those of other state-of-the-art captioning models on the CrowdCaption dataset.

Author's Affiliation
Article Details

Issue: Vol 26 No 4 (2023)
Page No.: 3128-3138
Published: Dec 31, 2023

 Copyright Info

Creative Commons License

Copyright: The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

 How to Cite
Nguyen, K. (2023). Feature extraction semilearning and augmented representation for image captioning in crowd scenes. Science and Technology Development Journal, 26(4), 3128-3138.

 Cited by

Article level Metrics by Paperbuzz/Impactstory
Article level Metrics by Altmetrics

 Article Statistics
HTML = 258 times
PDF   = 104 times
Total   = 104 times