Engineering and Technology - Research article Open Access Logo

MULTITASK LEARNING BASED ON ATTENTION AND TRANSFORMER MECHANISM FOR EVENT RECOGNITION AND IMPORTANCE IMAGE PREDICTION IN PHOTO ALBUM

Viet Hoai Vo 1, *
Viet Quoc Le 1
  1. Computer Vision Department, University of Science, VNU-HCM
Correspondence to: Viet Hoai Vo, Computer Vision Department, University of Science, VNU-HCM. Email: vhviet@fit.hcmus.edu.vn.
Published: 2025-12-21

Online metrics


Statistics from the website

  • Abstract Views: 1993
  • Galley Views: 661

Statistics from Dimensions

This article is published with open access by Viet Nam National University, Ho Chi Minh City, Viet Nam. This article is distributed under the terms of the Creative Commons Attribution License (CC-BY 4.0) which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Abstract

With today’s advanced technology, shooting images is really simple thanks to the popularity of cameras and cellphones. Therefore, as the quantity of photographs increases daily, it gets harder and harder to organize them while preserving the significance of the picture. Solving this problem is challenging and attractive in computer vision. How to create a system that can identify the type of album, pick out the crucial photos that must be kept, and automatically delete the rest. It is very significant in reducing the storage and creating fanatics story video. In this work, we design a multitask network architecture that can be simultaneously taught for event recognition and image importance, thereby preventing the need for event-type information. Fusing the strength of convolution neural networks for image description as well as attention and transformer mechanism for album description to conduct both event recognition and image significance, providing a workable and effective approach with faster prediction times for both image importance and event identification. Our approach reaches out the SOTA method and improves 3% on image importance tasks and achieves high accuracy 67.21% on event recognition tasks in the ML-CUFED dataset. The results are evaluated on multiple backbone and parameters to demonstrate the generalization of the proposed methodology.

Comments