VG-CALF:Avision-guidedcross-attention andlate-fusion network for  radiology imagesinMedicalVisualQuestionAnswering

Lameesa, Aiman; Silpasuwanchai, Chaklam; Alam, Sakib Bin

Please use this identifier to cite or link to this item: http://dspace.aiub.edu:8080/jspui/handle/123456789/2601

Full metadata record

DC Field	Value	Language
dc.contributor.author	Lameesa, Aiman	-
dc.contributor.author	Silpasuwanchai, Chaklam	-
dc.contributor.author	Alam, Sakib Bin	-
dc.date.accessioned	2025-02-26T03:59:32Z	-
dc.date.available	2025-02-26T03:59:32Z	-
dc.date.issued	1-01-14	-
dc.identifier.citation	1	en_US
dc.identifier.uri	http://dspace.aiub.edu:8080/jspui/handle/123456789/2601	-
dc.description.abstract	Image and question matching is essential in Medical Visual Question Answering (MVQA) in order to accurately assess the visual-semantic correspondence between an image and a question. However, the recent state-of the-art methods focus solely on the contrastive learning between an entire image and a question. Though contrastive learning successfully model the global relationship between an image and a question, it is less effective to capture the fine-grained alignments conveyed between image regions and question words. In contrast, large-scale pre-training poses significant drawbacks, including extended training times, handling substantial data volumes, and necessitating high computational power. To address these challenges, we propose the Vision-Guided Cross-Attention based Late Fusion (VG-CALF) network, which integrates image and question features into a unified deep model without relying on pre-training for MVQA tasks. In our proposed approach, we use self-attention to effectively leverage intra-modal relationships within each modality and implement vision-guided cross-attention to emphasize the inter-modal relationships between image regions and question words. By simultaneously considering intra-modal and inter-modal relationships, our proposed method significantly improves the overall performance of MVQA without the need for pre-training on extensive image-question pairs. Experimental results on benchmark datasets, such as, SLAKE and VQA-RAD demonstrate that our proposed approach performs competitively with existing state-of-the-art methods.	en_US
dc.language.iso	en_US	en_US
dc.publisher	Elsevier	en_US
dc.subject	Vision-guided	en_US
dc.subject	Cross-attention	en_US
dc.subject	Late-fusion	en_US
dc.subject	Medical visual question answering	en_US
dc.title	VG-CALF:Avision-guidedcross-attention andlate-fusion network for radiology imagesinMedicalVisualQuestionAnswering	en_US
dc.type	Article	en_US
Appears in Collections:	Publications: Journals

Files in This Item:

File	Description	Size	Format
VG-CALF A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question Answering.pdf		600.97 kB	Adobe PDF	View/Open

Show simple item record

AIUB DSpace

Welcome to the Institutional Repository of American International University-Bangladesh. We preserve and enable easy and open access to all types of digital content including text, images, moving images, mpegs and data sets.