版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Xi An Jiao Tong Univ Dept Software Engn Xian 710049 Peoples R China
出 版 物:《IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY》 (IEEE Trans Circuits Syst Video Technol)
年 卷 期:2024年第34卷第1期
页 面:504-517页
核心收录:
基 金:National Natural Science Foundation of China
主 题:Information extraction cross-modal alignment hierarchical interaction sampling prior knowledge
摘 要:Document key information extraction (DKIE) is a crucial topic that aims at automatically comprehending documents with complex formats and layouts (invoices, business insurance, etc.). While pre-trained approaches have shown high performance on many DKIE tasks, they suffer from three major challenges. First of all, these approaches ignore the ambiguity resulting from similar text representations before cross-modal interaction. Secondly, they do not consider cross-modal representation alignment before cross-modal interaction. Finally, self-attention layers in cross-modal interaction incur significant computing costs, making it hard to perform joint representation learning from all negative samples. To address these issues, we propose a Dynamical Cross-Modal Alignment Interaction framework (DCMAI). To be more specific, 1) A prior knowledge-guided module is designed to adaptively mine fine-grained visual information to disambiguate similar text representations. 2) A crossover alignment loss is formulated to align cross-modal representations before cross-modal interaction. 3) A hierarchical interaction sampling scheme is introduced to obtain a small but efficient subset of cross-modal negative samples, and a contrastive loss is employed to improve joint representation learning. Comprehensive experiments show that the proposed DCMAI achieves state-of-the-art performance than competitive baselines on several public downstream benchmarks. Code will be open to the public.