版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:School of Vehicle and MobilityTsinghua University School of Information Technology & ManagementUniversity of International Business and Economics Qcraft Inc. School of Electronic Engineering and Computer ScienceQueen Mary University of London
出 版 物:《Science China(Information Sciences)》 (中国科学:信息科学(英文版))
年 卷 期:2025年第68卷第2期
页 面:134-150页
核心收录:
学科分类:082304[工学-载运工具运用工程] 08[工学] 080203[工学-机械设计及理论] 080204[工学-车辆工程] 0802[工学-机械工程] 0823[工学-交通运输工程]
基 金:supported by Beijing Higher Education Society under the 2024 General Project Scheme(Grant No. MS2024128) funding from the Ningbo Philosophy and Social Science Planning Project,as part of the “Ningbo Development Blue Book 2025” Initiative (Grant No. GL24-16)
主 题:visual localization semantic map bird-eye-view transformer pose estimation
摘 要:Accurate localization ability is fundamental in autonomous driving. Traditional visual localization frameworks approach the semantic map-matching problem with geometric models, which rely on complex parameter tuning and thus hinder large-scale deployment. In this paper, we propose BEV-Locator: an end-to-end visual semantic localization neural network using multi-view camera images. Specifically, a visual BEV(bird-eye-view) encoder extracts and flattens the multi-view images into BEV space. While the semantic map features are structurally embedded as map query sequences. Then a cross-model transformer associates the BEV features and semantic map queries. The localization information of ego-car is recursively queried out by cross-attention modules. Finally, the ego pose can be inferred by decoding the transformer outputs. This end-to-end model speaks to its broad applicability across different driving environments, including high-speed scenarios. We evaluate the proposed method in large-scale nuScenes and Qcraft datasets. The experimental results show that the BEV-Locator is capable of estimating the vehicle poses under versatile scenarios, which effectively associates the cross-model information from multi-view images and global semantic maps. The experiments report satisfactory accuracy with mean absolute errors of 0.052 m, 0.135 m and 0.251° in lateral, longitudinal translation and heading angle degree.