咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Efficient and robust multi-cam... 收藏

Efficient and robust multi-camera 3D object detection in bird-eye-view

作     者:Wang, Yuanlong Jiang, Hengtao Chen, Guanying Zhang, Tong Zhou, Jiaqing Qing, Zezheng Wang, Chunyan Zhao, Wanzhong 

作者机构:Nanjing Univ Aeronaut & Astronaut Coll Energy & Power Engn Nanjing Jiangsu Peoples R China 

出 版 物:《IMAGE AND VISION COMPUTING》 (Image Vision Comput)

年 卷 期:2025年第154卷

核心收录:

学科分类:0808[工学-电气工程] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 0702[理学-物理学] 

基  金:Nat-ural Science Foundation of Jiangsu Province [BK20231448] 

主  题:Multi-camera 3D object detection Autonomous driving 

摘      要:Bird s-eye view (BEV) representations are increasingly used in autonomous driving perception due to their comprehensive, unobstructed vehicle surroundings. Compared to transformer or depth based methods, ray transformation based methods are more suitable for vehicle deployment and more efficient. However, these methods typically depend on accurate extrinsic camera parameters, making them vulnerable to performance degradation when calibration errors or installation changes occur. In this work, we follow ray transformation based methods and propose an extrinsic parameters free approach, which reduces reliance on accurate offline camera extrinsic calibration by using a neural network to predict extrinsic parameters online and can effectively improve the robustness of the model. In addition, we propose a multi-level and multi-scale image encoder to better encode image features and adopt a more intensive temporal fusion strategy. Our framework further mainly contains four important designs: (1) a multi-level and multi-scale image encoder, which can leverage multi-scale information on the inter-layer and the intra-layer for better performance, (2) ray-transformation with extrinsic parameters free approach, which can transfers image features to BEV space and lighten the impact of extrinsic disturbance on m-odel s detection performance, (3) an intensive temporal fusion strategy using motion information from five historical frames. (4) a high-performance BEV encoder that efficiently reduces the spatial dimensions of a voxel-based feature map and fuse the multi-scale and the multi-frame BEV features. Experiments on nuScenes show that our best model (R101@900 x 1600) realized competitive 41.7% mAP and 53.8% NDS on the validation set, which outperforming several state-of-the-art visual BEV models in 3D object detection.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分