检索结果-内蒙古大学图书馆

32nd ACM International Conference on multimedia, MM 2024

作者： Lin, Yitai Wei, Zhijie Zhang, Wanfa Lin, Xiping Dai, Yudi Wen, Chenglu Shen, Siqi Xu, Lan Wang, Cheng Fujian Key Laboratory of Sensing and Computing for Smart Cities Xiamen University Xiamen China Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Xiamen China Xiamen University Xiamen China ShanghaiTech University Shanghai China

ISBN: (纸本)9798400706868

We introduce HmPEAR, a novel dataset crafted for advancing research in 3D Human Pose Estimation (3D HPE) and Human Action Recognition (HAR), with a primary focus on outdoor environments. This dataset offers a synchronized collection of imagery, LiDAR point clouds, 3D human poses, and action categories. In total, the dataset encompasses over 300,000 frames collected from 10 distinct scenes and 25 diverse subjects. Among these, 250,000 frames of data contain 3D human pose annotations captured using an advanced motion capture system and further optimized for accuracy. Furthermore, the dataset annotates 40 types of daily human actions, resulting in over 6,000 action clips. Through extensive experimentation, we have demonstrated the quality of HmPEAR and highlighted the challenges it presents to current methodologies. Additionally, we propose baselines leveraging sequential images and point clouds for 3D HPE and HAR, which underscore the mutual reinforcement between them, highlighting the potential for cross-task synergies. The dataset is available at http://***/hmpear. © 2024 ACM.

关键词： Human form models

来源：评论

学校读者我要写书评

暂无评论

Deep Instruction Tuning for Segment Anything Model 24

Deep Instruction Tuning for Segment Anything Model

引用

32nd ACM International Conference on multimedia, MM 2024

作者： Huang, Xiaorui Luo, Gen Zhu, Chaoyang Tong, Bo Zhou, Yiyi Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Fujian Xiamen China The Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong

ISBN: (纸本)9798400706868

Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is released at: https://***/wysnzzzz/DIT. © 2024 ACM.

关键词： Deep learning

来源：评论

学校读者我要写书评

暂无评论

SimCLIP: Refining Image-Text Alignment with Simple Prompts for Zero-/Few-shot Anomaly Detection 24

SimCLIP: Refining Image-Text Alignment with Simple Prompts f...

引用

32nd ACM International Conference on multimedia, MM 2024

作者： Deng, Chenghao Xu, Haote Chen, Xiaolu Xu, Haodi Tu, Xiaotong Ding, Xinghao Huang, Yue Institute of Artificial Intelligence Xiamen University Xiamen China School of Informatics Xiamen University Xiamen China School of Informatics Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing Xiamen China

ISBN: (纸本)9798400706868

Recently, large pre-trained vision-language models, such as CLIP, have demonstrated significant potential in zero-/few-shot anomaly detection tasks. However, existing methods not only rely on expert knowledge to manually craft extensive text prompts but also suffer from a misalignment of high-level language features with fine-level vision features in anomaly segmentation tasks. In this paper, we propose a method, named SimCLIP, which focuses on refining the aforementioned misalignment problem through bidirectional adaptation of both Multi-Hierarchy Vision Adapter (MHVA) and Implicit Prompt Tuning (IPT). In this way, our approach requires only a simple binary prompt to efficiently accomplish anomaly classification and segmentation tasks in zero-shot scenarios. Furthermore, we introduce its few-shot extension, SimCLIP+, integrating the relational information among vision embeddings and skillfully merging the cross-modal synergy information between vision and language to address downstream anomaly detection tasks. Extensive experiments on two challenging datasets prove the more remarkable generalization capacity of our method compared to the current SOTA approaches. Our code is available at https://***/CH-ORGI/SimCLIP. © 2024 ACM.

关键词： Zero-shot learning

来源：评论

学校读者我要写书评

暂无评论

LIGHTMOTION: A LIGHT AND TUNING-FREE METHOD FOR SIMULATING CAMERA MOTION IN VIDEO GENERATION

arXiv

引用

arXiv 2025年

作者： Song, Quanjian Lin, Zhihang Zeng, Zhanpeng Zhang, Ziyue Cao, Liujuan Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China

Existing camera motion-controlled video generation methods face computational bottlenecks in fine-tuning and inference. This paper proposes LightMotion, a light and tuning-free method for simulating camera motion in video generation. Operating in the latent space, it eliminates additional fine-tuning, inpainting, and depth estimation, making it more streamlined than existing methods. The endeavors of this paper comprise: (i) The latent space permutation operation effectively simulates various camera motions like panning, zooming, and rotation. (ii) The latent space resampling strategy combines background-aware sampling and cross-frame alignment to accurately fill new perspectives while maintaining coherence across frames. (iii) Our in-depth analysis shows that the permutation and resampling cause an SNR shift in latent space, leading to poor-quality generation. To address this, we propose latent space correction, which reintroduces noise during denoising to mitigate SNR shift and enhance video generation quality. Exhaustive experiments show that our LightMotion outperforms existing methods, both quantitatively and qualitatively. Copyright © 2025, The Authors. All rights reserved.

关键词： Signal to noise ratio

来源：评论

学校读者我要写书评

暂无评论

Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization

arXiv

引用

arXiv 2025年

作者： Shen, You Zhang, Zhipeng Li, Xinyang Qu, Yansong Lin, Yu Zhang, Shengchuan Cao, Liujuan Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China

Representing 3D scenes from multiview images is a core challenge in computer vision and graphics, which requires both precise rendering and accurate reconstruction. Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its high-quality rendering and fast inference speed. Yet, due to the unstructured and irregular nature of Gaussian point clouds, ensuring accurate geometry reconstruction remains difficult. Existing methods primarily focus on geometry regularization, with common approaches including primitive-based and dual-model frameworks. However, the former suffers from inherent conflicts between rendering and reconstruction, while the latter is computationally and storage-intensive. To address these challenges, we propose CarGS, a unified model leveraging Contribution-adaptive regularization to achieve simultaneous, high-quality rendering and surface reconstruction. The essence of our framework is learning adaptive contribution for Gaussian primitives by squeezing the knowledge from geometry regularization into a compact MLP. Additionally, we introduce a geometry-guided densification strategy with clues from both normals and Signed Distance Fields (SDF) to improve the capability of capturing high-frequency details. Our design improves the mutual learning of the two tasks, meanwhile its unified structure doesn’t require separate models as in dual-model based approaches, guaranteeing efficiency. Extensive experiments demonstrate CarGS’s ability to achieve state-of-the-art (SOTA) results in both rendering fidelity and reconstruction accuracy while maintaining real-time speed and minimal storage size. Copyright © 2025, The Authors. All rights reserved.

关键词： 3D reconstruction

来源：评论

学校读者我要写书评

暂无评论

Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs

arXiv

引用

arXiv 2025年

作者： Dai, Shaohui Qu, Yansong Li, Zheyan Li, Xinyang Zhang, Shengchuan Cao, Liujuan Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China

Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over 30× faster. Our code will be available at https://***/Atrovast/THGS. Copyright © 2025, The Authors. All rights reserved.

关键词： Gaussian distribution

来源：评论

学校读者我要写书评

暂无评论

Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text 38

Director3D: Real-world Camera Trajectory and 3D Scene Genera...

引用

38th Conference on Neural Information Processing Systems, NeurIPS 2024

作者： Li, Xinyang Lai, Zhangyu Xu, Linning Qu, Yansong Cao, Liujuan Zhang, Shengchuan Dai, Bo Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China Shanghai Artificial Intelligence Laboratory China The Chinese University of Hong Kong Hong Kong University of Hong Kong Hong Kong

Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined camera trajectories. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. (2) Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. (3) Lastly, the 3D Gaussians are further refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation. © 2024 Neural information processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

ESR-DDLN : Enhanced Single Image Super-Resolution Via Dual-Domain Learning Network

ESR-DDLN : Enhanced Single Image Super-Resolution Via Dual-D...

引用

IEEE International Conference on multimedia and Expo (ICME)

作者： Zihao He Shengchuan Zhang Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University

ISBN: (数字)9798350390155

ISBN: (纸本)9798350390162

Most existing CNN-based super-resolution (SR) methods focus solely on the spatial domain. We argue that frequency domain details are essential for reconstructing fine textures and patterns. To leverage the frequency information, this paper presents a novel Dual-Domain Learning Network (DDLN) for enhanced image SR. Specifically, DDLN includes Deep Dual-Domain Learning Blocks (DDLB), a Cross Modal Distillation Loss and a pioneering Discriminator. First, DDLB can capture comprehensive image details via simultaneous feature optimization in spatial and frequency domains. Next, the Cross Modal Distillation Loss guides the fusion of spatial and frequency features, enhancing the network’s learning capability. Finally, the pioneering Discriminator with full complex-valued convolution processes images converted from HSV to complex form, boosting SR image quality and realism. Comparative experiments on standard datasets demonstrate significant improvements over current techniques, showcasing the potential of dual-domain approaches in SR and offering novel insights for future research.

关键词： Measurement Image quality Convolution Frequency-domain analysis Superresolution Boosting Spatial resolution Standards Optimization Image reconstruction

来源：评论

学校读者我要写书评

暂无评论

Representation Purification for End-to-End Speech Translation 31

Representation Purification for End-to-End Speech Translatio...

引用

31st International Conference on Computational Linguistics, COLING 2025

作者： Zhang, Chengwei Zhou, Yue Zhao, Rui Chen, Yidong Shi, Xiaodong School of Informatics Xiamen University China Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan Ministry of Culture and Tourism China Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Xiamen China

ISBN: (纸本)9798891761964

Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language. Previous research primarily focused on enhancing speech translation by facilitating knowledge transfer from machine translation, exploring various methods to bridge the gap between speech and text modalities. Despite substantial progress made, factors in speech that are not relevant to translation content, such as timbre and rhythm, often limit the efficiency of knowledge transfer. In this paper, we conceptualize speech representation as a combination of content-agnostic and content-relevant factors. We examine the impact of content-agnostic factors on translation performance through preliminary experiments and observe a significant performance deterioration when content-agnostic perturbations are introduced to speech signals. To address this issue, we propose a Speech Representation Purification with Supervision Enhancement (SRPSE) framework, which excludes the content-agnostic components within speech representations to mitigate their negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate that SRPSE significantly improves translation performance across all translation directions in three settings and achieves preeminent performance under a transcript-free setting. © 2025 Association for Computational Linguistics.

关键词：

来源：评论

学校读者我要写书评

暂无评论

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation 38

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End...

引用

38th Conference on Neural Information Processing Systems, NeurIPS 2024

作者： Wu, Changli Chen, Qi Ji, Jiayi Wang, Haowei Ma, Yiwei Huang, You Luo, Gen Fei, Hao Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Shanghai Innovation Institute Shanghai China Youtu Lab Tencent Shanghai China National University of Singapore Singapore

3D Referring Expression Segmentation (3D-RES) aims to segment 3D objects by correlating referring expressions with point clouds. However, traditional approaches frequently encounter issues like over-segmentation or mis-segmentation, due to insufficient emphasis on spatial information of instances. In this paper, we introduce a Rule-Guided Spatial Awareness Network (RG-SAN) by utilizing solely the spatial information of the target instance for supervision. This approach enables the network to accurately depict the spatial relationships among all entities described in the text, thus enhancing the reasoning capabilities. The RG-SAN consists of the Text-driven Localization Module (TLM) and the Rule-guided Weak Supervision (RWS) strategy. The TLM initially locates all mentioned instances and iteratively refines their positional information. The RWS strategy, acknowledging that only target objects have supervised positional information, employs dependency tree rules to precisely guide the core instance's positioning. Extensive testing on the ScanRefer benchmark has shown that RG-SAN not only establishes new performance benchmarks, with an mIoU increase of 5.1 points, but also exhibits significant improvements in robustness when processing descriptions with spatial ambiguity. All codes are available at https://***/sosppxo/RG-SAN. © 2024 Neural information processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：