检索结果-内蒙古大学图书馆

VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition 27

学校读者我要写书评

暂无评论

VANER: Leveraging Large Language Model for Versatile and Ada...

27th European Conference on Artificial Intelligence, ECAI 2024

作者： Bian, Junyi Zhai, Weiqi Huang, Xiaodi Zheng, Jiaxuan Zhu, Shanfeng School of Computer Science Fudan University Shanghai200433 China Institute of Science and Technology for Brain-Inspired Intelligence Fudan University China Ministry of Education Shanghai200433 China MOE Frontiers Center for Brain Science Fudan University Shanghai200433 China Zhangjiang Fudan International Innovation Center Shanghai200433 China Shanghai Key Lab of Intelligent Information Processing Fudan University Shanghai200433 China School of Computing and Mathematics Charles Sturt University AlburyNSW2640 Australia

ISBN: (纸本)9781643685489

The prevalent solution for BioNER involves using representation learning techniques combined with sequence ***, such methods are inherently task-specific, demonstrate poor generalizability, and often require a dedicated model for each *** leverage the versatile capabilities of recent large language models (LLMs), several approaches have explored generative techniques for entity ***, these approaches often fall short compared to previous sequence labeling *** this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and *** combining the LLM's understanding of instructions with sequence labeling techniques, we train a model using a mix of datasets capable of extracting various types of *** that the backbone LLMs lacks specialized medical knowledge, we also integrate external entity knowledge bases and employ instruction tuning to enable the model to densely recognize curated *** parameter-efficient training model, VANER, significantly outperforms previous LLMs-based *** the first time, as an LLM-based model, VANER surpasses the majority of conventional state-of-the-art BioNER systems, achieving the highest F1 scores across three datasets. © 2024 The Authors.

关键词： labeled data

An End-to-End Text Spotting Model for Vertical and Multi-Line Codes

学校读者我要写书评

暂无评论

SSRN

SSRN 2023年

作者： Chen, Pingping You, Suo Chen, Honghui Jiang, Mengxi School of advanced manufacturing Fuzhou University Fujian Fuzhou362251 China National Joint Engineering Research Center of Video Processing and Communications Fuzhou University Fujian Fuzhou350108 China Key Lab for Intelligent Processing and Wireless Transmission of Media Information Fuzhou University Fujian Fuzhou350108 China

Scene text detection (STR) attracts much attention in computer vision and is widely used in real-time applications. Though many methods have been proposed for horizontal and oriented texts, STR frameworks for spotting vertical and multi-line codes in complex scenarios, such as automatic container code recognition (ACCR), have yet to be fully explored. In this paper, we propose an end-to-end text spotting framework for multi-directional of horizontal, vertical and multi-line (HVM) container codes. We first propose the Self-Spatial Enhancement Module (SSEM) and Self-Channel Enhancement Module (SCEM) for constructing an adaptive inter-domain feature fusion network. Then, a transformer-based branch with Masked RoI is exploited to recognize codes. Finally, we develop a Character Contrastive Learning (CCL) loss to improve the representation of character features. Experimental results show that the proposed method achieves state-of-the-art performance in multi-directional and multi-line code recognition as compared to other methods. In particular, the F-measure and recognition accuracy reaches 91.2% and 93.2% in real-time ACCR, respectively. © 2023, The Authors. All rights reserved.

关键词： Containers

StableAnimator: High-Quality Identity-Preserving Human Image Animation

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Tu, Shuyuan Xing, Zhen Han, Xintong Cheng, Zhi-Qi Dai, Qi Luo, Chong Wu, Zuxuan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Microsoft Research Asia China Huya Inc Carnegie Mellon University United States

Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively. Copyright © 2024, The Authors. All rights reserved.

关键词： Animation

Deepfake Network Architecture Attribution

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Yang, Tianyun Huang, Ziyao Cao, Juan Li, Lei Li, Xirong Key Lab of Intelligent Information Processing Institute of Computing Technology CAS Beijing China University of Chinese Academy of Sciences Beijing China Key Lab of Data Engineering and Knowledge Engineering Renmin University of China China

With the rapid progress of generation technology, it has become necessary to attribute the origin of fake images. Existing works on fake image attribution perform multi-class classification on several Generative Adversarial Network (GAN) models and obtain high accuracies. While encouraging, these works are restricted to model-level attribution, only capable of handling images generated by seen models with a specific seed, loss and dataset, which is limited in real-world scenarios when fake images may be generated by privately trained models. This motivates us to ask whether it is possible to attribute fake images to the source models’ architectures even if they are finetuned or retrained under different configurations. In this work, we present the first study on Deepfake Network Architecture Attribution to attribute fake images on architecture-level. Based on an observation that GAN architecture is likely to leave globally consistent fingerprints while traces left by model weights vary in different regions, we provide a simple yet effective solution named DNA-Det for this problem. Extensive experiments on multiple cross-test setups and a large-scale dataset demonstrate the effectiveness of DNA-Det. Our source code and dataset can be found here: https://***/ICTMCG/DNA-Det Copyright © 2022, The Authors. All rights reserved.

关键词： Network architecture

LVOS: A Benchmark for Large-scale Long-term Video Object Segmentation

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Hong, Lingyi Liu, Zhongying Chen, Wenchao Tan, Chenzhi Feng, Yuang Zhou, Xinyu Guo, Pinxue Li, Jinglun Chen, Zhaoyu Gao, Shuyong Zhang, Wei Zhang, Wenqiang Shanghai Key Laboratory of Intelligent Information Processing School of Computer Science Fudan University Shanghai200433 China The Shanghai Engineering Research Center of AI&Robotics Academy for Engineering&Technology Fudan University Shanghai China Engineering Research Center of AI&Robotics Ministry of Education Academy for Engineering&Technology Fudan University Shanghai China The Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University Shanghai China

Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shell VOS models, existing VOS benchmarks mainly focus on short-term videos lasting about 5 seconds, where objects remain visible most of the time. However, these benchmarks poorly represent practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models' performance in real scenarios. Based on LVOS, we evaluate 20 existing VOS models under 4 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that key factor to accuracy decline is the increased video length, emphasizing LVOS's crucial role. We hope our LVOS can advance development of VOS in real scenes. Data and code are available at https://***/lvos. ***/. © 2024, CC BY.

关键词： Benchmarking

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Wang, Junke Chen, Dongdong Luo, Chong Dai, Xiyang Yuan, Lu Wu, Zuxuan Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center on Intelligent Visual Computing China Microsoft Cloud + AI Microsoft Research Asia

Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios. In this paper, we present our vision for multimodal and versatile video understanding and propose a prototype system, ChatVideo. Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit and employs various Video Foundation Models (ViFMs) to annotate their properties e.g., appearance, motion, etc. All the detected tracklets are stored in a database and interact with the user through a database manager. We have conducted extensive case studies on different types of in-the-wild videos, which demonstrates the effectiveness of our method in answering various video-related problems. Our project is available at https://***/ChatVideo/ Copyright © 2023, The Authors. All rights reserved.

关键词：

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Alfasly, Saghir Lu, Jian Xu, Chen Zou, Yuru Shenzhen Key Laboratory of Advanced Machine Learning and Applications Shenzhen University China Guangdong Key Laboratory of Intelligent Information Processing Shenzhen China Pazhou Lab Guangzhou China

With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase. Accordingly, a novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities. Moreover, we present a new two-stream video Transformer for efficiently modeling the visual modalities. Results on several vision-specific annotated datasets including Kinetics400 and UCF-101 validated our framework as it outperforms most relevant action recognition methods. © 2022, CC BY.

关键词： Semantics

DL2G: Anatomical Landmark Detection with Deep Local Features and Geometric Global Constraint

学校读者我要写书评

暂无评论

DL2G: Anatomical Landmark Detection with Deep Local Features...

2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024

作者： Wang, Rui Yang, Wanli Xiao, Kuntao Sun, Yi Sheng, Shurong Lv, Zhao Gao, Jiahong Auhui University Auhui Hefei China Hefei Comprehensive National Science Center Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing Institute of Artificial Intelligence Hefei China Anhui Province Key Laboratory of Multimodal Cognitive Computation Auhui Hefei China Peking University Center for Mri Research Academy for Advanced Interdisciplinary Studies Beijing China Peking University Beijing City Key Lab for Medical Physics and Engineering Institute of Heavy Ion Physics School of Physics Beijing China Peking University McGovern Institute for BrainResearch Beijing China

ISBN: (纸本)9798350386226

Anatomical landmark detection, a pivotal research area in medical image processing, holds immense value in surgical navigation, image registration, and related fields. Traditional machine learning methods struggle with generalization and robustness. Current supervised end-to-end approaches lacks in-terpretability in capturing global information, and GPU memory constraints restrict their application to extensive 3D medical image datasets. Moreover, existing methods overlook intrinsic geometric cues within images and point sets. Herein, we introduce a novel anatomical landmark detection framework that integrates deep learning's representation capabilities with global geometric information derived from images and point sets (DL2G). This approach facilitates anatomical landmark localization in a local-to-global fashion. Initially, we train a deep feature descriptor based on self-supervised contrastive learning, which is further used to screen candidate points for the target one by comparing local patch embeddings. Subsequently, geometric constraints constructed from labeled template point sets enable the selection of geometrically consistent matching point sets. Ultimately, the framework utilizes the global mutual information of images to perform landmark localization through iterative optimization. Adhering to the AFID data annotation protocol, we evaluated our method on two public MRI head datasets, OASIS and HCP, for detecting 32 anatomical landmarks. The results convincingly demonstrate the superiority of our method over current state-of-the-art approaches in terms of both accuracy and computational efficiency. © 2024 IEEE.

关键词： Contrastive Learning

MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Tu, Shuyuan Dai, Qi Zhang, Zihao Xie, Sicheng Cheng, Zhi-Qi Luo, Chong Han, Xintong Wu, Zuxuan Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Microsoft Research Asia China Carnegie Mellon University United States Huya Inc.

Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and *** Codes 68T45, 68T10 Copyright © 2024, The Authors. All rights reserved.

关键词： Diffusion