检索结果-内蒙古大学图书馆

A survey on multimodal large language models

National science Review 2024年第12期11卷 277-296页

作者： Shukang Yin Chaoyou Fu Sirui Zhao Ke Li Xing Sun Tong Xu Enhong Chen School of Artificial Intelligence and Data Science University of Science and Technology of China State Key Laboratory for Novel Software Technology Nanjing University School of Intelligence Science and Technology Nanjing University Tencent YouTu Lab

Recently, the multimodal large language model(MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models(LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition–free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture,training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.

关键词： multimodal large language model vision language model large language model

来源：评论

学校读者我要写书评

暂无评论

SGformer:Boosting transformers for indoor lighting estimation from a single image

引用

Computational Visual Media 2024年第4期10卷 671-686页

作者： Junhong Zhao Bing Xue Mengjie Zhang Centre for Data Science and Artificial Intelligence&School of Engineering and Computer Science Victoria University of WellingtonWellington 6012New Zealand

Estimating lighting from standard images can effectively circumvent the need for resourceintensive high-dynamic-range(HDR)lighting ***,this task is often ill-posed and challenging,particularly for indoor scenes,due to the intricacy and ambiguity inherent in various indoor illumination *** propose an innovative transformer-based method called SGformer for lighting estimation through modeling spherical Gaussian(SG)distributions—a compact yet expressive lighting *** from previous approaches,we explore underlying local and global dependencies in lighting features,which are crucial for reliable lighting ***,we investigate the structural relationships spanning various resolutions of SG distributions,ranging from sparse to dense,aiming to enhance structural consistency and curtail potential stochastic noise stemming from independent SG component *** harnessing the synergy of local–global lighting representation learning and incorporating consistency constraints from various SG resolutions,the proposed method yields more accurate lighting estimates,allowing for more realistic lighting effects in object relighting and *** code and model implementing our work can be found at https://***/junhong-jennifer-zhao/SGformer.

关键词： lighting estimation transformer spherical Gaussian(SG) augmented reality

来源：评论

学校读者我要写书评

暂无评论

Feature-Grounded Single-Stage Text-to-Image Generation

引用

Tsinghua science and technology 2024年第2期29卷 469-480页

作者： Yuan Zhou Peng Wang Lei Xiang Haofeng Zhang School of Artificial Intelligence Nanjing University of Information Science and TechnologyNanjing 210044China School of Computer Science and Engineering Nanjing University of Science and TechnologyNanjing 210094China

Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)***,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approaches the ground-truth image ***,the multistage generation strategy results in complex T2I ***,this study proposes a novel feature-grounded single-stage T2I model,which considers the“real”distribution learned from training images as one input and introduces a worst-case-optimized similarity measure into the loss function to enhance the model's generation *** results on two benchmark datasets demonstrate the competitive performance of the proposed model in terms of the Frechet inception distance and inception score compared to those of some classical and state-of-the-art models,showing the improved similarities among the generated image,text,and ground truth.

关键词： text-to-image(T2I) feature-grounded single-stage generation Generative Adversarial Network(GAN)

来源：评论

学校读者我要写书评

暂无评论

Woodpecker: hallucination correction for multimodal large language models

引用

science China(Information sciences) 2024年第12期67卷 52-64页

作者： Shukang YIN Chaoyou FU Sirui ZHAO Tong XU Hao WANG Dianbo SUI Yunhang SHEN Ke LI Xing SUN Enhong CHEN School of Artificial Intelligence and Data Science University of Science and Technology of China State Key Laboratory for Novel Software Technology Nanjing University School of Intelligence Science and Technology Nanjing University Institute of Automation Chinese Academy of Sciences YouTu

Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models(MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://***/BradyFU/Woodpecker.

关键词： multimodal learning multimodal large language models hallucination correction large language models vision and language

来源：评论

学校读者我要写书评

暂无评论

Expression Recognition Method Based on Convolutional Neural Network and Capsule Neural Network

引用

computers, Materials & Continua 2024年第4期79卷 1659-1677页

作者： Zhanfeng Wang Lisha Yao School of Computer Science and Artificial Intelligence Chaohu UniversityHefei238000China School of Big Data and Artificial Intelligence Anhui Xinhua UniversityHefei230088China

Convolutional neural networks struggle to accurately handle changes in angles and twists in the direction of images,which affects their ability to recognize patterns based on internal feature levels. In contrast, CapsNet overcomesthese limitations by vectorizing information through increased directionality and magnitude, ensuring that spatialinformation is not overlooked. Therefore, this study proposes a novel expression recognition technique calledCAPSULE-VGG, which combines the strengths of CapsNet and convolutional neural networks. By refining andintegrating features extracted by a convolutional neural network before introducing theminto CapsNet, ourmodelenhances facial recognition capabilities. Compared to traditional neural network models, our approach offersfaster training pace, improved convergence speed, and higher accuracy rates approaching stability. Experimentalresults demonstrate that our method achieves recognition rates of 74.14% for the FER2013 expression dataset and99.85% for the CK+ expression dataset. By contrasting these findings with those obtained using conventionalexpression recognition techniques and incorporating CapsNet’s advantages, we effectively address issues associatedwith convolutional neural networks while increasing expression identification accuracy.

关键词： Expression recognition capsule neural network convolutional neural network

来源：评论

学校读者我要写书评

暂无评论

Robust video question answering via contrastive cross-modality representation learning

引用

science China(Information sciences) 2024年第10期67卷 211-226页

作者： Xun YANG Jianming ZENG Dan GUO Shanshan WANG Jianfeng DONG Meng WANG School of Information Science and Technology University of Science and Technology of China Institute of Artificial Intelligence Hefei Comprehensive National Science Center School of Computer Science and Information Engineering Hefei University of Technology Institutes of Physical Science and Information Technology Anhui University School of Computer Science and Technology Zhejiang Gongshang University

Video question answering(VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.

关键词： video question answering cross-modality fusion contrastive learning cross-media reasoning

来源：评论

学校读者我要写书评

暂无评论

Detection and Diagnosis of Small Target Breast Masses Based on Convolutional Neural Networks

引用

Tsinghua science and technology 2024年第5期29卷 1524-1539页

作者： Ling Tan Ying Liang Jingming Xia Hui Wu Jining Zhu School of Computer Science Nanjing University of Information Science&TechnologyNanjing 210044China School of Artificial Intelligence Nanjing University of Information Science&TechnologyNanjing 210044China

Breast mass identification is of great significance for early screening of breast cancer,while the existing detection methods have high missed and misdiagnosis rate for small *** propose a small target breast mass detection network named Residual asymmetric dilated convolution-Cross layer attention-Mean standard deviation adaptive selection-You Only Look Once(RCM-YOLO),which improves the identifiability of small masses by increasing the resolution of feature maps,adopts residual asymmetric dilated convolution to expand the receptive field and optimize the amount of parameters,and proposes the cross-layer attention that transfers the deep semantic information to the shallow layer as auxiliary information to obtain key feature *** the training process,we propose an adaptive positive sample selection algorithm to automatically select positive samples,which considers the statistical features of the intersection over union sets to ensure the validity of the training set and the detection accuracy of the *** verify the performance of our model,we used public datasets to carry out the *** results showed that the mean Average Precision(mAP)of RCM-YOLO reached 90.34%,compared with YOLOv5,the missed detection rate for small masses of RCM-YOLO was reduced to 11%,and the single detection time was reduced to 28 *** detection accuracy and speed can be effectively improved by strengthening the feature expression of small masses and the relationship between *** method can help doctors in batch screening of breast images,and significantly promote the detection rate of small masses and reduce misdiagnosis.

关键词： mammography diagnosis mass detection deep learning cross-layer attention adaptive positive sample selection

来源：评论

学校读者我要写书评

暂无评论

Optimizing SDN Controller to Switch Latency for Controller Placement Problem

引用

Informatica (Slovenia) 2024年第8期48卷 165-176页

作者： Zobary, Firas School of Computer Science and Artificial Intelligence Wuhan University of Technology Wuhan China

Software-Defined Networking (SDN) updates network flexibility by decoupling the data plane from control planes, employing a logically centralized yet physically distributed multi-controller architecture. The optimal placement of controllers and their quantity presents a significant challenge known as the Controller Placement Problem (CPP). This study addresses the optimization of average propagation delay between controllers and switches, introducing an enhancement version of well-known K-Means algorithm for network partitioning and controller placement, called an Advanced K-Means algorithm. The proposed algorithm strategically minimizes the average propagation delay by situating controllers in optimal nodes within each sub-network. Evaluation through simulations on the Internet OS3E topology demonstrates the algorithm's efficacy, showcasing a 22%, 11%, 7%, and 3% reduction in average propagation delay compared to DBCP, POCO, CNPA, and HDIDS, respectively. These results establish the proposed algorithm as a competitive solution, emphasizing its capacity to achieve comparable or superior performance in mitigating latency between controllers and switches when compared to existing algorithms. © 2024 Slovene Society Informatika. All rights reserved.

关键词： Controllers

来源：评论

学校读者我要写书评

暂无评论

GPS: graph contrastive learning via multi-scale augmented views from adversarial pooling

引用

science China(Information sciences) 2025年第1期68卷 145-158页

作者： Wei JU Yiyang GU Zhengyang MAO Ziyue QIAO Yifang QIN Xiao LUO Hui XIONG Ming ZHANG School of Computer Science National Key Laboratory for Multimedia Information ProcessingPeking University Artificial Intelligence Thrust The Hong Kong University of Science and Technology Department of Computer Science University of California

Self-supervised graph representation learning has recently shown considerable promise in a range of fields, including bioinformatics and social networks. A large number of graph contrastive learning approaches have shown promising performance for representation learning on graphs, which train models by maximizing agreement between original graphs and their augmented views(i.e., positive views). Unfortunately, these methods usually involve pre-defined augmentation strategies based on the knowledge of human experts. Moreover, these strategies may fail to generate challenging positive views to provide sufficient supervision signals. In this paper, we present a novel approach named graph pooling contrast(GPS) to address these *** by the fact that graph pooling can adaptively coarsen the graph with the removal of redundancy, we rethink graph pooling and leverage it to automatically generate multi-scale positive views with varying emphasis on providing challenging positives and preserving semantics, i.e., strongly-augmented view and weakly-augmented view. Then, we incorporate both views into a joint contrastive learning framework with similarity learning and consistency learning, where our pooling module is adversarially trained with respect to the encoder for adversarial robustness. Experiments on twelve datasets on both graph classification and transfer learning tasks verify the superiority of the proposed method over its counterparts.

关键词： graph representation learning graph neural networks graph contrastive learning graph augmentations graph pooling

来源：评论

学校读者我要写书评

暂无评论

Neighbor Library-Aware Graph Neural Network for Third Party Library Recommendation

引用

Tsinghua science and technology 2023年第4期28卷 769-785页

作者： Ying Jin Yi Zhang Yiwen Zhang School of Artificial Intelligence and Big Data Hefei UniversityHefei 230601China School of Computer Science and Technology Anhui UniversityHefei 230601China

Modern software development has moved toward agile growth and rapid delivery,where developers must meet the changing needs of users *** such a situation,plug-and-play Third-Party Libraries(TPLs)introduce a considerable amount of convenience to ***,selecting the exact candidate that meets the project requirements from the countless TPLs is challenging for *** works have considered setting up a personalized recommender system to suggest TPLs for ***,these approaches rarely consider the complex relationships between applications and TPLs,and are unsatisfactory in accuracy,training speed,and convergence *** this paper,we propose a new end-to-end recommendation model called Neighbor Library-Aware Graph Neural Network(NLA-GNN).Unlike previous works,we only initialize one type of node embedding,and construct and update all types of node representations using Graph Neural Networks(GNN).We use a simplified graph convolution operation to alternate the information propagation process to increase the training efficiency and eliminate the heterogeneity of the app-library bipartite graph,thus efficiently modeling the complex high-order relationships between the app and the *** experiments on large-scale real-world datasets demonstrate that NLA-GNN achieves consistent and remarkable improvements over state-of-the-art baselines for TPL recommendation tasks.

关键词： Third-Party Library(TPL) TPL recommendation Graph Neural Network(GNN) bipartite graph

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：