Recently, the multimodal large language model(MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models(LLMs) as a brain to perform multimodal tasks. The surprising ...
详细信息
Recently, the multimodal large language model(MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models(LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition–free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture,training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.
Estimating lighting from standard images can effectively circumvent the need for resourceintensive high-dynamic-range(HDR)lighting ***,this task is often ill-posed and challenging,particularly for indoor scenes,due to...
详细信息
Estimating lighting from standard images can effectively circumvent the need for resourceintensive high-dynamic-range(HDR)lighting ***,this task is often ill-posed and challenging,particularly for indoor scenes,due to the intricacy and ambiguity inherent in various indoor illumination *** propose an innovative transformer-based method called SGformer for lighting estimation through modeling spherical Gaussian(SG)distributions—a compact yet expressive lighting *** from previous approaches,we explore underlying local and global dependencies in lighting features,which are crucial for reliable lighting ***,we investigate the structural relationships spanning various resolutions of SG distributions,ranging from sparse to dense,aiming to enhance structural consistency and curtail potential stochastic noise stemming from independent SG component *** harnessing the synergy of local–global lighting representation learning and incorporating consistency constraints from various SG resolutions,the proposed method yields more accurate lighting estimates,allowing for more realistic lighting effects in object relighting and *** code and model implementing our work can be found at https://***/junhong-jennifer-zhao/SGformer.
Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)***,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approache...
详细信息
Recently,Generative Adversarial Networks(GANs)have become the mainstream text-to-image(T2I)***,a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approaches the ground-truth image ***,the multistage generation strategy results in complex T2I ***,this study proposes a novel feature-grounded single-stage T2I model,which considers the“real”distribution learned from training images as one input and introduces a worst-case-optimized similarity measure into the loss function to enhance the model's generation *** results on two benchmark datasets demonstrate the competitive performance of the proposed model in terms of the Frechet inception distance and inception score compared to those of some classical and state-of-the-art models,showing the improved similarities among the generated image,text,and ground truth.
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models(MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existi...
详细信息
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models(MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://***/BradyFU/Woodpecker.
Convolutional neural networks struggle to accurately handle changes in angles and twists in the direction of images,which affects their ability to recognize patterns based on internal feature levels. In contrast, Caps...
详细信息
Convolutional neural networks struggle to accurately handle changes in angles and twists in the direction of images,which affects their ability to recognize patterns based on internal feature levels. In contrast, CapsNet overcomesthese limitations by vectorizing information through increased directionality and magnitude, ensuring that spatialinformation is not overlooked. Therefore, this study proposes a novel expression recognition technique calledCAPSULE-VGG, which combines the strengths of CapsNet and convolutional neural networks. By refining andintegrating features extracted by a convolutional neural network before introducing theminto CapsNet, ourmodelenhances facial recognition capabilities. Compared to traditional neural network models, our approach offersfaster training pace, improved convergence speed, and higher accuracy rates approaching stability. Experimentalresults demonstrate that our method achieves recognition rates of 74.14% for the FER2013 expression dataset and99.85% for the CK+ expression dataset. By contrasting these findings with those obtained using conventionalexpression recognition techniques and incorporating CapsNet’s advantages, we effectively address issues associatedwith convolutional neural networks while increasing expression identification accuracy.
Video question answering(VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts...
详细信息
Video question answering(VideoQA) is a challenging yet important task that requires a joint understanding of low-level video content and high-level textual semantics. Despite the promising progress of existing efforts, recent studies revealed that current VideoQA models mostly tend to over-rely on the superficial correlations rooted in the dataset bias while overlooking the key video content, thus leading to unreliable results. Effectively understanding and modeling the temporal and semantic characteristics of a given video for robust VideoQA is crucial but, to our knowledge, has not been well investigated. To fill the research gap, we propose a robust VideoQA framework that can effectively model the cross-modality fusion and enforce the model to focus on the temporal and global content of videos when making a QA decision instead of exploiting the shortcuts in datasets. Specifically, we design a self-supervised contrastive learning objective to contrast the positive and negative pairs of multimodal input, where the fused representation of the original multimodal input is enforced to be closer to that of the intervened input based on video perturbation. We expect the fused representation to focus more on the global context of videos rather than some static keyframes. Moreover, we introduce an effective temporal order regularization to enforce the inherent sequential structure of videos for video representation. We also design a Kullback-Leibler divergence-based perturbation invariance regularization of the predicted answer distribution to improve the robustness of the model against temporal content perturbation of videos. Our method is model-agnostic and can be easily compatible with various VideoQA backbones. Extensive experimental results and analyses on several public datasets show the advantage of our method over the state-of-the-art methods in terms of both accuracy and robustness.
Breast mass identification is of great significance for early screening of breast cancer,while the existing detection methods have high missed and misdiagnosis rate for small *** propose a small target breast mass det...
详细信息
Breast mass identification is of great significance for early screening of breast cancer,while the existing detection methods have high missed and misdiagnosis rate for small *** propose a small target breast mass detection network named Residual asymmetric dilated convolution-Cross layer attention-Mean standard deviation adaptive selection-You Only Look Once(RCM-YOLO),which improves the identifiability of small masses by increasing the resolution of feature maps,adopts residual asymmetric dilated convolution to expand the receptive field and optimize the amount of parameters,and proposes the cross-layer attention that transfers the deep semantic information to the shallow layer as auxiliary information to obtain key feature *** the training process,we propose an adaptive positive sample selection algorithm to automatically select positive samples,which considers the statistical features of the intersection over union sets to ensure the validity of the training set and the detection accuracy of the *** verify the performance of our model,we used public datasets to carry out the *** results showed that the mean Average Precision(mAP)of RCM-YOLO reached 90.34%,compared with YOLOv5,the missed detection rate for small masses of RCM-YOLO was reduced to 11%,and the single detection time was reduced to 28 *** detection accuracy and speed can be effectively improved by strengthening the feature expression of small masses and the relationship between *** method can help doctors in batch screening of breast images,and significantly promote the detection rate of small masses and reduce misdiagnosis.
Software-Defined Networking (SDN) updates network flexibility by decoupling the data plane from control planes, employing a logically centralized yet physically distributed multi-controller architecture. The optimal p...
详细信息
Self-supervised graph representation learning has recently shown considerable promise in a range of fields, including bioinformatics and social networks. A large number of graph contrastive learning approaches have sh...
详细信息
Self-supervised graph representation learning has recently shown considerable promise in a range of fields, including bioinformatics and social networks. A large number of graph contrastive learning approaches have shown promising performance for representation learning on graphs, which train models by maximizing agreement between original graphs and their augmented views(i.e., positive views). Unfortunately, these methods usually involve pre-defined augmentation strategies based on the knowledge of human experts. Moreover, these strategies may fail to generate challenging positive views to provide sufficient supervision signals. In this paper, we present a novel approach named graph pooling contrast(GPS) to address these *** by the fact that graph pooling can adaptively coarsen the graph with the removal of redundancy, we rethink graph pooling and leverage it to automatically generate multi-scale positive views with varying emphasis on providing challenging positives and preserving semantics, i.e., strongly-augmented view and weakly-augmented view. Then, we incorporate both views into a joint contrastive learning framework with similarity learning and consistency learning, where our pooling module is adversarially trained with respect to the encoder for adversarial robustness. Experiments on twelve datasets on both graph classification and transfer learning tasks verify the superiority of the proposed method over its counterparts.
Modern software development has moved toward agile growth and rapid delivery,where developers must meet the changing needs of users *** such a situation,plug-and-play Third-Party Libraries(TPLs)introduce a considerabl...
详细信息
Modern software development has moved toward agile growth and rapid delivery,where developers must meet the changing needs of users *** such a situation,plug-and-play Third-Party Libraries(TPLs)introduce a considerable amount of convenience to ***,selecting the exact candidate that meets the project requirements from the countless TPLs is challenging for *** works have considered setting up a personalized recommender system to suggest TPLs for ***,these approaches rarely consider the complex relationships between applications and TPLs,and are unsatisfactory in accuracy,training speed,and convergence *** this paper,we propose a new end-to-end recommendation model called Neighbor Library-Aware Graph Neural Network(NLA-GNN).Unlike previous works,we only initialize one type of node embedding,and construct and update all types of node representations using Graph Neural Networks(GNN).We use a simplified graph convolution operation to alternate the information propagation process to increase the training efficiency and eliminate the heterogeneity of the app-library bipartite graph,thus efficiently modeling the complex high-order relationships between the app and the *** experiments on large-scale real-world datasets demonstrate that NLA-GNN achieves consistent and remarkable improvements over state-of-the-art baselines for TPL recommendation tasks.
暂无评论