With the exponential growth of big data and advancements in large-scale foundation model techniques, the field of machine learning has embarked on an unprecedented golden era. This period is characterized by significa...
详细信息
With the exponential growth of big data and advancements in large-scale foundation model techniques, the field of machine learning has embarked on an unprecedented golden era. This period is characterized by significant innovations across various aspects of machine learning, including data exploitation, network architecture development, loss function settings and algorithmic innovation.
Recently, the multimodal large language model(MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models(LLMs) as a brain to perform multimodal tasks. The surprising ...
详细信息
Recently, the multimodal large language model(MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models(LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition–free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture,training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.
In this paper,we consider the exact quantum query complexity of two fundamental symmetric functions.1)MOD_(m)^(n),which calculates the Hamming weight of an-bit string modulo;2)EXACT_(k,l)^(n),which determines if the H...
详细信息
In this paper,we consider the exact quantum query complexity of two fundamental symmetric functions.1)MOD_(m)^(n),which calculates the Hamming weight of an-bit string modulo;2)EXACT_(k,l)^(n),which determines if the Hamming weight of an-bit string is exactly k or *** these two symmetric functions have received considerable attention,their exact quantum query complexities have not been fully ***,our results are as follows:1)We design an optimal quantum query algorithm to compute MOD_(m)^(n)exactly and thus provide a tight characterization of its exact quantum query complexity,which settles a previous *** on this algorithm,we demonstrate that a broad class of symmetric functions is not evasive in the quantum model,i.e.,there exist quantum algorithms to compute these functions exactly when the number of queries is less than their input size.2)By proposing a quantum algorithm that utilizes the minimum number of queries to compute EXACT_(k,l)^(n)exactly for some specific values of k and l,we give a tight characterization of its exact quantum query complexity in these scenarios.
Exploration strategy design is a challenging problem in reinforcement learning(RL),especially when the environment contains a large state space or sparse *** exploration,the agent tries to discover unexplored(novel)ar...
详细信息
Exploration strategy design is a challenging problem in reinforcement learning(RL),especially when the environment contains a large state space or sparse *** exploration,the agent tries to discover unexplored(novel)areas or high reward(quality)*** existing methods perform exploration by only utilizing the novelty of *** novelty and quality in the neighboring area of the current state have not been well utilized to simultaneously guide the agent’s *** address this problem,this paper proposes a novel RL framework,called clustered reinforcement learning(CRL),for efficient exploration in *** adopts clustering to divide the collected states into several clusters,based on which a bonus reward reflecting both novelty and quality in the neighboring area(cluster)of the current state is given to the *** leverages these bonus rewards to guide the agent to perform efficient ***,CRL can be combined with existing exploration strategies to improve their performance,as the bonus rewards employed by these existing exploration strategies solely capture the novelty of *** on four continuous control tasks and six hard-exploration Atari-2600 games show that our method can outperform other state-of-the-art methods to achieve the best performance.
The transformer architecture [1] has been widely used for natural language processing(NLP) tasks. Under the inspiration of its excellent performance in NLP, transformer-based models [2, 3] have established many new re...
The transformer architecture [1] has been widely used for natural language processing(NLP) tasks. Under the inspiration of its excellent performance in NLP, transformer-based models [2, 3] have established many new records in various computer vision tasks. However, most vision transformers(Vi Ts) suffer from large model sizes, large run-time memory consumption, and high computational costs. Therefore, impending needs exist to develop and deploy lightweight and efficient vision transformers.
With the rapid development of deep learning, current deep models can learn a fixed number of classes with high performance. However, in our ever-changing world, data often come from the open environment, which is with...
With the rapid development of deep learning, current deep models can learn a fixed number of classes with high performance. However, in our ever-changing world, data often come from the open environment, which is with stream format or available temporarily due to privacy issues. As a result, the classification model should learn new classes incrementally instead of restarting the training process.
Call graphs facilitate various tasks in software engineering. However, for the dynamic language Python, the complex language features and external library dependencies pose enormous challenges for building the call gr...
详细信息
For Unmanned Aerial Vehicles (UAVs) monitoring tasks, capturing high quality images of target objects is important for subsequent recognition. Concerning the problem, many prior works study placement/trajectory planni...
详细信息
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models(MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existi...
详细信息
Hallucinations is a big shadow hanging over the rapidly evolving multimodal large language models(MLLMs), referring to that the generated text is inconsistent with the image content. To mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like woodpeckers heal trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://***/BradyFU/Woodpecker.
Emotion-cause pair extraction(ECPE)aims to extract all the pairs of emotions and corresponding causes in a *** generally contains three subtasks,emotions extraction,causes extraction,and causal relations detection bet...
详细信息
Emotion-cause pair extraction(ECPE)aims to extract all the pairs of emotions and corresponding causes in a *** generally contains three subtasks,emotions extraction,causes extraction,and causal relations detection between emotions and *** works adopt pipelined approaches or multi-task learning to address the ECPE ***,the pipelined approaches easily suffer from error propagation in real-world *** multi-task learning cannot optimize all tasks globally and may lead to suboptimal extraction *** address these issues,we propose a novel framework,Pairwise Tagging Framework(PTF),tackling the complete emotion-cause pair extraction in one unified tagging *** prior works,PTF innovatively transforms all subtasks of ECPE,i.e.,emotions extraction,causes extraction,and causal relations detection between emotions and causes,into one unified clause-pair tagging *** this unified tagging task,we can optimize the ECPE task globally and extract more accurate emotion-cause *** validate the feasibility and effectiveness of PTF,we design an end-to-end PTF-based neural network and conduct experiments on the ECPE benchmark *** experimental results show that our method outperforms pipelined approaches significantly and typical multi-task learning approaches.
暂无评论