We present Video-LLaMA1 a multi-modal framework that empowers Large language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal trainin...
详细信息
Large language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited expl...
详细信息
ISBN:
(纸本)9798891760615
Large language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinitescale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https: //***
Current language models are mainly trained on snap-shots of data gathered at a particular time, which decreases their capability to generalize over time and model language change. To model the time variable, existing ...
详细信息
ISBN:
(纸本)9798891760615
Current language models are mainly trained on snap-shots of data gathered at a particular time, which decreases their capability to generalize over time and model language change. To model the time variable, existing works have explored temporal language models (e.g., TempoBERT) by directly incorporating the timestamp into the training process. While effective to some extent, these methods are limited by the superficial temporal information brought by timestamps, which fails to learn the inherent changes of linguistic components. In this paper, we empirically confirm that the performance of pre-trained language models (PLMs) is closely affiliated with syntactically changed tokens. Based on this observation, we propose a simple yet effective method named Syntax-Guided Temporal language Model (SG-TLM), which could learn the inherent language changes by capturing an intrinsic relationship between the time prefix and the tokens with salient syntactic change. Experiments on two datasets and three tasks demonstrate that our model outperforms existing temporal language models in both memorization and generalization capabilities. Extensive results further confirm the effectiveness of our approach across different model frameworks, including both encoder-only and decoder-only models (e.g., LLaMA). Our code is available at https://***/zhaochen0110/TempoLM.
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs). Studies have shown that synthetic data can effectively improve the per...
详细信息
Large language models(LLMs) have exhibited notable general-purpose task-solving abilities in language understanding and generation, including processing recommendation tasks. The majority of existing research relies o...
详细信息
ISBN:
(纸本)9789819794331;9789819794348
Large language models(LLMs) have exhibited notable general-purpose task-solving abilities in language understanding and generation, including processing recommendation tasks. The majority of existing research relies on training-free recommendation models that treat LLMs as reasoning engines and directly given the recommended task response. This approach heavily relies on pre-trained knowledge and may lead to excessive costs. As such, we propose a two-stage fine-tuning framework leveraging LLaMA2 and GPT-4 Knowledge Enhancement for recommendation. In particular, we use GPT-4 Instruction-Following data to tune the LLM in first-stage instruction tuning process, achieving lower training costs and better inference performance. In the second stage, through a elaborately designed prompt template, we fine-tune LLM from the first stage in a few-shot setting by interactive sequences based on user ratings. To validate the effectiveness of our framework, we compare against state-of-the-art baseline methods on benchmark datasets. The results demonstrate that our framework has promising recommendation capabilities. Our experiments are executed on a single RTX4090 with LLaMA2-7B.
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (...
详细信息
Grammar induction has made significant progress in recent years. However, it is not clear how the application of induced grammar could enhance practical performance in downstream tasks. In this work, we introduce an u...
详细信息
Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-sit...
详细信息
ISBN:
(纸本)9798891760615
Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets are released at https://***/LukeForeverYoung/UReader.
Large language Models (LLMs) and Large Vision language Models (LVLMs) exhibit advanced proficiency in language reasoning and comprehension across a wide array of languages. While their performance is notably robust in...
详细信息
Multiple knowledge (e.g., co-reference, topics, emotional causes, etc) has been demonstrated effective for emotion detection. However, exploring this knowledge in Emotion Recognition in Conversations (ERC) is currentl...
详细信息
ISBN:
(纸本)9798891760615
Multiple knowledge (e.g., co-reference, topics, emotional causes, etc) has been demonstrated effective for emotion detection. However, exploring this knowledge in Emotion Recognition in Conversations (ERC) is currently a blank slate due to the lack of annotated data and the high cost involved in obtaining such knowledge. Fortunately, the emergence of Large language Models (LLMs) holds promise in filling this void. Therefore, we propose a Multiple Knowledge Fusion Model (MKFM) to effectively integrate such knowledge generated by LLMs for ERC and empirically study its impact on the model. Experimental results on three public datasets have demonstrated the effectiveness of multiple knowledge for ERC. Furthermore, we conduct a detailed analysis of the contribution and complementarity of this knowledge(1).
暂无评论