Recently, audio generation tasks have attracted considerable research interests. Despite rapid advancements in generating high-fidelity audio that is coarsely aligned with the text description, precise temporal contro...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Recently, audio generation tasks have attracted considerable research interests. Despite rapid advancements in generating high-fidelity audio that is coarsely aligned with the text description, precise temporal controllability is still a challenge, which is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. It leverages data crawling, segmentation and filtering to simulate fine-grained temporally-aligned audio-text data. Furthermore, PicoAudio integrates temporal information to guide audio generation through tailored model design. With the effective text processing capabilities from large language models, PicoAudio can take natural language input and generate audio that aligns well with the temporal description in the input. Both subjective and objective evaluation demonstrate that PicoAudio dramatically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. Generation samples are available at the $PicoAudio - Demo$.
Large language model evaluation plays a pivotal role in the enhancement of its capacity. Previously, numerous methods for evaluating large language models have been proposed in this area. Despite their effectiveness, ...
详细信息
Computerized adaptive testing (CAT), as a tool that can efficiently measure student's ability, has been widely used in various standardized tests (e.g., GMAT and GRE). The adaptivity of CAT refers to the selection...
Computerized adaptive testing (CAT), as a tool that can efficiently measure student's ability, has been widely used in various standardized tests (e.g., GMAT and GRE). The adaptivity of CAT refers to the selection of the most informative questions for each student, reducing test length. Existing CAT methods do not explicitly target ability estimation accuracy since there is no student's true ability as ground truth; therefore, these methods cannot be guaranteed to make the estimate converge to the true with such limited responses. In this paper, we analyze the statistical properties of estimation and find a theoretical approximation of the true ability: the ability estimated by full responses to question bank. Based on this, a Bounded Ability Estimation framework for CAT (BECAT) is proposed in a data-summary manner, which selects a question subset that closely matches the gradient of the full responses. Thus, we develop an expected gradient difference approximation to design a simple greedy selection algorithm, and show the rigorous theoretical and error upper-bound guarantees of its ability estimate. Experiments on both real-world and synthetic datasets, show that it can reach the same estimation accuracy using 15% less questions on average, significantly reducing test length.
The dramatically increasing volume of incomplete data makes the imputation models computationally infeasible in many real-life applications. In this paper, we propose an effective scalable imputation system named SCIS...
详细信息
Hierarchical multi-granularity classification is the task of classifying objects according to multiple levels or granularities. The class hierarchy is vital side information for hierarchical multi-granularity classifi...
详细信息
Recent advances in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relation, a critical feature for audio content, is currently underrepre...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Recent advances in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relation, a critical feature for audio content, is currently underrepresented in mainstream models, resulting in an imprecise temporal controllability. Specifically, users cannot accurately control the timestamps of sound events using free-form text. One significant challenge is the absence of a high-quality, temporally-aligned audio-text dataset, which is essential for training models with temporal control. The more temporally-aligned the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we propose a temporally-aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps, duration, frequency, and ordering, covering almost all aspects of temporal control. Additionally, we offer a comprehensive test set and evaluation metric to assess the temporal control performance of text-to-audio generation models. Examples are available on the $AudioTime - Demo$.
Mathematical reasoning is one of the crucial abilities of general artificial intelligence, which requires machines to master mathematical logic and knowledge from solving problems. However, existing approaches are not...
详细信息
In hierarchical classification learning, hierarchical feature selection algorithms plays an important role which can be used to address the curse of dimensionality. Existing hierarchical feature selection algorithms b...
详细信息
Rationale extraction can be considered as a straightforward method of improving the model explainability, where rationales are a subsequence of the original inputs, and can be extracted to support the prediction resul...
ISBN:
(纸本)9781713871088
Rationale extraction can be considered as a straightforward method of improving the model explainability, where rationales are a subsequence of the original inputs, and can be extracted to support the prediction results. Existing methods are mainly cascaded with the selector which extracts the rationale tokens, and the predictor which makes the prediction based on selected tokens. Since previous works fail to fully exploit the original input, where the information of non-selected tokens is ignored, in this paper, we propose a Disentanglement-Augmented Rationale Extraction (DARE) method, which encapsulates more information from the input to extract rationales. Specifically, it first disentangles the input into the rationale representations and the non-rationale ones, and then learns more comprehensive rationale representations for extracting by minimizing the mutual information (MI) between the two disentangled representations. Besides, to improve the performance of MI minimization, we develop a new MI estimator by exploring existing MI estimation methods. Extensive experimental results on three real-world datasets and simulation studies clearly validate the effectiveness of our proposed method.
Textile composition identification (TCI) is an essential basic link in the textile industry. Methods based on computer vision or near-infrared (NIR) signal processing have shown potential for the nondestructive TCI ta...
详细信息
暂无评论