Most of the deep learning based speech enhancement methods focus on the modeling of complicated relationship between the noisy speech and the clean speech without the consideration of noise information. In order to co...
详细信息
ISBN:
(纸本)9781728176055
Most of the deep learning based speech enhancement methods focus on the modeling of complicated relationship between the noisy speech and the clean speech without the consideration of noise information. In order to cope with various complex noise scenes, we introduce a novel enhancement architecture that integrates a deep autoencoder with neural noise embedding. In this study, a new normalization method, termed conditional layer normalization (CLN), is introduced to improve the generalization of deep learning based speech enhancement approaches for unseen environments. The noise embedding is passed through the CLN layers to regularize the network for speech enhancement task. The proposed network can be adaptively adjusted according to different noise information extracted from the noisy speech input. The network in overall is trained in an end-to-end manner and the experimental results show that the proposed scheme produces satisfactory enhancement performance comparing the other methods. The visualization shows that our proposed network captures noise information, which is helpful to improve robustness to unseen environments for speech enhancement.
Joint relational triple extraction treats entity recognition and relation extraction as a joint task to extract relational triples, and this is a critical task in information extraction and knowledge graph constructio...
详细信息
Joint relational triple extraction treats entity recognition and relation extraction as a joint task to extract relational triples, and this is a critical task in information extraction and knowledge graph construction. However, most existing joint models still fall short in terms of extracting overlapping triples. Moreover, these models ignore the trigger words of potential relations during the relation detection process. To address the two issues, a joint model based on Potential Relation Detection and conditional Entity Mapping is proposed, named PRDCEM. Specifically, the proposed model consists of three components, i.e., potential relation detection, candidate entity tagging, and conditional entity mapping, corresponding to three subtasks. First, a non-autoregressive decoder that contains a cross-attention mechanism is applied to detect potential relations. In this way, different potential relations are associated with the corresponding trigger words in the given sentence, and the semantic representations of the trigger words are fully utilized to encode potential relations. Second, two distinct sequence taggers are employed to extract candidate subjects and objects. Third, an entity mapping module incorporating conditional layer normalization is designed to align the candidate subjects and objects. As such, each candidate subject and each potential relation are combined to form a condition that is incorporated into the sentence, which can effectively extract overlapping triples. Finally, the negative sampling strategy is employed in the entity mapping module to mitigate the error propagation from the previous two components. In a comparison with 15 baselines, the experimental results obtained on two widely used public datasets demonstrate that PRDCEM can effectively extract overlapping triples and achieve improved performance.
Partial multi-label learning (PML) addresses problems where each instance is assigned a candidate label set and only a subset of these candidate labels is correct. The major challenge of PML is that the training proce...
详细信息
Partial multi-label learning (PML) addresses problems where each instance is assigned a candidate label set and only a subset of these candidate labels is correct. The major challenge of PML is that the training procedure can be easily misguided by noisy labels. Current studies on PML have revealed two significant drawbacks. First, most of them do not sufficiently explore complex label correlations, which could improve the effectiveness of label disambiguation. Second, PML models heavily rely on prior assumptions, limiting their applicability to specific scenarios. In this work, we propose a novel method of PML based on the Encoder-Decoder Framework (PML-ED) to address the drawbacks. PML-ED initially achieves the distribution of label probability through a KNN label attention mechanism. It then adopts conditional layer normalization (CLN) to extract the high-order label correlation and relaxes the prior assumption of label noise by introducing a universal Encoder-Decoder framework. This approach makes PML-ED not only more efficient compared to the state-of-the-art methods, but also capable of handling the data with large noisy labels across different domains. Experimental results on 28 benchmark datasets demonstrate that the proposed PML-ED model, when benchmarked against nine leading-edge PML algorithms, achieves the highest average ranking across five evaluation criteria.
Text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthesis remains problematic in achieving high-qualit...
详细信息
Text-to-speech (TTS) technology is commonly used to generate personalized voices for new speakers. Despite considerable progress in TTS technology, personal voice synthesis remains problematic in achieving high-quality custom voices. In addressing this issue, fine-tuning a TTS model is a popular approach. However, it must be applied once for every new speaker, which results in both time-consuming model training and excessive storage of the TTS model parameters. Therefore, to support a large number of new speakers, a parameter-efficient fine-tuning (PEFT) approach must be used instead of full fine-tuning, as well as an approach to accommodate multiple speakers with a small number of parameters. To this end, this work first incorporates a low-rank adaptation-based fine-tuning method for variational inference with adversarial learning for end-to-end TTS (VITS) model. Next, the approach is extended with conditional layer normalization for multi-speaker fine-tuning, and the residual adapter is further applied to the text encoder outputs of the VITS model to improve the intelligibility and naturalness of the speech quality of personalized speech. The performance of the fine-tuned TTS models with different combinations of fine-tuning modules is evaluated using the Libri-TTS-100, VCTK, and Common Voice datasets, as well as a Korean multi-speaker dataset. Objective and subjective quality comparisons reveal that the proposed approach achieves speech quality comparable to that of a fully fine-tuned model, with around a 90% reduction in the number of model parameters.
暂无评论