This paper focuses on the image composition of transparent objects, where existing image matting methods suffer from composition errors due to the lack of accurate foreground during the composition process. We propose...
详细信息
Despite significant progress, the shortage of labeled data and expert knowledge remains a challenge for Fine-grained Visual Classification (FGVC). Some multi-source approaches that incorporate additional modalities, s...
详细信息
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Su...
详细信息
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, and this issue has received little attention. To address these issues, we propose a new zero-shot method for audio captioning. Our method is built on the contrastive language-audio pre-training (CLAP) model. During training, the model reconstructs the ground-truth caption using the CLAP text encoder. In the inference stage, the model generates text descriptions from the CLAP audio embeddings of given audio inputs. To enhance the ability of the model in transitioning from text-to-text generation to audio-to-text generation, we propose to use the mixed-augmentations-based soft prompt to learn more robust latent representations, leveraging instance replacement and embedding augmentation. Additionally, we introduce the retrieval-based acoustic-aware hard prompt to improve the cross-domain performance of the model by employing the domain-agnostic label information of sound events. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.
Lightweight image super-resolution aims to reconstruct high-resolution images from low-resolution images using low computational costs. However, existing methods result in the loss of middle-layer features due to acti...
详细信息
Detecting weak fire, such as overexposed and highly transparent flames, remains a significant challenge in vision-based fire detection. Convolutional Neural Network (CNN) based methods are widely used for automatic fi...
详细信息
Natural image matting plays a crucial role in numerous real-world applications. Image matting methods based on pixel pair optimization is a type of matting algorithm, which has significant advantages in parallel compu...
详细信息
Document subjectivity analysis has become an important aspect of web text content mining. This problem is similar to traditional text categorization, thus many related classification techniques can be adapted here. Ho...
详细信息
Document subjectivity analysis has become an important aspect of web text content mining. This problem is similar to traditional text categorization, thus many related classification techniques can be adapted here. However, there is one significant difference that more language or semantic information is required for better estimating the subjectivity of a document. Therefore, in this paper, our focuses are mainly on two aspects. One is how to extract useful and meaningful language features, and the other is how to construct appropriate language models efficiently for this special task. For the first issue, we conduct a Global-Filtering and Local-Weighting strategy to select and evaluate language features in a series of n-grams with different orders and within various distance-windows. For the second issue, we adopt Maximum Entropy (MaxEnt) modeling methods to construct our language model framework. Besides the classical MaxEnt models, we have also constructed two kinds of improved models with Gaussian and exponential priors respectively. Detailed experiments given in this paper show that with well selected and weighted language features, MaxEnt models with exponential priors are significantly more suitable for the text subjectivity analysis task.
This article proposes a model combination method to enhance the discriminability of the generative model. Generative and discriminative models have different optimization objectives and have their own advantages and d...
详细信息
This article proposes a model combination method to enhance the discriminability of the generative model. Generative and discriminative models have different optimization objectives and have their own advantages and drawbacks. The method proposed in this article intends to strike a balance between the two models mentioned above. It extracts the discriminative parameter from the generative model and generates a new model based on a multi-model combination. The weight for combining is determined by the ratio of the inter-variance to the intra-variance of the classes. The higher the ratio is, the greater the weight is, and the more discriminative the model will be. Experiments on speech recognition demonstrate that the performance of the new model outperforms the model trained with the traditional generative method.
In this paper, we present a novel audio fingerprinting method based on N-grams, which can quickly identify a segment of audio even when the audio signals are seriously distorted. We make use of N peaks in spectrum to ...
详细信息
In this paper, we present a novel audio fingerprinting method based on N-grams, which can quickly identify a segment of audio even when the audio signals are seriously distorted. We make use of N peaks in spectrum to form the audio fingerprint, which accelerates the retrieval speed greatly. We take advantage of the initial robust peaks to calculate the similarity between candidates and the input audio, which improves the retrieval accuracy significantly. The effectiveness of the N-gram method was evaluated on a music database of 10,000 songs. Experimental results show that the proposed approach outperforms two state-of-the-art algorithms (Shazam and Philips Robust Hash) in both effectiveness (in terms of retrieval accuracy) and efficiency (in terms of average retrieval time).
The a priori signal-to-noise (SNR) is one of the most important parameters in the short-time spectrum estimation techniques in speech enhancement. A new and convenient algorithm to estimate the priori SNR is involved ...
详细信息
暂无评论