As a kind of cancer with high incidence, skin cancer seriously threatens people's life and health. Early detection, early diagnosis, and early treatment are one of the effective ways to increase the survival rate ...
详细信息
Document layout analysis is an important part of document information processing systems, which is essential for many applications such as optical character recognition (OCR) systems, machine translation, information ...
详细信息
Document layout analysis is an important part of document information processing systems, which is essential for many applications such as optical character recognition (OCR) systems, machine translation, information retrieval, and document structured data extraction, as well as for digitizing paper documents and classifying and identifying document image regions. Document-like images contain a wealth of information, and in order to automatically extract and classify regions of interest in document images, the document images are programmed to analyze the layout content for subsequent OCR and automatic transcription. However, the proposed algorithms still have more limitations due to various document layouts and variations of block positions, inter-class and within-class variations, and background noise. This paper first summarizes the traditional learning algorithms based on tour smoothing and segmentation projection, deep learning algorithms using recurrent convolutional neural networks and twin networks, and algorithms combining traditional learning and deep learning proposed in recent years. The current mainstream algorithms and common datasets in experiments for deep learning and their access are highlighted. As well as the comparison of some algorithms on benchmark datasets, and some experimental results with good robustness are given. Finally, the future research areas are prospected for further development.
With the development of deep learning, speaker recognition systems have shown increasingly better performance. The generalization ability of the models is also an important aspect of performance evaluation. Typically,...
With the development of deep learning, speaker recognition systems have shown increasingly better performance. The generalization ability of the models is also an important aspect of performance evaluation. Typically, a baseline system is used to compare against the improved models to demonstrate performance enhancements. However, we cannot determine the differences in learned voiceprint features between the improved models and the baseline system. This paper introduces an improved speaker recognition system based on the ECAPA-TDNN model. It utilizes stable learning to eliminate sample correlation and employs attribution analysis to compare the differences in voiceprint feature learning between the improved and baseline systems. Experimental results demonstrate that stable learning improves the model’s generalization performance and helps it learn better voiceprint features. The effectiveness and generalization capability of the proposed method are verified through experiments on the VoxCeleb, CNCeleb, and LibriSpeech datasets. This work is important for enhancing speaker recognition performance, analyzing differences in voiceprint feature learning, and promoting advancements in the field.
In this paper, we focus on the task of bilingual dictionary induction for the Chinese-Uyghur language pair. Usually, correlating long-distance linguistic information requires cross-linguistic information as supervisio...
详细信息
ISBN:
(数字)9781665476744
ISBN:
(纸本)9781665476751
In this paper, we focus on the task of bilingual dictionary induction for the Chinese-Uyghur language pair. Usually, correlating long-distance linguistic information requires cross-linguistic information as supervision, which often requires parallel corpora to link in seed lexicons. And the parallel corpora are expensive. The low-resource Uyghur language text data are only available in a small amount, and the derivative morphological structure is vibrant and complex. In bilingual processing aligning most similar units and entity stems is the first step. So separating sentences into morpheme sequences is essential in the cross-lingual processing tasks. Uyghur words in text sentences consist of stems joined with several suffixes/prefixes. Rich and complex multiple affix forms exist in the text, forming many derivative words. This situation can easily lead to an increase in the repetition rate of intentional features in the text, which affects the efficiency of bilingual dictionary extraction. In this work, we actively explore the resource construction and granularity optimization of minority low-resource languages and learn cross-language word embeddings without the supervision of parallel data. A Chinese-Uyghur bilingual dictionary extraction method is proposed based on the neural network cross-language word embedding vector technology and the multilingual morphological analyzer. Experiments show that the way based on morpheme sequence significantly improved compared to the baseline model of the word sequence.
Sparse Subspace Clustering (SSC) is integral to image processing, drawing from spectral clustering foundations. However, prevalent methods, relying on an l 1 -norm constraint, fail to capture nuanced inter-region corr...
Sparse Subspace Clustering (SSC) is integral to image processing, drawing from spectral clustering foundations. However, prevalent methods, relying on an l 1 -norm constraint, fail to capture nuanced inter-region correlations, affecting segmentation efficacy. To remedy this, we introduce an Adaptive Gaussian Regularization Constrained SSC for enhanced image segmentation. This method begins with superpixel preprocessing to enrich local information. Given the Gaussian nature of the SSC’s sparse coefficient matrix, a Gaussian probability density function is infused as a regularization term, reinforcing regional image ties and facilitating similarity matrix creation. Using spectral clustering, we then define superpixel clusters leading to the final segmentation. When tested against the BSDS500 and SBD datasets and other leading algorithms, our model showcases marked improvements in natural image segmentation.
In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and e...
详细信息
ISBN:
(数字)9798350390155
ISBN:
(纸本)9798350390162
In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature.
Deep neural network models based on x-vector have become the most popular framework for speaker recognition, and the quality of speaker features (embeddings) is important for open-set tasks such as speaker verificatio...
Deep neural network models based on x-vector have become the most popular framework for speaker recognition, and the quality of speaker features (embeddings) is important for open-set tasks such as speaker verification and speaker diarization. Currently, the most popular loss function is based on margin penalty, however, it only considers enlarging the inter-class distance while neglecting to reduce the intra-class feature differences. Therefore, we propose a multi-view learning approach that divides the training process into two views from the speaker embedding level. The classification view focuses on distinguishing the discriminability of different speakers, while the clustering view focuses on shrinking the feature boundaries of the same speaker, making intra-class differences smaller. The combined effect of the two perspectives achieves large inter-class distance and small intra-class distances, resulting in the extraction of more discriminative and stable speaker embeddings. We test the performance of the method on both speaker verification and speaker diarization tasks, and the results demonstrate the effectiveness of our approach.
We propose a deep neural network with spectrogram matching and mutual attention (SMMA-Net) for audio clue-based target speaker extraction (TSE). To effectively use the auxiliary speech, we proposed spectrogram matchin...
We propose a deep neural network with spectrogram matching and mutual attention (SMMA-Net) for audio clue-based target speaker extraction (TSE). To effectively use the auxiliary speech, we proposed spectrogram matching (SM) strategy and mutual attention (MA) block. We conducted all experiments on the WSJ0-2mix-extr dataset. The ablation and comparison studies verified the effectiveness of SM strategy and MA block. The experimental results show that our proposed method outperforms the state-of-the-art methods by a sizable margin of 1.3 dB on the metric of scale-invariant signal-to-distortion ratio improvement. Additionally, SMMA-Net achieved that the performance of model for TSE task exceeds that for speaker separation task under the similar architecture. The main code will be available at https://***/Ht-Xu/SMMA-Net.
In the field of spectral analysis, the common Raman spectral feature selection model can extract features effectively, but it will change the original data *** teacher model assists the student model in distillation t...
详细信息
In the field of spectral analysis, the common Raman spectral feature selection model can extract features effectively, but it will change the original data *** teacher model assists the student model in distillation t...
详细信息
暂无评论