Efficient speaker segmentation and clustering method based on the improved spectral clustering is proposed in this paper. Traditional speaker segmentation and clustering is performed by the hierarchical clustering alg...
详细信息
ISBN:
(纸本)9781457716232
Efficient speaker segmentation and clustering method based on the improved spectral clustering is proposed in this paper. Traditional speaker segmentation and clustering is performed by the hierarchical clustering algorithms with Bayesian information criterion (BIC) metric and cross likelihood ratio (CLR) metric after the speakers are segmented. Since this method has high computational complexity and may result in a suboptimal solution, we use spectral clustering to overcome this problem and improve the performance of clustering algorithm. First the affinity matrix is constructed with the mean supervector feature transformed by KL kernel mapping. And then the scaling parameter is selected adaptively. The experiments performed on the NIST 1998 multi-speaker corpus show that the proposed method outperforms the baseline system.
We report results on speaker diarization of French broadcast news and talk shows on current affairs. This speaker diarization process is a multistage segmentation and clustering system. One of the stages is agglomerat...
详细信息
ISBN:
(纸本)9781424414833
We report results on speaker diarization of French broadcast news and talk shows on current affairs. This speaker diarization process is a multistage segmentation and clustering system. One of the stages is agglomerative clustering using state-of-the-art speaker identification methods (SID). For the GMMs used in this stage, we tried many different feature parameters, including MFCCs, Gaussianized MFCCs, Gaussianized MFCCs with cepstral mean subtraction, and Gaussianized MFCCs with cepstral mean substraction containing only frames with high energy. We found that this last set of feature parameters gave the best results. Compared to Gaussianized MfCCs, these features reduced the diarization error rate (DER) by 12% on a development set and by 19% on a test set. We also combined clusters resulting from Gaussianized and non-Gaussianized feature sets. This cluster combination resulted in another 4% reduction in DER for both the development and the test sets. The best DER we have achieved is 15.4% on the development set, and 14.5% on the test set.
When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been propose...
详细信息
When performing speaker diarization on recordings from meetings, multiple microphones of different qualities are usually available and distributed around the meeting room. Although several approaches have been proposed in recent years to take advantage of multiple microphones, they are either too computationally expensive and not easily scalable or they cannot outperform the simpler case of using the best single microphone. In this paper, the use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain. New techniques we are presenting include blind reference-channel selection, two-step time delay of arrival (TDOA) Viterbi postprocessing, and a dynamic output signal weighting algorithm, together with using such TDOA values in the diarization to complement the acoustic information. Tests on speaker diarization show a 25% relative improvement on the test set compared to using a single most centrally located microphone. Additional experimental results show improvements using these techniques in a speech recognition task.
We report results on speaker diarization of telephone conversations. This speaker diarization process is similar to the multistage segmentation and clustering system used in broadcast news. It consists of an initial a...
详细信息
We report results on speaker diarization of telephone conversations. This speaker diarization process is similar to the multistage segmentation and clustering system used in broadcast news. It consists of an initial acoustic change point detection algorithm, iterative Viterbi re-segmentation, gender labeling, agglomerative clustering using a Bayesian information criterion (BIC), followed by agglomerative clustering using state-of-the-art speaker identification (SID) methods and Viterbi re-segmentation using Gaussian mixture models (GMMs). We repeat these multistage segmentation and clustering steps twice: once with mel-frequency cepstral coefficients (MFCCs) as feature parameters for the GMMs used in gender labeling, SID, and Viterbi re-segmentation steps and another time with Gaussianized MFCCs as feature parameters for the GMMs used in these three steps. The resulting clusters from the parallel runs are combined in a novel way that leads to a significant reduction in the diarization error rate (DER). On a development set containing 30 telephone conversations, this combination step reduced the DER by 20 %. On another test set containing 30 telephone conversations, this step reduced the DER by 13%. The best error rate we have achieved is 6.7 % on the development set and 9.0 % on the test set.
Accurate modeling of speaker clusters is important in the task of speaker diarization. Creating accurate models involves both selection of the model complexity and optimum training given the data. Using models with fi...
详细信息
ISBN:
(纸本)1424407281
Accurate modeling of speaker clusters is important in the task of speaker diarization. Creating accurate models involves both selection of the model complexity and optimum training given the data. Using models with fixed complexity and trained using the standard EM algorithm poses a risk of overfitting, which can lead to a reduction in diarization performance. In this paper a technique proposed by the author to estimate the complexity of a model is combined with a novel training algorithm called "Cross-Validation EM" to control the number of training iterations. This combination leads to more robust speaker modeling and results in an increase in speaker diarization performance. Tests on the NIST RT (MDM) datasets for meetings show a relative improvement of 10.6% relative on the test set.
We report results on speaker diarization of telephone conversations. This speaker diarization process is similar to the multistage segmentation and clustering system used in broadcast news. It consists of an initial a...
详细信息
ISBN:
(纸本)9781424417452
We report results on speaker diarization of telephone conversations. This speaker diarization process is similar to the multistage segmentation and clustering system used in broadcast news. It consists of an initial acoustic change point detection algorithm, iterative Viterbi re-segmentation, gender labeling, agglomerative clustering using a Bayesian information criterion (BIC), followed by agglomerative clustering using stateof-the-art speaker identification methods (SID) and Viterbi resegmentation using Gaussian mixture models (GMMs). The Viterbi re-segmentation using GNMs is new, and it reduces the diarization error rate (DER) by 10%. We repeat these multistage segmentation and clustering steps twice: once with MFCCs as feature parameters for the GMMs used in gender labeling, SID and Viterbi re-segmentation steps, and another time with Gaussianized MFCCs as feature parameters for the GMMs used in these three steps. The resulting clusters from the parallel runs are combined in a novel way that leads to a significant reduction in the DER. On a development set containing 30 telephone conversations, this combination step reduced the DER by 20%. On another test set containing 30 telephone conversations, this step reduced the DER by 13%. The best error rate we have achieved is 6.7% on the development set, and 9.0% on the test set.
This paper shows research performed into the topic of speaker diarization for multi-speaker environment. It looks into the algorithms and the implementation of an off-line speakersegmentation and indexing system for ...
详细信息
ISBN:
(纸本)9781424418343
This paper shows research performed into the topic of speaker diarization for multi-speaker environment. It looks into the algorithms and the implementation of an off-line speakersegmentation and indexing system for recorded speech data where usually more than one speaker is present. speaker diarization is a well studied topic in the domain of broadcast news recordings. Most of the proposed systems involve hierarchical clustering of the data, where the number of speakers and their identities are known a priori. speaker diarization is the task of assigning a unique label to all speech segments in an audio stream by the same speaker. There are two key challenges: processing speed and robustness in the presence of noise. In this paper we address the robustness issue by using a method already successful in speech recognition application. Using ANS (Autocorrelation-Based Noise Subtraction) for robust genetic algorithm-based speaker diarization, we compare the results with the baseline MFCC-based system in clean and noisy conditions.
Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include part...
详细信息
Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification.
This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitione...
详细信息
This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system. The baseline partitioner provides a high cluster purity, but has a tendency to split data from speakers with a large quantity of data into several segment clusters. Several improvements to the baseline system have been made. First, the iterative Gaussian mixture model (GNM) clustering has been replaced by a Bayesian information criterion (BIC) agglomerative clustering. Second, an additional clustering stage has been added, using a GMM-based speaker identification method. Finally, a post-processing stage refines the segment boundaries using the output of a transcription system. On the National Institute of Standards and Technology (NIST) RT-04F and ESTER evaluation data, the multistage system reduces the speaker error by over 70% relative to the baseline system, and gives'between 40% and 50% reduction relative to a single-stage BIC clustering system.
This paper summarizes the collaboration of the LIA and CLIPS laboratories on speaker diarization of broadcast news during the spring NIST Rich Transcription 2003 evaluation campaign (NIST-RT'03S). The speaker diar...
详细信息
This paper summarizes the collaboration of the LIA and CLIPS laboratories on speaker diarization of broadcast news during the spring NIST Rich Transcription 2003 evaluation campaign (NIST-RT'03S). The speaker diarization task consists of segmenting a conversation into homogeneous segments which are then grouped into speaker classes. Two approaches are described and compared for speaker diarization. The first one relies on a classical two-step speaker diarization strategy based on a detection of speaker turns followed by a clustering process, while the second one uses an integrated strategy where both segment boundaries and speaker tying of the segments are extracted simultaneously and challenged during the whole process. These two methods are used to investigate various strategies for the fusion of diarization results. Furthermore, segmentation into acoustic macro-classes is proposed and evaluated as a priori step to speaker diarization. The objective is to take advantage of the a priori acoustic information in the diarization process. Along with enriching the resulting segmentation with information about speaker gender, channel quality or background sound, this approach brings gains in speaker diarization performance thanks to the diversity of acoustic conditions found in broadcast news. The last part of this paper describes some ongoing works carried out by the CLIPS and LIA laboratories and presents some results obtained since 2002 on speaker diarization for various corpora. (c) 2005 Elsevier Ltd. All rights reserved.
暂无评论