Conventional audio-visual approaches for activespeakerdetection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they t...
详细信息
Conventional audio-visual approaches for activespeakerdetection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a "student-teacher" learning approach. A conventional pre-trained activespeaker detector is adopted as a "teacher" network to provide the position of the speakers as pseudo-labels. The multichannel audio "student" network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system.
This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as activespeakerdetection and localizati...
详细信息
ISBN:
(纸本)9798350323726
This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as active speaker detection and localization (ASDL). Our goal was to investigate the performance of spatial acoustic features extracted from the multichannel audio as the input of a convolutional recurrent neural network (CRNN), in relation to the number of channels employed and additive noise. To this end, experiments were conducted to compare the generalized cross-correlation with phase transform (GCC-PHAT), the spatial cue-augmented log-spectrogram (SALSA) features, and a recently-proposed beamforming method, evaluating their robustness to various noise intensities. The array aperture and sampling density were tested by taking subsets from the 16-microphone array. Results and tests of statistical significance demonstrate the microphones' contribution to performance on the TragicTalkers dataset, which offers opportunities to investigate audio-visual approaches in the future.
This paper presents a self-supervised method for visual detection of the activespeaker in a multiperson spoken interaction scenario. activespeakerdetection is a fundamental prerequisite for any artificial cognitive...
详细信息
This paper presents a self-supervised method for visual detection of the activespeaker in a multiperson spoken interaction scenario. activespeakerdetection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the activespeaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping activespeakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multiperson face-to-face interaction data set. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.
暂无评论