检索结果-内蒙古大学图书馆

Leveraging Visual Supervision for Array-Based active speaker detection and localization

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 2024年 32卷 984-995页

作者： Berghi, Davide Jackson, Philip J. B. Univ Surrey Ctr Vis Speech & Signal Proc CVSSP Surrey GU2 7XH England

Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a "student-teacher" learning approach. A conventional pre-trained active speaker detector is adopted as a "teacher" network to provide the position of the speakers as pseudo-labels. The multichannel audio "student" network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system.

关键词： active speaker detection and localization self-supervised learning multichannel microphone array

来源：评论

学校读者我要写书评

暂无评论

AUDIO INPUTS FOR active speaker detection and localization VIA MICROPHONE ARRAY

AUDIO INPUTS FOR ACTIVE SPEAKER DETECTION AND LOCALIZATION V...

引用

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

作者： Berghi, Davide Jackson, Philip J. B. Univ Surrey Ctr Vis Speech & Signal Proc Guildford Surrey England

ISBN: (纸本)9798350323726

This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as active speaker detection and localization (ASDL). Our goal was to investigate the performance of spatial acoustic features extracted from the multichannel audio as the input of a convolutional recurrent neural network (CRNN), in relation to the number of channels employed and additive noise. To this end, experiments were conducted to compare the generalized cross-correlation with phase transform (GCC-PHAT), the spatial cue-augmented log-spectrogram (SALSA) features, and a recently-proposed beamforming method, evaluating their robustness to various noise intensities. The array aperture and sampling density were tested by taking subsets from the 16-microphone array. Results and tests of statistical significance demonstrate the microphones' contribution to performance on the TragicTalkers dataset, which offers opportunities to investigate audio-visual approaches in the future.

关键词： Features extraction active speaker detection and localization microphone array multichannel audio

来源：评论

学校读者我要写书评

暂无评论

Self-Supervised Vision-Based detection of the active speaker as Support for Socially Aware Language Acquisition

引用

IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS 2020年第2期12卷 250-259页

作者： Stefanov, Kalin Beskow, Jonas Salvi, Giampiero Univ Southern Calif Inst Creat Technol Los Angeles CA 90089 USA KTH Royal Inst Technol Dept Speech Mus & Hearing S-10044 Stockholm Sweden NTNU Norwegian Univ Sci & Technol Dept Elect Syst N-7491 Trondheim Norway

This paper presents a self-supervised method for visual detection of the active speaker in a multiperson spoken interaction scenario. active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multiperson face-to-face interaction data set. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.

关键词： Visualization Cognitive systems Acoustics Face Detectors Hidden Markov models Robots active speaker detection and localization cognitive systems and development language acquisition through development transfer learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：