检索结果-内蒙古大学图书馆

IEEE International Conference on Multimedia and Expo (ICME)

作者： Wang, Jinxin Guo, Zhongwen Yang, Chao Li, Xiaomei Cui, Ziyuan Ocean Univ China Fac Informat Sci & Engn Qingdao Peoples R China Univ Technol Sydney Sch Comp Sci Sydney Australia

ISBN: (纸本)9781665468916

Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules.

关键词： audio-visual recognition deep learning multi-modality feature extraction

来源：评论

学校读者我要写书评

暂无评论

An End-to-End Mandarin audio-visual Speech recognition Model with a Feature Enhancement Module

An End-to-End Mandarin Audio-Visual Speech Recognition Model...

引用

2023 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2023

作者： Wang, Jinxin Yang, Chao Guo, Zhongwen Li, Xiaomei Wang, Weigang Ocean University of China Faculty of Information Science and Engineering Qingdao China School of Computer Science University of Technology Sydney Sydney Australia

ISBN: (纸本)9798350337020

Compared to relying only on audio information, incorporating visual information improves speech recognition accuracy in noisy environments. Existing works are prone to design specific architecture for feature extraction, neglecting feature enhancement. In this paper, we propose an end-to-end Mandarin audio-visual speech recognition model with a Feature Enhancement Module. Specifically, we design a Feature Enhancement Module (FEM) that uses deconvolution and up-sampling to obtain the twin enhanced data for generating high-resolution feature representation. We further develop the visual Feature Enhancement Module (visual FEM) and audio Feature Enhancement Module (audio FEM) to enhance feature extraction from both visual data and audio data. We incorporate the proposed modules into the blocks of the Residual Network for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k and Chinese Mandarin Lip Reading (CMLR) datasets. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules. © 2023 IEEE.

关键词： audio-visual recognition deep learning feature enhancement extraction multi-modality feature extraction

来源：评论

学校读者我要写书评

暂无评论

3D Convolutional Neural Networks for Cross audio-visual Matching recognition

引用

IEEE ACCESS 2017年 5卷 22081-22091页

作者： Torfi, Amirsina Iranmanesh, Seyed Mehdi Nasrabadi, Nasser Dawson, Jeremy West Virginia Univ Coll Engn & Mineral Resources Lane Dept Comp Sci & Elect Engn Morgantown WV 26506 USA

audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this paper. We propose the use of a coupled 3D convolutional neural network (3D CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller data set for training, our proposed method surpasses the performance of the existing similar methods for audio-visual matching, which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20% on the equal error rate and over 7% on the average precision in comparison to the state-of-the-art method.

关键词： Convolutional networks 3D architecture deep learning audio-visual recognition

来源：评论

学校读者我要写书评

暂无评论

Multimodal Emotion recognition through Deep Fusion of audio-visual Data 26

Multimodal Emotion Recognition through Deep Fusion of Audio-...

引用

26th International Conference on Computer and Information Technology, ICCIT 2023

作者： Sultana, Tamanna Jahan, Meskat Uddin, Md. Kamal Kobayashi, Yoshinori Hasan, Mahmudul Comilla University Department of Computer Science and Engineering Cumilla Bangladesh NSTU Department of Computer Science & Telecommunication Engineering Noakhali Bangladesh Saitama University Interactive Systems Lab. Saitama Japan

ISBN: (纸本)9798350359015

The field of emotion recognition in artificial intelligence focuses on enabling machines to comprehend and react to the range of emotions experienced by humans. This paper presents a novel approach that integrates the Convolution Neural Network (CNN) with audio and visual modalities. The study employs the RAVDESS database as a resource to train two distinct models for the analysis of both video and audio data. When it comes to audio pre-processing, advanced signal-processing techniques are applied to extract relevant elements and capture basic acoustic characteristics. A one-dimensional Convolutional Neural Network (CNN) architecture receives the audio data as input, enabling the model to learn complicated patterns and representations from the audio domain. In the context of video pre-processing, sophisticated algorithms are employed to extract essential facial characteristics. In order to capture the changing periods of facial expressions, the video frames are analyzed using a three-dimensional CNN framework following that they have been compressed and converted to grayscale. The fusion technique involves concatenating and extending the outputs of the audio and visual models. The fused features are subsequently sent into a softmax layer, which facilitates the development of a resilient emotion identification system. © 2023 IEEE.

关键词： audio-visual recognition Convolution Neural Networks Emotion recognition Multi-modal fusion

来源：评论

学校读者我要写书评

暂无评论

Methodologies of audio-visual Biometric Performance Evaluation for the H2020 SpeechXRays Project 5

Methodologies of Audio-Visual Biometric Performance Evaluati...

引用

5th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP)

作者： Mtibaa, Aymen Hmani, Mohamed Amine Petrovska-Delacretaz, Dijana Boudy, Jerome Ben Hamida, Ahmed Bauzou, Claude Crucianu, Iacob Markopoulos, Ioannis Spanakis, Emmanouil Nicolin, Alexandru Narr, Christian Kockmann, Marcel Perez, Javier Inst Polytech Paris Telecom SudParis Paris France Sfax Univ Ecole Natl Ingenieurs Sfax ATMS Sfax Tunisia Fdn Res & Technol Hellas Inst Comp Sci Athens Greece Horia Hulubei Natl Inst Phys & Nucl Engn Magurele Romania IDEMIA Courbevoie France SIVECO Bucharest Romania FORTHNET Athens Greece LumenVox Berlin Germany

ISBN: (纸本)9781728175133

Biometric recognition is nowadays widely used in different services and applications, making the user authentication easier and more secure than the traditional authentication system. Starting from this idea, the EU SpeechXRays project H2020 developed and evaluated in real-life environments a user recognition platform based on face and voice modalities. Since the proposed biometric solution was evaluated in real-life environments where biometric data recorded was not accessible because of the General Data Protection Regulation GDPR, the ground truth of the conducted evaluation was not available. To correctly report the performance evaluation, some methodologies were proposed to detect the errors caused by the absence of ground truth. This paper describes the biometric solution provided by the project and presents the biometric performance evaluation carried out in three real-life use case pilots on more than 2 000 users.

关键词： audio-visual recognition performance evaluation

来源：评论

学校读者我要写书评

暂无评论

audio-visual recognition System in Compression Domain

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2011年第5期21卷 637-646页

作者： Wong, Yee Wan Seng, Kah Phooi Ang, Li-Minn Taylors Univ Lakeside Campus Selangor 47500 Darul Ehsan Malaysia Univ Nottingham Malaysia Campus Selangor 43500 Darul Ehsan Malaysia

This paper presents a highly efficient audio-visual recognition system in compression domain. For face recognition systems, the multiband feature fusion method selects the wavelet subbands that are invariant to illumination and facial expression variations. These subbands will be extracted directly from the inverse quantization in the compression system. By taking the inverse quantized wavelet coefficient of the video as the input, the inverse wavelet transform which corresponds to image reconstruction is omitted. As a result, the computational complexity of the conventional video-based face recognition system is reduced. We also present a set of new face localization methods to localize the facial wavelet coefficients from the wavelet subband image. The dual optimal multiband feature fusion method is then used to fuse the two set of wavelet coefficients and generate the visual scores. Experimental results show that with low computational complexity, the proposed system achieves high recognition accuracy in UNMC-VIER, CUAVE, and XM2VTS audio-visual database.

关键词： audio-visual recognition computational complexity face localization face segmentation video-based face recognition wavelet transform

来源：评论

学校读者我要写书评

暂无评论

Amazigh audiovisual Speech recognition System Design

Amazigh Audiovisual Speech Recognition System Design

引用

Intelligent Systems and Computer Vision (ISCV)

作者： Addarrazi, Ilham Satori, Hassan Satori, Khalid USMBA FSDM Dept Math & Comp Sci Fes Morocco UMP FPN Dept Math & Comp Sci Nador Morocco

ISBN: (纸本)9781509040629

It is well known that speech recognition is a multimodal process which uses information not only from audio but also from vision. This paper describes our experience to design an audio visual speech recognition system, which relates the acoustic and the visual information in order to improve noise robustness of automatic speech recognition. The accuracy rate for face and mouth detection using Viola-Jones approach was satisfactory (reaches to 99% and 96.6% for face and mouth detection respectively).

关键词： audio-visual recognition Automatic Speech recognition lip reading HMM

来源：评论

学校读者我要写书评

暂无评论

MANDARIN audio-visual SPEECH recognition WITH EFFECTS TO THE NOISE AND EMOTION

引用

INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL 2010年第2期6卷 711-723页

作者： Pao, Tsang-Long Liao, Wen-Yuan Chen, Yu-Te Wu, Tsan-Nung Tatung Univ Dept Comp Sci & Engn Taipei 104 Taiwan DeLin Inst Technol Dept Comp Sci & Informat Engn Tucheng City 236 Taipei County Taiwan

This paper presents;a Mandarin audio-visual recognition system dealing with noisy and emotional speech signal. In the proposed approach, we extract the visual features of the lips. These features are very important to the recognition system. especially in noisy condition or with emotional effects. In this recognition system., we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GAM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. The experimental results of different classifiers at various SNR. levels are presented The results show that using the WD-KNN classifier yields better recognition accuracy than. other classifiers for the used Mandarin speech corpus.

关键词： audio-visual recognition Feature extraction Gaussian mixture model K-nearest neighbour Hidden Markov model Weighted-discrete KNN

来源：评论

学校读者我要写书评

暂无评论

Human audio-visual Consonant recognition Analyzed with Three Bimodal Integration Models

Human Audio-Visual Consonant Recognition Analyzed with Three...

引用

10th INTERSPEECH 2009 Conference

作者： Ma, Zhanyu Leijon, Arne KTH Royal Inst Technol Sound & Image Proc Lab Stockholm Sweden

ISBN: (纸本)9781615676927

With A-V recordings. ten normal hearing people took recognition tests at different signal-to-noise ratios (SNR). The AV recognition results are predicted by the fuzzy logical model of perception (FLMP) and the post-labelling integration model (POSTL). We also applied hidden Markov models (HMMs) and multi-stream HMMs (MSHMMs) for the recognition. As expected, all the models agree qualitatively with the results that the benefit gained from the visual signal is larger at lower acoustic SNRs. However, the FLMP severely overestimates the AV integration result, while the POSTL model underestimates it. Our automatic speech recognizers integrated the audio and visual stream efficiently. The visual automatic speech recognizer could be adjusted to correspond to human visual performance. The MSHMMs combine the audio and visual streams efficiently, but the audio automatic speech recognizer must be further improved to allow precise quantitative comparisons with human audio-visual performance.

关键词： audio-visual recognition Fuzzy Logical Model of Perception Post-Labelling Model Hidden Markov Models Multi-Stream Hidden Markov Models

来源：评论

学校读者我要写书评

暂无评论

An audio-visual speech recognition with a new mandarin audio-visual database

An audio-visual speech recognition with a new mandarin audio...

引用

4th International Conference on Cybernetics and Information Technologies, Systems and Applications/5th Int Conf on Computing, Communications and Control Technologies

作者： Liao, Wen-Yuan Pao, Tsang-Long Chen, Yu-Te Chang, Tsun-Wei De Lin Inst Technol Dept Comp Sci & Engn Taipei Taiwan Tatung Univ Dept Comp Sci & Engn Taipei Taiwan

ISBN: (纸本)9781934272077

Automatic speech recognition(ASR) by machine has been a goal and an attractive research area for past several decades. In recent years, there has been growing attractive research topic for overcoming certain audio-only recognition problems. Motivated by the multimodal nature of speech, the visual feature is considered to bring in information that dose not existing in the acoustic signal and enables improved system performance over audio-only methods. We first introduce the method for the extraction for the visual feature of the lip. In this paper, we compare four different weighting functions in weighted KNN-based classifiers to recognize ten digits, including 0 to 9, from Mandarin audiovisual speech. The classifiers studied include traditional KNN, weighted KNN, and weighted D-KNN. We also create a new audio-visual database in English and Mandarin. We will describe this audio-visual database and test this database for our proposed system, with some experimental results.

关键词： audio-visual recognition feature extraction K nearest neighborhood.

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：