检索结果-内蒙古大学图书馆

International Conference on Acoustics, speech, and Signal Processing (ICASSP)

作者： Kaijun Deng Dezhi Zheng Jindong Xie Jinbao Wang Weicheng Xie Linlin Shen Siyang Song Computer Vision Institute School of Computer Science and Software Engineering Shenzhen University National Engineering Laboratory for Big Data System Computing Technology Shenzhen University Guangdong Provincial Key Laboratory of Intelligent Information Processing Department of Computer Science University of Exeter

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Accurately synthesizing talking face videos and capturing fine facial features for individuals with long hair presents a significant challenge. To tackle these challenges in existing methods, we propose a decomposed per-embedding Gaussian fields (DEGSTalk), a 3D Gaussian Splatting (3DGS)-based talking face synthesis method for generating realistic talking faces with long hairs. Our DEGSTalk employs Deformable Pre-Embedding Gaussian Fields, which dynamically adjust pre-embedding Gaussian primitives using implicit expression coefficients. This enables precise capture of dynamic facial regions and subtle expressions. Additionally, we propose a Dynamic Hair-Preserving Portrait Rendering technique to enhance the realism of long hair motions in the synthesized videos. Results show that DEGSTalk achieves improved realism and synthesis quality compared to existing approaches, particularly in handling complex facial dynamics and hair preservation. Our code is available at https://***/CVI-SZU/DEGSTalk.

关键词： Hair Training Three-dimensional displays Dynamics Signal processing Rendering (computer graphics) Noise measurement speech processing Faces Videos

来源：评论

学校读者我要写书评

暂无评论

Data-Driven Analysis of Skin Cancer Classification with Convolutional Neural Networks for E-Health Applications

Data-Driven Analysis of Skin Cancer Classification with Conv...

引用

2024 IEEE Global Communications Conference, GLOBECOM 2024

作者： Ahmed, Imran Ahmad, Misbah Chehri, Abdellah Jeon, Gwanggil Anglia Ruskin University School of Computing and Information Science Cambridge United Kingdom Hartpury University Animal and Agriculture Department Gloucester United Kingdom University of the West of England Centre for Machine Vision Bristol Robotics Laboratory Bristol United Kingdom Department of Mathematics and Computer Science Canada Incheon National University Department of Embedded Systems Engineering Incheon Korea Republic of

ISBN: (纸本)9798350351255

This study explores the effectiveness of Convolutional Neural Networks (CNNs) in automatically classifying skin cancer for e-health applications. The trained model showcases impressive performance by leveraging the HAM10000 dataset, which includes a wide range of skin lesion images from seven different classes. The parameters and architecture of the CNN model are presented in a systematic manner, providing valuable insights into the reasoning behind its design. The model is optimized using the Adam optimizer and annealing techniques to ensure efficient convergence. The model's performance is assessed on validation and test datasets, showcasing an accuracy of 78.55% and 76.49%, respectively, for skin cancer classification. This study highlights the significant potential of CNN as a powerful tool for automating the diagnosis of skin cancer, which is in line with the growing trend of using deep learning for medical image analysis. © 2024 IEEE.

关键词： Electronic health record

来源：评论

学校读者我要写书评

暂无评论

An Ensemble Approach to Multi-Class Classification of Vocal Disorders: Laryngocele and Vox Senilis

An Ensemble Approach to Multi-Class Classification of Vocal ...

引用

2024 IEEE International Conference on Intelligent Signal Processing and Effective Communication Technologies, INSPECT 2024

作者： Bawa, Puneet Kadyan, Virender Mantri, Archana Sethi, Monika Chitkara University Institute of Engineering & Technology Chitkara University Centre of Excellence for Speech and Multimodal Laboratory Punjab India Machine Intelligence Research Centre School of Computer Science UPES Uttarakhand Dehradun India Anurag University Department of Electronicsand Communication Engineering Hyderabad India Chitkara University Institute of Engineering & Technology Chitkara University Punjab India

ISBN: (纸本)9798350379525

The classification of audio signals has been a significant challenge in machine learning, especially with regard to the early identification of voice disorders. However, traditional techniques based on raw audio feature extraction have been shown to be not as effective due to the complex and non-stationary capacities of individuals experiencing speech difficulties. Therefore, the audio information related to two vocal disorders including Laryn-gocele and Vox Senilis have been transformed into an informative representation, Mel-Spectrogram. This helps in capturing both temporal as well as spectral characteristics in a manner consistent with typical auditory perception. In this study, the standalone Convolutional Neural Network (CNN) and various ensemble methods in conjection with machine learning techniques (CNN-ML) models have been explored for classification of audio signals based on Mel-Spectrogram. The CNN-Support Vector Machine (SVM) based ensemble method has excelled with accuracy of 96.23% at the extraction of hierarchical features corresponding to the Mel-Spectrograms while the SVM component provides robust classification capabilities. Overall, a relative improvement of 41.18% has been observed for CNN-SVM using Adamax optimizer when compared to the performance of ensemble techniques using Adam optimizer. © 2024 IEEE.

关键词： Convolutional neural networks

来源：评论

学校读者我要写书评

暂无评论

Human Orientation Estimation Under Partial Observation

Human Orientation Estimation Under Partial Observation

引用

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

作者： Jieting Zhao Hanjing Ye Yu Zhan Hao Luan Hong Zhang Department of Electronic and Electrical Engineering SUSTech Shenzhen Key Laboratory of Robotics and Computer Vision Southern University of Science and Technology (SUSTech) Department of Computer Science School of Computing National University of Singapore

ISBN: (数字)9798350377705

ISBN: (纸本)9798350377712

Reliable Human Orientation Estimation (HOE) from a monocular image is critical for autonomous agents to understand human intention. Significant progress has been made in HOE under full observation. However, the existing methods easily make a wrong prediction under partial observation and give it an unexpectedly high confidence. To solve the above problems, this study first develops a method called Part-HOE that estimates orientation from the visible joints of a target person so that it is able to handle partial observation. Subsequently, we introduce a confidence-aware orientation estimation method, enabling more accurate orientation estimation and reasonable confidence estimation under partial observation. The effectiveness of our method is validated on both public and custom-built datasets, and it shows great accuracy and reliability improvement in partial observation scenarios. In particular, we show in real experiments that our method can benefit the robustness and consistency of the Robot Person Following (RPF) task.

关键词： Accuracy Filtering Estimation Robustness Autonomous agents Intelligent robots

来源：评论

学校读者我要写书评

暂无评论

Inverse Kinematics of Robotic Manipulators Using a New Learning-by-Example Method

Inverse Kinematics of Robotic Manipulators Using a New Learn...

引用

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

作者： Jacket Demby’s Ramy Farag Guilherme N. DeSouza Department of Electrical Engineering and Computer Science (EECS) Vision-Guided and Intelligent Robotics (ViGIR) Laboratory University of Missouri-Columbia Columbia Missouri

ISBN: (数字)9798350377705

ISBN: (纸本)9798350377712

Inverse Kinematics (IK) is one of the most fundamental challenges in robotics. It refers to the process of determining the joint configurations required to achieve the desired position and orientation (pose) of a robot end-effector. Although numerous Data-Driven (DD) IK solvers have demonstrated encouraging results, they have not achieved the same accuracy when compared to other IK methods for complex robot configurations (e.g., numerical methods for higher Degrees of Freedom (DoF)). In this work, we propose a new Learning-by-Example method, and show that such a scheme considerably improves the IK learning results when compared to other DD learners. In our approach, the network input incorporates an example of joint-pose pair along with the query pose to predict the desired robot joint configuration. We show that the example joint-pose pair does not need to be too close to the query – i.e. example and query can be as far as 20 degrees apart in the joint configuration space. Furthermore, we investigate the utilization of residual and dense skip connections in Multilayer Perceptron for DDIK solvers and employ the resulting networks for two redundant robotic manipulators: a 7-DoF-7R commensurate robot and a 7DoF-2RP4R incommensurate robot. Our experimental results show that the resulting DDIK solver can reliably predict IK solutions with accuracy better than 1mm in position and 1deg in orientation.

关键词： Hands Accuracy Kinematics Network architecture Multilayer perceptrons Numerical models Reliability Collision avoidance Robots Intelligent robots

来源：评论

学校读者我要写书评

暂无评论

Generative Adversarial Network-Based Voice Synthesis from Spectrograms for Low-Resource speech Recognition in Mismatched Conditions 15

Generative Adversarial Network-Based Voice Synthesis from Sp...

引用

15th International Conference on Computing Communication and Networking Technologies, ICCCNT 2024

作者： Bawa, Puneet Kadyan, Virender Chhabra, Gunjan Chitkara University Institute of Engineering & Technology Chitkara University Centre of Excellence for Speech and Multimodal Laboratory Punjab India University of Petroleum & Energy Studies Machine Intelligence Research Centre School of Computer Science Energy Acres Bidholi Uttarakhand Dehradun248007 India Department of Computer Science and Engineering Graphic Era Hill University Uttarakhand Dehradun248007 India Graphic Era Deemed to be University Uttarakhand Dehradun248007 India

ISBN: (纸本)9798350370249

The use of Generative Adversarial Networks (GANs) has been increasing in speech recognition tasks but there has been significant hurdle due to limited availability. The use of GAN have shown promise in speech synthesis tasks, yet their application in low-resource speech systems faces a significant hurdle owing to limited data availability. The progress of effective Automatic speech Recognition (ASR) systems faces multiple challenges due to a limited range of options and scarcity, resulting in decreased adaptability and efficiency. This article proposes an innovative approach for integrating Generative Adversarial Networks (GANs) to create speech for both adults and children. Experiments have been conducted on using Mel-Spectrograms for synthetic augmentation to address the problem of limited data availability, particularly for low-resource languages and children. The experiments were conducted under both matched and mismatched conditions. The results demonstrate a noteworthy decrease in the Word Error Rate (WER), showcasing the potential of the GAN-based Vocoder model. This leads to an overall Relative Improvement (RI) of 12.74% and 13.95% for the adult and children ASR system, respectively. The research has yielded useful insights on the advancement of ASR systems, particularly in relation to the potential benefits of using GAN-based augmentation in real-world scenarios. © 2024 IEEE.

关键词： Generative Adversarial Network Mismatched Conditions speech Recognition Vocoder Voice Synthesis

来源：评论

学校读者我要写书评

暂无评论

NORPPA: NOvel Ringed Seal Re-Identification by Pelage Pattern Aggregation

NORPPA: NOvel Ringed Seal Re-Identification by Pelage Patter...

引用

IEEE Winter Applications and computer vision Workshops (WACVW)

作者： Ekaterina Nepovinnykh Tuomas Eerola Heikki Kälviäinen Ilia Chelak Department of Computational Engineering School of Engineering Sciences Computer Vision and Pattern Recognition Laboratory (CVPRL) Lappeenranta-Lahti University of Technology LUT Lappeenranta Finland Department of Computer Science Faculty of Science University of Helsinki Helsinki Finland

We propose a method for Saimaa ringed seal (Pusa hispida saimensis) re-identification. Access to large image volumes through camera trapping and crowdsourcing provides novel possibilities for animal conservation and monitoring and calls for automatic methods for analysis, in particular, when re-identifying individual animals from the images. The proposed method NOvel Ringed seal re-identification by Pelage Pattern Aggregation (NORPPA) utilizes the permanent and unique pelage pattern of Saimaa ringed seals and content-based image retrieval techniques. First, the query image is preprocessed, and each seal instance is segmented. Next, the seal's pelage pattern is extracted using a U-net encoder-decoder based method. Then, CNN-based affine invariant features are embedded and aggregated into Fisher Vectors. Finally, the cosine distance between the Fisher Vectors is used to find the best match from a database of known individuals. We perform extensive experiments of various modifications of the method on challenging Saimaa ringed seals re-identification dataset. The proposed method is shown to produce the best re-identification accuracy on our dataset in comparisons with alternative approaches.

关键词：

来源：评论

学校读者我要写书评

暂无评论

MULTIMODALITY HELPS FEW-SHOT 3D POINT CLOUD SEMANTIC SEGMENTATION

arXiv

引用

arXiv 2024年

作者： An, Zhaochong Sun, Guolei Liu, Yun Li, Runjia Wu, Min Cheng, Ming-Ming Konukoglu, Ender Belongie, Serge Department of Computer Science University of Copenhagen Denmark Computer Vision Laboratory ETH Zurich Switzerland College of Computer Science Nankai University China Department of Engineering Science University of Oxford United Kingdom Institute for Infocomm Research A*STAR Singapore

Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at this link. © 2024, CC BY-NC-SA.

关键词： Semantic Segmentation

来源：评论

学校读者我要写书评

暂无评论

HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

arXiv

引用

arXiv 2025年

作者： Huang, Dehao Dong, Wenlong Tang, Chao Zhang, Hong Shenzhen Key Laboratory of Robotics and Computer Vision Southern University of Science and Technology Shenzhen China Department of Electronic and Electrical Engineering Southern University of Science and Technology Shenzhen China

Task-oriented grasping (TOG) is essential for robots to perform manipulation tasks, requiring grasps that are both stable and compliant with task-specific constraints. Humans naturally grasp objects in a task-oriented manner to facilitate subsequent manipulation tasks. By leveraging human grasp demonstrations, current methods can generate high-quality robotic parallel-jaw task-oriented grasps for diverse objects and tasks. However, they still encounter challenges in maintaining grasp stability and sampling efficiency. These methods typically rely on a two-stage process: first performing exhaustive task-agnostic grasp sampling in the 6-DoF space, then applying demonstration-induced constraints (e.g., contact regions and wrist orientations) to filter candidates. This leads to inefficiency and potential failure due to the vast sampling space. To address this, we propose the Human-guided Grasp Diffuser (HGDiffuser), a diffusion-based framework that integrates these constraints into a guided sampling process. Through this approach, HGDiffuser directly generates 6-DoF task-oriented grasps in a single stage, eliminating exhaustive task-agnostic sampling. Furthermore, by incorporating Diffusion Transformer (DiT) blocks as the feature backbone, HGDiffuser improves grasp generation quality compared to MLP-based methods. Experimental results demonstrate that our approach significantly improves the efficiency of task-oriented grasp generation, enabling more effective transfer of human grasping strategies to robotic systems. To access the source code and supplementary videos, visit https://***/ view/hgdiffuser. Copyright © 2025, The Authors. All rights reserved.

关键词： Robotics

来源：评论

学校读者我要写书评

暂无评论

Zero-Shot Audio Captioning Using Soft and Hard Prompts

IEEE Transactions on Audio, Speech and Language Processing

引用

IEEE Transactions on Audio, speech and Language Processing 2025年 33卷 2045-2058页

作者： Yiming Zhang Xuenan Xu Ruoyi Du Haohe Liu Yuan Dong Zheng-Hua Tan Wenwu Wang Zhanyu Ma Pattern Recognition and Intelligent System Laboratory School of Artificial Intelligence Beijing University of Posts and Telecommunications Beijing China Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Centre for Vision Speech and Signal Processing University of Surrey Guildford U.K. Department of Electronic Systems Aalborg University Aalborg Denmark

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, and this issue has received little attention. To address these issues, we propose a new zero-shot method for audio captioning. Our method is built on the contrastive language-audio pre-training (CLAP) model. During training, the model reconstructs the ground-truth caption using the CLAP text encoder. In the inference stage, the model generates text descriptions from the CLAP audio embeddings of given audio inputs. To enhance the ability of the model in transitioning from text-to-text generation to audio-to-text generation, we propose to use the mixed-augmentations-based soft prompt to learn more robust latent representations, leveraging instance replacement and embedding augmentation. Additionally, we introduce the retrieval-based acoustic-aware hard prompt to improve the cross-domain performance of the model by employing the domain-agnostic label information of sound events. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.

关键词： Training Decoding Semantics Data models Acoustics Electronic mail Benchmark testing Transformers Robustness Perturbation methods

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：