A work zone is a section of road with closed lanes for maintenance, forcing vehicles to merge and creating congestion bottlenecks on the highway. Emergency Vehicles (EVs) are vital for incident response, with response...
详细信息
Automation in manufacturing systems is rapidly changing, shifting from rigid solutions towards the use of smart technologies that better support manufacturers in meeting market demand. The implementation of these new ...
详细信息
Traffic congestion is a major problem in urban areas of India affecting the quality of life for the citizens. Traditional methods of traffic monitoring and management have limitations in terms of accuracy and scalabil...
详细信息
The alarming surge in cardiovascular diseases, with a particular focus on Coronary Artery Disease (CAD), is causing premature fatalities. This escalation is exacerbating the inefficiency of the diagnostic process, bur...
详细信息
Image steganography is a technique for encrypting data, allowing covert communication, and enhancing data security inside image files. This study provides an in-depth analysis of image steganography, covering its core...
详细信息
Audio-visual speech synthesis (AVSS) is a emerging field of study that involves generating synchronized and realistic video of a target speaker based on converted audio inputs of a source speaker. The AVSS method incl...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Audio-visual speech synthesis (AVSS) is a emerging field of study that involves generating synchronized and realistic video of a target speaker based on converted audio inputs of a source speaker. The AVSS method includes two sequential components: voice conversion (VC) to transform the source speaker’s voice to the target speaker’s voice, and audio-visual synthesis (AVS) to generate synchronized video of the target speaker from the output of the VC model. This paper presents an AVSS approach using Swin Transformer-based generative adversarial network (GAN) framework. The Swin Transformer is incorporated into the discriminator of both the VC and AVS models. Its hierarchical design and self-attention mechanisms significantly enhance the temporal and spatial coherence of the synthesized outputs, thereby improving the quality and synchronization of both audio and visual components. Moreover, a feature matching loss in the VC model and a temporal coherence loss in the AVS model is also incorporated to enhance the quality of synthesized audio and video outputs. Experimental results demonstrate that the proposed approach significantly outperforms existing techniques in terms of audio quality and visual synchronization, as validated by objective metrics and subjective evaluations. This work advances AVSS, offering improved performance for applications in virtual avatars, dubbing, and human-computer interaction.
Currently, fixed fixtures or a six coordinate attitude adjustment platform with linear motion and rotation are commonly used in aircraft engine assembly to adjust the engine's attitude, resulting in low attitude a...
详细信息
This paper presents a deep learning-based solution to the challenge of weed infestation in the context of modern agriculture. We leverage visual similarities of weed species through a hierarchical classification appro...
详细信息
An investigation was undertaken to assess the effectiveness of item-based collaborative filtering (IBCF) within recommendation systems, particularly in scenarios where user preferences are significantly influenced by ...
详细信息
Large Language models (LLMs) based on the Transformer architecture are designed to understand and generate human-like text by learning patterns and relationships from vast amounts of textual data. These models have be...
详细信息
暂无评论