Procedural content generation via machine learning (PCGML) has demonstrated its usefulness as a content and game creation approach, and has been shown to be able to support human creativity. An important facet of crea...
详细信息
ISBN:
(纸本)9781450388078
Procedural content generation via machine learning (PCGML) has demonstrated its usefulness as a content and game creation approach, and has been shown to be able to support human creativity. An important facet of creativity is combinational creativity or the recombination, adaptation, and reuse of ideas and concepts between and across domains. In this paper, we present a PCGML approach for level generation that is able to recombine, adapt, and reuse structural patterns from several domains to approximate unseen domains. We extend prior work involving example-driven Binary Space Partitioning for recombining and reusing patterns in multiple domains, and incorporate variational autoencoders (VAEs) for generating unseen structures. We evaluate our approach by blending across 7 domains and subsets of those domains. We show that our approach is able to blend domains together while retaining structural components. Additionally, by using different groups of training domains our approach is able to generate both 1) levels that reproduce and capture features of a target domain, and 2) levels that have vastly different properties from the input domain.
Emotional Voice Conversion (EVC) is a task that aims to convert the emotional state of speech from one to another while preserving the linguistic information and identity of the speaker. However, many studies are limi...
详细信息
Emotional Voice Conversion (EVC) is a task that aims to convert the emotional state of speech from one to another while preserving the linguistic information and identity of the speaker. However, many studies are limited by the requirement for parallel speech data between different emotional patterns, which is not widely available in real-life applications. Furthermore, the annotation of emotional data is highly time-consuming and labor-intensive. To address these problems, in this paper, we propose SGEVC, a novel semi-supervised generative model for emotional voice conversion. This paper demonstrates that using as little as 1% supervised data is sufficient to achieve EVC. Experimental results show that our proposed model achieves state-of-the-art (SOTA) performance and consistently outperforms EVC baseline frameworks.
An electrocardiogram (ECG) provides crucial information about an individual's health status. Researchers utilize ECG data to develop learners for a variety of tasks, ranging from diagnosing ECG abnormalities to es...
详细信息
An electrocardiogram (ECG) provides crucial information about an individual's health status. Researchers utilize ECG data to develop learners for a variety of tasks, ranging from diagnosing ECG abnormalities to estimating time to death - here modeled as individual survival distributions (ISDs). The way the ECG is represented is important for creating an effective learner. While many traditional ECG-based prediction models rely on hand-crafted features, such as heart rate, this study aims to achieve a better representation. The effectiveness of various ECG based feature extraction methods for prediction of ISDs, either supervised or unsupervised, have not been explored previously. The study uses a large ECG dataset from 244,077 patients with over 1.6 million 12lead ECGs, each labeled with the patient's disease - one or more International Classification of Diseases (ICD) codes. We explored extracting high-level features from ECG traces using various approaches, then trained models that used these ECG features (along with age and sex), across a range of training sizes, to estimate patient-specific ISDs. The results showed that the supervised feature extractor method produced ECG features that can estimate ISD curves better than ECG features obtained from unsupervised or knowledge-based methods. Supervised ECG features required fewer training instances (as low as 500) to learn ISD models that performed better than the baseline model that only used age and sex. On the other hand, unsupervised and knowledge-based ECG features required over 5,000 training samples to produce ISD models that performed better than the baseline. The study's findings may assist researchers in selecting the most appropriate approach for extracting high-level features from ECG signals to estimate patient-specific ISD curves.
Understanding how to learn feature representations for images and generate high-quality images under unsupervised learning was challenging. One of the main difficulties in feature learning has been the problem of post...
详细信息
ISBN:
(数字)9783031442131
ISBN:
(纸本)9783031442124;9783031442131
Understanding how to learn feature representations for images and generate high-quality images under unsupervised learning was challenging. One of the main difficulties in feature learning has been the problem of posterior collapse in variational inference. This paper proposes a hierarchical aggregated vector-quantized variational autoencoder, called TransVQ-VAE. Firstly, the multi-scale feature information based on the hierarchical Transformer is complementarily encoded to represent the global and structural dependencies of the input features. Then, it is compared to the latent encoding space with a linear difference to reduce the feature dimensionality. Finally, the decoder generates synthetic samples with higher diversity and fidelity compared to previous ones. In addition, we propose a dual self-attention module in the encoding process that uses spatial and channel information to capture distant texture correlations, contributing to the consistency and realism of the generated images. Experimental results on MNIST, CIFAR-10, CelebA-HQ, and ImageNet datasets show that our approach significantly improves the diversity and visual quality of the generated images.
Semantic image synthesis (SIS) refers to the problem of generating realistic imagery given a semantic segmentation mask that defines the spatial layout of object classes. Most of the approaches in the literature, othe...
详细信息
ISBN:
(数字)9783031431487
ISBN:
(纸本)9783031431470;9783031431487
Semantic image synthesis (SIS) refers to the problem of generating realistic imagery given a semantic segmentation mask that defines the spatial layout of object classes. Most of the approaches in the literature, other than the quality of the generated images, put effort in finding solutions to increase the generation diversity in terms of style i.e. texture. However, they all neglect a different feature, which is the possibility of manipulating the layout provided by the mask. Currently, the only way to do so is manually by means of graphical users interfaces. In this paper, we describe a network architecture to address the problem of automatically manipulating or generating the shape of object classes in semantic segmentation masks, with specific focus on human faces. Our proposed model allows embedding the mask class-wise into a latent space where each class embedding can be independently edited. Then, a bi-directional LSTM block and a convolutional decoder output a new, locally manipulated mask. We report quantitative and qualitative results on the CelebMask-HQ dataset, which show our model can both faithfully reconstruct and modify a segmentation mask at the class level. Also, we show our model can be put before a SIS generator, opening the way to a fully automatic generation control of both shape and texture. Code available at https://***/TFonta/Semantic-VAE.
Textiles are one of the common necessities in our lives. The quality of textile products is usually closely related to the quality of the fabric materials. Therefore, before the fabric materials are processed, the fab...
详细信息
ISBN:
(纸本)9783031054914;9783031054907
Textiles are one of the common necessities in our lives. The quality of textile products is usually closely related to the quality of the fabric materials. Therefore, before the fabric materials are processed, the fabric and textile industry first conducts quality inspections. In the past, traditional methods detect the defects on the fabric surface by human eyes, suffering from the overall inspection standards to be unreliable due to the fatigue and subjective judgment of the inspectors and consuming considerable labor costs and time. Automated detection methods are gradually introduced into the textile industry as one of the important processes. With the rapid advancement of deep learning technology, deep neural networks have caused revolutionary changes in the field of computer vision Therefore, this research employs an unsupervised deep learning model to detect fabric defects by combining the variational autoencoder (VAE) and the generative adversarial network (GAN). The proposed fabric inspection networks The proposed fabric inspection networks also called FINs only use non-defective fabric data to train the model, which solves the problem that traditional detection methods need to collect a large amount of defect data. In the process of model training, we introduced structural similarity index to help the overall model learn the defect-free texture characteristics of fabric surfaces. Finally, through this method, the surface defects can be found and the defective areas can be repaired. After segmentation, the position of the defect can be marked, and the detection result has also reached a certain degree of accuracy.
Timbre is high-dimensional and sensuous, making it difficult for musical-instrument learners to improve their timbre. Although some systems exist to improve timbre, they require expert labeling for timbre evaluation;h...
详细信息
ISBN:
(纸本)9798400701061
Timbre is high-dimensional and sensuous, making it difficult for musical-instrument learners to improve their timbre. Although some systems exist to improve timbre, they require expert labeling for timbre evaluation;however, solely visualizing the results of unsupervised learning lacks the intuitiveness of feedback because human perception is not considered. Therefore, we employ crossmodal correspondences for intuitive visualization of the timbre. We designed TimToShape, a system that visualizes timbre with 2D shapes based on the user's input of timbre-shape correspondences. TimToShape generates a shape morphed by linear interpolation according to the timbre's position in the latent space, which is obtained by unsupervised learning with a variational autoencoder (VAE). We confirmed that people perceived shapes generated by TimToShape to correspond more to timbre than randomly generated shapes. Furthermore, a user study of six violin players revealed that TimToShape was well-received in terms of visual clarity and interpretability.
Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such s...
详细信息
ISBN:
(数字)9783031282386
ISBN:
(纸本)9783031282379;9783031282386
Recent progress in pose-estimation methods enables the extraction of sufficiently-precise 3D human skeleton data from ordinary videos, which offers great opportunities for a wide range of applications. However, such spatio-temporal data are typically extracted in the form of a continuous skeleton sequence without any information about semantic segmentation or annotation. To make the extracted data reusable for further processing, there is a need to access them based on their content. In this paper, we introduce a universal retrieval approach that compares any two skeleton sequences based on temporal order and similarities of their underlying segments. The similarity of segments is determined by their content-preserving low-dimensional code representation that is learned using the variational autoencoder principle in an unsupervised way. The quality of the proposed representation is validated in retrieval and classification scenarios;our proposal outperforms the state-of-the-art approaches in effectiveness and reaches speed-ups up to 64x on common skeleton sequence datasets.
End-to-end singing voice synthesis (SVS) model VISinger [1] can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end...
详细信息
End-to-end singing voice synthesis (SVS) model VISinger [1] can achieve better performance than the typical two-stage model with fewer parameters. However, VISinger has several problems: text-to-phase problem, the end-to-end model learns the meaningless mapping of text-to-phase;glitches problem, the harmonic components corresponding to the periodic signal of the voiced segment occurs a sudden change with audible artefacts;low sampling rate, the sampling rate of 24KHz does not meet the application needs of high-fidelity generation with the full-band rate (44.1KHz or higher). In this paper, we propose VISinger 2 to address these issues by integrating the digital signal processing (DSP) methods with VISinger. Specifically, inspired by recent advances in differentiable digital signal processing (DDSP) [2], we incorporate a DSP synthesizer into the decoder to solve the above issues. The DSP synthesizer consists of a harmonic synthesizer and a noise synthesizer to generate periodic and aperiodic signals, respectively, from the latent representation z in VISinger. It supervises the posterior encoder to extract the latent representation without phase information and avoid the prior encoder modelling text-to-phase mapping. To avoid glitch artefacts, the HiFiGAN is modified to accept the waveforms generated by the DSP synthesizer as a condition to produce the singing voice. Moreover, with the improved waveform decoder, VISinger 2 manages to generate 44.1kHz singing audio with richer expression and better quality. Experiments on OpenCpop corpus [3] show that VISinger 2 outperforms VISinger, CpopSing and RefineSinger in both subjective and objective metrics. Our audio samples and source code are available (1).
As Internet applications continue to scale up, microservice architecture has become increasingly popular due to its flexibility and logical structure. Anomaly detection in traces that record inter-microservice invocat...
详细信息
ISBN:
(纸本)9798400703270
As Internet applications continue to scale up, microservice architecture has become increasingly popular due to its flexibility and logical structure. Anomaly detection in traces that record inter-microservice invocations is essential for diagnosing system failures. Deep learning-based approaches allow for accurate modeling of structural features (i.e., call paths) and latency features (i.e., call response time), which can determine the anomaly of a particular trace sample. However, the point-wise manner employed by these methods results in substantial system detection overhead and impracticality, given the massive volume of traces (billion-level). Furthermore, the point-wise approach lacks high-level information, as identical sub-structures across multiple traces may be encoded differently. In this paper, we introduce the first Group-wise Trace anomaly detection algorithm, named GTrace. This method categorizes the traces into distinct groups based on their shared substructure, such as the entire tree or sub-tree structure. A groupwise variational autoencoder (VAE) is then employed to obtain structural representations. Moreover, the innovative "predicting latency with structure" learning paradigm facilitates the association between the grouped structure and the latency distribution within each group. With the group-wise design, representation caching, and batched inference strategies can be implemented, which significantly reduces the burden of detection on the system. Our comprehensive evaluation reveals that GTrace outperforms state-of-theart methods in both performances (2.64% to 195.45% improvement in AUC metrics and 2.31% to 40.92% improvement in best F-Score) and efficiency (21.9x to 28.2x speedup). We have deployed and assessed the proposed algorithm on eBay's microservices cluster, and our code is available at https://***/NetManAIOps/***.
暂无评论