The data available in the world come in various modalities, such as audio, text, image, and video. Each data modality has different statistical properties. Understanding each modality, individually, and the relationsh...
详细信息
The data available in the world come in various modalities, such as audio, text, image, and video. Each data modality has different statistical properties. Understanding each modality, individually, and the relationship between the modalities is vital for a better understanding of the environment surrounding us. Multimodal learning models allow us to process and extract useful information from multimodal sources. For instance, image captioning and text-to-image synthesis are examples of multimodal learning, which require mapping between texts and images. In this paper, we introduce a research area that has never been explored by the remote sensing community, namely the synthesis of remote sensing images from text descriptions. More specifically, in this paper, we focus on exploiting ancient text descriptions of geographical areas, inherited from previous civilizations, to generate equivalent remote sensing images. From a methodological perspective, we propose to rely on generative adversarial networks (GANs) to convert the text descriptions into equivalent pixel values. GANs are a recently proposed class of generative models that formulate learning the distribution of a given dataset as an adversarial competition between two networks. The learned distribution is represented using the weights of a deep neural network and can be used to generate more samples. To fulfill the purpose of this paper, we collected satellite images and ancient texts to train the network. We present the interesting results obtained and propose various future research paths that we believe are important to further develop this new research area.
The recent success of Generative Adversarial Networks (GAN) is a result of their ability to generate high quality images given samples from a latent space. One of the applications of GANs is to generate images from a ...
详细信息
ISBN:
(纸本)9783030276843;9783030276836
The recent success of Generative Adversarial Networks (GAN) is a result of their ability to generate high quality images given samples from a latent space. One of the applications of GANs is to generate images from a text description, where the text is first encoded and further used for the conditioning in the generative model. In addition to text, conditional generative models often use label information for conditioning. Hence, the structure of the meta-data and the ontology of the labels is important for such models. In this paper, we propose Ontology Generative Adversarial Networks (O-GANs) to handle the complexities of the data with label ontology. We evaluate our model on a dataset of fashion images with hierarchical label structure. Our results suggest that the incorporation of the ontology, leads to better image quality as measured by Fr ' echet Inception Distance and Inception Score. Additionally, we show that the O-GAN better matches the generated images to their conditioning text, compared to models that do not incorporate the label ontology.
text-to-image synthesis is a research topic that has not yet been addressed by the remote sensing community. It consists in learning a mapping from text description to image pixels. In this paper, we propose to addres...
详细信息
ISBN:
(纸本)9781538691540
text-to-image synthesis is a research topic that has not yet been addressed by the remote sensing community. It consists in learning a mapping from text description to image pixels. In this paper, we propose to address this topic for the very first time. More specifically, our objective is to convert ancient text descriptions of geographic areas written by past explorers into an equivalent remote sensing image. To this effect, we rely on generative adversarial networks (GANs) to learn the mapping. GANs aim to represent the distribution of a dataset using weights of a deep neural network, which are trained as an adversarial competition between two networks. We collected ancient texts dating back to 7 BC to train our network and obtained interesting results, which form the basis to highlight future research directions to advance this new topic.
Advancements on text-to-image synthesis generate remarkable images from textual descriptions. However, these methods are designed to generate only one object with varying attributes. They face difficulties with comple...
详细信息
ISBN:
(纸本)9781538662496
Advancements on text-to-image synthesis generate remarkable images from textual descriptions. However, these methods are designed to generate only one object with varying attributes. They face difficulties with complex descriptions having multiple arbitrary objects since it would require information on the placement and sizes of each object in the image. Recently, a method that infers object layouts from scene graphs has been proposed as a solution to this problem. However, their method uses only object labels in describing the layout, which fail to capture the appearance of some objects. Moreover, their model is biased towards generating rectangular shaped objects in the absence of ground-truth masks. In this paper, we propose an object encoding module to capture object features and use it as additional information to the image generation network. We also introduce a graph-cuts based segmentation method that can infer the masks of objects from bounding boxes to better model object shapes. Our method produces more discernible images with more realistic shapes as compared to the images generated by the current state-of-the-art method.
Generating a sequence of images from a multi-sentence paragraph is a recently proposed task called Story-Visualization. In this task, how to keep the global consistency across dynamic scenes and characters in the stor...
详细信息
ISBN:
(纸本)9781450376273
Generating a sequence of images from a multi-sentence paragraph is a recently proposed task called Story-Visualization. In this task, how to keep the global consistency across dynamic scenes and characters in the story flow is the distinct difference from other single-image works, which is also a significant challenge. However, the visual quality and semantic relevance of existing results are not satisfying when handling datasets with high semantic complexity, such as Pororo-SV cartoon dataset. To address this issue, we propose a new story visualization model named PororoGAN, which jointly considers story-to-image-sequence, sentence-to-image and word-to-image-patch alignment. In particular, we introduce ASE (aligned sentence encoder) and AWE (attentional word encoder) to improve global and local relevance, respectively. Additionally, we add an image patches discriminator to improve the reality of results. Both quantitative and qualitative studies show that PororoGAN outperforms the state-of-the-art models.
In this work, we attempt to address the issue of developing a sophisticated text encoder for retro-remote sensing application. The encoder converts ancient landscape descriptions into a fixed-size vector that, adequat...
详细信息
ISBN:
(数字)9781728121901
ISBN:
(纸本)9781728121918
In this work, we attempt to address the issue of developing a sophisticated text encoder for retro-remote sensing application. The encoder converts ancient landscape descriptions into a fixed-size vector that, adequately, represents the available information. This vector is then used as a conditioning data to a Generative adversarial network (GAN) that synthesizes the equivalent image. We propose using a pre-trained Doc2Vec encoder for text encoding and train a Wasserstein GAN (a variant of GAN) to convert landscape descriptions written by travelers and geographers into the equivalent image. Qualitative and quantitative analysis of the generated images signify usefulness of the proposed method.
Generating accurate high-resolution images from text representations is a difficult problem in computer vision that has a wide range of functional applications. text-to-image conversion is not dissimilar to the diffic...
详细信息
Generating accurate high-resolution images from text representations is a difficult problem in computer vision that has a wide range of functional applications. text-to-image conversion is not dissimilar to the difficulties inherent in language processing. For example, each text meaning can be encoded in two distinct human languages, while photographs and text are two distinct encoding languages for similar data. However, these are two distinct issues, since text-to-image or image-to-text conversions are extremely multimodal in nature. The proposed model for creating 256 × 256 realistic images from Arabic text descriptions is discussed in this article. The relationship between an Arabic word in a sentence and its component in a picture as introduced in this paper using the DAMSM model. This model teaches two neural networks how to map the Arabic picture and word sub-regions of a full sentence to a shared semantic model. It performs well as an Arabic-text encoder and a picture encoder. We start with the Modified-Arabic dataset and train the model from scratch. The proposed model establishes a new standard for the conversion of Arabic text to realistic pictures. A mutation happens when Arabic is used as the primary language for converting Arabic texts to real images. The inception score of the newly introduced model reported by 3.42 ± .05 on the CUB database.
暂无评论