With the emergence of specifically tailored neural architectures that cope with both modalities, cross-modal language and image processing has attracted increasing attention. A major motivation has been the search for...
详细信息
ISBN:
(纸本)9798350397444
With the emergence of specifically tailored neural architectures that cope with both modalities, cross-modal language and image processing has attracted increasing attention. A major motivation has been the search for a quantum leap in language understanding supported by visual grounding, which has been oriented mostly to solve tasks where language descriptions of images are to be provided, and vice-versa, where images are to be generated on the basis of keywords. Adopting a distinct angle of inquiry, this paper addresses rather the cross-modal challenge of language driven image design, focusing on the task of editing an image on the basis of language instructions to modify it. And adopting as well a distinct research path, which dispenses with specifically tailored architectures, the approach proposed here resorts rather to a general purpose, suitably instantiated neural architecture of the Transformer class. Experimentation with this approach delivered very encouraging results, empirically demonstrating that this is an effective methodology for language driven image design and the basis for further advances in cross-modal processing and its applications with affordable compute and data.
暂无评论