Fine-grained vision-language retrieval aims to search for corresponding fine-grained images based on a text query, or vice versa. the challenge lies in how to match cross-modal data by learning an effective alignment....
详细信息
ISBN:
(纸本)9789819786190;9789819786206
Fine-grained vision-language retrieval aims to search for corresponding fine-grained images based on a text query, or vice versa. the challenge lies in how to match cross-modal data by learning an effective alignment. this paper proposes a simple yet effective efficiency-aware fine-grained vision-language retrieval via a global-contextual auto-encoder method. Firstly, global-contextual features from the images and texts are learned to promote the discriminability of the intra-modality features. then, to strengthen the semantic relevance among heterogeneous modalities, this method employs a semantic autoencoder. Concretely, the encoder projects the visual features into the semantic space occupied by the textual features. Further, the decoder applies an additional constraint, which is desirable to reconstruct the original visual features. Notably, the autoencoder is linear and symmetric, making it reasonable to scale up on large datasets. Comprehensive experiments on two fine-grained tasks illustrate that the proposed method surpasses several state-of-the-art baselines, validating its effectiveness and efficiency.
the Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches withthe theme of "a ...
详细信息
ISBN:
(纸本)9789819786916;9789819786923
the Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches withthe theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment;(2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method;(3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at: https://***/wmeiqi/VS-LLM.
the automatic generation of chinese calligraphy images is a very challenging task, because the structure of chinese characters is very complex. At present, most methods learn the style of the images one by one, meanin...
详细信息
ISBN:
(纸本)9789819784899;9789819784905
the automatic generation of chinese calligraphy images is a very challenging task, because the structure of chinese characters is very complex. At present, most methods learn the style of the images one by one, meaning that they lack the ability to model the style of the calligraphy from a more macro perspective. To solve these problems, this paper proposes a one-to-many style transfer model, SCGAN, based on a style collection mechanism. Our model can gather information from the collection level to complete the task of chinese character image generation. the main features of our model are as follows: first, based on the proposed style collection mechanism, our model can collect and transform style features from the collection level;second, we redesigned the structure of the generative adversarial network. Our model can complete the one-to-many style transfer task, which can greatly reduce the workload associated with multi-target style transfer. Compared with other deep learning methods, the results obtained by our method are higher quality and closer to reality. Experimental results show that our method achieves better performance than other methods in one-to-one and one-to-many chinese character generation tasks.
In recent years, significant strides have been made in harnessing large language models (LLMs) to leverage various tools across different fields, which largely expands the application scope of LLMs. However, current r...
详细信息
ISBN:
(纸本)9789819784868;9789819784875
In recent years, significant strides have been made in harnessing large language models (LLMs) to leverage various tools across different fields, which largely expands the application scope of LLMs. However, current research predominantly focuses on LLMs' inherent tool exploitation skills from their training data, leading to higher costs when integrating new tools. Additionally, most studies concentrate on English models, leaving a scarcity of open-source resources for other languages. this study investigates the zero-shot generalization of LLMs in tool usage, with a focus on chinese models. We introduce AtomTool, an open-source framework for tool acquisition in LLMs, along with a dataset of 16,000 chinese entries. this work marks the first effort to evaluate zero-shot generalization in chinese models and provides the initial open-source framework and dataset dedicated to tool acquisition in chinese LLMs. Our experiments show AtomTool outperforms the closed-source models like ChatGPT in zero-shot generalization in most cases. We also propose a novel dataset construction method and evaluation framework, examining prompt design and tool quantity effects on model performance. Overall, our work establishes a solid foundation for advancing tool acquisition in chinese LLMs.
In StyleGAN, convolution kernels are shaped by both static parameters shared across images and dynamic modulation factors w(+) is an element of W+ specific to each image. therefore, W+ space is often used for image in...
详细信息
ISBN:
(纸本)9789819786916;9789819786923
In StyleGAN, convolution kernels are shaped by both static parameters shared across images and dynamic modulation factors w(+) is an element of W+ specific to each image. therefore, W+ space is often used for image inversion and editing. However, pre-trained model struggles with synthesizing out-of-domain images due to the limited capabilities of W+ and its resultant kernels, necessitating full fine-tuning or adaptation through a complex hypernetwork. this paper proposes an efficient refining strategy for dynamic kernels. the key idea is to modify kernels by low-rank residuals, learned from input image or domain guidance. these residuals are generated by matrix multiplication between two sets of tokens withthe same number, which controls the complexity. We validate the refining scheme in image inversion and domain adaptation. In the former task, we design grouped transformer blocks to learn these token sets by one- or two-stage training. In the latter task, token sets are directly optimized to support synthesis in the target domain while preserving original content. Extensive experiments show that our method achieves low distortions for image inversion and high quality for out-of-domain editing.
Recent successes of game AIs such as AlphaGo and AlphaStar, which beat professional human players in the games Go and StarCraft, respectively, mark the breakthroughs of intelligent decision making technique in complex...
详细信息
ISBN:
(纸本)9789819785018;9789819785025
Recent successes of game AIs such as AlphaGo and AlphaStar, which beat professional human players in the games Go and StarCraft, respectively, mark the breakthroughs of intelligent decision making technique in complex games. Generally, games studied previously are mostly symmetric in game-theoretic sense due to their sports or e-sports characteristics. However, games in reality are usually asymmetric because of the position-dependent resource unbalance, and they are rarely studied. In this paper, we propose a novel asymmetric game model based on the framework of game theoretic learning. Specifically, we develop an agent training method withthree steps: game model formulation, solution concept definition and game solution computation. To verify our model, a mini-Wargame is used in our experiment, where the initial number and visual scope are set to be unbalanced. Experiments show that the proposed method is better than popular self-play based methods such as naive self-play and prioritized fictitious self-play. the work provides a game-theoretic view for asymmetric games, and it may attract more interests for the rarely studied asymmetric games.
the ability to translate medical images across different modalities is crucial for synthesizing missing data and aiding in clinical diagnosis. However, existing learning-based techniques have limitations when it comes...
详细信息
ISBN:
(纸本)9789819784950;9789819784967
the ability to translate medical images across different modalities is crucial for synthesizing missing data and aiding in clinical diagnosis. However, existing learning-based techniques have limitations when it comes to capturing cross-modal and global features. these techniques are often tailored to specific pairs of modalities, limiting their practical utility, especially considering the variability of missing modalities in different cases. In this study, we introduce MedPrompt, a multi-task framework designed to efficiently translate diverse modalities. Our framework incorporates the Self-adaptive Prompt Block, which dynamically guides the translation network to handle different modalities effectively. To encode the cross-modal prompt efficiently, we introduce the Prompt Extraction Block and the Prompt Fusion Block. Additionally, we leverage the Transformer model to enhance the extraction of global features across various modalities. through extensive experimentation involving five datasets and four pairs of modalities, we demonstrate that our proposed model achieves state-of- the-art visual quality and exhibits excellent generalization capability. the results highlight the effectiveness and versatility of MedPrompt in addressing the challenges associated with cross- modal medical image translation.
Predicting the future trajectories of agents in complex traffic scene is one of the key issues in autonomous driving, requiring reliable and effective predictions for all agents in the scene. Existing trajectory predi...
详细信息
ISBN:
(纸本)9789819787913;9789819787920
Predicting the future trajectories of agents in complex traffic scene is one of the key issues in autonomous driving, requiring reliable and effective predictions for all agents in the scene. Existing trajectory prediction models have achieved high performance on public datasets, but deploying models on vehicles requires both high accuracy and fast computation. It is necessary to balance the complexity of computation and the effectiveness of the structure when designing model. To address the above problem, we proposes a lightweight trajectory prediction model HHATP. Our method is scene-centric and located in the same coordinate system. We use different encoders for the heterogeneous scene objects and the encoded results are then fed into a hierarchical attention module, which considers both global and local interaction to model the relationships between elements. Subsequently, a dynamic weight decoder is used to obtain the trajectories of all agents. Our method achieves good accuracy on the Argoverse dataset and enables fast inference.
Camouflaged object detection is focused on segmenting objects concealed within their surroundings. this technology can be applied in various fields such as medical image analysis, wildlife conservation, autonomous dri...
详细信息
ISBN:
(纸本)9789819788576;9789819788583
Camouflaged object detection is focused on segmenting objects concealed within their surroundings. this technology can be applied in various fields such as medical image analysis, wildlife conservation, autonomous driving, and others. Existing semi-supervised camouflage object detection methods often suffer from poor network performance due to the accumulation of incorrect pseudo labels, and they fail to fully utilize multi-scale features or account for the diverse scale contexts necessary for various sizes of camouflage objects. In this paper, we propose an innovative semi-supervised learning strategy. We employ a dual-branch network named CAMNet, utilizing salient maps corresponding to camouflage objects to aid detection. We also introduce a Multi-Information Fusion Feature Perception module (MIF) and an Adaptive Receptive Field Selection module (ARFS), which are integrated into the network. Ultimately, we perform thorough comparative experiments on the R2C7K, COD-Water, and COD-Jungle datasets, showcasing superior performance in contrast to current state-of-the-art methods. We also conduct ablation experiments, further confirming the effectiveness of the proposed modules.
Domain adaptation (DA) aims to transfer knowledge from labeled source domains to unlabeled target domains, addressing the challenge of model generalization when there is a distribution mismatch between training and te...
详细信息
ISBN:
(纸本)9789819784868;9789819784875
Domain adaptation (DA) aims to transfer knowledge from labeled source domains to unlabeled target domains, addressing the challenge of model generalization when there is a distribution mismatch between training and testing data. While many vision Transformer (ViT)-based methods have been developed for DA, they focus primarily on improving accuracy, with less emphasis on accelerating inference on unlabeled target domains. In this paper, we propose a novel method named Cascaded Adaptive vision Transformer (CAViT), which dynamically adjusts token counts for each input image by cascading multiple transformers with increasing tokens. During testing, "easier" images exit early, while "harder" images are processed further until confident predictions are achieved. We further enhance domain adversarial learning by incorporating a token-level domain discriminator in the attention layer, which assigns distinct weights to different patch tokens. this enables the network to learn features with cross-domain transferability and discriminative capabilities, achieving effective feature alignment. Experimental results demonstrate that our method not only improves accuracy but also significantly reduces computational costs, as evidenced by results on three benchmark datasets.
暂无评论