Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on explor...
详细信息
ISBN:
(纸本)9781713899921
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK(2) to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new blockwise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a blind spot in the deployment of pre-trained VL models.
Potholes can be a nuisance to the vehicle and can affect the decision-making of the intelligent driving system. However, most autonomous driving algorithms are currently trained using datasets collected from normal ro...
详细信息
ISBN:
(纸本)9798350379860;9798350379877
Potholes can be a nuisance to the vehicle and can affect the decision-making of the intelligent driving system. However, most autonomous driving algorithms are currently trained using datasets collected from normal road conditions, as datasets containing pothole roads are scarce. This limitation reduces the robustness and sensitivity of autonomous driving algorithms in recognizing pothole pavement. To address the aforementioned limitation, we propose an Pseudo-Samples Generation strategy based on Improved Cycle Generative Adversarial Network (PSG-ICGAN) to enhance the model's accuracy in recognizing pothole pavement. In PSG-ICGAN, we enhance the adversarial sample generation algorithm of CycleGAN to produce pseudo-samples with highly similar semantic information to the original images, yet indistinguishable by the classification model. Adding these challenging pseudo-samples to the training dataset significantly enhances the model's robustness in recognizing pothole road images. In addition, we introduce a channel attention mechanism into the pothole classification model, which helps the model to capture subtle pothole features. Experiments show that our model achieves superior performance in the pothole road dataset.
Smart painting software and cross-border fusion based on digital imageprocessingalgorithms is studied in this paper. Early non-deep methods use shallow network model architectures, which are difficult to effectively...
详细信息
This paper focuses on the detection and identification of defects on the end faces of small motor bearings. Bearing defects significantly impact the performance and lifespan of motors, and traditional manual inspectio...
详细信息
Extraction of image edges is the basis of imageprocessing class algorithms, for image edge extraction is of great significance. Nowadays, video image data is greatly increases the amount of image data and the process...
详细信息
The problem of insulator defects in transmission lines caused by adverse environmental conditions poses a significant risk to the safe and stable operation of the power system. In order to improve upon traditional met...
详细信息
Computer vision and natural language processing researchers have dedicated significant time and energy to the problem of automatically creating image descriptions. In this work, we propose an artificially intelligent ...
详细信息
Long-range active detection is widely demanded in various fields. Currently, it is still difficult to obtain high-resolution images in long-range while ensuring miniaturization of the detection system, because the res...
详细信息
Creating realistic images from textual descriptions is a major challenge in computer vision and natural language processing. Traditional methods often fail to accurately translate text into visual representations, res...
详细信息
Most supervised super-resolution (SR) algorithms require paired high-resolution (HR) and low-resolution (LR) images as training samples. However, the network structures trained by supervised algorithms don't adapt...
详细信息
暂无评论