The creation, expression and application of dynamic sculpture public art in public environment space combines traditional dynamic sculpture art with contemporary public art in a specific angle, and then tries to sort ...
详细信息
Human Activity recognition (HAR) has emerged as a potential research topic for smart healthcare owing to the fast growth of wearable and smart devices in recent years. The significant applications of HAR in ambient as...
详细信息
In the realm of imageprocessing.and patternrecognition, writing with gestures in the air has grown in popularity and difficulty over the past few years. It contributes significantly to the creation of automation pro...
详细信息
Referring Remote Sensing image Segmentation (RRSIS) is a new challenge that combines computer vision and natu-ral language processing. Traditional Referring image Seg-mentation (RIS) approaches have been impeded by th...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
Referring Remote Sensing image Segmentation (RRSIS) is a new challenge that combines computer vision and natu-ral language processing. Traditional Referring image Seg-mentation (RIS) approaches have been impeded by the com-plex spatial scales and orientations found in aerial imagery, leading to suboptimal segmentation results. To address these challenges, we introduce the Rotated Multi-Scale In-teraction Network (RMSIN), an innovative approach de-signed for the unique demands of RRSIS. RMSIN incorpo-rates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore, RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accu-racy. To assess the efficacy of RMSIN, we have curated an expansive dataset comprising 17,402 image-caption-mask triplets, which is unparalleled in terms of scale and vari-ety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also estab-lishes a stringent benchmark for the RRSIS task, ensuring a rigorous evaluation of performance. Experimental eval-uations demonstrate the exceptional performance of RM-SIN, surpassing existing state-of-the-art models by a signif-icant margin. Datasets and code are available at https://***/Lsan2401/RMSIN.
We aim at accelerating super-resolution (SR) networks on large images (2K-8K). The large images are usually decomposed into small sub-images in practical usages. Based on this processing. we found that different image...
详细信息
ISBN:
(纸本)9781665445092
We aim at accelerating super-resolution (SR) networks on large images (2K-8K). The large images are usually decomposed into small sub-images in practical usages. Based on this processing. we found that different image regions have different restoration difficulties and can be processed by networks with different capacities. Intuitively, smooth areas are easier to super-solve than complex textures. To utilize this property, we can adopt appropriate SR networks to process different sub-images after the decomposition. On this basis, we propose a new solution pipeline - ClassSR that combines classification and SR in a unified framework. In particular, it first uses a Class-Module to classify the sub-images into different classes according to restoration difficulties, then applies an SR-Module to perform SR for different classes. The Class-Module is a conventional classification network, while the SR-Module is a network container that consists of the to-be-accelerated SR network and its simplified versions. We further introduce a new classification method with two losses - Class-Loss and Average-Loss to produce the classification results. After joint training, a majority of sub-images will pass through smaller networks, thus the computational cost can be significantly reduced. Experiments show that our ClassSR can help most existing methods (e.g., FSRCNN, CARN, SRResNet, RCAN) save up to 50% FLOPs on DIV8K datasets. This general framework can also be applied in other low-level vision tasks.
Vision Transformers have proven their versatility and utility for complex computer vision tasks, such as land cover segmentation in remote sensing applications. While performing on par or even outperforming other meth...
Vision Transformers have proven their versatility and utility for complex computer vision tasks, such as land cover segmentation in remote sensing applications. While performing on par or even outperforming other methods like Convolutional Neural Networks (CNNs), Transformers tend to require even larger datasets with fine-grained annotations (e.g., pixel-level labels for land cover segmentation). To overcome this limitation, we propose a weakly-supervised vision Transformer that leverages image-level labels to learn a semantic segmentation task to reduce the human annotation load. We achieve this by slightly modifying the architecture of the vision Transformer through the use of gating units in each attention head to enforce sparsity during training and thereby retaining only the most meaningful heads. This allows us to directly infer pixel-level labels from image-level labels by post-processing.the un-pruned attention heads of the model and refining our predictions by iteratively training a segmentation model with high fidelity. Training and evaluation on the DFC2020 dataset show that our method 1 not only generates high-quality segmentation masks using image-level labels, but also performs on par with fully-supervised training relying on pixel-level labels. Finally, our results show that our method is able to perform weakly-supervised semantic segmentation even on small-scale datasets.
Since the low resolution of infrared focal plane arrays may degrade the performance of polarization imaging significantly, it is necessary to study the super-resolution reconstruction method for superior image resolut...
详细信息
The microstate analysis of EEG signals makes full use of the spatial information of the brain topographic map, and reflects the active association of different brain regions. Different from the traditional EEG feature...
详细信息
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent’s proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
Vision Transformers have demonstrated outstanding performance in computer Vision tasks. Nevertheless, this superior performance for large models comes at the expense of increasing memory usage for storing the paramete...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Vision Transformers have demonstrated outstanding performance in computer Vision tasks. Nevertheless, this superior performance for large models comes at the expense of increasing memory usage for storing the parameters and intermediate activations. To accelerate model inference, in this work we develop and evaluate integer and mixed-precision kernels in Triton for the efficient execution of two fundamental building blocks of transformers –linear layer and attention– on graphics processing.units (GPUs). On an NVIDIA A100 GPU, our kernel implementations of Vision Transformers achieve a throughput speedup of up to 7x compared with reference kernels in PyTorch floating-point single precision (FP32). Additionally, the accuracy for the ViT Large model top-1 drops by less than one percent on the imageNet1K classification task. We also observe up to 6x increased throughput by applying our kernels to the Segment Anything Model image encoder while keeping the mIOU close to the FP32 reference on the COCO2017 dataset for static and dynamic quantization. Furthermore, our kernels demonstrate improved speed to the TensorRT INT8 linear layer, and we improve the throughput of base FP16 (half precision) Triton attention on average by up to 19 ± 4.01%. We have open-sourced the QAtnn framework, which is tightly integrated with the PyTorch quantization workflow https://***/IBM/qattn.
暂无评论