vision transformers have shown excellent performance in computer vision tasks. As the computation cost of their self-attention mechanism is expensive, recent works tried to replace the self-attention mechanism in visi...
详细信息
Aiming at the problem of unclear low-voltage topology caused by the current low-voltage distribution network user information changes, meter fault replacement, station upgrades and other reasons, this paper designs a ...
详细信息
vision technology is growing vigorously, but underwater vision remains challenges and problems, which is a field with application value and development prospects. The visual information obtained underwater is often co...
详细信息
ISBN:
(纸本)9781665486415
vision technology is growing vigorously, but underwater vision remains challenges and problems, which is a field with application value and development prospects. The visual information obtained underwater is often color-biased and blurred, and it is highly susceptible to the influence of the surrounding environment and its stability cannot be controlled. Therefore, the current research and applications of vision technologies in underwater vision provide ways to solve the described problems, including image processing, target detection and recognition, localization and tracking technologies. The purpose of this review is to summarize the development of underwater vision technology and the results achieved. The mainstream underwater vision technologies are classified according to the different theories or algorithms used, and the current new research progress in each field is introduced in detail. By summarizing and analyzing the above research results, the application of each key technology of underwater vision is sorted out, and its further development direction is foreseen.
In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing robotic Process Automation (RPA) challenges through enhanced cognitive capab...
详细信息
ISBN:
(纸本)9798350360882;9798350360899
In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and sophisticated reasoning. This development heralds a new era of scalability and human-like adaptability in goal attainment. In this context, we introduce AUTONODE (Autonomous User-interface Transformation through Online Neuro-graphic Operations and Deep Exploration). AUTONODE employs advanced neuro-graphical techniques to facilitate autonomous navigation and task execution on web interfaces, thereby obviating the necessity for predefined scripts or manual intervention. Our engine empowers agents to comprehend and implement complex workflows, adapting to dynamic web environments with unparalleled efficiency. Our methodology synergizes cognitive functionalities with robotic automation, endowing AUTONODE with the ability to learn from experience. We have integrated an exploratory module, DoRA (Discovery and mapping Operation for graph Retrieval Agent), which is instrumental in constructing a knowledge graph that the engine utilizes to optimize its actions and achieve objectives with minimal supervision. The versatility and efficacy of AUTONODE are demonstrated through a series of experiments, highlighting its proficiency in managing a diverse array of web-based tasks, ranging from data extraction to transaction processing. The implementation of our paper can be accessed at :https://***/TransformerOptimus/AutoNode
Large-scale text-to-image diffusion models have demonstrated impressive capabilities for downstream tasks by leveraging strong vision-language alignment from generative pre-training. Recently, a number of works have e...
详细信息
Audio-visual speech enhancement (SE) is the task of reducing the acoustic background noise in a degraded speech signal using both acoustic and visual information. In this work, we study how to incorporate visual infor...
详细信息
ISBN:
(纸本)9798350302615
Audio-visual speech enhancement (SE) is the task of reducing the acoustic background noise in a degraded speech signal using both acoustic and visual information. In this work, we study how to incorporate visual information to enhance a speech signal using acoustic beamformers in hearing aids (HAs). Specifically, we first trained a deep learning model to estimate a time-frequency mask from audio-visual data. Then, we apply this mask to estimate the inter-microphone power spectral densities (PSDs) of the clean and the noise signal. Finally, we used the estimated PSDs to build acoustic beamformers. Assuming that a HA user wears an add-on device comprising a camera pointing at the target speaker, we show that our method can be beneficial for HA systems especially at low signal to noise ratios (SNRs).
Precise intrusion unmanned aerial vehicle (UAV) detection over long distances is of crucial importance for guaranteeing the low-air security. Although many deep learning-based vision detectors have been developed, the...
详细信息
Self-supervised learning with vision transformer (ViT) has gained much attention recently. Most existing methods rely on either contrastive learning or masked image modeling. The former is suitable for global feature ...
详细信息
ISBN:
(纸本)9781728198354
Self-supervised learning with vision transformer (ViT) has gained much attention recently. Most existing methods rely on either contrastive learning or masked image modeling. The former is suitable for global feature extraction but underperforms in fine-grained tasks. The later explores the internal structure of images but ignores the high information sparsity and unbalanced information distribution. In this paper, we propose a new approach called Attention-guided Contrastive Masked Image Modeling (ACoMIM), which integrates the merits of both paradigms and leverages the attention mechanism of ViT for effective representation. Specifically, it has two pretext tasks, predicting the features of masked regions guided by attention and comparing the global features of masked and unmasked images. We show that these two pretext tasks complement each other and improve our method's performance. The experiments demonstrate that our model transfers well to various downstream tasks such as classification and object detection. Code is available at https://***/yczhan/ACoMIM.
Extracting and aggregating multiple feature representations from various scales have become the key to point cloud classification tasks. vision Transformer (ViT) is a representative solution along this line, but it la...
详细信息
ISBN:
(纸本)9781728198354
Extracting and aggregating multiple feature representations from various scales have become the key to point cloud classification tasks. vision Transformer (ViT) is a representative solution along this line, but it lacks the capability to model detailed multi-scale features and their interactions. In addition, learning efficient and effective representation from the point cloud is challenging due to its irregular, unordered, and sparse nature. To tackle these problems, we propose a novel multi-scale representation learning transformer framework, employing various geometric features beyond common Cartesian coordinates. Our approach enriches the description of point clouds by local geometric relationships and group them at multiple scales. This scale information is aggregated and then new patches can be extracted to minimize feature overlay. The bottleneck projection head is then adopted to enhance the information and feed all patches to the multi-head attention to capture the deep dependencies among representations across patches. Evaluation on public benchmark datasets shows the competitive performance of our framework on point cloud classification.
This research presents a new approach for blind single-image transparency separation, a significant challenge in image processing. The proposed framework divides the task into two parallel processes: feature separatio...
详细信息
ISBN:
(纸本)9781728198354
This research presents a new approach for blind single-image transparency separation, a significant challenge in image processing. The proposed framework divides the task into two parallel processes: feature separation and image reconstruction. The feature separation task leverages two deep image prior (DIP) networks to recover two distinct layers. An exclusion loss and deep feature separation loss are used to decompose features. For the image reconstruction task, we minimize the difference between the mixed image and the re-mixed image while also incorporating a regularizer to impose natural priors on each layer. Our results indicate that our method performs comparably or outperforms state-of-the-art approaches when tested on various image datasets.
暂无评论