MeanShift is a popular mode-seeking clustering algorithm used in a wide range of applications in machine learning. However, it is known to be prohibitively slow, with quadratic runtime per iteration. We propose MeanSh...
详细信息
In recent years, there has been significant progress in efficient and lightweight image super-resolution, due in part to the design of several powerful and lightweight attention mechanisms that enhance model represent...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
In recent years, there has been significant progress in efficient and lightweight image super-resolution, due in part to the design of several powerful and lightweight attention mechanisms that enhance model representation ability. However, the attention maps of most methods are obtained directly from the spatial domain, limiting their upper bound due to the locality of spatial convolutions and limited receptive fields. In this paper, we shift focus to the frequency domain, since the natural global properties of the frequency domain can address this issue. To explore attention maps from the frequency domain perspective, we investigate and correct some misconceptions in existing frequency domain feature processing.methods and propose a new frequency domain attention mechanism called frequency-enhanced pixel attention (FPA). Additionally, we use large kernel convolutions and partial convolutions to improve the ability to extract deep features while maintaining a lightweight design. On the basis of these improvements, we propose a large kernel frequency-enhanced network (LKFN) with smaller model size and higher computational efficiency. It can effectively capture long-range dependencies between pixels in a whole image and achieve state-of-the-art performance in existing efficient super-resolution methods.
Due to its relevance in intelligent transportation systems, anomaly detection in traffic videos has recently received much interest. It remains a difficult problem due to a variety of factors influencing the video qua...
详细信息
ISBN:
(纸本)9781665448994
Due to its relevance in intelligent transportation systems, anomaly detection in traffic videos has recently received much interest. It remains a difficult problem due to a variety of factors influencing the video quality of a real-time traffic feed, such as temperature, perspective, lighting conditions, and so on. Even though state-of-the-art methods perform well on the available benchmark datasets, they need a large amount of external training data as well as substantial computational resources. In this paper, we propose an efficient approach for a video anomaly detection system which is capable of running at the edge devices, e.g., on a roadside camera. The proposed approach comprises a pre-processing.module that detects changes in the scene and removes the corrupted frames, a two-stage background modelling module and a two-stage object detector. Finally, a backtracking anomaly detection algorithm computes a similarity statistic and decides on the onset time of the anomaly. We also propose a sequential change detection algorithm that can quickly adapt to a new scene and detect changes in the similarity statistic. Experimental results on the Track 4 test set of the 2021 AI City Challenge show the efficacy of the proposed framework as we achieve an F1-score of 0.9157 along with 8.4027 root mean square error (RMSE) and are ranked fourth in the competition.
The ability to forecast the data is significantly helped by machine learning's involvement. At the present time, data analysis is playing a very significant part in both the Information Technology (IT) industry an...
详细信息
Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that GPT-4V has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of GPT-4V for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at https://***/EnVision-Research/GPT4Affectivity.
Artificial intelligence (AI) and autonomous edge computing in space are emerging areas of interest to augment capabilities of nanosatellites, where modern sensors generate orders of magnitude more data than can typica...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Artificial intelligence (AI) and autonomous edge computing in space are emerging areas of interest to augment capabilities of nanosatellites, where modern sensors generate orders of magnitude more data than can typically be transmitted to mission control. Here, we present the hardware and software design of an onboard AI subsystem hosted on SpIRIT. The system is optimised for on-board computer vision experiments based on visible light and long wave infrared cameras. This paper highlights the key design choices made to maximise the robustness of the system in harsh space conditions, and their motivation relative to key mission requirements, such as limited compute resources, resilience to cosmic radiation, extreme temperature variations, distribution shifts, and very low transmission bandwidths. The payload, called Loris, consists of six visible light cameras, three infrared cameras, a camera control board and a Graphics processing.Unit (GPU) system-on-module. Loris enables the execution of AI models with on-orbit fine-tuning as well as a next-generation image compression algorithm, including progressive coding. This innovative approach not only enhances the data processing.capabilities of nanosatellites but also lays the groundwork for broader applications to remote sensing from space.
Light field (LF) imaging has become increasingly popular in recent years for capturing and processing.visual information. A significant challenge in LF processing.is super-resolution (SR), which aims to enhance the re...
Light field (LF) imaging has become increasingly popular in recent years for capturing and processing.visual information. A significant challenge in LF processing.is super-resolution (SR), which aims to enhance the resolution of low-resolution LF images. This article proposes a new LF image super-resolution (LFSR) approach that leverages the epipolar-spatial relationship within the LF. To train a deep neural network for LFSR, the proposed method involves extracting three types of information from the LF: spatial, horizontal epipolar, and vertical epipolar. Experimental results demonstrate the effectiveness of the proposed approach compared with state-of-the-art (SOTA) performance, as evidenced by quantitative metrics and visual quality. In addition, we conducted ablation studies to assess the effectiveness of each type of information and gain insights into the underlying mechanisms of the proposed method. Our approach achieved competitive results on the NTIRE 2023 Light Field image Super-Resolution Challenge: our proposed model was ranked 10th on the test set and 6th on the validation set among 148 participants. Paper’s code is available at: https://***/ahmeddiefy/EpiS_LFSR.
Objects are represented differently in projection-based sensors such as cameras depending on sensor resolution, field of view, and distortion, leading to distorted physical and geometric properties. As a result, senso...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Objects are represented differently in projection-based sensors such as cameras depending on sensor resolution, field of view, and distortion, leading to distorted physical and geometric properties. As a result, sensor data processing.depend on these properties. With the large variations of sensors on the market, an equivariant representation and suitable processing.are necessary to become independent of the sensor used. In this work, we propose an extension of conventional image data by an additional channel in which the associated projection properties are encoded. Furthermore, we introduce a SensorConv layer as an extension to the conventional convolution layer. SensorConv enable using projection properties in convolutional neural networks. To that end, we propose an architecture for using the SensorConv layer in the Detectron2 [21] framework. We collected a dataset of equirectangular images for our experiments with the CARLA [3] simulator. To analyze multiple sensor models (i.e., sensor intrinsic), we created an augmentation method to emulate a high variability of sensors from the collected equirectangular panoramas. In our experiment, we show that our method can generalize better across different camera sensors.
Recent advancements in Vision-Language Models (VLMs) have marked a significant leap in bridging the gap between computer vision and natural language processing. However, traditional VLMs, trained through contrastive l...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
Recent advancements in Vision-Language Models (VLMs) have marked a significant leap in bridging the gap between computer vision and natural language processing. However, traditional VLMs, trained through contrastive learning on limited and noisy image-text pairs, often lack the spatial and linguistic understanding to generalize well to dense vision tasks or less common languages. Our approach, Solid Foun-dation CLIP (SF-CLIP), circumvents this issue by implicitly building on the solid visual and language understanding of foundational models trained on vast amounts of unimodal data. SF-CLIP integrates contrastive image-text pretraining with a masked knowledge distillation from large foundational text and vision models. This methodology guides our VLM in developing robust text and image representations. As a result, SF-CLIP shows exceptional zero-shot classification accuracy and enhanced image and text retrieval capabilities, setting a new state of the art for ViT-B/16 trained on YFCC15M and ***, the dense per-patch supervision enhances our zero-shot and linear probe performance in semantic segmentation tasks. A remarkable aspect of our model is its multilingual proficiency, evidenced by strong retrieval results in multiple languages despite being trained predominantly on English data. We achieve all of these improvements without sacrificing the training efficiency through our selective application of masked distillation and the inheritance of teacher word embeddings.
Rain removal is an essential task in computer vision, particularly for applications such as autonomous navigation to function seamlessly during rain. However, most existing single-image deraining algorithms are limite...
Rain removal is an essential task in computer vision, particularly for applications such as autonomous navigation to function seamlessly during rain. However, most existing single-image deraining algorithms are limited by their inability to generalize on diverse real-world rainy images, the need for real-time processing. and the lack of task-driven metric enhancement. This paper proposes MobileDeRainGAN, an efficient semi-supervised algorithm that addresses these challenges. The proposed approach includes a novel latent bridge network and multi-scale discriminator that effectively removes rain-related artifacts at different scales. Our cross-domain experiments on Rain1400 and RainCityscapes datasets demonstrate substantial improvements over state-of-the-art methods in terms of generalization and object detection scores in a semi-supervised setting. Furthermore, our approach is significantly faster and can run in real-time even on edge devices. Overall, our proposed MobileDeRainGAN algorithm offers a significant improvement in rain removal performance on real-world images while being efficient, scalable, and suitable for real-world applications.
暂无评论