At present, and increasingly so in the future, much of the captured visual content will not be seen by humans. Instead, it will be used for automated machinevision analytics and may require occasional human viewing. ...
详细信息
At present, and increasingly so in the future, much of the captured visual content will not be seen by humans. Instead, it will be used for automated machinevision analytics and may require occasional human viewing. Examples of such applications include traffic monitoring, visual surveillance, autonomous navigation, and industrial machinevision. To address such requirements, we develop an end-to-end learned image codec whose latent space is designed to support scalability from simpler to more complicated tasks. The simplest task is assigned to a subset of the latent space (the base layer), while more complicated tasks make use of additional subsets of the latent space, i.e., both the base and enhancement layer(s). For the experiments, we establish a 2-layer and a 3-layer model, each of which offers input reconstruction for human vision, plus machinevision task(s), and compare them with relevant benchmarks. The experiments show that our scalable codecs offer 37%-80% bitrate savings on machinevision tasks compared to best alternatives, while being comparable to state-of-the-art image codecs in terms of input reconstruction.
Signal capture is at the forefront of perceiving and understanding the environment;thus, imaging plays a pivotal role in mobile vision. Recent unprecedented progress in artificial intelligence (AI) has shown great pot...
详细信息
Signal capture is at the forefront of perceiving and understanding the environment;thus, imaging plays a pivotal role in mobile vision. Recent unprecedented progress in artificial intelligence (AI) has shown great potential in the development of advanced mobile platforms with new imaging devices. Traditional imaging systems based on the "capturing images first and processing afterward" mechanism cannot meet this explosive demand. On the other hand, computational imaging (CI) systems are designed to capture high-dimensional data in an encoded manner to provide more information for mobile vision systems. Thanks to AI, CI can now be used in real-life systems by integrating deep learning algorithms into the mobile vision platform to achieve a closed loop of intelligent acquisition, processing, and decision-making, thus leading to the next revolution of mobile vision. Starting from the history of mobile vision using digital cameras, this work first introduces the advancement of CI in diverse applications and then conducts a comprehensive review of current research topics combining CI and AI. Although new-generation mobile platforms, represented by smart mobile phones, have deeply integrated CI and AI for better image acquisition and processing, most mobile vision platforms, such as self-driving cars and drones only loosely connect CI and AI, and are calling for a closer integration. Motivated by this fact, at the end of this work, we propose some potential technologies and disciplines that aid the deep integration of CI and AI and shed light on new directions in the future generation of mobile vision platforms.
Background: Fermented foods are products processed through microbial fermentation and are widely appreciated by consumers around the world for their unique flavors. With advancements in industrial technology and incre...
详细信息
Background: Fermented foods are products processed through microbial fermentation and are widely appreciated by consumers around the world for their unique flavors. With advancements in industrial technology and increasing consumer demand, modern techniques are being progressively integrated into the production and quality control of fermented foods to enhance production efficiency and product quality. Among these innovations, computer vision technology stands out as particularly impactful. Scope and approach: This paper provides an overview of the applications of computer vision in the field of fermented foods, focusing on its technical algorithms and applications within the food industry. It outlines the specific uses of computer vision technology across different types of fermented foods and discusses the relevant techniques employed. Finally, this review highlights the transformative potential of adaptive learning and multimodal fusion in addressing current limitations of computer vision for fermented food monitoring. Key findings and conclusions: The adoption of computer vision technology has significantly improved both the efficiency and accuracy of quality control processes in fermented food production. Through non-contact real-time monitoring, researchers can quickly identify the dynamic changes in microorganisms and related parameter indicators during fermentation and evaluate their impact on food quality. These technologies have not only boosted the efficiency of fermented food production but have also enhanced control over product flavor and safety assessments. Despite ongoing challenges in technology implementation and data analysis, the continuous advancements in deep learning and imageprocessing technologies are expected to increase the impact of computer vision in the field of fermented foods, driving sustainable industry development.
Facial recognition is a widely-used process that aims to detect and verify an individual's identity. This technique is employed in various applications, such as image and video analysis, surveillance, and security...
详细信息
The use of machinevision and deep learning for intelligent industrial inspection has become increasingly important in automating the production processes. Despite the fact that machinevision approaches are used for ...
详细信息
The use of machinevision and deep learning for intelligent industrial inspection has become increasingly important in automating the production processes. Despite the fact that machinevision approaches are used for industrial inspection, deep learning-based defect segmentation has not been widely studied. While state-of-the-art segmentation methods are often tuned for a specific purpose, extending them to unknown sets or other datasets, such as defect segmentation datasets, require further analysis. In addition, recent contributions and improvements in image segmentation methods have not been extensively investigated for defect segmentation. To address these problems, we conducted a comparative experimental study on several recent state-of-the-art deep learning-based segmentation methods for steel surface defect segmentation and evaluated them on the basis of segmentation performance, processing time, and computational complexity using two public datasets, NEU-Seg and Severstal Steel Defect Detection (SSDD). In addition we proposed and trained a hybrid transformer-based encoder with CNN-based decoder head and achieved state-of-the-art results, a Dice score of 95.22% (NEU-Seg) and 95.55% (SSDD).
The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of token...
详细信息
ISBN:
(纸本)9783031434143;9783031434150
The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in better performance, it also leads to a considerable increase in computational cost. Motivated by the saying "A picture is worth a thousand words," we propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed. First, we train a Resizable-ViT (ReViT) model capable of processing input with diverse token lengths. Next, we extract token-length labels from ReViT that indicate the minimum number of tokens required to achieve accurate predictions. We then use these labels to train a lightweight Token-Length Assigner (TLA) that allocates the optimal token length for each image during inference. The TLA enables ReViT to process images with the minimum sufficient number of tokens, reducing token numbers in the ViT model and improving inference speed. Our approach is general and compatible with modern vision transformer architectures, significantly reducing computational costs. We verified the effectiveness of our methods on multiple representative ViT models on image classification and action recognition.
Automated authoring enables simplified deployment of applications and services for complex use cases, especially in the field of machine learning. This paper presents the development and implementation of a specialize...
详细信息
Automated authoring enables simplified deployment of applications and services for complex use cases, especially in the field of machine learning. This paper presents the development and implementation of a specialized authoring tool that can be used for computer visionapplications, enabling automated creation of machine learning services. The proposed authoring tool realizes a microservices architecture to facilitate the conversion and deployment of machine learning inference services, especially in image classification and object detection use cases. The authoring process addresses the interoperability issues commonly faced in machine learning frameworks, leveraging the Open Neural Network Exchange (ONNX) for model conversion into a standardized format. By encapsulating machine learning tools in containerized applications, this authoring tool offers a modular solution that can be easily adapted to various industrial applications. The developed authoring tool integrates the common machine learning frameworks PyTorch and TensorFlow, coupling DevOps methodologies such as CI/CD, ensuring a robust, maintainable, and user-friendly system that meets the growing needs of machine learning use cases in manufacturing.
Remote sensing scene classification has been extensively studied for its critical roles in geological survey, oil exploration, traffic management, earthquake prediction, wildfire monitoring, and intelligence monitorin...
详细信息
ISBN:
(纸本)9781510666931;9781510666948
Remote sensing scene classification has been extensively studied for its critical roles in geological survey, oil exploration, traffic management, earthquake prediction, wildfire monitoring, and intelligence monitoring. In the past, the machine Learning (ML) methods for performing the task mainly used the backbones pretrained in the manner of supervised learning (SL). As Masked image Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a better way for learning visual feature representation, it presents a new opportunity for improving ML performance on the scene classification task. This research aims to explore the potential of MIM pretrained backbones on four well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31. Compared to the published benchmarks, we show that the MIM pretrained vision Transformer (ViTs) backbones outperform other alternatives (up to 18% on top 1 accuracy) and that the MIM technique can learn better feature representation than the supervised learning counterparts (up to 5% on top 1 accuracy). Moreover, we show that the general-purpose MIM-pretrained ViTs can achieve competitive performance as the specially designed yet complicated Transformer for Remote Sensing (TRS) framework. Our experiment results also provide a performance baseline for future studies.
This paper proposes image style transfer with shape preservation for gaze estimation. While several shape preservation constraints are proposed, we present additional shape preservation constraints using (i) dense pix...
详细信息
ISBN:
(纸本)9784885523434
This paper proposes image style transfer with shape preservation for gaze estimation. While several shape preservation constraints are proposed, we present additional shape preservation constraints using (i) dense pixelwise correspondences between the original and its transferred images and (ii) task-driven learning using gaze estimation error for directly improving gaze direction estimation. A variety of experiments with other SOTA methods, publicly-available datasets, and ablation studies validate the effectiveness of our method.
In recent years, the rapid development of computer vision and artificial intelligence has significantly advanced agricultural applications, particularly in the quality detection and grading of navel oranges. This revi...
详细信息
暂无评论