The solution to the problem of recognizing human actions on video sequences is one of the key areas on the path to the development and implementation of computer vision systems in various spheres of life. At the same ...
详细信息
ISBN:
(纸本)9781510651531;9781510651524
The solution to the problem of recognizing human actions on video sequences is one of the key areas on the path to the development and implementation of computer vision systems in various spheres of life. At the same time, additional sources of information (such as depth sensors, thermal sensors) allow to get more informative features and thus increase the reliability and stability of recognition. In this research, we focus on how to combine the multi-level decompression for depth and color information to improve the state of art action recognition methods. We present the algorithm, combining information from visible cameras and depth sensors based on the deep learning and PLIP model (parameterized model of logarithmic imageprocessing) close to the human visual system's perception. The experiment results on the test dataset confirmed the high efficiency of the proposed action recognition method compared to the state-of-the-art methods that used only one modality image (visible or depth).
Lip reading has gained popularity due to the proliferation of emerging real-world applications. This article provides a comprehensive review of benchmark datasets available for lip-reading applications and pioneering ...
详细信息
Lip reading has gained popularity due to the proliferation of emerging real-world applications. This article provides a comprehensive review of benchmark datasets available for lip-reading applications and pioneering works that analyze lower facial cues for lip-reading applications. A comprehensive review of lip reading applications is broadly classified into five distinct applications: Lip Reading Biometrics (LRB), Audio Visual Speech Recognition (AVSR), Silent Speech Recognition (SSR), Voice from Lips, and Lip HCI (Human-computer interaction). LRB entails extensive research in the fields of authentication and liveness detection. AVSR covers key findings that have contributed significantly to applications such as voice assistants, video-totext transcription, hearing aids, and pronunciation-correcting systems. SSR analyzes the efforts made for silent-video-to-text transcription and surveillance camera applications. The voice from lips section discusses applications such as voice for the voiceless and vision-infused speech inpainting. In lip HCI, LR-HCI for smartphones, smart TVs, computers, robots, and musical instruments is reviewed in detail. Comprehensive coverage is given to cutting-edge techniques in computer vision, signal processing, machine learning, and deep learning. The advancements that aid the system in learning to lip-read and authenticate lip gestures, generate text transcription, synthesize voice based on lip movements, and control systems via lip movements (lip HCI) are covered. The work concludes by highlighting the limitations of existing frameworks, the road maps of each application illustrating the evolution of techniques employed over time, and future research avenues in lip-reading applications.
To deal with the requirement of high-precision localization of large-size workpieces in an industrial environment, an improved shape-based matching algorithm is proposed based on the phase stretching transformation an...
详细信息
To deal with the requirement of high-precision localization of large-size workpieces in an industrial environment, an improved shape-based matching algorithm is proposed based on the phase stretching transformation and the iterative closest point algorithms. Basler industrial cameras are used to collect images of large-size workpieces, such as glass. The experimental results show that the average localization error is 0.05 +/- 0.013 mm, which can meet the requirements of practical applications. This algorithm can effectively and accurately achieve high-precision localization of different positions of multi-directionally transformed objects in industrial environments. (C) 2021 Optical Society of America
With the increasing demand for data processing, approximate computing is widely used in various fault-tolerant applications such as imageprocessing, computer vision and machine learning. These applications also requi...
详细信息
ISBN:
(数字)9781665453363
ISBN:
(纸本)9781665453363
With the increasing demand for data processing, approximate computing is widely used in various fault-tolerant applications such as imageprocessing, computer vision and machine learning. These applications also require a huge number of multiplication operations. In this paper, we are mainly oriented to the softcore approximate multiplier which is implemented on FPGA via encoding the INIT parameter values in the Look-Up Table (LUT) primitives. Three approximate multipliers with associated carry chain are presented in the manner of reducing LUTs from proposed exact multiplier. An approximate multiplier without carry chain is also presented to further reduce the multiplier's critical path delay and power consumption. We also present an accuracy configurable adder to build high -order approximate multipliers for architectural space exploration. The resolution of the state-of-the-art Mean Relative Error Distance (AIRED) and Power -Delay Product (PDP) pareto front is improved and the approximate multiplier we proposed achieves 24.4%, 52.9% and 56.4% reduction in latency, area, and power over the soft multiplier IP core, respectively. Finally, we apply the proposed approximate multiplier design to imageprocessing and convolutional neural networks (CNNs). Compared to advanced approximate multipliers, it offers less energy consumption and area while remaining acceptable qualities. Our designs are open sourced at h ttps://,g,i thu h. corriN aosh an gsh an g96/FPGAbased_approx_mult to assist further reproducing and development.
The design of machinevisionapplications allows automatic inspection, measuring systems, and robot guidance. Typical applications of industrial robots are based on no-contact sensors to give the robot information abo...
详细信息
The design of machinevisionapplications allows automatic inspection, measuring systems, and robot guidance. Typical applications of industrial robots are based on no-contact sensors to give the robot information about the environment. Robot's machinevision requires photosensors or video cameras to make intelligent decisions about its localization. Video cameras used as image-capturing equipment are too costly in comparison with optical scanning systems (OSS). The OSS system provides spatial coordinates measurements that can be exploited to solve a wide variety of structural problems in real-time. Localization and guidance usingmachine learning (ML) techniques offer advantages due to signals captured can be transformed and be reduced for processing, storage, and displaying. The use of algorithms of ML enhances the performance of the optical system based on localization and guidance. Feature extraction represents an important part of ML techniques to transform the original raw data onto a low-dimensional subspace and holding relevant information. This work presents an improvement of an optical system based on k-nearest neighbor (k-NN) technique to solve the object detection and localization problem. The utility of this improvement allows the optical system can discriminate between the reference source and the optical noise or interference. The OSS system presented in this article has been implemented in structural health monitoring to measure the angular position even under "lighting and weather conditions". The feature extraction techniques used in this article were linear predictive coding (LPC), quartiles (Q(iquartile)), and autocorrelation coefficients (ACC). The results of using k-NN and autocorrelation coefficients and quartiles predicted more than 98% of correct classification by using a reference source light as a class 1 and a light bulb as an optical noise and called class 2.
This article mainly research express the robot task, the visual imageprocessing and the upper machine through HALCON procedures of imageprocessing, c# development PC development, realize the Courier information acqu...
详细信息
Style transformation on face images has traditionally been a popular research area in the field of computer vision, and its applications are quite extensive. Currently, the more mainstream schemes include Generative A...
Style transformation on face images has traditionally been a popular research area in the field of computer vision, and its applications are quite extensive. Currently, the more mainstream schemes include Generative Adversarial Network (GAN)-based image generation as well as style transformation and Stable diffusion method. In 2019, the NVIDIA team proposed StyleGAN, which is a relatively mature scheme for generating real faces as well as face feature blending. The whole StyleGAN model is trained based on the Flickr-Faces-HQ Dataset (FFHQ) dataset, the This is a large dataset, so the model takes a long time to train. My aim is to form a One-shot stylized face image generator, which means that only one reference face and one stylized face need to be input, and a brand-new face with a mixture of features can be generated in a short training time. This is inspired by the existing research result JoJoGAN, which learns a style mapper from a single example of the style. JoJoGAN uses a GAN inversion procedure and StyleGAN's style-mixing property to produce a substantial paired dataset from a single example of the style. This paper will make improvements to JoJoGAN, including improving the encoder that utilizes the GAN Inversion method to generate latent codes for image features, and the random mixing of latent codes to produce a more refined paired dataset.
In the last few years, the abundance of available plank-ton images has significantly increased due to advancements in acquisition system technology. Consequently, a growing interest in automatic plankton image classif...
详细信息
ISBN:
(数字)9798331536626
ISBN:
(纸本)9798331536633
In the last few years, the abundance of available plank-ton images has significantly increased due to advancements in acquisition system technology. Consequently, a growing interest in automatic plankton image classification has surged. machine learning algorithms have recently emerged to assist in the analysis of this vast quantity of data, supporting traditional manual processing. However, annotating such data is costly and demands significant time and resources, thus requiring data-efficient machine learning solutions. The typical framework for tackling this issue has been the adoption of supervised imageNet pre-trained models, and fine-tuning them on the plankton classification downstream task. Nonetheless, self-supervised pre-training protocols may provide an effective alternative to the supervised approaches using imageNet, while allowing the exploitation of the increasingly large amount of unanno-tated plankton data. To the best of our knowledge, no work systematically analyzes the impact of self-supervised pre-training protocols for plankton image classification. To fill this gap, in this paper, we present a thorough comparison between in-domain (plankton images) and out-of-domain (imageNet) supervised and self-supervised pre-training, in terms of the quality of the corresponding embeddings for plankton image classification. We believe that this work may pave the way for further research in self-supervised protocols for the plankton domain, providing a valuable alternative to imageNet, and exploiting the vast amount of unannotated available plankton images.
vision-and-language navigation (VLN) agents are trained to navigate in real-world environments based on natural language instructions. A major challenge in VLN is the limited available training data, which hinders the...
详细信息
ISBN:
(纸本)9798891760615
vision-and-language navigation (VLN) agents are trained to navigate in real-world environments based on natural language instructions. A major challenge in VLN is the limited available training data, which hinders the models' ability to generalize effectively. Previous approaches have attempted to alleviate this issue by using external tools to generate pseudo-labeled data or integrating web-scaled image-text pairs during training. However, these methods often rely on automatically-generated or out-of-domain data, leading to challenges such as suboptimal data quality and domain mismatch. In this paper, we introduce a masked path modeling (MPM) objective. MPM pre-trains an agent using self-collected data for subsequent navigation tasks, eliminating the need for external tools. Specifically, our method allows the agent to explore navigation environments and record the paths it traverses alongside the corresponding agent actions. Subsequently, we train the agent on this collected data to reconstruct the original action sequence when given a randomly masked subsequence of the original path. This approach enables the agent to accumulate a diverse and substantial dataset, facilitating the connection between visual observations of paths and the agent's actions, which is the foundation of the VLN task. Importantly, the collected data are in-domain, and the training process avoids synthetic data with uncertain quality, addressing previous issues. We conduct experiments on various VLN datasets and demonstrate the applications of MPM across different levels of instruction complexity. Our results exhibit significant improvements in success rates, with enhancements of 1.3%, 1.1%, and 1.2% on the val-unseen split of the Room-to-Room, Room-for-Room, and Room-across-Room datasets, respectively. Additionally, we underscore the adaptability of MPM as well as the potential for additional improvements when the agent is allowed to explore unseen environments prior to testing. [GRAPHICS]
暂无评论