Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-t...
详细信息
ISBN:
(纸本)9798350323726
Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.
In many industrial processes, such as power generation, chemical production, and waste management, accurately monitoring industrial burner flame characteristics is crucial for safe and efficient operation. A key step ...
详细信息
In many industrial processes, such as power generation, chemical production, and waste management, accurately monitoring industrial burner flame characteristics is crucial for safe and efficient operation. A key step involves separating the flames from the background through binary segmentation. Decades of machinevision research have produced a wide range of possible solutions, from traditional imageprocessing to traditional machine learning and modern deep learning methods. In this work, we present a comparative study of multiple segmentation approaches, namely Global Thresholding, Region Growing, Support vector machines, Random Forest, Multilayer Perceptron, U-Net, and DeepLabv3+, that are evaluated on a public benchmark dataset of industrial burner flames. We provide helpful insights and guidance for researchers and practitioners aiming to select an appropriate approach for the binary segmentation of industrial burner flames and beyond. For the highest accuracy, deep learning is the leading approach, while for fast and simple solutions, traditional imageprocessing techniques remain a viable option.
Independent adversarial sample detection is an important problem in the field of computer vision and machine learning, especially in the context of the widespread use of deep learning models. This can lead to misclass...
详细信息
Instance segmentation,an important imageprocessing operation for automation in agriculture,is used to precisely delineate individual objects of interestwithin images,which provides foundational information for variou...
详细信息
Instance segmentation,an important imageprocessing operation for automation in agriculture,is used to precisely delineate individual objects of interestwithin images,which provides foundational information for various automated or robotic tasks such as selective harvesting and precision *** study compares the one-stage YOLOv8 and the two-stage Mask R-CNN machine learning models for instance segmentation under varying orchard conditions across two *** 1,collected in dormant season,includes images of dormant apple trees,which were used to train multi-object segmentation models delineating tree branches and *** 2,collected in the early growing season,includes images of apple tree canopies with green foliage and immature(green)apples(also called fruitlet),which were used to train single-object segmentation models delineating only immature green *** results showed that YOLOv8 performed better than Mask R-CNN,achieving good precision and near-perfect recall across both datasets at a confidence threshold of ***,for Dataset 1,YOLOv8 achieved a precision of 0.90 and a recall of 0.95 for all *** comparison,Mask R-CNN demonstrated a precision of 0.81 and a recall of 0.81 for the *** Dataset 2,YOLOv8 achieved a precision of 0.93 and a recall of *** R-CNN,in this single-class scenario,achieved a precision of 0.85 and a recall of ***,the inference times for YOLOv8 were 10.9 ms for multi-class segmentation(Dataset 1)and 7.8 ms for single-class segmentation(Dataset 2),compared to 15.6 ms and 12.8 ms achieved by Mask R-CNN's,*** findings showYOLOv8's superior accuracy and efficiency in machine learning applications compared to two-stage models,specifically Mask-R-CNN,which suggests its suitability in developing smart and automated orchard operations,particularly when real-time applications are necessary in such cases as robotic harvesting and robotic immature green fruit thin
image decolorization is an image pre-processing step which is widely used in image analysis, computer vision, and printing applications. The most commonly used methods give each color channel (e.g., the R component in...
详细信息
image decolorization is an image pre-processing step which is widely used in image analysis, computer vision, and printing applications. The most commonly used methods give each color channel (e.g., the R component in RGB format, or the Y component of an image in CIE-XYZ format) a constant weight without considering image content. This approach is simple and fast, but it may cause significant information loss when images contain too many isoluminant colors. In this paper, we propose a new method which is not only efficient, but also can preserve a higher level of image contrast and detail than the traditional methods. It uses the information from the cumulative distribution function (CDF) of the information in each color channel to compute a weight for each pixel in each color channel. Then, these weights are used to combine the three color channels (red, green, and blue) to obtain the final grayscale value. The algorithm works in RGB color space directly without any color conversion. In order to evaluate the proposed algorithm objectively, two new metrics are also developed. Experimental results show that the proposed algorithm can run as efficiently as the traditional methods and obtain the best overall performance across four different metrics.
In recent years, Transformer models have revolutionized machine learning. While this has resulted in impressive results in the field of Natural Language processing, Computer vision quickly stumbled upon computation an...
详细信息
ISBN:
(纸本)9798350370287;9798350370713
In recent years, Transformer models have revolutionized machine learning. While this has resulted in impressive results in the field of Natural Language processing, Computer vision quickly stumbled upon computation and memory problems due to the high resolution and dimensionality of the input data. This is particularly true for video, where the number of tokens increases cubically relative to the frame and temporal resolutions. A first approach to solve this was vision Transformers, which introduce a partitioning of the input into embedded grid cells, lowering the effective resolution. More recently, Swin Transformers introduced a hierarchical scheme that brought the concepts of pooling and locality to transformers in exchange for much lower computational and memory costs. This work proposes a reformulation of the latter that views Swin Transformers as regular Transformers applied over a quadtree representation of the input, intrinsically providing a wider range of design choices for the attentional mechanism. Compared to similar approaches such as Swin and MaxviT, our method works on the full range of scales while using a single attentional mechanism, allowing us to simultaneously take into account both dense short range and sparse long range dependencies with low computational overhead and without introducing additional sequential operations, thus making full use of GPU parallelism.
Diffractive optical elements that divide an input beam into a set of replicas are used in many optical applications ranging from imageprocessing to communications. Their design requires time-consuming optimization pr...
详细信息
Diffractive optical elements that divide an input beam into a set of replicas are used in many optical applications ranging from imageprocessing to communications. Their design requires time-consuming optimization processes, which, for a given number of generated beams, are to be separately treated for one-dimensional and two-dimensional cases because the corresponding optimal efficiencies may be different. After generalizing their Fourier treatment, we prove that, once a particular divider has been designed, its transmission function can be used to generate numberless other dividers through affine transforms that preserve the efficiency of the original element without requiring any further optimization. (c) 2024 Optica Publishing Group
Most modern consumer-grade cameras are often equipped with a rolling shutter mechanism,which is becoming increasingly important in computer vision,robotics and autonomous driving ***,its temporal-dynamic imaging natur...
详细信息
Most modern consumer-grade cameras are often equipped with a rolling shutter mechanism,which is becoming increasingly important in computer vision,robotics and autonomous driving ***,its temporal-dynamic imaging nature leads to the rolling shutter effect that manifests as geometric *** the years,researchers have made significant progress in developing tractable rolling shutter models,optimization methods,and learning approaches,aiming to remove geometry distortion and improve visual *** this survey,we review the recent advances in rolling shutter cameras from two aspects of motion modeling and deep *** the best of our knowledge,this is the first comprehensive survey of rolling shutter *** the part of rolling shutter motion modeling and optimization,the principles of various rolling shutter motion models are elaborated and their typical applications are ***,the applications of deep learning in rolling shutter based imageprocessing are ***,we conclude this survey with discussions on future research directions.
This paper introduces a deep learning-based framework for identifying hand-drawn schematics of power converter circuits and performing automated simulations. The framework employs cutting-edge computer vision-based ob...
详细信息
This paper introduces a deep learning-based framework for identifying hand-drawn schematics of power converter circuits and performing automated simulations. The framework employs cutting-edge computer vision-based object detection models, such as YOLOv8, to achieve a high mean average precision (mAP) of 96.7% to accurately identify components. Wire tracing and connectivity are achieved through a combined architecture built upon classical imageprocessing techniques and deep learning approaches. Detailed information extracted from a hand-drawn circuit schematic is used to automatically create its netlist for automated simulation through the spice engine. The proposed framework is successfully tested on various nonisolated (buck, boost) and isolated (flyback, full-bridge) converters under both continuous conduction mode (CCM) and discontinuous conduction mode (DCM) operations. In the comprehensive assessment of the entire framework, its efficacy is tested on 140 newly drawn circuit diagrams. The overall accuracy in the generation of netlists reaches a high value of 95.71%, utilizing the robust component detection capabilities of YOLOv8. Moreover, the framework enables the generation of both graphical representations and adjacency matrices for circuit diagrams. This output serves as a valuable dataset generator, contributing to the rapidly advancing domains of machine learning, including graph neural networks and geometric learning, particularly in the application space of power and energy systems. This framework can be further employed as an educational tool, and the ideas introduced can be developed to generate fully automated and efficient power converter designs for real-world applications.
Nowadays, the optimization of the construction site layout of power line tower grouping not only affects the project progress, but also relates to the construction safety. Therefore, this study proposes a machinevisi...
详细信息
暂无评论