With the development of computer vision, natural language processing, and machine learning technologies, a great number of joint visual-textual applications, such as image captioning, visual question answering, visual...
详细信息
With the development of computer vision, natural language processing, and machine learning technologies, a great number of joint visual-textual applications, such as image captioning, visual question answering, visual grounding, image-text cross-modal retrieval, and text-based image generation, emerged in recent years. They leverage machine learning models as the core module to tackle problems related to the intersection of vision and language. For all these joint visual-textual applications, vision and text modalities interact in three fundamental modes. The first is the "joint learning" mode, which considers both modalities as parallel inputs to jointly predict a target. The second is the "retrieval" mode, which explores the correspondence relation between the two modalities and aims to find the corresponding items that belong to different modalities. The third is the "generation" mode, which focuses on creating and modifying the items of one modality using the input of another modality as guidance. For all the joint visual-textual applications of the three modes, how to effectively "capture" and "attend" to the significant information of the visual and textual inputs is crucial. This thesis develops new "capturing" and "attending" methods to effectively model joint visual-textual applications in the three modes. For the first mode, we focus on a significant social media classification application. A novel bilateral attention model is proposed to classify whether a WeChat Moment is related to business or not based on the Moment's image and text information. For the second mode, we comprehensively investigate the application of image-text cross-modal retrieval on both general and domain-specific tasks. We first explore the general image-text matching task and propose approaches that capture high-performance cross-modal information. We then focus on two domain-specific tasks related to font retrieval and person search. We design methods to further utilize the specia
This paper presents a secure reconfigurable hierarchical hardware architecture at the pixel and region level for smart image sensors to accelerate machinevisionapplications. The design maintains hierarchical process...
详细信息
This paper presents a secure reconfigurable hierarchical hardware architecture at the pixel and region level for smart image sensors to accelerate machinevisionapplications. The design maintains hierarchical processing that begins at the pixel level. It aims to reduce the computational burden on the sequential processor and increases the confidentiality of the sensor. We achieve this goal by preprocessing the data in parallel with event-based processing within the sensor and extract the local features, which are then forwarded to an encryption module. After that, an external processor can obtain the encrypted features to complete the vision application. This approach significantly accelerates the vision application by executing the low-level and mid-level imageprocessingapplications and simultaneously by reducing the data volume at the sensor level. The secure hardware architecture enables the vision application to perform in real-time with reliability. This hierarchical processing breaks the traditional sequential imageprocessing and introduces parallelism for machinevisionapplications. We evaluate the design in FPGA and achieve the GDSII file in the ASIC platform at 800MHz. Simulation results show that the area overhead and power penalty for adding reconfiguration features stay in an acceptable range. Besides, removing redundant information, 84.01%, and 94.31% dynamic power can be saved at each pixel-level and region-level, respectively.
It is essential to find creative solutions to the growing urban problems of traffic congestion and parking issues. By using real-time traffic camera photos with imageprocessing and deep learning algorithms to compute...
详细信息
In a complex semiconductor manufacturing environment, critical dimension scanning electron microscope (CD-SEM) images are captured at metrology step to monitor structural measurements and detect anomaly to meet the st...
详细信息
ISBN:
(数字)9798331531850
ISBN:
(纸本)9798331531867
In a complex semiconductor manufacturing environment, critical dimension scanning electron microscope (CD-SEM) images are captured at metrology step to monitor structural measurements and detect anomaly to meet the stringent process control requirements. This paper focuses on advanced CD-SEM imageprocessing and anomaly detection using machine learning and generative AI models, which include computer vision (CV) imageprocessing, Residual Neural Network (ResNet) deep learning and Generative Adversarial Network (GAN) model for various use cases. The applications of these models during in-line monitoring are crucial to identify potential process issues for yield and quality improvement.
A key element of quality control in manufacturing is Product Inspection, which is a process that allows for verifying a product's quality enabled by activities such as measuring, examining, and testing one or more...
详细信息
Object classification and detection involve numerous applications like imageprocessing, picture retrieval, security and surveillance, video communication, robot vision and observation. They are often classified based...
详细信息
Event-based vision sensors are a paradigm shift in the way that visual information is obtained and processed. These devices are capable of low-latency transmission of data which represents the scene dynamics. Addition...
详细信息
ISBN:
(纸本)9784901122207
Event-based vision sensors are a paradigm shift in the way that visual information is obtained and processed. These devices are capable of low-latency transmission of data which represents the scene dynamics. Additionally, low-power benefits make the sensors popular in finite-power scenarios such as high-speed robotics or machinevisionapplications where latency in visual information is desired to be minimal. The core datatype of such vision sensors is the 'event' which is an asynchronous per-pixel signal indicating a change in light intensity at an instance in time corresponding to the spatial location of that sensor on the array. A popular approach to event-based processing is to map events onto a 2D plane over time which is comparable with traditional imaging techniques. However, this paper presents a disruptive approach to event data processing that uses a tree-based filter framework that directly processes raw event data to extract events corresponding to interest point features, which is then combined with a Harris interest point approach to isolate features. We hypothesise that since the tree structure contains the same spatial information as a 2D surface mapping, Harris may be applied directly to the content of the tree, bypassing the need for transformation to the 2D plane. Results illustrate that the proposed approach performs better than other state-of-the-art approaches with limited compromise on the run-time performance.
Medical image segmentation is crucial for many healthcare applications, and deep learning networks have shown great potential in performing semantic segmentation tasks effectively. However, existing methods often suff...
详细信息
ISBN:
(数字)9798350355413
ISBN:
(纸本)9798350355420
Medical image segmentation is crucial for many healthcare applications, and deep learning networks have shown great potential in performing semantic segmentation tasks effectively. However, existing methods often suffer from the loss of important local features. To solve this problem, We introduce an innovative approach: a multi-scale and multi-connection feature adaptive fusion method guided by an attention mechanism. The proposed method emphasizes adaptive fusion of multi-scale and multi-connection features, effectively capturing meaningful features at different scales, while also utilizing attention mechanism to enhance feature representation. Specifically, Our approach leverages dense multi-scale skip connections to bridge the semantic gap between the feature maps of the encoder and decoder, where the feature map in each decoder module is connected with each feature map in encoder module. Besides, our network incorporates a deep attention block (DCS) in the encoder to capture meaningful local features. In the decoder, we introduce an efficient multi-scale convolution block (MSDC) to refine feature maps by performing deep convolution across multiple scales. Experimental results in the well known The ISIC dataset results demonstrate that our approach significantly outperforms the baseline method in the task of medical image segmentation.
Extracting text from images is essential in imageprocessing and computer vision, with applications in document digitization and automated text recognition. This paper reviews various text extraction techniques, categ...
详细信息
ISBN:
(数字)9798331533663
ISBN:
(纸本)9798331533670
Extracting text from images is essential in imageprocessing and computer vision, with applications in document digitization and automated text recognition. This paper reviews various text extraction techniques, categorized into thresholding, rough set and fuzzy set methods, clustering, edge detection, and machine learning. Thresholding techniques such as Gaussian, Otsu, adaptive, and double-edge methods are explored. Rough set and fuzzy set methods, which handle uncertainty in image data and improve text segmentation, are reviewed. Clustering techniques, such as K-means and density-based methods, are studied for their effectiveness in grouping pixels for text isolation. Edge detection techniques, including Sobel, Roberts, Canny, Morphological Component Analysis (MCA), and Laplacian, are examined for their role in enhancing text boundary identification. machine learning approaches, such as Support Vector machines (SVM), Contrastive Language-image Pre-training (CLIP), Hidden Markov Models (HMM), and hybrid BiLSTM-CNN models, are analyzed for their ability to improve accuracy in noisy environments. This review compares these techniques, highlighting their strengths, challenges, and applications for text extraction.
This article describes a solution for automating measurements on CNC machines equipped with automatic measurement tools using a vision system that simulates stereoscopic vision. The proposed method of measurement allo...
This article describes a solution for automating measurements on CNC machines equipped with automatic measurement tools using a vision system that simulates stereoscopic vision. The proposed method of measurement allows non-contact determination of the approximate position of the workpiece on the machine to fill in the parameters of measuring cycles. A three-stage algorithm for imageprocessing by a software module of such a system is described, including recognition of the area with the workpiece, automatic correction of input images for successful contour recognition, determining of the contours of the workpieces and their geometric parameters, as well as calculation of the approximate dimensions of the workpieces relatively to the marks applied to the machine worktable beforehand. The calculated dimensions must be transmitted to the CNC system of the machine to run a parameterized program that calls a measuring cycle that clarifies the position of the workpiece on the machine. A model of the vision system has been implemented and tested.
暂无评论