This document is a tutorial on how to choose a binarization algorithm for a document image, considering different analysis criterias. This tutorial focuses on some main challenges one can face when dealing with binari...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
This document is a tutorial on how to choose a binarization algorithm for a document image, considering different analysis criterias. This tutorial focuses on some main challenges one can face when dealing with binarization algorithms.
Data acquisition and analysis are important areas for science, directly related to image reconstruction. Much acquired data can be corrupted by various factors, such as external noise sources or those inherent to the ...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
Data acquisition and analysis are important areas for science, directly related to image reconstruction. Much acquired data can be corrupted by various factors, such as external noise sources or those inherent to the application, but can be treated mathematically. This work aims to reconstruct images corrupted by Gaussian and Rician noise, using DC programming and a non-convex version of the total variation (TV) model. The tests are performed with a variation of BDCA (smoothing of the first DC component) and nmBDCA algorithms. The obtained results are evaluated both in quality (PSNR and SSIM) and in CPU time, covering medical computed tomography (CT) images and magnetic resonance images (MRI).
Effectively distinguishing between images in high visual similarity datasets poses significant challenges, especially with photometric variations, perspective transformations, and/or occlusions. We introduce a novel m...
ISBN:
(纸本)9798350376043;9798350376036
Effectively distinguishing between images in high visual similarity datasets poses significant challenges, especially with photometric variations, perspective transformations, and/or occlusions. We introduce a novel methodology that fuses local and global feature detection techniques. By integrating local feature analysis with global feature representation based on graph structuring and processing, our approach can capture topological and metric relationships among descriptors. The proposed graph representation is computed using only matching features, hence filtering irrelevant information and focusing on unique image attributes that favor identification. This study aims to answer how the synergistic combination of these techniques can outperform conventional identification methods dealing with data sets with high visual similarity. We performed experiments showing significant improvements in precision and recall, reflected in the F1-Score, of the proposed strategy over pure local-based image identification. The results highlight the potential of hybrid approaches for better image recognition, also revealing that local-based method can use our proposal as an additional component for obtaining improved results.
image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to pr...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to promoting social inclusion by providing visual context to people with impairments. Despite recent progress, especially in English, low-resource languages like brazilian Portuguese face a shortage of datasets, models, and studies. This work seeks to contribute to this context by fine-tuning and investigating the performance of vision language models based on the Transformer architecture in brazilian Portuguese. We leverage pre-trained vision model checkpoints (ViT, Swin, and DeiT) and neural language models (BERTimbau, DistilBERTimbau, and GPorTuguese-2). Several experiments were carried out to compare the efficiency of different model combinations using the #PraCegoVer-63K, a native Portuguese dataset, and a translated version of the Flickr30K dataset. The experimental results demonstrated that configurations using the Swin, DistilBERTimbau, and GPorTuguese-2 models generally achieved the best outcomes. Furthermore, the #PraCegoVer-63K dataset presents a series of challenges, such as descriptions made up of multiple sentences and the presence of proper names of places and people, which significantly decrease the performance of the investigated models.
Motivated by the inherent data scarcity in the medical domain, this work studies few-shot retinal disease classification, using the brazilian Multilabel Ophtalmological Dataset. We compare different network architectu...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
Motivated by the inherent data scarcity in the medical domain, this work studies few-shot retinal disease classification, using the brazilian Multilabel Ophtalmological Dataset. We compare different network architectures and non-trivial data augmentations under the application of the Reptile Algorithm, conducting quantitative and qualitative analysis. Regarding the architectures, we observe that Swin outperforms ViT and ResNet. We also observe that clever data augmentations not only improve performance, but can also generate prediction confidence distributions that are more interpretable and trustworthy. Furthermore, pre-training the models with domain-specific data leads to superior ability of the models to detect the relevant patterns in the images. Code is available at ***/gabjp/few-shot-BRSET.
Photographing fiscal receipts has become increasingly common with the rise of online storage and accounting services. However, capturing images in uncontrolled environments often leads to distortions that can compromi...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
Photographing fiscal receipts has become increasingly common with the rise of online storage and accounting services. However, capturing images in uncontrolled environments often leads to distortions that can compromise Optical Character Recognition (OCR) techniques, turning the output text unreadable. To address this problem, we propose an expert open-source filtering approach based on low-level features to identify and discard poor-quality fiscal images, select high-quality ones, and flag images that need preparation before OCR. The flagged images undergo a series of enhancement techniques, including homography transformation, super-resolution, noise reduction, sharpness adjustment, morphological operations, and binarization. Our extensive experimental evaluation, executed in a new proposed labeled dataset of fiscal receipt, shows that the proposed method lowers the average Character Error Rate metric by up to 11 points compared to baseline methods. Additionally, an ablation study reveals the impact on the accuracy of each image preparation step.
image retrieval approaches typically involve two fundamental stages: visual content representation and similarity measurement. Traditional methods rely on pairwise dissimilarity metrics, such as Euclidean distance, wh...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
image retrieval approaches typically involve two fundamental stages: visual content representation and similarity measurement. Traditional methods rely on pairwise dissimilarity metrics, such as Euclidean distance, which overlook the global structure of datasets. Aiming to address this limitation, various unsupervised post-processing approaches have been developed to redefine similarity measures. Diffusion processes and rank-based methods compute a more effective similarity by considering the relationships among images and the overall dataset structure. However, neither approach is capable of defining novel image representations. This paper aims to overcome this limitation by proposing a novel self-supervised image re-ranking method. The proposed method exploits a hypergraph model, clustering strategies, and Graph Convolutional Networks (GCNs). Initially, an unsupervised rank-based manifold learning method computes global similarities to define small and reliable clusters, which are used as soft labels for training a semi-supervised GCN model. This GCN undergoes a two-stage training process: an initial classification-focused stage followed by a retrieval-focused stage. The final GCN embeddings are employed for retrieval tasks using the cosine similarity. An experimental evaluation conducted on four public datasets with three different visual features indicates that the proposed approach outperforms traditional and recent rank-based methods.
Despite the impressive advances in image understanding approaches, defining similarity among images remains a challenging task, crucial for many applications such as classification and retrieval. Mainly supported by C...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
Despite the impressive advances in image understanding approaches, defining similarity among images remains a challenging task, crucial for many applications such as classification and retrieval. Mainly supported by Convolution Neural Networks (CNNs) and Transformer-based models, image representation techniques are the main reason for the advances. On the other hand, comparisons are mostly computed based on traditional pairwise measures, such as the Euclidean distance, while contextual similarity approaches can lead to effective results in defining similarity between points in high-dimensional spaces. This paper introduces a novel approach to contextual similarity by combining two techniques: neighbor embedding projection methods and rank-based manifold learning. High-dimensional features are projected in a 2D space used for efficiently ranking computation. Subsequently, manifold learning methods are exploited for a re-ranking step. An experimental evaluation conducted on different datasets and visual features indicates that the proposed approach leads to significant gains in comparison to the original feature representations and the neighbor embedding method in isolation.
The revolutionary advances in image representation have led to impressive progress in many image understanding-related tasks, primarily supported by Convolutional Neural Networks (CNN) and, more recently, by Transform...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
The revolutionary advances in image representation have led to impressive progress in many image understanding-related tasks, primarily supported by Convolutional Neural Networks (CNN) and, more recently, by Transformer models. Despite such advances, assessing the similarity among images for retrieval in unsupervised scenarios remains a challenging task, mostly grounded on traditional pairwise measures, such as the Euclidean distance. The scenario is even more challenging when different visual features are available, requiring the selection and fusion of features without any label information. In this paper, we propose an Unsupervised Dual-Layer Aggregation (UDLA) method, based on contextual similarity approaches for selecting and fusing CNN and Transformer-based visual features trained through transfer learning. In the first layer, the selected features are fused in pairs focused on precision. A sub-set of pairs is selected for a second layer aggregation focused on recall. An experimental evaluation conducted in different public datasets showed the effectiveness of the proposed approach, which achieved results significantly superior to the best-isolated feature and also superior to a recent fusion approach considered as baseline.
Traditional convolutional neural networks (CNNs) face significant challenges when applied to omnidirectional images due to the non-uniform sampling inherent in equirectangular projection (ERP). This projection type le...
详细信息
ISBN:
(纸本)9798350376043;9798350376036
Traditional convolutional neural networks (CNNs) face significant challenges when applied to omnidirectional images due to the non-uniform sampling inherent in equirectangular projection (ERP). This projection type leads to distortions, particularly near the poles of the ERP image, and fixed-size kernels in planar CNNs are not designed to address this issue. This paper introduces a convolutional block called Spherically-Weighted Horizontally Dilated Convolutions (SWHDC). Our block mitigates distortions during the feature extraction phase by properly weighting dilated convolutions according to the optimal support for each row in the ERP, thus enhancing the ability of a network to process omnidirectional images. We replace planar convolutions of well-known backbones with our SWHDC block and test its effectiveness in the 3D object classification task using ERP images as a case study. We considered standard benchmarks and compared the results with state-of-the-art methods that convert 3D objects to single 2D images. The results show that our SWHDC block improves the classification performance of planar CNNs when dealing with ERP images without increasing the number of parameters, outperforming peering methods. Code is available at: https://***/rmstringhini/SWHDC
暂无评论