A pyramidical multi-scale encoder-decoder network, namely PMED-Net, is proposed for medical image segmentation. Different variants of encoder-decoder networks are in practice for segmenting the medical images and U-Ne...
详细信息
A pyramidical multi-scale encoder-decoder network, namely PMED-Net, is proposed for medical image segmentation. Different variants of encoder-decoder networks are in practice for segmenting the medical images and U-Net is the most widely used one. However, the existing architectures for segmenting medical images have millions of parameters that require enormous computations which results in memory and cost-inefficiency. To overcome such limitations, we come up with the idea of training small networks in a cascaded form for coarse-to-fine prediction. The proposed adaptive network is extended up to six pyramid levels, and at each level, features are extracted at different scales of the input image. Each lightweight encoder-decoder network is trained independently to minimize loss, where succeeding level networks further refine the prior predictions. Evaluation and comparison of our architecture were performed on four different publicly available medical image segmentation datasets: International Skin Imaging Collaboration (ISIC) challenge 2018 dataset, brain tumor dataset, nuclei dataset, and X-ray dataset. The experimental results of the PMED-Net are either better or on par with other state-of-the-art networks in terms of IoU, F1-Score, and sensitivity metrics. Moreover, PMED-Net is efficient in terms of parameterized complexity as it has 1/21.3, 1/21.1, 1/14.0, 1/11.6, 1/11.2, 1/6.64, and 1/4.95 times fewer parameters than SegNet, U-Net, BCDU-Net, CU-Net, FCN-8s, ORED-Net, and MultiResUNet respectively. The pre-trained models, datasets information, and implementation details are available at https://***/kabbas570/Pyramid-Based-encoder-decoder.
The automatic generation of radiological imaging reports aims to produce accurate and coherent clinical descriptions based on X-ray images. This facilitates clinicians in completing the arduous task of report writing ...
详细信息
ISBN:
(纸本)9798400704369
The automatic generation of radiological imaging reports aims to produce accurate and coherent clinical descriptions based on X-ray images. This facilitates clinicians in completing the arduous task of report writing and advances clinical automation. The primary challenge in radiological imaging report generation lies in accurately capturing and describing abnormal regions in the images under data bias conditions, resulting in the generation of lengthy texts containing image details. Existing methods mostly rely on prior knowledge such as medical knowledge graphs, corpora, and image databases to assist models in generating more precise textual descriptions. However, these methods still struggle to identify rare anomalies in the images. To address this issue, we propose a two-stage training model, named CLR2G, based on cross-modal contrastive learning. This model delegates the task of capturing anomalies, particularly those challenging for the generative model trained with cross-entropy loss under data bias conditions, to a specialized abnormality capture component. Specifically, we employ a semantic matching loss function to train additional abnormal image and text encoders through cross-modal contrastive learning, facilitating the capture of 13 common anomalies. We utilize the anomalous image features, text features and their confidence probabilities as a posteriori knowledge to help the model generate accurate image reports. Experimental results demonstrate the state-of-the-art performance of our method on two widely used public datasets, IU-Xray and MIMIC-CXR.
An accurate and timely cracking assessment, including the presence, location and crack geometric feature measurement, is crucial for evaluating concrete wind towers. Therefore, the early identification of cracks is a ...
详细信息
An accurate and timely cracking assessment, including the presence, location and crack geometric feature measurement, is crucial for evaluating concrete wind towers. Therefore, the early identification of cracks is a critical procedure in promptly evaluating structural integrity. This study proposed an ad-hoc encoder-decoder network based on DeepLabv3+ with depth separable convolutions to automatically segment cracks from real-world images captured from various concrete wind towers. The combined advantages of the improved DeepLabv3+ and the lightweight MobileNet v2 are suitable as a benchmark due to their high performance and universality. Four experiments were conducted to determine the model design choice and crack feature measurement capability: (1) six parametric tests using various pre-trained base networks and algorithm optimisers, (2) the influence of complex background noise (i.e., handwriting script) on crack segmentation performance, (3) comparative studies with cutting-edge pixel-wise segmentation models and (4) crack feature measurement (i.e., length and width). The research outcome demonstrated that DeepLabv3+ with MobileNet v2 can potentially be applied for efficient and accurate crack segmentation in concrete wind towers with complex backgrounds.
Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended que...
详细信息
Visual question answering (VQA) has been attracting attention in remote sensing very recently. However, the proposed solutions remain rather limited in the sense that the existing VQA datasets address closed-ended question-answer queries, which may not necessarily reflect real open-ended scenarios. In this paper, we propose a new dataset named VQA-TextRS that was built manually with human annotations and considers various forms of open-ended question-answer pairs. Moreover, we propose an encoder-decoder architecture via transformers on account of their self-attention property that allows relational learning of different positions of the same sequence without the need of typical recurrence operations. Thus, we employed vision and natural language processing (NLP) transformers respectively to draw visual and textual cues from the image and respective question. Afterwards, we applied a transformer decoder, which enables the cross-attention mechanism to fuse the earlier two modalities. The fusion vectors correlate with the process of answer generation to produce the final form of the output. We demonstrate that plausible results can be obtained in open-ended VQA. For instance, the proposed architecture scores an accuracy of 84.01% on questions related to the presence of objects in the query images.
Many image-to-image computer vision approaches have made great progress by an end-to-end framework with the encoder-decoder architecture. However, the same image-to-image eye fixation prediction task is not the same a...
详细信息
Many image-to-image computer vision approaches have made great progress by an end-to-end framework with the encoder-decoder architecture. However, the same image-to-image eye fixation prediction task is not the same as those computer vision tasks in that it focuses more on salient regions rather than precise predictions for every pixel. Thus, it is not appropriate to directly apply the end-to-end encoder-decoder to the eye fixation prediction task. In addition, although high-level feature is important, the contribution of low-level feature should also be kept and balanced in computational model. Nevertheless, some low-level features that attract attention are easily neglected while transiting through the deep network. Therefore, the effective way to integrate low-level and high-level features for improving eye fixation prediction performance is still a challenging task. In this paper, a coarse-to-fine network (CFN) that encompasses two pathways with different training strategies are proposed: coarse perceiving network (CFN-Coarse) can be a simple encoder network or any of the existing pretrained network to capture the distribution of salient regions and generate high-quality feature maps;fine integrating network (CFN-Fine) uses fixed parameters from the CFN-Coarse and combines features from deep to shallow in the deconvolution process by adding skip connections between down-sampling and up-sampling paths to efficiently integrate deep and shallow features. The saliency map obtained by the method is evaluated over 6 standard benchmark datasets, namely SALICON, MIT1003, MIT300, Toronto, OSIE, and SUN500. The results demonstrate that the method can surpass the state-of-the-art accuracy of eye fixation prediction and achieves the competitive performance to date under most evaluation metrics on SALICON Saliency Prediction Challenge (LSUN2017).
Currently, SLAM (simultaneous localization and mapping) systems based on monocular cameras cannot directly obtain depth information, and most of them have problems with scale uncertainty and need to be initialized. In...
详细信息
Currently, SLAM (simultaneous localization and mapping) systems based on monocular cameras cannot directly obtain depth information, and most of them have problems with scale uncertainty and need to be initialized. In some application scenarios that require navigation and obstacle avoidance, the inability to achieve dense mapping is also a defect of monocular SLAM. In response to the above problems, this paper proposes a method which learns depth estimation by DenseNet and CNN for a monocular SLAM system. We use an encoder-decoder architecture based on transfer learning and convolutional neural networks to estimate the depth information of monocular RGB images. At the same time, through the front-end ORB feature extraction and the back-end direct RGB-D Bundle Adjustment optimization method, it is possible to obtain accurate camera poses and achieve dense indoor mapping when using estimated depth information. The experimental results show that the monocular depth estimation model used in this paper can achieve good results, and it is also competitive in comparison with the current popular methods. On this basis, the error of camera pose estimation is also smaller than traditional monocular SLAM solutions and can complete the dense indoor reconstruction task. It is a complete SLAM system based on monocular camera.
Images captured in low-brightness environments often lead to poor visibility and exhibit artifacts such as low brightness, low contrast, and color distortion. These artifacts not only affect the visual perception of t...
详细信息
Images captured in low-brightness environments often lead to poor visibility and exhibit artifacts such as low brightness, low contrast, and color distortion. These artifacts not only affect the visual perception of the human eye but also decrease the performance of computer vision algorithms. Existing deep learning-based image enhancements studies are quite slow and usually require extensive hardware specifications. Conversely, lightweight enhancement approaches do not provide satisfactory performance as compared to state-of-the-art methods. Therefore, we proposed a fast and lightweight deep learning-based algorithm for performing low-light image enhancement using the light channel of Hue Saturation Lightness (HSL). LiCENt stands for Light Channel Enhancement Network that uses a combination of an autoencoder and convolutional neural network (CNN) to train a low-light enhancer to first improve the illumination and later improve the details of the low-light image in a unified framework. This method used a single channel lightness 'L' of HSL color space instead of traditional RGB color channels which helps in reducing the number of learnable parameters by a factor of 8.92, at the most. LiCENt also has significant advantages for the Brilliance Perception Adjustment, which enables the model to avoid issues including over-enhancement and color distortion. The experimental results demonstrate that our approach generalizes well in synthetic and natural low-light images and outperforms other methods in terms of qualitative and quantitative metrics.
Dehazing refers to removing the haze and restoring the details from hazy images. In this paper, we propose ClarifyNet, a novel, end-to-end trainable, convolutional neural network architecture for single image dehazing...
详细信息
Dehazing refers to removing the haze and restoring the details from hazy images. In this paper, we propose ClarifyNet, a novel, end-to-end trainable, convolutional neural network architecture for single image dehazing. We note that a high-pass filter detects sharp edges, texture, and other fine details in the image, whereas a low-pass filter detects color and contrast information. Based on this observation, our key idea is to train ClarifyNet on ground-truth haze-free images, low-pass filtered images, and high-pass filtered images. Based on this observation, we present a shared-encoder multi-decoder model ClarifyNet which employs interconnected parallelization. While training, ground-truth haze-free images, low-pass filtered images, and high-pass filtered images undergo multi-stage filter fusion and attention. By utilizing a weighted loss function composed of SSIM loss and L1 loss, we extract and propagate complementary features. We comprehensively evaluate ClarifyNet on I-HAZE, O-HAZE, Dense-Haze, NH-HAZE, SOTS-Indoor, SOTS-Outdoor, HSTS, and Middlebury datasets. We use PSNR and SSIM metrics and compare the results with previous works. For most datasets, ClarifyNet provides the highest scores. On using EfficientNet-B6 as the backbone, ClarifyNet has 18 M parameters (model size of similar to 71 MB) and a throughput of 8 frames-per-second while processing images of size 2048 x 1024.
Existing skin attributes detection methods usually initialize with a pre-trained Imagenet network and then fine-tune on a medical target task. However, we argue that such approaches are suboptimal be-cause medical dat...
详细信息
Existing skin attributes detection methods usually initialize with a pre-trained Imagenet network and then fine-tune on a medical target task. However, we argue that such approaches are suboptimal be-cause medical datasets are largely different from ImageNet and often contain limited training samples. In this work, we propose Task Agnostic Transfer Learning (TATL), a novel framework motivated by der-matologists' behaviors in the skincare context. TATL learns an attribute-agnostic segmenter that detects lesion skin regions and then transfers this knowledge to a set of attribute-specific classifiers to detect each particular attribute. Since TATL's attribute-agnostic segmenter only detects skin attribute regions, it enjoys ample data from all attributes, allows transferring knowledge among features, and compensates for the lack of training data from rare attributes. We conduct extensive experiments to evaluate the pro-posed TATL transfer learning mechanism with various neural network architectures on two popular skin attributes detection benchmarks. The empirical results show that TATL not only works well with multi -ple architectures but also can achieve state-of-the-art performances, while enjoying minimal model and computational complexities. We also provide theoretical insights and explanations for why our transfer learning framework performs well in practice.(c) 2022 Elsevier B.V. All rights reserved.
As the integration density and design intricacy of semiconductor wafers increase, the magnitude and complexity of defects in them are also on the rise. Since the manual inspection of wafer defects is costly, an automa...
详细信息
As the integration density and design intricacy of semiconductor wafers increase, the magnitude and complexity of defects in them are also on the rise. Since the manual inspection of wafer defects is costly, an automated artificial intelligence (AI) based computer-vision approach is highly desired. The previous works on defect analysis have several limitations, such as low accuracy and the need for separate models for classification and segmentation. For analyzing mixed-type defects, some previous works require separately training one model for each defect type, which is non-scalable. In this paper, we present WaferSegClassNet (WSCN), a novel network based on encoder-decoder architecture. WSCN performs simultaneous classification and segmentation of both single and mixed-type wafer defects. WSCN uses a "shared encoder" for classification, and segmentation, which allows training WSCN end-to-end. We use N-pair contrastive loss to first pretrain the encoder and then use BCE-Dice loss for segmentation, and categorical cross-entropy loss for classification. Use of N-pair contrastive loss helps in better embedding representation in the latent dimension of wafer maps. WSCN has a model size of only 0.51MB and performs only 0.2 M FLOPS. Thus, it is much lighter than other state-of-the-art models. Also, it requires only 150 epochs for convergence, compared to 4000 epochs needed by a previous work. We evaluate our model on the MixedWM38 dataset, which has 38,015 images. WSCN achieves an average classification accuracy of 98.2% and a dice coefficient of 0.9999. We are the first to show segmentation results on the MixedWM38 dataset. The source code can be obtained from https://***/ckmvigil/WaferSegClassNet. (C) 2022 Elsevier B.V. All rights reserved.
暂无评论