Digital X-ray radiography is widely used in clinical diagnosis. High quality chest X-ray is conducive to the accurate diagnosis of diseases by clinicians. However, the quality assessment of the chest X-ray images main...
详细信息
Digital X-ray radiography is widely used in clinical diagnosis. High quality chest X-ray is conducive to the accurate diagnosis of diseases by clinicians. However, the quality assessment of the chest X-ray images mainly depends on the subjective evaluation of doctors, and the results are influenced by the skill level and experience of the evaluators, involving many issues such as heavy workload and various uncertain factors in such subjective judgment. In this paper, we propose a chest X-ray quality assessment method that combines image-text contrastive learning with medical domain knowledge fusion. Based on pretraining the model from contrastivetext-image pairs, large-scale real clinical chest X-ray and diagnostic report text information are fused, and the model is fine-tuned to achieve cross domain transfer learning. While improving the prediction accuracy of the algorithm, the cost of massive sample data annotation is avoided. The local visual patch features of the X-ray images are aligned with multiple text features to ensure that the visual features contain more fine-grained image information. Theoretical analysis and experimental results show that the contrastivelearning algorithm based on the fusion of triplet information in medical knowledge graph and chest X-ray multi-modal data has achieved good performance in terms of accuracy. In addition, the method proposed in this paper can be easily extended to complete other tasks such as medical image multi-lesion segmentation and disease progression prediction.
Employing a dictionary can efficiently rectify the deviation between the visual prediction and the ground truth in scene text recognition methods. However, the independence of the dictionary on the visual features may...
详细信息
ISBN:
(纸本)9783031417306;9783031417313
Employing a dictionary can efficiently rectify the deviation between the visual prediction and the ground truth in scene text recognition methods. However, the independence of the dictionary on the visual features may lead to incorrect rectification of accurate visual predictions. In this paper, we propose a new dictionary language model leveraging the Scene image-text Matching(SITM) network, which avoids the drawbacks of the explicit dictionary language model: 1) the independence of the visual features;2) noisy choice in candidates etc. The SITM network accomplishes this by using image-textcontrastive (ITC) learning to match an image with its corresponding text among candidates in the inference stage. ITC is widely used in vision-language learning to pull the positive image-text pair closer in feature space. Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space. Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks. Additionally, we integrate our method with ABINet and establish new state-of-the-art results on several benchmarks.
暂无评论