The emergence of vision-language models, particularly Contrastive Language-Image Pre-Training (CLIP), has significantly improved the performance of numerous visual tasks, demonstrating notable zero-shot transfer abili...
详细信息
The emergence of vision-language models, particularly Contrastive Language-Image Pre-Training (CLIP), has significantly improved the performance of numerous visual tasks, demonstrating notable zero-shot transfer abilities. CLIP's remarkable generalization ability offers substantial innovation potential for smart manufacturing and public safety surveillance, potentially accelerating the advancement of Industry 5.0. However, most current research focuses on public datasets, with limited investigation into complex industrial scenarios. These industrial scenarios' semantic structures and image qualities differ significantly from the datasets used to train CLIP, presenting challenges for its effectiveness in industrial applications. This paper presents a Context-Aware Masked CLIP (CAM-CLIP) framework for high-performance pixel-level semantic parsing in complex industrial scenarios, under few-shot conditions. The framework autonomously detects and identifies objects in industrial scenarios based on textual descriptions, enhancing safety monitoring and anomaly detection. We constructed a dedicated dataset using offshore drilling platforms as a case study and conducted empirical validation. Results demonstrate that CAM-CLIP achieved an 80.7 mIoU in pixel-level semantic parsing of offshore drilling platforms with a limited sample size, outperforming state-of-the-art methods by 8.47 mIoU. This study extends CLIP's applicability to industrial settings and offers a model for future implementations. It advances semanticparsing in industrial scenarios and promotes the development of intelligent, interpretable systems.
暂无评论