Multimodal human understanding and analysis is an emerging research area that cuts through several disciplines like Computer vision (CV), Natural Language processing (NLP), Speech processing, Human-Computer Interactio...
详细信息
ISBN:
(纸本)9798400701245
Multimodal human understanding and analysis is an emerging research area that cuts through several disciplines like Computer vision (CV), Natural Language processing (NLP), Speech processing, Human-Computer Interaction (HCI), and Multimedia. Several multimodal learning techniques have recently shown the benefit of combining multiple modalities in image-text, audio-visual and video representation learning and various downstream multimodal tasks. At the core, these methods focus on modelling the modalities and their complex interactions by using large amounts of data, different loss functions and deep neural network architectures. However, for many Web and Social media applications, there is the need to model the human, including the understanding of human behaviour and perception. For this, it becomes important to consider interdisciplinary approaches, including social sciences, semiotics and psychology. The core is understanding various cross-modal relations, quantifying bias such as social biases, and the applicability of models to real-world problems. Interdisciplinary theories such as semiotics or gestalt psychology can provide additional insights and analysis on perceptual understanding through signs and symbols via multiple modalities. In general, these theories provide a compelling view of multimodality and perception that can further expand computational research and multimedia applications on the Web and Social media. The theme of the MUWS workshop, multimodal human understanding, includes various interdisciplinary challenges related to social bias analyses, multimodal representation learning, detection of human impressions or sentiment, hate speech, sarcasm in multimodal data, multimodal rhetoric and semantics, and related topics. The MUWS workshop will be an interactive event and include keynotes by relevant experts, poster and demo sessions, research presentations and discussion.
This research introduces "Jaddah," an innovative AI-based system for the automated detection of road infrastructure defects using advanced computer vision and machine learning techniques. The system addresse...
详细信息
ISBN:
(数字)9798331506520
ISBN:
(纸本)9798331506537
This research introduces "Jaddah," an innovative AI-based system for the automated detection of road infrastructure defects using advanced computer vision and machine learning techniques. The system addresses the limitations of traditional road inspection methods, which are often slow and prone to human error. Jaddah develops a mobile application that efficiently detects, classifies, and segments road defects at the pixel level. By utilizing a comprehensive dataset of high-resolution images, the model training process is significantly enhanced. The YOLOv8-seg model is implemented to achieve precise defect localization and segmentation, ensuring high accuracy in identifying and categorizing road defects. Performance metrics show an impressive 87% mAP50, demonstrating reliable defect detection. These results contribute to improved infrastructure maintenance, enhanced road safety, and greater operational efficiency.
In recent years, traditional imageprocessing techniques have seen the introduction of novel tools, able to face issues that are not always handy with classical vision algorithms. For example, classical image processi...
详细信息
ISBN:
(纸本)9781665483605
In recent years, traditional imageprocessing techniques have seen the introduction of novel tools, able to face issues that are not always handy with classical vision algorithms. For example, classical imageprocessing algorithms (measurement, detection of features, and many others) require a controlled environment, like illumination, target positioning, and vibration that can influence the scene for the correct operation. On the other hand, the machine learning approaches enabled imageprocessing techniques also in non-controlled environments. One of these applications can be represented by developing a leak detector at the household level, based on processing pictures of the mechanical water meter dial. The proposed research investigates using a deep learning approach to detect the minimal movement of the water meter needles related to water leakage. In particular, a CNN was trained to correlate successive differences on the water meter dial images taken with an applied calibrated water flow. From this analysis, it is possible to detect the absence of periods with null consumption and thus detect small water losses.
State-of-the-art approaches in computer vision heavily rely on sufficiently large training datasets. For real-world applications, obtaining such a dataset is usually a tedious task. In this paper, we present a fully a...
详细信息
ISBN:
(纸本)9781665462839
State-of-the-art approaches in computer vision heavily rely on sufficiently large training datasets. For real-world applications, obtaining such a dataset is usually a tedious task. In this paper, we present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps. In contrast to existing work, our pipeline covers every step from data acquisition to the final dataset. We first scrape images for the objects of interest from popular image search engines and since we rely only on text-based queries the resulting data comprises a wide variety of images. Hence, image selection is necessary as a second step. This approach of image scraping and selection relaxes the need for a real-world domain-specific dataset that must be either publicly available or created for this purpose. We employ an object-agnostic background removal model and compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection. In the third step, we generate random arrangements of the object of interest and distractors on arbitrary backgrounds. Finally, the composition of the images is done by pasting the objects using four different blending methods. We present a case study for our dataset generation approach by considering parcel segmentation. For the evaluation we created a dataset of parcel photos that were annotated automatically. We find that (1) our dataset generation pipeline allows a successful transfer to real test images (Mask AP 86.2), (2) a very accurate image selection process - in contrast to human intuition - is not crucial and a broader category definition can help to bridge the domain gap, (3) the usage of blending methods is beneficial compared to simple copy-and-paste. We made our full code for scraping, image composition and training publicly available at https://***/parcel2d.
The proceedings contain 154 papers. The topics discussed include: feature-driven 3d range geometry compression via spatially-aware depth encoding;open source deep learning inference libraries for autonomous driving sy...
The proceedings contain 154 papers. The topics discussed include: feature-driven 3d range geometry compression via spatially-aware depth encoding;open source deep learning inference libraries for autonomous driving systems;problems in image target-based color correction;improvement of aerial image by simulations;recognition-aware learned image compression;artist-specific style transfer for semantic segmentation of paintings: the value of large corpora of surrogate artworks;data visualization of crime data using immersive virtual reality;a comparison of non-experts and experts using DSIS method;contrast enhancement: cross-modal learning approach for medical images;a continuous bitstream-based blind video quality assessment using multi-layer perceptron;correspondences for image and video reconstruction;design and analysis on low-power and low-noise single slope ADC for digital pixel sensors;incremental two-network approach to develop a purity analyzer system for canola seeds;advantage of machine learning over maximum likelihood in limited-angle low-photon x-ray tomography;image montage detection based on image segmentation and robust hashing techniques;chatbot integrated with machine learning deployed in the cloud and performance evaluation;and the relationship between vision and simulated remote vision system air refueling performance.
The prevalence of hallucinations in responses generated by large language models (LLMs) poses significant challenges for the reliability of natural language processingapplications. This study addresses the detection ...
详细信息
ISBN:
(数字)9798350355413
ISBN:
(纸本)9798350355420
The prevalence of hallucinations in responses generated by large language models (LLMs) poses significant challenges for the reliability of natural language processingapplications. This study addresses the detection of such hallucinations through an enhanced Roberta-base model, specifically targeting hallucination responses produced by the Mistral 7B Instruct model. By implementing Low-Rank Adaptation (LoRA) for fine-tuning and incorporating hierarchical multi-head attention and multi-level self-attention weighting mechanisms, we aim to improve both the accuracy of hallucination detection and the interpretability of the model’s decisions. Our experimental results demonstrate that the proposed model significantly outperforms baseline models across various metrics, including accuracy, precision, recall, and area under the curve (AUC). Future research directions will explore the integration of larger-scale models and additional fine-tuning techniques to further bolster the model’s capacity for detecting hallucinations, thereby enhancing the reliability of LLM outputs.
Multilevel thresholding plays a crucial role in imageprocessing, with extensive applications in object detection, machinevision, medical imaging, and traffic control systems. It entails the partitioning of an image ...
详细信息
Recent Compositional Zero-Shot Learning (CZSL) methods increasingly adopt the pre-trained vision-language models to capture the contextual relations between image and text spaces. However, the single-class-token desig...
详细信息
The task of image caption generation aims to automatically produce natural language descriptions that match the content of images, integrating the fields of machinevision and natural language processing, which holds ...
详细信息
ISBN:
(数字)9798331530334
ISBN:
(纸本)9798331530341
The task of image caption generation aims to automatically produce natural language descriptions that match the content of images, integrating the fields of machinevision and natural language processing, which holds significant theoretical and practical value. Inspired by top-down attention mechanisms, this paper proposes an innovative attention model. Utilizing the output of pretrained object detection networks as prior knowledge for images, the model guides the generation of natural language descriptions. By directly incorporating the results of object detection as attention inputs into the text generation network, the model effectively focuses on key descriptive regions of images, thereby significantly enhancing performance. On public Chinese image captioning datasets, this model demonstrates substantial advantages in metrics such as BLEU-4 and METEOR.
暂无评论