The increasing use of Extended Reality (XR) brings a need for more advanced Human-computer Interaction (HCI) technologies to allow for intuitive and robust collaboration. However, challenges arise when also ensuring i...
详细信息
ISBN:
(纸本)9783031717062;9783031717079
The increasing use of Extended Reality (XR) brings a need for more advanced Human-computer Interaction (HCI) technologies to allow for intuitive and robust collaboration. However, challenges arise when also ensuring interactions are both natural and effective, particularly when incorporating flexible verbal communication. Prior research has explored many multi-modal interaction technologies, yet there exists a need for a framework to allow for human-computer communication in specifically XR applications. This work proposes a novel framework that can incorporate advanced means of Natural Language Processing (NLP) to handle flexible verbal communication while ensuring computer interpretation is robust. Drawing from prior research that identified limitations of information given in XR visuals and challenges in using widely available NLP tools, the proposed framework uses a rule-enforced command extraction pipeline to ensure consistent human-like processing, while preserving the flexibility of more open-domain verbal dialogue. The design employs a sequence of NLP techniques for spoken utterance analysis, language specific rules to extract usable game commands, and verification on the structured command with a user-reinforced hypergraph method of data storage. The framework and its processing pipeline was evaluated using a wide-range of human-like phrasings of commands deemed representative of verbalized instructions for human-computer collaboration in an XR context. Framework performance was then compared against a Large Language Model (LLM) that was customized to extract commands using the same rules. These results serve to showcase the potential of the proposed framework through demonstrated examples and early results while also noting areas of needed performance enhancements.
Outcome prediction is crucial for head and neck cancer patients as it can provide prognostic information for early treatment planning. Radiomics methods have been widely used for outcome prediction from medical images...
详细信息
Real-world Person Re-Identification (Re-ID) presents severe challenges like occlusions and clothing changes, making traditional Re-ID methods fail. Existing occluded Re-ID methods struggle with cloth-changing scenario...
详细信息
Contrastively pretrained vision-language models (VLMs) such as CLIP have shown impressive zero-shot classification performance without any classification-specific training. They create a common embedding space by cont...
详细信息
ISBN:
(纸本)9789819620708;9789819620715
Contrastively pretrained vision-language models (VLMs) such as CLIP have shown impressive zero-shot classification performance without any classification-specific training. They create a common embedding space by contrastively pretraining an image and a text encoder to align positive image-text pairs and repel negative pairs. Then zero-shot classification of an image can be performed by measuring the cosine similarities between the image embedding and embeddings of texts that describe the classes. However, relevant works do not address the scenario in which few image examples for some (not all) classes are available. In this novel task which we term variable-shot (v-shot) classification, these models fail due to the embedding space modality gap, i.e. the fact that image-to-image similarities are higher than image-to-text ones. To this end, we propose to enable v-shot capabilities in pre-trained VLMs with minimal training complexity by re-projecting embeddings of frozen pre-trained image encoders using a shallow network, RectNet, which we train both with the standard CLIP contrastive loss function, as well as a novel modality alignment loss function specifically constructed to bridge the modality gap. Finally, we introduce three v-shot classification benchmarks, on which the proposed architecture achieves 32.22%, 29.58% and 45.15% increases in top-1 classification accuracy respectively.
Multi-face tracking (MFT) is a subtask of multi-object tracking (MOT) that focuses on detecting and tracking multiple faces across video frames. Modern MOT trackers adopt the Kalman filter (KF), a linear model that es...
详细信息
In a survey, a lot of questions and answers are listed and participators will select one of the answers for each question. Sometimes, participators may not be willing to select the preferred one item in the answers be...
详细信息
COVID-19 has triggered an intense worldwide search for low molecular weight inhibitors of a number of target proteins of the SARS-CoV-2 coronavirus responsible for the pandemic. However, the problem of creating an eff...
详细信息
The updated text of the Artificial Intelligence Act (AI Act) has raised some questions about tools using Artificial Intelligence in different fields. In particular, the specimen of AI technology for medical ...
详细信息
Readability assessment for book-level long text is widely needed in real educational applications. However, most of the current researches focus on passage-level readability assessment and little work has been done to...
详细信息
This paper introduces the online reasoning platform InfOCF-Web 2.0 that provides easy access to implementations of various inference methods for conditional belief bases. We present an overview of the realization...
暂无评论