The performance of CLIP in dynamic facial expression recognition (DFER) task doesn’t yield exceptional results as observed in other CLIP-based classification tasks. While CLIP’s primary objective is to achieve align...
详细信息
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa. Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning...
详细信息
Traditional machine learning follows a close-set assumption that the training and test set share the same label space. While in many practical scenarios, it is inevitable that some test samples belong to unknown class...
ISBN:
(纸本)9781713871088
Traditional machine learning follows a close-set assumption that the training and test set share the same label space. While in many practical scenarios, it is inevitable that some test samples belong to unknown classes (open-set). To fix this issue, Open-Set Recognition (OSR), whose goal is to make correct predictions on both close-set samples and open-set samples, has attracted rising attention. In this direction, the vast majority of literature focuses on the pattern of open-set samples. However, how to evaluate model performance in this challenging task is still unsolved. In this paper, a systematic analysis reveals that most existing metrics are essentially inconsistent with the aforementioned goal of OSR: (1) For metrics extended from close-set classification, such as Open-set F-score, Youden's index, and Normalized Accuracy, a poor open-set prediction can escape from a low performance score with a superior close-set prediction. (2) Novelty detection AUC, which measures the ranking performance between close-set and open-set samples, ignores the close-set performance. To fix these issues, we propose a novel metric named OpenAUC. Compared with existing metrics, OpenAUC enjoys a concise pairwise formulation that evaluates open-set performance and close-set performance in a coupling manner. Further analysis shows that OpenAUC is free from the aforementioned inconsistency properties. Finally, an end-to-end learning method is proposed to minimize the OpenAUC risk, and the experimental results on popular benchmark datasets speak to its effectiveness.
The problem of blind image super-resolution aims to recover high-resolution (HR) images from low-resolution (LR) images with unknown degradation modes. Most existing methods model the image degradation process using b...
详细信息
Scene text recognition (STR) is still a hot research topic in computer vision field due to its various applications. Existing works mainly focus on learning a general model with a huge number of synthetic text images ...
详细信息
Scene text recognition (STR) is still a hot research topic in computer vision field due to its various applications. Existing works mainly focus on learning a general model with a huge number of synthetic text images to recognize unconstrained scene texts, and have achieved substantial progress. However, these methods are not quite applicable in many real-world scenarios where 1) high recognition accuracy is required, while 2) labeled samples are lacked. To tackle this challenging problem, this paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation between the synthetic source domain (with many synthetic labeled samples) and a specific target domain (with only some or a few real labeled samples). This is done by simultaneously learning each character's feature representation with an attention mechanism and establishing the corresponding character-level latent subspace with adversarial learning. Our approach can maximize the character-level confusion between the source domain and the target domain, thus achieves the sequence-level adaptation with even a small number of labeled samples in the target domain. Extensive experiments on various datasets show that our method significantly outperforms the finetuning scheme, and obtains comparable performance to the state-of-the-art STR methods.
We address a challenging problem: recognizing multiple text sequences from an image by pure end-to-end learning. It is twofold: 1) Multiple text sequences recognition. Each image may contain multiple text sequences of...
详细信息
We address a challenging problem: recognizing multiple text sequences from an image by pure end-to-end learning. It is twofold: 1) Multiple text sequences recognition. Each image may contain multiple text sequences of different content, location and orientation, we try to recognize all these texts in the image. 2) Pure end-to-end (PEE) learning. We solve the problem in a pure end-to-end learning way where each training image is labeled by only text transcripts of the contained sequences, without any geometric annotations. Most existing works recognize multiple text sequences from an image in a non-end-to-end (NEE) or quasi-end-to-end (QEE) way, in which each image is trained with both text transcripts and text locations. Only recently, a PEE method was proposed to recognize text sequences from an image where the text sequence was split to several lines in the image. However, it cannot be directly applied to recognizing multiple text sequences from an image. So in this paper, we propose a pure end-to-end learning method to recognize multiple text sequences from an image. Our method directly learns the probability distribution of multiple sequences conditioned on each input image, and outputs multiple text transcripts with a well-designed decoding strategy. To evaluate the proposed method, we construct several datasets mainly based on an existing public dataset and two real application scenarios. Experimental results show that the proposed method can effectively recognize multiple text sequences from images, and outperforms CTC-based and attention-based baseline methods.
An electroencephalogram(EEG)-based brain–computer interface(BCI) speller allows a user to input text to a computer by thought. It is particularly useful to severely disabled individuals, e.g. amyotrophic lateral scle...
详细信息
An electroencephalogram(EEG)-based brain–computer interface(BCI) speller allows a user to input text to a computer by thought. It is particularly useful to severely disabled individuals, e.g. amyotrophic lateral sclerosis patients, who have no other effective means of communication with another person or a *** studies so far focused on making EEG-based BCI spellers faster and more reliable; however, few have considered their security. This study, for the first time, shows that P300 and steady-state visual evoked potential BCI spellers are very vulnerable, i.e. they can be severely attacked by adversarial perturbations,which are too tiny to be noticed when added to EEG signals, but can mislead the spellers to spell anything the attacker wants. The consequence could range from merely user frustration to severe misdiagnosis in clinical applications. We hope our research can attract more attention to the security of EEG-based BCI spellers, and more broadly, EEG-based BCIs, which has received little attention before.
Multi-modal knowledge graph embeddings (KGE) have caught more and more attention in learning representations of entities and relations for link prediction tasks. Different from previous uni-modal KGE approaches, multi...
ISBN:
(纸本)9781713871088
Multi-modal knowledge graph embeddings (KGE) have caught more and more attention in learning representations of entities and relations for link prediction tasks. Different from previous uni-modal KGE approaches, multi-modal KGE can leverage expressive knowledge from a wealth of modalities (image, text, etc.), leading to more comprehensive representations of real-world entities. However, the critical challenge along this course lies in that the multi-modal embedding spaces are usually heterogeneous. In this sense, direct fusion will destroy the inherent spatial structure of different modal embeddings. To overcome this challenge, we revisit multi-modal KGE from a distributional alignment perspective and propose optimal transport knowledge graph embeddings (OTKGE). Specifically, we model the multi-modal fusion procedure as a transport plan moving different modal embeddings to a unified space by minimizing the Wasserstein distance between multi-modal distributions. Theoretically, we show that by minimizing the Wasserstein distance between the individual modalities and the unified embedding space, the final results are guaranteed to maintain consistency and comprehensiveness. Moreover, experimental results on well-established multi-modal knowledge graph completion benchmarks show that our OTKGE achieves state-of-the-art performance.
Sparse Mobile Crowdsensing (Sparse MCS) selects a small part of sub-areas for data collection and infers the data of other sub-areas from the collected data. Compared with Mobile Crowdsensing (MCS) that does not use d...
详细信息
Sparse Mobile Crowdsensing (Sparse MCS) selects a small part of sub-areas for data collection and infers the data of other sub-areas from the collected data. Compared with Mobile Crowdsensing (MCS) that does not use data inference methods, Sparse MCS saves sensing costs while ensuring the quality of global data. However, the existing research works on Sparse MCS only focus on selecting a small part of sub-areas with higher value. It does not consider whether the recruited participants can collect the data of the required sub-areas, and also ignores the value of other data collected by the participants. In order to solve the limitations of traditional methods in sub-areas selection, this paper starts from the perspective of participants and concentrates on the contribution of the data collected by each participant to the entire collection task. All the data contributions collected by each participant will become the basis for decision-making for the participant's choice. And correspondingly, a new idea to deal with the problem of participant selection under Sparse MCS is proposed. In view of the fact that each person's daily movement trajectory is basically stable, and the data collected by different people on their respective trajectories have different values, this paper uses this regularity and difference to study how to directly recruit participants who can collect high-value data. Furthermore, the participant selection problem considered in this paper is not limited to the data collection in the next cycle, but directly recruits some participants to continue the data collection task in the next multiple cycles. The participant selection problem that spans multiple cycles can be modeled as a dynamic decision-making problem. Since heuristic strategies may fall into a local optimal solution, this paper uses reinforcement learning to solve the participant selection problem: We use the participant selection system as an agent of reinforcement learning, and design the
Chromosome classification is an important but difficult and tedious task in karyotyping. Previous methods only classify manually segmented single chromosome, which is far from clinical practice. In this work, we propo...
详细信息
Chromosome classification is an important but difficult and tedious task in karyotyping. Previous methods only classify manually segmented single chromosome, which is far from clinical practice. In this work, we propose a detection based method, DeepACC, to locate and fine classify chromosomes simultaneously based on the whole metaphase image. We firstly introduce the Additive Angular Margin Loss to enhance the discriminative power of the model. To alleviate batch effects, we transform decision boundary of each class case-by-case through a siamese network which make full use of priori knowledges that chromosomes usually appear in pairs. Furthermore, we take the clinically seven group criteria as a prior-knowledge and design an additional Group Inner-Adjacency Loss to further reduce inter-class similarities. A private metaphase image dataset from clinical laboratory are collected and labelled to evaluate the performance. Results show that the new design brings encouraging performance gains comparing to the state-of-the-art baseline models.
暂无评论