Exquisite demand exists for customizing the pretrained large text-to-image model, e.g. Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous cu...
Exquisite demand exists for customizing the pretrained large text-to-image model, e.g. Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just one facial photograph and only 1024 learnable parameters under 3 minutes. So we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. Project page is at: http://***. Code is at: https://***/ygtxr1997/CelebBasis.
Verb errors are one of the most common grammar errors made by non-native writers of English. This work especially focus on an important type of verb usage errors, subject-verb agreement for the third person singular f...
详细信息
Verb errors are one of the most common grammar errors made by non-native writers of English. This work especially focus on an important type of verb usage errors, subject-verb agreement for the third person singular forms, which has a high proportion in errors made by non-native English learners. Existing work has not given a satisfied solution for this task, in which those using supervised learning method usually fail to output good enough performance, and rule-based methods depend on advanced linguistic resources such as syntactic parsers. In this paper, we propose a rule-based method to detect and correct the concerned errors. The proposed method relies on a series of rules to automatically locate subject and predicate in four types of sentences. The evaluation shows that the proposed method gives state-of-The-Art performance with quite limited linguistic resources.
Person re-identification, as a branch of image retrieval, has an extremely important application in public safety. In the past few decades, researchers have improved its accuracy through a variety of methods, includin...
详细信息
Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models. Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency. Particularly, backpropag...
ISBN:
(纸本)9781713871088
Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models. Recent progress in training methods has enabled successful deep SNNs on large-scale tasks with low latency. Particularly, backpropagation through time (BPTT) with surrogate gradients (SG) is popularly used to enable models to achieve high performance in a very small number of time steps. However, it is at the cost of large memory consumption for training, lack of theoretical clarity for optimization, and inconsistency with the online property of biological learning rules and rules on neuromorphic hardware. Other works connect the spike representations of SNNs with equivalent artificial neural network formulation and train SNNs by gradients from equivalent mappings to ensure descent directions. But they fail to achieve low latency and are also not online. In this work, we propose online training through time (OTTT) for SNNs, which is derived from BPTT to enable forward-in-time learning by tracking presynaptic activities and leveraging instantaneous loss and gradients. Meanwhile, we theoretically analyze and prove that the gradients of OTTT can provide a similar descent direction for optimization as gradients from equivalent mapping between spike representations under both feedforward and recurrent conditions. OTTT only requires constant training memory costs agnostic to time steps, avoiding the significant memory costs of BPTT for GPU training. Furthermore, the update rule of OTTT is in the form of three-factor Hebbian learning, which could pave a path for online on-chip learning. With OTTT, it is the first time that the two mainstream supervised SNN training methods, BPTT with SG and spike representation-based training, are connected, and meanwhile it is in a biologically plausible form. Experiments on CIFAR-10, CIFAR-100, ImageNet, and CIFAR10-DVS demonstrate the superior performance of our method on large-scale static and neuromorphic datasets in a small number of time steps.
Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built o...
详细信息
Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also c...
详细信息
Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also c...
Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also causes the crucial inconsistency problem. Most of the previous methods design different estimators instead of STE to mitigate it. However, they ignore the fact that when reducing the estimating error, the gradient stability will decrease concomitantly. These highly divergent gradients will harm the model training and increase the risk of gradient vanishing and gradient exploding. To fully take the gradient stability into consideration, we present a new perspective to the BNNs training, regarding it as the equilibrium between the estimating error and the gradient stability. In this view, we firstly design two indicators to quantitatively demonstrate the equilibrium phenomenon. In addition, in order to balance the estimating error and the gradient stability well, we revise the original straight through estimator and propose a power function based estimator, Rectified Straight Through Estimator (ReSTE for short). Comparing to other estimators, ReSTE is rational and capable of flexibly balancing the estimating error with the gradient stability. Extensive experiments on CIFAR-10 and ImageNet datasets show that ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.
Weakly supervised video anomaly detection (WS-VAD) is to distinguish anomalies from normal events based on discriminative representations. Most existing works are limited in insufficient video representations. In this...
详细信息
Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built o...
Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built one-decoder-layer detectors. Although they gain remarkable acceleration, their performance still lags behind their six-decoder-layer counterparts by a large margin. In this work, we aim to bridge this performance gap while retaining fast speed. We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hampering the performance of one-decoder-layer detectors. Thus we propose Adaptive Sparse Anchor Generator (ASAG) which predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors. Further, a simple and effective Query Weighting method eases the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy trade-off. The code is availab.e at https://***/iSEE-lab.ratory/ASAG.
Aerial person re-identification (AReID) focuses on accurately matching target person images within a UAV camera network. Challenges arise due to the broad field of view and arbitrary movement of UAVs, leading to foreg...
详细信息
ISBN:
(数字)9798350390155
ISBN:
(纸本)9798350390162
Aerial person re-identification (AReID) focuses on accurately matching target person images within a UAV camera network. Challenges arise due to the broad field of view and arbitrary movement of UAVs, leading to foreground target rotation and background style variation. Existing AReID methods have provided limited solutions for the former, while the latter remains largely unexplored. This paper propose a Rotation Exploration Vision Transformer (RoExViT) to tackle the aforementioned dual challenges. Specifically, we design Multiple Rotation Tokens (MRT) to explore diverse rotational representations at the feature level, addressing foreground target rotation. To handle background style variation, we propose Cross-Camera Similarity (CCS) loss to effectively minimize the view gap among different cameras. Furthermore, we propose Iteratively Adaptive Batch Construction (IABC) strategy to mitigate overfitting on small datasets. Extensive experiments show that our method outperforms the state-of-the-art methods on PRAI-1581 and UAV-Human while also exhibting outstanding performance on Market1501.
暂无评论