vision Transformers (ViTs) have recently demonstrated remarkable performance in computervision tasks. However, their parameter-intensive nature and reliance on large amounts of data for effective performance have shi...
详细信息
Traditional vision-based systems used for automatic gait pathology detection, associate high-cost. However, with the advent of Microsoft Kinect sensor, researchers tried to model some low-cost gait assessment systems;...
详细信息
With the rise in popularity of machine and deep learning models, there is an increased focus on their vulnerability to malicious inputs. These adversarial examples drift model predictions away from the original intent...
详细信息
This paper reviews the NTIRE 2023 challenge on image super-resolution (×4), focusing on the proposed solutions and results. The task of image super-resolution (SR) is to generate a high-resolution (HR) output fro...
详细信息
This paper proposes to recognise the true (self-reported) personality from the learned simulation of the target subject’s cognition. This approach builds on two following findings in cognitive science: (i) human cogn...
详细信息
This paper proposes to recognise the true (self-reported) personality from the learned simulation of the target subject’s cognition. This approach builds on two following findings in cognitive science: (i) human cognition partially determines expressed behaviour and is directly linked to true personality traits;and (ii) in dyadic interactions individuals’ nonverbal behaviours are influenced by their conversational partner’s behaviours. In this context, we hypothesise that during a dyadic interaction, a target subject’s facial reactions are driven by two main factors, i.e. their internal (person-specific) cognitive process, and the externalised nonverbal behaviours of their conversational partner. Consequently, we propose to represent the target subject’s (defined as the listener) person-specific cognition in the form of a person-specific CNN architecture that has unique architectural parameters and depth, which takes audio-visual non-verbal cues displayed by the conversational partner (defined as the speaker) as input, and is able to reproduce the target subject’s facial reactions. Each person-specific CNN is explored by the Neural Architecture Search (NAS) and a novel adaptive loss function, which is then represented as a graph representation for recognising the target subject’s true personality. Experimental results not only show that the produced graph representations are well associated with target subjects’ personality traits in both human-human and human-machine interaction scenarios, and outperform the existing approaches with significant advantages, but also demonstrate that the proposed novel strategies such as adaptive loss, and the end-to-end vertices/edges feature learning, help the proposed approach in learning more reliable personality representations. Building on our earlier version of this work, this paper further proposes: (i) assigning a unique depth for each CNN;(ii) a novel end-to-end graph vertex feature learning strategy;(iii) a transformer-bas
The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a l...
The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for infor-mation exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines Swin-T [1], DeiT-T [2] and PVT-T [3] by 1.0%, 2.6 % and 3.1 % with a negligible increase in the number of parameters and computational effort. Code is available at https://***/ofsoundof/LocalViT.
This paper introduces a novel benchmark for efficient up-scaling as part of the NTIRE 2023 Real-Time Image Super-Resolution (RTSR) Challenge, which aimed to upscale images from 720p and 1080p resolution to native 4K (...
详细信息
Object removal is a technique for removing the undesired object(s) and then fill-in the empty region(s) in an image such that the modified image is visually plausible. The existing algorithms are unable to provide pro...
详细信息
Accurate and robust visual object tracking is one of the most challenging and fundamental computervision problems. It entails estimating the trajectory of the target in an image sequence, given only its initial locat...
详细信息
Traditional recommendation models trained on observational interaction data have generated large impacts in a wide range of applications, it faces bias problems that cover users’ true intent and thus deteriorate the ...
详细信息
暂无评论