This research explore the integration and application of advanced deep learning models, specifically Convolutional Neural Networks (CNNs) and vision Transformer (ViT) models, in the field of age and gender detection. ...
详细信息
ISBN:
(纸本)9783031707889
This research explore the integration and application of advanced deep learning models, specifically Convolutional Neural Networks (CNNs) and vision Transformer (ViT) models, in the field of age and gender detection. The study begins by outlining the significance and challenges of accurate age and gender detection in various domains such as targeted advertising, security, and humancomputer interaction. It then delves into technical aspects of CNNs and ViTs, elucidating their architecture, working principles, and suitability for image-based task. The proposed techniques used in this study was able to differentiate between the following age groups: 0–15, 15–20, 20–25, 25–30, 30–35, and 40. The purpose is to offer a technique for creating and implementing accurate categorization and age estimation systems capable of operating and achieving high accuracy by integrating and applying a variety of feature extractors and algorithms. Pre-processing is evaluating preliminary data, configuring it, and transforming it to a standard format. The feature extraction component of the age and gender prediction technique is crucial. Three different extraction methods (ResNet 50, ViT Small, and ViT Base) are used in this section. Convolutional Neural Network (CNN) and vision Transformer (ViT) classifiers were used. This optional component of patternrecognition system design focuses on system accuracy. To assess patternrecognition system performance, several approaches are employed, including Mean Absolute Error (MAE), Cumulative Score (CS), Leave-One-Out Cross-validation, and Confusion Matrix. In this study, however, gender and age prediction were tested using the Confusion Matrix and Mean Absolute Error (MAE). Python was the programming language utilized in this study. Python is a high-level, general-purpose programming language that is interpreted. Precision, recall, f1score and accuracy were the performance matrices used. A precision of 99% was achieved for male classification whi
Nowadays, to improve animal well being in livestock farming application, a wire-less video sensor network (WVSN) can be deployed to early detect injury and moni-tor animals. They are composed of small embedded video a...
详细信息
Nowadays, to improve animal well being in livestock farming application, a wire-less video sensor network (WVSN) can be deployed to early detect injury and moni-tor animals. They are composed of small embedded video and camera motes that capture video frames periodically and send them to a specific node called a sink. Sending all the captured images to the sink consumes a lot of energy on every sensor and may cause a bottleneck at the sink level. Energy consumption and bandwidth limitation are two important challenges in WVSNs because of the limited energy resources of the nodes and the medium scarcity. In this work, we introduce two mechanisms to reduce the overall number of frames sensed and sent to the sink. The first approach is applied on each sensor node, where the FRABID algorithm, a joint data reduction, and frame rate adaptation on sensing and transmission phases mechanism is introduced. This approach reduces the number of sensed frames based on a similarity method. The aim is to adapt the number of sensed frames based on the degree of difference between two consecutive sensed frames in each period. This adaptation technique maintains the accuracy of the video while capturing frames holding new information. This approach is validated through simulations using real data-sets from video sensors (Wang et al. in: 2014 IEEE conference on computervision and patternrecognition workshops, pp 393-400, 2014). The results show that the amount of sensed data is reduced by more than 70% compared to a recent algorithm in Christian et al. (Multimed Tools Appl 79(3):1801-1819, 2020) while guaranteeing the detection of all the critical events at the sensor node level. The sec-ond approach exploits the Spatio-temporal correlation between neighboring nodes to reduce the number of captured frames. For that purpose, Synchronization with Frame Rate Adaptation SFRA algorithm is introduced where overlapping nodes capture frames in a synchronized fashion every N - 1 periods, wher
This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This ch...
详细信息
Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic pose estimates are admitted due to the lack of dependency modeling bet...
Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic pose estimates are admitted due to the lack of dependency modeling between the body joints. In this paper, we present a structured representation, named Pose as Compositional Tokens (PCT), to explore the joint dependency. It represents a pose by M discrete tokens with each characterizing a sub-structure with several interdependent joints (see Figure 1). The compositional design enables it to achieve a small reconstruction error at a low cost. Then we cast pose estimation as a classification task. In particular, we learn a classifier to predict the categories of the M tokens from an image. A pre-learned decoder network is used to recover the pose from the tokens without further post-processing. We show that it achieves better or comparable pose estimation results as the existing methods in general scenarios, yet continues to work well when occlusion occurs, which is ubiquitous in practice. The code and models are publicly available at https://***/Gengzigang/PCT.
Implicit neural representation (INR), sometimes also referred to coordinate-based representation or fitting, has gained the state-of-the-art performance in numerous research fields including computervision and comput...
详细信息
Table detection and structure recognition is an important component of document analysis systems. Deep learning-based transformer models have recently demonstrated significant success in various computervision and do...
详细信息
ISBN:
(纸本)9783031417337;9783031417344
Table detection and structure recognition is an important component of document analysis systems. Deep learning-based transformer models have recently demonstrated significant success in various computervision and document analysis tasks. In this paper, we introduce PyramidTabNet (PTN), a method that builds upon Convolutionless Pyramid vision Transformer to detect tables in document images. Furthermore, we present a tabular image generative augmentation technique to effectively train the architecture. The proposed augmentation process consists of three steps, namely, clustering, fusion, and patching, for the generation of new document images containing tables. Our proposed pipeline demonstrates significant performance improvements for table detection on several standard datasets. Additionally, it achieves performance comparable to the state-of-the-art methods for structure recognition tasks.
The attention mechanism has been widely used and achieved good results in many visual tasks. But the calculations of attention mechanism in vision tasks consume huge spaces and times, which is the obvious disadvantage...
详细信息
In this paper we introduce the Temporo-Spatial vision Transformer (TSViT), a fully-attentional model for general Satellite image Time Series (SITS) processing based on the vision Transformer (ViT). TSViT splits a SITS...
In this paper we introduce the Temporo-Spatial vision Transformer (TSViT), a fully-attentional model for general Satellite image Time Series (SITS) processing based on the vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes can be found at https://***/michaeltrs/DeepSatModels.
This paper presents a method that effectively combines two prevalent visual recognition methods, i.e., image classification and contrastive language-image pre-training, dubbed iCLIP. Instead of naïve multi-task l...
This paper presents a method that effectively combines two prevalent visual recognition methods, i.e., image classification and contrastive language-image pre-training, dubbed iCLIP. Instead of naïve multi-task learning that use two separate heads for each task, we fuse the two tasks in a deep fashion that adapts the image classification to share the same formula and the same model weights with the language-image pre-training. To further bridge these two tasks, we propose to enhance the category names in image classification tasks using external knowledge, such as their descriptions in dictionaries. Extensive experiments show that the proposed method combines the advantages of two tasks well: the strong discrimination ability in image classification tasks due to the clean category labels, and the good zero-shot ability in CLIP tasks ascribed to the richer semantics in the text descriptions. In particular, it reaches 82.9% top-1 accuracy on IN-1K, and mean-while surpasses CLIP by 1.8%, with similar model size, on zero-shot recognition of Kornblith 12-dataset benchmark. The code and models are publicly available at https://***/weiyx16/iCLIP.
image-to-image translation is an important and challenging problem in computervision and imageprocessing. Diffusion models (DM) have shown great potentials for high-quality image synthesis, and have gained competiti...
image-to-image translation is an important and challenging problem in computervision and imageprocessing. Diffusion models (DM) have shown great potentials for high-quality image synthesis, and have gained competitive performance on the task of image-to-image translation. However, most of the existing diffusion models treat image-to-image translation as conditional generation processes, and suffer heavily from the gap between distinct domains. In this paper, a novel image-to-image translation method based on the Brownian Bridge Diffusion Model (BBDM) is proposed, which models image-to-image translation as a stochastic Brownian Bridge process, and learns the translation between two domains directly through the bidirectional diffusion process rather than a conditional generation process. To the best of our knowledge, it is the first work that proposes Brownian Bridge diffusion process for image-to-image translation. Experimental results on various benchmarks demonstrate that the proposed BBDM model achieves competitive performance through both visual inspection and measurable metrics.
暂无评论