This paper describes the "Handwritten Text recognition in Brazilian Essays - BRESSAY" competition, held at the 18th International conference on Document Analysis and recognition (ICDAR 2024). The competition...
详细信息
ISBN:
(纸本)9783031705519;9783031705526
This paper describes the "Handwritten Text recognition in Brazilian Essays - BRESSAY" competition, held at the 18th International conference on Document Analysis and recognition (ICDAR 2024). The competition aimed to advance Handwritten Text recognition (HTR) by addressing challenges specific to Brazilian Portuguese academic essays, such as diverse handwriting styles and document irregularities like smudges and erasures. Participants were encouraged to develop robust algorithms capable of accurately transcribing handwritten texts at line, paragraph, and page levels using the new BRESSAY dataset. The competition attracted 14 participants from different countries, with 4 research groups submitting a total of 11 proposals in the three challenges by the end of the competition. These proposals achieved impressive recognition rates and demonstrated advancements over traditional baseline models by using key strategies such as preprocessing techniques, synthetic data approaches, and advanced deep learning models. The evaluation metrics used were Character Error Rate (CER) and Word Error Rate (WER), with error rates reaching up to 2.88% CER and 9.39% WER for line-level recognition, 3.75% CER and 10.48% WER for paragraph-level recognition, and 3.77% CER and 10.08% WER for page-level recognition. The competition highlight the potential for continued improvements in HTR and underscore the BRESSAY dataset as a resource for future researches. The dataset is available in the repository (https://***/arthurflor23/handwritten-text-recognition).
Self-attention techniques, and specifically Transformers, are dominating the field of text processing and are becoming increasingly popular in computervision classification tasks. In order to visualize the parts of t...
详细信息
ISBN:
(纸本)9781665445092
Self-attention techniques, and specifically Transformers, are dominating the field of text processing and are becoming increasingly popular in computervision classification tasks. In order to visualize the parts of the image that led to a certain classification, existing methods either rely on the obtained attention maps or employ heuristic propagation along the attention graph. In this work, we propose a novel way to compute relevancy for Transformer networks. The method assigns local relevance based on the Deep Taylor Decomposition principle and then propagates these relevancy scores through the layers. This propagation involves attention layers and skip connections, which challenge existing methods. Our solution is based on a specific formulation that is shown to maintain the total relevancy across layers. We benchmark our method on very recent visual Transformer networks, as well as on a text classification problem, and demonstrate a clear advantage over the existing explainability methods.
In this paper, we take an early step towards video representation learning of human actions with the help of large-scale synthetic videos, particularly for human motion representation enhancement. Specifically, we fir...
详细信息
Reducing inconsistencies in the behavior of different versions of an AI system can be as important in practice as reducing its overall error. In image classification, sample-wise inconsistencies appear as "negati...
详细信息
ISBN:
(纸本)9781665445092
Reducing inconsistencies in the behavior of different versions of an AI system can be as important in practice as reducing its overall error. In image classification, sample-wise inconsistencies appear as "negative flips": A new model incorrectly predicts the output for a test sample that was correctly classified by the old (reference) model. Positive-congruent (PC) training aims at reducing error rate while at the same time reducing negative flips, thus maximizing congruency with the reference model only on positive predictions, unlike model distillation. We propose a simple approach for PC training, Focal Distillation, which enforces congruence with the reference model by giving more weights to samples that were correctly classified. We also found that, if the reference model itself can be chosen as an ensemble of multiple deep neural networks, negative flips can be further reduced without affecting the new model's accuracy.
In this paper, we focus on improving online multi-object tracking (MOT). In particular, we introduce a region-based Siamese Multi-Object Tracking network, which we name SiamMOT. SiamMOT includes a motion model that es...
详细信息
ISBN:
(纸本)9781665445092
In this paper, we focus on improving online multi-object tracking (MOT). In particular, we introduce a region-based Siamese Multi-Object Tracking network, which we name SiamMOT. SiamMOT includes a motion model that estimates the instance's movement between two frames such that detected instances are associated. To explore how the motion modelling affects its tracking capability, we present two variants of Siamese tracker, one that implicitly models motion and one that models it explicitly. We carry out extensive quantitative experiments on three different MOT datasets: MOT17, TAO-person and Caltech Roadside Pedestrians, showing the importance of motion modelling for MOT and the ability of SiamMOT to substantially outperform the state-of-the-art. Finally, SiamMOT also outperforms the winners of ACM MM'20 HiEve Grand Challenge on HiEve dataset. Moreover, SiamMOT is efficient, and it runs at 17 FPS for 720P videos on a single modern GPU.
This paper deals with the scarcity of data for training optical flow networks, highlighting the limitations of existing sources such as labeled synthetic datasets or unlabeled real videos. Specifically, we introduce a...
详细信息
ISBN:
(纸本)9781665445092
This paper deals with the scarcity of data for training optical flow networks, highlighting the limitations of existing sources such as labeled synthetic datasets or unlabeled real videos. Specifically, we introduce a framework to generate accurate ground-truth optical flow annotations quickly and in large amounts from any readily available single real picture. Given an image, we use an off-the-shelf monocular depth estimation network to build a plausible point cloud for the observed scene. Then, we virtually move the camera in the reconstructed environment with known motion vectors and rotation angles, allowing us to synthesize both a novel view and the corresponding optical flow field connecting each pixel in the input image to the one in the new frame. When trained with our data, state-of-the-art optical flow networks achieve superior generalization to unseen real data compared to the same models trained either on annotated synthetic datasets or unlabeled videos, and better specialization if combined with synthetic images.
Visible-Infrared Person Re-identification (VI-ReID) would effectively improve the recognition performance in weak-lighting and nighttime scenes, which is an important research direction in patternrecognition and comp...
详细信息
We study image segmentation from an information-theoretic perspective, proposing a novel adversarial method that performs unsupervised segmentation by partitioning images into maximally independent sets. More specific...
详细信息
ISBN:
(纸本)9781665445092
We study image segmentation from an information-theoretic perspective, proposing a novel adversarial method that performs unsupervised segmentation by partitioning images into maximally independent sets. More specifically, we group image pixels into foreground and background, with the goal of minimizing predictability of one set from the other. An easily computed loss drives a greedy search process to maximize inpainting error over these partitions. Our method does not involve training deep networks, is computationally cheap, class-agnostic, and even applicable in isolation to a single unlabeled image. Experiments demonstrate that it achieves a new state-of-the-art in unsupervised segmentation quality, while being substantially faster and more general than competing approaches.
Leaf recognition is a vital component of plant classification, which is crucial in agricultural automation. Previous studies have employed various machine learning algorithms, ranging from deep learning methods such a...
详细信息
Disentangled representations support a range of downstream tasks including causal reasoning, generative modeling, and fair machine learning. Unfortunately, disentanglement has been shown to be impossible without the i...
详细信息
ISBN:
(纸本)9781665445092
Disentangled representations support a range of downstream tasks including causal reasoning, generative modeling, and fair machine learning. Unfortunately, disentanglement has been shown to be impossible without the incorporation of supervision or inductive bias. Given that supervision is often expensive or infeasible to acquire, we choose to incorporate structural inductive bias and present an unsupervised, deep State-Space-Model for Video Disentanglement (VDSM). The model disentangles latent time-varying and dynamic factors via the incorporation of hierarchical structure with a dynamic prior and a Mixture of Experts decoder. VDSM learns separate disentangled representations for the identity of the object or person in the video, and for the action being performed. We evaluate VDSM across a range of qualitative and quantitative tasks including identity and dynamics transfer;sequence generation, Frechet Inception Distance, and factor classification. VDSM achieves state-of-the-art performance and exceeds adversarial methods, even when the methods use additional supervision.
暂无评论