检索结果-内蒙古大学图书馆

A Riemannian Residual Learning Mechanism for SPD Network

学校读者我要写书评

暂无评论

A Riemannian Residual Learning Mechanism for SPD Network

International Joint Conference on Neural Networks (IJCNN)

作者： Zhenyu Cai Rui Wang Tianyang Xu Xiaojun Wu Josef Kittler School of Artificial Intelligence and Computer Science Jiangnan University Wuxi China Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Wuxi China Centre for Vision Speech and Signal Processing University of Surrey Guildford U.K.

ISBN: (数字)9798350359312

ISBN: (纸本)9798350359329

The generalization of Euclidean network paradigm to the Riemannian manifolds has attracted much attention for offering useful geometric representations in processing manifold-valued data in recent years. However, the information degradation during data compression mapping hinders Riemannian networks from going deeper, and there are very few solutions specifically designed for this problem. Given the remarkable success of deep Residual learning in Euclidean networks, a novel Riemannian residual learning mechanism (RRLM) is proposed in the context of Symmetric Positive Definite (SPD) manifolds, enabling the characterization of deep spatiotemporal features while preserving the manifold properties. Based on RRLM, a stack of SPD manifold-constrained residual-like blocks is designed on the tail of the original SPDNet(backbone) for the sake of conducting deep Riemannian residual learning. For simplicity, we refer to the network architecture introduced above as Riemannian residual SPD network (ResSPDNet). The experimental results achieved on three types of visual classification tasks, i.e., facial emotion recognition, drone recognition, and action recognition, demonstrate that our method can achieve improved accuracy with a deepened network structure.

关键词： Manifolds Learning systems Degradation Emotion recognition Visualization Accuracy Face recognition

From recognition to Prediction: Leveraging Sequence Reasoning for Action Anticipation

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Liu, Xin Hao, Chao Yu, Zitong Yue, Huanjing Yang, Jingyu School of Electrical and Information Engineering Tianjin University China Computer Vision and Pattern Recognition Laboratory School of Engineering Sciences Lappeenranta-Lahti University of Technology LUT Finland School of Computing and Information Technology Great Bay University China

The action anticipation task refers to predicting what action will happen based on observed videos, which requires the model to have a strong ability to summarize the present and then reason about the future. Experience and common sense suggest that there is a significant correlation between different actions, which provides valuable prior knowledge for the action anticipation task. However, previous methods have not effectively modeled this underlying statistical relationship. To address this issue, we propose a novel end-to-end video modeling architecture that utilizes attention mechanisms, named Anticipation via recognition and Reasoning (ARR). ARR decomposes the action anticipation task into action recognition and sequence reasoning tasks, and effectively learns the statistical relationship between actions by next action prediction (NAP). In comparison to existing temporal aggregation strategies, ARR is able to extract more effective features from observable videos to make more reasonable predictions. In addition, to address the challenge of relationship modeling that requires extensive training data, we propose an innovative approach for the unsupervised pre-training of the decoder, which leverages the inherent temporal dynamics of video to enhance the reasoning capabilities of the network. Extensive experiments on the Epic-kitchen-100, EGTEA Gaze+, and 50salads datasets demonstrate the efficacy of the proposed methods. The code is available at https://***/linuxsino/ARR. Copyright © 2024, The Authors. All rights reserved.

关键词： Video analysis

A Dual-Space Framework for General Knowledge Distillation of Large Language Models

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Zhang, Xue Zhang, Songming Liang, Yunlong Meng, Fandong Chen, Yufeng Xu, Jinan Zhou, Jie School of Computer Science and Technology Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence Beijing Jiaotong University Beijing100044 China Pattern Recognition Center WeChat AI Tencent Inc China

Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model;b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies. © 2025, CC BY-NC-ND.

关键词： Distribution functions

Answering Diverse Questions via Text Attached with key Audio-Visual Clues

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Ye, Qilang Yu, Zitong Liu, Xin The School of Computing and Information Technology Great Bay University Dongguan523000 China The School of Computer Science and Engineering Chongqing University of Technology Chongqing401300 China The Computer Vision and Pattern Recognition Laboratory Lappeenranta-Lahti University of Technology LUT Lappeenranta53850 Finland

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. Although mining deeper layers of audio-visual information to interact with questions facilitates the multimodal fusion process, the redundancy of audio-visual parameters tends to reduce the generalization of the inference engine to multiple question-answer pairs in a single video. Indeed, the natural heterogeneous relationship between audiovisuals and text makes the perfect fusion challenging, to prevent high-level audio-visual semantics from weakening the network’s adaptability to diverse question types, we propose a framework for performing mutual correlation distillation (MCD) to aid question inference. MCD is divided into three main steps: 1) firstly, the residual structure is utilized to enhance the audio-visual soft associations based on self-attention, then key local audio-visual features relevant to the question context are captured hierarchically by shared aggregators and coupled in the form of clues with specific question vectors. 2) Secondly, knowledge distillation is enforced to align audiovisual-text pairs in a shared latent space to narrow the cross-modal semantic gap. 3) And finally, the audio-visual dependencies are decoupled by discarding the decision-level integrations. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs, i.e., Music-AVQA and AVQA. Experiments show that our method outperforms other state-of-the-art methods, and one interesting finding behind is that removing deep audio-visual features during inference can effectively mitigate overfitting. The source code is released at http://***/rikeilong/MCD-forAVQA. © 2024, CC BY.

关键词： Distillation

CLIDSUM: A Benchmark Dataset for Cross-Lingual Dialogue Summarization

学校读者我要写书评

暂无评论

CLIDSUM: A Benchmark Dataset for Cross-Lingual Dialogue Summ...

2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

作者： Wang, Jiaan Meng, Fandong Lu, Ziyao Zheng, Duo Li, Zhixu Qu, Jianfeng Zhou, Jie Pattern Recognition Center WeChat AI Tencent Inc China School of Computer Science and Technology Soochow University Suzhou China Shanghai Key Laboratory of Data Science School of Computer Science Fudan University Shanghai China Beijing University of Posts and Telecommunications Beijing China

We present CLIDSUM, a benchmark dataset towards building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents and 112k+ annotated summaries in different target languages. Based on the proposed CLIDSUM, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on CLIDSUM to provide deeper analyses. Furthermore, we propose mDIALBART which extends mBART via further pre-training, where the multiple objectives help the pre-trained model capture the structural characteristics as well as key content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDIALBART, as an end-to-end model, outperforms strong pipeline models on CLIDSUM. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://***/krystalan/ClidSum. © 2022 Association for Computational Linguistics.

关键词： Pipelines

Anomaly Handwritten Text Detection for Automatic Descriptive Answer Evaluation 22

学校读者我要写书评

暂无评论

Anomaly Handwritten Text Detection for Automatic Descriptive...

Proceedings of the 2022 11th International Conference on Computing and pattern recognition

作者： Nilanjana Chatterjee Palaiahnaakote Shivakumara Umapada Pal Tong Lu Yue Lu Computer Vision and Pattern Recognition Unit Indian Statistical Institute India Faculty of Computer Science and Information Technology University of Malaya Malaysia National Key Lab for Novel Software Technology Nanjing University China Shanghai Key Laboratory of Multidimensional Information Processing East China Normal University China

ISBN: (纸本)9781450397056

Although there are advanced technologies for character recognition, automatic descriptive answer evaluation is an open challenge for the document image analysis community due to large diversified handwritten text and answers to the question. This paper presents a novel method for detecting anomaly handwritten text in the responses written by the students to the questions. The method is proposed based on the fact that when the students are confident in answering questions, the students usually write answers legibly and neatly while they are not confident, they write sloppy writing which may not be easy for the reader to understand. To detect such anomaly handwritten text, we explore a new combination of Fourier transform and deep learning model for detecting edges. This result preserves the structure of handwritten text. For extracting features for classification of anomaly text and normal text, the proposed method studies the behavior of writing style, especially the variation at ascenders and descenders. Therefore, the proposed work draws principal axis which is invariant to rotation, scaling and some extent to distortion for the edge images. With respect to principal axis, the proposed method draws medial axis using uppermost and lowermost points. The distance between the medial axis and principal axis points are considered as feature vector. Further, the feature vector is passed to Artificial Neural Network for classification of anomaly text. The proposed method is evaluated by testing on our own dataset, standard dataset of gender identification (IAM) and handwritten forgery detection dataset (ACPR 2019). The results on different datasets show that the proposed work outperforms the existing methods.

关键词：

Interactive Semantic Segmentation With Weak Supervision 22

学校读者我要写书评

暂无评论

Interactive Semantic Segmentation With Weak Supervision

8th International Conference on Computing and Artificial Intelligence, ICCAI 2022

作者： Gong, Lei Wang, Da-Han Wu, Yun Ye, Hai-Li Zhu, Chen-Yan School of Computer and Information Engineering Xiamen University of Technology Xiamen361024 China Fujian Key Laboratory of Pattern Recognition and Image Understanding Xiamen361024 China Medical Diagnostic Systems Co. Ltd. Xiamen361000 China

ISBN: (纸本)9781450396110

At present, the most advanced semantic segmentation model training mainly relies on pixel-level annotation, that is, annotating the category of each pixel of an image. Such annotation usually is time-consuming and expensive, especially for special applications that require expert annotation. The weakly-supervised segmentation method using the point-level supervision information has been investigated which however has great problems that the supervision information is quite limited and the performance is far from fully supervised methods. In this paper, we proposes an novelty interactive image segmentation method based on weak supervision, which allows multiple feedbacks of easily obtained weakly supervised information and improves the efficiency of utility of the supervision information. In the downstream task (interactive image segmentation), supervised information at the point level is used for many times, which makes the connection between pixels in the upstream task become closer and improves the segmentation accuracy. First, image-level tags are used to train the classification network. Then the pseudo-semantic labels are generated and put into the interactive segmentation network for training, and an almost completely supervised CNN is obtained, which further improves the performance and provides operability for human-computer interaction. The proposed method achieves promising semantic segmentation results that are close to those obtained by strongly supervised segmentation methods on the PASCAL VOC 2012 datasets. © 2022 ACM.

关键词： Semantic Segmentation

A Novel Wideband CPW-Fed 5.8GHz RFID Tag Antenna

学校读者我要写书评

暂无评论

Chinese Journal of Electronics 2023年第2期21卷 202-208页

作者： Huihui Li Xuanqin Mou Zhen Ji Hang Yu Yan Li Institute of Image Processing and Pattern Recognition Xi'an Jiaotong University Xi'an China College of Computer Science and Software Engineering Shenzhen University Shenzhen China Shenzhen Key Laboratory of Embedded System Design Shenzhen China

A novel wideband 5.8GHz CPW-fed antenna is presented for Radio frequency identification (RFID) tag. Four U-shaped and four L-shaped branches are used as additional resonators to achieve wideband operation. The proposed antenna was analyzed numerically using the Method of moment (MOM) and the Finite element method (FEM). With the antenna size limited to $30\times 30 \text{mm}^{2}$ , the −10dB bandwidth obtained by MOM is 3.235GHz (5.765∼9GHz) and the −9.5dB band-width obtained by FEM is 2.74GHz (5.32∼8.06GHz), corresponding to 55.7% and 47.2% of the center frequency 5.8GHz respectively. Moreover, the simulated results show that the proposed antenna has gain of more than 4.8dBi and the radiation pattern is nearly omnidirectional in the H-plane. The measured −10dB bandwidth is 2.68GHz (5.63GHz∼8.31GHz), 46.2% of the 5.8GHz frequency. Furthermore, there are three measured resonant frequencies at 1.34GHz, 3.23GHz and 5.8GHz with lower than −10dB return loss respectively. The measurement result achieves a wideband RFID tag antenna performance and is in good agreement with the calculated results.

关键词： Antenna measurements Resonant frequency Loss measurement Frequency measurement Finite element analysis Broadband antennas RFID tags

Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Wang, Lean Li, Lei Dai, Damai Chen, Deli Zhou, Hao Meng, Fandong Zhou, Jie Sun, Xu National Key Laboratory for Multimedia Information Processing School of Computer Science Peking University China Pattern Recognition Center WeChat AI Tencent Inc. China

In-context learning (ICL) emerges as a promising capability of large language models (LLMs) by providing them with demonstration examples to perform diverse tasks. However, the underlying mechanism of how LLMs learn from the provided context remains under-explored. In this paper, we investigate the working mechanism of ICL through an information flow lens. Our findings reveal that label words in the demonstration examples function as anchors: (1) semantic information aggregates into label word representations during the shallow computation layers' processing;(2) the consolidated information in label words serves as a reference for LLMs' final predictions. Based on these insights, we introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL. The promising applications of our findings again validate the uncovered ICL working mechanism and pave the way for future studies. © 2023, CC BY.

关键词： Demonstrations