This article aims to study the process of automatic text classification, with a focus on two key steps: feature extraction and classification processing. By adopting suitable feature extraction methods and classificat...
详细信息
ISBN:
(数字)9798350386905
ISBN:
(纸本)9798350386912
This article aims to study the process of automatic text classification, with a focus on two key steps: feature extraction and classification processing. By adopting suitable feature extraction methods and classification processing models, the accuracy and efficiency of automatic text classification can be improved. The effectiveness of the proposed methods is verified through experiments and evaluations, and a corresponding result evaluation model is proposed to measure the classification performance. The research findings indicate that selecting appropriate feature extraction methods and classification processing models can enhance the accuracy and effectiveness of text classification. In the conclusion section, the research achievements of this article are summarized, and directions and prospects for future research are pointed out. This study provides a complete process for automatic text classification, which is of great guidance significance for practical applications.
The amount of digital text-based consumer review data has increased dramatically and there exist many machine learning approaches for automated text-based sentiment analysis. Marketing researchers have employed variou...
详细信息
The amount of digital text-based consumer review data has increased dramatically and there exist many machine learning approaches for automated text-based sentiment analysis. Marketing researchers have employed various methods for analyzing text reviews but lack a comprehensive comparison of their performance to guide method selection in future applications. We focus on the fundamental relationship between a consumer's overall empirical evaluation, and the text-based explanation of their evaluation. We study the empirical tradeoff between predictive and diagnostic abilities, in applying various methods to estimate this fundamental relationship. We incorporate methods previously employed in the marketing literature, and methods that are so far less common in the marketing literature. For generalizability, we analyze 25,241 products in nine product categories, and 260,489 reviews across five review platforms. We find that neural network-based machine learning methods, in particular pre-trained versions, offer the most accurate predictions, while topic models such as Latent Dirichlet Allocation offer deeper diagnostics. However, neural network models are not suited for diagnostic purposes and topic models are ill equipped for making predictions. Consequently, future selection of methods to process text reviews is likely to be based on analysts' goals of prediction versus diagnostics. Published by Elsevier B.V.
naturallanguageprocessing (NLP) is a very important part of machine learning that can be applied to different real applications. Several NLP models with huge training datasets are proposed. The primary purpose of th...
详细信息
We address a generalization of the bandit with knapsacks problem, where a learner aims to maximize rewards while satisfying an arbitrary set of long-term constraints. Our goal is to design best-of-both-worlds algorith...
As an important minority language, Tibetan carries rich cultural information, but its related naturallanguageprocessing research is less. To address this pain point, this study selects the first high-quality instruc...
详细信息
ISBN:
(数字)9798331535087
ISBN:
(纸本)9798331535094
As an important minority language, Tibetan carries rich cultural information, but its related naturallanguageprocessing research is less. To address this pain point, this study selects the first high-quality instruction dataset specifically designed for supervised fine-tuning of Tibetan Large language Models (LLMs), i.e., the TIFD dataset, and for the first time selects the lightweight fine-tuning framework based on Low-Rank Adaptation (LoR $A$ ), systematic evaluation of the performance of Tibetan instruction tasks for three types of base models, GLM-4, Qwen2.5 and Llama-3. models' instruction following ability on the Tibetan TIFD dataset. The experimental results show that the TIFD dataset significantly improves the model's instruction comprehension and generation ability through the combination of structured instruction-triad (instruction-input-output) and LoRA techniques, and the study reveals that the multitasking coverage of the TIFD dataset and the low-rank constraint mechanism of LoRA synergistically optimize the model's processing ability for complex linguistic phenomena such as the Tibetan honorific system and verb tense, and demonstrates the efficacy of low-rank constraints in processing complex linguistic phenomena, such as the Tibetan honorific system and verb tense. The study reveals that the multitasking coverage of the TIFD dataset and the low-rank constraint mechanism of LoR $A$ synergistically optimize the model's ability to process complex linguistic phenomena, such as the Tibetan honorific system and verb tense. This synergy provides a novel framework for applying low-rank constraints in low-resource languageprocessing, which provides a highly efficient fine-tuning paradigm for low-resource linguistic NLP. This study not only verifies the universal optimization effect of the TIFD dataset on Tibetan multi-base models but also provides empirical evidence for cross-linguistic model design.
How often do we come across paragraphs which contain important information but are too long to read? Most people tend to overlook humungous paragraphs at the expense of losing out on crucial information. This leads to...
详细信息
Auto regressive text generation for low-resource languages, particularly the option of using pre-trained language models, is a relatively under-explored problem. In this paper, we model Math Word Problem (MWP) generat...
详细信息
Unsupervised text style transfer task aims to rewrite a text into target style while preserving its main content. Traditional methods rely on the use of a fixed-sized vector to regulate text style, which is difficult ...
详细信息
ISBN:
(纸本)9798400701030
Unsupervised text style transfer task aims to rewrite a text into target style while preserving its main content. Traditional methods rely on the use of a fixed-sized vector to regulate text style, which is difficult to accurately convey the style strength for each individual token. In fact, each token of a text contains different style intensity and makes different contribution to the overall style. Our proposed method addresses this issue by assigning individual style vector to each token in a text, allowing for fine-grained control and manipulation of the style strength. Additionally, an adversarial training framework integrated with teacher-student learning is introduced to enhance training stability and reduce the complexity of high-dimensional optimization. The results of our experiments demonstrate the efficacy of our method in terms of clearly improved style transfer accuracy and content preservation in both two-style transfer and multi-style transfer settings.(1)
In the educational domain, identifying the similarity among test items provides various advantages for exam quality management and personalized student learning. Existing studies mostly relied on student performance d...
详细信息
ISBN:
(纸本)9783031642982;9783031642999
In the educational domain, identifying the similarity among test items provides various advantages for exam quality management and personalized student learning. Existing studies mostly relied on student performance data, such as the number of correct or incorrect answers, to measure item similarity. However, nuanced semantic information within the test items has been overlooked, possibly due to the lack of similarity-labeled data. Human-annotated educational data demands high-cost expertise, and items comprising multiple aspects, such as questions and choices, require detailed criteria. In this paper, we introduce a task of aspect-based semantic textual similarity for educational test items (aSTS-EI), where we assess the similarity by specific aspects within test items and present an LLM-guided benchmark dataset. We report the baseline performance by extending the STS methods, setting the groundwork for future aSTS-EI tasks. In addition, to assist data-scarce settings, we propose a progressive augmentation (ProAug) method, which generates step-by-step item aspects via recursive prompting. Experimental results imply the efficacy of existing STS methods for a shorter aspect while underlining the necessity for specialized approaches in relatively longer aspects. Nonetheless, markedly improved results with ProAug highlight the assistance of our augmentation strategy to overcome data scarcity.
Image captioning has emerged as a rapidly thriving area for the machine learning research community. Generally, image captioning is performed by combining various computer vision features, naturallanguageprocessing,...
详细信息
ISBN:
(数字)9798350376876
ISBN:
(纸本)9798350376883
Image captioning has emerged as a rapidly thriving area for the machine learning research community. Generally, image captioning is performed by combining various computer vision features, naturallanguageprocessing, and machine learning methods with consideration of some additional inputs to get more accurate context-dependent image captions Bengali is a significant language in India, adopted by approximately 100 million people. There exist various state-of-the-art methods for generating captions in the English language; however, for the Bengali language, there are very limited methods, and existing methods in the English language are not particularly helpful. Moreover, translations from English to Bengali may overlook or misinterpret subtle meanings, tones, or cultural nuances. Therefore, this work proposes a machine-learning model for captioning pictures in Bengali using an attention mechanism. The Flickr-8k dataset, which has 8000 images, is used to train the model in this work. The proposed method generates image captions in the Bengali language and attained a BLEU score of 0.66.
暂无评论