Accurately representing categorical data is crucial for enhancing machine learning model performance, especially when dealing with ordinal data, where the order of categories is significant. This work provides a novel...
详细信息
ISBN:
(数字)9798331509910
ISBN:
(纸本)9798331509927
Accurately representing categorical data is crucial for enhancing machine learning model performance, especially when dealing with ordinal data, where the order of categories is significant. This work provides a novel use of ordinal encoding as an alternative to traditional one-hot encoding in word embedding methods. We highlight key theoretical results that demonstrate the benefits of ordinal encoding: improved model interpretability, robustness to noise, and effective out-of-vocabulary handling. Through a series of theorems and empirical benchmarks on publicly available datasets, we show that ordinal encoding can significantly enhance predictive accuracy in text classification tasks. Our findings underscore the importance of preserving ordinal relationships in categorical data and position ordinal encoding as a vital methodology for practitioners aiming for optimal performance. This contribution enriches the ongoing discussion in data representation, offering insights into using FastText and Word2Vec with ordinal values instead of one-hot-encoded vectors.
With the development of Internet technology, the number of Internet users increases rapidly, and the amount of data generated on the Internet is very large every day. At the same time, with the development of storage ...
详细信息
ISBN:
(纸本)9781665494250
With the development of Internet technology, the number of Internet users increases rapidly, and the amount of data generated on the Internet is very large every day. At the same time, with the development of storage technology and query technology, it is very easy to collect massive data, but the information value contained in these data is uneven, and most of them are unmarked. However, traditional supervised learning has a great demand for labeled samples. Faced with a large number of unlabeled samples, there is a problem of the lack of effective automatic labeling methods, and manual labeling costs are high. If the strategy of simple random sampling is used for annotation, it may lead to the selection of noisy information and waste of resources, and low-quality training data could also have an influence on the prediction accuracy of the model. Meanwhile, the training effect of traditional deep learning methods is very limited for small sample labeled training sets. This paper takes the text emotion analysis task in naturallanguageprocessing as the background, selects IMDB film review data as the training set and test set, starts with the design of active learning algorithm based on clustering analysis, combined with the appropriate pre-training fine-tuning model, constructs a data enhancement method based on active learning. In the experiment, it is found that when the labeled training set is reduced by 90%, the prediction accuracy of the pre-training model is reduced by no more than 2%, which verifies the effectiveness of the data enhancement method combining active learning with the pretraining model.
Clinical notes, which can be embedded into electronic medical records, document patient care delivery and summarize interactions between healthcare providers and patients. These clinical notes directly inform patient ...
详细信息
ISBN:
(纸本)9780998133157
Clinical notes, which can be embedded into electronic medical records, document patient care delivery and summarize interactions between healthcare providers and patients. These clinical notes directly inform patient care and can also indirectly inform research and quality/safety metrics, among other indirect metrics. Recently, some states within the United States of America require patients to have open access to their clinical notes to improve the exchange of patient information for patient care. Thus, developing methods to assess the cyber risks of clinical notes before sharing and exchanging data is critical. While existing naturallanguageprocessing techniques are geared to de-identify clinical notes, to the best of our knowledge, few have focused on classifying sensitive-information risk, which is a fundamental step toward developing effective, widespread protection of patient health information. To bridge this gap, this research investigates methods for identifying security/privacy risks within clinical notes. The classification either can be used upstream to identify areas within notes that likely contain sensitive information or downstream to improve the identification of clinical notes that have not been entirely de-identified. We develop several models using unigram and word2vec features with different classifiers to categorize sentence risk. Experiments on i2b2 de-identification dataset show that the SVM classifier using word2vec features obtained a maximum F1-score of 0.792. Future research involves articulation and differentiation of risk in terms of different global regulatory requirements.
Background: Code reviewing is an essential part of software development to ensure software quality. However, the abundance of review tasks and the intensity of the workload for reviewers negatively impact the quality ...
详细信息
ISBN:
(纸本)9781450394277
Background: Code reviewing is an essential part of software development to ensure software quality. However, the abundance of review tasks and the intensity of the workload for reviewers negatively impact the quality of the reviews. The short review text is often unactionable. Aims: We propose the Example Driven Review Explanation (EDRE) method to facilitate the code review process by adding additional explanations through examples. EDRE recommends similar code reviews as examples to further explain a review and help a developer to understand the received reviews with less communication overhead. Method: Through an empirical study in an industrial setting and by analyzing 3,722 Code reviews across three open-source projects, we compared five methods of data retrieval, text classification, and text recommendation. Results: EDRE using TF-IDF word embedding along with an SVM classifier can provide practical examples for each code review with 92% F-score and 90% Accuracy. Conclusions: The example-based explanation is an established method for assisting experts in explaining decisions. EDRE can accurately provide a set of context-specific examples to facilitate the code review process in software teams.
Smart technologies are vital for progressions in text processing. They support the enhancement of academic activities' efficiency. For e-assessment, they provide researchers and educators with tools for objective ...
详细信息
ISBN:
(数字)9798331533557
ISBN:
(纸本)9798331533564
Smart technologies are vital for progressions in text processing. They support the enhancement of academic activities' efficiency. For e-assessment, they provide researchers and educators with tools for objective content evaluation and scoring. As demand for quality learning experiences increases, such technologies offer solutions to enrich education and expand understanding of course materials. Automated question-answering scoring (QAS) addresses the challenges of manual grading, which is labor-intensive and prone to mistakes. By integrating computer systems with naturallanguageprocessing (NLP) and intelligent algorithms, student answers can be analyzed objectively, delivering accurate feedback to help improve skills. In this study, we have employed WordNet and TF-IDF for QAS tasks, based on text similarity matching methods. Text pre-processing techniques are utilized for reliable computational analysis of the Arabic language. This study combines two established methods in linguistic computing and text processing to automate the assessment of the question-answering method. Using a dataset of questions, teachers' templates, and students' answers. The results exhibited the distinction of WordNet over TF-IDF based on three evaluation metrics.
Scene Text Recognition (STR) has long been considered an important yet challenging task in the field of computer vision. Recent works have demonstrated that utilizing language information is effective for the visually...
Scene Text Recognition (STR) has long been considered an important yet challenging task in the field of computer vision. Recent works have demonstrated that utilizing language information is effective for the visually difficult images, like ones with occultation or blurring. However, the use of language information sometimes leads to the over-correction problem. For out-of-vocabulary samples (e.g. "hou" and "0x4a"), some methods have tended to be biased to language side and over-corrected (e.g. over-correct "hou" to "hot"). This imbalance of vision and language has limited the usage of models in practical scenarios, yet it is rarely occurs for human. To address this issue, we rethink the human’s recognition process and propose a model behaving in the order of "Read, Spell and Repeat". It refines the recognition process circularly with vision and language information. With this mechanism, our model integrates vision and language information in a more effective manner, achieving higher accuracy with less parameters compared to baseline and competitive performance with SOTA methods in the standard benchmarks.
Speaker diarization, the task of automatically identifying different speakers in audio and video, is frequently performed using probabilistic models and deep learning techniques. However, existing methods usually rely...
详细信息
ISBN:
(纸本)9783031790287;9783031790294
Speaker diarization, the task of automatically identifying different speakers in audio and video, is frequently performed using probabilistic models and deep learning techniques. However, existing methods usually rely on direct analysis of the audio signal, which presents challenges for languages that lack established diarization methodologies, such as Portuguese. In this article, we propose a new approach to speaker diarization that leverages generative models for automatic speaker identification in Portuguese. We employed two generative models: one for refining the transcribed audio and another for performing the diarization task, as well as a model for initially transcribing the audio. Our method simplifies the diarization process by capturing and analyzing speaker style patterns from transcribed audio and achieves high accuracy without depending on direct signal analysis. This approach not only increases the effectiveness of speaker identification but also extends the usefulness of generative models to new domains. It opens a new perspective for diarization research, especially for the development of accurate systems for under-researched languages in audio and video applications.
In their natural form, convolutional neural networks (CNNs) lack interpretability despite their effectiveness in visual categorization. Concept activation vectors (CAVs) offer human-interpretable quantitative explaina...
详细信息
ISBN:
(数字)9798350390155
ISBN:
(纸本)9798350390162
In their natural form, convolutional neural networks (CNNs) lack interpretability despite their effectiveness in visual categorization. Concept activation vectors (CAVs) offer human-interpretable quantitative explainability, utilizing feature maps from intermediate layers of CNNs. Current concept-based explainability methods assess explainer faithfulness primarily through Fidelity. However, relying solely on this metric has limitations. This study extends the Invertible Concept-based Explainer (ICE) to introduce a new ingredient measuring concept consistency. We propose the CoherentICE explainability framework for CNNs, expanding beyond Fidelity. Our analysis, for the first time, highlights that Coherence provides a more reliable faithfulness evaluation for CNNs, supported by empirical validations. Our findings emphasize that accurate concepts are meaningful only when consistently accurate and improve at deeper CNN layers.
In the rapidly evolving domain of naturallanguageprocessing (NLP), the efficiency of Large language Models (LLMs) in generating abstractive text summaries plays a pivotal role in information synthesis. This study ad...
详细信息
ISBN:
(数字)9798350361537
ISBN:
(纸本)9798350361544
In the rapidly evolving domain of naturallanguageprocessing (NLP), the efficiency of Large language Models (LLMs) in generating abstractive text summaries plays a pivotal role in information synthesis. This study advances the understanding of LLM performance by conducting a comprehensive evaluation of seven cutting-edge models on four distinct datasets. The models selected for comparison include Distilbart-cnn-12-6, Led-base-16384, Google’s Bigbird, Microsoft’s ProphetNet, Facebook’s BART, T5 fine-tuned, and Google’s PEGASUS . Each model’s summarization prowess is rigorously assessed using a battery of metrics: ROUGE, METEOR, BERTScore, Cosine Similarity, and BLEU. The goal is to discern the intricate relationship between dataset characteristics and model efficacy, delivering insights into the inherent advantages and limitations of each model in handling specific data contexts. The results contribute to a refined understanding of LLM applicability, offering empirical evidence to aid in the selection of the most suitable model for varied summarization needs.
Recently, there are a lot of works trying to probe knowledge in pre-trained language models (PLMs). Most probing works use data in knowledge bases to create a "fill-in-the-blank" task form and probe entity k...
详细信息
暂无评论