Stemming is a vital preprocessing step in naturallanguageprocessing (NLP) that reduces words to their root forms, enabling efficient text analysis. For Bangla, a morphologically rich language, stemming is particular...
详细信息
ISBN:
(数字)9798350357509
ISBN:
(纸本)9798350357516
Stemming is a vital preprocessing step in naturallanguageprocessing (NLP) that reduces words to their root forms, enabling efficient text analysis. For Bangla, a morphologically rich language, stemming is particularly challenging due to complex noun and verb inflections. Existing methods often fail to handle compound words, irregular forms and over-or under-stemming issues effectively. This paper introduces a hybrid Bangla stemming technique combining numerical mapping, Part-of-Speech (POS) tagging, rule-based methods and the Levenshtein Distance algorithm. Leveraging an 130,000-word dictionary and a multi-layer fallback mechanism, the proposed system improves accuracy and flexibility. Experimental results on 1,000 unique words show 84.6% accuracy and an 82% F1-score, outperforming other Bangla stemmers. The approach demonstrates significant potential for enhancing NLP tasks like text classification and information retrieval. Future research will address limitations such as named entity processing and dictionary expansion, aiming for even greater adaptability and efficiency.
Large language Models (LLMs) have demonstrated outstanding performance in naturallanguageprocessing over the past few years. The adoption of LLMs across digital applications in various walks of life is growing expon...
详细信息
ISBN:
(数字)9798350363708
ISBN:
(纸本)9798350363715
Large language Models (LLMs) have demonstrated outstanding performance in naturallanguageprocessing over the past few years. The adoption of LLMs across digital applications in various walks of life is growing exponentially. However, the mechanisms by which LLMs, like all types of artificial neural networks, yield their results remain opaque. This lack of transparency in decision-making poses risks for the further use of the technology. This paper presents a concise review of approaches to obtaining explanations of LLM outcomes. It examines the principal methods of explainable artificial intelligence that are applied to generate explanations for LLM predictions. An approach for classifying local explanation methods is proposed.
Securing sufficient data to enable automatic sign language translation modeling is challenging. The data insufficiency issue exists in both video and text modalities;however, fewer studies have been performed on text ...
详细信息
ISBN:
(纸本)9791095546726
Securing sufficient data to enable automatic sign language translation modeling is challenging. The data insufficiency issue exists in both video and text modalities;however, fewer studies have been performed on text data augmentation compared to video data. In this study, we present three methods of augmenting sign language text modality data, comprising 3,052 Gloss-level Korean Sign language (GKSL) and Word-level Korean language (WKL) sentence pairs. Using each of the three methods, the following number of sentence pairs were created: blank replacement 10,654, sentence paraphrasing 1,494, and synonym replacement 899. Translation experiment results using the augmented data showed that when translating from GKSL to WKL and from WKL to GKSL, Bi-Lingual Evaluation Understudy (BLEU) scores improved by 0.204 and 0.170 respectively, compared to when only the original data was used. The three contributions of this study are as follows. First, we demonstrated that three different augmentation techniques used in existing naturallanguageprocessing (NLP) can be applied to sign language. Second, we propose an automatic data augmentation method which generates quality data by utilizing the Korean sign language gloss dictionary. Lastly, we publish the Gloss-level Korean Sign language 13k dataset (GKSL13k), which has verified data quality through expert reviews.
In this work, a system is created to suggest product/service code and industrial classification code for Thai language. The system can suggest UNSPSC and TSIC codes relevant to query terms via indexing search. Techniq...
详细信息
ISBN:
(数字)9781665457279
ISBN:
(纸本)9781665457279
In this work, a system is created to suggest product/service code and industrial classification code for Thai language. The system can suggest UNSPSC and TSIC codes relevant to query terms via indexing search. Techniques used in this work are based on knowledge of text processing and text similarity, as well as indexing. Through a complexity analysis, the system has been proved efficient as it can retrieve data about 1,000 times faster than traditional methods. Furthermore, Mean Reciprocal Rank (MRR) was employed to evaluate the search results of 1,000 products and services. The results showed that the proposed system achieved the MRR of 0.46, indicating the relevant search result is approximately in the second or third rank. Currently, the proposed system has been implemented as a part of SMEs registration process in the OSMEP website to support Thai SMEs to access government procurement.
Answering questions, finding the most appropriate answer to the question given by the user as input are among the important tasks of naturallanguageprocessing. Many studies have been done on question answering and d...
详细信息
ISBN:
(纸本)9783031105364;9783031105357
Answering questions, finding the most appropriate answer to the question given by the user as input are among the important tasks of naturallanguageprocessing. Many studies have been done on question answering and datasets, methods have been published. The aim of this article is to reveal the studies done in question answering and to identify the missing research topics. In this literature review, it is tried to determine the datasets, methods and frameworks used for question answering between 2000 and 2022. From the articles published between these years, 91 papers are selected based on inclusion and exclusion criteria. This systematic literature review consists of research analyzes such as research questions, search strategy, inclusion and exclusion criteria, data extraction. We see that the selected final study focuses on four topics. These are naturallanguageprocessing, Information Retrieval, Knowledge Base, Hybrid Based.
Video-text retrieval is an emerging stream in both computer vision and naturallanguageprocessing communities, which aims to find relevant videos given text queries. In this paper, we study the notoriously challengin...
详细信息
ISBN:
(纸本)9798350301298
Video-text retrieval is an emerging stream in both computer vision and naturallanguageprocessing communities, which aims to find relevant videos given text queries. In this paper, we study the notoriously challenging task, i.e., Unsupervised Domain Adaptation Video-text Retrieval (UDAVR), wherein training and testing data come from different distributions. Previous works merely alleviate the domain shift, which however overlook the pairwise misalignment issue in target domain, i.e., there exist no semantic relationships between target videos and texts. To tackle this, we propose a novel method named Dual Alignment Domain Adaptation (DADA). Specifically, we first introduce the cross-modal semantic embedding to generate discriminative source features in a joint embedding space. Besides, we utilize the video and text domain adaptations to smoothly balance the minimization of the domain shifts. To tackle the pairwise misalignment in target domain, we propose the Dual Alignment Consistency (DAC) to fully exploit the semantic information of both modalities in target domain. The proposed DAC adaptively aligns the video-text pairs which are more likely to be relevant in target domain, enabling that positive pairs are increasing progressively and the noisy ones will potentially be aligned in the later stages. To that end, our method can generate more truly aligned target pairs and ensure the discriminability of target features. Compared with the state-of-the-art methods, DADA achieves 20.18% and 18.61% relative improvements on R@1 under the setting of TGIF -> MSR-VTT and TGIF -> MSVD respectively, demonstrating the superiority of our method.
Continual learning (CL) with Vision-language Models (VLMs) has overcome the constraints of traditional CL, which only focuses on previously encountered classes. During the CL of VLMs, we need not only to prevent the c...
Passwords are the most widely used authentication method and play a crucial role in the field of information security. In this study, we explore the effectiveness of applying machine learning (ML) and naturallanguage...
详细信息
ISBN:
(数字)9798331534103
ISBN:
(纸本)9798331534110
Passwords are the most widely used authentication method and play a crucial role in the field of information security. In this study, we explore the effectiveness of applying machine learning (ML) and naturallanguageprocessing (NLP) techniques to password classification. We compare the performance of classifiers by using eight ML techniques and four NLP techniques to classify user-created passwords. The experimental results show that the classifier using a combination of Bag-of-Words and Logistic Regression outperforms other classifiers, achieving an accuracy of 98.53% and a recall for weak passwords of 99.68%.
Improving the reasoning capabilities of large language models (LLMs) has attracted considerable interest. Recent approaches primarily focus on improving the reasoning process to yield a more precise final answer. Howe...
Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands...
暂无评论