咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Stylometry-driven framework fo... 收藏

Stylometry-driven framework for Urdu intrinsic plagiarism detection: a comprehensive analysis using machine learning, deep learning, and large language models

作     者:Manzoor, Muhammad Faraz Farooq, Muhammad Shoaib Abid, Adnan 

作者机构:Department of Computer Science University of Management and Technology Lahore Pakistan Department of Data Science Faculty of Computing and Information Technology University of the Punjab Lahore Pakistan 

出 版 物:《Neural Computing and Applications》 (Neural Comput. Appl.)

年 卷 期:2025年第37卷第9期

页      面:6479-6513页

核心收录:

学科分类:0711[理学-系统科学] 08[工学] 0835[工学-软件工程] 0714[理学-统计学(可授理学、经济学学位)] 0701[理学-数学] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

主  题:Deep learning 

摘      要:Detecting plagiarism in documents is a well-established task in natural language processing (NLP). Broadly, plagiarism detection is categorized into two types (1) intrinsic: to check the whole document or all the passages have been written by a single author;(2) extrinsic: where a suspicious document is compared with a given set of source documents to figure out sentences or phrases which appear in both documents. In the pursuit of advancing intrinsic plagiarism detection, this study addresses the critical challenge of intrinsic plagiarism detection in Urdu texts, a language with limited resources for comprehensive language models. Acknowledging the absence of sophisticated large language models (LLMs) tailored for Urdu language, this study explores the application of various machine learning, deep learning, and language models in a novel framework. A set of 43 stylometry features at six granularity levels was meticulously curated, capturing linguistic patterns indicative of plagiarism. The selected models include traditional machine learning approaches such as logistic regression, decision trees, SVM, KNN, Naive Bayes, gradient boosting and voting classifier, deep learning approaches: GRU, BiLSTM, CNN, LSTM, MLP, and large language models: BERT and GPT-2. This research systematically categorizes these features and evaluates their effectiveness, addressing the inherent challenges posed by the limited availability of Urdu-specific language models. Two distinct experiments were conducted to evaluate the impact of the proposed features on classification accuracy. In experiment one, the entire dataset was utilized for classification into intrinsic plagiarized and non-plagiarized documents. Experiment two categorized the dataset into three types based on topics: moral lessons, national celebrities, and national events. Both experiments are thoroughly evaluated through, a fivefold cross-validation analysis. The results show that the random forest classifier achieved an ex

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分