版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Univ Lahore Dept CS & IT Lahore Pakistan
出 版 物:《INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS》 (Int. J. Data Sci. Anal.)
年 卷 期:2024年
页 面:1-15页
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
主 题:Multimodal dataset Auto regex synthesis Dataset for Roman Urdu auto regex synthesis Roman Urdu dataset Query to regex Strings to regex
摘 要:Automatic regex synthesis involves generation of regular expressions from user-written natural language descriptions, example strings or both. Daily, countless regex generation queries are posted on online Q&A platforms such as StackOverflow (https://***) and Quora (https://***). Existing automatic regex synthesis methods demand concretely designed, multimodal datasets for optimal performance. Unfortunately, publicly available datasets even for resource-rich languages like English are often model-specific and incomplete, potentially hindering the efficiency and accurateness of regex synthesis methods. This issue is worsened for resource-poor languages such as Standard Urdu and Roman Urdu. In this paper, we present a novel, benchmark Roman Urdu dataset with 900 words and a novel Roman Urdu lexicon of 4225 words, annotated and labeled, to address the unmet needs of regex synthesis methods. Equipping these methods with a proficient dataset can lead to more fruitful regex generation.