检索结果-内蒙古大学图书馆

Improved Script Identification Algorithm Using unicode-Based Regular Expression Matching Strategy

DATA 2025年第4期10卷 43-43页

作者： Qasim, Mamtimin Silamu, Wushour Guangzhou Coll Commerce Sch Informat Technol & Engn Guangzhou 511363 Peoples R China Xinjiang Univ Sch Comp Sci & Technol Urumqi 830046 Peoples R China Key Multilingual Lab Xinjiang Urumqi 830046 Peoples R China

While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it's impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks;furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification;furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.

关键词： script script identification unicode language identification

来源：评论

学校读者我要写书评

暂无评论

unicode-based method for text steganography with Malayalam text

引用

JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2015年第4期28卷 1591-1600页

作者： Vidhya, P. M. Paul, Varghese MG Univ Sch Comp Sci Kottayam 686560 Kerala India Cochin Univ Sci & Technol Dept IT Cochin 682016 Kerala India

Recent researches regarding information hiding is mostly concentrating on Linguistic steganography. The Steganography is the art and science of hiding a message inside another message without drawing any suspicion to the others so that the message can only be detected by its intended recipient. Now days, concentration is made on local language encryption based steganography to provide high security for the secret information sharing. In this paper, a method to steganography is proposed with an Indian local language, Malayalam. The proposed method consists of a custom unicode based technique with embedding based on indexing, i.e. the original message is encoded to a Malayalam text with custom unicode values generated for the Malayalam text. After that an embedding algorithm will be designed to mix the encoded original message with the Malayalam text. The experimental study was done to evaluate the efficiency of the proposed approach. The comparison study of the proposed method against an existing method revealed that, the proposed steganography methods is more precise in the encoding process and balanced in the decoding process. The proposed method achieved a precision rate of 0.95 and decoding rate of 0.81.

关键词： Information hiding text steganography unicode index

来源：评论

学校读者我要写书评

暂无评论

unicode search of dirty data, or: How I learned to stop worrying and love unicode Technical Standard #18

Unicode search of dirty data, or: How I learned to stop worr...

引用

13th Annual DFRWS Conference

作者： Stewart, Jon Uckelman, Joel Lightbox Technol Inc Arlington VA 22209 USA

This paper discusses problems arising in digital forensics with regard to unicode, character encodings, and search. It describes how multipattern search can handle the different text encodings encountered in digital forensics and a number of issues pertaining to proper handling of unicode in search patterns. Finally, we demonstrate the feasibility of the approach and discuss the integration of our developed search engine, lightgrep, with the popular bulk_extractor tool. (C) 2013 Joel Uckelman and Jon Stewart. Published by Elsevier Ltd. All rights reserved.

关键词： unicode Regular expression Regex Search Ditigal forensics

来源：评论

学校读者我要写书评

暂无评论

Transcoding unicode characters with AVX-512 instructions

引用

SOFTWARE-PRACTICE & EXPERIENCE 2023年第12期53卷 2430-2462页

作者： Clausecker, Robert Lemire, Daniel Zuse Inst Berlin Berlin Germany Univ Quebec TELUQ DOT Lab Res Ctr Montreal PQ Canada Univ Quebec TELUQ DOT Lab Res Ctr Montreal PQ H2S 3L5 Canada

Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the previous best solutions. For example, we transcode Chinese text from UTF-8 to UTF-16 at more than 5 GiB s-1$$ {\mathrm{s}}<^>{-1} $$ using fewer than 2 CPU instructions per character. To ensure reproducibility, we make our software freely available as an open-source library. Our library is part of the popular *** JavaScript runtime.

关键词： character encoding text processing unicode vectorization

来源：评论

学校读者我要写书评

暂无评论

unicode For Hiding Information In A Text Document 14

UNICODE For Hiding Information In A Text Document

引用

14th IEEE International Conference on Application of Information and Communication Technologies (AICT)

作者： Zaynalov, N. R. Mavlonov, O. N. Muhamadiev, A. N. Dusmurod, Qilichev Rahmatullaev, I. R. Tashkent Univ Informat Technol Informat Secur Dept Samarkand Branch Samarkand Uzbekistan

ISBN: (纸本)9781728173863

Steganography is a unique approach for developing tools and methods to hide the fact of transmitting a secret message. The first traces of steganographic methods are lost in ancient times. From detective works, various methods of secret writing between the lines of ordinary text are well known: from milk to complex chemical reagents with subsequent processing. Digital steganography is based on hiding or embedding additional information in digital objects, while causing some distortion of these objects. In this case, text, images, audio, video, network packets, and so on can be used as objects or containers. To embed a secret message, steganographic methods rely on redundant container information or properties that the human perception system cannot distinguish. Recently, there has been a lot of research in the field of hiding information in a text container, since text documents are used in many organizations. Based on this, here the MS Word document is considered as a data carrier, which has various parameters, changing these parameters can achieve data integration. In the same article, we present steganography using invisible unicode characters of the Space type, but with different encoding. A combined approach for encoding Latin characters is proposed for the effectiveness of the method.

关键词： unicode unicode characters steganography Digital steganography steganographic methods hiding information MS Word document

来源：评论

学校读者我要写书评

暂无评论

unicode at Gigabytes per Second 28th

Unicode at Gigabytes per Second

引用

28th International Symposium on String Processing and Information Retrieval (SPIRE)

作者： Lemire, Daniel Univ Quebec TELUQ DOT Lab Res Ctr Montreal PQ Canada

ISBN: (纸本)9783030866914;9783030866921

We often represent text using unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.

关键词： unicode Vectorization Internationalization

来源：评论

学校读者我要写书评

暂无评论

A secure and size efficient algorithm to enhance data hiding capacity and security of cover text by using unicode

引用

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES 2022年第5期34卷 2180-2191页

作者： Ditta, Allah Azeem, Muhammad Naseem, Shahid Rana, Khurram Gulzar Khan, Muhammad Adnan Iqbal, Zafar Univ Educ Div Sci & Technol Lahore Pakistan Univ Sialkot Sialkot Pakistan Quaid I Azam Univ Islamabad Pakistan Lahore Garrison Univ Lahore Pakistan Bahria Univ Lahore Lahore Pakistan

With the advancement of technology, the maximum data hiding capacity and security of cover objects have become a very challenging task for researchers, particularly in text carrier. Text carrier depicts low hiding capacity but more secure for the detection of confidential information. It demands novelty in data hiding algorithms. In this regard, a novel algorithm is proposed by using steganography and cryptography together for the enhancement of capacity and security of confidential data. The recommended algorithm uses a linguistic steganography method to conceal data into the Arabic text carrier. In the described algorithm, the identification of secret information from text files is hard due to less redundant bits in the text as compared to the image, audio, and video steganographic mediums. The current solution uses unicode characters such as Zero-Width-Character (ZWC) and Zero-Width-Joiner (ZWJ) to hide the secret information. Before hiding confidential information, secret data is encrypted by using bit inversion due to which algorithm achieved high security. It is observed from the simulation results that the proposed algorithm successfully achieved high cover medium capacity, security, and robustness. (C) 2020 The Authors. Published by Elsevier B.V. on behalf of King Saud University.

关键词： Text Steganography Data communication unicode Arabic text Cryptography Data security

来源：评论

学校读者我要写书评

暂无评论

unicode search of dirty data, or: How I learned to stop worrying and love unicode Technical Standard #18

引用

DIGITAL INVESTIGATION 2013年 10卷 S116-S125页

作者： Stewart, Jon Uckelman, Joel Lightbox Technol Inc Arlington VA 22209 USA

关键词： unicode Regular expression Regex Search Ditigal forensics

来源：评论

学校读者我要写书评

暂无评论

Transcoding billions of unicode characters per second with SIMD instructions

引用

SOFTWARE-PRACTICE & EXPERIENCE 2022年第2期52卷 555-575页

作者： Lemire, Daniel Mula, Wojciech Univ Quebec TELUQ DOT Lab Res Ctr Montreal PQ H2S 3L5 Canada 0X80 Pl Wroclaw Poland

In software, text is often represented using unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library.

关键词： character encoding text processing unicode vectorization

来源：评论

学校读者我要写书评

暂无评论

ADAMAS: Interweaving unicode and color to enhance CAPTCHA security

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2016年 55卷 289-310页

作者： Roshanbin, Narges Miller, James Univ Alberta Dept Elect & Comp Engn Edmonton AB Canada

We propose, implement and test a new CAPTCHA, called Adamas, which offers resistance against preprocessing and various forms of segmentation and recognition attacks. The multi-layered security approach employed in this CAPTCHA mainly comes from its use of unicode as an input space, a virtual keyboard as the input device, homoglyphs and correlated usage of color in foreground and background as well as several layers of randomization that aim to minimize the formation of detectable patterns that can be exploited by machines. A user study conducted to measure the usability of Adamas indicates that its solving accuracy is comparable to major CAPTCHAs in use today and offers insights into factors that affect CAPTCHA usability. (C) 2014 Elsevier B.V. All rights reserved.

关键词： CAPTCHA unicode Color Interactive CAPTCHAs CAPTCHA security CAPTCHA usability

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：