While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the Un...
详细信息
While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it's impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks;furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification;furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.
Recent researches regarding information hiding is mostly concentrating on Linguistic steganography. The Steganography is the art and science of hiding a message inside another message without drawing any suspicion to ...
详细信息
Recent researches regarding information hiding is mostly concentrating on Linguistic steganography. The Steganography is the art and science of hiding a message inside another message without drawing any suspicion to the others so that the message can only be detected by its intended recipient. Now days, concentration is made on local language encryption based steganography to provide high security for the secret information sharing. In this paper, a method to steganography is proposed with an Indian local language, Malayalam. The proposed method consists of a custom unicode based technique with embedding based on indexing, i.e. the original message is encoded to a Malayalam text with custom unicode values generated for the Malayalam text. After that an embedding algorithm will be designed to mix the encoded original message with the Malayalam text. The experimental study was done to evaluate the efficiency of the proposed approach. The comparison study of the proposed method against an existing method revealed that, the proposed steganography methods is more precise in the encoding process and balanced in the decoding process. The proposed method achieved a precision rate of 0.95 and decoding rate of 0.81.
This paper discusses problems arising in digital forensics with regard to unicode, character encodings, and search. It describes how multipattern search can handle the different text encodings encountered in digital f...
详细信息
This paper discusses problems arising in digital forensics with regard to unicode, character encodings, and search. It describes how multipattern search can handle the different text encodings encountered in digital forensics and a number of issues pertaining to proper handling of unicode in search patterns. Finally, we demonstrate the feasibility of the approach and discuss the integration of our developed search engine, lightgrep, with the popular bulk_extractor tool. (C) 2013 Joel Uckelman and Jon Stewart. Published by Elsevier Ltd. All rights reserved.
Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction ...
详细信息
Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the previous best solutions. For example, we transcode Chinese text from UTF-8 to UTF-16 at more than 5 GiB s-1$$ {\mathrm{s}}<^>{-1} $$ using fewer than 2 CPU instructions per character. To ensure reproducibility, we make our software freely available as an open-source library. Our library is part of the popular *** JavaScript runtime.
Steganography is a unique approach for developing tools and methods to hide the fact of transmitting a secret message. The first traces of steganographic methods are lost in ancient times. From detective works, variou...
详细信息
ISBN:
(纸本)9781728173863
Steganography is a unique approach for developing tools and methods to hide the fact of transmitting a secret message. The first traces of steganographic methods are lost in ancient times. From detective works, various methods of secret writing between the lines of ordinary text are well known: from milk to complex chemical reagents with subsequent processing. Digital steganography is based on hiding or embedding additional information in digital objects, while causing some distortion of these objects. In this case, text, images, audio, video, network packets, and so on can be used as objects or containers. To embed a secret message, steganographic methods rely on redundant container information or properties that the human perception system cannot distinguish. Recently, there has been a lot of research in the field of hiding information in a text container, since text documents are used in many organizations. Based on this, here the MS Word document is considered as a data carrier, which has various parameters, changing these parameters can achieve data integration. In the same article, we present steganography using invisible unicode characters of the Space type, but with different encoding. A combined approach for encoding Latin characters is proposed for the effectiveness of the method.
We often represent text using unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NE...
详细信息
ISBN:
(纸本)9783030866914;9783030866921
We often represent text using unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.
With the advancement of technology, the maximum data hiding capacity and security of cover objects have become a very challenging task for researchers, particularly in text carrier. Text carrier depicts low hiding cap...
详细信息
With the advancement of technology, the maximum data hiding capacity and security of cover objects have become a very challenging task for researchers, particularly in text carrier. Text carrier depicts low hiding capacity but more secure for the detection of confidential information. It demands novelty in data hiding algorithms. In this regard, a novel algorithm is proposed by using steganography and cryptography together for the enhancement of capacity and security of confidential data. The recommended algorithm uses a linguistic steganography method to conceal data into the Arabic text carrier. In the described algorithm, the identification of secret information from text files is hard due to less redundant bits in the text as compared to the image, audio, and video steganographic mediums. The current solution uses unicode characters such as Zero-Width-Character (ZWC) and Zero-Width-Joiner (ZWJ) to hide the secret information. Before hiding confidential information, secret data is encrypted by using bit inversion due to which algorithm achieved high security. It is observed from the simulation results that the proposed algorithm successfully achieved high cover medium capacity, security, and robustness. (C) 2020 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
This paper discusses problems arising in digital forensics with regard to unicode, character encodings, and search. It describes how multipattern search can handle the different text encodings encountered in digital f...
详细信息
This paper discusses problems arising in digital forensics with regard to unicode, character encodings, and search. It describes how multipattern search can handle the different text encodings encountered in digital forensics and a number of issues pertaining to proper handling of unicode in search patterns. Finally, we demonstrate the feasibility of the approach and discuss the integration of our developed search engine, lightgrep, with the popular bulk_extractor tool. (C) 2013 Joel Uckelman and Jon Stewart. Published by Elsevier Ltd. All rights reserved.
In software, text is often represented using unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower ...
详细信息
In software, text is often represented using unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library.
We propose, implement and test a new CAPTCHA, called Adamas, which offers resistance against preprocessing and various forms of segmentation and recognition attacks. The multi-layered security approach employed in thi...
详细信息
We propose, implement and test a new CAPTCHA, called Adamas, which offers resistance against preprocessing and various forms of segmentation and recognition attacks. The multi-layered security approach employed in this CAPTCHA mainly comes from its use of unicode as an input space, a virtual keyboard as the input device, homoglyphs and correlated usage of color in foreground and background as well as several layers of randomization that aim to minimize the formation of detectable patterns that can be exploited by machines. A user study conducted to measure the usability of Adamas indicates that its solving accuracy is comparable to major CAPTCHAs in use today and offers insights into factors that affect CAPTCHA usability. (C) 2014 Elsevier B.V. All rights reserved.
暂无评论