Existing open-source text language identification tools are not ideal for recognizing short texts in Chinese, Japanese, and Korean, and some of them cannot distinguish between Simplified and Traditional Chinese. To im...
详细信息
Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify h...
详细信息
ISBN:
(纸本)9781728188003
Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs - particularly ones that have not been previously spotted - and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normalized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: https://***/PerryXDeng/weaponizing_unicode
This paper proposes a novel scheme where the key k is generated as discrete logarithm of indices involving prime modulus p and any base value q. This base value q is an element of Z(p). The Discrete logarithm values a...
详细信息
This paper proposes a novel scheme where the key k is generated as discrete logarithm of indices involving prime modulus p and any base value q. This base value q is an element of Z(p). The Discrete logarithm values are substituted for k in the encryption equation. During decryption the corresponding k's are used to recover the plaintext. The sender embeds the p, q values along with the encrypted message and transmits it. This obviates the need for sending the full-length key along with the encrypted message. The proposed method ensures higher security in the transmission. The strength of the method lies in the difficulty of guessing p, q values, the entire key need not be transmitted and the full set of ASCII values of the Z(256) plane figure in the encryption process. The paper also discusses the difficulty of attempting brute force technique to discover p, q values. As an extension of this work. the authors are exploring the possibility of using the full set of unicode values instead of the restricted 8-bit ASCII set. (C) 2008 Elsevier B.V. All rights reserved.
The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times usi...
详细信息
The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookupalgorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available single-instruction-multiple-data instructions. To ensure reproducibility, our work is freely available as open source software.
Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic language...
详细信息
Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic languages bear common characteristics. That is why computer and language scientists can take the opportunity to develop common Natural Language Processing (NLP) techniques or algorithms. Bearing the concept in mind, we compare and analyze these three languages character by character. As an application of the hypothesis, we also developed a uniform sorting algorithm in two steps, first for the Bengali and Nepali languages only and then extended it for Hindi in the second step. Our thorough investigation with more than 30,000 words from each language suggests that, the algorithm maintains total accuracy as set by the local language authorities of the respective languages and good efficiency.
Information systems for Arabic-Kazakh processing must handle the editing and display problems caused by four special vowels: , , and The current solution uses combinations of four alternative vowels (, , , and ) with ...
详细信息
Information systems for Arabic-Kazakh processing must handle the editing and display problems caused by four special vowels: , , and The current solution uses combinations of four alternative vowels (, , , and ) with the character to represent these four special vowels. However, this approach relies on deliberate spelling errors and can cause computer programs to be unable to semantically distinguish the alternative vowels from the original vowels. Moreover, this causes problems in Arabic-Kazakh text-processing applications such as text sorting, script conversion and speech synthesis. We propose a compromise method in which the four special vowels are represented by combinations of themselves with the character and the related editing and display problems are handled using an OpenType font. The relevant glyph layout features in the OpenType font format are compatible with the proposed compromise method. Results from the sorting and classification of 10,000 randomly selected common Arabic-Kazakh words demonstrate that the new method successfully avoids problems caused by letter replacement, including text sorting errors in 2843 of the tested words and ambiguities with the characters , , , and in 3960 of the words. (C) 2017 Elsevier B.V. All rights reserved.
In this note the author describes an initiative to create a keyboard for Android mobile devices that can type characters for a West African language called Kaansa, spoken by perhaps 10,000 Kaan people in Burkina Faso....
详细信息
ISBN:
(纸本)9781450343060
In this note the author describes an initiative to create a keyboard for Android mobile devices that can type characters for a West African language called Kaansa, spoken by perhaps 10,000 Kaan people in Burkina Faso. The Kaan community has only recently established a written orthography and begun formal literacy training for adults and youths. This note examines certain currently available mobile technologies to allow texting in Kaansa and considers future efforts to measure the impact of such technologies on the literacy rate among several demographics.
暂无评论