Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain;t...
详细信息
Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain;that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring the omitted vowels in speech technologies, little attention has been given to this problem in papers dedicated to written Arabic technologies. In this research, we present Arabic-Unitex, an Arabic Language Resource, with emphasis on vowel representation and encoding. Specifically, we present two dozens of rules formalizing a detailed description of vowel omission in written text. They are typographical rules integrated into large-coverage resources for morphological annotation. For restoring vowels, our resources are capable of identifying words in which the vowels are not shown, as well as words in which the vowels are partially or fully included. By taking into account these rules, our resources are able to compute and restore for each word form a list of compatible fully vowelized candidates through omission-tolerant dictionary lookup. In our previous studies, we have proposed a straightforward encoding of taxonomy for verbs (Neme in Proceedings of the international workshop on lexical resources (WoLeR) at ESSLLI, 2011) and broken plurals (Neme and Laporte in Lang Sci, 2013, ). While traditional morphology is based on derivational rules, our description is based on inflectional ones. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. The lexicon is built and updated manually and contains 76,000 fully vowelized lemmas. It is then inflected by means of finite-state transducers (FSTs), generating 6 million forms. The coverage of these inflected forms is extended by forma
Neural interfaces of the future will be used to help restore lost sensory, motor, and other capabilities. However, realizing this futuristic promise requires a major leap forward in how electronic devices interface wi...
详细信息
Neural interfaces of the future will be used to help restore lost sensory, motor, and other capabilities. However, realizing this futuristic promise requires a major leap forward in how electronic devices interface with the nervous system. Next generation neural interfaces must support parallel recording from tens of thousands of electrodes within the form factor and power budget of a fully implanted device, posing a number of significant engineering challenges. In this paper, we exploit sparsity and diversity of neural signals to achieve simultaneous data compression and channel multiplexing for neural recordings. The architecture uses wired-OR interactions within an array of single-slope A/D converters to obtain massively parallel digitization of neural action potentials. The achieved compression is lossy but effective at retaining the critical samples belonging to action potentials, enabling efficient spike sorting and cell type identification. Simulation results of the architecture using data obtained from primate retina ex-vivo with a 512-channel electrode array show average compression rates up to $\sim$40 while missing less than 5 of cells. In principle, the techniques presented here could be used to design interfaces to other parts of the nervous system.
A decimal notation satisfies many simple mathematical properties. and it is a useful tool in the analysis of trees. A practical method is presented, that compresses the decimal codes while maintaining the fast determi...
详细信息
A decimal notation satisfies many simple mathematical properties. and it is a useful tool in the analysis of trees. A practical method is presented, that compresses the decimal codes while maintaining the fast determination of relations (e.g., ancestor, descendant, brother, etc.). A special node. called akernel node,including many common subcodes of the other codes, is defined, and a compact data structure is presented using the kerne! nodes. Letn(m) be the number of the total (kernel) nodes. It is theoretically proved that encoding a decimal code is a constant time, that the worst-case time complexity of compressing the decimal codes is O(n+m2), and that the size of the data structure is proportional tom. From the experimental results of some hierarchical semantic primitives for natural language processing, it is shown that the ratiom/nbecomes an extremely small value, ranging from 0.047 to 0.13.
The article is devoted to developing a compression method using context modeling of a sequence of bits and wavelet transform, which make it possible to take into account the specifics and properties of the initial hyp...
详细信息
The article is devoted to developing a compression method using context modeling of a sequence of bits and wavelet transform, which make it possible to take into account the specifics and properties of the initial hyperspectral remote sensing data. Two algorithms for compressing hyperspectral data (lossy and lossless) based on wavelet transform are proposed, the distinguishing features of which are reduction in the required memory size, acceleration of the search for significant wavelet coefficients using a pyramid with approximating coefficients, and an increase in the compression coefficient. Recommendations for applying these algorithms are formulated. A distinctive feature of the hyperspectral data compression method is the ability to control the compression coefficient owing to parametric adjustment of the algorithms, application of context modeling and adaptation to the type of initial data (classical cube or Fourier interferogram). The efficiency of the technique has been experimentally confirmed using examples of compression of classical data and real Fourier interferograms with compression ratios of 4.1 and 2.4, corresponding to the level of the best global results, as well as analytically with data distortion in a compressed stream.
A new modular and programmable wireless capsule endoscope is presented in this paper. The capsule system consumes low power and has small physical size. A new image compression algorithm is presented in this paper to ...
详细信息
Background: With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such...
详细信息
Background: With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression. Results: RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: ( 1) present a robust and effective way for RNA structural data compression;( 2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective. Conclusion: A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Exten
Because local slant stacking increases the data dimension in beam migration, the volume of local slant stacks can be enormous and can obstruct efficient data processing. In addition, a proper beam compression algorith...
详细信息
Because local slant stacking increases the data dimension in beam migration, the volume of local slant stacks can be enormous and can obstruct efficient data processing. In addition, a proper beam compression algorithm can reduce the computation of ray tracing and beam mapping. Thus, compressing the local slant stacks with high fidelity can improve the efficiency of beam migration. A new approach is proposed to efficiently compress the local slant stacks. This approach combines the estimation of multiple local slopes based on the structure tensor to reduce the number of slopes, and the sparse representation for the slant stacked data via the matching pursuit decomposition to reduce the number of temporal samples. Furthermore, a new algorithm to estimate multiple local slopes based on the second-order structure tensor is proposed to handle the intersecting events efficiently. Several data examples indicated that the new compression algorithm required much less storage. Meanwhile, the new algorithm can restore the significant events and tolerate some random noise. The migration results determined that this compression algorithm does not obviously degrade the quality of the beam migration result, and it even makes the migration result more clear by suppressing the random noise smearing.
Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-...
详细信息
Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal compression Dissimilarity), NCD (Normalized compression Dissimilarity) and CD (compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available. Results: We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis,
In recent years different types of Residual Neural Networks (ResNets, for short) have been introduced to improve the performance of deep Convolutional Neural Networks. To cope with the possible redundancy of the layer...
详细信息
In recent years different types of Residual Neural Networks (ResNets, for short) have been introduced to improve the performance of deep Convolutional Neural Networks. To cope with the possible redundancy of the layer structure of ResNets and to use them on devices with limited computational capabilities, several tools for exploring and compressing such networks have been proposed. In this paper, we provide a contribution in this setting. In particular, we propose an approach for the representation and compression of a ResNet based on the use of a multilayer network. This is a structure sufficiently powerful to represent and manipulate a ResNet, as well as other families of deep neural networks. Our compression approach uses a multilayer network to represent a ResNet and to identify the possible redundant convolutional layers belonging to it. Once such layers are identified, it prunes them and some related ones obtaining a new compressed ResNet. Experimental results demonstrate the suitability and effectiveness of the proposed approach.
In recent years, numerous smart meters have been widely installed to aggregate time series engineering parameters over fields; it has led to problems of handling big data. The huge volumes of data need to be transmitt...
详细信息
In recent years, numerous smart meters have been widely installed to aggregate time series engineering parameters over fields; it has led to problems of handling big data. The huge volumes of data need to be transmitted, stored, processed as well as retrieved. Storing and accessing these big data have become expensive in time, space and bandwidth. The aim of the study is to find a solution for the problems. One solution developed in the study is to compress/decompress the engineering parameters. The data format of the variables has three (03) portions: 128-bit Global Unique Identifier (GUID), 64-bit time stamp parameter, and 64-bit floating point value parameter. Three encoding/decoding algorithms have been applied and implemented. The approaches have reduced the original historical data size 40% off as well as the storage cost. The algorithms' performances: the compression ratio, the saving percentage and the compression/decompression time and speed have been measured. The decompression process has been proved faster than the compression process based on the historical data.
暂无评论