High performance SIMD text processing using the method of parallel bit streams is introduced with a case study of UTF-8 to UTF-16 transcoding. A forward transform converts byte-oriented character stream data into eigh...
详细信息
ISBN:
(纸本)9781595939609
High performance SIMD text processing using the method of parallel bit streams is introduced with a case study of UTF-8 to UTF-16 transcoding. A forward transform converts byte-oriented character stream data into eight parallel bit streams. Decoding, validation and computation of UTF-8 indexed UTF-16 bit streams are performed using bit-parallel logic and shifting operations. Conversion from UTF-8 indexing to UTF-16 indexing is performed using parallel bit deletion. The inverse transform is applied to yield high and low UTF-16 byte streams which are then merged. Combined with optimization techniques for blocks of ASCII data, speed-ups of 3 to 25 times are achieved on commodity processors compared with optimized byte-at-a-time code. Further applications of the method of parallel bit streams to bulk text processing applications are briefly discussed along with future prospects for the combination of intraregister and intrachip parallelism on multicore processors.
The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces...
详细信息
ISBN:
(纸本)9781595939609
The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems. In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPI's communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation. We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the BlueGene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.
Abstract machines provide a certain separation between platform-dependent and platform-independent concerns in compilation. Many of the differences between architectures are encapsulated in the specific abstract machi...
详细信息
The importance of tiles or blocks in scientific computing cannot be overstated. Many algorithms, both iterative and recursive, can be expressed naturally if tiles are represented explicitly. From the point of view of ...
详细信息
High performance SIMD text processing using the method of parallel bit streams is introduced with a case study of UTF-8 to UTF-16 transcoding. A forward transform converts byte-oriented character stream data into eigh...
详细信息
Several parallel architectures such as GPUs and the Cell processor have fast explicitly managed on-chip memories, in addition to slow off-chip memory. They also have very high computational power with multiple levels ...
详细信息
The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces...
详细信息
The ZeptoOS project is developing an open-source alternative to the proprietary software stacks available on contemporary massively parallel architectures. The aim is to enable computer science research on these archi...
详细信息
暂无评论