Clusters built from single-core systems are cost-effective as for the performance improvement and availability. However, the hardware constraints put limitations on the performance of single-core systems. Hence, it is...
详细信息
Clusters built from single-core systems are cost-effective as for the performance improvement and availability. However, the hardware constraints put limitations on the performance of single-core systems. Hence, it is difficult to meet with the increasing high performance requirements of diversified applications at different levels for general purpose computing. A promising feasible solution is the novice multi-core systems which extend the parallelism to CPU level by integrating multiple processing units on a single die. This paper uses finite-difference time-domain (FDTD) algorithm as a case study, designing suitable parallel FDTD algorithms for three architectures: distributed-memory machines with single-core processors, shared-memory machines with dual-core processors, and the Cell Broadband Engine (Cell/B.E.) processor with nine heterogeneous cores. The experiment results show that the Cell/B.E. processor using 8 SPEs achieves a significant speedups of 7.05 faster than AMD single-core Opteron processor and 3.37 than AMD dual-core Opeteron processor at the processor level.
In this paper, we designed and implemented a High-Level Abstract parallel Programming Platform that relieves the programmer from all the hassle involved in parallel programming. That is, what is requested from the pro...
详细信息
In this paper, we designed and implemented a High-Level Abstract parallel Programming Platform that relieves the programmer from all the hassle involved in parallel programming. That is, what is requested from the programmer is only to specify the program is a suitable form that hides many of the hardware features. All the parallel processes control, that were very challenging, are hence assumed by the platform itself. To date, only three parallel programming approaches were suggested in the literature: Implicit, explicit and systematic parallel programming. Among the paradigms that are part of the third approach, we preferred to use the GAMMA formalism as a backbone for our implementation mainly for two reasons: First, it uses an unstructured data set, which has the benefit of reducing the data dependency to its lowest possible level and second, the program correctness can be easily demonstrated. A GAMMA program is generally defined as a pair of , where the elements that fulfill the condition are substituted with the product of the action. The program is naturally and systematically executed in parallel. However, to date, no attempt was made to provide a physical implementation of the GAMMA formalism. As an application for our implemented platform, we suggested to parallelize some classical GIS image decomposition problems. The obtained results showed that, in addition to the ease and abstract way of parallel programming, an almost linear speedup is achieved.
From the post-Renaissance period to the twentieth century, the traditional Western corset has played an important role in the transformation of Western clothing and has played an integral part in its development. The ...
详细信息
From the post-Renaissance period to the twentieth century, the traditional Western corset has played an important role in the transformation of Western clothing and has played an integral part in its development. The unique form and structure of the Western corset affected women’s bodies and endangered their health. The article selects the 18th century corset in the VampersandA Museum’s collection as the object of study and constructs a virtual model of it through CLO 3D software, which is capable of completing the digital conversion of the relationship between clothing and the human body, and explores the impact of the corset on the human body through more concrete experiments and more intuitive data. The effect of the corset on the human body is explored through more concrete experiments.
In Software Effort Estimation (SEE) practice, the data drought problem has been plaguing researchers and practitioners. Leveraging heterogeneous SEE data collected by other companies is a feasible solution to relieve ...
详细信息
Continued increasing of fault rate in integrate circuit makes processors more susceptible to errors, especially many-core processor. Meanwhile, most systems or applications do not need full fault coverage, which has e...
详细信息
ISBN:
(纸本)9781479909735
Continued increasing of fault rate in integrate circuit makes processors more susceptible to errors, especially many-core processor. Meanwhile, most systems or applications do not need full fault coverage, which has excessive overhead. So on-demand fault tolerance is desired for these applications. In this paper, we propose an adaptive low-overhead fault tolerance mechanism for many-core system, called Device View Redundancy (DVR). It treats fault tolerance as a device that can be configured and used by application when high reliability is needed. Nevertheless, DVR exploits the idle resources for low-overhead fault tolerance, which is based on the observation that the utilization of many-core system is low due to lack of parallelism in application. Finally, the experiment shows that the performance overhead of DVR is reduced by 16% to 98% compared with full Dual Modular Redundancy (DMR).
The idea of virtual backbone routing has been proposed for efficient routing among a set of mobile nodes in wireless ad hoc networks. Virtual backbone routing can reduce communication overhead and speedup the routing ...
ISBN:
(纸本)9783540241287
The idea of virtual backbone routing has been proposed for efficient routing among a set of mobile nodes in wireless ad hoc networks. Virtual backbone routing can reduce communication overhead and speedup the routing process compared with many existing on-demand routing protocols for routing detection. In many studies, Minimum Connected Dominating Set (MCDS) is used to approximate virtual backbones in a unit-disk graph. However finding a MCDS is a NP-hard problem. We propose a distributed, 3-phase protocol for calculating the CDS in this paper. Our new protocol largely reduces the number of nodes in CDS compared with Wu and Li's method, while message and time complexities of our approach remain almost the same as those of Wu and Li's method. We conduct extensive simulations and show our protocol can consistently outperform Wu and Li's method. The correctness of our protocol is proved through theoretical analysis.
Data distribution is the basic behavior of P2 P applications(file sharing and streaming service) and it is a key element affecting the performance of P2 P systems. However, there are few research works that focus on d...
详细信息
ISBN:
(纸本)9783037851555
Data distribution is the basic behavior of P2 P applications(file sharing and streaming service) and it is a key element affecting the performance of P2 P systems. However, there are few research works that focus on data distribution of P2 P applications from the view of whole system. In this paper we study the data distribution in P2 P applications in terms of decreasing the system distribution load. We define the distribution load of P2 P systems formally and analyze how to decrease the system load quickly by means of mathematical analysis. Moreover, we give a feasible fast distribution algorithm according to our theoretic conclusion. The experimental results show that our algorithm has significant improvement on data distribution speed and load balance.
The ability to efficiently switch from one pre-encoded video stream to another is a valuable attribute for a variety of interactive streaming applications, such as switching among streams of the same video encoded in ...
详细信息
ISBN:
(纸本)9781479923427
The ability to efficiently switch from one pre-encoded video stream to another is a valuable attribute for a variety of interactive streaming applications, such as switching among streams of the same video encoded in different bit-rates for real-time bandwidth adaptation, or view-switching among videos capturing the same dynamic 3D scene but from different viewpoints. It is well known that intra-coded I-frames can be used at switch boundaries to facilitate stream-switching. However, the size of an I-frame is large, making frequent insertion impractical. A recent proposal towards a more efficient stream-switching mechanism is distributed source coding (D-SC), which exploits worst-case correlation between a set of potential predictor frames in the decoder buffer (called side information (SI) frames) and a target frame to lower encoding rate. However, the conventional use of bit-plane and channel coding means the encoding and decoding complexity of DSC frames is large. In this paper, we pursue a novel approach to the stream-switching problem based on the concept of "signal merging", using piecewise constant (p-wc) function as the merge operator. Specifically, we propose a new merge mode for a code block, where for each k-th transform coefficient in the block, we encode appropriate step size and horizontal shift parameters at the encoder, so that the resulting floor function at the decoder can map corresponding coefficients from any SI frame to the same reconstructed value, resulting in an identically merged signal. The selection of shift parameter per coefficient, as well as coding modes between intra and merge per block, are optimized in a rate-distortion (rd) optimal manner. Experiments show encouraging coding gain over a previous implementation of DSC frame at low- to mid-bitrates at reduced computation complexity.
We describe a simple, low-level approach for embedding probabilistic programming in a deep learning ecosystem. In particular, we distill probabilistic programming down to a single abstraction—the random variable. Our...
We describe a simple, low-level approach for embedding probabilistic programming in a deep learning ecosystem. In particular, we distill probabilistic programming down to a single abstraction—the random variable. Our lightweight implementation in TensorFlow enables numerous applications: a model-parallel variational auto-encoder (VAE) with 2nd-generation tensor processing units (TPUv2s); a data-parallel autoregressive model (Image Transformer) with TPUv2s; and multi-GPU No-U-Turn Sampler (NUTS). For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer on 256x256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3.
Privacy-preserving computation techniques like homomorphic encryption (HE) and secure multi-party computation (SMPC) enhance data security by enabling processing on encrypted data. However, the significant computation...
详细信息
ISBN:
(数字)9798331530037
ISBN:
(纸本)9798331530044
Privacy-preserving computation techniques like homomorphic encryption (HE) and secure multi-party computation (SMPC) enhance data security by enabling processing on encrypted data. However, the significant computational and CPU-DRAM data movement overhead resulting from the underlying cryptographic algorithms impedes the adoption of these techniques in practice. Existing approaches focus on improving computational overhead using specialized hardware like GPUs and FPGAs, but these methods still suffer from the same processor-DRAM overhead. Novel hardware technologies that support in-memory processing have the potential to address this problem. Memory-centric computing, or processing-in-memory (PIM), brings computation closer to data by introducing low-power processors called data processing units (DPUs) into memory. Besides its in-memory computation capability, PIM provides extensive parallelism, resulting in significant performance improvement over state-of-the-art approaches. We propose a framework that uses recently available PIM hardware to achieve efficient privacy-preserving computation. Our design consists of a four-layer architecture: (1) an application layer that decouples privacy-preserving applications from the underlying protocols and hardware; (2) a protocol layer that implements existing secure computation protocols (HE and MPC); (3) a data orchestration layer that leverages data compression techniques to mitigate the data transfer overhead between DPUs and host memory; (4) a computation layer which implements DPU kernels on which secure computation algorithms are built.
暂无评论