Fast Fourier Transform (FFT) is an important part of many applications, such as in wireless communication based on OFDM (Orthogonal Frequency Division Multiplexing). With Cloud Radio Access Networks, implementing FFTs...
详细信息
ISBN:
(纸本)9781479953424
Fast Fourier Transform (FFT) is an important part of many applications, such as in wireless communication based on OFDM (Orthogonal Frequency Division Multiplexing). With Cloud Radio Access Networks, implementing FFTs on multiprocessor clusters is a challenging task. For instance, supporting the Long Term Evolution (LTE) protocol requires processing 100 independent FFTs (with sizes ranging from 128 to 2048 points) in 66.7 μs. In this work, seven native FFT candidate implementations are compared. The considered implementation environments are: OpenMP (Open Multi-Processing) on 1 core, MPI (Message Passing Interface) on 1 core, 2 cores, and 3 cores, Hybrid OpenMP+MPI on 1 core and 3 cores, and MPI on an heterogeneous platform composed of Xeon-Phi and 3 cores. The reported experimental results show that the latter method meets the latency requirements of LTE. It is shown that the OpenMP and MPI paradigms running only on MICs (Many Integrated Cores) cannot benefit fully from the computing capability of many-core architectures. The heterogeneous combination of Xeon+MICs provides a better performance.
The message passing interface (MPI) is designed as an architecture independent interface for parallel programming in the shared-nothing, message passing paradigm. We briefly summarize basic requirements to a high-qual...
详细信息
The message passing interface (MPI) is designed as an architecture independent interface for parallel programming in the shared-nothing, message passing paradigm. We briefly summarize basic requirements to a high-quality implementation of MPI for efficient programming of SMP clusters and related architectures, and discuss possible, mild extensions of the topology functionality of MPI, which, while retaining a high degree of architecture independence, can make MPI more useful and efficient for message-passing programming of SMP clusters. We show that the discussed extensions can all be implemented on top of MPI with very little environmental support.
In general, highly parallelized programs executed on heterogeneous multiprocessor platforms may get better performance than homogeneous ones. OpenCL is one of the standards for parallel programming of heterogeneous mu...
详细信息
ISBN:
(纸本)9781509008070
In general, highly parallelized programs executed on heterogeneous multiprocessor platforms may get better performance than homogeneous ones. OpenCL is one of the standards for parallel programming of heterogeneous multiprocessor platforms and SPIR (Standard Portable Intermediate Representation) is a portable binary format for representing OpenCL kernel code. However, the programming of these programs is usually complex and error-prone for most programmers. Therefore, some standards have been proposed to simplify the programming on heterogeneous multiprocessor platforms, for example, OpenACC (a directive-based parallel programming model). In this paper, we implement our framework on Clang, the C front-end of LLVM, to automatically translate OpenACC to LLVM IR with SPIR kernels. After that, it is optional to optimize the IR code by LLVM optimizer and execute the host LLVM IR by LLVM JIT-compiler. According to the experiment results, our translated programs have significant performance enhancement for some programs while comparing with their corresponding sequential version of programs and have comparable performance while comparing with their manual OpenCL version. Therefore, our design may reduce the difficulty of writing the programs in heterogeneous multiprocessor platform and the translated OpenCL programs are portable and have good performance as that of the manual OpenCL programs written by experienced programmers.
The purpose in conducting the research presented in this paper is to determine the applicability of a parallel scalability model to Apache Hadoop on a cloud computer. In doing this, the goal is to identify possible op...
详细信息
ISBN:
(纸本)9781467362184
The purpose in conducting the research presented in this paper is to determine the applicability of a parallel scalability model to Apache Hadoop on a cloud computer. In doing this, the goal is to identify possible optimizations of map-reduce systems for more efficient computation. The results of the experiment indicate that Hadoop does not have the necessary features for it to be adaptable to the model, but that certain optimizations can be made nonetheless.
Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution...
详细信息
Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory systems eliminate the need for explicit locks, but do not support conditional synchronization. They also require the ability to rollback transactions. In this paper, we propose new monitor based methods that provide automatic signaling for global conditions that span multiple objects. Our system provides automatic notification for global conditions. Assuming that the global condition is a Boolean expression of local predicates, our method allows efficient monitoring of the conditions without any need for global locks. Furthermore, our system solves the monitor composition problem without requiring global locks. We have implemented our constructs on top of Java and have evaluated their overhead. Our results show that on most of the test cases, not only our code is simpler but also faster than Java's reentrant- lock as well as the Deuce transactional memory system.
We present PDG, a debugger that allows one to identify erroneous processes without examining the source-level. Instead the interactions between processes are examined by animating program behaviour. For this, we use h...
详细信息
We present PDG, a debugger that allows one to identify erroneous processes without examining the source-level. Instead the interactions between processes are examined by animating program behaviour. For this, we use hierarchical graphical representations that are constructed during top-down program development. During animation, a debugging kernel implementing a record-replay mechanism guarantees reproducible program behaviour. To minimise interference, we are developing a tracing mechanism that decides at run-time which messages are to be traced to guarantee reproducibility. The run-time debugging kernel provides portability by supporting standard portable communication calls. In addition, the kernel itself is easy to port, since all architecture and communication-protocol dependent functionality is clearly separated from the generic debugging functionality.< >
This work seeks to break the sample efficiency bottleneck in parallel large-scale ranking and selection (R&S) problems by leveraging correlation information. We modify the commonly used "divide and conquer&qu...
详细信息
Current Graphics Processing Unit (GPU) presents large potentials in speeding up computationally intensive data parallel applications over traditional parallelization approaches since there are much more hardware threa...
详细信息
ISBN:
(纸本)9781424465330
Current Graphics Processing Unit (GPU) presents large potentials in speeding up computationally intensive data parallel applications over traditional parallelization approaches since there are much more hardware threads inside GPUs than the computational cores available to common CPU threads. NVIDIA developed a generic GPU programming platform, CUDA, which allows programmers to utilize GPU through C programming language and parallelize applications in a similar way as in traditional multithreading approach. However, not all applications are suitable for this new platform. Only computationally intensive applications without strong dependency are good candidates. Although Advanced Encryption Standard (AES) does not belong to this group due to the light workload in its efficient implementation, this paper proposed an approach to arrange data in different GPU memory spaces properly, overcoming the extra communication delay, and still turning GPU into an effective accelerator. Experimental results have demonstrated its effectiveness by performance gains and proved that GPU can be used to accelerate more types of applications.
Functions that invoke operations on multiple objects atomically are a useful extension of object-based parallel languages, such as Orca. This paper introduces atomic functions and shows how compile-time information ca...
详细信息
Functions that invoke operations on multiple objects atomically are a useful extension of object-based parallel languages, such as Orca. This paper introduces atomic functions and shows how compile-time information can drive run-time optimizations of such functions.
This paper presents a project who's goal is to make an animation simulating the activity of a mobile robot in a given environment. A parallel is drawn between animation and reactive programming, particularly with ...
详细信息
This paper presents a project who's goal is to make an animation simulating the activity of a mobile robot in a given environment. A parallel is drawn between animation and reactive programming, particularly with the concept of autonomous agent. The realised animation consists of a virtual world, the environment of the robot and the robot itself, as an agent acting in this world. The turning point of the project is the simulation of the sensors through which the simulated robot is supposed to see its environment. To implement this task, advanced techniques are used such as ray tracing and radiosity. An experimentation platform is designed based on the robot Nomad 200 and its simulator, with adjunction of interfaces for the virtual sensors and for the representation on the computer screen.< >
暂无评论