With further development and wide acceptance of cloud computing, lots of companies and colleges decide to take advantage of it in their own data centers, which is known as private clouds. Since private clouds have som...
详细信息
In contrast with public clouds, private clouds have some unique features, especially when related to workflow scheduling. Of course, the tradeoff problem between power and performance remains to be one of the key conc...
详细信息
In contrast with public clouds, private clouds have some unique features, especially when related to workflow scheduling. Of course, the tradeoff problem between power and performance remains to be one of the key concerns. Based on our previous research, in this paper, we propose a hybrid energy-efficient scheduling algorithm using dynamic migration. The experiments show that it can not only reduce the response time, conserve more energy, but also achieve higher level of load balancing.
Machine translation (MT), with its broad potential use, has gained increased attention from both researchers and software vendors. To generate high quality translations, however, MT decoders can be highly computat...
详细信息
Machine translation (MT), with its broad potential use, has gained increased attention from both researchers and software vendors. To generate high quality translations, however, MT decoders can be highly computation intensive. With significant raw computing power, multi-core microprocessors have the potential to speed up MT software on desktop machines. However, retrofitting existing MT decoders is a nontrivial issue. Race conditions and atomicity issues are among those complications making parallelization difficult. In this article, we show that, to parallelize a state-of-the-art MT decoder, it is much easier to overcome such difficulties by using a process-based parallelization method, called functional task parallelism, than using conventional thread-based methods. We achieve a 7.60 times speed up on an 8-core desktop machine while making significantly less changes to the original sequential code than required by using multiple threads.
This paper introduces PartitionSim, a parallel simulator for future thousand-core processors with software-managed cache coherence. The purpose of PartitionSim is to improve the simulation performance of many-core arc...
详细信息
This paper introduces PartitionSim, a parallel simulator for future thousand-core processors with software-managed cache coherence. The purpose of PartitionSim is to improve the simulation performance of many-core architectures at the expense of little accuracy sacrifice. To achieve this goal, we propose a novel technique: timing partition. Timing partition is based on such an observation: in a target system, interacting components communicate with each other and impose simulation synchronization while non-interacting components don't communicate with each other and allow asynchronous simulation. It divides the target timing models into two groups: non-interacting group and interacting group. Non-interacting timing models are simulated by host threads that synchronize little with each other to improve speed and hurt little accuracy, while interacting timing models are simulated by host threads that synchronize strictly with each other to preserve accuracy. Using PartitionSim, We have simulated a target composed of thousands of cores on a 16-core SMP machine. The evaluation results show that PartitionSim scales well with near linear speedup and has considerable performance (up to 25MIPS) at the expense of little accuracy sacrifice (average 0.92%).
With the prevalence of multi-core processors, it is a trend that the embedded cluster deploys SMP nodes to gain more computing power. As a crucial issue, the MPI inter-process communication has been suffering the cont...
详细信息
With the prevalence of multi-core processors, it is a trend that the embedded cluster deploys SMP nodes to gain more computing power. As a crucial issue, the MPI inter-process communication has been suffering the contradiction between high performance and embedded constraints. Moreover, there is a big performance gap between intra- and inter-node communication for different infrastructures. In this paper, we design a virtual communication system called SMVN, which extends the shared memory mechanism typically used in intra-node case into the inter-node case. The SMVN utilizes the HT inter-chip interconnect interface in Godson-3A SMP nodes to build a mesh topology. It is Ethernet compatible by simulating bottom layers of TCP/IP protocol. With the design, the node interconnection can get rid of NICs, cables and switches. Furthermore, we exploit the zero-copy scheme and other optimizations to improve the performance. We port the MPICH2 library by socket channel and formulate its process allocation. The MPI latency and bandwidth tests show that the performance difference between two levels is small. The inter-node bandwidth is 27.3 MB/s, which is more than twice the theoretical peak value of 100 Mb Ethernet and reaches 84% of the intra-node performance.
Aggressive technology scaling causes chip multiprocessors increasingly error-prone. Core-level fault-tolerant approaches bind two cores to implement redundant execution and error detection. However, along with more co...
详细信息
Aggressive technology scaling causes chip multiprocessors increasingly error-prone. Core-level fault-tolerant approaches bind two cores to implement redundant execution and error detection. However, along with more cores integrated into one chip, existing static and dynamic binding schemes suffer from the scalability problem when considering the violation effects caused by external write operations. In this paper, we present a transparent dynamic binding (TDB) mechanism to address the issue. Learning from static binding schemes, we involve the private caches to hold identical data blocks, thus we reduce the global masters-lave consistency maintenance to the scale of the private caches. With our fault-tolerant cache coherence protocol, TDB satisfies the objective of private cache consistency, therefore provides excellent scalability and flexibility. Experimental results show that, for a set of parallel workloads, the overall performance of our TDB scheme is very close to that of baseline fault-tolerant systems, outperforming dynamic core coupling by 9.2%, 10.4%, 18% and 37.1% when considering 4, 8, 16 and 32 cores respectively.
Stencil computations are core of wide range of scientific and engineering applications. A lot of efforts have been put into improving efficiency of stencil calculations on different platforms, but unfortunately it is ...
详细信息
Stencil computations are core of wide range of scientific and engineering applications. A lot of efforts have been put into improving efficiency of stencil calculations on different platforms, but unfortunately it is not easy to reuse. In this paper we present a PAttern-Driven Stencil compiler-based tool and a simple tuning system to reuse those well optimized methods and codes. We also suggest extensions to OpenMP, depicting high-level data structures in order to facilitate recognition of various stencil computation patterns. The PADS allows programmers to rewrite kernel of stencils or reuse source-to-source translator outputs as optimized stencil template codes with related tuning parameters, In addition, PADS consists of a OpenMP to CUDA translator and code generator using optimized template codes. It also obtains architecture-specific parameters to tune stencils across different GPU platforms. To demonstrate our system flexibility and performance portability, we illustrate four different stencil computations, Laplacian operator with Jacobi iterative method, divergence operator, 3 dimension 25 point stencil and a 2D heat equation using ADI method with periodic boundary conditions. PADS succeeds in generating all these four stencil codes using different optimization strategies and delivers a promising performance improvement.
Coverage model is the main technique to evaluate the thoroughness of dynamic verification of a Design-under-Verification (DUV). However, rather than achieving a high coverage, the essential purpose of verification is ...
详细信息
Coverage model is the main technique to evaluate the thoroughness of dynamic verification of a Design-under-Verification (DUV). However, rather than achieving a high coverage, the essential purpose of verification is to expose as many bugs as possible. In this paper, we propose a novel verification methodology that leverages the early bug prediction of a DUV to guide and assess related verification process. To be specific, this methodology utilizes predictive models built upon artificial neural networks (ANNs), which is capable of modeling the relationship between the high-level attributes of a design and its associated bug information. To evaluate the performance of constructed predictive model, we conduct experiments on some open source projects. Moreover, we demonstrate the usability and effectiveness of our proposed methodology via elaborating experiences from our industrial practices. Finally, discussions on the application of our methodology are presented.
As the feature size of FPGA shrinks to nanometers, soft errors increasingly become an important concern for SRAM-based FPGAs. Without consideration of the application level impact, existing reliability-oriented placem...
详细信息
As the feature size of FPGA shrinks to nanometers, soft errors increasingly become an important concern for SRAM-based FPGAs. Without consideration of the application level impact, existing reliability-oriented placement and routing approaches analyze soft error rate (SER) only at the physical level, consequently completing the design with suboptimal soft error mitigation. Our analysis shows that the statistical variation of the application level factor is significant. Hence in this work, we first propose a cube-based analysis to efficiently and accurately evaluate the application level factor. And then we propose a cross-layer optimized placement and routing algorithm to reduce the SER by incorporating the application level and the physical level factor together. Experimental results show that, the average difference of the application level factor between our cube-based method and Monte Carlo golden simulation is less than 0.01. Moreover, compared with the baseline VPR placement and routing technique, the cross-layer optimized placement and routing algorithm can reduce the SER by 14% with no area and performance overhead.
computer-supported collaborative learning (CSCL) is an emerging branch of learning science concerned with studying how people can learn together with the help of computers. As an indispensable ingredient, computer med...
详细信息
暂无评论