This paper presents the basic parallel implementation and a variation for matrix - vector multiplication. We evaluated and compared the performance of the two implementations on a cluster of workstations using Message...
详细信息
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays acces...
详细信息
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays accessed in one computation phase. Based on ARV, we use array grouping to eliminate the pseudo data distributing of small shared data and improve the page locality. Experimental results show that ARV-based array grouping can greatly improve the performance of applications with non-continuous data access and strict access affinity on OpenMP/JIAJIA cluster. For applications with small shared arrays, array grouping can improve the performance obviously when the processor number is small.
In this paper we investigate the problem of finding a delay- and degree-bounded maximum sum of nodes application level multicast tree. We then proved the problem is NP-hard, and its relationship with the well-studied ...
详细信息
Program performance optimization often involves choosing right parameters to minimize the program's runtime. Selecting optimization parameters by means of execution-driven search is guaranteed to find excellent re...
详细信息
Program performance optimization often involves choosing right parameters to minimize the program's runtime. Selecting optimization parameters by means of execution-driven search is guaranteed to find excellent results, for it accurately accounts for all performance components of the target platform. But the major drawback of execution-driven approach is the excessive compilation time due to thousands of runs of the original program. In this article, we propose a novel technique called program reduction transformations to reduce the cost of execution-driven optimization parameter selection. It is based on our observation to the characteristics of the scientific applications and the optimization parameter selection task. The ideal is to transform the program before it is used in execution-driven parameter selection procedure. The transformed program runs in much shorter time but preserves the parameter selection quality. This technique greatly reduces the time spent on evaluating each candidate parameter and makes execution-driven optimization parameter selection affordable. We formulate the theoretic foundation of program reduction transformation. And we find several situations where reduction transformations can be legally applied. These situations are common in scientific applications. Experiments done for two math kernels and three SPEC benchmarks show that our approach is both feasible and effective
The paper presents a novel method, namely slicing execution, to verify C programs with respect to temporal safety properties. The distinguished feature is that it only simulates the execution of the relevant statement...
详细信息
The paper presents a novel method, namely slicing execution, to verify C programs with respect to temporal safety properties. The distinguished feature is that it only simulates the execution of the relevant statements under abstraction criteria and checks the properties on the fly. The abstraction criterion begins with a proper initial set of program variables and may be iteratively refined according to spurious counter-examples. Provided that the properties to be verified usually involve only a few variables in practical programs, slicing execution may have the same precision as path-sensitive simulation with the cost close to standard flow-sensitive dataflow analysis. The presented method has been used to verify the initial handshake process of SSL protocol based on the C source code of openssl-0.9.6c. The experiment results confirm our claim and show that slicing execution is not only practical but also effective.
Stride data value predictor is widely used by researchers in data value prediction study. Compared with context-based hybrid data value predictors, stride data value predictors are simple. But when encountering non-st...
详细信息
Stride data value predictor is widely used by researchers in data value prediction study. Compared with context-based hybrid data value predictors, stride data value predictors are simple. But when encountering non-stride repeated sequences, a stride value predictor does not perform as well as a context-based hybrid data value predictor. In this paper, a revised stride data value predictor is introduced. With a little augment to a traditional stride data value predictor, the new predictor can make correct predictions on some patterns that can only be done by the context-based data value predictors. Simulation results show that the new predictor works well with most value predictable instructions. Design decisions such as predictor size, confidence mechanism and storing partial tag are analyzed
We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matri...
详细信息
ISBN:
(纸本)9781595930293
We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth. Copyright 2005 ACM.
The paper presents an agent-oriented programming language SLABSp. It provides caste and scenario mechanisms in a coherent way to support the caste-centric methodology of agent-oriented software development. It uses ca...
详细信息
The paper presents an agent-oriented programming language SLABSp. It provides caste and scenario mechanisms in a coherent way to support the caste-centric methodology of agent-oriented software development. It uses caste as a modular facility to organize agents into castes and to represent their structure and behavior characteristics. SLABSp also uses scenarios to define agents' behaviors in the context of environment situations. In the paper, the implementation of the language is briefly described. An example of the program is given to illustrate its programming style. Copyright 2005 ACM.
Test oracles are widely used to verify whether a system under test is running as desired. Since the correctness of real-time systems depends on the logical results of the computation and the time when results are prod...
详细信息
Test oracles are widely used to verify whether a system under test is running as desired. Since the correctness of real-time systems depends on the logical results of the computation and the time when results are produced at the same time, an optimized model checking-based method for test oracles generation is proposed to check if the system traces satisfy their real-time specifications at run time. Inspired by the idea of real-time model checking, the test oracles can be automatically generated from their specifications in the real-time logic MITL/sub [o,d]/ in a simpler way and modelled by a variant of the timed automata. Assertions are chosen to acquire the traces of real-time systems. A case study is presented to demonstrate the usefulness of the method proposed in this paper.
This paper proposes a generic programmable array processor architecture for a wide variety of approximate string matching algorithms. Further, we describe the architecture of the array and the architecture of the cell...
详细信息
This paper proposes a generic programmable array processor architecture for a wide variety of approximate string matching algorithms. Further, we describe the architecture of the array and the architecture of the cell in detail in order to efficiently implement for both the preprocessing and searching phases of most string matching algorithms. Further, the architecture performs approximate string matching for complex patterns that contain don't care, complement and classes symbols. Our architecture maximizes the strength of VLSI in terms of intensive and pipelined computing and yet circumvents the limitation on communication. It may be adopted as a basic structure for a universal flexible string matcher engine.
暂无评论