We have developed an HPF (High Performance Fortran) language processor for SX-4 series, aimed at distributed memory multiprocessor systems. HPF is a de facto standard data-parallel language mainly aimed at distributed...
详细信息
We have developed an HPF (High Performance Fortran) language processor for SX-4 series, aimed at distributed memory multiprocessor systems. HPF is a de facto standard data-parallel language mainly aimed at distributed memory multiprocessor systems. HPF allows users to develop parallelized programsdby only specifying how to map data onto processors. The HPF compiler partitions computation among processors based on specified mapping information, and generates necessary data transfers. Therefore, both how to map computation onto processors and how to achieve high-speed data transfer are important for the efficient implementation of HPF compilers. This paper describes automatic parallelization and data transfer technology in NEC's HPF language processor. This paper also discusses the utilization of shared memory parallelization and vectorization on SX-4 and SX-5 series.
An important research topic is parallelizing of compilers to generate local memory access sequences and communication sets while compiling a data-parallel language into an SPMD (Single Program Multiple data) program. ...
详细信息
An important research topic is parallelizing of compilers to generate local memory access sequences and communication sets while compiling a data-parallel language into an SPMD (Single Program Multiple data) program. In this paper, we present a scheme to efficiently enumerate local memory access sequences and to evaluate communication sets. We use a class table to store information that is extracted from array sections and data distribution patterns. Given array references and data distributions, we can utilize the class table to generate communication sets in closed forms. Furthermore, we derive the algorithms for sending and receiving necessary data between processors. An algorithm for generating the class table is presented, and the time complexity of this algorithm is O(s), where s is the array section stride. The technique of generating communication sets for one index variable has been implemented on a DEC Alpha 3000 workstation. The experimental results confirm the advantage of our scheme, especially when the array section stride is larger than the block size. Finally, we adapt our approach to handle array references with multiple index variables. The time complexity for constructing the whole class table is O(s(2)).
暂无评论