the growing gap in performance between processor and memory speeds has created a problem for data-intensive applications. A recent approach for solving this problem is to use processor-in-memory (PIM) technology. PIM ...
详细信息
ISBN:
(纸本)0769509908
the growing gap in performance between processor and memory speeds has created a problem for data-intensive applications. A recent approach for solving this problem is to use processor-in-memory (PIM) technology. PIM technology integrates a processor on a DRAM memory chip, which increases bandwidth between the processor and memory. In this paper, we discuss a PIM-based multiprocessor system, the System Level Intelligent Intensive Computing (SLIIC) Quick look (QL) board. this system includes eight COTS PIM chips and two FPGA chips that implement a flexible interconnect network. the performance of the SLIIC QL board is measured and analyzed for the distributed corner-turn application. We show that the performance of the current SLIIC QL on the distributed corner turn application is better than a PowerPC-based multicomputer that consumes more power and occupies more area. this advantage, which can be achieved in a limited context, demonstrates that even limited COTS PIMs have some advantages for data-intensive computations.
While parallelism and multi-cores are receiving much attention as a major scalability path, customization is another, orthogonal and complementary, scalability path which can target not easily parallelizable programs ...
详细信息
ISBN:
(纸本)9781424429325
While parallelism and multi-cores are receiving much attention as a major scalability path, customization is another, orthogonal and complementary, scalability path which can target not easily parallelizable programs or program sections. the key assets of customization are cost and power efficiency. the key limitation of customization is flexibility. However, we argue that there is no perfect balance between efficiency and flexibility, each system vendor may want to strike a different such balance. In this article, we present a method for achieving any desired balance between flexibility and efficiency by automatically combining any set of individual customization circuits into a larger compound circuit. this circuit is significantly more cost efficient than the simple union of all target circuits, and is configurable to behave as any of the target circuits, while avoiding the routing and configuration cost overhead of FPGAs. the more individual circuits are included, the larger the number of applications which can potentially benefit from this compound customization circuit, realizing flexibility at a minimal cost. Moreover, we observe that the compound circuit cost does not increase in proportion to the number of target applications, due to the wide range of common data-flow and control-flow patterns in programs. Currently, the target individual circuits correspond to loops, like most accelerators in embedded systems, but the aggregation method can accommodate circuits of any size. Using the UTDSP benchmarks and accelerators coupled with an embedded PowerPC405 processor, we show that this approach can yield an average performance improvement of 2.97, while the corresponding synthesized aggregate accelerator is 3 time smaller than the sum of individual accelerators for each target benchmark.
Parallel sequence-search tools are rising in popularity among computational biologists. Withthe rapid growth of sequence databases, database segmentation is the trend of the future for such search tools. While I/O cu...
详细信息
Parallel sequence-search tools are rising in popularity among computational biologists. Withthe rapid growth of sequence databases, database segmentation is the trend of the future for such search tools. While I/O currently is not a significant bottleneck for parallel sequence-search tools, future technologies including faster processors, customized computational hardware such as FPGAs, improved search algorithms, and exponentially growing databases emphasize an increasing need for efficient parallel I/O in future parallel sequence-search tools. Our paper focuses on examining different I/O strategies for these future tools in a modern parallel file system (PVFS2). Because implementing and comparing various I/O algorithms in every search tool is labor-intensive and time-consuming, we introduce S3aSim, a general simulation framework for sequence-search which allows us to quickly implement, test, and profile various I/O strategies. We examine a variety of I/O strategies (e.g., master-writing and various worker-writing strategies using individual and collective I/O methods) for storing result data in sequence-search tools such as mpiBLAST, pioBLAST, and parallel HMMer. Our experiments fully detail the interaction of computing and I/O within a full application simulation as opposed to typical I/O-only benchmarks
暂无评论