Modern workloads tend to have huge working sets and low locality. Despite this trend, the capacity of DRAM has not been increased enough to accommodate such huge working sets. Therefore, memory management mechanisms o...
详细信息
ISBN:
(纸本)9781450370417
Modern workloads tend to have huge working sets and low locality. Despite this trend, the capacity of DRAM has not been increased enough to accommodate such huge working sets. Therefore, memory management mechanisms optimized for such modern workloads are widely required today. For such optimizations, knowing the data access pattern of given workloads is essential. However, manually extracting such patterns from huge and complex workloads is exhaustive. Worse yet, existing memory access analysis tools incur unacceptably high overheads for unnecessarily detailed analysis results. To mitigate this situation, we introduce a tool that is designed for data access pattern tracing. Two core mechanisms in this tool, a region-based sampling and an adaptive region adjustment, allow users to limit the tracing overhead in a bounded range regardless of the size and complexity of target workloads, while preserving the quality of results. Our empirical evaluations that conducted with 20 realistic workloads show the high quality, low overhead, and a potential use case of this tool.
Application address streams contain a wealth of information that can be used to characterize the behavior of applications. However, the collection and handling of address streams is complicated by their size and the c...
详细信息
ISBN:
(纸本)9781457720642
Application address streams contain a wealth of information that can be used to characterize the behavior of applications. However, the collection and handling of address streams is complicated by their size and the cost of collecting them. We present PSnAP, a compression scheme specifically designed for capturing the fine-grained patterns that occur in well structured, memory intensive, high performance computing applications. PSnAP profiles are human readable and reveal a great deal of information about the application memory behavior. In addition to providing insight to application behavior the profiles can be used to replay a proxy synthetic address stream for analysis. We demonstrate that the synthetic address streams mimic very closely the behavior of the originals.
As the speed gap between CPU and external memory widens, memory latency has become the dominant performance bottleneck in modern applications. Closely connected are caches which play an important role in reducing the ...
详细信息
ISBN:
(纸本)9781450357616
As the speed gap between CPU and external memory widens, memory latency has become the dominant performance bottleneck in modern applications. Closely connected are caches which play an important role in reducing the average memory latency. The way data is accessed strongly influences cache performance. Numerous multimedia algorithms operating on data such as images and videos, perform processing over rectangular regions of pixels. If this and other data access patterns are properly exploited, significant performance improvements can be achieved. This paper proposes a prefetch-aware memory system that exploits 2D, stride and sequential data access patterns in multimedia applications. It aims at reducing the average memory access latency, lowering the number of memory accesses and utilizing the bandwidth efficiently. Our results reveal significant average memory access time (AMAT) reduction of 21.2% when utilizing effectively the proposed approach compared to the baseline in the evaluated workloads.
One common characteristic of modern workloads such as cloud, big data, and machine learning is memory intensiveness. In detail, such workloads tend to have a huge working set and low locality. Especially, the size of ...
详细信息
ISBN:
(纸本)9781728124063
One common characteristic of modern workloads such as cloud, big data, and machine learning is memory intensiveness. In detail, such workloads tend to have a huge working set and low locality. Especially, the size of working sets is rapidly growing so that cannot be fully accommodated by a DRAM based main memory. Worse yet, the cloud computing systems, which has been pervasive since few decades ago, are continuously reducing the size of DRAM per CPU and encouraging memory overcommitment. Consequently, efficient and effective out-of-core memory management is becoming more important. Though a number of memory management mechanisms for such situations have proposed, manual analysis and optimization are still required for optimal performance of each workload due to the wide variety of data access patterns. However, existing tools for memory access analysis are not appropriate to be used here because those are not designed for extraction of the dynamic data access pattern of modern workloads. When those tools are used for the purpose, those incur unacceptably high overheads for unnecessarily accurate analysis results. To mitigate this situation, we introduce a tool that is designed for the purpose. Basically, the tool employs a memory access tracking technique based on page table entry access bit, which incurs only minimal overhead. It also provides a technique for an effective tradeoff between profiling overheads and accuracy of the output by dynamically adjusting number of tracking regions. By adopting the technique, this tool can control the level of overheads and output accuracy in bounded range that user specified regardless of the size of target workloads. The overhead can be lowered even enough to be used for online target workloads while still providing useful quality of the extracted data access pattern. The main contributions of this paper are: 1) introduce of the data access patterns profiler tool designed for modern memory-intensive workloads, and 2) empirica
A key to good processor utilization for sparse matrix computations is storing the data in the format that is most conducive to fast access by the memory system. In particular, for sparse matrix triangular solves the t...
详细信息
A key to good processor utilization for sparse matrix computations is storing the data in the format that is most conducive to fast access by the memory system. In particular, for sparse matrix triangular solves the traditional compressed sparse matrix format is poor, and minor adjustments to the data structure can increase the processor utilization dramatically. Such adjustments involve storing the L and U factors separately and storing the U rows 'backwards' so that they are accessed in a simple streaming fashion during the triangular solves. Changes to the PETSc libraries to use this modified storage format resulted in over twice the floating-point rate for some matrices. This improvement can be accounted for by a decrease in the cache misses and TLB (transaction lookaside buffer) misses in the modified code.
As the performance gap between processors and storage devices keeps increasing, I/O performance becomes a critical bottleneck of modern high-performance computing systems. In this paper, we propose a pattern-directed ...
详细信息
As the performance gap between processors and storage devices keeps increasing, I/O performance becomes a critical bottleneck of modern high-performance computing systems. In this paper, we propose a pattern-directed and layout-aware data replication design, named PDLA, to improve the performance of parallel I/O systems. PDLA includes an HDD-based scheme H-PDLA and an SSD-based scheme S-PDLA. For applications with relatively low I/O concurrency, H-PDLA identifies accesspatterns of applications and makes a reorganized data replica for each accesspattern on HDD-based servers with an optimized data layout. Moreover, to accommodate applications with high I/O concurrency, S-PDLA replicates critical accesspatterns that can bring performance benefits on SSD-based servers or on HDD-based and SSD-based servers. We have implemented the proposed replication scheme under MPICH2 library on top of OrangeFS file system. Experimental results show that H-PDLA can significantly improve the original parallel I/O system performance and demonstrate the advantages of S-PDLA over H-PDLA.
dataaccess latency, a limiting factor in the performance of chip multiprocessors, grows significantly with the number of cores in nonuniform cache architectures with distributed cache banks. To mitigate this effect, ...
详细信息
dataaccess latency, a limiting factor in the performance of chip multiprocessors, grows significantly with the number of cores in nonuniform cache architectures with distributed cache banks. To mitigate this effect, we use a compiler-based approach to leverage dataaccess locality, choose an optimized data placement and efficiently configure the on-chip network. The proposed experimental compiler framework employs novel compilation techniques to discover and represent multithreaded memory accesspatterns (MMAPs). At runtime, symbolic MMAPs are resolved and used by a partitioning algorithm to choose a partition of allocated memory blocks among the forked threads in the analyzed application. This partition is used to enforce data ownership by associating the data with the core that executes the thread owning the data. Based on the partition, the communication pattern of the application can be extracted. We demonstrate how this information can be used in an experimental architecture to accelerate applications. In particular, our compiler assisted data partitioning approach shows a 20 percent speedup over shared caching and 5 percent speedup over the closest runtime approximation, first touch. By leveraging the communication pattern we can achieve a comparable performance to a system that uses a complex centralized network configuration system at runtime. Thus, our final system saves significant runtime complexity and achieves an 5.1 percent additional speedup through the addition of the reconfigurable network.
Energy efficiency has become one of the most important challenges in designing future computing systems, and the storage system is one of the largest energy consumers within them. This paper proposes an Energy Efficie...
详细信息
Energy efficiency has become one of the most important challenges in designing future computing systems, and the storage system is one of the largest energy consumers within them. This paper proposes an Energy Efficient Disk (EED) drive architecture which integrates a relatively small-sized NAND flash memory into a traditional disk drive to explore the impact of the flash memory on the performance and energy consumption of the disk. The EED monitors data access patterns and moves the frequently accessed data from the magnetic disk to the flash memory. Due to the data migration, most of the dataaccesses can be satisfied with the flash memory, which extends the idle period of the disk drive and enables the disk drive to stay in a low power state for an extended period of time. Because flash memory consumes considerably less energy and the read access is much faster than a magnetic disk, the EED can save significant amounts of energy while reducing the average response time. Real trace driven simulations are employed to validate the proposed disk drive architecture. An energy coefficient, which is the product of the average response time and the average energy consumption, is proposed as a performance metric to measure the EED. The simulation results, along with the energy coefficient, show that the EED can achieve an 89.11% energy consumption reduction and a 2.04% average response time reduction with cello99 trace, a 7.5% energy consumption reduction and a 45.15% average response time reduction with cello96 trace, and a 20.06% energy consumption reduction and a 6.02% average response time reduction with TPC-D trace, respectively. Traditionally, energy conservation and performance improvement are contradictory. The EED strikes a good balance between conserving energy and improving performance. (c) 2008 Elsevier Inc. All rights reserved.
Due to the widening performance gap between RAM and disk drives, a large number of I/O optimization methods have been proposed and designed to alleviate the impact of this gap. One of the most effective approaches of ...
详细信息
Due to the widening performance gap between RAM and disk drives, a large number of I/O optimization methods have been proposed and designed to alleviate the impact of this gap. One of the most effective approaches of improving disk access performance is enhancing data locality. This is because the method could increase the hit ratio of disk cache and reduce the seek time and rotational latency. Disk drives have experienced dramatic development since the first disk drive was announced in 1956. This paper investigates some important characteristics of modern disk drives. Based on the characteristics and the observation that dataaccess on disk drives is highly skewed, the frequently accessed data blocks and the correlated data blocks are clustered into objects and moved to the outer zones of a modern disk drive. The idea attempts to enhance spatial locality, improve the efficiency of aggressive sequential prefetch, and take advantage of Zoned Bit Recording (ZBR). An experimental simulation is employed to investigate the performance gains generated by the enhanced data locality. The performance gains are analyzed by breaking down the disk access time into seek time, rotational latency, data transfer time, and hit ratio of the disk cache. Experimental results provide useful insights into the performance behaviours of a modern disk drive with enhanced data locality. (C) 2009 Elsevier Inc. All rights reserved.
This column completes a two-part exploration into features of application programming interfaces (APIs) that are useful in clouds. The discussion contrasts APIs with other types of interfaces and describes variations ...
详细信息
This column completes a two-part exploration into features of application programming interfaces (APIs) that are useful in clouds. The discussion contrasts APIs with other types of interfaces and describes variations on protocols and calling methods, giving examples from physical hardware control to illustrate important features of cloud API design.
暂无评论