As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress the performance of multi-threaded applications if they are not scalable. In this pap...
详细信息
As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress the performance of multi-threaded applications if they are not scalable. In this paper, we present a new dynamic memory allocator for multi-threaded applications. The allocator never uses any synchronization for common cases. It uses only lock-free synchronization mechanisms for uncommon cases. Each thread owns a private heap and handles memory requests on the heap. Our allocator is completely synchronization-free when a thread allocates a memory block and deal locates it by itself. Synchronization-free means that threads do not communicate with each other at all. On the other hand, if a thread allocates a block and another thread frees it, we use a lock-free stack to atomically add it to the owner thread's heap to avoid the memory blowup problem. Furthermore, our allocator exploits various memory block caching mechanisms to reduce the latency of memory management. Freed blocks or intermediate memory chunks are cached hierarchically in each thread's heap and they are used for future memory allocation. We compare the performance and scalability of our allocator to those of well-known existing multi-threaded memory allocators using eight benchmarks. Experimental results on a 48-core AMD system show that our approach achieves better performance than other allocators for all benchmarks and is highly scalable with a large number of threads.
Cache coherent Non-Uniform Memory Access (cc-NUMA) architectures have been widely used for chip multiprocessors (CMPs). However, they require complicated hardware to properly handle the cache coherence problem. Moreov...
详细信息
Cache coherent Non-Uniform Memory Access (cc-NUMA) architectures have been widely used for chip multiprocessors (CMPs). However, they require complicated hardware to properly handle the cache coherence problem. Moreover, it generates heavy on-chip network traffic due to the coherence enforcement. In this work, we propose a simple software-managed coherent memory architecture for many cores. Our memory architecture exploits explicitly addressed local stores. Instead of implementing the complicated cache coherence protocol in hardware, coherence and consistency are supported by software, such as a runtime or an operating system. The local stores together with the software leverage conventional caches to make the architecture much simpler and to generate much less network traffic than conventional ccNUMA-based CMPs. Experimental results indicate that our approach is promising.
暂无评论