The behavior of a Boltzmann Machine (BM) according to changes in the parameters that determine its convergence is experimentally analyzed to find a way to accelerate the convergence towards a solution for the given op...
详细信息
Aggregating the capacity and bandwidth of the commodity disks in the nodes of a cluster provides cost effective and high performance storage systems. Nevertheless, this strategy could be a feasible approach only if th...
详细信息
Combination of optical and acoustic sensors to compensate the strengths and weaknesses of each sensor modality is a topic of increasing interest in applications involving autonomous underwater vehicles (AUV). In this ...
详细信息
Engineering and scientific applications usually require to manage large quantities of data with different programs. The I/O demands of these applications get higher as they get larger because processor and memory spee...
详细信息
This paper presents a proposal for a fast on-line map analysis for the RTS game Planet Wars in order to define specialized strategies for an autonomous bot. This analysis is used to tackle two constraints of the game,...
详细信息
This paper presents an approach to the evolution of the cooperative behaviour of some bots inside the PC game Unreal. We intend to create bots that cooperate as a team trying to beat other teams (composed of human pla...
详细信息
Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) architecture ot only high floating-point performance, but also substantial ...
详细信息
Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) architecture ot only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-di fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of rided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into ensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively erence in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. el parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of loc On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectoriz employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel(R) PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which ts the vendor-specific Intel(R)MKL library by a factor of up to 2.22X.
Batcher (1968) has presented some parallel merging algorithms. A new direct algorithm different from Batcher's, but with more global optimal properties in the simplicity, regularity, symmetry, and generality, is ...
详细信息
Batcher (1968) has presented some parallel merging algorithms. A new direct algorithm different from Batcher's, but with more global optimal properties in the simplicity, regularity, symmetry, and generality, is presented. A theorem is presented stating that for every general algorithm for merging, if it merges all elements of duodirun(N), then it will merge the array cascaded by 2 sorted arrays. The theorem is easily confirmed by the zero-one principle. According to the theorem, the DuoDirun Merging Algorithm (DDMA) can be composed. The purpose of DDMA is to transform arrays in the duodirun structure.
E-health and e-monitoring have become an increasingly important area during recent years, being the recognition of motion, postures and physical exercises one of the main topics. In this kind of problem is common to w...
详细信息
Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip ...
详细信息
Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This nonuniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy-while using less silicon area-by 13%, and comes within 13% of an ideal, minimal hit latency solution. Copyright 2002 ACM.
暂无评论