Traditional alias analysis is expensive and ineffective for dynamic optimizations. In practice, dynamic optimization systems perform memory optimizations speculatively, and rely on hardware, such as alias registers, t...
详细信息
ISBN:
(纸本)9780769549248;9781467348195
Traditional alias analysis is expensive and ineffective for dynamic optimizations. In practice, dynamic optimization systems perform memory optimizations speculatively, and rely on hardware, such as alias registers, to detect memory aliases at runtime. Existing hardware alias detection schemes either cannot scale up to a large number of alias registers or may introduce false positives. Order-based alias detection overcomes the limitations. However, it brings considerable challenges as how software can efficiently manage the alias register queue and impose restrictions on optimizations. In this paper, we present SMARQ, a Software-Managed Alias Register Queue, which manages the alias register queue efficiently and supports more aggressive speculative optimizations. We conducted experiments with a dynamic optimization system on a VLIW processor that has 64 alias registers. The experiments on a suite of SPECFP2000 benchmarks show that SMARQ improves the overall performance by 39% as compared to the case without hardware alias detection. By scaling up to a large number (from 16 to 64) of alias registers, SMARQ improves performance by 10%. Compared to a technique with false positives (similar to Itanium), SMARQ improves performance by 13%. To reduce the chance of alias register overflow, the novel alias register allocation algorithm in SMARQ reduces the alias register working set by 74% as compared to a straightforward alias register allocation based on program order.
A hardware/software co-designed processor transparently supports a ubiquitous ISA (e.g. ×86) with diversified and innovative microarchitectural implementations. It leverages co-designed HW features and dynamic bi...
详细信息
ISBN:
(纸本)9781467355247
A hardware/software co-designed processor transparently supports a ubiquitous ISA (e.g. ×86) with diversified and innovative microarchitectural implementations. It leverages co-designed HW features and dynamic binary translation (DBT) SW to morph existing binary programs to scale performance and save power. On such systems, the portable bytecode of modern dynamic languages (e.g. Java, JavaScript, etc.) is first translated into the code in the architecture ISA by the just-in-time (JIT) compilation in the bytecode virtual machine, and then into the code in the internal implementation ISA by the DBT. This not only incurs the translation overheads twice, but also brings significant emulation inefficiency as the DBT does not have the high level bytecode information. In this paper, we present AccelDroid, which accelerates the Android Dalvik bytecode execution on the HW/SW co-designed processor through direct bytecode translation in the DBT. Our experiments on a HW/SW co-designed Transmeta Efficeon machine show that AccelDroid can improve performance by 78% and save energy by 40% for the CaffeineMark 3.0 benchmark suite.
It is already established that going forward, the roughly 2x/2yr performance improvements delivered over the last two decades will primarily come through parallelism rather than increasing clock frequencies due to ass...
详细信息
It is already established that going forward, the roughly 2x/2yr performance improvements delivered over the last two decades will primarily come through parallelism rather than increasing clock frequencies due to associated power challenges. Provided software and tools continue to scale well with core and thread count, large core counts bring serious challenges both in the memory hierarchy and interconnect bandwidth both on-die, within the package, and off package. Simulations on anticipated future workloads help isolate where specific bottlenecks are likely to occur. New technologies both in die stacking and package- to-package interconnects will be required. These solutions will bring dramatic changes in the physical layer that may well break backward compatibility. Furthermore, these potential approaches are segment specific and involve complex tradeoffs of performance, cost, and power. This presentation will explore several approaches highlighting potential solutions and bandwidth requirements driven by likely future applications.
Program optimization on multi-core systems must preserve the program memory consistency. This paper studies TSO-preserving binary optimization. We introduce a novel approach to formally model TSO-preserving binary opt...
详细信息
Program optimization on multi-core systems must preserve the program memory consistency. This paper studies TSO-preserving binary optimization. We introduce a novel approach to formally model TSO-preserving binary optimization based on the formal TSO memory model. The major contribution of the modeling is a sound and complete algorithm to verify TSO-preserving binary optimization with O(N 2 ) complexity. We also developed a dynamic binary optimization system to evaluate the performance impact of TSO-preserving optimization. We show in our experiments that, dynamic binary optimization without memory optimizations can improve performance by 8.1%. TSO-preserving optimizations can further improve the performance by 4.8% to a total 12.9%. Without considering the restriction for TSO-preserving optimizations, the dynamic binary optimization can improve the overall performance to 20.4%.
Bioinformatics applications constitute an emerging data-intensive, high-performance computing (HPC) domain. While there is much research on algorithmic improvements, (2004), the actual performance of an application al...
详细信息
Bioinformatics applications constitute an emerging data-intensive, high-performance computing (HPC) domain. While there is much research on algorithmic improvements, (2004), the actual performance of an application also depends on how well the program maps to the target hardware. This paper presents a performance study of two parallel bioinformatics applications HMMER (sequence alignment) and SVM-RFE (gene expression analysis), on Intel x86 based hyperthread-capable (2002) shared-memory multiprocessor systems. The performance characteristics varied according to the application and target hardware characteristics. For instance, HMMER is compute intensive and showed better scalability on a 3.0 GHz system versus a 2.2 GHz system. However, SVM-RFE is memory intensive and showed better absolute performance on the 2.2 GHz machine which has better memory bandwidth. The performance is also impacted by processor features, e.g. hyperthreading (HT) (2002) and prefetching. With HMMER we could obtain -75% of the performance with HT enabled with respect to doubling the number of CPUs. While load balancing optimizations can provide speedup of -30% for HMMER on a hyperthreading-enabled system, the load balancing has to adapt to the target number of processors and threads. SVM-RFE benefits differently from the same load-balancing and thread scheduling tuning. We conclude that compiler and runtime optimizations play an important role to achieve the best performance for a given bioinformatics algorithm.
Analyzing static snapshots of massive, graph-structured data cannot keep pace with the growth of social networks, financial transactions, and other valuable data sources. We introduce a framework, STING (Spatio-Tempor...
详细信息
Analyzing static snapshots of massive, graph-structured data cannot keep pace with the growth of social networks, financial transactions, and other valuable data sources. We introduce a framework, STING (Spatio-Temporal Interaction Networks and Graphs), and evaluate its performance on multicore, multisocket Intel ® -based platforms. STING achieves rates of around 100 000 edge updates per second on large, dynamic graphs with a single, general data structure. We achieve speedups of up to 1000× over parallel static computation, improve monitoring a dynamic graph's connected components, and show an exact algorithm for maintaining local clustering coefficients performs better on Intel-based platforms than our earlier approximate algorithm.
暂无评论