We have been developing a multiprocessor architecture which executes iterations of a loop speculatively in parallel. In this paper, we present speculative memory (SM), in order to enable the large-scale speculation wh...
详细信息
ISBN:
(纸本)9781509008056
We have been developing a multiprocessor architecture which executes iterations of a loop speculatively in parallel. In this paper, we present speculative memory (SM), in order to enable the large-scale speculation which supports the speculative execution of the iteration of arbitrary size and duration. With SM, a programmer can hint explicitly that iterations of a certain loop are preferable to be executed speculatively in parallel. SM manages multiple values (versions) of speculatively modified data. SM also features the memory renaming and the delayed execution of the program codes, which could be viewed as a dynamic code migration. These can remove the dependencies between loop iterations or alleviate the occurrence of dependency hazards. Thus, SM can improve the success rate of the speculation, and consequently, makes it possible to extract the thread-level parallelism more than ever before.
The fork-join framework project is one of the more challenging programming assignments in the computer science curriculum at Virginia Tech. Students in Computer Systems must manage a pool of threads to facilitate the ...
详细信息
The fork-join framework project is one of the more challenging programming assignments in the computer science curriculum at Virginia Tech. Students in Computer Systems must manage a pool of threads to facilitate the shared execution of dynamically created tasks. This project is difficult because students must overcome the challenges of concurrent programming and conform to the project’s specific semantic requirements.
When working on the project, many students received inconsistent test results and were left confused when debugging. The suggested debugging tool, Helgrind, is a general-purpose thread error detector. It is limited in its ability to help fix bugs because it lacks knowledge of the specific semantic requirements of the fork-join framework. Thus, there is a need for a special-purpose tool tailored for this project.
We implemented Willgrind, a debugging tool that checks the behavior of fork-join frameworks implemented by students through dynamic program analysis. Using the Valgrind framework for instrumentation, checking statements are inserted into the code to detect deadlock, ordering violations, and semantic violations at run-time. Additionally, we extended Willgrind with happens-before based checking in WillgrindPlus. This tool checks for ordering violations that do not manifest themselves in a given execution but could in others.
In a user study, we provided the tools to 85 students in the Spring 2017 semester and collected over 2,000 submissions. The results indicate that the tools are effective at identifying bugs and useful for fixing bugs. This research makes multithreaded programming easier for students and demonstrates that special-purpose debugging tools can be beneficial in computer science education.
ThreadMentor is a multiplatform pedagogical tool designed to ease the difficulty in teaching and learning multithreaded programming. It consists of a C++ class library and a visualization system. The class library sup...
详细信息
ThreadMentor is a multiplatform pedagogical tool designed to ease the difficulty in teaching and learning multithreaded programming. It consists of a C++ class library and a visualization system. The class library supports many thread management functions and synchronization primitives in an object-oriented way, and the visualization system is activated automatically by a user program and shows the inner working of every thread and every synchronization primitive on-the-fly. Events can also be saved for playback. In this way, students will be able to visualize the dynamic behavior of a threaded program and the interaction among threads and synchronization primitives.
The Tera multithreaded Architecture (MTA) is a radical new architecture intended to revolutionize high-performance computing in both the scientific and commercial marketplaces. Each processor supports 128 threads in h...
详细信息
ISBN:
(纸本)9780897919845
The Tera multithreaded Architecture (MTA) is a radical new architecture intended to revolutionize high-performance computing in both the scientific and commercial marketplaces. Each processor supports 128 threads in hardware. Extremely fast thread switching is used to mask latency in a uniform-access memory system without caching. It is claimed that these hardware characteristics allow compilers to easily transform sequential programs into efficient multithreaded programs for the Tera MTA. In this paper, we attempt to provide an objective initial evaluation of the performance of the Tera multithreaded architecture and programming system for general-purpose applications. The basis of our investigation is two programs from the C3I Parallel Benchmark Suite (C3IPBS). Both these programs have previously been shown to have the potential for large-scale parallelization. We compare the performance of these programs on (i) a fast uniprocessor, (ii) two conventional shared-memory multiprocessors, and (iii) the first installed Tera MTA (at the San Diego Supercomputer Center). On these platforms, we compare the effectiveness of both automatic and manual parallelization.
This paper considers incorporating a bound-consistency enforcing procedure to an interval branch-and-prune method. A heuristic to decide, when to use the developed operator, is proposed. As enforcing the bound-consist...
详细信息
This paper considers incorporating a bound-consistency enforcing procedure to an interval branch-and-prune method. A heuristic to decide, when to use the developed operator, is proposed. As enforcing the bound-consistency is much more time consuming than performing other narrowing tools, we parallelize the procedure, using Intel TBB. A few parallelization versions are considered. Also, this is a good opportunity to make a case-study of performance of various lock instances, implemented in the TBB package. Numerical results for typical benchmark problems are presented and analyzed. A specific lock version, proper for the application, is proposed. Performance on two architectures is considered: Intel Xeon and Intel Xeon Phi (MIC). (C) 2017 Elsevier Inc. All rights reserved.
Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading compon...
详细信息
Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation.
The MPI-2 Standard has carefully specified the interaction between MPI and user-created threads. The goal of this specification is to allow users to write multithreaded MPI programs while also allowing MPI implementat...
详细信息
The MPI-2 Standard has carefully specified the interaction between MPI and user-created threads. The goal of this specification is to allow users to write multithreaded MPI programs while also allowing MPI implementations to deliver high performance. However, a simple reading of the thread-safety specification does not reveal what its implications are for an implementation and what implementers must be aware (and careful) of. In this paper, we describe and analyze what the MPI Standard says about thread-safety and what it implies for an implementation. We classify the MPI functions based on their thread-safety requirements and discuss several issues to consider when implementing thread-safety in MPI. We use the example of generating new context ids (required for creating new communicators) to demonstrate how a simple solution for the single-threaded case does not naturally extend to the multithreaded case and how a naive thread-safe algorithm can be expensive. We then present an algorithm for generating context ids that works efficiently in both single-threaded and multithreaded cases. (C) 2007 Elsevier B.V. All rights reserved.
A 250-MHz single-chip multiprocessor, which can implement multichannel decoding, encoding, and transcoding of various audio and video standards, was fabricated using 0.25-mum CMOS technology and consumes 2.38 W at 2.5...
详细信息
A 250-MHz single-chip multiprocessor, which can implement multichannel decoding, encoding, and transcoding of various audio and video standards, was fabricated using 0.25-mum CMOS technology and consumes 2.38 W at 2.5 V. The multiprocessor integrates four processors and 64-kB shared level-2 cache and exploits coarse-grained parallelism inherent in audio and video signal processing with multithreaded programming. Three coprocessors and scratch-pad memory have been added to each processing element and perform subword parallel processing, background data transfer, and bitstream processing for audio and video signal processing. Useful-skew and clock gating have been utilized to achieve high-speed operation and low power consumption. Consequently, the multiprocessor achieves MPEG2 (MP@HL,) video decoding at 20 frames/s.
Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to...
详细信息
Exascale computing systems will exhibit high degrees of hierarchical parallelism, with thousands of computing nodes and hundreds of cores per node. Efficiently exploiting hierarchical parallelism is challenging due to load imbalance that arises at multiple levels. OpenMP is the most widely-used standard for expressing and exploiting the ever-increasing node-level parallelism. The scheduling options in OpenMP are insufficient to address the load imbalance that arises during the execution of multithreaded applications. The limited scheduling options in OpenMP hinder research on novel scheduling techniques which require comparison with others from the literature. This work introduces LB4OMP, an open-source dynamic load balancing library that implements successful scheduling algorithms from the literature. LB4OMP is a research infrastructure designed to spur and support present and future scheduling research, for the benefit of multithreaded applications performance. Through an extensive performance analysis campaign, we assess the effectiveness and demystify the performance of all loop scheduling techniques in the library. We show that, for numerous applications-systems pairs, the scheduling techniques in LB4OMP outperform the scheduling options in OpenMP. Node-level load balancing using LB4OMP leads to reduced cross-node load imbalance and to improved MPI+OpenMP applications performance, which is critical for Exascale computing.
Sorting huge amounts of datasets have become essential in many computer applications, such as search engines, database and web-based applications, in order to improve searching performance. Moreover, due to the witnes...
详细信息
Sorting huge amounts of datasets have become essential in many computer applications, such as search engines, database and web-based applications, in order to improve searching performance. Moreover, due to the witnessed prevalence of the commercial Simultaneous multithreaded architecture (SMT), parallel programming using multithreading becomes a dire need for efficiently using all available hardware resources for one application. In this paper, one of the efficient and quick algorithms, the Quicksort, is applied as a parallel multithreaded algorithm on SMT architecture, where virtual parallelization has been achieved using the POSIX threads (Pthreads) library. The proposed algorithm is evaluated and compared with its sequential counterpart. The obtained analytical and experimental results reveal that multithreading is a viable technique for implementing the parallel Quicksort algorithm efficiently on SMT architecture, where it has been shown both analytically and experimentally that the parallel multithreaded Quicksort algorithm outperforms the sequential Quicksort algorithm in terms of various performance metrics including;time complexity and speedup.
暂无评论