To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this techn...
详细信息
ISBN:
(纸本)9781665497473
To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this technique, both computations and communications are running at the same time. But computation usually also performs some data movements. Since data for computations and for communications use the same memory system, memory contention may occur when computations are memory-bound and large messages are transmitted through the network at the same time. In this paper we propose a model to predict memory band-width for computations and for communications when they are executed side by side, according to data locality and taking contention into account. Elaboration of the model allowed to better understand locations of bottleneck in the memory system and what are the strategies of the memory system in case of contention. The model was evaluated on many platforms with different characteristics, and showed a prediction error in average lower than 4 %.
The growth of online media platforms, particularly multimedia-rich social networks such as YouTube, has resulted in a demand for efficient data collection and analysis techniques. One of the critical data elements for...
详细信息
ISBN:
(纸本)9798350311990
The growth of online media platforms, particularly multimedia-rich social networks such as YouTube, has resulted in a demand for efficient data collection and analysis techniques. One of the critical data elements for multimedia-rich social platforms is the video transcript, which is not readily available through social platforms. Traditional methods for transcript generation are time-consuming and are challenged by the vast amount of data. This study proposes a methodology that leverages parallel computing and the Python multiprocessing library to improve the speed of transcript collection from YouTube. The methodology utilizes YouTube's Transcript API to extract YouTube-generated transcripts and OpenAI's Whisper model to generate transcriptions on videos without native YouTube transcriptions. Additionally, the Googletrans Translation API was used to translate transcriptions from non-English videos. The results showed a significant improvement in processing time and performance, enabling researchers to conduct various studies on a larger scope of YouTube data with ease. With parallelprocessing, the YouTube Transcript API showed a 2100.88% performance increase, the Whisper model showed a 29.45% improvement, and the Googletrans API showed a 738.46% increase compared to the sequential processing baseline using the same process. The total time consumption was reduced by 25.54% from 105.64 hours to 78.66 hours. The methodology developed in this study is not limited to YouTube and can be applied to other social media platforms, making it a versatile solution for data collection and analysis.
Coarse-Grained reconfigurable architecture (CGRA) is a promising platform for HPC systems in the post-Moore's era. A single-source programming model is essential for practical heterogeneous computing. However, we ...
详细信息
ISBN:
(纸本)9781665497473
Coarse-Grained reconfigurable architecture (CGRA) is a promising platform for HPC systems in the post-Moore's era. A single-source programming model is essential for practical heterogeneous computing. However, we do not have a canonical programming model and a frontend compiler for it. Existing versatile CGRAs, in respect to their execution model, computational capability, and system structure, magnify the difficulty of orchestrating the compiler techniques. It consequently forces designers of the CGRAs to develop the compiler from scratch, working only for their architectures. Such an approach is outdated, given other successful accelerators like GPU and FPGAs. This paper presents a new CGRA compiler framework in order to reduce development efforts of CGRA applications. OpenMP annotated codes are fed into the proposed compiler, as recent OpenMP support device offloading to the accelerators. This property improves the reusability of the existing source code for HPC workloads. The design of the compiler is inspired by LLVM, which is the most famous compiler framework so that the frontend is built to be architecture-independent. In this work, we demonstrate that the proposed compiler can handle different types of CGRAs without changing the source codes. In addition, we discuss the effect of architecture-independent optimization algorithms. We also provide an open-source implementation of the compiler framework at https://***/hal-lab-u-tokyo/CGRAOmp.
It has been a decade since the ACM/ieee CS2013 Curriculum guidelines recommended that all CS students learn about parallel and distributed computing (PDC). But few textbooks for "core" CS courses especially ...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
It has been a decade since the ACM/ieee CS2013 Curriculum guidelines recommended that all CS students learn about parallel and distributed computing (PDC). But few textbooks for "core" CS courses especially first-year courses include coverage of PDC topics. To fill this gap, we have written free, online, beginner- and intermediate-level PDC textbooks, containing interactive C/C++ OpenMP, MPI, mpi4py, CUDA, and OpenACC code examples that students can run and modify directly in the browser. The books address a serious challenge to leaching PDC concepts, namely, easy access to the powerful hardware needed for observing patterns and scalability. This paper describes the content of these textbooks and the underlying infrastructure that make them possible. We believe the described textbooks fill a critical gap in PDC education and will be very useful for the community.
Machine-learning (ML) algorithms are finding wide adoption across a rich spectrum of application domains with diverse requirements in terms of performance, power, and cost. These diverse requirements are making it nec...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Machine-learning (ML) algorithms are finding wide adoption across a rich spectrum of application domains with diverse requirements in terms of performance, power, and cost. These diverse requirements are making it necessary to explore a large space of ML architectures and reexamine fundamental computational structures, a process of exploration that is very expensive. To get around the costly computations associated with large data sets and long training times, there have been increasing investments in specialized fixed-function hardware. However, this specialized hardware is expensive and hard to generalize to address the spectrum of applications. For our experiments, we focus on a novel, highly parallel, superset ML architecture, and use it to test the capabilities of new coarse-grained FPGAs containing hundreds and thousands of DSP slices with dedicated local storage. These new coarse-grained architectures allow us to achieve ASIC-like clock rate and reductions in power while exploring novel and common ML architectures.
Finding the biconnected components of a graph has a large number of applications in many other graph problems including planarity testing, computing the centrality metrics, finding the (weighted) vertex cover, colorin...
详细信息
ISBN:
(纸本)9781665481069
Finding the biconnected components of a graph has a large number of applications in many other graph problems including planarity testing, computing the centrality metrics, finding the (weighted) vertex cover, coloring, and the like. Recent years saw the design of efficient algorithms for this problem across sequential and parallel computational models. However, current algorithms do not work in the setting where the underlying graph changes over time in a dynamic manner via the insertion or deletion of edges. the insertion or deletion of edges. Dynamic algorithms in the sequential setting that obtain the biconnected components of a graph upon insertion or deletion of a single edge are known from over two decades ago. parallel algorithms for this problem are not heavily studied. In this paper, we design shared-memory parallel algorithms that obtain the biconnected components of a graph subsequent to the insertion or deletion of a batch of edges. Our algorithms hence will be capable of exploiting the parallelism adduced due to a batch of updates. We implement our algorithms on an AMD EPYC 7742 CPU having 128 cores. Our experiments on a collection of 10 realworld graphs from multiple classes indicate that our algorithms outperform parallel state-of-the-art static algorithms.
Edge computing plays a pivotal role in IoT applications that require rapid and secure data processing. How-ever, these applications are typically resource-demanding, and the resources available at the edge are often s...
详细信息
With the accelerating growth of Big Data, real-world graph processingapplications now need to tackle graphs with billions of vertices and trillions of edges, thereby increasing the demand for effective solutions to a...
详细信息
ISBN:
(纸本)9798350302080
With the accelerating growth of Big Data, real-world graph processingapplications now need to tackle graphs with billions of vertices and trillions of edges, thereby increasing the demand for effective solutions to application scalability. Unfortunately, current approaches to implementing these applications on modern HPC systems exhibit poor scale-out performance with increasing numbers of nodes. The scalability challenges for these applications are driven by large data sizes, synchronization overheads, and fine-grained communications with irregular data accesses and poor locality. This paper presents the scalability of a novel Actor-based programming system, which provides a lightweight runtime that supports fine-grained asynchronous execution and automatic message aggregation atop a Partitioned Global Address Space (PGAS) communication layer. Evaluations of the Jaccard Index and PageRank applications on the NERSC Perlmutter system demonstrate nearly perfect scaling up to 1, 000 nodes and 64K cores (one-third of the targeted 3000-nodes for Perlmutter). In addition, our Actor-based implementations of Jaccard Index and PageRank executed with parallel efficiencies of 85.7% and 63.4% for the largest run of 64K cores. This performance represents a 29.6x speedup relative to UPC and OpenSHMEM versions of PageRank.
Homomorphic encryption (HE) algorithms, particularly the Cheon-Kim-Kim-Song (CKKS) scheme, offer significant potential for secure computation on encrypted data, making them valuable for privacy-preserving machine lear...
详细信息
ISBN:
(纸本)9798350387186;9798350387179
Homomorphic encryption (HE) algorithms, particularly the Cheon-Kim-Kim-Song (CKKS) scheme, offer significant potential for secure computation on encrypted data, making them valuable for privacy-preserving machine learning. However, high latency in large integer operations in the CKKS algorithm hinders the processing of large datasets and complex computations. This paper proposes a novel strategy that combines lossless data compression techniques with the parallelprocessing power of graphics processing units to address these challenges. Our approach demonstrably reduces data size by 90% and achieves significant speedups of up to 100 times compared to conventional approaches. This method ensures data confidentiality while mitigating performance bottlenecks in CKKS-based computations, paving the way for more efficient and scalable HE applications.
We identify the graph data structure, frontiers, operators, an iterative loop structure, and convergence conditions as essential components of graph analytics systems based on the native-graph approach. Using these es...
详细信息
ISBN:
(纸本)9781665497473
We identify the graph data structure, frontiers, operators, an iterative loop structure, and convergence conditions as essential components of graph analytics systems based on the native-graph approach. Using these essential components, we propose an abstraction that captures all the significant programming models within graph analytics, such as bulksynchronous, asynchronous, shared-memory, message-passing, and push vs. pull traversals. Finally, we demonstrate the power of our abstraction with an elegant modern C++ implementation of single-source shortest path and its required components.
暂无评论