Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications' behavior hard to understand. therefore, detecting and diagnosing performance vari...
详细信息
ISBN:
(纸本)9781450392044
Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications' behavior hard to understand. therefore, detecting and diagnosing performance variance are of crucial importance for users and application developers. However, previous detection approaches either bring too large overhead and hurt applications' performance, or rely on nontrivial source code analysis that is impractical for production-run parallel applications. In this work, we propose VAPRO, a performance variance detection and diagnosis framework for production-run parallel applications. Our approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workload, which can be used for performance variance detection. To effectively identify these snippets at runtime even without program source code, we introduce State Transition Graph (STG) to track program execution and then conduct lightweight workload analysis on STG to locate variance. To diagnose the detected variance, VAPRO leverages a progressive diagnosis method based on a hybrid model leveraging variance breakdown and statistical analysis. Results show that the performance overhead of VAPRO is only 1.38% on average. VAPRO can detect the variance in real applications caused by hardware bugs, memory, and IQ After fixing the detected variance, the standard deviation of the execution time is reduced by up to 73.5%. Compared withthe state-of-the-art variance detection tool based on source code analysis, VAPRO achieves 30.0% higher detection coverage.
Graph processing, especially high-performance graph traversal, plays a more and more important role in data analytics. the successor of Sunway TaihuLight, NEW SUNWAY, is equipped with nearly 10 PB memory and over 40 m...
详细信息
ISBN:
(纸本)9781450392044
Graph processing, especially high-performance graph traversal, plays a more and more important role in data analytics. the successor of Sunway TaihuLight, NEW SUNWAY, is equipped with nearly 10 PB memory and over 40 million cores, which brings the opportunity to process hundreds of trillions of edges graphs. However, the graph with an unprecedented scale also brings severe performance challenges, including load imbalance, poor locality, and irregular access of graph traversal workload. To address the scalability problem, we propose a novel 3-level degree-aware 1.5D graph partitioning, which benefits from both delegated 1D and 2D partitioning. By delegating extremely heavy vertices globally and other heavy vertices on columns and rows in the processes mesh, we break the scalability wall of previous partitioning methods. Together with sub-iteration direction optimization, core group -aware core subgraph segmenting, and a new on-chip sorting mechanism using RMA, we achieve 180,792 GTEPS on a graph with 281 trillion edges, using 103,912 processors with over 40 million cores, achieving 1.75x performance and 8x capacity compared to the previous state of the art and conforming to the Graph 500 BFS benchmark[14].
the availability of Non-Volatile Main Memory (known as NVMM) enables the design of recoverable concurrent algorithms. We study the power of software combining in achieving recoverable synchronization and designing per...
详细信息
ISBN:
(纸本)9781450392044
the availability of Non-Volatile Main Memory (known as NVMM) enables the design of recoverable concurrent algorithms. We study the power of software combining in achieving recoverable synchronization and designing persistent data structures. Software combining is a general synchronization approach, which attempts to simulate the ideal world when executing synchronization requests (i.e., requests that must be executed in mutual exclusion). A single thread, called the combiner, executes all active requests, while the rest of the threads are waiting for the combiner to notify them that their requests have been applied. Software combining significantly decreases the synchronization cost and outperforms many other synchronization techniques in various cases. We identify three persistence principles, crucial for performance, that an algorithm's designer has to take into consideration when designing highly-efficient recoverable synchronization protocols or data structures. We illustrate how to make the appropriate design decisions in all stages of devising recoverable combining protocols to respect these principles. Specifically, we present two recoverable software combining protocols, satisfying different progress properties, that are many times faster and have much lower persistence cost than a large collection of existing persistent techniques for achieving scalable synchronization. We build fundamental recoverable data structures, such as stacks and queues, based on these protocols that outperform by far existing recoverable implementations of such data structures. We also provide the first recoverable implementation of a concurrent heap and present experiments to show that it has good performance when the size of the heap is not very large.
Maintaining a dynamic k-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for ...
详细信息
Concurrent B+trees have been widely used in many systems. Withthe scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate con...
详细信息
While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools f...
详细信息
Functional logic languages are a high-level approach to programming by combining the most important declarative features. they abstract from small-step operational details so that programmers can concentrate on the lo...
详细信息
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular exp...
详细信息
ISBN:
(纸本)9798400714436
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks the overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup over a serial algorithm. the existing data-parallel DFA-based recognizers suffer from an excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions.
the proceedings contain 13 papers. the topics discussed include: a guided walk into link key candidate extraction with relational concept analysis;reflections on profiling and cataloguing the content of SPARQL endpoin...
the proceedings contain 13 papers. the topics discussed include: a guided walk into link key candidate extraction with relational concept analysis;reflections on profiling and cataloguing the content of SPARQL endpoints using SPORTAL;reflections on: modeling linked open statistical data;reflections on: DCAT-AP representation of Czech national open data catalog and its impact;reflections on: deep learning for noise-tolerant RDFS reasoning;reflections on: finding melanoma drugs through a probabilistic knowledge graph;reflections on: knowledge graph fact prediction via knowledge-enriched tensor factorization;the semantic sensor network ontology, revamped;and reflections on: knowmore - knowledge base augmentation with structured web markup.
Many low-level optimizations for NVIDIA GPU can only be implemented in native hardware assembly (SASS). However, programming in SASS is unproductive and not portable. To simplify low-level GPU programming, we present ...
详细信息
暂无评论