In this paper, we present a new distributed algorithm for minimizing a sum of non-necessarily differentiable convex functions composed with arbitrary linear operators. The overall cost function is assumed strongly con...
详细信息
ISBN:
(纸本)9781479970612
In this paper, we present a new distributed algorithm for minimizing a sum of non-necessarily differentiable convex functions composed with arbitrary linear operators. The overall cost function is assumed strongly convex. Each involved function is associated with a node of a hypergraph having the ability to communicate with neighboring nodes sharing the same hyperedge. Our algorithm relies on a primal-dual splitting strategy with established convergence guarantees. We show how it can be efficiently implemented to take full advantage of a multicore architecture. The good numerical performance of the proposed approach is illustrated in a problem of video sequence denoising, where a significant speedup is achieved.
This paper presents work-in-progress towards a C++ source-to-source translator that automatically seeks parallelisable code fragments and replaces them with code for a graphics co-processor. We report on our experienc...
详细信息
This paper describes how one can implement distributed lambda-calculus interpreter from scratch. At first, we describe how to implement a monadic parser, than the Krivine Machine is introduced for the interpretation p...
详细信息
ISBN:
(纸本)9781538653951
This paper describes how one can implement distributed lambda-calculus interpreter from scratch. At first, we describe how to implement a monadic parser, than the Krivine Machine is introduced for the interpretation part and as for distribution, the actor model is used. In this work we are not providing general solution for parallelism, but we consider particular patterns, which can always be parallelized. As a result, the basic extensible implementation of call-by-name distributed machine is introduced and the prototype is presented. We achieved computation speed improvement in some cases, but efficient distributed version is not achieved, problems are discussed in evaluation section. This work provides a foundation for further research, completing the implementation it is possible to add concurrency for non-determinism, improve the interpreter using call-by-need semantic or study optimal auto parallelization to generalize what could be done efficiently in parallel.
In this work, we address the challenge of designing an efficient warp scheduler for throughput processors by proposing SAWS (Simple and Adaptive Warp Scheduler). Differently from previous approaches which target a par...
详细信息
ISBN:
(纸本)9781538649756
In this work, we address the challenge of designing an efficient warp scheduler for throughput processors by proposing SAWS (Simple and Adaptive Warp Scheduler). Differently from previous approaches which target a particular type of applications, SAWS considers several simple scheduling algorithms and tries to use the one that best fits each application or phase within an application. Through detailed simulations we demonstrate that a practical implementation of SAWS can obtain IPC values that closely match the best scheduling algorithm in each case.
Computing radiosity is a very expensive problem in computer graphics. Recent hierarchical methods have greatly speeded up the computation of first diffuse and now also specular radiosity. We present a parallel algorit...
详细信息
ISBN:
(纸本)9781581130102
Computing radiosity is a very expensive problem in computer graphics. Recent hierarchical methods have greatly speeded up the computation of first diffuse and now also specular radiosity. We present a parallel algorithm for computing both diffuse and specular radiosity together, and discuss the techniques we used to improve its performance. The algorithm is both irregular and highly unpredictable. Despite this, by carefully designing a parallel algorithm that minimizes synchronization and memory access overhead and by identifying and correcting several synchronization bottlenecks that we did not anticipate, we were able to obtain speedups of 26.3 on a 32-processor machine with distributed memory and 14.2 on a 16-processor machine with centralized memory. We demonstrate how execution profiles obtained at runtime, for example time spent waiting at different locks, can be used to significantly improve the performance of complex, irregular parallelapplications.
Graph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path-planning, bioinformatics, and machine learning. Graph-processing workloads...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Graph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path-planning, bioinformatics, and machine learning. Graph-processing workloads have massive data footprints that exceed cache storage capacity and exhibit highly irregular memory access patterns due to data-dependent graph traversals. This irregular behaviour causes graph-processing workloads to exhibit poor data locality, undermining their performance. This paper makes two fundamental observations on the memory access patterns of graph-processing workloads: First, conventional cache hierarchies become mostly useless when dealing with graph-processing workloads, since 78.6% of the accesses that miss in the L1 Data Cache (L1D) result in misses in the L2 Cache (L2C) and in the Last Level Cache (LLC), requiring a DRAM access. Second, it is possible to predict whether a memory access will be served by DRAM or not in the context of graph-processing workloads by observing strides between accesses triggered by instructions with the same Program Counter (PC). Our key insight is that bypassing the L2C and the LLC for highly irregular accesses significantly reduces latency cost while also reducing pressure on the lower levels of the cache hierarchy. Based on these observations, this paper proposes the Large Predictor (LP), a low-cost micro-architectural predictor capable of distinguishing between regular and irregular memory accesses. We propose to serve accesses tagged as regular by LP via the standard memory hierarchy, while irregular access are served via the Side Data Cache (SDC). The SDC is a private percore set-associative cache placed alongside the L1D specifically aimed at reducing the latency cost of highly irregular accesses while avoiding polluting the rest of the cache hierarchy with data that exhibits poor locality. SDC coupled with LP yields geometric mean speed-ups of 20.3% and 20.2% on single- and multi-core scenarios, resp
Directed Acyclic Graphs (DAGs) are often used to model circuits and networks. The path length in such DAGs represents circuit or network delays. In the vertex splitting problem, the objective is to determine a minimum...
详细信息
The paper presents the Abstract Configuration Language (ACL) implemented within the parallel Objects object-oriented parallel programming environment. ACL defines a set of directives that allow users to specify the al...
详细信息
ISBN:
(纸本)0818678836
The paper presents the Abstract Configuration Language (ACL) implemented within the parallel Objects object-oriented parallel programming environment. ACL defines a set of directives that allow users to specify the allocation needs of his/her application components without being aware of the architectural details. ACL directives drive the allocation decisions of the run-time support, by adapting its general-purpose behaviour to follow applications particular allocation needs. The effectiveness of the ACL approach in increasing the performances of parallelapplications is confirmed by a testbed application.
Pharmaceutical parallel trade is a legal trade in European countries, where traders can buy medicinal products in one country and sell them in other countries to make a profit. In the pharmaceutical parallel trade mar...
详细信息
ISBN:
(纸本)9781665464970
Pharmaceutical parallel trade is a legal trade in European countries, where traders can buy medicinal products in one country and sell them in other countries to make a profit. In the pharmaceutical parallel trade market, players such as manufacturers, wholesalers, parallel traders, pharmacies, and hospitals are involved. Studying and analyzing this market is of significant interest to economists and players involved. Agent-based modeling offers a robust algorithmic framework to analyze macroeconomic phenomena through micro-founded models. As an initial step in using agent-based modeling for the parallel trade of pharmaceuticals, we consider a simplified pharmaceutical trading market inspired by available game theory models. In this paper, we developed and elaborated the implementation of an agent-based model for the pharmaceutical trade market and employed it to run multiple scenarios that are impossible to analyze through game-theoretic models. Subsequently, we demonstrated how an agent-based model could be utilized to analyze the market from an economic perspective and how players in this market can recruit this model in their business decisions.
The development of complex networked multi-core systems, like compute nodes in the Internet-of-Things, requires new simulation and design concepts. In this paper we present an environment for the asynchronous simulati...
详细信息
ISBN:
(纸本)9781479968909
The development of complex networked multi-core systems, like compute nodes in the Internet-of-Things, requires new simulation and design concepts. In this paper we present an environment for the asynchronous simulation of networked multi-core systems, based on SystemC. Combined with the open-source machine emulator and virtualizer QEMU, a virtual network is created. The compute nodes act similar to recent Systems-on-Chip from Xilinx and Altera. By combining an ARM processing system with programmable logic, a high flexibility is provided. We exemplary simulate these systems by extending QEMU, following its device model abstraction qdev. The resulting network benefits from the execution on different host systems. It is highly scalable and designed for the development of complex networked multi-core systems. For the non-distributed execution on one processor we implemented an alternative communication method which takes only 2/3 of the time for networked simulation.
暂无评论