Complex networks are a technique for the modeling and analysis of large data sets in many scientific and engineering disciplines. Due to their excessive size conventional algorithms and single core processors struggle...
详细信息
ISBN:
(纸本)9781479904945
Complex networks are a technique for the modeling and analysis of large data sets in many scientific and engineering disciplines. Due to their excessive size conventional algorithms and single core processors struggle with the efficient processing of such networks. Employing multi-core graphic processing units (CPUs) could provide sufficient processing power for the analysis of such networks. However, commonly designed algorithms cannot exploit these massively parallel processing power for the analysis of such networks. In this paper, we present the multi Layer Network Decomposition (MLND) approach which provides a general approach for parallel network analysis using multi-core processors via efficient partitioning and mapping of networks onto GPU architectures. Evaluation using a 336 core GPU graphic card demonstrated a 16x speed-up in complex network analysis relative to a CPU based approach.
Convolutional networks (ConvNets) have become a popular approach to computer vision. Here we consider the parallelization of ConvNet training, which is computationally costly. Our novel parallel algorithm is based on ...
详细信息
Convolutional networks (ConvNets) have become a popular approach to computer vision. Here we consider the parallelization of ConvNet training, which is computationally costly. Our novel parallel algorithm is based on decomposition into a set of tasks, most of which are convolutions or FFTs. Theoretical analysis suggests that linear speedup with the number of processors is attainable. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. Benchmarking with multi-core CPUs shows speedup roughly equal to the number of physical cores. We also demonstrate 90x speedup on a many-core CPU (Xeon Phi Knights Corner). Our algorithm can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, kernel sizes, and density and size of the output patch. (C) 2017 Elsevier Inc. All rights reserved.
Recent advances in non-volatile main memory (NVM) technology have spurred research on algorithms that are resilient to intermittent failures that cause processes to crash and subsequently restart. In this paper we pre...
详细信息
Recent advances in non-volatile main memory (NVM) technology have spurred research on algorithms that are resilient to intermittent failures that cause processes to crash and subsequently restart. In this paper we present a Recoverable Mutual Exclusion (RME) algorithm that supports abortability. Our algorithm guarantees FCFS and a strong liveness property: processes do not starve even in runs consisting of infinitely many crashes, provided that a process crashes at most a finite number of times in each of its attempts. On DSM and Relaxed-CC multiprocessors, a process incurs O (min(k, log n)) RMRs in a passage and O(f + min (k, log n)) RMRs in an attempt, where n is the number of processes that the algorithm is designed for, k is the point contention of the passage or the attempt, and f is the number of times that p crashes during the attempt. On a Strict CC multiprocessor, the passage and attempt complexities are O(n) and 0(f + n), respectively. Our algorithm uses only the read, write, and CAS operations, which are commonly supported by multiprocessors. Attiya, Hendler, and Woelfel proved that, with any mutual exclusion algorithm, a process incurs at least 52 (log n) RMRs in a passage, if the algorithm uses only the read, write, and CAS operations (in: Proc. of the Fortieth ACM Symposium on Theory of Computing, New York, NY, USA, 2008). This lower bound implies that the worst-case RMR complexity of our algorithm is optimal for the DSM and Relaxed CC multiprocessors. This paper is an expanded version of our conference paper as reported by Jayanti and Joshi (in: Atig and Schwarzmann (eds) Networked Systems. Springer International Publishing, Cham, 2019), which presented the first Recoverable Mutual Exclusion (RME) algorithm that supports abortability. This algorithm from our conference paper (in: Atig and Schwarzmann (eds) Networked Systems. Springer International Publishing, Cham, 2019) admits starvation when there are infinitely many aborts in a run. In this paper,
Mutex locks have traditionally been the most common mechanism for protecting shared data structures in concurrent programs. However, the robustness of such locks against process failures has not been studied thoroughl...
详细信息
Mutex locks have traditionally been the most common mechanism for protecting shared data structures in concurrent programs. However, the robustness of such locks against process failures has not been studied thoroughly. The vast majority of mutex algorithms are designed around the assumption that processes are reliable, meaning that a process may not fail while executing the lock acquisition and release code, or while inside the critical section. If such a failure does occur, then the liveness properties of a conventional mutex lock may cease to hold until the application or operating system intervenes by cleaning up the internal structure of the lock. For example, a process that is attempting to acquire an otherwise starvation-free mutex may be blocked forever waiting for a failed process to release the critical section. Adding to the difficulty, if the failed process recovers and attempts to acquire the same mutex again without appropriate cleanup, then the mutex may become corrupted to the point where it loses safety, notably the mutual exclusion property. We address this challenge by formalizing the problem of recoverable mutual exclusion, and proposing several solutions that vary both in their assumptions regarding hardware support for synchronization, and in their efficiency. Compared to known solutions, our algorithms are more robust as they do not restrict where or when a process may crash, and provide stricter guarantees in terms of efficiency, which we define in terms of remote memory references.
We develop an efficient multicore algorithm, PMS6MC, for the (l;d) -motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatch...
详细信息
We develop an efficient multicore algorithm, PMS6MC, for the (l;d) -motif discovery problem in which we are to find all strings of length l that appear in every string of a given set of strings with at most d mismatches. PMS6MC is based on PMS6, which is currently the fastest single-core algorithm for motif discovery in large instances. The speedup, relative to PMS6, attained by our multicore algorithm ranges from a high of 6.62 for the (17,6) challenging instances to a low of 2.75 for the (13,4) challenging instances on an Intel 6-core system. We estimate that PMS6MC is 2 to 4 times faster than other parallel algorithms for motif search on large instances.
We focus on the constraint-based automated addition of nonmasking and stabilizing fault-tolerance to hierarchical programs. We specify legitimate states of the program in terms of constraints that should be satisfied ...
详细信息
We focus on the constraint-based automated addition of nonmasking and stabilizing fault-tolerance to hierarchical programs. We specify legitimate states of the program in terms of constraints that should be satisfied in those states. To deal with faults that may violate these constraints, we add recovery actions while ensuring interference freedom among the recovery actions added for satisfying different constraints. Since the constraint-based manual design of fault-tolerance is well known, we expect our approach to have a significant benefit in automating the addition of fault-tolerance. We illustrate our algorithm with four case studies: stabilizing mutual exclusion, stabilizing diffusing computation, a data dissemination problem in sensor networks, and tree maintenance. With experimental results, we show that the complexity of our algorithm is reasonable and that it can be reduced using the structure of the hierarchical systems. We also reduced the time complexity of the synthesis using parallelism. We consider two approaches to speedup the synthesis algorithm: first, the use of the multiple constraints that have to be satisfied during synthesis;second, the use of the distributed nature of the programs being synthesized. We show that our approaches provide significant reduction in the synthesis time. To our knowledge, this is the first instance where automated synthesis has been successfully used in synthesizing programs that are correct under fairness assumptions. Moreover, in three of the case studies considered in this paper, the structure of the recovery paths is too complex to permit existing heuristic-based approaches for adding recovery. (C) 2011 Elsevier B.V. All rights reserved.
In light of recent advances in non-volatilemain memory technology, Golab and Ramaraju reformulated the traditional mutex problem into the novel Recoverable Mutual Exclusion (RME) problem. In the best known solution fo...
详细信息
ISBN:
(纸本)9781450362177
In light of recent advances in non-volatilemain memory technology, Golab and Ramaraju reformulated the traditional mutex problem into the novel Recoverable Mutual Exclusion (RME) problem. In the best known solution for RME, due to Golab and Hendler from PODC 2017, a process incurs at most O(log n/log log n) remote memory references (RMRs) per passage on a system with n processes, where a passage is an interval from when a process enters the Try section to when it subsequently returns to Remainder. Their algorithm, however, guarantees this bound only for cache-coherent (CC) multiprocessors, leaving open the question of whether a similar bound is possible for distributed shared memory (DSM) multiprocessors. We answer this question affirmatively by designing an algorithm for a system with n processes, such that, it satisfies the same complexity bound as Golab and Hendler's for both CC and DSM multiprocessors. Our algorithm has some additional advantages over Golab and Hendler's: (i) its Exit section is wait-free, (ii) it uses only the Fetch-and-Store instruction, and (iii) on a CC machine our algorithm needs each process to have a cache of only O(1) words, while their algorithm needs a cache of size that is a function of n.
Recoverable mutual exclusion (RME) is a variation on the classic mutual exclusion (ME) problem that allows processes to crash and recover. The time complexity of RME algorithms is quantified in the same way as for ME,...
详细信息
ISBN:
(纸本)9781450357951
Recoverable mutual exclusion (RME) is a variation on the classic mutual exclusion (ME) problem that allows processes to crash and recover. The time complexity of RME algorithms is quantified in the same way as for ME, namely by counting remote memory references-expensive memory operations that traverse the processor-to-memory interconnect. Prior work has established that the RMR complexity of the RME problem for n processes is T(logn) for the class of algorithms that use read/write registers and single-word comparison primitives such as Compare-And-Swap (Golab and Ramaraju 2016), O(logn/log logn) for the class of algorithms that use read/write registers and additional single-word read-modify-primitives such as Fetch-And-Store (Golab and Hendler 2017), and T(1) for the class of algorithms that use read/write registers and specialized double-word read-modify-write primitives (Golab and Hendler 2017). These complexity bounds hold in a model of computation where processes may fail independently, and where a process that fails while accessing the mutex is required to recover eventually. This body of work leaves open two important questions: (i) what is the tight bound on the RMR complexity of RME for the class of algorithms that use read/write registers and commonly supported single-word read-modify-primitives;and (ii) how is the RMR complexity of RME affected by variations in the failure model? This paper answers both questions partially by showing that RME can be solved using O(1) RMRs per passage in the worst case in a model where failures are system-wide (i.e., all processes crash simultaneously), and processes receive additional information from the environment regarding the occurrence of the failure. The upper bound algorithm we present relies crucially on a novel RMR-efficient barrier that processes use to synchronize recovery actions after each failure. The barrier uses read/write registers and single-word Compare-And-Swap only. Additionally, we present a transfo
A number of NUMA-aware synchronization algorithms have been proposed lately to stress the scalability inefficiencies of existing locks. However their presupposed local lock granularity, a physical processor, is often ...
详细信息
ISBN:
(纸本)9781479961238
A number of NUMA-aware synchronization algorithms have been proposed lately to stress the scalability inefficiencies of existing locks. However their presupposed local lock granularity, a physical processor, is often not the optimum configuration for various workloads. This paper further explores the design space by taking into consideration the physical affinity between the cores within a single processor, and presents FSL to support variable and finely tuned group size for different lock contexts and instances. The new design provides a uniform model for the discussion of affinity locks and can completely subsume the previous NUMA-aware designs because they have only discussed one special case of the model. The interfaces of the new scheme are kernel-compatible and thus largely facilitate kernel incorporation. The investigation with the lock shows that an affinity lock with optimal local lock granularity can outperform its NUMA-aware counterpart by 29.40% and 58.28% at 80 cores with different workloads.
A long-standing open question has been whether lock-freedom and wait-freedom are fundamentally different progress conditions, namely, can the former be provided in situations where the latter cannot? This paper answer...
详细信息
ISBN:
(纸本)9781450357951
A long-standing open question has been whether lock-freedom and wait-freedom are fundamentally different progress conditions, namely, can the former be provided in situations where the latter cannot? This paper answers the question in the affirmative, by proving that there are objects with lock-free implementations, but without wait-free implementations-using objects of any finite power. We precisely define an object called n-process long-lived approximate agreement (n-LLAA), in which two sets of processes associated with two sides, 0 or 1, need to decide on a sequence of increasingly closer outputs. We prove that 2-LLAA has a lock-free implementation using reads and writes only, while n-LLAA has a lock-free implementation using reads, writes and (n - 1)-process consensus objects. In contrast, we prove that there is no wait-free implementation of the n-LLAA object using reads, writes and specific (n - 1)-process consensus objects, called (n - 1)-window registers.
暂无评论