Software distributed shared memory (SDSM) systems provide programmers with a shared memory programming environment across distributedmemory architectures. In contrast to the message passing programming environment, t...
详细信息
Software distributed shared memory (SDSM) systems provide programmers with a shared memory programming environment across distributedmemory architectures. In contrast to the message passing programming environment, the SDSM can resolve data dependencies within the application without the programmer having to explicitly specify communication. However, this service is provided at a cost to performance. Thus it makes sense to use message passing directly when data dependencies are easy to solve using message passing. For example, it is not complicated to specify data transfer for large contiguous regions of memory. This paper outlines how the Danui SDSM library has been extended to include support for message passing. Four different message passing transfers are identified depending on whether the data being sent/received resides in private or globally shared buffers. Transfers between globally shared buffers are further categorized as symmetrical or asymmetrical depending on whether they correspond to the same region of shared memory. The implication of each transfer type on the memory consistency of the global address space is discussed. Central to the Danui SDSM extension is the use of information provided and implied by message passing operations. The overhead of the implementation is analyzed.
The increasing demand for edge computing requires effective Deep Neural Network (DNN) accelerators that are suitable for resource-limited environments. This paper presents a new method that uses distributed control me...
详细信息
ISBN:
(数字)9798350387568
ISBN:
(纸本)9798350387575
The increasing demand for edge computing requires effective Deep Neural Network (DNN) accelerators that are suitable for resource-limited environments. This paper presents a new method that uses distributed control methodology for DNN acceleration on edge devices. Our architecture offers significant improvements over a similar architecture without such feature, including a remarkable reduction in memory requirements by up to 7×, along with notable speedups of up to 7.42× in DNN processing. Additionally, our design reaches energy efficiency of a maximum of 4300 MOPS/W, demonstrating its potential to address resource constraints while improving DNN performance on edge platforms.
Multiple kernel clustering (MKC) algorithms have been extensively studied and applied to various applications. Although they demonstrate great success in both the theoretical aspects and applications, existing MKC alg...
详细信息
ISBN:
(纸本)9780999241103
Multiple kernel clustering (MKC) algorithms have been extensively studied and applied to various applications. Although they demonstrate great success in both the theoretical aspects and applications, existing MKC algorithms cannot be applied to large-scale clustering tasks due to: i) the heavy computational cost to calculate the base kernels;and ii) insufficient memory to load the kernel matrices. In this paper, we propose an approximate algorithm to overcome these issues, and to make it be applicable to large-scale applications. Specifically, our algorithm trains a deep neural network to regress the indicating matrix generated by MKC algorithms on a small subset, and then obtains the approximate indicating matrix of the whole data set using the trained network, and finally performs the k-means on the output of our network. By mapping features into indicating matrix directly, our algorithm avoids computing the full kernel matrices, which dramatically decreases the memory requirement. Extensive experiments show that our algorithm consumes less time than most comparatively similar algorithms, while it achieves comparable performance with MKC algorithms.
This paper addresses the self-management of in-memorydistributed data grid platforms. A growing number of applications rely in these platforms to speed up access to large sets of data. However, they are complex to ma...
详细信息
This paper addresses the self-management of in-memorydistributed data grid platforms. A growing number of applications rely in these platforms to speed up access to large sets of data. However, they are complex to manage due to the diversity of configuration and load profiles. The proposed approach employs an adaptation policy expressed in terms of high-level goals to facilitate the task of the system manager, and address the complexity issues posed by the management of multiple configurations. The approach is validated experimentally using the open-source RedHat's Infinispan platform.
Neural network pruning is an essential technique for reducing the size and complexity of deep neural networks, enabling large-scale models on devices with limited resources. However, existing pruning approaches heavil...
Neural network pruning is an essential technique for reducing the size and complexity of deep neural networks, enabling large-scale models on devices with limited resources. However, existing pruning approaches heavily rely on training data for guiding the pruning strategies, making them ineffective for federated learning over distributed and confidential datasets. Additionally, the memory- and computation-intensive pruning process becomes infeasible for recourse-constrained devices in federated learning. To address these challenges, we propose FedTiny, a distributed pruning framework for federated learning that generates specialized tiny models for memory-and computing-constrained devices. We introduce two key modules in FedTiny to adaptively search coarse- and finer-pruned specialized models to fit deployment scenarios with sparse and cheap local computation. First, an adaptive batch normalization selection module is designed to mitigate biases in pruning caused by the heterogeneity of local data. Second, a lightweight progressive pruning module aims to finer prune the models under strict memory and computational budgets, allowing the pruning policy for each layer to be gradually determined rather than evaluating the overall model structure. The experimental results demonstrate the effectiveness of FedTiny, which outperforms state-of-the-art approaches, particularly when compressing deep models to extremely sparse tiny models. FedTiny achieves an accuracy improvement of 2.61% while significantly reducing the computational cost by 95.91% and the memory footprint by 94.01% compared to state-of-the-art methods.
Sensor networks are exposed to hostile environments that may cause failures of single nodes and communication links which affect the whole network. Localizing the cause of the problem in space and time requires to col...
详细信息
Sensor networks are exposed to hostile environments that may cause failures of single nodes and communication links which affect the whole network. Localizing the cause of the problem in space and time requires to collect diagnostic data from the network. Due to resource and energy constraints, however, it is not possible to continuously collect detailed diagnostic data from all nodes. We therefore propose an incremental approach where first data is logged to flash memory and later the user can pose a sequence of diagnostic queries with decreasing scope and increasing level of detail to pinpoint the cause of the problem.
A loosely coupled ring-connected parallel processing system (called KORP: Kobe University Ring-connected Parallel Computer) with distributed shared-memory is being developed. The high-speed communication in KORP is pe...
详细信息
A loosely coupled ring-connected parallel processing system (called KORP: Kobe University Ring-connected Parallel Computer) with distributed shared-memory is being developed. The high-speed communication in KORP is performed by the exclusive data transmission hardware called Stream Controller (SC). The consistency of shared data is maintained by a coherence-protocol controller which is a dedicated hardware for coherence control. The authors describe the architecture of KORP and its scalable configuration hyper-KORP. A trace-driven simulator is developed to analyze a distributed shared-memory system of KORP. SC performance evaluation by simulation using a parallel benchmark application is shown.< >
This paper evaluates the use of per-node multi-threading to hide remote memory and synchronization latencies in a software DSM. As with hardware systems, multi-threading in software systems can be used to reduce the c...
详细信息
This paper evaluates the use of per-node multi-threading to hide remote memory and synchronization latencies in a software DSM. As with hardware systems, multi-threading in software systems can be used to reduce the costs of remote requests by switching threads when the current thread blocks. We added multi-threading to the CVM software DSM and evaluated its impact on performance for a suite of common shared memory programs. Multi-threading resulted in speed improvements of at least 17% in three of the seven applications in our suite, and lesser improvements in the other applications. However, we found that: good performance is not always achievable transparently for non-trivial applications; multi-threading can negatively interact with DSM operations; multi-threading decreases cache and TLB locality; and any multi-threading speedup is dependent on available work.
With the dominance of multicore processors, parallel programming has become more important. Transactional memory is a promising solution to synchronisation issues that are hurting parallel programmers. While there are...
详细信息
With the dominance of multicore processors, parallel programming has become more important. Transactional memory is a promising solution to synchronisation issues that are hurting parallel programmers. While there are a lot of researches on the implementation tradeoffs of TM, there is rare study on the applications that may utilize the techniques, which is essential to both providing feedbacks to the TM designers and to helping potential users. This paper makes the first step of this work by presenting our identification of emerging applications for the comprehensive study of TM. The selection is based on application-domains including popular server/client softwares, multimedia applications, bioinformatics applications, data mining applications, and other scientific applications, which cover most of the dwarfs. A preliminary experiment is also provided to illustrate what we can get from this work.
Betweenness centrality is a fundamental centrality measure that quantifies how important a node or an edge is, within a network, based on how often it lies on the shortest paths between all pairs of nodes. In this pap...
详细信息
Betweenness centrality is a fundamental centrality measure that quantifies how important a node or an edge is, within a network, based on how often it lies on the shortest paths between all pairs of nodes. In this paper, we develop a scalable distributed algorithm, which enables every node in a network to estimate its own betweenness and the betweenness of edges incident on it with only local interaction and without any centralized coordination, nor high memory usages. The development is based on exploiting various local properties of shortest paths, and on formulating and solving an unconstrained distributed optimization problem. We also evaluate the algorithm performance via simulation on a number of random geometric graphs, showing that it yields betweenness estimates that are fairly accurate in terms of ordering.
暂无评论