Multivariate time series (MTS) classification has been tackled using various methods, including Reservoir computing (RC), which generates efficient vectorized representations like reservoir state (RS). RS shines when ...
详细信息
ISBN:
(纸本)9798350383782;9798350383799
Multivariate time series (MTS) classification has been tackled using various methods, including Reservoir computing (RC), which generates efficient vectorized representations like reservoir state (RS). RS shines when handling extensive classes or training sets but demands longer processing and substantial memory. Addressing this, in this study we present the Parallel Reservoir Echo State Network (PR-ESN), an optimized parallel training and evaluation algorithm rooted in the ESN principle. It leverages both CPU-shared memory and parallel distributedmemory architecture to efficiently capture reservoir state's optimal model space representation, addressing computational challenges in MTS analysis. Distinguishing itself from previous works, PR-ESN combines distributed parallel processing at the network level and shared memory multiprocessing at the node level. This results in reduced memory requirements and faster processing, making it a significant contribution to the field. Key features include PR-ESN's distributed training and evaluation, shared memory parallelization, and MSR concatenation for comprehensive analysis of distributed model space representations. Testing on real-world MTS and benchmark ECG data proves PR-ESN-based classifiers achieve superior accuracy and faster processing times with optimal memory usage. Testing on real-world MTS and benchmark ECG data proves PR-ESN-based classifiers achieve superior accuracy and faster processing times with optimal memory usage.
Large-scale data analytics, scientific simulation, and deep learning codes in HPC perform massive computations on data greatly exceeding the bounds of main memory. These out-of-core algorithms suffer from severe data ...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
Large-scale data analytics, scientific simulation, and deep learning codes in HPC perform massive computations on data greatly exceeding the bounds of main memory. These out-of-core algorithms suffer from severe data movement penalties, programming complexity, and limited code reuse. To solve this, HPC sites have steadily increased DRAM capacity. However, this is not sustainable due to financial and environmental costs. A more elegant, low-cost, and portable solution is to expand memory to distributed multi-tiered storage. In this work, we propose MegaMmap: a software distributed shared memory (DSM) that enlarges effective memory capacity through intelligent tiered DRAM and storage management. MegaMmap provides workload-aware data organization, eviction, and prefetching policies to reduce DRAM consumption while ensuring speedy access to critical data. A variety of memory coherence optimizations are provided through an intuitive hinting system. Evaluations show that various workloads can be executed with a fraction of the DRAM while offering competitive performance.
Multiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. distributed-memory par...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
Multiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. distributed-memory parallel algorithms for SpGEMM have mainly focused on sparsity-oblivious approaches that use 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoretically reduce communication by not fetching nonzeros of the sparse matrices that do not participate in the multiplication. Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
Deep Learning Recommendation Models (DLRMs) are pivotal in various sectors, yet they are hindered by the high memory demands of embedding tables and the significant communication overhead in distributed training envir...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
Deep Learning Recommendation Models (DLRMs) are pivotal in various sectors, yet they are hindered by the high memory demands of embedding tables and the significant communication overhead in distributed training environments. Traditional approaches, like Tensor-Train (TT) decomposition, although effective for compressing these tables, introduce substantial computational burdens. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements. This paper proposes EcoRec, an advanced library designed to expedite the training of DLRMs through a synergistic integration of TT decomposition technology and distributed training. EcoRec introduces a novel computation pattern that eliminates redundancy in TT operations, alongside an efficient multiplication pathway, significantly reducing computational time. Additionally, it provides a unique micro-batching technique with sorted indices to decrease memory demands without additional computational costs. EcoRec also features a novel pipeline training system for embedding layers, ensuring balanced data distribution and enhanced communication efficiency. EcoRec, built on PyTorch and CUDA, has been evaluated on a 32 GPU cluster. The results show EcoRec significantly outperforms the existing ELRec system, achieving up to a 3.1x speedup and a 38.5% reduction in memory requirements. EcoRec marks a notable advancement in high-performance DLRM training.
Influence maximization (IM) is the problem of finding the k most influential nodes in a graph. We propose distributed-memory parallel algorithms for the two main kernels of a state-of-the-art implementation of one IM ...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
Influence maximization (IM) is the problem of finding the k most influential nodes in a graph. We propose distributed-memory parallel algorithms for the two main kernels of a state-of-the-art implementation of one IM algorithm, influence maximization via martingales (IMM). The baseline relies on a bulk-synchronous parallel approach and uses replication to reduce communication and achieve approximate load balance, at the cost of synchronization and high memory requirements. By contrast, our method fully distributes the data, thereby improving memory scalability, and uses fine-grained asynchronous parallelism to improve network utilization and the cost of doing more communication. We show our design and implementation can achieve up to 29.6x speedup over the MPI-based state-of-the-art on synthetic and real-world network graphs. Moreover, ours is the first implementation that can run IMM to find influencers in the 'twitter' graph (41M nodes and 1.4B edges) in 200 seconds using 8K CPU cores of NERSC Perlmutter supercomputer.
Large-scale Computational Fluid Dynamics (CFD) simulations are typical HPC applications that require both high memory bandwidth and large memory capacity. However, it is difficult to achieve high performance for such ...
详细信息
In this paper, we study the partitioning of a context-aware shared memory data structure so that it can be implemented as a distributed data structure running on multiple machines. By context-aware data structures, we...
Spiking Neural Networks (SNNs) have recently been used as a computational model for applications such as deep learning, image recognition and machine learning. Similar to the biological brain, SNN neurons depend on th...
详细信息
ISBN:
(纸本)9798350372977;9798350372984
Spiking Neural Networks (SNNs) have recently been used as a computational model for applications such as deep learning, image recognition and machine learning. Similar to the biological brain, SNN neurons depend on the membrane level to fire an output. If the level exceeds a specified threshold, the neuron sends an output to activate next neurons. This leads to an unbalanced workload among the neurons. The dynamically-changing membrane level is stored inside a neuron. In hardware, this storage can be implemented as a register or on-chip memory, which determines the amount of consumed resources and, in turns, affects the network scalability. SNN accelerators have recently been implemented on UltraScale FPGAs devices for high-performance purposes. On-chip memories on these devices are classified as distributedmemory, Block RAMs (BRAMs) and Ultra RAMs (URAMs). In this paper, we explored the impact of using different on-chip memories to store the membrane level of SNN neurons. We implemented a parameterizable SpIking Neural networK (SINK) accelerator where the network capacity and weight width are parameters. SINK has the ability to run in four different modes based on the memory type. We ran SINK on UltraScale Zync104 FPGA device and measure the utilization of the hardware resources (LUTs), registers, memory, power consumption and performance. The results show that URAM can be the best fit to store the membrane level, since it used 30%, 11% and 2% less LUTs, Regs, and power 2% respectively comparing with BRAM and distributedmemory
We develop a distributed-memory parallel algorithm for performing batch updates on streaming graphs, where vertices and edges are continuously added or removed. Our algorithm leverages distributed sparse matrices as t...
详细信息
distributed persistent key-value store (KVS) plays an important role in today's storage infrastructure. The development of persistent memory (PM) and remote direct memory access (RDMA) allows to build distributed ...
详细信息
ISBN:
(纸本)9798350339864
distributed persistent key-value store (KVS) plays an important role in today's storage infrastructure. The development of persistent memory (PM) and remote direct memory access (RDMA) allows to build distributed persistent KVS to provide fast data access. However, prior works focus on either PM-oriented or RDMA-oriented optimizations for key-value stores. We find these optimizations disallow a simple porting of RDMA-enabled KVS to PM or vice versa. This paper proposes FastStore, a high-performance distributed persistent KVS, by fully exploiting RDMA features and PM-friendly optimizations. First, FastStore utilizes RDMA-enabled PM exposure to establish direct indexing at the client side to reduce RTTs for reading values. Meanwhile, PM exposure allows PM sharing among cluster nodes, which helps to mitigate attribute-value skewness. Then, FastStore designs PM-friendly ownership transferring log and failure-atomic slotted-page allocator to achieve highly efficient PM management without PM leakage. Finally, FastStore proposes volatile search key to its B+tree indexing to reduce excessive PM accesses. We implement FastStore and the evaluation shows that FastStore outperforms the state-of-the-art ordered KVS Sherman by 2.8x higher throughput and 71.5% fewer RTTs.
暂无评论