In this research, we tackled the complex task of aligning audio and lyrics content automatically. This task entails the precise matching of lyrics with the corresponding audio segments in songs, necessitating the coor...
详细信息
In light of recent advancements in Internet of Multimedia Things (IoMT) and 5G technology, both the variety and quantity of data have been rapidly increasing. Consequently, handling zero-shot cross-modal retrieval (ZS...
详细信息
For a stream processing system that uses checkpoints as a fault-tolerant method, selecting the appropriate checkpoint period is the key to ensuring the efficient operation of streaming applications. State-of-art strea...
详细信息
ISBN:
(纸本)9781665473156
For a stream processing system that uses checkpoints as a fault-tolerant method, selecting the appropriate checkpoint period is the key to ensuring the efficient operation of streaming applications. State-of-art stream processing systems currently only support fixed-cycle checkpoints, which is difficult to make a good trade-off between fault-tolerant processing and the cost of failure recovery in dynamically changing streaming application scenarios. Moreover, in a complex distributed streaming application environment, the dynamic environmental indicators (e.g., the values of workloads and failure rates) are not in coincidence with the model assumptions, such as the dynamics of Twitter's hot events data changing quickly. In this paper, we consider the dynamic changes of environmental indicators and adaptively optimize the processing delay and fault recovery time. Then, we propose a dynamic adjustment method for the checkpoint interval by reinforcement learning, which is named DACM. DACM adaptively optimizes the processing delay and fault recovery time, while avoiding the overall environment modeling of streaming applications. The experiments conducted on the Flink platform show that DACM reduces the processing delay by 10% and the failure recovery time by 37% compared with the existing checkpoint interval optimization models.
This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics processing Units (GPUs), in double double, quad dou...
详细信息
ISBN:
(纸本)9781665497473
This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision arithmetic. For the blocked Householder QR and the back substitution, of interest are those dimensions at which teraflop performance is attained. The other interesting question is the cost overhead factor that appears each time the precision is doubled. Experimental results are reported on five different NVIDIA GPUs, with a particular focus on the P100 and the V100, both capable of teraflop performance. Thanks to the high Compute to Global Memory Access (CGMA) ratios of multiple double arithmetic, teraflop performance is already attained running the double double QR on 1,024-by-1,024 matrices, both on the P100 and the V100. For the back substitution, the dimension of the upper triangular system must be as high as 17,920 to reach one teraflops on the V100, in quad double precision, and then taking only the times spent by the kernels into account. The lower performance of the back substitution in small dimensions does not prevent teraflop performance of the solver at dimension 1,024, as the time for the QR decomposition dominates. In doubling the precision from double double to quad double and from quad double to octo double, the observed cost overhead factors are lower than the factors predicted by the arithmetical operation counts. This observation correlates with the increased performance for increased precision, which can again be explained by the high CGMA ratios.
This work presents a high-speed high-resolution 1024-input voltage-mode WTA circuit which is suitable for selective attention based processing systems. The circuit uses multi-stage WTAs to improve the resolution and s...
详细信息
Apache Spark is a widely used in-memory processing system due to its high performance. For fast data processing, Spark manages in-memory data such as cached or shuffling (aggregate and sorting) data in its own managed...
详细信息
ISBN:
(纸本)9781665481069
Apache Spark is a widely used in-memory processing system due to its high performance. For fast data processing, Spark manages in-memory data such as cached or shuffling (aggregate and sorting) data in its own managed memory pools. However, despite its sophisticated memory management scheme, we found that Spark still suffers from out-of-memory (OOM) exceptions and high garbage collection (GC) overheads when wild memory consumers, who are not tracked by Spark and execute external codes, use a large amount of memory. To resolve the problems, we propose PokeMem, which is an enhanced Spark that incorporates wild memory consumers into the managed ones to prevent them from taking up memory spaces excessively in stealth. Our main idea is to open the black-box of unmanaged memory regions in external codes by providing customized data collections. PokeMem enables fine-grained controls of created objects within running tasks, by spilling and reloading the objects of custom data collections based on the memory pressure and access patterns. To further reduce memory pressures, PokeMem exploits pre-built memory estimation models to predict the external code's memory usage and proactively acquires memory before the execution of external code, and also performs JVM heap-usage monitoring to avoid critical memory pressures. With the help of these techniques, our evaluations show that PokeMem outperforms vanilla Spark with at most 3x faster execution with 3.9x smaller GC overheads, and successfully runs workloads without OOM exception that vanilla Spark has failed to run.
Recent commercial incarnations of processing-inmemory (PIM) maintain the standard DRAM interface and employ the all-bank mode execution to maximize bank-level memory bandwidth. Such a synchronized all-bank PIM control...
详细信息
ISBN:
(纸本)9798350326598;9798350326581
Recent commercial incarnations of processing-inmemory (PIM) maintain the standard DRAM interface and employ the all-bank mode execution to maximize bank-level memory bandwidth. Such a synchronized all-bank PIM control can effectively manage conventional dense matrix-vector operations on evenly distributed matrices across banks with lock-step execution. Sparse matrix processing is another critical computation that can significantly benefit from the PIM architecture, but the current all-bank PIM control cannot support diverging executions due to the random sparsity. To accelerate such sparse matrix applications, this paper proposes a partially synchronous execution on sparse matrix-vector multiplication (SpMV) and sparse triangular matrix-vector solve (SpTRSV), filling the gap between the practical constraint of PIM and the irregular nature of sparse computation. It allows the execution of the processing unit of each bank to diverge in a limited way to manage the irregular execution path of sparse matrix computation. It proposes compaction and distribution policies for the input matrix and vector. In addition to SpMV, this paper identifies SpTRSV is another key kernel, and proposes SpTRSV acceleration on PIM technology. The experimental evaluation shows that the new sparse PIM architecture outperforms NVIDIA Geforce RTX 3080 GPU by 4.43x speedup for SpMV and 3.53x speedup for SpTRSV with a similar amount of DRAM bandwidth.
A totally asynchronous gradient algorithm, with fixed step size is proposedfor federated learning. A mathematical model is presented and a convergence result is established. The convergence result is based on the conc...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
A totally asynchronous gradient algorithm, with fixed step size is proposedfor federated learning. A mathematical model is presented and a convergence result is established. The convergence result is based on the concept of macro iterations sequence. The interest of the contribution is to show that the asynchronous federated learning method converges when gradients of loss functions are updated by workers without order nor synchronization and with possible unbounded delays.
A soft error caused by terrestrial neutrons poses a threat to the reliability of safety-critical systems, such as self-driving applications. These applications, often comprised of neural networks, rely on graphic proc...
详细信息
ISBN:
(纸本)9798350341355
A soft error caused by terrestrial neutrons poses a threat to the reliability of safety-critical systems, such as self-driving applications. These applications, often comprised of neural networks, rely on graphic processing units (GPUs) due to their requirement for massive parallel computation. While neural networks inherently include redundant computation and possess a certain level of error tolerance, detectable unrecoverable errors (DUEs) can be more detrimental than silent data corruption (SDC), as they can result in temporary service unavailability. This study specifically focuses on addressing illegal memory access, a primary cause of DUEs, and proposes a programming method that can detect illegal addresses. In the single instruction, multiple threads (SIMT) scheme, the data address is regularly calculated based on the thread ID, and this regularity is exploited to identify illegal addresses through inter-thread communication. To evaluate the effectiveness of the proposed method, fault injection campaigns were conducted for matrix multiplication, vector addition, and transposition. The experimental results indicate that the proposed method resulted in a reduction of the DUE rate by 17.3%, 86.8%, and 87.1% for these respective operations.
With the application of blockchain light nodes in embedded devices, how to alleviate computing pressure brought by complex operations such as transaction's SPV Verification for CPU of embedded devices and improve ...
详细信息
暂无评论