the proceedings contain 34 papers. the topics discussed include: EEG based thought-to-text translation via deep learning;penta-band circular patch antenna with partial ground for wireless applications;fingerprint gene...
ISBN:
(纸本)9798350338461
the proceedings contain 34 papers. the topics discussed include: EEG based thought-to-text translation via deep learning;penta-band circular patch antenna with partial ground for wireless applications;fingerprint generation and authentication though adaptive convolution generative adversarial network (ADCGAN);go together: bridging the gap between learners and teachers;design, implementation, and power analysis for network-on-chip architectures;SELthA: secure, efficient and lightweight authentication mechanism for unmanned aerial vehicle network;a hybrid statistical model for ultra short term wind speed prediction;a comprehensive study of the role of self-driving vehicles in agriculture: a review;praise or insult? identifying cyberbullying using natural language processing;inductance enhancement using nested inductor topology for RF and voltage regulators applications;comparison of memory-less and memory-based models for short-term solar irradiance forecasting;and exploring the impact of false location identification on the inference of social ties in location-based social networks.
Digital FIR filters can be efficiently implemented using distributed arithmetic (DA). Original DA provides low throughput. parallel DA is proven to be a promising technique for efficient DA implementation. Block-based...
详细信息
Sparse matrix-vector multiplication (SpMV) is extensively used in scientific computing and often accounts for a significant portion of the overall computational overhead. therefore, improving the performance of SpMV i...
详细信息
ISBN:
(数字)9789819708017
ISBN:
(纸本)9789819708000;9789819708017
Sparse matrix-vector multiplication (SpMV) is extensively used in scientific computing and often accounts for a significant portion of the overall computational overhead. therefore, improving the performance of SpMV is crucial. However, sparse matrices exhibit a sporadic and irregular distribution of non-zero elements, resulting in workload imbalance among threads and challenges in vectorization. To address these issues, numerous efforts have focused on optimizing SpMV based on the hardware characteristics of computing platforms. In this paper, we present an optimization on CSR-Based SpMV, since the CSR format is the most widely used and supported by various high-performance sparse computing libraries, on a novel MIMD computing platform Pezy-SC3s. Based on the hardware characteristics of Pezy-SC3s, we tackle poor data locality, workload imbalance, and vectorization challenges in CSRBased SpMV by employing matrix chunking, applying Atomic Cache for workload scheduling, and utilizing SIMD instructions during performing SpMV. As the first study to investigate SpMV optimization on Pezy-SC3s, we evaluate the performance of our work by comparing it withthe CSR-Based SpMV and SpMV provided by Nvidia's CuSparse. through experiments conducted on 2092 matrices obtained from SuiteSparse, we demonstrate that our optimization achieves a maximum speedup ratio of x17.63 and an average of x1.56 over CSR-Based SpMV and an average bandwidth utilization of 35.22% for large-scale matrices (nnz >= 10(6)) compared with 36.17% obtained using CuSparse. these results demonstrate that our optimization effectively harnesses the hardware resources of Pezy-SC3s, leading to improved performance of CSR-Based SpMV.
this project develops an innovative Braun Multiplier design to address power consumption and chip area challenges. Integrating a high-speed parallel prefix adder enhances computational speed by leveraging parallel pro...
详细信息
parallel-in-time algorithms provide an additional layer of concurrency for the numerical integration of models based on time-dependent differential equations. Methods like Parareal, which parallelize across multiple t...
详细信息
ISBN:
(纸本)9783031396977;9783031396984
parallel-in-time algorithms provide an additional layer of concurrency for the numerical integration of models based on time-dependent differential equations. Methods like Parareal, which parallelize across multiple time steps, rely on a computationally cheap and coarse integrator to propagate information forward in time, while a parallelizable expensive fine propagator provides accuracy. Typically, the coarse method is a numerical integrator using lower resolution, reduced order or a simplified model. Our paper proposes to use a physics-informed neural network (PINN) instead. We demonstrate for the Black-Scholes equation, a partial differential equation from computational finance, that Parareal with a PINN coarse propagator provides better speedup than a numerical coarse propagator. Training and evaluating a neural network are both tasks whose computing patterns are well suited for GPUs. By contrast, mesh-based algorithms withtheir low computational intensity struggle to perform well. We show that moving the coarse propagator PINN to a GPU while running the numerical fine propagator on the CPU further improves Parareal's single-node performance. this suggests that integrating machine learning techniques into parallel-in-time integration methods and exploiting their differences in computing patterns might offer a way to better utilize heterogeneous architectures.
In this paper, we consider a fully implicit Stokes solver implementation targeting both GPU and multithreaded CPU architectures. the solver is aimed at the semistructured mesh often emerging during permeability calcul...
详细信息
ISBN:
(纸本)9783031388637
In this paper, we consider a fully implicit Stokes solver implementation targeting both GPU and multithreaded CPU architectures. the solver is aimed at the semistructured mesh often emerging during permeability calculations in geology. the solver basically consists of four main parts: geometry and topology analysis, linear system construction, linear system solution, and postprocessing. A modified version of the AMGCL library developed by the authors in earlier research is used for the solution. Previous experiments showed that the GPU architecture can deliver extremely high performance for such types of problems, especially when the whole stack is implemented on the GPU. However, the GPU memory limitation significantly reduces the available mesh sizes. For some applications, the computation time is not as important as the mesh size. therefore, it is convenient to have both GPU (for example, CUDA) and multithreaded CPU versions of the same code. the direct code port is time-consuming and error-prone. Several automatic approaches are available: OpenACC standard, DVM-system, SYCL, and others. Often, however, these approaches still demand careful programming if one wants to deliver maximum performance for a specific architecture. Some problems (such as the analysis of connected components, in our case) require totally different optimal algorithms for different architectures. Furthermore, sometimes native libraries deliver the best performance and are preferable for specific parts of the solution. For these reasons, we used another approach, based on C++ language abilities as template programming. the main two components of our approach are array classes and ‘for each’ algorithms. Arrays can be used on both CPU and CUDA architectures and internally substitute the memory layout that best fits the current architecture (as an ‘array of structures’ or ‘structure of arrays’). ‘For each’ algorithms generate kernels or parallel cycles that implement parallelprocessing for ind
Currently, the landscape of computer hardware architecture presents the characteristics of heterogeneity and diversity, prompting widespread attention to cross-platform portable parallel programming techniques. Most e...
详细信息
Defect detection in manufacturing remains challenging, with traditional methods relying on inflexible, hardcoded image processing techniques. While deep learning approaches show promise, they often lack scalability ac...
详细信息
In recent years, depression, as a serious mental illness, has received widespread attention from various sectors of society. How to identify depressive emotions in a timely manner and detect depression has become an u...
详细信息
Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, dataflow archit...
详细信息
ISBN:
(纸本)9783031396977;9783031396984
Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, dataflow architectures often face challenges of low utilization of computational resources if the application algorithms are irregular. In this paper, we propose a software and hardware co-design technique that makes both regular and irregular applications efficient on dataflow architectures. First, we dispatch instructions between dataflow graph (DFG) nodes to ensure load balance. Second, we decouple threads within the DFG nodes into consecutive pipeline stages and provide architectural support. By time-multiplexing these stages on each PE, dataflow hardware can achieve much higher utilization and performance. We show that our method improves performance by gmean 2.55x (and up to 3.71x) over a conventional dataflow architecture (and by gmean 1.80x over Plasticine) on a variety of challenging applications.
暂无评论