In order to meet the increased computational demands and stricter power constraints of modern applications, architectures have evolved to include domain-specific accelerators. In order to design efficient accelerators...
详细信息
ISBN:
(纸本)9798350305487
In order to meet the increased computational demands and stricter power constraints of modern applications, architectures have evolved to include domain-specific accelerators. In order to design efficient accelerators, three main challenges need to be addressed: compute, memory, and control. Moreover, since SoCs usually contain multiple accelerators, selecting the right one for each task also become crucial. this becomes specially relevant in Flexible Processing Units (xPUs), processing units that provide multiple functionalities withthe same hardware. While it is possible to use shared support components for all functionalities, this will lead to sub-optimal performance. In this work, we take one example of such xPU, and analyze the aspects which have not yet been fully addressed, showing that there is more potential to be exploited. By understanding the required memory patterns, we can achieve up to 72% speedup gains compared to using the memory support optimized for a different functionality. Furthermore, we propose an in-depth analysis of the different functionalities provided by the xPU. We then leverage the insights obtained from this analysis by providing a mechanism that selects the right functionality, maximizing hardware utilization.
Cloud computing allows users to access large computing infrastructures quickly. In the highperformancecomputing (HPC) context, public cloud resources emerge as an economical alternative, allowing institutions and re...
详细信息
ISBN:
(纸本)9798350381603
Cloud computing allows users to access large computing infrastructures quickly. In the highperformancecomputing (HPC) context, public cloud resources emerge as an economical alternative, allowing institutions and research groups to use highly parallel infrastructures in the cloud. However, parallel runtime systems and software optimizations proposed over the years to improve the performance and scalability of HPC applications targeted traditional on-premise HPC clusters, where developers have direct access to the underlying hardware without any kind of virtualization. In this paper, we analyze the performance and scalability of HPC applications from the NAS Parallel Benchmarks suite when running on a virtualized HPC cluster built on top of Amazon Web Services (AWS), contrasting them withthe results obtained withthe same applications running on a traditional on-premise HPC cluster from Grid'5000. Our results show that CPU-bound applications achieve similar results in both platforms, whereas communication-bound applications may be impacted by the limited network bandwidth in the cloud. Cloud infrastructure demonstrated better performance under workloads with moderate communication and mediumsized messages.
this research introduces an automated road extraction model utilizing deep learning techniques for high-resolution aerial imagery. Focused on applications in urban planning, disaster management, and logistics, the stu...
详细信息
the proceedings contain 17 papers. the special focus in this conference is on Engineering Interactive computing Systems. the topics include: Evaluation of a Social Robot System for performance-Oriented Stroke therapy;...
ISBN:
(纸本)9783031592348
the proceedings contain 17 papers. the special focus in this conference is on Engineering Interactive computing Systems. the topics include: Evaluation of a Social Robot System for performance-Oriented Stroke therapy;MUMR-MIODMIT: A Generic architecture Extending Standard Interactive Systems architecture to Address Engineering Issues for Rehabilitation;serious Game for Company Governance: Supporting Integration, Prevention of Professional Disintegration and Job Retention of People with Disabilities;two Concepts of Domain-Specific Languages for therapists to Control a Humanoid Robot;an Approach to Leverage Artificial Intelligence for Car-Parking Related Mobile Applications;Engineering AI-Similar Designs: Should I Engineer My Interactive System with AI Technologies?;explaining through the Right Reasoning Style: Lessons Learnt;Exploring AI-Enhanced Shared Control for an Assistive Robotic Arm;hidden Figures: Architectural Challenges to Expose Parameters Lost in Code;Not What I was Trained for – Out-of-Distribution-Tests for Interactive AIs;end User Development for Extended Reality;exertion Trainer: Smartphone Exergame Design to Support Children’s Kinesthetic Learning through Playful Feedback;explaining Temporal Logic Model Checking Counterexamples through the Use of Structured Natural Language;merging Creativity with Computation in Sketch-to-Code Transitions;UX Data Visualization: Supporting Software Professionals in Exploring Users’ Interaction Data.
Seismic imaging techniques like Reverse Time Migration (RTM) are time-consuming and data-intensive activities in the field of geophysical exploration. the computational cost associated withthe stability and dispersio...
详细信息
ISBN:
(纸本)9798350305487
Seismic imaging techniques like Reverse Time Migration (RTM) are time-consuming and data-intensive activities in the field of geophysical exploration. the computational cost associated withthe stability and dispersion conditions in the discrete two-way wave equation makes RTM time-consuming. Additionally, RTM is data-intensive due to the need to manage a considerable amount of information, such as the forward propagated wavefields (source wavefield), to build the final migrated seismic image according to an imaging condition. In this context, we introduce lossy and lossless wavefield compression for parallel multi-core and GPU-based RTM to alleviate the data transfer between processor and disk. We use OpenACC for enabling GPU parallelism and the ZFP library aligned to decimation based on the Nyquist sampling theorem to reduce storage. We study experimentally the effects of wavefield compression for both GPU-based and optimized OpenMP+vectorization RTM versions. Multi-core and GPU-based RTM have been linked to the ZFP library to compress the source wavefield on-the-fly once it has been decimated according to the Nyquist sampling theorem to calculate the imaging condition. this approach can reduce drastically the persistent storage required by the technique. However, it is essential to understand the impact of using compressed wavefields on the migration process that builds the seismic image. In this context, we show how much storage can be reduced without compromising the seismic image's accuracy and quality.
the pursuit of energy efficiency has been driving the development of techniques to optimize hardware resource usage in high-performancecomputing (HPC) servers. On multicore architectures, thread-level parallelism (TL...
详细信息
ISBN:
(纸本)9798350305487
the pursuit of energy efficiency has been driving the development of techniques to optimize hardware resource usage in high-performancecomputing (HPC) servers. On multicore architectures, thread-level parallelism (TLP) exploitation, dynamic voltage and frequency scaling (DVFS), and uncore frequency scaling (UFS) are three popular methods applied to improve the trade-off between performance and energy consumption, represented by the energy-delay product (EDP). However, the complexity of selecting the optimal configuration (TLP degree, DVFS, and UFS) for each application poses a challenge to software developers and end-users due to the massive number of possible configurations. To tackle this challenge, we propose NeurOPar, an optimization strategy for parallel workloads driven by an artificial neural network (ANN). It uses representative hardware and software metrics to build and train an ANN model that predicts combinations of thread count and core/uncore frequency levels that provide optimal EDP results. through experiments on four multicore processors using twenty-five applications, we demonstrate that NeurOPar predicts combinations that yield EDP values close to the best ones achieved by an exhaustive search and improve the overall EDP by 42% compared to the default execution of HPC applications. We also show that NeurOPar can enhance the execution of parallel applications without incurring the performance and energy penalties associated with online methods by comparing it with two state-of-the-art strategies.
Graph neural networks (GNN) one of the most popular neural network models, are extensively applied in graphrelated fields, including drug discovery, recommendation systems, etc. Unsupervised graph learning as one type...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
Graph neural networks (GNN) one of the most popular neural network models, are extensively applied in graphrelated fields, including drug discovery, recommendation systems, etc. Unsupervised graph learning as one type of GNN plays a crucial role in various graph-related missions like node classification and edge prediction. However, withthe increasing size of real-world graph datasets, processing such massive graphs in host memory becomes impractical, and GNN training demands a substantial storage volume to accommodate the vast amount of graph data. Consequently, GNN training results in significant I/O migration between the host and storage. Although state-ofthe-art frameworks have made strides in mitigating I/O overhead by considering embedding locality, their GNN frameworks still suffer from long training times. In this paper, we propose a fully out-of-core framework, called Celeritas, which speeds up the unsupervised GNN training on a single machine by co-designing the GNN algorithm and storage systems. First, based on the theoretical analysis, we propose a new partial combination operation to enable the embedding updates across GNN layers. this cross-layer computing achieves future computation for the embedding stored in memory to save data migration. Second, due to the dependency between embedding and edges, we consider their data locality together. Based on the cross-layer computing property, we propose a new loading order to fully utilize the data stored in the main memory to save I/O. Finally, a new sampling scheme called two-level sampling is proposed associated with a new partition algorithm to further reduce data migration and computation overhead while maintaining similar training accuracy. the real system experiments indicate that the proposed Celeritas can reduce the total training time of different GNN models from 44.76% to 73.85% compared to state-of-art schemes for different graph datasets.
the important growth in the demand for Neural Network solutions has created an urgent need for efficient implementations across a wide array of environments and platforms. As industries increasingly rely on AI-driven ...
详细信息
ISBN:
(纸本)9798350381603
the important growth in the demand for Neural Network solutions has created an urgent need for efficient implementations across a wide array of environments and platforms. As industries increasingly rely on AI-driven technologies, optimizing the performance and effectiveness of these networks has become crucial. While numerous studies have achieved promising results in this field, the process of fine-tuning and identifying optimal architectures for specific problem domains remains a complex and resource-intensive task. As such, there is a pressing need to explore and evaluate techniques that can improve this optimization process, reducing costs and time-to-deployment while maximizing the overall performance of Neural Networks. this work focuses on evaluating the optimization process of NetAdpat for two neural networks on an Nvidia Jetson device. We observe a performance decay for the larger network when the algorithm tries to meet the latency constraint. Furthermore, we propose potential alternatives to optimize this tool. Particularly, we propose an alternative configuration search procedure that allows us to enhance the optimization process, achieving speedups of up to similar to 7x.
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide (e.g., 16,384-262,144-bit-wide) data-parallel operations, in a single...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide (e.g., 16,384-262,144-bit-wide) data-parallel operations, in a single-instruction multiple-data (SIMD) fashion. However, DRAM rows' large and rigid granularity limit the effectiveness and applicability of PUD in three ways. First, since applications have varying degrees of SIMD parallelism (which is often smaller than the DRAM row granularity), PUD execution often leads to underutilization, throughput loss, and energy waste. Second, due to the high area cost of implementing interconnects that connect columns in a wide DRAM row, most PUD architectures are limited to the execution of parallel map operations, where a single operation is performed over equally-sized input and output arrays. third, the need to feed the wide DRAM row with tens of thousands of data elements combined withthe lack of adequate compiler support for PUD systems create a programmability barrier, since programmers need to manually extract SIMD parallelism from an application and map computation to the PUD hardware. Our goal is to design a flexible PUD system that overcomes the limitations caused by the large and rigid granularity of PUD. To this end, we propose MIMDRAM, a hardware/software co-designed PUD system that introduces new mechanisms to allocate and control only the necessary resources for a given PUD operation. the key idea of MIMDRAM is to leverage finegrained DRAM (i.e., the ability to independently access smaller segments of a large DRAM row) for PUD computation. MIMDRAM exploits this key idea to enable a multiple-instruction multiple-data (MIMD) execution model in each DRAM subarray (and SIMD execution within each DRAM row segment). We evaluate MIMDRAM using twelve real-world applications and 495 multi-programmed application mixes. Our evaluation shows that MIMDRAM provides 34x the performance, 14.3x the energy efficiency, 1.7x the throughp
With growing problem sizes for GPU computing, multi-GPU systems with fine-grained memory sharing have emerged to improve the current coarse-grained unified memory support based on page migration. Such multi-GPU system...
详细信息
ISBN:
(纸本)9798350393132;9798350393149
With growing problem sizes for GPU computing, multi-GPU systems with fine-grained memory sharing have emerged to improve the current coarse-grained unified memory support based on page migration. Such multi-GPU systems with shared memory pose a new challenge in securing CPUGPU and inter-GPU communications, as the cost of secure data transfers adds a significant performance overhead. there are two overheads of secure communication in multi-GPU systems: First, extra overhead is added to generate one-time pads (OTPs) for authenticated encryption. Second, the security metadata such as MACs and counters passed along with encrypted data consume precious network bandwidth. this study investigates the performance impact of secure communication in multi-GPU systems and evaluates the prior CPU-oriented OTP precomputation schemes adapted for multi-GPU systems. Our investigation identifies the challenge withthe limited OTP buffers for interGPU communication and the opportunity to reduce traffic for security meta-data with bursty communications in GPUs. Based on the analysis, this paper proposes a new dynamic OTP buffer allocation technique, which adjusts the buffer assignment for each source-destination pair to reflect the communication patterns. To address the bandwidth problem by extra security metadata, the study employs a dynamic batching scheme to transfer only a single set of metadata for each batched group of data responses. the proposed design constantly tracks the communication pattern from each GPU, periodically adjusts the allocated buffer size, and dynamically forms batches of data transfers. Our evaluation shows that in a 16-GPU system, the proposed scheme can improve the performance by 13.2% and 17.5% on average from the prior cached and private schemes, respectively.
暂无评论