In the current landscape of advanced natural language processing (NLP), managing gpumemory effectively is crucial. This paper delves into new tokenization methods and data handling to enhance NLP model efficiency, fo...
详细信息
In the current landscape of advanced natural language processing (NLP), managing gpumemory effectively is crucial. This paper delves into new tokenization methods and data handling to enhance NLP model efficiency, focusing on avoiding "CUDA out of memory" errors. It examines how sophisticated tokenization and managing text lengths in large datasets can boost model performance. These insights are vital for optimizing resources and scaling NLP models, especially with limited gpumemory. The paper also contextualizes NLP challenges, underlining the significance of memoryoptimization amidst growing language model complexities. It reviews key NLP technologies, including transformer models, and addresses their memoryoptimization challenges. Moreover, it underscores the paper's role in developing innovative techniques for more effective memoryoptimization, linking it to ongoing research and trends in NLP. This work aims to progress natural language processing methods and make AI technologies more accessible.
The performance of General-Purpose computation on Graphics Processing Units (GPgpu) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Pro...
详细信息
The performance of General-Purpose computation on Graphics Processing Units (GPgpu) is heavily dependent on the memory access behavior. This sensitivity is due to a combination of the underlying Massively Parallel Processing (MPP) execution model present on gpus and the lack of architectural support to handle irregular memory access patterns. Application performance can be significantly improved by applying memory-access-patternaware optimizations that can exploit knowledge of the characteristics of each access pattern. In this paper, we present an algorithmic methodology to semi-automatically find the best mapping of memory accesses present in serial loop nest to underlying data-parallel architectures based on a comprehensive static memory access pattern analysis. To that end we present a simple, yet powerful, mathematical model that captures all memory access pattern information present in serial data-parallel loop nests. We then show how this model is used in practice to select the most appropriate memory space for data and to search for an appropriate thread mapping and work group size from a large design space. To evaluate the effectiveness of our methodology, we report on execution speedup using selected benchmark kernels that cover a wide range of memory access patterns commonly found in GPgpu workloads. Our experimental results are reported using the industry standard heterogeneous programming language, OpenCL, targeting the NVIDIA GT200 architecture.
With the growing of the depth of neural networks and the scale of data, the difficulty of network training also increases. When the gpumemory is insufficient, it is challenging to train deeper models. Recent research...
详细信息
ISBN:
(纸本)9781665481069
With the growing of the depth of neural networks and the scale of data, the difficulty of network training also increases. When the gpumemory is insufficient, it is challenging to train deeper models. Recent research uses tensor swapping and recomputation techniques in a combined manner to optimize the memory usage. However, complex dependencies of the DNN graph limit the improvement of the single gpu memory optimization. Improper swap decisions even brings negative effects because the source of the recomputation may have been swapped out. In this paper, we propose a novel swap dominated tensor re-generation strategy, called STR, which combines swap and recomputation techniques to find the optimal execution plan for the DNN training when the memory is limited. We formalize our memoryoptimization problem with constraints which describe the dependency of the operator calculation and the bandwidth usage of swap. A host checkpoint mechanism is designed to make full use of the swapped tensors, which reduces the cost of the recomputation. We also present an approximation method based on a recursive source tracing procedure to improve the optimization efficiency. We implement a prototype of STR as a plugin on TensorFlow. The experimental result shows that STR improves up to 21.3% throughput compared with the state-of-the-art hybrid optimization strategy.
This paper presents a novel approach to optimizing realistic fur rendering for CG animation using Unreal Engine (UE). We introduce a progressive method combining three key techniques to enhance rendering efficiency wh...
详细信息
ISBN:
(纸本)9798400711381
This paper presents a novel approach to optimizing realistic fur rendering for CG animation using Unreal Engine (UE). We introduce a progressive method combining three key techniques to enhance rendering efficiency while maintaining high quality. These strategies significantly optimize gpumemory use, enabling more complex and realistic results. Our approach streamlines production workflows, reducing both time and costs, offering broader potential applications in the film industry.
The interaction between buildings and wind significantly impacts the comfort and safety of pedestrians, thereby influencing the sustainability of cities. Computational fluid dynamics (CFD) simulation of wind velocity ...
详细信息
The interaction between buildings and wind significantly impacts the comfort and safety of pedestrians, thereby influencing the sustainability of cities. Computational fluid dynamics (CFD) simulation of wind velocity in urban environments provides valuable insights into building aerodynamics. Traditional CFD solvers are limited by high computational costs, hindering practical engineering applications. Graph neural networks (GNNs) have emerged as a promising approach to accelerate CFD simulations on unstructured meshes. However, their inability to handle large-scale urban wind prediction due to high gpumemory requirements poses a challenge, as GNNs rely on gpus for fast training and inference. To overcome this limitation, we propose SGMS-GNN, a novel GNN model that accurately and efficiently predicts wind velocity fields in urban environments while maintaining consistent gpumemory usage as the simulation domain increases. We employed a validated CFD model to generate a dataset of wind velocity fields in various urban topologies by simulating wind flow through randomly generated building layouts. Our well-generalized SGMS-GNN demonstrates accurate urban wind field predictions at cityscale, achieving a 70 % reduction in gpumemory usage compared to other GNN models. Furthermore, the proposed model outperforms the CFD model on which it is trained by running 1-2 orders of magnitude faster.
Over the past decade, Graphics Processing Units (gpus) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these adv...
详细信息
Over the past decade, Graphics Processing Units (gpus) have revolutionized high-performance computing, playing pivotal roles in advancing fields like IoT, autonomous vehicles, and exascale computing. Despite these advancements, efficiently programming gpus remains a daunting challenge, often relying on trial-and-error optimization methods. This paper introduces an optimization technique for CUDA programs through a novel Data Layout strategy, aimed at restructuring memory data arrangement to significantly enhance data access locality. Focusing on the dynamic programming algorithm for chained matrix multiplication—a critical operation across various domains including artificial intelligence (AI), high-performance computing (HPC), and the Internet of Things (IoT)—this technique facilitates more localized access. We specifically illustrate the importance of efficient matrix multiplication in these areas, underscoring the technique’s broader applicability and its potential to address some of the most pressing computational challenges in gpu-accelerated applications. Our findings reveal a remarkable reduction in memory consumption and a substantial 50% decrease in execution time for CUDA programs utilizing this technique, thereby setting a new benchmark for optimization in gpu computing.
暂无评论