[Auto Generated] Chapter 1 . Selecting a PL/M-86 Size Control 1.1 Introduction ............................................................................................................................. 1-1 1.2 Maki...
详细信息
[Auto Generated] Chapter 1 . Selecting a PL/M-86 Size Control 1.1 Introduction ............................................................................................................................. 1-1 1.2 Making the Selection .............................................................................................................. 1-2 1.2.1 Ramifications of Your Selection ................................................................................ 1-2 Restrictions Associated with
Developing real-time systems applications requires programming paradigms that can handle the specification of concurrent activities and timing constraints, and controlling execution on a particular platform. The incre...
详细信息
ISBN:
(纸本)9798350388640;9798350388633
Developing real-time systems applications requires programming paradigms that can handle the specification of concurrent activities and timing constraints, and controlling execution on a particular platform. The increasing need for high-performance, and the use of fine-grained parallel execution, makes this an even more challenging task. This paper explores the state-of-the-art and challenges in real-time parallel application development, focusing on two research directions: one from the high- performance domain (using OpenMP) and another from the real-time and critical systems field (based on Ada). The paper reviews the features of each approach and highlights remaining open issues.
IntroductionUnity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU ...
详细信息
IntroductionUnity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU (GPGPU) computing. However, as Unity is primarily developed for multi-platform game creation, its compute shader language has several limitations, including the lack of multi-GPU computation support and incomplete mathematical *** address these limitations, GPU manufacturers have developed specialized programming models, such as CUDA and HIP, which enable developers to leverage the full computational power of modern GPUs. This article introduces an open-source tool designed to bridge the gap between Unity and CUDA, allowing developers to integrate CUDA's capabilities within Unity-based *** proposed solution establishes an interoperability framework that facilitates communication between Unity and CUDA. The tool is designed to efficiently transfer data, execute CUDA kernels, and retrieve results, ensuring seamless integration into Unity's rendering and computation *** tool extends Unity's capabilities by enabling CUDA-based computations, overcoming the inherent limitations of Unity's compute shader language. This integration allows developers to exploit multi-GPU architectures, leverage advanced mathematical functions, and enhance computational performance for real-time applications.
Temporal graphs change with time and have a lifespan associated with each vertex and edge. These graphs are suitable to process time-respecting algorithms where the traversed edges must have monotonic timestamps. Inte...
详细信息
Temporal graphs change with time and have a lifespan associated with each vertex and edge. These graphs are suitable to process time-respecting algorithms where the traversed edges must have monotonic timestamps. Interval-centric Computing Model (ICM) is a distributed programming abstraction to design such temporal algorithms. There has been little work on supporting time-respecting algorithms at large scales for streaming graphs, which are updated continuously at high rates (Millions/s), such as in financial and social networks. In this article, we extend the windowed-variant of ICM for incremental computing over streaming graph updates. We formalize the properties of temporal graph algorithms and prove that our model of incremental computing over streaming updates is equivalent to batch execution of ICM. We design TARIS, a novel distributed graph platform that implements these incremental computing features. We use efficient data structures to reduce memory access and enhance locality during graph updates. We also propose scheduling strategies to interleave updates with computing, and streaming strategies to adapt the execution window for incremental computing to the variable input rates. Our detailed and rigorous evaluation of temporal algorithms on large-scale graphs with up to 2B edges show that TARIS out-performs contemporary baselines, Tink and Gradoop, by 3-4 orders of magnitude, and handles a high input rate of 83k-587 M Mutations/s with latencies in the order of seconds-minutes.
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable code...
详细信息
The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the low-level abstraction of memory access (LLAMA), a C++ library that provides such a data structure abstraction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. The library is extensible with third-party allocators. Providing two close-to-life examples, we show that the LLAMA-generated array of structs and struct of arrays layouts produce identical code with the same performance characteristics as manually written data structures. Integrations into the SPEC CPU(R) lbm benchmark and the particle-in-cell simulation PIConGPU demonstrate LLAMA's abilities in real-world applications. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts compared with naive element-wise copying. LLAMA provides a novel tool for the development of high-performance C++ applications in a heterogeneous environment.
Java's Stream API, that massively makes use of lambda expressions, permits a more declarative way of defining operations on collections in comparison to traditional loops. While experimental results suggest that t...
详细信息
ISBN:
(纸本)9781665457019
Java's Stream API, that massively makes use of lambda expressions, permits a more declarative way of defining operations on collections in comparison to traditional loops. While experimental results suggest that the use of the Stream API has measurable benefits with respect to code readability (in comparison to loops), a remaining question is whether it has other implications. And one of such implications is, for example, tooling in general and debugging in particular because of the following: While the traditional loop-based approach applies filters one after another to single elements, the Stream API applies filters on whole collections. In the meantime there are dedicated debuggers for the Stream API, but it remains unclear whether such a debugger (on the Stream API) has a measurable benefit in comparison to the traditional stepwise debugger (on loops). The present papers introduces a controlled experiment on the debugging of filter operations using a stepwise debugger versus a stream debugger. The results indicate that under the experiment's settings the stream debugger has a significant (p<.001) and large, positive effect (eta(2)(p)=.899;M-stepwise/M-stream similar to 204%). However, the experiment reveals that additional factors interact with the debugger treatment such as whether or not the failing object is known upfront. The mentioned factor has a strong and large disordinal interaction effect with the debugger (p<.001;eta(2)(p)=.928): In case an object is known upfront that can be used to identify a failing filter, the stream debugger is even less efficient than the stepwise debugger ( M-stepwise/M-stream similar to 72%). Hence, while we found overall a positive effect of the stream debugger, the answer whether or not debugging is easier on loops or streams cannot be answered without taking the other variables into account. Consequently, we see a contribution of the present paper not only in the comparison of different debuggers but in the identification o
The paper provides a model of a mathematical model of a control systems program. This model, in the abstract setting of category theory, also suggests an architecture for engineering. A polynomial functor is used to s...
详细信息
ISBN:
(数字)9781665484329
ISBN:
(纸本)9781665484329
The paper provides a model of a mathematical model of a control systems program. This model, in the abstract setting of category theory, also suggests an architecture for engineering. A polynomial functor is used to specify a Moore machine and through the fixpoint of this functor we obtain a transducer from a stream of input values to a stream of control values. Implementation using Reactive Extensions (RX) is sketched in a language independent manner.
In this paper, we applied statistical analysis in order to evaluate the effect of parameters that affect the performance of a GPU-based parallel system. More specifically, we manually split the data to be processed in...
详细信息
ISBN:
(数字)9781665454087
ISBN:
(纸本)9781665454087
In this paper, we applied statistical analysis in order to evaluate the effect of parameters that affect the performance of a GPU-based parallel system. More specifically, we manually split the data to be processed into a number of trials that will perform inside the kernel while the rest of them are passed as parameters to calls of kernel. In addition, we used varying number of threads. For each combination that occurs by changing the above parameter values, we measured the speedup as the ratio of the CPU to the GPU code execution time. Also, we investigated GPU profiler's metrics to find out if there is any correlation with speedup. The performance evaluation was based on statistical analysis. Monte Carlo algorithms were used as benchmark, due to the high degree of parallelism they can incorporate.
暂无评论