The development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. This emerging scenario of heterogeneous mobile archite...
详细信息
ISBN:
(纸本)9781509012336
The development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. This emerging scenario of heterogeneous mobile architectures brings challenging issues regarding the use of the available computing resources. Such issues are mainly related to the intrinsic complexity of coordinating these processors in order to increase application performance. In this sense, this paper presents a high-level programming model to implement parallel patterns that can be executed in a coordinate way by heterogeneous mobile architectures. A comparative analysis of performance and programming complexity is presented, contrasting code generated automatically from the proposed programming model with low-level manually-optimized implementations.
This work presents an implementation of Neocognitron Neural Network, using a highperformancecomputingarchitecture based on GPU (Graphics Processing Unit). Neocognitron is an artificial neural network, proposed by F...
详细信息
ISBN:
(纸本)9780769534237
This work presents an implementation of Neocognitron Neural Network, using a highperformancecomputingarchitecture based on GPU (Graphics Processing Unit). Neocognitron is an artificial neural network, proposed by Fukushima and collaborators, constituted of several hierarchical stages of neuron layers, organized in. two-dimensional matrices called cellular planes. For the highperformance computation of Face Recognition application using Neocognitron it was used CUDA (Compute Unified Device architecture) as API (Application Programming Interface) between the CPU and the GPU, from GeForce 8800 GTX of NVIDIA company, with 128 ALU's. As face image databases it was used a face database created at UFS-Car and the CMU-PIE (Carnegie Mellon University Pose, Illumination and Expression) database. The load balancing was achieved through the use of cellular connections as threads organized in blocks, following the CUDA philosophy), of development. The results showed the feasibility of this type of device as a massively parallel data processing tool, and that smaller the granularity and the data dependency of the parallel processing, better is its performance.
A technique to speed up stencil computation is introduced. Computation and data reuse schemes are developed for its application to 1- and 3-dimensional stencils. The approach traverses the data domain fewer times than...
详细信息
ISBN:
(纸本)9781509061082
A technique to speed up stencil computation is introduced. Computation and data reuse schemes are developed for its application to 1- and 3-dimensional stencils. The approach traverses the data domain fewer times than a state-of-the-art, straightforward iterative stencil implementation would. performance results are shown for a variety of platforms, exemplifying how it can be straightforwardly applied with existing techniques and frameworks. The technique, named Aggregate Stencil-Loop Iteration (ASLI), works by applying a stencil obtained by the original stencil operator convolved with itself one or more times. This more complex operator creates new opportunities for in-register data reuse and increases the FLOPs-to-load ratio. The total number of FLOPs decreases for 1D but increases for 2D and 3D star-shaped stencils. In both scenarios, speed-up relative to the state-of-the-art is achieved. ASLI is relatively easy to implement and works synergistically with existing methods to optimize stencil computations.
A new RAID-x (redundant array of inexpensive disks at level x) architecture is presented for distributed I/O processing on a serverless cluster of computers. The RAID-x architecture is based on a new concept of orthog...
详细信息
ISBN:
(纸本)0769507832;0769507840
A new RAID-x (redundant array of inexpensive disks at level x) architecture is presented for distributed I/O processing on a serverless cluster of computers. The RAID-x architecture is based on a new concept of orthogonal striping and mirroring (OSM) across all distributed disks in the cluster. The primary advantages of this OSM approach lie in: (1) a significant improvement in parallel I/O bandwidth, (2) hiding disk mirroring overhead in the background, and (3) greatly enhanced scalability and reliability in cluster computing applications. All claimed advantages are substantiated with benchmark performance results on the Trojans cluster built at USC in 1999. Throughout the paper, we discuss the issues of scalable I/O performance, enhanced system reliability, and striped checkpointing on distributed RAID-x in serverless cluster environment.
In this paper, we present MSLIO, a code to mimic the I/O behavior of multiscale simulations. Such an I/O kernel is useful for HPC research, as it can be executed more easily and more efficiently than the full simulati...
详细信息
ISBN:
(数字)9781665451574
ISBN:
(纸本)9781665451574
In this paper, we present MSLIO, a code to mimic the I/O behavior of multiscale simulations. Such an I/O kernel is useful for HPC research, as it can be executed more easily and more efficiently than the full simulations when researchers are interested in the I/O load only. We validate MSLIO by comparing it to the I/O performance of an actual simulation, and we then use it to test some possible improvements to the output routine of the MHM (Multiscale Hybrid Mixed) library.
The increasing performance needs in critical real-time embedded systems (CRTES), such as for instance the automotive domain, push for the adoption of high-performance hardware from the consumer electronics domain. How...
详细信息
ISBN:
(纸本)9781538677698
The increasing performance needs in critical real-time embedded systems (CRTES), such as for instance the automotive domain, push for the adoption of high-performance hardware from the consumer electronics domain. However, their time-predictability features are quite unexplored. The ARM *** architecture is a good candidate for adoption in the CRTES market (i.e. in the automotive market it has already started being used). In this paper we study ARM ***'s capabilities to meet CRTES requirements. In particular, we perform a qualitative and quantitative assessment of its timing characteristics, focusing on shared multicore resources, and how this architecture can be reliably used in CRTES.
Serverless computing has emerged as a popular cloud computing paradigm. Serverless environments are convenient to users and efficient for cloud providers. However, they can induce substantial application execution ove...
详细信息
ISBN:
(纸本)9781665476522
Serverless computing has emerged as a popular cloud computing paradigm. Serverless environments are convenient to users and efficient for cloud providers. However, they can induce substantial application execution overheads, especially in applications with many functions. In this paper, we propose to accelerate serverless applications with a novel approach based on software-supported speculative execution of functions. Our proposal is termed Speculative Function-as-a-Service (SpecFaaS). It is inspired by out-of-order execution in modern processors, and is grounded in a characterization analysis of FaaS applications. In SpecFaaS, functions in an application are executed early, speculatively, before their control and data dependences are resolved. Control dependences are predicted like in pipeline branch prediction, and data dependences are speculatively satisfied with memoization. With this support, the execution of downstream functions is overlapped with that of upstream functions, substantially reducing the end-to-end execution time of applications. We prototype SpecFaaS on Apache OpenWhisk, an open-source serverless computing platform. For a set of applications in a warmed-up environment, SpecFaaS attains an average speedup of 4.6x. Further, on average, the application throughput increases by 3.9x and the tail latency decreases by 58.7%.
The last decade has seen several changes in the structure and emphasis of enterprise IT systems. Specific infrastructure trends have included the emergence of large consolidated data centers, the adoption of virtualiz...
详细信息
ISBN:
(纸本)0769522750
The last decade has seen several changes in the structure and emphasis of enterprise IT systems. Specific infrastructure trends have included the emergence of large consolidated data centers, the adoption of virtualization and modularization, and an increased commoditization of hardware. At the application level, both the workload mix and usage patterns have evolved to an increased emphasis on service-centric computing and SLA-driven performance tuning. These, often dramatic, changes in the enterprise IT landscape motivate equivalent changes in the emphasis of architecture research. In this paper, we summarize some recent trends in enterprise IT systems and discuss the implications for architecture research, suggesting some high-level challenges and open questions for the community to address.
Deep Learning has shifted the focus of traditional batch workflows to data-driven feature engineering on streaming data. In particular, the execution of Deep Learning workflows presents expectations of near-real-time ...
详细信息
ISBN:
(纸本)9781665443012
Deep Learning has shifted the focus of traditional batch workflows to data-driven feature engineering on streaming data. In particular, the execution of Deep Learning workflows presents expectations of near-real-time results with user-defined acceptable accuracy. Meeting the objectives of such applications across heterogeneous resources located at the edge of the network, the core, and in-between requires managing trade-offs between the accuracy and the urgency of the results. However, current data analysis rarely manages the entire Deep Learning pipeline along the data path, making it complex for developers to implement strategies in real-world deployments. Driven by an object detection use case, this paper presents an architecture for time-critical Deep Learning workflows by providing a data-driven scheduling approach to distribute the pipeline across Edge to Cloud resources. Furthermore, it adopts a data management strategy that reduces the resolution of incoming data when potential trade-off optimizations are available. We illustrate the system's viability through a performance evaluation of the object detection use case on the Grid'5000 testbed. We demonstrate that in a multi-user scenario, with a standard frame rate of 25 frames per second, the system speed-up data analysis up to 54.4% compared to a Cloud-only-based scenario with an analysis accuracy higher than a fixed threshold.
Energy-harvesting devices operate under extremely tight energy constraints. Ensuring forward progress under frequent power outages is paramount. Applications running on these devices are typically amenable to approxim...
详细信息
ISBN:
(纸本)9781728114446
Energy-harvesting devices operate under extremely tight energy constraints. Ensuring forward progress under frequent power outages is paramount. Applications running on these devices are typically amenable to approximation, offering new opportunities to provide better forward progress between power outages. We propose What's Next (WN), a set of anytime approximation techniques for energy harvesting: subword pipelining, subword vectorization and skim points. Skim points fundamentally decouple the checkpoint location from the recovery location upon a power outage. Ultimately, WN transforms processing on energy-harvesting devices from all-or-nothing to as-is computing. We enable an approximate (yet acceptable) result sooner and proceed to the next task when power is restored rather than resume processing from a checkpoint to yield the perfect output. WN yields speedups of 2.26x and 3.02x on non-volatile and checkpoint-based volatile processors, while still producing high-quality outputs.
暂无评论