In this paper, we introduce XPySom, a new opensource Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve highperformance on a single node, exploiting widely availab...
详细信息
ISBN:
(纸本)9781728199245
In this paper, we introduce XPySom, a new opensource Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve highperformance on a single node, exploiting widely available Python libraries for vector processing on multi-core CPUs and GP-GPUs. We present results from an extensive experimental evaluation of XPySom in comparison to widely used open-source SOM implementations, showing that it outperforms the other available alternatives. Indeed, our experimentation carried out using the Extended MNIST open data set shows a speed-up of about 7x and 100x when compared to the best open-source multi-core implementations we could find with multi-core and GP-GPU acceleration, respectively, achieving the same accuracy levels in terms of quantization error.
The proposed FSCHOL framework consists of an FPGA kernel implementing a throughput-optimized hardware architecture for accelerating the supernodal multifrontal algorithm for sparse Cholesky factorization and a host pr...
详细信息
ISBN:
(纸本)9781665443012
The proposed FSCHOL framework consists of an FPGA kernel implementing a throughput-optimized hardware architecture for accelerating the supernodal multifrontal algorithm for sparse Cholesky factorization and a host program implementing a novel scheduling algorithm for finding the optimal execution order of supernodes computations for an elimination tree on the FPGA to eliminate the need for offchip memory access for storing intermediate results. Moreover, the proposed scheduling algorithm minimizes on-chip memory requirements for buffering intermediate results by resolving the dependency of parent nodes in an elimination tree through temporal parallelism. Experiment results for factorizing a set of sparse matrices in various sizes from SuiteSparse Matrix Collection show that the proposed FSCHOL implemented on an Intel Stratix 10 GX FPGA development board achieves on average 5.5x and 9.7x higher performance and 10.3x and 24.7x lower energy consumption than implementations of CHOLMOD on an Intel Xeon E5-2637 CPU and an NVIDIA V100 GPU, respectively.
Numerous applications for mobile devices require 3D vision capabilities, which in turn require depth detection since this enables the evaluation of an object's distance, position and shape. Despite the increasing ...
详细信息
ISBN:
(纸本)9781509061082
Numerous applications for mobile devices require 3D vision capabilities, which in turn require depth detection since this enables the evaluation of an object's distance, position and shape. Despite the increasing popularity of depth detection algorithms, available solutions need expensive hardware and/or additional ASICs, which are not suitable for low-cost commodity hardware devices. In this paper, we propose a low-cost and low-power embedded solution to provide high speed depth detection. We extend an existing off-the-shelf VLIW image processor and perform algorithmic and architectural optimizations in order to achieve the requested real-time performance speed. Experimental results show that by adding different functional units and adjusting the algorithm to take full advantage of them, a 640x480 image pair with 64 disparities(1) can be processed at 36.75 fps on a single processor instance, which is an improvement of 23% compared to the best state-of-the-art image processor.
The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has receive...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units into highperformance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained DL workloads. For this purpose, we implement and optimise for the Fujitsu processor A64FX, three distinct methods for the calculation of the convolution, namely, the lowering approach, a blocked variant of the direct convolution algorithm, and the Winograd minimal filtering algorithm. Our experimental results include an extensive evaluation of the parallel scalability of these three methods and a comparison of their global performance using three popular DL models and a representative dataset.
Approximate memories provide energy savings or performance improvements at the cost of occasional errors in stored data. Applications that tolerate errors on their data profit from this trade-off by controlling these ...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
Approximate memories provide energy savings or performance improvements at the cost of occasional errors in stored data. Applications that tolerate errors on their data profit from this trade-off by controlling these errors to not affect critical data. This control usually involves programmer intervention with annotations in the source code. To avoid annotations, some techniques protect critical data that are common on many applications, isolating specific memory regions from errors. In this work, we propose and explore alternatives for the protection of application critical data by managing a supervisor execution environment with an approximate memory system. We expose only dynamically allocated data to errors with secure data manipulation through an approximate allocation scheme that divide stored data based on the approximation of the heap area. We evaluate 6 applications with different data access profiles and obtain up to 20% of energy savings.
highly available metadata services of distributed file systems are essential to cloud applications. However, existing highly available metadata designs lack client-oriented features that treat metadata discriminately,...
详细信息
ISBN:
(纸本)9781467380119
highly available metadata services of distributed file systems are essential to cloud applications. However, existing highly available metadata designs lack client-oriented features that treat metadata discriminately, leading to a single metadata fault domain and low availability. After investigating the workload characteristics of Hadoop, we propose Client-Oriented METadata (COMET), a novel highly available metadata service design that divides and distributes metadata into independent regions in terms of clients. These regions are isolated fault domains inherently, and failures in one region will not break file operations in other regions. A prototype of COMET was implemented based on HDFS, and the experimental results show that COMET can significantly improve metadata availability of HDFS without obvious performance degradation. It can also deliver scalable performance and faster metadata recovery due to its decentralized architecture.
With the increased amount of data available for processing, and the increased need of processing this data, loosely-coupled batch applications have become very popular. Many batch applications require a high level of ...
详细信息
ISBN:
(纸本)9781665417303
With the increased amount of data available for processing, and the increased need of processing this data, loosely-coupled batch applications have become very popular. Many batch applications require a high level of processing capacity, which leads us to the need of highperformancecomputing infrastructures. This approach has been used for a long time, mainly for scientific purposes, and focused on the conventional environments for HPC, namely local clusters and supercomputers. The high-speed networks present in these systems are paramount for the execution of tightly-coupled scientific applications, but are wasted when executing loosely-coupled applications. Cloud infrastructures, on the other hand, provide a more appropriate infrastructure to support such loosely-coupled applications. Unfortunately, the user experience in cloud systems is completely different from that of conventional batch systems, mainly because the infrastructure needs to be deployed and subsequently released, to achieve the desired gains. In this paper we propose the architecture of a batch processing system that takes advantage of common features of cloud infrastructures to minimize cost and waiting time, while providing a user experience that is similar to conventional HPC systems.
Recent developments in the international arena has meant the technology is now mature enough to bring together those required for the implementation of a grid computing facility. This paper examines the requirements a...
详细信息
ISBN:
(纸本)0769517722
Recent developments in the international arena has meant the technology is now mature enough to bring together those required for the implementation of a grid computing facility. This paper examines the requirements and applications for an eScience infrastructure with particular reference to developments in Europe.
Applications with large amounts of data, real-time constraints, ultra-low power requirements, and heavy computational complexity present significant challenges for modern computing systems, and often fall within the c...
ISBN:
(纸本)9781479989300
Applications with large amounts of data, real-time constraints, ultra-low power requirements, and heavy computational complexity present significant challenges for modern computing systems, and often fall within the category of highperformancecomputing (HPC). As such, computer architects have looked to highperformance single instruction multiple data (SIMD) architectures, such as accelerator-rich platforms, for handling these workloads. However, since the results of these applications do not always require exact precision, approximate computing may also be leveraged. In this work, we introduce BRAINIAC, a heterogeneous platform that combines precise accelerators with neural-network-based approximate accelerators. These reconfigurable accelerators are leveraged in a multi-stage flow that begins with simple approximations and resorts to more complex ones as needed. We employ high-level, application-specific light-weight checks (LWCs) to throttle this multi-stage acceleration flow and reliably ensure user-specified accuracy at runtime. Evaluation of the performance and energy of our heterogeneous platform for error tolerance thresholds of 5%-25% demonstrates an average of 3x gain over computation that only includes precise acceleration, and 15x-35x gain over software-based computation.
This paper introduces an approach to the design of discrete event simulation experiments aimed at transient performance analysis. Specially in complex, multi-tier applications, the net effects of small delays introduc...
详细信息
ISBN:
(纸本)9781467380119
This paper introduces an approach to the design of discrete event simulation experiments aimed at transient performance analysis. Specially in complex, multi-tier applications, the net effects of small delays introduced by buffers, IO operations, communication latency and averaged measurements, may result in significant inertia along the input-output path. In order to bring out these dynamic properties, the simulation experiment should excite the system with non-stationary workload under controlled conditions. The work discusses on the dynamic properties of large-scale distributed computer systems and how these may impact delivered performance. These rationales are explored to motivate a concern-based architecture which captures the elicited requirements. The design approach is systematic formulated and illustrated by a case study on extending a well-known cloud computing simulation framework to meet the aimed features. Experimental results of ongoing work are also addressed.
暂无评论