the convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has receive...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
the convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units into highperformance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained DL workloads. For this purpose, we implement and optimise for the Fujitsu processor A64FX, three distinct methods for the calculation of the convolution, namely, the lowering approach, a blocked variant of the direct convolution algorithm, and the Winograd minimal filtering algorithm. Our experimental results include an extensive evaluation of the parallel scalability of these three methods and a comparison of their global performance using three popular DL models and a representative dataset.
the proceedings contain 248 papers. the topics discussed include: parallel computing implementation for real-time image dehazing based on dark channel;improved parallel algorithms for sequential minimal optimization o...
ISBN:
(纸本)9781538666142
the proceedings contain 248 papers. the topics discussed include: parallel computing implementation for real-time image dehazing based on dark channel;improved parallel algorithms for sequential minimal optimization of classification problems;heterogeneous assignment of functional units with Gaussian execution time on a tree;highperformance and low latency vision system with hardware accelerator;merge-based parallel sparse matrix-sparse vector multiplication with a vector architecture;a learning-based adjustment model with genetic algorithm of function point estimation;high-performance implementation of matrix-free Runge-Kutta discontinuous Galerkin method for Euler equations;a step towards hadoop dynamic scaling;and towards building a distributed data management architecture to integrate multi-sources remote sensing big data.
Large-scale data centers run latency-critical jobs with quality-of-service (QoS) requirements, and throughput-oriented background :jobs, which need to achieve highperformance. Previous works have proposed methods whi...
详细信息
ISBN:
(纸本)9781728161495
Large-scale data centers run latency-critical jobs with quality-of-service (QoS) requirements, and throughput-oriented background :jobs, which need to achieve highperformance. Previous works have proposed methods which cannot co-locate multiple latency-critical jobs with multiple backgrounds jobs while: (I) meeting the QoS requirements of all latency-critical jobs, and (2) maximizing the performance of the background jobs. this paper proposes CLITE, a Bayesian Optimization-based, multi-resource partitioning technique which achieves these goals.
Service discovery is a critical task in distributed computingarchitectures for finding a particular service instance. Semantic annotations of services help to enrich the service discovery process. Semantic registries...
详细信息
ISBN:
(纸本)159593717X
Service discovery is a critical task in distributed computingarchitectures for finding a particular service instance. Semantic annotations of services help to enrich the service discovery process. Semantic registries are an important component for the discovery of services and they allow for semantic interoperability through ontology-based query formulation and dynamic mapping of terminologies between system domains. this paper evaluates two semantic registries - OWLJessKB implementation and instanceStore - to determine the suitability of these with regards to the performance of loading ontologies, the query response time and the overall scalability for use in mathematical services. Copyright 2007 ACM.
Skewed-associativity is a technique that reduces the miss ratios of CPU caches by applying different indexing functions to each way of an associative cache. Even though it showed impressive hit/miss statistics, the sc...
详细信息
ISBN:
(纸本)0769517722
Skewed-associativity is a technique that reduces the miss ratios of CPU caches by applying different indexing functions to each way of an associative cache. Even though it showed impressive hit/miss statistics, the scheme has not been welcomed by the industry, presumably because implementation of the original version is complex and might involve access-time penalties among other costs. this work presents a simplified, easy to implement variant that we call 11 niininialli,-skewed-associativity (MSkA). We show that MRA caches, for many cases, should not have penalties in either access time or power consumption when compared to set-associative caches of the same associativity. Hit/miss statistics were obtained by means of trace-driven simulations. Miss ratios are not as good as those for full skewing, but they are still advantageous. Minimal-skewing is thus proposed as a way to improve the hit/miss performance of caches, often without producing access-time delays or increases in power consumption as other techniques do (for example, using higher associativities).
Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact ...
详细信息
ISBN:
(纸本)1595936734
Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. these schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate these techniques on an FGCS system called iShare, using trace-based simulation. the results show that they achieve better application performancethan the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers. Copyright 2007 ACM.
In general, two types of resource reservations in computer networks can be distinguished: immediate reservations which are made in a just-in-time manner and advance reservations which allow to reserve resources a long...
详细信息
ISBN:
(纸本)0769520464
In general, two types of resource reservations in computer networks can be distinguished: immediate reservations which are made in a just-in-time manner and advance reservations which allow to reserve resources a long time before they are actually used. Advance reservations are especially useful for grid computing but also for a variety of other applications that require network quality-of-service, such as content distribution networks or even mobile clients, which need advance reservation to support handovers for streaming video. Withthe emerged MPLS standard, explicit routing can be implemented also in IP networks, thus overcoming the unpredictable routing behavior which so far prevented the implementation of advance reservation services. the impact of such advance reservation mechanisms on the performance of the network with respect to the amount of admitted requests and the allocated bandwidth has so far not been examined in detail. In this paper we show that advance reservations can lead to a reduced performance of the network with respect to both metrics. the analysis of the reasons shows a fragmentation of the network resources. In advance reservation environments, additional new services can be defined such as malleable reservations which are introduced in this paper and can lead to an increased performance of the network. Four strategies for scheduling malleable reservations are presented and compared. the results of the comparisons show that some strategies increase the resource fragmentation and are therefore unsuitable in the considered environment while others lead to a significantly better performance of the network. Besides discussing the performance issue, in this paper the software architecture of a management system for advance reservations is presented.
Approximate memories provide energy savings or performance improvements at the cost of occasional errors in stored data. Applications that tolerate errors on their data profit from this trade-off by controlling these ...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
Approximate memories provide energy savings or performance improvements at the cost of occasional errors in stored data. Applications that tolerate errors on their data profit from this trade-off by controlling these errors to not affect critical data. this control usually involves programmer intervention with annotations in the source code. To avoid annotations, some techniques protect critical data that are common on many applications, isolating specific memory regions from errors. In this work, we propose and explore alternatives for the protection of application critical data by managing a supervisor execution environment with an approximate memory system. We expose only dynamically allocated data to errors with secure data manipulation through an approximate allocation scheme that divide stored data based on the approximation of the heap area. We evaluate 6 applications with different data access profiles and obtain up to 20% of energy savings.
In this paper, we introduce XPySom, a new opensource Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve highperformance on a single node, exploiting widely availab...
详细信息
ISBN:
(纸本)9781728199245
In this paper, we introduce XPySom, a new opensource Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve highperformance on a single node, exploiting widely available Python libraries for vector processing on multi-core CPUs and GP-GPUs. We present results from an extensive experimental evaluation of XPySom in comparison to widely used open-source SOM implementations, showing that it outperforms the other available alternatives. Indeed, our experimentation carried out using the Extended MNIST open data set shows a speed-up of about 7x and 100x when compared to the best open-source multi-core implementations we could find with multi-core and GP-GPU acceleration, respectively, achieving the same accuracy levels in terms of quantization error.
Ensembles of Online Sequential Extreme Learning Machine algorithm are suitable for forecasting Data Streams with Concept Drifts. Nevertheless, data streams forecasting require high-performance implementations due to t...
详细信息
ISBN:
(纸本)9781538677698
Ensembles of Online Sequential Extreme Learning Machine algorithm are suitable for forecasting Data Streams with Concept Drifts. Nevertheless, data streams forecasting require high-performance implementations due to the high incoming samples rate. In this work, we proposed to tune-up three ensembles, which operates withthe Online Sequential Extreme Learning Machine, using high-performance techniques. We reimplemented them in the C programming language with Intel MKL and MPI libraries. the Intel MKL provides functions that explore the multithread features in multicore CPUs, which expands the parallelism to multiprocessors architectures. the MPI allows us to parallelize tasks with distributed memory on several processes, which can be allocated within a single computational node, or spread over several nodes. In summary, our proposal consists of a two-level parallelization, where we allocated each ensemble model into an MPI process, and we parallelized the internal functions of each model in a set of threads through Intel MKL. thus, the objective of this work is to verify if our proposals provide a significant improvement in execution time when compared to the respective conventional serial approaches. For the experiments, we used a synthetic and a real dataset. Experimental results showed that, in general, the high-performance ensembles improve the execution time, when compared with its serial version, performing up to 10-fold faster.
暂无评论