Video streams, either in form of on-demand streaming or live streaming, usually have to be converted (i.e., transcoded) based on the characteristics (e.g., spatial resolution) of clients' devices. Transcoding is a...
详细信息
ISBN:
(纸本)9781509024537
Video streams, either in form of on-demand streaming or live streaming, usually have to be converted (i.e., transcoded) based on the characteristics (e.g., spatial resolution) of clients' devices. Transcoding is a computationally expensive operation, therefore, streaming service providers currently store numerous transcoded versions of the same video to serve different types of client devices. However, recent studies show that accessing video streams have a long tail distribution. that is, there are few popular videos that are frequently accessed while the majority of them are accessed infrequently. the idea we propose in this research is to transcode the infrequently accessed videos in a on-demand (i.e., lazy) manner. Due to the cost of maintaining infrastructure, streaming service providers (e.g., Netflix) are commonly using cloud services. However, the challenge in utilizing cloud services for video transcoding is how to deploy cloud resources in a cost-efficient manner without any major impact on the quality of video streams. To address the challenge, in this research, we present an architecture for on-demand transcoding of video streams. the architecture provides a platform for streaming service providers to utilize cloud resources in a cost-efficient manner and with respect to the Quality of Service (QoS) requirements of video streams. In particular, the architecture includes a QoS-aware scheduling component to efficiently map video streams to cloud resources, and a cost-efficient dynamic (i.e., elastic) resource provisioning policy that adapts the resource acquisition with respect to the video streaming QoS requirements.
this work proposes a set of requirements for programming emerging FPGA-based highperformancecomputing systems, and uses them to evaluate a number of existing parallel programming models.
ISBN:
(纸本)9780769533070
this work proposes a set of requirements for programming emerging FPGA-based highperformancecomputing systems, and uses them to evaluate a number of existing parallel programming models.
the paper presents a pragmatic scan partitioning architecturethat allows less than perfect scan design in highperformance, VLSI circuits to cost-effectively achieve test development and manufacturing test goals. the...
详细信息
ISBN:
(纸本)0769515703
the paper presents a pragmatic scan partitioning architecturethat allows less than perfect scan design in highperformance, VLSI circuits to cost-effectively achieve test development and manufacturing test goals. the paper then describes an implementation of the architecture on Compaq's Alpha 21364 microprocessor.
the optimization of legacy codes for fully exploiting the parallelism opportunities provided by modern heterogeneous architectures is a difficult task. Multiple levels of parallelism can be exploited in order to gain ...
详细信息
ISBN:
(纸本)9781538648193
the optimization of legacy codes for fully exploiting the parallelism opportunities provided by modern heterogeneous architectures is a difficult task. Multiple levels of parallelism can be exploited in order to gain the expected performance. this work describes the lessons learned in the performance optimization of a real-world reservoir engineering application composed of thousands of code lines. We study the exploitation of the multiple levels of parallelism, showing a possible, although non-trivial, path to extract performance. Our results show that exploiting thread-level parallelism is not always the best path to derive performance gains. On the other side, vectorization plays a key role in reducing the execution time of the application.
the interconnection network is a crucial part of high-performancecomputer systems. It significantly determines parallel system performance as well as the development and the operating cost. In this paper we suggest e...
详细信息
ISBN:
(纸本)9781467327138
the interconnection network is a crucial part of high-performancecomputer systems. It significantly determines parallel system performance as well as the development and the operating cost. In this paper we suggest efficient and scalable hierarchical multi-ring interconnection network architecture. For building up the interconnection network we have designed adequate switch architecture and implemented "step-back-on-blocking" flow control algorithm. the architectural model has been verified and communicational performance parameters have been evaluated on the basis of numerous simulation experiments conducted in the OMNeT++ simulation environment.
the size of the Last Level Caches (LLC) in multicore architectures is increasing, and so is their power consumption. However, most of this power is wasted on unused or invalid cache lines. For dirty cache lines, the L...
详细信息
ISBN:
(纸本)9781479929276
the size of the Last Level Caches (LLC) in multicore architectures is increasing, and so is their power consumption. However, most of this power is wasted on unused or invalid cache lines. For dirty cache lines, the LLC waits until the line is evicted to be written back to memory. Hence, dirty lines compete for the memory bandwidth with read requests (prefetch and demand), increasing pressure on the memory controller. this paper proposes a Dead Line and Early Write-Back Predictor (DEWP) to improve the energy efficiency of the LLC. DEWP early evicts dead cache lines with an average accuracy of 94%, and only 2% false positives. DEWP also allows scheduling of dirty lines for early eviction, allowing earlier write-backs. Using DEWP over a set of single and multi-threaded benchmarks, we obtain an average of 61% static energy savings, while maintaining the performance, for both inclusive and non-inclusive LLCs.
the proceedings contain 5 papers. the topics discussed include: enabling rapid development of parallel tree search applications;challenges in executing large parameter sweep studies across widely distributed computing...
详细信息
ISBN:
(纸本)1595937145
the proceedings contain 5 papers. the topics discussed include: enabling rapid development of parallel tree search applications;challenges in executing large parameter sweep studies across widely distributed computing environments;hyperscaling of plasma turbulence simulations in DEISA;WISDOM-II: a large in silico docking effort for finding novel hits against malaria using computational grid infrastructure;and efficient processing of pathological images using the grid: computer-aided prognosis of neuroblastoma.
Traditionally, parallel graph analytics workloads have been implemented in systems like Pregel, GraphLab, Galois, and Ligra that support graph data structures and graph operations directly. An alternative approach is ...
详细信息
ISBN:
(纸本)9781728176451
Traditionally, parallel graph analytics workloads have been implemented in systems like Pregel, GraphLab, Galois, and Ligra that support graph data structures and graph operations directly. An alternative approach is to express graph workloads in terms of sparse matrix kernels such as sparse matrix-vector and matrix-matrix multiplication. An API for these kernels has been defined by the GraphBLAS project. the SuiteSparse project has implemented this API on shared-memory platforms, and the LAGraph project is building a library of graph algorithms using this API. How does the matrix-based approach perform compared to the graph-based approach? Our experiments on a 56 core CPU show that for representative graph workloads, LAGraph/SuiteSparse solutions are 5x slower on the average than Galois solutions. We argue that this performance gap arises from inherent limitations of a matrix-based API: regardless of which architecture a matrixbased algorithm is run on, it is subject to the same inherent limitations of the matrix-based API.
Energy use is now a first-class design constraint in high-performance systems and applications. Improving our understanding of application energy consumption in diverse, heterogeneous systems will be essential to effi...
详细信息
ISBN:
(纸本)9781509061082
Energy use is now a first-class design constraint in high-performance systems and applications. Improving our understanding of application energy consumption in diverse, heterogeneous systems will be essential to efficient operation. For example, power limits in large scale parallel and distributed systems will require optimizing performance under energy constraints. However, with increased levels of parallelism, complex memory hierarchies, hardware heterogeneity, and diverse programming models and interfaces, improving performance and energy efficiency simultaneously is exceedingly difficult. Our thesis is that estimating energy use, either a priori or as soon as possible at runtime, will be essential to future systems. Such estimates must adapt with changes in applications across hardware configurations. Existing approaches offer insight and detail, but typically are too cumbersome to enable adaptation at runtime or lack portability or accuracy. To overcome these limitations, we propose two energy estimation techniques which use the Aspen domain specific language for performance modeling: ACEE (Algorithmic and Categorical Energy Estimation), a combination of analytical and empirical modeling techniques embedded in a runtime framework that leverages Aspen, and AEEM (Aspen's Embedded Energy Modeling), a system level coarse-grained energy estimation technique that uses performance modeling from Aspen to generate energy estimations at runtime. this paper presents methodology of the models and examines their accuracy as well as their advantages and challenges in several use cases.
DARPA's Ubiquitous high-performancecomputing (UHPC) program asked researchers to develop computing systems capable of achieving energy efficiencies of 50 GOPS/Watt, assuming 2018-era fabrication technologies. thi...
详细信息
ISBN:
(纸本)9781467355872
DARPA's Ubiquitous high-performancecomputing (UHPC) program asked researchers to develop computing systems capable of achieving energy efficiencies of 50 GOPS/Watt, assuming 2018-era fabrication technologies. this paper describes Runnemede, the research architecture developed by the Intel-led UHPC team. Runnemede is being developed through a co-design process that considers the hardware, the runtime/OS, and applications simultaneously. Near-threshold voltage operation, fine-grained power and clock management, and separate execution units for runtime and application code are used to reduce energy consumption. Memory energy is minimized through application-managed on-chip memory and direct physical addressing. A hierarchical on-chip network reduces communication energy, and a codelet-based execution model supports extreme parallelism and fine-grained tasks. We present an initial evaluation of Runnemede that shows the design process for our on-chip network, demonstrates 2-4x improvements in memory energy from explicit control of on-chip memory, and illustrates the impact of hardware-software co-design on the energy consumption of a synthetic aperture radar algorithm on our architecture.
暂无评论