NVMe is designed to unshackle flash from a traditional storage bus by allowing hosts to employ many threads to achieve higher bandwidth. While NVMe enables users to fully exploit all levels of parallelism offered by m...
ISBN:
(纸本)9781939133120
NVMe is designed to unshackle flash from a traditional storage bus by allowing hosts to employ many threads to achieve higher bandwidth. While NVMe enables users to fully exploit all levels of parallelism offered by modern SSDs, current firmware designs are not scalable and have difficulty in handling a large number of I/O requests in parallel due to its limited computation power and many hardware *** propose DeepFlash, a novel manycore-based storage platform that can process more than a million I/O requests in a second (1MIOPS) while hiding long latencies imposed by its internal flash media. Inspired by a parallel data analysis system, we design the firmware based on many-to-many threading model that can be scaled horizontally. The proposed DeepFlash can extract the maximum performance of the underlying flash memory complex by concurrently executing multiple firmware components across many cores within the device. To show its extreme parallel scalability, we implement DeepFlash on a many-core prototype processor that employs dozens of lightweight cores, analyze new challenges from parallel I/O processing and address the challenges by applying concurrency-aware optimizations. Our comprehensive evaluation reveals that DeepFlash can serve around 4.5 GB/s, while minimizing the CPU demand on microbenchmarks and real server workloads.
General purpose hardware accelerators have become major data processing resources in many computing domains. However, the processing capability of hardware accelerations is often limited by costly software interventio...
详细信息
ISBN:
(数字)9781728161495
ISBN:
(纸本)9781728161501
General purpose hardware accelerators have become major data processing resources in many computing domains. However, the processing capability of hardware accelerations is often limited by costly software interventions and memory copies to support compulsory data movement between different processors and solid-state drives (SSDs). This in turn also wastes a significant amount of energy in modern accelerated systems. In this work, we propose, DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications. We implement a new memory controller that plugs a real 3x nm multi-partition PRAM to 28nm technology FPGA logic cells and interoperate its design into a real PCIe accelerator emulation platform. The evaluation results reveal that our DRAM-less achieves, on average, 47% better performance than advanced acceleration approaches that use a peer-to-peer DMA.
The development of fine-grain multi-threaded program ex-ecution models has created an interesting challenge: how to partition a program into threads that can exploit machine parallelism, achieve latency tolerance, and...
详细信息
This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of t...
This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.
In this paper, we examine the potential of optimization-based computer-assisted proof methods to be applied much more widely than commonly recognized by engineers and computer scientists. More specifically, we contend...
详细信息
In this paper, we study the lifetime op- timization problem in wireless sensor networks using mo- bile sink nodes. This problem is inherently difficult since we need to consider both sink scheduling and data rout- ing...
详细信息
In this paper, we study the lifetime op- timization problem in wireless sensor networks using mo- bile sink nodes. This problem is inherently difficult since we need to consider both sink scheduling and data rout- ing. Through a simple case study we develop a novel no- tation named the Placement pattern (PP) to bound traffic patterns with candidate locations. This significantly de- creases the number of elements needed to be scheduled. Based on the PP, we mathematically formulate this opti- mization problem as a Mixed-integer non-linear program- ming (MINLP), which is very tough and time consuming to solve. By proving that the problem is NP-complete, we point out that instead of seeking an optimal algorithm, heuristic algorithms, especially those with performance guarantee, would be much more desirable to develop. Fur- thermore, in order to help identify performance gains of heuristic algorithms proposed in the future, we develop a Linear programming (LP) formulation which serves as an upper bound by adopting a reformulation and relaxation technique.
An mth-order recurrence problem is defined as the compu tation of the series x1, X2,…, XN, where xi=fi(xi-1,…, xi-m) for some function fi This paper uses a technique called recursive doubling in an algorithm for sol...
详细信息
Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) are the most frequently used tools for programming according to the message passing paradigm, which is considered one of the best ways to develop para...
详细信息
ISBN:
(数字)9783540481584
ISBN:
(纸本)9783540665496
Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) are the most frequently used tools for programming according to the message passing paradigm, which is considered one of the best ways to develop parallel applications. This volume comprises 67 revised contributions presented at the Sixth European PVM/MPI Users' Group Meeting, which was held in Barcelona, Spain, 26-29 September 1999. The conference was organized by the computer Science Department of the Universitat Autònoma de Barcelona. This conference has been previously held in Liverpool, UK (1998) and Cracow, Poland (1997). The first three conferences were devoted to PVM and were held at the TU Munich, Germany (1996), ENS Lyon, France (1995), and University of Rome (1994). This conference has become a forum for users and developers of PVM, MPI, and other message passing environments. Interaction between those groups has proved to be very useful for developing new ideas in parallel computing and for applying some of those already existent to new practical fields.
Processing data in storage is an energy-efficient solution to examine massive datasets. However, a general incarnation of such well-known task-offloading model in a real system is unfortunately unsuccessful due to not...
Processing data in storage is an energy-efficient solution to examine massive datasets. However, a general incarnation of such well-known task-offloading model in a real system is unfortunately unsuccessful due to not only poor performance but also many practical challenges, such as limited processing capabilities and high vulnerabilities at the storage-level. We propose DockerSSD, a fully flexible in-storage processing (ISP) model that can run a variety of applications near flash without their source-level modification. Specifically, it enables lightweight OS-level virtualization in modern SSDs, which allows the storage intelligence to be well harmonized with existing computing environment and makes ISP even faster. Instead of developing a vendor-specific ISP to offload, DockerSSD can reuse existing Docker images, create containers as a self-governing execution object in storage, and process data directly where they are in real-time. To this end, we design a new communication method and virtual firmware that operate together to download Docker images and manage their container execution without a change of the existing storage interface and runtime. We further accelerate ISP and reduce the execution latency by automating container-related network and I/O handling data paths over hardware. Our evaluation shows that DockerSSD is 2.0 × faster than state-of-the-art ISP models for workloads with a high volume of system calls or file accesses. Moreover, it demonstrates a reduction in power and energy consumption by 1.6 × and 2.3 × respectively.
暂无评论