NVMeoF is the latest extension of NVMe for remote storage access which allows remote access to NVMe controllers through high-speed RDMA, FC, and TCP networks. NVMe over TCP (NoT) can build on the basis of large-scale ...
详细信息
ISBN:
(纸本)9781665473156
NVMeoF is the latest extension of NVMe for remote storage access which allows remote access to NVMe controllers through high-speed RDMA, FC, and TCP networks. NVMe over TCP (NoT) can build on the basis of large-scale common network infrastructure in datacenters and standard TCP/IP software protocol stack, enabling a wide availability compared with RDMA-enabled specific network infrastructure for NVMe-over-RDMA. However, the processing of read/write I/O at the host and target prominently shows significantly different characteristics and requirements, where one side sends the NVMeoF instruction of the request, while the other side sends the requested data. The existing NoT implementation can not meet the different characteristics of requests in the datacenter, which eventually results in the I/O performance being limited by the common processing pipeline and sending strategy. In this paper, we propose RNoT, a transformable queue that can meet the differentiated processing scheme of read or write request characteristics respectively in NoT implementation. Specifically, RNoT defines a switchable working attribute and separates resources for read and write I/O to achieve intra-queue long-term exclusivity, delivers read and write requests into other RNoT queue pairs to achieve inter-queue I/O scheduling, and transfers request command and data with targeted approaches to achieve short and long flow optimization. We implemented RNoT in Linux Kernel and evaluated it using realistic benchmarks and applications. Our experimental results demonstrate that RNoT can achieve 30.39% and 29.27% lower latency than i10 and NoT respectively, increase IOPS by up to 41.34% than NoT on average, thus RNoT can effectively optimize the read and write I/O performance in NoT with dedicated processing scheme.
In this paper, we extend the eigenfilter approach to solve general least-squares approximation problems with linear constraints. Such extension unifies previous work in eigenfilters and many other filter design proble...
详细信息
ISBN:
(纸本)0780312813
In this paper, we extend the eigenfilter approach to solve general least-squares approximation problems with linear constraints. Such extension unifies previous work in eigenfilters and many other filter design problems, including spectral/spatial filtering, one-dimensional or multidimensional filters, data independent or statistically optimal filtering, etc. With this approach, various filter design problems are transformed into problems of finding an eigenvector of a positive definite matrix that is determined by filter design specifications. This approach has the advantage that many filter design constraints can be incorporated easily. A number of design examples are presented to show the usefulness and flexibility of the proposed approach.
The unidirectional nature of propagation and predictable delays are two characteristics of optically pipelined buses that have made them popular in recent years. Many models have been proposed that use reconfigurable ...
详细信息
Architectures based on networked servers have become a de-facto standard for distributed Virtual Environment (DVE) systems. These systems allow a large number of remote users to share a single 3D virtual scene. In ord...
详细信息
ISBN:
(纸本)0769523129
Architectures based on networked servers have become a de-facto standard for distributed Virtual Environment (DVE) systems. These systems allow a large number of remote users to share a single 3D virtual scene. In order to provide quality of service in a DVE system, clients should be assigned to servers taking into account system throughput and system latency. This highly complex problem is known as the quality of service (QoS) problem. This paper proposes an elitist sexual genetic algorithm for solving the QoS problem in distributed Virtual Environment systems. Performance evaluation results show that, due to its ability of both finding good search paths and keeping diversity escaping from local minima, this nature inspired technique can provide significantly better solutions than other heuristic methods with shorter execution times. Therefore, the proposed implementation of GA search method can improve the QoS offered by DVE systems.
Message Passing Interfaces (MPI) plays an important role in parallel computing. Many parallelapplications are implemented as MPI programs. The existing methods of bug detection for MPI programs have the shortage of p...
详细信息
ISBN:
(纸本)9781479981113
Message Passing Interfaces (MPI) plays an important role in parallel computing. Many parallelapplications are implemented as MPI programs. The existing methods of bug detection for MPI programs have the shortage of providing both input and non-determinism coverage, leading to missed bugs. In this paper, we employ symbolic execution to ensure the input coverage, and propose an on-the-fly schedule algorithm to reduce the interleaving explorations for non-determinism coverage, while ensuring the soundness and completeness. We have implemented our approach as a tool, called MPISE, which can automatically detect the deadlock and runtime bugs in MPI programs. The results of the experiments on benchmark programs and real world MPI programs indicate that MPISE finds bugs effectively and efficiently. In addition, our tool also provides diagnostic information and replay mechanism to help understand bugs.
The major issue today on cluster and grid computing is the efficient resource management. The evaluation of scheduling strategies is hard because of the generation of jobs under realistic scenario. This is true for ri...
详细信息
ISBN:
(纸本)0769521320
The major issue today on cluster and grid computing is the efficient resource management. The evaluation of scheduling strategies is hard because of the generation of jobs under realistic scenario. This is true for rigid jobs (where the number of processors is fixed) and even more for moldable ones. This paper presents an approach to generate realistic workloads for this kind of jobs. The model we propose is based on the analysis of one year of utilization of the I-cluster, a 225 processors cluster. From this log we extract a typical load for this kind of parallel machines and introduce a way to generate synthetic realistic workloads in an automatic way. This work was done as a way to test scheduling strategies taking into account both rigid and moldable jobs so as the workload generator may handle moldable jobs.
Programming using message passing or distributed shared memory are the two major parallel programming paradigms on clusters. However these two models have high programming complexity, produce less maintainable paralle...
详细信息
ISBN:
(纸本)9780769530499
Programming using message passing or distributed shared memory are the two major parallel programming paradigms on clusters. However these two models have high programming complexity, produce less maintainable parallel code, and are not suitable for multi-core multiprocessor clusters. While object-oriented programming is dominant in serial programming, it has not been well exploited in parallel programming. In this paper we propose an innovative automatic parallelization framework that employs past experience to parallelize serial programs and outputs the parallel code in the form of objects. Supported by a data-driven runtime environment, each parallel task is managed as a thread, exploiting the multiple processing cores on a cluster node. Based on this proposed framework, we have implemented a proof-of-concept parallelizer called PJava to parallelize Java code. The performance benefit of this framework is evaluated through case studies by comparing the execution time of the automatically generated PJava code to that of handcrafted JOPI (a Java dialect of MPI) code.
Most statistical software packages implement a broad range of techniques but do so in an ad hoc fashion, leaving users who do not have a broad knowledge of statistics at a disadvantage since they may not understand al...
详细信息
I/O data access is a recognized performance bottleneck of high-end computing. Several commercial and research parallel file systems have been developed in recent years to ease the performance bottleneck. These advance...
详细信息
ISBN:
(纸本)9781450305525
I/O data access is a recognized performance bottleneck of high-end computing. Several commercial and research parallel file systems have been developed in recent years to ease the performance bottleneck. These advanced file systems perform well on some applications but may not perform well on others. They have not reached their full potential in mitigating the I/O-wall problem. Data access is application dependent. Based on the application-specific optimization principle, in this study we propose a cost-intelligent data access strategy to improve the performance of parallel file systems. We first present a novel model to estimate data access cost of different data layout policies. Next, we extend the cost model to calculate the overall I/O cost of any given application and choose an appropriate layout policy for the application. A complex application may consist of different data access patterns. Averaging the data access patterns may not be the best solution for those complex applications that do not have a dominant pattern. We then further propose a hybrid data replication strategy for those applications, so that a file can have replications with different layout policies for the best performance. Theoretical analysis and experimental testing have been conducted to verify the newly proposed cost-intelligent layout approach. Analytical and experimental results show that the proposed cost model is effective and the application-specific data layout approach achieved up to 74% performance improvement for data-intensive applications.
This paper presents a programmable many-core platform containing 64 cores routed in a hierarchical network tor biomedical signal processingapplications. Individual core processors are based on a RISC architecture wit...
详细信息
ISBN:
(纸本)9781467349529;9781467349512
This paper presents a programmable many-core platform containing 64 cores routed in a hierarchical network tor biomedical signal processingapplications. Individual core processors are based on a RISC architecture with DSP enhancement blocks. Given the number of conditional program loops in DSP applications such as FFT, additional hardware blocks are added that operate in parallel to each core processor. The two blocks calculate the FFT input addresses and determine if a conditional loop is necessary. Pertorming these operations in parallel to the main processor greatly reduces the time to completion for a DSP application. Each processor is implemented in 65 nm CMOS using standard cell Iibraries. The 64-core platform occupies 19.51 mm(2) and runs at 1.18 GHz at 1 V. For demonstration, Electroencephalogram (EEG) seizure detection and analysis and uItrasound spectral doppler are mapped onto the cores. The seizure detection and analysis algorithm utilizes 60 processors and takes 890 ns to execute. Spectral doppler utilizes 29 processors and takes 715 ns to run.
暂无评论