For the university computer labs there is a real challenge for the fixed infrastructure to keep the pace with the increased demand for computational resources. The complexity of the computerarchitecture is given by t...
详细信息
ISBN:
(纸本)9781509003327
For the university computer labs there is a real challenge for the fixed infrastructure to keep the pace with the increased demand for computational resources. The complexity of the computerarchitecture is given by the high number of complex software tools, the virtual machines, the different operating systems and the great range of different classes with laboratory lessons, especially for computer engineering students. One solution would be the use of private university-owned Cloud but not all universities can afford a modern thin-client solution. This paper presents an intermediate non-expensive step towards working with distributed resources for computer laboratories, a remote boot system, using Clonezilla and Diskless Remote Boot in Linux - DRBL, methods implemented and tested in a demo system.
Distributed training of Deep Neural Networks (DNNs) on high-performancecomputing (HPC) systems is becoming increasingly common. HPC systems dedicated entirely or mainly to Deep Learning (DL) workloads are becoming a ...
详细信息
ISBN:
(纸本)9781728195865
Distributed training of Deep Neural Networks (DNNs) on high-performancecomputing (HPC) systems is becoming increasingly common. HPC systems dedicated entirely or mainly to Deep Learning (DL) workloads are becoming a reality. The collective communication overhead for calculating the average of weight gradients, e.g., an Allreduce operations, is one of the main factors limiting the scaling of data parallelism training. Several active efforts across different layers of the training stack including the training algorithms, parallelism strategy, communication algorithms, and system design have been proposed to cope with this communication challenge when scaling distributed training of DNNs. However, even with those methods, communication still becomes a bottleneck with the steady increase in model sizes, e.g.,100-10,000s MB, and the number of compute nodes, e.g., 1,000-10,000s of GPUs. In this work, we investigate the benefits of co-design of Allreduce algorithms and the network system. We propose to replace the Fat-tree network topology with a variant of Distributed Loop Network topology that guarantees a fixed routing paths length between any pairs of computing nodes for the communication pattern of halving-doubling Allreduce algorithm. We also propose a technique to eliminate/mitigate the network contention.
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications. DejaVu provides a transparent paralle...
详细信息
The data-intensive applications that will shape computing in the coming decades require scalable architectures that incorporate scalable data and compute resources and can support random requests to unstructured (e. g...
详细信息
ISBN:
(纸本)9780769549699;9781467360050
The data-intensive applications that will shape computing in the coming decades require scalable architectures that incorporate scalable data and compute resources and can support random requests to unstructured (e. g., logs) and semi-structured (e.g., large graph, XML) data sets. To explore the suitability of FPGAs for these computations, we are constructing an FPGA-based system with a memory capacity of 512 GB from a collection of 32 Virtex-5 FPGAs spread across 8 enclosures. This paper describes our work in exploring alternative interconnect technologies and network topologies for FPGA-based clusters. The diverse interconnects combine inter-enclosure high-speed serial links and wide, single-ended intra-enclosure on-board traces with network topologies that balance network diameter, network throughput, and FPGA resource usage. We discuss the architecture of high-radix routers in FPGAs that optimize for the asymmetry between the inter-and intra-enclosure links. We analyze the various interconnects that aim to efficiently utilize the prototype's total switching capacity of 2.43 Tb/s. The networks we present have aggregate throughputs up to 51.4 GB/s for random traffic, diameters as low as 845 nanoseconds, and consume less than 12% of the FPGAs' logic resources.
iWARP is a set of standards enabling Remote Direct Memory Access (RDMA) over Ethernet. iWARP supporting RDMA and OS bypass, coupled with TCP/IP Offload Engines, can fully eliminate the host CPU involvement in an Ether...
详细信息
We describe a shared Simon Fraser University (West-Grid) and Dalhousie (ACEnet) seminar series which is now two years old, and is gradually expanding to include other Canadian universities. More generally we discuss c...
详细信息
This paper presents CRAC, an environment dedicated to design efficient asynchronous iterative algorithms for a grid architecture. Those algorithms are particularly suited for grid architecture since they naturally all...
详细信息
The Coarse-Grained Reconfigurable architecture (CGRA) is considered as one of the most potential candidates for big data applications, which provides significant throughput improvement and high energy efficiency. Unli...
详细信息
high-radix switches are desirable building blocks for large computer interconnection networksb because they are more suitable to convert chip I/O bandwidth into low latency and low cost than low-radix switches [10]. U...
详细信息
One of the fundamental problems in parallel computing is how to efficiently perform routing in a faulty network each component of which fails with some probability. This paper presents a comparative performancestudy ...
详细信息
暂无评论