Because of their ability to exploit the tolerance for imprecision and uncertainty in real-world problems, and their robustness and parallelism, artificial neural networks (ANNs) and their techniques have become increa...
详细信息
Because of their ability to exploit the tolerance for imprecision and uncertainty in real-world problems, and their robustness and parallelism, artificial neural networks (ANNs) and their techniques have become increasingly important for modeling and optimization in many areas of science and engineering. As a consequence, the market is flooded with new, increasingly technical software and hardware products. This paper presents an analytical overview of the most popular ANNs, both in hardware and software modes. After an overview of ANN, the paper discusses global optimization for ANN training, and the NOVEL hybrid method is presented and its performance is discussed. The paper then discusses the techniques and means for parallelizing neurosimulations of ANNs, both at a high programming level and at a low hardware-emulation level. It then presents vector microprocessor architectures and the Spert II fixed-point system as applied to multimedia and human-machine interface. Finally, it introduces the most recently explored concept of cellular neural networks (CNN), its performance and operation are analyzed. Conclusions and recommendations conclude the paper.
The performance of parallel sorting is not well understood on hardware cache-coherent shared address space (CC-SAS) multiprocessors, which increasingly dominate the market for tightly-coupled multiprocessing. We study...
详细信息
The performance of parallel sorting is not well understood on hardware cache-coherent shared address space (CC-SAS) multiprocessors, which increasingly dominate the market for tightly-coupled multiprocessing. We study two high-performance parallel sorting algorithms, radix and sample sorting, under three major programming models-a load-store CC-SAS, message passing, and the segmented SHMEM model-on a 64-processor SGI Origin2000. We observe surprisingly good speedups on this demanding application. The performance of radix sort is greatly affected by the programming model and particular implementation used. Sample sort exhibits more uniform performance across programming models on this platform, but it is usually not so good as that of the best radix sort for larger data sets if each is allowed to use the best programming model for itself. The best combination of algorithm and programming model is radix sorting under the SHMEM model for larger data sets and sample sorting under CC-SAS for smaller data sets.
MATmarks is an extension of the MATLAB tool that enables shared memory programming on a network of workstations by adding a small set of commands. The authors present a high level overview of the MATmarks system, the ...
详细信息
MATmarks is an extension of the MATLAB tool that enables shared memory programming on a network of workstations by adding a small set of commands. The authors present a high level overview of the MATmarks system, the commands we added to MATLAB, and the performance gains we achieved as a result.
With the advent of Gigabit Ethernet network technology, a new contender for the next-generation cluster interconnect is on the horizon. We benchmark Gigabit Ethernet and compare it to other cluster interconnects, name...
详细信息
With the advent of Gigabit Ethernet network technology, a new contender for the next-generation cluster interconnect is on the horizon. We benchmark Gigabit Ethernet and compare it to other cluster interconnects, namely Fast Ethernet, Myrinet and Scalable Coherent Interface. For a meaningful comparison, benchmark experiments are carried out at two levels: TCP/IP networking and MPI parallel programming. Using high-end PCs (Pentium III 450 MHz) and standard system software (Linux and MPICH), our results show that with Gigabit Ethernet, end-to-end throughputs of up to 44 MBytes per second can be achieved. A change from Fast Ethernet to Gigabit Ethernet resulted in performance gains of up to a factor of 5.05 for ftp, up to a factor of 3.10 for MPI point-to-point communication and up to a factor of 2.58 for some NAS parallel benchmarks, depending on communication granularity and message size. Further performance gains seem to require improved protocol processing and node hardware (memory bandwidth, file system, PCI bus).
We describe the implementation of a distributed garbage collector for a group of object oriented databases. We start by considering the issues that led to the choice of algorithm and why garbage collection in a databa...
详细信息
We describe the implementation of a distributed garbage collector for a group of object oriented databases. We start by considering the issues that led to the choice of algorithm and why garbage collection in a database is more difficult than in memory. We describe the algorithm and how it was implemented in Eiffel, using PVM (parallel Virtual Machine) and the Versant ODBMS.
Run-time systems are critical to the implementation of concurrent object oriented programming languages. The paper describes a concurrent object oriented programming language, Balinda C++, running on a distributed mem...
详细信息
Run-time systems are critical to the implementation of concurrent object oriented programming languages. The paper describes a concurrent object oriented programming language, Balinda C++, running on a distributed memory system and its run-time implementation. The run-time system is built on the top of the Nexus communication library. The tuplespace is the key to Balinda C++. A distributed tuplespace model is presented to improve data locality. Some experiments have been done to verify our model. The results indicate that our model is effective at improving system performance.
parallel simulation has the potential to accelerate the execution of simulation applications. However developing a parallel discrete-event simulation from scratch requires an in-depth knowledge of the mapping process ...
详细信息
parallel simulation has the potential to accelerate the execution of simulation applications. However developing a parallel discrete-event simulation from scratch requires an in-depth knowledge of the mapping process from the physical model to the simulation model, and a substantial effort in optimising performance. This paper presents an overview of the SPaDES (Structured parallel Discrete-Event Simulation) parallel simulation framework. We focus on the performance analysis of SPaDES/C++, an implementation of SPaDES on a distributed-memory Fujitsu AP3000 parallel computer. SPaDES/C++ hides the underlying complex parallel simulation synchronization and parallel programming details from the simulationist. Our empirical results show that the SPaDES framework can deliver good speedup if the process granularity is properly optimised.
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model ha...
详细信息
We present a new mechanism-oriented memory model called Commit-Reconcile & Fences (CRF) and define it using algebraic rules. Many existing memory models can be described as restricted versions of CRF. The model has been designed so that it is both easy for architects to implement and stable enough to serve as a target machine interface for compilers of high-level languages. The CRF model exposes a semantic notion of caches (saches), and decomposes load and store instructions into finer-grain operations. We sketch how to integrate CRF into modern microprocessors and outline an adaptive coherence protocol to implement CRF in distributed shared-memory systems. CRF offers an upward compatible way to design next generation computer systems.
The IA-64 architecture provides new opportunities and challenges for implementing an improved set of transcendental functions. Using several novel polynomial-based table-driven techniques, we are able to provide new a...
详细信息
The IA-64 architecture provides new opportunities and challenges for implementing an improved set of transcendental functions. Using several novel polynomial-based table-driven techniques, we are able to provide new algorithms for the transcendental functions. Major improvements include an accuracy level of about 0.6 ulps (units in the last place) and forward trigonometric functions that have a period of 2/spl pi/. The accuracy enhancements are achieved at improved speed, yet without an increase in the table size. In this paper, we highlight the key IA-64 architectural features that influenced our designs, and explain the main ideas used in our new algorithms.
暂无评论