Artificial neural networks are highly parallel structures inspired by the human brain. they have been used successfully in many human-like applications, such as pattern recognition. Performance of these networks can b...
详细信息
Artificial neural networks are highly parallel structures inspired by the human brain. they have been used successfully in many human-like applications, such as pattern recognition. Performance of these networks can be enhanced if used properly in conjunction with equally powerful mathematical tools. In this paper, we used the discrete wavelet transform as a pre-processing tool for two well-known neural classifiers; competitive layer networks and learning vector networks. the wavelets transform was used successfully to approximate the input patterns of the two classifiers and thus reduced their input-layer requirements considerably. Such reduction facilitates cost-effective hardware implementations of artificial neural networks.
Clusters built from single-core systems are cost-effective as for the performance improvement and availability. However, the hardware constraints put limitations on the performance of single-core systems. Hence, it is...
详细信息
Clusters built from single-core systems are cost-effective as for the performance improvement and availability. However, the hardware constraints put limitations on the performance of single-core systems. Hence, it is difficult to meet withthe increasing high performance requirements of diversified applications at different levels for general purpose computing. A promising feasible solution is the novice multi-core systems which extend the parallelism to CPU level by integrating multiple processing units on a single die. this paper uses finite-difference time-domain (FDTD) algorithm as a case study, designing suitable parallel FDTD algorithms for three architectures: distributed-memory machines with single-core processors, shared-memory machines with dual-core processors, and the Cell Broadband Engine (Cell/B.E.) processor with nine heterogeneous cores. the experiment results show that the Cell/B.E. processor using 8 SPEs achieves a significant speedups of 7.05 faster than AMD single-core Opteron processor and 3.37 than AMD dual-core Opeteron processor at the processor level.
this paper presents a novel architectural solution to address the problem of scalable routing in very large sensor networks. We develop a routing solution off-network control processing (ONCP) that achieves control sc...
this paper presents a novel architectural solution to address the problem of scalable routing in very large sensor networks. We develop a routing solution off-network control processing (ONCP) that achieves control scalability in large sensor networks by shifting certain amount of routing functions to an ldquooff-networkrdquo server. A tiered and hybrid routing approach, consisting of ldquocoarse grainrdquo global routing, and distributed ldquofine grainrdquo local routing is proposed for achieving scalability by avoiding network wide control message dissemination. We present the ONCP architectural concepts and analytically characterize its performance in relations to both flat and hierarchical sensor routing architectures. We also show ns2 based experimental results indicating that for large sensor networks with realistic data models, the packet drop, latency and energy performance of ONCP can be significantly better than those for flat and cluster-based protocols.
A ubiquitous processor, HCgorilla followed Java CPU for multimedia processing and was built in RNG (random number generators) for cipher processing. then, HCgorilla had an execution stage composed of several units for...
详细信息
A ubiquitous processor, HCgorilla followed Java CPU for multimedia processing and was built in RNG (random number generators) for cipher processing. then, HCgorilla had an execution stage composed of several units for those sophisticated processing. Since the execution stage kept physical separation, each function took different latency. this required instruction scheduling similarly to regular super scalar processors. We describe, in this paper, the improvement of HCgorilla to solve this issue. Specifically, the execution stage composed of arithmetic units is wave-pipelined in whole. this completely merges the parallel structure without physical separation. the waved multifunctional execution unit is effective to realize wide-range dynamic ILP (instruction level parallelism) at a rate higher than regular superscalar processors.
In the last several years GPU devices have started to evolve into supercomputers. New, non-graphics, features are rapidly appearing along with new more general programming languages. One reason for the quick pace of c...
详细信息
ISBN:
(纸本)9781509030217
In the last several years GPU devices have started to evolve into supercomputers. New, non-graphics, features are rapidly appearing along with new more general programming languages. One reason for the quick pace of change is that, games and hardware evolve together: Hardware vendors review the most popular games, looking for places to add hardware while game developers review new hardware, looking for places to add more realism. Today, we see both GPU devices and games moving from a model of looks real to one of acts real. One consequence of acts real is that evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs. We will review the difference between a CPU and a GPU. then we will describe hardware changes added to the current generation of AMD graphics processors, including the introduction of traditional compute operations such as double precision, scatter/gather and local memory. Along with new features, we have added new metrics like performance/watt and performance/dollar. the current AMD GPU processor delivers 9 gigaflops/watt and 5 gigaflops/dollar. For the last two generations, each AMD GPU has provided double the performance/watt of the prior machine. We believe the software community needs to become more aware and appreciate these *** this has been a kind of co-evolution and not a process of radical change, current GPU devices have retained a number of odd sounding transitional features, including fixed functions like memory systems that can do filtering, depth buffers, a rasterizer and the like. Today, each of these remain because they are important for graphics performance. Software on GPU devices also shows transitional features. As AI/physics virtual reality starts to become important, development frameworks have started to shift. Graphics APIs have added compute shaders. Finally, there has been a set of transitional programs implemented by graphics programmers but whose only re
Summary form only given. Pathologists and cancer biologists rely on tissue and cellular analysis to study cancer expression, genetic profiles, and cellular morphology to understand the underlying basis for a disease a...
详细信息
Summary form only given. Pathologists and cancer biologists rely on tissue and cellular analysis to study cancer expression, genetic profiles, and cellular morphology to understand the underlying basis for a disease and to grade the level of disease progression. Conventional analysis of tissue histology and sample cytology includes the steps of examination of the stained tissue or cell smear under a microscope, scoring the expression relative to the most highly expressing (densely stained) area on a predefined scale for normal, cancer, stromal regions based on the morphology of the tissue, estimating the percentage area of cancer tissue relative of normal and stroma, and multiplying the score by the percentage area of cancer region and converting to another predefined scale for statistical analyses. Most of this analysis is done manually or with limited tools to aid the scoring process. Over the last 5 years, automated and semi- automated microscope slide scanners have become available in the marketplace. these scanners rely on sophisticated microscopes and allow for the digitization of the entire sample at varying magnifications. this has led to the emergence of digital pathology and a growing amount of image data. Each sample digitized is typically of the order of 2.7 GB to 10 GB in size depending on the magnification of the digitizing system with an image size of 30,000 ? 30,000 pixels or larger. Further, current software and methods for automated scoring of tissue is very limited. this has led to an increased interest in identifying novel solutions to automated histology and cytology analysis. In order to achieve high computational accuracy with reasonable turnaround times, novel approaches from the data and resource management perspective are also required to address handling of image sizes outlined above. Two developments in computer industry make the current generation of scientists more likely to solve the performance challenges associated withthe large ima
作者:
El Baz, D.CNRS
LAAS 7Ave Colonel Roche F-31077 Toulouse 4 France
the implementation of parallel asynchronous iterative algorithms on message passing architectures is considered. Several issues related to communication via message passing interfaces or libraries such as MPI-1, MPI-2...
详细信息
ISBN:
(纸本)9780769527840
the implementation of parallel asynchronous iterative algorithms on message passing architectures is considered. Several issues related to communication via message passing interfaces or libraries such as MPI-1, MPI-2, PVM or SHMEM are discussed in this survey paper Practical impleinentations are proposed.
this paper presents a parallel architecture that can simultaneously perform block-matching motion estimation (ME) and discrete cosine transform (DCT). Because DCT and ME are both processed block by block, it is prefer...
详细信息
ISBN:
(纸本)9783540729044
this paper presents a parallel architecture that can simultaneously perform block-matching motion estimation (ME) and discrete cosine transform (DCT). Because DCT and ME are both processed block by block, it is preferable to put them in one module for resource sharing. Simulation results performed using Simulink demonstrate that the parallel fashioned architecture improves the performance in terms of running time by 18.6% compared to the conventional sequential fashioned architecture.
In several digital signal processingalgorithms, computational nodes are organized in consecutive stages and data is reordered between these stages. parallel computation of such algorithms with reduced number of proce...
详细信息
ISBN:
(纸本)0769522262
In several digital signal processingalgorithms, computational nodes are organized in consecutive stages and data is reordered between these stages. parallel computation of such algorithms with reduced number of processing elements implies that several computational nodes are assigned to each element. As a drawback, permutations become more complex and require data storage. In this paper, a systematic design methodology for stride permutation networks is derived. these permutations are represented with Boolean matrices, which are decomposed and mapped directly onto register-based networks. the resulting networks are regular and scalable and they support any stride of power-of-two. In addition, the networks reach the lower bound in the number of registers indicating area-efficiency. Since the proposed methodology is systematic, it can be exploited in automated design generation.
In this paper we describe the parallelization of two nearest neighbour classification algorithms. Nearest neighbour methods are well-known machine learning techniques. they have been successfully applied to Text Categ...
详细信息
ISBN:
(纸本)9783540744658
In this paper we describe the parallelization of two nearest neighbour classification algorithms. Nearest neighbour methods are well-known machine learning techniques. they have been successfully applied to Text Categorization task. Based on standard parallel techniques we propose two versions of each algorithm on message passing architectures. We also include experimental results on a cluster of personal computers using a large text collection. Our algorithms attempt to balance the load among the processors, they are portable, and obtain very good speedups and scalability.
暂无评论