Modern high energy physics experiments require massively parallel special purpose computers (triggers) to reduce the extremely large primary dataflow to manageable amounts. We present a prototype processing ASIC inte...
详细信息
Modern high energy physics experiments require massively parallel special purpose computers (triggers) to reduce the extremely large primary dataflow to manageable amounts. We present a prototype processing ASIC intended as the basic computational unit in a first-level calorimeter trigger for the ATLAS collider detector to be built at CERN, Switzerland. The proposed trigger is a compact highly parallel pipelined system with 4096 systolic processors partitioned into 256 weakly-interacting custom-designed ASICs. Local results from these ASICs are then merged by a second, less complex type of ASIC. data is received at 800 Mbit/s by bipolar input circuits, while the processing is performed in CMOS at 320 MHz, using the true single phase clocking scheme (TSPC). This method promotes fast and compact implementations well suited for pipelined bit-serial applications. A 0.5 /spl mu/m BiCMOS process with 4 metal layers was chosen for the implementation.
Pattern Matching (PM) over network packet flows for Network Intrusion Detection/Prevention System is becoming more and more performance sensitive due to the rapid progress of Internet applications in terms of data vol...
详细信息
Pattern Matching (PM) over network packet flows for Network Intrusion Detection/Prevention System is becoming more and more performance sensitive due to the rapid progress of Internet applications in terms of data volumes. Meanwhile, modern multicore platforms are becoming performance competitive with traditional hardware solutions for PM. But due to the unbalance of network flow sizes, traditional flow- based data parallel processing/programming model can not fully exert multicore platforms' computing power and results in poor performance scalability. In this paper, a novel parallel inspection model, Dynamic Differentiated Distributed Detection (D 4 ) is proposed. D 4 deploys distributed parallel operations by adding one more dimension on workload partition/allocation. It proposes an effective and efficient scheme to pre-partition the pattern set in several candidate ways, called "Detection Modes", and let multiple candidate PM methods to handle the subsets, respectively; the most suitable Detection Mode would be selected specifically for each incoming flows at the run-time, and the workload would be dynamically allocated among multiple CPU cores. Experimental results on real-world pattern set and traffic traces show that D 4 scales much better than traditional schemes by better balancing the load among the processors while avoiding unnecessary overheads.
The authors present a new template-matching algorithm with good recognition performance. However, this new algorithm exhibits a complex, four-dimensional, wavefront architecture. Thus, for VLSI implementation, reduced...
详细信息
The authors present a new template-matching algorithm with good recognition performance. However, this new algorithm exhibits a complex, four-dimensional, wavefront architecture. Thus, for VLSI implementation, reduced architectures with fewer connections and processors need to be derived. For this purpose, the authors develop a systematic reduction methodology to manually map wavefront computations from high-dimension to low-dimension. This methodology consists of seven steps. Based on this methodology, the authors derive several two-dimensional architectures which are suitable for VLSI implementation for the new template-matching algorithm and have simulated one of the architectures by using the Intel Hypercube Machine iPSC/2.< >
作者:
R. AmmarB. QinU-155
Computer Science & Engineering Department University of Connecticut Storrs CT USA
A flow-analysis technique for real-time parallel computations at the computation level is presented. The technique is based on reducing the given parallel computation to a sequential one and then applying one of the a...
详细信息
A flow-analysis technique for real-time parallel computations at the computation level is presented. The technique is based on reducing the given parallel computation to a sequential one and then applying one of the available flow-analysis techniques for sequential computations. An example is given.< >
Armstrong III is a 20 node multi-computer that is currently operational. In addition to a RISC processor, each node contains reconfigurable resources implemented with FPGAs. The in-circuit reprogramability of static R...
详细信息
Armstrong III is a 20 node multi-computer that is currently operational. In addition to a RISC processor, each node contains reconfigurable resources implemented with FPGAs. The in-circuit reprogramability of static RAM based FPGAs allows the computational capabilities of a node to be dynamically matched to the computational requirements of an application. Most reconfigurable computers in existence today rely solely on a large number of FPGAs to perform computations. In contrast, the paper demonstrates the utility of a small number of FPGAs coupled to a RISC processor with a simple interconnect. The article describes a substantive example application that performs HMM training for speech recognition with the reconfigurable platform.
We discuss the parallelization and distribution of a workflow based application. We present experimental results and discuss lessons learned. The experimentation is based on a scientific application case study - seque...
详细信息
We discuss the parallelization and distribution of a workflow based application. We present experimental results and discuss lessons learned. The experimentation is based on a scientific application case study - sequential simulation applied to geostatistics. As a result we identify a list of issues not well supported in the most referenced scientific workflows tools and environments namely Kepler, and outline relevant research directions.
The use of search algorithms for test data generation has seen many successful results. For structural criteria such as branch coverage, heuristics have been designed to help the search. The most common heuristic is t...
详细信息
The use of search algorithms for test data generation has seen many successful results. For structural criteria such as branch coverage, heuristics have been designed to help the search. The most common heuristic is the use of approach level (usually represented with an integer) to reward test cases whose executions get close (in the control flow graph)to the target branch. To solve the constraints of the predicates in the control flow graph, the branch distance is commonly employed. These two measures are linearly combined. Because the approach level is more important, the branch distance is normalised, often in the range. In this paper, we analyse different types of normalising functions. We found out that the one that is usually employed in the literature has several flaws. We hence propose a different normalizing function that is very simple and that does not suffer of these limitations. We carried out empirical and analytical analyses to compare these two functions. In particular, we studied their effect on two commonly used search algorithms, namely Simulated Annealing and Genetic Algorithms.
This paper will describe a 70K transistor chip fabricated in a 1.25μm CMOS technology, with a 6.2×7.6mm die size, and featuring data synchronization, pipeline latency compensation and other computational elements.
This paper will describe a 70K transistor chip fabricated in a 1.25μm CMOS technology, with a 6.2×7.6mm die size, and featuring data synchronization, pipeline latency compensation and other computational elements.
The authors continue a program of designing multiplicative FFT (fast Fourier transform) algorithms with highly structured dataflow. They take up the case of transform size N, N=p/sup 2/q, where p and q are distinct o...
详细信息
The authors continue a program of designing multiplicative FFT (fast Fourier transform) algorithms with highly structured dataflow. They take up the case of transform size N, N=p/sup 2/q, where p and q are distinct odd primes. Number-theoretical methods are used to decompose the indexing set into orbits based on its multiplicative ring structure of Z/N, N=p/sup 2/q. A family of variants of the fundamental algorithm is designed, presenting options as to whether additions or multiplications dominate arithmetic cost.< >
暂无评论