Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soo...
详细信息
Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
In the design of a heterogeneous multiprocessor system on chip, we face a new design problem;scheduler implementation. In this paper, we present an approach to implementing a static scheduler, which controls all the t...
详细信息
ISBN:
(纸本)0780387368
In the design of a heterogeneous multiprocessor system on chip, we face a new design problem;scheduler implementation. In this paper, we present an approach to implementing a static scheduler, which controls all the task executions and communication transactions of a system according to a pre-determined schedule. For the scheduler implementation, we consider both intra-processor and inter-processor synchronization. We also consider scheduler overhead, which is often neglected. In particular, we address the issue of centralized implementation versus distributed implementation. We investigate the pros and cons of the two different scheduler implementations. Through experiments with synthetic examples and a real world multimedia application, we show the effectiveness of our approach.
To meet the computational requirements of mid-range and high-end programmable ultrasound systems, multiple processors are currently required. Algorithms optimized specifically for a single processor-based system may n...
详细信息
ISBN:
(纸本)081944426X
To meet the computational requirements of mid-range and high-end programmable ultrasound systems, multiple processors are currently required. Algorithms optimized specifically for a single processor-based system may not perform well in a multiprocessor environment. They need to be efficiently remapped on multiple processors to take advantage of the increased computing power while minimizing the interprocessor data transfer and the latency between data acquisition and display. In this, paper, we describe a multiprocessor-based implementation of scan conversion, a key processing task in an ultrasound system that geometrically transforms the acquired polar ultrasound data to Cartesian coordinates for display. The single processor-based scan conversion algorithm that was reported previously uses inverse mapping for geometric transformation, where the pixel values in the Cartesian display are determined from data in the polar domain. Inverse mapping requires access to a full frame of pre-scan-converted ultrasound data, which in a multiprocessor system can be located across multiple processors, thus requiring a significant amount of interprocessor data communications. Our modified scan conversion algorithm reduces the data movement by performing inverse-mapped scan conversion locally on the polar-domain data present in each processor's memory. Each processor handles a smaller amount of data, thus reducing the latency. The raster pixels generated by each processor are combined later. interprocessor synchronization is used to ensure that each processor displays data belonging to the same frame. Data overlapping between processors avoids boundary artifacts between regions that are processed on different processors. Using four Hitachi/Equator Technologies' 300-MHz MAP-CA processors, scan conversion requires 5.6 ms for a 600x420 RGB frame, as compared to 14.6 ms using a single processor, and the latency is reduced by 33.3%. We believe that this type of parallel algorithms will
Both predictable interprocessor synchronization and fast interrupt response are required for real-time systems constructed using asymmetric shared-memory multiprocessors. This paper points out that conventional spin l...
详细信息
Both predictable interprocessor synchronization and fast interrupt response are required for real-time systems constructed using asymmetric shared-memory multiprocessors. This paper points out that conventional spin lock algorithms cannot satisfy both requirements at the same time and describes two spin lock algorithms that have been proposed to solve this problem. These algorithms, extensions of queuing spin locks modified to be preemptable for servicing interrupts, can give upper bounds on the times to acquire and release an interprocessor lock while achieving a fast response to interrupt requests. The per-formance measurement of the algorithms demonstrates that the algorithms have required properties. To apply the algorithms to real-time kernels, we also propose an extended algorithm, which is a combination of the two algorithms.
暂无评论