As the computation cost increases to meet the design requirements for computation-intensive applications on today's systems, the pressure to develop high performance parallel processors on a chip will increase. Ne...
详细信息
As the computation cost increases to meet the design requirements for computation-intensive applications on today's systems, the pressure to develop high performance parallel processors on a chip will increase. Network-on-Chip (NoC) techniques that interconnect multiple processing elements with routers are the solution for reducing computation time and power consumption by parallel processing on a chip. the shared communication platform is also essential to meet the scalability and complexity challenges for System-on-Chip (SoC). However not many parallel applications have been studied for such an architecture and workload characterizations have not been researched to benefit the architecture design optimization. In this paper, we study multiple data-parallel applications on a multicore NoC architecture withdistributed memory space. We introduce an efficient runtime workload distribution algorithm that balances workloads of parallel processors and apply for selected embedded applications. Using our cycle accurate multicore simulator, we simulated our NoC-enabled multicore architecture model and executed data-parallel applications on various number of processing elements using the proposed runtime load balancing algorithm and analyzed performance and communication overheads.
the large gap between the speed in which data can be processed and the performance of I/O devices makes the shared storage infrastructure of a cluster a great bottle-neck. parallel File systems try to smooth such diff...
详细信息
the large gap between the speed in which data can be processed and the performance of I/O devices makes the shared storage infrastructure of a cluster a great bottle-neck. parallel File systems try to smooth such difference by distributing data onto several servers, increasing the system's available bandwidth. However, most implementations use a fixed number of I/O servers, defined during the initialization of the system, and can not add new resources without a complete redistribution of the existing data. Withthe execution of different applications at the same time, the concurrent access to these resources can aggravate the existing bottleneck, making very hard to define an initial number of servers that satisfies the performance requirements of different applications. this paper presents a reconfiguration mechanism for the dNFSp file system that uses on-line monitoring of application's I/O behavior to detect performance contention and dedicate more I/O resources to applications with higher demands. these extra resources are taken from the available nodes of the cluster, using their I/O devices as a temporary storage. We show that this strategy is capable of increasing the I/O performance in up to 200% for access patterns with short I/O phases and 47% for longer I/O phases.
Branch and Bound (B&B) algorithms are highly parallelizable but they are irregular and dynamic load balancing techniques have been used to avoid idle processors. In previous work, authors use a dynamic number of t...
详细信息
Branch and Bound (B&B) algorithms are highly parallelizable but they are irregular and dynamic load balancing techniques have been used to avoid idle processors. In previous work, authors use a dynamic number of threads at run time, which depends on the measured performance of the application for just one interval B&B algorithm running on the system. In this way, load balancing is achieved by thread generation decisions. In this work, we extend the study of these models to non-dedicated systems. In order to have a controlled test bed and comparable results, several instances of the interval global optimization algorithm are executed in the system, withthe same model and problem to solve. therefore, a non-dedicated system is simulated because the execution of one application affects the execution of the other instances. this paper discusses different methods and models to decide when a thread should be created. Experiments show which of the proposed methods performs best in terms of maximum running time per application, using the fewest running threads. Following this parallel programming methodology, which is well suited for other B&B codes, applications can adapt their parallelism level to their performance and load of the system(at run time). this work represents a step forward towards increasing the performance of parallel algorithm running in non-dedicated and heterogeneous systems. the adaptive model discussed in this work is able to reduce the overall execution time for a set of instances of the same application running simultaneously. It also exempts the user from specifying the number of threads each application should use.
the battle of fixed function devices vs. programmable devices has been won by the programmables. the question facing us now is to determine what kinds of programmability to place on next generation systems/devices. Re...
详细信息
ISBN:
(纸本)9781450305549
the battle of fixed function devices vs. programmable devices has been won by the programmables. the question facing us now is to determine what kinds of programmability to place on next generation systems/devices. Research and development on many applications has shown that different kinds of hardware and software programmability succeed for different application classes: powerful, singlethread-optimized CPUs continue to do very well for many applications; the General Purpose GPU is carving a niche in high throughput, parallel floating point codes in addition to its home turf of graphics; the FPGA is particularly good at variable bit-size computations and data steering, as well as paralleldistributed control of networks. Future systems may well need all three types of these types of engines, and perhaps interesting mixtures of them. this is particularly true when we deal withthe combined goals of optimizing cost, performance and *** this workshop, we will look to the future of the FPGA within these types of 'converged' programmable computing engines, and reflectively ask ourselves: What role can the FPGA play? What future applications in areas such as networking, mobile, and artificial intelligence can be driven by FPGAs? How do FPGAs fit into the architecture realm of CPUs, general purpose GPUs, and DSPs? How should the designer/programme express their intent in the most effective way possible? What are the requirements for a compilation and optimization environment that allow FPGAs to intermix within a heterogeneous and converged future?the intent of this workshop is to open a discussion on these questions. there will be a series of short, invited talks interspersed with free and open discussion.
We have developed a network (called TPNET) which is adaptable for any parallel processing systems. It consists of several core processors and a router. A process element in a parallel processing system is a processor ...
详细信息
ISBN:
(纸本)9780889868205
We have developed a network (called TPNET) which is adaptable for any parallel processing systems. It consists of several core processors and a router. A process element in a parallel processing system is a processor called TPCORE2, which has been developed by the authors' group. Since this core processor can execute full set of the transputer instruction set, we can describe a software system using the parallel processing language occam. Occam is based on theoretically a model called Communicating Sequential Processes (CSP). If a parallel system can be described in occam language, and work fine, it will be regarded as free from any deadlocks or livelocks which will be intrinsically hidden in a parallel system. We can construct simply a secure parallel processing system in this way. Each processor can be connected to a router, and we can achieve a dynamic configuration of the network topology by controlling the router. the basic communication protocol in TPNET is IEEE 1355. An assured and efficient network can be constructed despite the structural simplicity of the protocol. With characteristics discussed above and with an efficient interrupt processing system in TPCORE2, we propose this TPNET as a basic framework for high performance embedded systems used widely in various industrial fields.
Many attempts have been made to optimize the median filter from the software and hardware approach. An architectural design of hardware capable of performing real-time median filtering is presented. the architecture u...
详细信息
ISBN:
(纸本)9780889868205
Many attempts have been made to optimize the median filter from the software and hardware approach. An architectural design of hardware capable of performing real-time median filtering is presented. the architecture uses the histogram approach to calculate the median, while optimizing the sliding window method to reuse all its calculations. Data is output row by row and every input pixel is processed only once. the design is independent of window size or image size, and supports adding more processing elements to support wider images. the control unit design is minimized to enable self-adjustment of plug-and-play processing elements. the architecture is implemented in VHDL and synthesized to a Virtex-2 Pro FPGA. the architecture's performance as well as operation is compared to previous work.
Network fault management systems rely heavily on observed alarms to identify the root causes of network failures. Due to the increasing complexity of modern computer networks, the information carried by these alarms m...
详细信息
ISBN:
(纸本)9780889868205
Network fault management systems rely heavily on observed alarms to identify the root causes of network failures. Due to the increasing complexity of modern computer networks, the information carried by these alarms may in fact be vague, imprecise, and inconsistent. thus, these alarms often possess different diagnostic capabilities and should not be treated equally. In this paper, we propose a new distributed alarm correlation approach that effectively tackles the aforementioned data deficiencies. According to the proposed approach, the managed network is first divided into a disjoint set of management domains and each domain is assigned an intelligent agent. Within the framework of Dempster-Shafer evidence theory, the intelligent agent perceives each network entity in its domain as a source of information. As such, alarms emitted by these entities are expected to exhibit different information qualities and are assigned different weights accordingly. Based on their weights, the observed alarms are then correlated by their respective agent into a single local fuzzy composite alarm. Since local composite alarms constitute only partial views of the managed network, they are correlated, by a higher management entity, into a global alarm that accurately reflects a comprehensive view of the managed network.
A typical group mutual exclusion algorithm among m groups makes use of anm-group coterie, which determines the performance of the algorithm. there are two main performance measures: the availability is the probability...
详细信息
ISBN:
(纸本)9780889868205
A typical group mutual exclusion algorithm among m groups makes use of anm-group coterie, which determines the performance of the algorithm. there are two main performance measures: the availability is the probability that an algorithm tolerates process crash failures, and the concurrency is the number of processes that it allows simultaneous access to the resources. Since non-dominated (ND, for short) m-group coteries (locally) maximize the availability and their degrees roughly correspond to the concurrency, methods to construct ND m-group coteries with large degrees are looked for. Nevertheless, only a few naive methods have been proposed. this paper presents three methods to construct desirable m-group coteries. the first method constructs an ND m-group coterie from a dominated one using the transversal composition. the second one constructs an ND (m - 1)-group coterie from an ND m-group coterie. the last one uses the coterie join operation to produce an ND m-group coterie from an ND coterie and another ND mgroup coterie. these methods preserve the degrees of the original m-group coteries.
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, u...
详细信息
ISBN:
(纸本)9781450301787
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their low-level transport layers. However, most active message frameworks have low-level programming interfaces that require significant programming effort to use directly in applications and that also prevent optimization opportunities. In this paper we present AM++, a new user-level library for active messages based on generic programming techniques. Our library allows message handlers to be run in an explicit loop that can be optimized and vectorized by the compiler and that can also be executed in parallel on multicore architectures. Runtime optimizations, such as message combining and filtering, are also provided by the library, removing the need to implement that functionality at the application level. Evaluation of AM++ withdistributed-memory graph algorithms shows the usability benefits provided by these library features, as well as their performance advantages.
In this paper, we present a distributed architecture for indexing and serving large and diverse datasets. It incorporates and extends the functionality of Hadoop, the open source MapReduce framework, and of HBase, a d...
详细信息
暂无评论