Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical probl...
详细信息
ISBN:
(纸本)9783540695004
Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical problems, like debugging parallel programs. In this paper we introduce the first debugger that works with any parallel extension of the functional language Haskell, the de facto standard in the (lazy evaluation) functional programming community. the debugger is implemented as an independent library. thus, it can be used with any Haskell compiler. Moreover, the debugger can be used to analyze how much speculative work has been done in any program.
In this paper we provide both a qualitative and a quantitative evaluation of a decoupled multithreaded architecture that uses non-blocking threads. Our architecture is based on simple in-order pipelines and complete d...
详细信息
ISBN:
(纸本)9783540695004
In this paper we provide both a qualitative and a quantitative evaluation of a decoupled multithreaded architecture that uses non-blocking threads. Our architecture is based on simple in-order pipelines and complete decoupling of memory accesses from execution pipelines. We extend the architecture to support thread level speculation using snooping cache coherency protocols. We evaluate the performance gains from speculations by varying the number of load/store instructions compared to computational instructions, miss speculation rates and the degree of thread level speculation. Our architecture presents a viable alternative to complex superscalar and super-speculative CPUs.
DPCM (Differential Pulse Code Modulation) coding is widely used in many applications including lossless JPEG compression. DPCM decoding is inherently a 1-indexed or 2-indexed recurrence relation. thus, although it is ...
详细信息
ISBN:
(纸本)9780769533025
DPCM (Differential Pulse Code Modulation) coding is widely used in many applications including lossless JPEG compression. DPCM decoding is inherently a 1-indexed or 2-indexed recurrence relation. thus, although it is hard to parallelize efficiently, some (N log N)or (log(2) N) algorithms have been studied for an N x N image with N x N or N processors. Recently commodity microprocessors are equipped with plural cores and SMP architectures are utilized in some PCs, but the number of parallelism is not so large (up to 80). thus, it is unrealistic that the image processing of an N x N image is parallelized with N x N or N processors. In this paper we implements two parallel DPCM algorithms for an N x N image on P processors (P << N): Fat-pipeline and P-scheme. Our experimental results show that both approaches provide the parallelisms of about 3.2 with 6 processing cores.
Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two...
详细信息
ISBN:
(纸本)9781424420025
Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two or four cores, GPU architectures are "manycore" with hundreds of cores capable of running thousands of threads in parallel. NVIDIA's CUDA is a co-evolved hardware-software architecture that enables high-performance computing developers to harness the tremendous computational power and memory bandwidth of the GPU in a familiar programming environment - the C programming language. We describe the CUDA programming model and motivate its use in the biomedical imaging community.
this paper considers the problem of digital predistortion of parallel Wiener-type systems using the Recursive Prediction Error Method (RPEM) and the Nonlinear Filtered-x Least Mean Squares (NFxLMS) algorithms. the RPE...
详细信息
ISBN:
(纸本)9781424421787
this paper considers the problem of digital predistortion of parallel Wiener-type systems using the Recursive Prediction Error Method (RPEM) and the Nonlinear Filtered-x Least Mean Squares (NFxLMS) algorithms. the RPEM algorithm is used for the identification of the parallel Wiener-type system and the FIR filter that represents the inverse of the linear kernels. then the estimate of the nonlinear kernels and the inverse of the linear kernels are used to construct the predistorter as done in [1]. On the other hand, the NFxLMS algorithm is used to directly estimate the coefficients of the predistorter modeled using Volterra series. A comparative simulation study between the two algorithms is given in this paper
parallel computers provide an efficient and economical way to solve large-scale and/or time-constrained scientific, engineering, and industry problems. Consequently, there is a need to predict the performance order of...
详细信息
ISBN:
(纸本)9783540695004
parallel computers provide an efficient and economical way to solve large-scale and/or time-constrained scientific, engineering, and industry problems. Consequently, there is a need to predict the performance order of both deterministic and non-deterministic parallelalgorithms. the performance prediction of the traveling salesman problem (TSP) is a challenging problem because similar input data sets may cause significant variability in execution times. parallel performance of data-dependent algorithms depends on the problem size, the number of processors, and other parameters. Discovering the main other parameters is the real key to obtain a good estimation of performance order. this paper presents a novel methodology to the problem of predicting the performance of a parallel algorithm for solving the TSP. the entire process explores data in search of patterns and/or relationships detecting the main parameters that affect performance. then, it uses the measured values for this limited number of inputs to produce a multiple-linear-regression model. Finally, the regression equation allows for predicting how the algorithm will respond when given new input data sets. the preliminary experimental results are quite promising.
In this paper we present a novel and complete approach on how to encapsulate parallelism for relational database query execution that strives for maximum resource utilization for both CPU and disk activities. Its simp...
详细信息
ISBN:
(纸本)9783540695004
In this paper we present a novel and complete approach on how to encapsulate parallelism for relational database query execution that strives for maximum resource utilization for both CPU and disk activities. Its simple and robust design is capable of modeling intra- and inter-operator parallelism for one or more parallel queries in a most natural way. In addition, encapsulation guarantees that the bulk of relational operators can remain unmodified, as long as their implementation is thread-safe. We will show, that withthis approach, the problem of scheduling parallel tasks is generalized, so that it can be safely entrusted to the underlying operating system (OS) without suffering any performance penalties. On the contrary, relocation of all scheduling decisions from the DBMS to the OS guarantees a centralized and therefore near-optimal resource allocation (depending on the OS's abilities) for the complete system that is hosting the database server as one of its tasks. Moreover, withthis proposal, query parallelization is fully transparent on the SQL interface of the database system. Configuration of the system for effective parallel query execution can be adjusted by the DB administrator by setting two descriptive tuning parameters. A prototype implementation has been integrated into the Transbase (R) relational DBMS engine.
Since Schnorr - Euchner Sphere Decoding (SE-SD) does not guarantee a fixed throughput, the searching cycles of SE-SD should be limited for the practical implementation. Given SE-SD with runtime constraint causes degra...
详细信息
ISBN:
(纸本)9781424414567
Since Schnorr - Euchner Sphere Decoding (SE-SD) does not guarantee a fixed throughput, the searching cycles of SE-SD should be limited for the practical implementation. Given SE-SD with runtime constraint causes degradation in performance due to the variance of searching cycles, an enhanced SE-SD architecture with a small variance of searching cycles is proposed in this paper for a multi-input multi-output(MIMO) system. Small variance in number of searching cycle is achieved by applying parallel partial Euclidean distance (PED) calculation units to the one-node-per-cycle architecture. Since the proposed architecture is able to evaluate more children nodes in a single cycle, average processing cycles and error performance are significantly improved with a per-block run-time constraint. Our proposed parallel architecture increases the complexity about two times, but it can obtain a 2 dB gain in a 4x4 16QAM system when the runtime constraint is 7 cycles.
In the block ciphers, though the operation is quite complex, there are a lot of similar characteristics including arithmetic unit, operation width, parallel data and ordinal implement. It is very suitable for designin...
详细信息
ISBN:
(纸本)9780769532875
In the block ciphers, though the operation is quite complex, there are a lot of similar characteristics including arithmetic unit, operation width, parallel data and ordinal implement. It is very suitable for designing ASIP (Application Specific Instruction Set Processor) targeted at block ciphers. In this thesis, a reconfigurable processor architecture is proposed, At the mean time, in order to improve instruction level parallelism. this thesis put forward the instruction bundle structure based on VLIW architecture, which supports word and sub-word parallelprocessing. As to the design of cipher arithmetic units, we adopt a specific design which is reconfigurable, so as to make the architecture have instruction level reconfigurable function. Besides, In order to solve the bottleneck of storage and access, this thesis adopt clustered technology to design two separated register files to storage data and subkey. Furthermore, this scheme reduces energy and clock cycles. A number of algorithms were implemented successfully on the processor. the prototype is realized using Altera's FPGA. Synthesis, placement and routing of processor have accomplished under 0.18 mu m CMOS technology through Design Complier tool. Compared with other ASIP targeted at block cipher, the results prove that processor can achieve relatively high performance in block cipher algorithmsprocessing.
Ubiquitous and pervasive computing systems are characterized by intelligent sensing and computing. these systems seamlessly understand and respond to the environment with little human intervention. Since such systems ...
详细信息
ISBN:
(纸本)9780769534923
Ubiquitous and pervasive computing systems are characterized by intelligent sensing and computing. these systems seamlessly understand and respond to the environment with little human intervention. Since such systems are required to be small and inobtrusive, embedded systems play an important role in their design. Furthermore, these systems need to run sophisticated applications in a resource constrained environment. In this paper we focus on computer vision applications in such systems. As these applications require larger memory and are computationally intensive, optimization of these algorithms is imperative. this paper discusses some optimization techniques and their impact on execution time in a complex real-world face tracking example. In certain scenarios, the requirement may be to suggest a hardware architecture for achieving a specific response time. this is espescially important for mission critical applications in the fields of automotive, medical or defence. However, the estimation of hardware architecture parameters such as core-clock frequency, memory requirement, optimal number of parallel execution paths for a given application is not straight forward. In this paper, we also present a structured approach to determine the hardware architecture for a driver assistance and safety application with stringent performance constraints.
暂无评论