this paper annotates the English corresponding units of Chinese clauses in Chinese-English translation and statistically analyzes them. Firstly, based on Chinese clause segmentation, we segment English target text int...
详细信息
ISBN:
(纸本)9783319504964;9783319504957
this paper annotates the English corresponding units of Chinese clauses in Chinese-English translation and statistically analyzes them. Firstly, based on Chinese clause segmentation, we segment English target text into corresponding units (clause) to get a Chinese-to-English clause-aligned parallel corpus. then, we annotate the grammatical properties of the English corresponding clauses in the corpus. Finally, we find the distribution characteristics of grammatical properties of English corresponding clauses by statistically analyzing the annotated corpus: there are more clauses (1631,74.41%) than sentences (561,25.59%);there are more major clauses (1719,78.42%) than subordinate clauses (473,21.58%);there are more adverbial clauses (392,82.88%) than attributive clauses (81,17.12%) and more non-defining clauses (358,75.69%) than restrictive relative clauses (115,24.31%) in subordinate clauses;and there are more simple clauses (1142,52.1%) than coordinate clauses (1050,47.9%).
Image processingalgorithms are widely used in the automotive field for ADAS (Advanced Driver Assistance System) purposes. To embed these algorithms, semiconductor companies offer heterogeneous architectures which are...
详细信息
ISBN:
(纸本)9781467375894
Image processingalgorithms are widely used in the automotive field for ADAS (Advanced Driver Assistance System) purposes. To embed these algorithms, semiconductor companies offer heterogeneous architectures which are composed of different processing units, often with massively parallel computing unit. However, embedding complex algorithms on these SoCs (System on Chip) remains a difficult task due to heterogeneity, it is not easy to decide how to allocate parts of a given algorithm on processing units of a given SoC. In order to help automotive industry in embedding algorithms on heterogeneous architectures, we propose a novel approach to predict performances of image processingalgorithms on different computing units of a given heterogeneous SoC. Our methodology is able to predict a more or less wide interval of execution time with a degree of confidence using only high level description of algorithms to embed, and a few characteristics of computing units.
Many-core architectures are becoming a major execution platform in order to face the increasing number of applications to be executed in parallel. Such an approach is very attractive in order to offer users with high ...
详细信息
Many-core architectures are becoming a major execution platform in order to face the increasing number of applications to be executed in parallel. Such an approach is very attractive in order to offer users with high performance. However it introduces some key challenges in terms of security as some malicious applications may compromise the whole system. A defense-in-depth approach relying on hardware and software mechanisms is thus mandatory to increase the level of protection. this work focuses on the Operating System (OS) level and proposes a set of operating system services able to dynamically create physical isolated secure zones for sensitive applications in many-core platforms. these services are integrated into the ALMOS OS deployed in the TSAR many-core architecture, and evaluated in terms of security level and induced performance overhead.
Electronic System level design has an important role in the multi-processor embedded system on chip design. Two important steps in this process are evaluation of a single design configuration and design space explorat...
详细信息
Electronic System level design has an important role in the multi-processor embedded system on chip design. Two important steps in this process are evaluation of a single design configuration and design space exploration. In the first part of design process, high-level simple analytical models for application mapping and evaluation are used and modified aiming at accelerating the evaluation of a single design configuration. Using the analytical model the design space is pruned and explored at high speed with low accuracy. In the second part of the design process, two Multi Objective Optimization algorithms based on Particle Swarm Optimization and Simulated Annealing have been proposed to perform design space exploration of the pruned design space with higher accuracy taking advantages of low-level architectural simulation engines. the results obtained by proposed algorithms will provide the designer more accurate solutions within an acceptable time. Considering the MJPEG application as the case study, each of these methods produces a set of near-optimal points. Simulation results show that the proposed methods can lead to near-optimal design configurations with acceptable accuracy in reasonable time.
Current architectures provide many control knobs for the reduction of power consumption of applications, like reducing the number of used cores or scaling down their frequency. However, choosing the right values for t...
详细信息
Current architectures provide many control knobs for the reduction of power consumption of applications, like reducing the number of used cores or scaling down their frequency. However, choosing the right values for these knobs in order to satisfy requirements on performance and/or power consumption is a complex task and trying all the possible combinations of these values is an unfeasible solution since it would require too much time. For this reasons, there is the need for techniques that allow an accurate estimation of the performance and power consumption of an application when a specific configuration of the control knobs values is used. Usually, this is done by executing the application with different configurations and by using these information to predict its behaviour when the values of the knobs are changed. However, since this is a time consuming process, we would like to execute the application in the fewest number of configurations possible. In this work, we consider as control knobs the number of cores used by the application and the frequency of these cores. We show that on most Parsec benchmark programs, by executing the application in 1% of the total possible configurations and by applying a multiple linear regression model we are able to achieve an average accuracy of 96% in predicting its execution time and power consumption in all the other possible knobs combinations.
Convolutional Neural Network (CNN) is the state-of-the-art deep learning approach employed in various applications due to its remarkable performance. Convolutions in CNNs generally dominate the overall computation com...
详细信息
ISBN:
(纸本)9781509028610
Convolutional Neural Network (CNN) is the state-of-the-art deep learning approach employed in various applications due to its remarkable performance. Convolutions in CNNs generally dominate the overall computation complexity and thus consume major computational power in real implementations. In this paper, efficient hardware architectures incorporating parallel fast finite impulse response (FIR) algorithm (FFA) for CNN convolution implementations are discussed. the theoretical derivation of 3 and 5 parallel FFAs is presented and the corresponding 3 and 5 parallel fast convolution units (FCUs) are proposed for most commonly used 3 × 3 and 5 × 5 convolutional kernels in CNNs, respectively. Compared to conventional CNN convolution architectures, the proposed FCUs reduce the number of multiplications used in convolutions significantly. Additionally, the FCUs minimize the number of reads from the feature map memory. Furthermore, a reconfigurable FCU architecture which suits the convolutions of both 3 × 3 and 5 × 5 kernels is proposed. Based on this, an efficient top-level architecture for processing a complete convolutional layer in a CNN is developed. To quantize the benefits of the proposed FCUs, the design of an FCU is coded with RTL and synthesized with TSMC 90nm CMOS technology. the implementation results demonstrate that 30% and 36% of the computational energy can be saved compared to conventional solutions with 3 × 3 and 5 × 5 kernels in CNN, respectively.
It is a trend now that computing power through parallelism is provided by multi-core systems or heterogeneous architectures for High Performance Computing (HPC) and scientific computing. Although many algorithms have ...
详细信息
ISBN:
(纸本)9781509052530
It is a trend now that computing power through parallelism is provided by multi-core systems or heterogeneous architectures for High Performance Computing (HPC) and scientific computing. Although many algorithms have been proposed and implemented using sequential computing, alternative parallel solutions provide more suitable and high performance solutions to the same problems. In this paper, three parallelization strategies are proposed and implemented for a dynamic programming based cloud smoothing application, using both shared memory and non-shared memory approaches. the experiments are performed on NVIDIA GeForce GT750m and Tesla K20m, two GPU accelerators of Kepler architecture. Detailed performance analysis is presented on partition granularity at block and thread levels, memory access efficiency and computational complexity. the evaluations described show high approximation of results with high efficiency in the parallel implementations, and these strategies can be adopted in similar data analysis and processing applications.
Computer hardware is currently moving towards heavily parallelized architectures with multiprocessors, multicore and chip multithreaded designs. Cache memory, the fastest component of the memory hierarchy, adapts to t...
详细信息
ISBN:
(纸本)9781509039005
Computer hardware is currently moving towards heavily parallelized architectures with multiprocessors, multicore and chip multithreaded designs. Cache memory, the fastest component of the memory hierarchy, adapts to this new kind of parallel systems in order to provide the promised performance increase. Current cache designs have limitations that can be transformed into optimization opportunities both in hardware and software. this paper provides a detailed research of cache performance in multicore processors, considering critical hardware aspects. A new solution is proposed to improve the current performance: an optimized replacement policy for the shared cache level. From experiments run on four and eight core setups in a multicore simulator, the proposed enhancements achieve up to 30% execution speed increase over the default setup.
Current research on audio signal processingalgorithms for digital hearing aid devices is extremely pushing the performance demands. Nowadays, there is a trend of using several microphones in such systems (e.g., binau...
详细信息
ISBN:
(纸本)9781479987481
Current research on audio signal processingalgorithms for digital hearing aid devices is extremely pushing the performance demands. Nowadays, there is a trend of using several microphones in such systems (e.g., binaural systems) to improve the speech perception of a hearing impaired person. However, there is a lack of mobile platforms, capable of processing such algorithms in real-time. this paper presents a new mobile SoC-based evaluation and development platform (including a multichannel audio extension board), specially thought not only for evaluating new hearing aid signal processingalgorithms but also to develop new hardware co-processor architectures, that could be integrated in current hearing aid devices to improve their performance with a minimal extra energy consumption.
Computer scientists and programmers face the difficultly of improving the scalability of their applications while using conventional programming techniques only. As a base-line hypothesis of this paper we assume that ...
详细信息
ISBN:
(纸本)9781509028269
Computer scientists and programmers face the difficultly of improving the scalability of their applications while using conventional programming techniques only. As a base-line hypothesis of this paper we assume that an advanced runtime system can be used to take full advantage of the available parallel resources of a machine in order to achieve the highest parallelism possible. In this paper we present the capabilities of HPX - a distributed runtime system for parallel applications of any scale - to achieve the best possible scalability through asynchronous task execution [1]. OP2 is an active library which provides a framework for the parallel execution for unstructured grid applications on different multi-core/many-core hardware architectures [2]. OP2 generates code which uses OpenMP for loop parallelization within an application code for both single-threaded and multi-threaded machines. In this work we modify the OP2 code generator to target HPX instead of OpenMP, i.e. port the parallel simulation backend of OP2 to utilize HPX. We compare the performance results of the different parallelization methods using HPX and OpenMP for loop parallelization within the Airfoil application. the results of strong scaling and weak scaling tests for the Airfoil application on one node with up to 32 threads are presented. Using HPX for parallelization of OP2 gives an improvement in performance by 5%-21%. By modifying the OP2 code generator to use HPX's parallelalgorithms, we observe scaling improvements by about 5% as compared to OpenMP. To fully exploit the potential of HPX, we adapted the OP2 API to expose a future and dataflow based programming model and applied this technique for parallelizing the same Airfoil application. We show that the dataflow oriented programming model, which automatically creates an execution tree representing the algorithmic data dependencies of our application, improves the overall scaling results by about 21% compared to OpenMP. Our results show
暂无评论