The tremendous number of sensors and smart objects being deployed in the Internet of Things (IoT) pose the potential for IT systems to detect and react to live-situations. For using this hidden potential, complex even...
详细信息
The tremendous number of sensors and smart objects being deployed in the Internet of Things (IoT) pose the potential for IT systems to detect and react to live-situations. For using this hidden potential, complex event processing (CEP) systems offer means to efficiently detect event patterns (complex events) in the sensor streams and therefore, help in realizing a "distributed intelligence" in the IoT. With the increasing number of data sources and the increasing volume at which data is produced, parallelization of event detection is crucial to limit the time events need to be buffered before they actually can be processed. In this paper, we propose a pattern-sensitive partitioning model for data streams that is capable of achieving a high degree of parallelism in detecting event patterns, which formerly could only consistently be detected in a sequential manner or at a low parallelization degree. Moreover, we propose methods to dynamically adapt the parallelization degree to limit the buffering imposed on event detection in the presence of dynamic changes to the workload. Extensive evaluations of the system behavior show that the proposed partitioning model allows for a high degree of parallelism and that the proposed adaptation methods are able to meet a buffering limit for event detection under high and dynamic workloads.
In the Big data era, workflow systems must embrace data parallel computing techniques for efficient data analysis and analytics. Here, an easy-to-use, scalable approach is presented to build and execute Big data appli...
详细信息
In the Big data era, workflow systems must embrace data parallel computing techniques for efficient data analysis and analytics. Here, an easy-to-use, scalable approach is presented to build and execute Big data applications using actor-oriented modeling in data parallel computing. Two bioinformatics use cases for next-generation sequencing data analysis demonstrate the approach's feasibility.
Complex Event Processing (CEP) systems enable applications to react to live-situations by detecting event patterns (complex events) in data streams. With the increasing number of data sources and the increasing volume...
详细信息
ISBN:
(纸本)9781479956661
Complex Event Processing (CEP) systems enable applications to react to live-situations by detecting event patterns (complex events) in data streams. With the increasing number of data sources and the increasing volume at which data is produced, parallelization of event detection is becoming of tremendous importance to limit the time events need to be buffered before they actually can be processed by an event detector-named event processing operator. In this paper, we propose a pattern-sensitive partitioning model for data streams that is capable of achieving a high degree of parallelism for event patterns which formerly could only be consistently detected in a sequential manner or at a low parallelization degree. Moreover, we propose methods to dynamically adapt the parallelization degree to limit the buffering imposed on event detection in the presence of dynamic changes to the workload. Extensive evaluations of the system behavior show that the proposed partitioning model allows for a high degree of parallelism and that the proposed adaptation methods are able to meet the buffering level for event detection under high and dynamic workloads.
Reducing the effects of off-chip memory access latency is a key factor in exploiting efficiently embedded multi-core platforms. We consider architectures that admit a multi-core computation fabric, having its own fast...
详细信息
Reducing the effects of off-chip memory access latency is a key factor in exploiting efficiently embedded multi-core platforms. We consider architectures that admit a multi-core computation fabric, having its own fast and small memory to which the data blocks to be processed are fetched from external memory using a DMA (direct memory access) engine, employing a double- or multiple-buffering scheme to avoid processor idling. In this paper we focus on application programs that process two-dimensional data arrays and we determine automatically the size and shape of the portions of the data array which are subject to a single DMA call, based on hardware and applications parameters. When the computation on different array elements are completely independent, the asymmetry of memory structure leads always to prefer one-dimensional horizontal pieces of memory, while when the computation of a data element shares some data with its neighbors, there is a pressure for more "square" shapes to reduce the amount of redundant data transfers. We provide an analytic model for this optimization problem and validate our results by running a mean filter application on the au. simulator. (C) 2013 Elsevier B.V. All rights reserved.
In this paper, we present an FPGA-based fast image warping method by applying data parallelization schemes. The parallelization of accesses to pixels relieves not only latency problem of the warping, but also bandwidt...
详细信息
In this paper, we present an FPGA-based fast image warping method by applying data parallelization schemes. The parallelization of accesses to pixels relieves not only latency problem of the warping, but also bandwidth requirements of off-chip memory. The LUT data parallelization scheme efficiently replaces parallel arithmetic operations with neither of increased memory size for LUT entries nor clock frequency. Two implementations with different characteristics prove the effectiveness and efficiency of the proposed method.
In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known doubl...
详细信息
In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring the data from the off-chip slow memory to the local memory of the cores via a DMA (direct memory access) mechanism. Based on the computation time and size of elementary data items as well as DMA characteristics, we derive optimal and near optimal values for the number of blocks that should be clustered in a single DMA command. We then extend the results to the case where a computation for one data item needs some data in its neighborhood. In this setting we characterize the performance of several alternative mechanisms for data sharing. Our models are validated experimentally using a cycle-accurate simulator of the Cell Broadband Engine architecture.
The feed-forward multi-layer neural networks have significant importance in speech recognition. A new parallel-training tool TNet was designed and optimized for multiprocessor computers. The training acceleration rate...
详细信息
ISBN:
(纸本)9783642157592
The feed-forward multi-layer neural networks have significant importance in speech recognition. A new parallel-training tool TNet was designed and optimized for multiprocessor computers. The training acceleration rates are reported on a phoneme-state classification task.
Multiprocessor platforms are gaining markets as a solution to boost general performance of processor beyond technological limitations that are present in single processors chips, Multi-processor in embedded systems al...
详细信息
ISBN:
(纸本)9781424437566
Multiprocessor platforms are gaining markets as a solution to boost general performance of processor beyond technological limitations that are present in single processors chips, Multi-processor in embedded systems also have a future in particular with applications like SDR(Software Defined Radio) where both high performance and high adaptability are required. Cryptographic algorithms implementation on embedded systems is also a hot topic for the rapidly developing wireless communication networks. In this paper we examine the implementation of the computation-intensive block ciphers AES and TDES algorithms on a 16 processors platform implemented on FPGA, we implemented a CBC operation mode which suites mass encryption on the platform and we obtained a linear speedup in computation time
Tato diplomová práce je zaměřena na paralelizaci trénování neuronových sítí pro rozpoznávání řeči. V rámci této diplomové práce byly implemen...
详细信息
Tato diplomová práce je zaměřena na paralelizaci trénování neuronových sítí pro rozpoznávání řeči. V rámci této diplomové práce byly implementovány a porovnány dvě strategie paralelizace. První strategií je paralelizace dat s využitím rozdělení trénování do několika POSIX vláken. Druhou strategií je paralelizace uzlů s využitím platformy pro obecné výpočty na grafických kartách CUDA. V případě první strategie bylo dosaženo 4x urychlení, v případě využití platformy CUDA bylo dosaženo téměř 10x urychlení. Pro trénování byl použit algoritmus Stochastic Gradient Descent se zpětným šířením chyb. Po krátkém úvodu následuje druhá kapitola práce, která je motivační a zasazuje probém do kontextu rozpoznávání řeči. Třetí kapitola práce je teoretická a diskutuje neuronové sítě a metodu trénování. Následující kapitoly jsou zaměřené na návrh a implementaci a popisují iterativní vývoj tohoto projektu. Poslední obsáhlá kapitola popisuje testovací systém a uvádí výsledky provedených experimentů. V závěru jsou krátce zhodnoceny dosažené výsledky a nastíněna perspektiva dalšího vývoje projektu.
暂无评论