Existing research methods are largely intended to be conducted colocated and synchronously with a study population, but this approach is not feasible with remote or distributed populations. We describe a needs assessm...
详细信息
the paper presents parallel implementation of searching the most similar subsequence in time series for computer cluster system with nodes based on Intel MIC accelerators. the algorithm involves three levels of data p...
详细信息
the paper presents parallel implementation of searching the most similar subsequence in time series for computer cluster system with nodes based on Intel MIC accelerators. the algorithm involves three levels of data parallelism. the first level provides partitioning of time series into equal-length fragments, each of which is processed on a separate node of the computer cluster;nodes interact using MPI technology. the second level of parallelism supposes division of the fragment into equal-length segments and processing of each segment by a separate thread by means of OpenMP technology. the third level provides load balancing between CPU and accelerator. CPU performs pruning of dissimilar subsequences. Accelerator performs heavy-weighted calculations of similarity measure. the results of experiments confirm the efficiency of algorithm.
the problem of obtaining blocks of operations and threads of parallel algorithm resulting in a smaller number of accesses to global memory and resulting in the efficient use of caches and shared memory graphics proces...
详细信息
the problem of obtaining blocks of operations and threads of parallel algorithm resulting in a smaller number of accesses to global memory and resulting in the efficient use of caches and shared memory graphics processor is investigated. We formulated and proved statements to assess the volume of communication transactions generated by alternative sizing of blocks, as well as to minimize the number of cache misses due to the use of temporal and spatial locality of data. the research is constructive and allows software implementation for practical use.
Cloud computing has emerged as a service model that enables on-demand network access to a large number of available virtualized resources and applications with a minimal management effort and a minor price. the spread...
详细信息
ISBN:
(纸本)9781509018574
Cloud computing has emerged as a service model that enables on-demand network access to a large number of available virtualized resources and applications with a minimal management effort and a minor price. the spread of Cloud computingtechnologies allowed dealing with complex applications such as Scientific Workflows, which consists of a set of intensive computational and data manipulation operations. Cloud computing helps such Workflows to dynamically provision compute and storage resources necessary for the execution of its tasks thanks to the elasticity asset of these resources. However, the dynamic nature of the Cloud incurs new challenges, as some allocated resources may be overloaded or out of access during the execution of the Workflow. Moreover, for data intensive tasks, the allocation strategy should consider the data placement constraints since data transmission time can increase notably in this case which implicates the increase of the overall completion time and cost of the Workflow. Likewise, for intensive computational tasks, the allocation strategy should consider the type of the allocated virtual machines, more specifically its CPU, memory and network capacities. Yet, a critical challenge is how to efficiently schedule the Workflow tasks on Cloud resources to optimize its overall quality of service. In this paper, we propose a QoS-aware algorithm for Scientific Workflows scheduling that aims to improve the overall quality of service (QoS) by considering the metrics of execution time, data transmission time, cost, resources availability and data placement constraints. We extended the parallel Cat Swarm Optimization (PCSO) algorithm to implement our proposed approach. We tested our algorithm within two sample Workflows of different scales and we compared the results to those given by the standard PSO, the CSO and the PCSO algorithms. the results show that our proposed algorithm improves the overall quality of service of the tested Workflows.
the paper considers parallel algorithm for solving multiextremal optimization problems. the issues of implementation of the algorithm on state-of-the-art computing systems using Intel Xeon Phi coprocessor are examined...
详细信息
the paper considers parallel algorithm for solving multiextremal optimization problems. the issues of implementation of the algorithm on state-of-the-art computing systems using Intel Xeon Phi coprocessor are examined. Two approaches for algorithm parallelization, which take into account information about laboriousness of the objective function computing, are considered. Speed up of the algorithm using Xeon Phi compared to the algorithm using CPU only is experimentally confirmed. Computational experiments are carried out on Lobachevsky supercomputer.
Paper describes methods devoted to drastically speedup parallel breadth-first search algorithm. Main obstacle on the way to effectively parallelize breadth-first search is workload imbalance within computing nodes as ...
详细信息
Paper describes methods devoted to drastically speedup parallel breadth-first search algorithm. Main obstacle on the way to effectively parallelize breadth-first search is workload imbalance within computing nodes as well as significant volume of transferred data in the end of every iteration of the algorithm. Two methods are suggested in this paper to overcome these challenges. First method allows to distribute workloads between OpenMP threads in single node. Second method allows to reduce data transfer volume by using the hybrid graph traversal.
In recent years, the proliferation of highly dynamic graphstructured data streams fueled the demand for real-time data analytics. For instance, detecting recent trends in social networks enables new applications in ar...
详细信息
Safety-critical applications require reliable hardware platforms with deterministic behavior. Concerning the increasing demand for performance, current single core solutions are not sufficient anymore. Classical multi...
详细信息
ISBN:
(纸本)9781509037070
Safety-critical applications require reliable hardware platforms with deterministic behavior. Concerning the increasing demand for performance, current single core solutions are not sufficient anymore. Classical multi-core processors are designed for a general application case, which provide much performance at the expense of determinism and reliability. In safety-critical applications, all required tasks are already known at development time. they are specified by a system description, like AUTOSAR. thus, a hardware architecture providing one core for each task and one physical link for each data exchange between different tasks can be derived. However, such a highly application-specific architecture is not available. Latest FPGA technologies provide now enough resources to integrate several soft-core processors in one low-cost chip. Furthermore, the cores and their connections can be arranged flexibly in an FPGA. To bridge the gap between safety-critical applications and FPGAs, this approach provides a toolchain as addition to existing AUTOSAR design tools for automatically generating a specific hardware architecture from metadata of an AUTOSAR description. By reducing the complexity of the hardware platform drastically, a reconfigurable, reliable, deterministic, distributed ((R2-D2) hardware architecture can be created. the results show that safety-critical tasks can be executed deterministically on one chip in parallel and multiple applications can be mapped to one low-cost FPGA. Furthermore, the latency of the system could be reduced extensively, so new application areas can be accessed.
the proceedings contain 74 papers. the topics discussed include: runahead cache misses using bloom filter;performance and portability studies with OpenACC accelerated version of GTC-P;outer-loop auto-vectorization for...
ISBN:
(纸本)9781509050819
the proceedings contain 74 papers. the topics discussed include: runahead cache misses using bloom filter;performance and portability studies with OpenACC accelerated version of GTC-P;outer-loop auto-vectorization for SIMD architectures based on open64 compiler;on routing of multiple concurrent user requests in multi-radio multi-channel wireless mesh networks;managing broadband access network with a SDN-based system;efficient scheduling strategy for mobile charger in wireless rechargeable sensor networks;efficient data retrieval algorithm for multi-request in multi-antenna wireless networks;accurate evaluation of bivariate polynomials;bilateral sampling randomized singular value decomposition;mePaaS: mobile-embedded platform as a service for distributing fog computing to edge nodes;a survey of challenging issues and approaches in mobile cloud computing;energy efficient scheduling of real time tasks on large systems;energy aware scheduling on heterogeneous multiprocessors with DVFS and duplication;green-aware online resource allocation for geo-distributed cloud data centers on multi-source energy;optimal scheduling algorithm of MapReduce tasks based on QoS in the hybrid cloud;towards an efficient maintenance of address space overflow for array based storage system;making user-level VMM for deterministic parallelism nonblocking and efficient;depth feature based accurate saliency detection for 3D images;and a variable Markovian based outlier detection method for multi-dimensional sequence over data stream.
In this paper, General Purpose Graphical Processing Unit (GPGPU) based concurrent implementation of handwritten digit classifier is presented. Different styles of handwriting make it difficult to recognize a pattern b...
详细信息
ISBN:
(纸本)9781509055869
In this paper, General Purpose Graphical Processing Unit (GPGPU) based concurrent implementation of handwritten digit classifier is presented. Different styles of handwriting make it difficult to recognize a pattern but using neural network, it is not a difficult task to perform. Different softwares like torch and MATLAB provide the support of multiple training algorithms to train a network. By choosing an appropriate training algorithm for a specific application, speed of training can be increased. Furthermore, using computational power of GPUs, training and classification speed of neural network can be significantly improved. In this work, Modified National Institute of Standards and Technology (MNIST) database of handwritten digits is used to train the network. Accuracy and training time of digit classifier is evaluated for different algorithms and then concurrent training is performed by exploiting power of GPU. Trained parameters are imported and used for the concurrent classification with Compute Unified Device Architecture (CUDA) computing language which can be useful in numerous practical applications. Finally, the results of sequential and concurrent operations of training and classification are compared.
暂无评论