Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using ...
详细信息
Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using all kinds of optimization methods algorithmically, but the intrinsic reason which lies in GPU architectural features for this performance improvement. Thus we present a study of several block encryption algorithms(AES, TRI-DES, RC5, TWOFISH and the chained block cipher formed by their combinations) processing on GPU using CUDA. We introduce our CUDA implementations, and investigate the program behavioral characteristics and their impacts on the performance in four aspects. We find that the number of threads used by a CUDA program can affect the overall performance fundamentally. Many block encryption algorithms can benefit from the shared memory if the capacity is large enough to hold the lookup tables. The data stored in device memory should be organized purposely to avoid performance degradation. Besides, the communication between host and device may turn out to be the bottleneck of a program. Through these analyses we hope to find out an effective way to optimize a CUDA program, as well as to reveal some desirable architectural features to support block encryption applications better.
Semi-structured data is different from the traditional data model. It has data prior to schema, and semi-structured data model is used to describe the data structural information of data rather than the mandatory cons...
详细信息
Semi-structured data is different from the traditional data model. It has data prior to schema, and semi-structured data model is used to describe the data structural information of data rather than the mandatory constraint. Therefore, discovery of semi-structured data has become the first step of knowledge discovery. In this paper, the concept of hierarchical data is adopted and a counting principle of "cumulative transformation" and hierarchical transactional database is described. Accordingly, a SHDP-mine mining algorithm based on SHDP-tree and a basic schema for data mining of semi-structured hierarchical data is presented. Finally, its effectiveness and efficiency is validated on a theoretical level and by experimental analysis.
The paper presents a workflow model with just-in-time selection of services to execute workflow tasks. The suitability of the approach is presented for scientific and business uses when the service availability is cha...
详细信息
The paper presents a workflow model with just-in-time selection of services to execute workflow tasks. The suitability of the approach is presented for scientific and business uses when the service availability is changing in time and services should be chosen at runtime rather than before the workflow is executed. It is shown that for a scientific workflow with repeatable simulations, the algorithm selects services to minimize the workflow execution time. For a business assembly/distribution workflow, the algorithm selects services to minimize the product of execution time and sum of service costs for the workflow. These simulations were run in a workflow execution environment implemented by the author and deployed in BeesyCluster. Implementation details and overhead of the solution a represented.
Applications of different categories contain varying levels of data, instruction and thread-level parallelism inherently. It's important to explore the potential coarse-grain thread-level parallelism in different ...
详细信息
Applications of different categories contain varying levels of data, instruction and thread-level parallelism inherently. It's important to explore the potential coarse-grain thread-level parallelism in different applications to guide the computing resources allocation problem in multicore chips. Up to now, lots of depth researches have been mainly concentrated in the desktop applications. In order to fully understand thread level parallel (TLP) technology's applicability, this paper proposes a criterion for selecting the region to be executed in parallel and analyzes applications' performance impacting factors (computation, coverage parallelism, thread size, inter-thread control dependence feature and inter-thread data dependence feature) by our dynamic profiling tool set. It explores the TLP potentials in desktop, multimedia and high performance computing (HPC) fields by demonstrating different speedup potentials that can be exploited using different core numbers. The experimental results show that the majority of desktop applications can only make an effective use of 2 cores' computing resources while most multimedia and HPC applications can use 8-16 cores' computing resources efficiently in the coarse-grain thread-level parallelism. Although TLP technology didn't perform well in the desktop applications that have serious data dependence problem, it's suitable for most multimedia and HPC applications that have large calculation, moderate thread size, and fuzzy dependence but easy to resolve.
When the strain tensor, the scalar damage quantity and the Laplacian thereof serve as the state variables of Helmholtz free energy, the general expressions of elasticity-gradient damage constitutive equations are deri...
详细信息
Due to the high complexity of the required calculations, Intelligent Routing Systems have to apply latest Operations Research techniques to be able to create routes efficiently. This paper proposes a solution to the M...
详细信息
Due to the high complexity of the required calculations, Intelligent Routing Systems have to apply latest Operations Research techniques to be able to create routes efficiently. This paper proposes a solution to the Multi Path Orienteering Problem with Time Windows (MPOPTW), which includes multiple paths to move between locations. The main characteristics of MPOPTW are: the total collected score obtained by visiting locations has to be maximized;not all locations can be visited due to different constraints;and the time required to move from one location to the next one varies according to the departure time, simulating public transportation.
Evaluation of high performance parallel systems is a delicate issue, due to the difficulty of generating workloads that represent, those that will run on actual systems. We overview the most usual workloads for perfor...
详细信息
ISBN:
(纸本)9781424444106
Evaluation of high performance parallel systems is a delicate issue, due to the difficulty of generating workloads that represent, those that will run on actual systems. We overview the most usual workloads for performance evaluation purposes, in the scope of interconnection networks simulation. Aiming to fill the gap between purely synthetic and application-driven workloads, we present a set of synthetic communication micro-kernels that enhance regular synthetic traffic by adding point-to-point causality. They are conceived to stress the interconnection architecture. As an example of the proposed methodology, we use these micro-kernels to evaluate a topological improvement of k-ary n-cubes.
This paper studies the influence that task placement may have on the performance of applications, mainly due to the relationship between communication locality and overhead. This impact is studied for torus and fat-tr...
详细信息
This paper studies the influence that task placement may have on the performance of applications, mainly due to the relationship between communication locality and overhead. This impact is studied for torus and fat-tree topologies. A simulation-based performance study is carried out, using traces of applications and application kernels, to measure the time taken to complete one or several concurrent instances of a given workload. As the purpose of the paper is not to offer a miraculous task placement strategy, but to measure the impact that placement have on performance, we selected simple strategies, including random placement. The quantitative results of these experiments show that different workloads present different degrees of responsiveness to placement. Furthermore, both the number of concurrent parallel jobs sharing a machine and the size of its network has a clear impact on the time to complete a given workload. We conclude that the efficient exploitation of a parallel computer requires the utilization of scheduling policies aware of application behavior and network topology.
Video-on-demand (VOD) is a service that allows users to view any video program from a server at the time of their choice, such kind of services are expected to be popular in future ubiquitous computing environment. Lo...
详细信息
This paper presents a first approach to try to determine if a newborn will be macrosomic before the labor, using a set of data taken from the mother. The problem of determining if a newborn is going to be macrosomic i...
详细信息
This paper presents a first approach to try to determine if a newborn will be macrosomic before the labor, using a set of data taken from the mother. The problem of determining if a newborn is going to be macrosomic is important in order to plan cesarean section and other problems during the labor. The proposed model to classify the weight is a neural network whose design is based recent algorithms that will allow the networks to focus on a concrete class. Before proceeding with the design methodology to obtain the models, a previous step of variable selection is performed in order to indentify the risk factors and to avoid the curse of dimensionality. Another study is made regarding the missing values in the database since the data were not complete for all the patients. The results will show how useful the addition of the missing values into the original data set can be in order to identify new risk factors.
暂无评论