Machine Learning techniques have been employed in virtually all domains in the past few years. New applications demand the ability to cope with dynamic environments like data streams with transient behavior. Such envi...
详细信息
ISBN:
(纸本)9781728176505
Machine Learning techniques have been employed in virtually all domains in the past few years. New applications demand the ability to cope with dynamic environments like data streams with transient behavior. Such environments present new requirements like incrementally process incoming data instances in a single pass, under both memory and time constraints. Furthermore, prediction models often need to adapt to concept drifts observed in non-stationary data streams. Ensemble learning comprises a class of stream mining algorithmsthat achieved remarkable prediction performance in this scenario. Implemented as a set of (several) individual component classifiers whose predictions are combined to predict new incoming instances, ensembles are naturally amendable for task parallelism. Despite its relevance, an efficient implementation of ensemble algorithms is still challenging. For example, dynamic data structures used to model non-stationary data behavior and detect concept drifts cause inefficient memory usage patterns and poor cache memory performance in multi-core environments. In this paper, we propose a minibatching strategy which can significantly reduce cache misses and improve the performance of several ensemble algorithms for stream mining in multi-core environments. We assess our strategy on four different state-of-art ensemble algorithms applying four widely used machine learning benchmark datasets with varied characteristics. Results from two different hardware show speedups of up to 5X on 8-core processors with ensembles of 100 and 150 learners. the benefits come at the cost of changes in predictive performances.
OpenMP allows developers to harness the power of shared memory multiprocessing in C and C++ applications, but the performance gained with OpenMP is highly sensitive to the underlying hardware, making performance porta...
详细信息
Unknown motif finding in a set of DNA sequences is an important step of understanding the functionality of a group of genes and it requires accuracy and efficiency. We propose and present high-performance computation ...
详细信息
Machine learning algorithms have become a major tool in various applications. the high-performance requirements on large-scale datasets pose a challenge for traditional von Neumann architectures. We present two machin...
详细信息
Machine learning algorithms have become a major tool in various applications. the high-performance requirements on large-scale datasets pose a challenge for traditional von Neumann architectures. We present two machine learning implementations and evaluations on PRINS, a novel processing-in-storage system based on resistive content addressable memory (ReCAM). PRINS functions simultaneously as a storage and a massively parallel associative processor. PRINS processing-in-storage resolves the bandwidth wall faced by near-data von Neumann architectures, such as three-dimensional DRAM and CPU stack or SSD with embedded CPU, by keeping the computing inside the storage arrays, thus implementing in-data, rather than near-data, processing. We show that PRINS-based processing-in-storage architecture may outperform existing in-storage designs and acceleratorbased designs. Multiple performance comparisons for the ReCAM processing-in-storage implementations ofK-means and K-nearest neighbors are performed. Compared platforms include CPU, GPU, FPGA, and Automata Processor. We show that PRINSmay achieve an order-of-magnitude speedup and improved power efficiency relative to all compared platforms.
On multi-core processors, contention on shared resources such as the last level cache (LLC) and memory bandwidth may cause serious performance degradation, which makes efficient resource allocation a critical issue in...
详细信息
ISBN:
(纸本)9781450362955
On multi-core processors, contention on shared resources such as the last level cache (LLC) and memory bandwidth may cause serious performance degradation, which makes efficient resource allocation a critical issue in data centers. Intel recently introduces Memory Bandwidth Allocation (MBA) technology on its Xeon scalable processors, which makes it possible to allocate memory bandwidth in a real system. However, how to make the most of MBA to improve system performance remains an open question. In this work, (1) we formulate a quantitative relationship between a program's performance and its LLC occupancy and memory request rate on commodity processors. (2) Guided by the performance formula, we propose a heuristic bound-aware throttling algorithm to improve system performance and (3) we further develop a hierarchical clustering method to improve the algorithm's efficiency. (4) We implement these algorithms in EMBA, a low-overhead dynamic memory bandwidth scheduling system to improve performance on Intel commodity processors. the results show that, when multiple programs run simultaneously on a multi-core processor whose memory bandwidth is saturated, the programs with high memory bandwidth demand usually use bandwidth inefficiently compared with programs with medium memory bandwidth demand from the perspective of CPU performance. By slightly throttling the former's bandwidth, we can significantly improve the performance of the latter. On average, we improve system performance by 36.9% at the expense of 8.6% bandwidth utilization rate.
the proceedings contain 88 papers. the topics discussed include: analysis of industrial Ethernet used in active surface system of QTT;interactive platform of gesture and music based on Myo armband and processing;a qua...
the proceedings contain 88 papers. the topics discussed include: analysis of industrial Ethernet used in active surface system of QTT;interactive platform of gesture and music based on Myo armband and processing;a quantitative study on the color of city landmark landscape architectures;research on risk identification for foreign companies investing in Mongolia infrastructure construction industry based on complex network technology;application of multi-dimensional linear fitting method in the establishment of the semi-autogenous grinding mill model;research on the construction and application of a cloud computing experiment platform for computer science general education courses;revisiting the current state-of-the-art multipath routing in ad hoc networks;surface blemishes of aluminum material image recognition based on transfer learning;and the establishment and analysis of gas yield prediction model.
the skyline query over uncertain data streams, as an important aspect of big data analysis, plays a significant role in various domains like financial data analysis, environmental monitoring, and wireless sensor netwo...
详细信息
Data movement has long been identified as the biggest challenge facing modern computer systems' designers. To tackle this challenge, many novel data compression algorithms have been developed. Often variable rate ...
详细信息
ISBN:
(数字)9781728165820
ISBN:
(纸本)9781728165837
Data movement has long been identified as the biggest challenge facing modern computer systems' designers. To tackle this challenge, many novel data compression algorithms have been developed. Often variable rate compression algorithms are favored over fixed rate. However, variable rate decompression is difficult to parallelize. Most existing algorithms adopt a single parallelization strategy suited for a particular HW platform. Such an approach fails to harness the parallelism found in diverse modern HW architectures. We propose a parallelization method for tiled variable rate compression algorithmsthat consists of multiple strategies that can be applied interchangeably. this allows an algorithm to apply the strategy most suitable for a specific HW platform. Our strategies are based on generating metadata during encoding, which is used to parallelize the decoding process. To demonstrate the effectiveness of our strategies, we implement them in a state-of-the-art compression algorithm called ZFP. We show that the strategies suited for multicore CPUs are different from the ones suited for GPUs. On a CPU, we achieve a near optimal decoding speedup and an overhead size which is consistently less than 0.04% of the compressed data size. On a GPU, we achieve average decoding rates of up to 100 GiB/s. Our strategies allow the user to make a trade-off between decoding throughput and metadata size overhead.
Serverless computing is an emerging cloud computing paradigm withthe goal of freeing developers from resource management issues. As of today, serverless computing platforms are mainly used to process computations tri...
详细信息
In recent years, the Hub location problems (HLPs) have been expanded to handle uncertain data, giving rise to Robust HLPs. In a Robust HLP with discrete scenarios, the unique set of requests is replaced by a set of di...
详细信息
ISBN:
(数字)9781728164250
ISBN:
(纸本)9781728164267
In recent years, the Hub location problems (HLPs) have been expanded to handle uncertain data, giving rise to Robust HLPs. In a Robust HLP with discrete scenarios, the unique set of requests is replaced by a set of discrete scenarios. For example, a scenario can be the collection of requests observed between nodes at given periods of the year. In a robust optimization approach, making appropriate decisions for all scenarios is time-consuming, especially for large size HLP instances. the purpose of this study is to show that such problems can be solved in a reasonable computing time and with high quality solutions, using the computing power of the GPU graphics card. So, we present a GPU-based approach for solving large Robust HLP with discrete senarios. the proposed parallel genetic algorithm returns a robust solution based on the min-max lexicographic criterion that minimizes the worst cost on all scenarios. Due to the performance of our GPU implementation we solve instances up to 4000 nodes in a few seconds on an Nvidia Quadro P6000 (3840 cores).
暂无评论