The big data era is characterized by the emergence of live data with high volume and fast arrival rate, it poses a new challenge to stream processing applications: how to process the unbounded live data in real time w...
详细信息
ISBN:
(纸本)9781509053827
The big data era is characterized by the emergence of live data with high volume and fast arrival rate, it poses a new challenge to stream processing applications: how to process the unbounded live data in real time with high throughput. The sliding window technique is widely used to handle the unbounded live data by storing the most recent history of streams. However, existing centralized solutions cannot satisfy the requirements for high processing capacity and low latency due to the single-node bottleneck. Moreover, existing studies on distributed windows primarily focus on specific operators, while a general framework for processing various window-based operators is wanted. In this paper, we firstly classify the window-based operators to two categories: data-independent operators and data-dependent operators. Then, we propose GDSW, a general framework for distributed count-based sliding window, which can handle both of data-independent and data-dependent operators. Besides, in order to balance system load, we further propose a dynamic load balance algorithm called DAD based on buffer usage. Our framework is implemented on Apache Storm 0.10.0. Extensive evaluation shows that GDSW can achieve sub-second latency, and 10X improvement in throughput compared with centralized processing, when processing rapid data rate or big size window.
Anomaly detection over multi-dimensional data stream has attracted considerable attention recently in various fields, such as network, finance and aerospace. In many cases, anomalies are composed of a sequence of mult...
详细信息
ISBN:
(纸本)9781509053827
Anomaly detection over multi-dimensional data stream has attracted considerable attention recently in various fields, such as network, finance and aerospace. In many cases, anomalies are composed of a sequence of multi-dimensional data, and it's necessary to detect this type of anomalies accurately and efficiently over data stream. Existing online methods of anomaly detection merely focus on the single-dimensional sequence. What's more, current studies about multi-dimensional sequence are mainly concentrated on static database. However, the anomaly detection for multi-dimensional sequence over data stream is much more difficult, due to the complexity of multidimensional sequence processing, the dynamic nature of data stream and the unbalance between normal and abnormal data. Facing these challenges, we propose an anomaly detection method for multi-dimensional sequence over data stream based on cost sensitive support vector machine (C-SVM) called ADMS. First, to improve the accuracy and efficiency, the ADMS transforms multi-dimensional sequences into feature vectors in a lossless way and prunes worthless features of these vectors. And then, the ADMS can detect abnormal sequences over dynamically imbalanced data stream by lively testing these vectors based on C-SVM. Experiments show that the false negative rate (FNR) of the ADMS is lower than 5%, the false positive rate (FPR) is lower than 7%, and the throughput is improved 42% by pruning worthless features. In addition, the AMDS performs well when there are concept drifts over the data stream.
Powering is an important operation in many computation intensive workloads. This paper investigates the performance of different styles to calculate the powering operations from the application level. A series of smal...
详细信息
ISBN:
(纸本)9781509045181
Powering is an important operation in many computation intensive workloads. This paper investigates the performance of different styles to calculate the powering operations from the application level. A series of small benchmark codes that calculate the powering operations in different ways are designed. Their performance is evaluated on Intel Xeon CPU under Intel compilation environments. The results show that the number of floating-point operations and the related runtime are sensitive to the value of the exponent Y and how it is used. When Y is an immediate integer number whose value is known at compile time, the cost of powering is much less than the situation when Y is an integer variable whose value is known at runtime. When Y is defined as a real variable, the cost of powering is always high, be it equals to an integer number or not. Based on the investigations, performance optimizations are applied to a kernel subroutine from a real-world supersonic combustion simulation code, which intensively involves powering operations. The result shows that the performance of that subroutine is improved for 13.25 times on the Intel Xeon E5-2692 CPU.
Binary Exchange Algorithm (BEA) always introduces excessive shuffle operations when mapping FFTs on vector SIMD DSPs. This can greatly restrict the overall performance. We propose a novel mod (2P-1) shuffle function a...
详细信息
ISBN:
(纸本)9781467390408
Binary Exchange Algorithm (BEA) always introduces excessive shuffle operations when mapping FFTs on vector SIMD DSPs. This can greatly restrict the overall performance. We propose a novel mod (2P-1) shuffle function and Mod-BEA algorithm (MBEA), which can halve the shuffle operation count and unify the shuffle mode. Such unified shuffle mode inspires us to propose a set of novel mod (2P-1) shuffle memory-access instructions, which can totally eliminate the shuffle operations. Experimental results show that the combination of MBEA and the proposed instructions can bring 17.2%-31.4% performance improvements at reasonable hardware cost, and compress the code size by about 30%.
Wearable device with an ego-centric camera would be the next generation device for human-computer interaction such as robot *** gesture is a natural way of egocentric human-computer *** this paper, we present an ego-c...
详细信息
Wearable device with an ego-centric camera would be the next generation device for human-computer interaction such as robot *** gesture is a natural way of egocentric human-computer *** this paper, we present an ego-centric multi-stage hand gesture analysis pipeline for robot control which works robustly in the unconstrained environment with varying *** particular, we first propose an adaptive color and contour based hand segmentation method to segment hand region from the egocentric *** then propose a convex U-shaped curve detection algorithm to precisely detect positions of *** parallelly, we utilize the convolutional neural networks to recognize hand *** on these techniques, we combine most information of hand to control the robot and develop a hand gesture analysis system on an i Phone and a robot arm platform to validate its *** result demonstrates that our method works perfectly on controlling the robot arm by hand gesture in real time.
Temporal alignment aligns two temporal sequences and is quite challenging due to drastic differences among temporal sequences and source data from different views. Canonical time warping (CTW) has shown great potentia...
详细信息
ISBN:
(纸本)9781509006212
Temporal alignment aligns two temporal sequences and is quite challenging due to drastic differences among temporal sequences and source data from different views. Canonical time warping (CTW) has shown great potential in temporal alignment tasks because it can reduce data redundancy by transforming high-dimensional data to a lower-dimensional subspace via canonical correlation analysis (CCA). However, CTW cannot uncover the underlying nonlinear structure embedded in the dataset. In this paper, we propose an autoencoder regularized canonical time warping method (AECTW) to overcome this drawback. Specifically, AECTW enhances lower-dimensional representation of each sequence by incorporating an autoencoder regularization, meanwhile reveals the nonlinear structure of features by explicit nonlinear transformation. By these strategies, AECTW significantly boosts CTW in temporal alignment tasks. Experiments on both synthetic data and two practical human action datasets demonstrate that AECTW outperforms the representative DTW-based methods.
This paper investigates the problem of maximizing uniform multicast throughput (MUMT) for multi-channel dense wireless sensor networks, where all nodes locate within one-hop transmission range and can communicate with...
详细信息
As the big data era is coming, it brings new challenges to the massive data processing. A combination of GPU and CPU on chip is the trend to release the pressure of large scale computing. We found that there are diffe...
详细信息
In this paper, we present the Tianhe-2 interconnect network and message passing services. We describe the architecture of the router and network interface chips, and highlight a set of hardware and software features e...
详细信息
In this paper, we present the Tianhe-2 interconnect network and message passing services. We describe the architecture of the router and network interface chips, and highlight a set of hardware and software features effectively supporting high performance communications, ranging over remote direct memory access, collective optimization, hardwareenable reliable end-to-end communication, user-level message passing services, etc. Measured hardware performance results are also presented.
As The integration of Physical space and cyberspace, the large-scale data distributing to diversification terminal which is geographical distribution of mass has become a huge challenge. When the data size can't b...
详细信息
As The integration of Physical space and cyberspace, the large-scale data distributing to diversification terminal which is geographical distribution of mass has become a huge challenge. When the data size can't be processed by the technology for traditional scope, how to deal with the user quality of service and efficient use of system resources has become an important issue of concern, with the resources becoming limited. This paper presents a data-driven mechanism for large-scale data distribution which is consists of four core part of the data production, data collection and pre-processing, data analysis engine, data consumption, aims to excavate the valuable information to improve the efficiency of resource use and accurate fault location for the Large-scale data distribution system. At the same time, this paper studies the resource scheduling optimization with analyzing data driven for the system behavior and Fault location with analyzing data-driven environment, which proves the effectiveness for the operation of the Large-scale data distribution system optimization by the data-driven working.
暂无评论