Non-volatile random-access memory(NVRAM) technology is maturing rapidly and its byte-persistence feature allows the design of new and efficient fault tolerance mechanisms. In this paper we propose the versionized pr...
详细信息
Non-volatile random-access memory(NVRAM) technology is maturing rapidly and its byte-persistence feature allows the design of new and efficient fault tolerance mechanisms. In this paper we propose the versionized process(Ver P), a new process model based on NVRAM that is natively non-volatile and fault tolerant. We introduce an intermediate software layer that allows us to run a process directly on NVRAM and to put all the process states into NVRAM, and then propose a mechanism to versionize all the process data. Each piece of the process data is given a special version number, which increases with the modification of that piece of data. The version number can effectively help us trace the modification of any data and recover it to a consistent state after a system *** with traditional checkpoint methods, our work can achieve fine-grained fault tolerance at very little cost.
Sparse bundle adjustment(SBA) is a key but time-and memory-consuming step in three-dimensional(3 D) reconstruction. In this paper, we propose a 3 D point-based distributed SBA algorithm(DSBA) to improve the speed and ...
详细信息
Sparse bundle adjustment(SBA) is a key but time-and memory-consuming step in three-dimensional(3 D) reconstruction. In this paper, we propose a 3 D point-based distributed SBA algorithm(DSBA) to improve the speed and scalability of SBA. The algorithm uses an asynchronously distributed sparse bundle adjustment(A-DSBA)to overlap data communication with equation computation. Compared with the synchronous DSBA mechanism(SDSBA), A-DSBA reduces the running time by 46%. The experimental results on several 3 D reconstruction datasets reveal that our distributed algorithm running on eight nodes is up to five times faster than that of the stand-alone parallel SBA. Furthermore, the speedup of the proposed algorithm(running on eight nodes with 48 cores) is up to41 times that of the serial SBA(running on a single node).
Network monitoring is vital in modern clouds and data center networks for traffic engineering, network diagnosis, network intrusion detection, which need diverse traffic statistics ranging fromflow size distributions ...
详细信息
As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit(...
详细信息
As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit(GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor(SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU's poor warp scheduling ***, benefits of GPU's high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected(CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter(LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit(PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.
The vehicle privacy protection plays a vital role in releasing or sharing of traffic videos. License plate, as the identifiable mark of a vehicle, contains the most sensitive information for a vehicle. Therefore, mask...
The vehicle privacy protection plays a vital role in releasing or sharing of traffic videos. License plate, as the identifiable mark of a vehicle, contains the most sensitive information for a vehicle. Therefore, masking the license plates is a common way to protect the privacy of corresponding vehicles. However, in the real world scenarios, it is often hard to locate the small and shifting license plates, and therefore precise and cost-effective privacy protection is quite challenging. To address this problem in surveillance video, we fully explore all available spatio-temporal cues and design bidirectional Kalman filter model in the consecutive frames to locate missing license plates. To verify effectiveness of the proposed benchmark, we build a new License Plates Privacy-preserving Dataset (LPPD) collected from various scenes with diverse privacy and utility annotations. We demonstrate that our proposed method show very promising capability of privacy protection on the real world dataset without sacrificing its utility.
Convolution neural network models are widely used in image classification tasks. However, the running time of such models is so long that it is not the conforming to the strict real-time requirement of mobile devices....
详细信息
Massive multiple-input multiple-output provides improved energy efficiency and spectral efficiency in 5 G. However it requires large-scale matrix computation with tremendous complexity, especially for data detection a...
详细信息
Massive multiple-input multiple-output provides improved energy efficiency and spectral efficiency in 5 G. However it requires large-scale matrix computation with tremendous complexity, especially for data detection and precoding. Recently, many detection and precoding methods were proposed using approximate iteration methods, which meet the demand of precision with low complexity. In this paper, we compare these approximate iteration methods in precision and complexity, and then improve these methods with iteration refinement at the cost of little complexity and no extra hardware resource. By derivation, our proposal is a combination of three approximate iteration methods in essence and provides remarkable precision improvement on desired vectors. The results show that our proposal provides 27%-83% normalized mean-squared error improvement of the detection symbol vector and precoding symbol vector. Moreover, we find the bit-error rate is mainly controlled by soft-input soft-output Viterbi decoding when using approximate iteration methods. Further, only considering the effect on soft-input soft-output Viterbi decoding, the simulation results show that using a rough estimation for the filter matrix of minimum mean square error detection to calculating log-likelihood ratio could provideenough good bit-error rate performance, especially when the ratio of base station antennas number and the users number is not too large.
Mixed-type data are pervasive in real life, but very limited outlier detection methods are available for these data. Some existing methods handle mixed-type data by feature converting, whereas their performance is dow...
Mixed-type data are pervasive in real life, but very limited outlier detection methods are available for these data. Some existing methods handle mixed-type data by feature converting, whereas their performance is downgraded by information loss and noise caused by the transformation. Another kind of approaches separately evaluates outlierness in numerical and categorical features. However, they fail to adequately consider the behaviours of data objects in different feature spaces, often leading to suboptimal results. As for outlier form, both clustered outliers and scattered outliers are contained in many real-world data, but a number of outlier detectors are inherently restricted by their outlier definitions to simultaneously detect both of them. To address these issues, an unsupervised outlier detection method MIX is proposed. MIX constructs a joint learning framework to establish a cooperation mechanism to make separate outlier scoring constantly communicate and sufficiently grasp the behaviours of data objects in another feature space. Specifically, MIX iteratively performs outlier scoring in numerical and categorical space. Each outlier scoring phase can be iteratively and cooperatively enhanced by the prior knowledge given by another feature space. To target both clustered and scattered outliers, the outlier scoring phases capture the essential characteristic of outliers, i.e., evaluating outlierness via the deviation from the normal model. We show that MIX significantly outperforms eight state-of-the-art outlier detectors on twelve real-world datasets and obtains good scalability.
As the parallel scale of HPC applications represented by earth system models becomes larger and the computing cost becomes higher, the performance of HPC applications is increasingly critical. Profiling HPC applicatio...
详细信息
As the parallel scale of HPC applications represented by earth system models becomes larger and the computing cost becomes higher, the performance of HPC applications is increasingly critical. Profiling HPC applications accurately helps to model the applications and find the performance bottlenecks. However, due to the complexity of HPC applications, the diversity of programming languages, the differences of individual programming habits, and multiple architectures, accurate profiling becomes very tough. In this paper, we propose LPerf: a low-overhead and high-accuracy profiler for HPC applications. To reduce the profiling overhead and improve the profiling accuracy, we propose a preprocessing method which can automatically instrument with tunable granularity thus significantly reducing the run-time overhead of profiling, an aggregated caller-callee relationship which is used to locate relationship of functions efficiently, and a profiling-aware method which can precisely calculate running time of functions. The experimental results show that the error rate of profiling reaches 0.02%, and the overhead reaches 1.6%, in the earth system model named CAS-ESM. Compared with the baselines, the precision, accuracy, and overhead of LPerf have reached the state of the art.
In image classification, Convolutional Neural Network(CNN) models have achieved high performance with the rapid development in deep learning. However, some categories in the image datasets are more difficult to distin...
详细信息
暂无评论