Network policy plays a crucial role in cloud-native networking, especially in multi-tenant scenarios. It provides precise control over connectivity by specifying source and destination endpoints, traffic types, and ot...
详细信息
ISBN:
(数字)9798350386059
ISBN:
(纸本)9798350386066
Network policy plays a crucial role in cloud-native networking, especially in multi-tenant scenarios. It provides precise control over connectivity by specifying source and destination endpoints, traffic types, and other criteria to allow or deny traffic. However, manual configuration of these policies introduces the risk of errors, leading to isolation violations or network service unavailability. therefore, network policy verification is essential for maintaining security and quality of service in cloud-native networking. Currently, a naïve approach involves individually checking each policy within the cluster, which can take over 100s for verification in a cluster size of over 100k. Existing verification frameworks, like Kano and Verikube, improve performance by leveraging pre-filtering and Satisfiability Modulo theories (SMT) solvers, achieving a 3.12x to 12.99x performance boost over the naïve baseline. However, as network policy changes rapidly within 100ms in real cloud-native networks, both frameworks need over 10s to perform verification for cluster sizes over 100k, which is far from satisfying. To overcome these issues, we propose and implement a novel network policy verification framework NPV, which utilizes the policy-label pre-filter process with bitwise compression. We further enhance the policy verification algorithm with a policy-namespace divide-and-conquer strategy to improve the data-level parallelism. We implement NPV on commodity servers and evaluate its performance using real network policy datasets. Our experiments indicate that, compared withthe state-of-the-art methods, NPV can achieve up to 139.00x to 651.06x improvement in verification time compared to Kano and Verikube, with 65% less memory usage.
the B+-tree is an important index in the fields of data warehousing and database management systems. Withthe development of new hardware technologies, the B+-tree needs to be revisited to fully take advantage of hard...
详细信息
ISBN:
(纸本)9783030602451;9783030602444
the B+-tree is an important index in the fields of data warehousing and database management systems. Withthe development of new hardware technologies, the B+-tree needs to be revisited to fully take advantage of hardware resources. In this paper, we focus on optimization techniques to increase the searching performance of B+-trees on the coupled CPU-GPU architecture. First, we propose a hierarchical searching approach on the single coupled GPU to efficiently deal with leaf nodes of B+-trees. It adopts a flexible strategy to determine the number of work items in a work group to search one key in order to reduce irregular memory accesses and divergent branches in the work group. Second, we present a co-processing pipeline method on the coupled architecture. the CPU and the integrated GPU process the sorting and searching tasks simultaneously to hide sorting and partial searching latencies. A distribution model is designed to support the workload balance strategy based on real-time performance. Our performance study shows that the hierarchical searching scheme provides an improvement up to 36% on the GPU compared to the baseline algorithm with fixed number of work items and the co-processing pipeline method further increases the throughput by a factor of 1.8. To the best of our knowledge, this paper is the first study to consider boththe CPU and the coupled GPU to optimize B+-trees searches.
For multi- and many-core CPUs, dynamic voltage and frequency scaling (DVPS) for individual cores provides an effective way for energy-efficient execution of applications. However, this requires additional hardware wit...
详细信息
ISBN:
(纸本)9781728165820
For multi- and many-core CPUs, dynamic voltage and frequency scaling (DVPS) for individual cores provides an effective way for energy-efficient execution of applications. However, this requires additional hardware within the chip that regulates voltage and frequency for each hardware sub-component that can be scaled separately. Because of the significant cost of this control hardware, it is often not realistic to provide such a regulator for each individual core. Instead, chip manufacturers group cores into islands consisting of multiple cores with a common regulator, and energy optimizing solutions must lake this constraint into account when assigning frequencies 10 jobs and cores. Crown Scheduling is a technique for the combined resource allocation, mapping and discrete DVFS-level selection for actor networks consisting of moldable parallel streaming tasks for energy efficient execution given a throughput constraint. We extend crown scheduling to compute correct schedules also in the presence of DVFS islands constraints. We find that, for most task sets, the crown scheduler computes almost equally good schedules for target architectures with and without island constraints.
the performance of parallelalgorithms is often inconsistent withtheir preliminary theoretical analyses. Indeed, the difference is increasing between the ability to theoretically predict the performance of a parallel...
详细信息
Deep neural network training is a common issue that is receiving increasing attention in recent years and basically performed on Stochastic Gradient Descent or its variants. Distributed training increases training spe...
详细信息
ISBN:
(纸本)9783030389611;9783030389604
Deep neural network training is a common issue that is receiving increasing attention in recent years and basically performed on Stochastic Gradient Descent or its variants. Distributed training increases training speed significantly but causes precision loss at the mean time. Increasing batchsize can improve training parallelism in distributed training. However, if the batchsize is too large, it will bring difficulty to training process and introduce more training error. In this paper, we consider controlling the total batchsize and lowering batchsize on each GPU by increasing the number of GPUs in distributed training. We train Resnet50 [4] on CIFAR-10 dataset by different optimizers, such as SGD, Adam and NAG. the experimental results show that large batchsize speeds up convergence to some degree. However, if the batchsize of per GPU is too small, training process fails to converge. Large number of GPUs, which means a small batchsize on each GPU declines the training performance in distributed training. We tried several ways to reduce the training error on multiple GPUs. According to our results, increasing momentum is a well-behaved method in distributed training to improve training performance on condition of multiple GPUs of constant large batchsize.
Manycore architectures are mainly composed of a very large amount of computing nodes interconnected with a multiplicity of links usually forming a NoC-like mesh architecture. High-speed links permit to obtain a higher...
详细信息
ISBN:
(纸本)9781728160443
Manycore architectures are mainly composed of a very large amount of computing nodes interconnected with a multiplicity of links usually forming a NoC-like mesh architecture. High-speed links permit to obtain a higher throughput but are much more expensive than normal links, making the interconnection of the system a cost/performance trade-off. Simulating such architectures is very important in order to characterise the optimal network topology for a given problem. In this work we introduce SCALPsim: a simulation framework permitting to evaluate routing algorithms and network properties in 1-D, 2-D and 3-D regular mesh topologies simultaneously using links of different characteristics in terms of latency and throughput. these features are particularly interesting in large scale systems withprocessing elements grouped into clusters, where communication properties differ largely inside and between clusters. this paper presents the framework and an application based on Cellular Self-Organizing Maps - CSOM.
Machine learning is widely used in pattern classification, image processing and speech recognition. Neural architecture search (NAS) could reduce the dependence of human experts on machine learning effectively. Due to...
详细信息
ISBN:
(纸本)9781665414852
Machine learning is widely used in pattern classification, image processing and speech recognition. Neural architecture search (NAS) could reduce the dependence of human experts on machine learning effectively. Due to the high complexity of NAS, the tradeoff between time consumption and classification accuracy is vital. this paper presents APENAS, an asynchronous parallel evolution based multi-objective neural architecture search, using the classification accuracy and the number of parameters as objectives, encoding the network architectures as individuals. To make full use of computing resource, we propose a multi-generation undifferentiated fusion scheme to achieve asynchronous parallel evolution on multiple GPUs or CPUs, which speeds up the process of NAS. Accordingly, we propose an election pool and a buffer pool for two-layer filtration of individuals. the individuals are sorted in the election pool by non-dominated sorting and filtered in the buffer pool by the roulette algorithm to improve the elitism of the Pareto front. APENAS is evaluated on the CIFAR-10 and CIFAR-100 datasets [25]. the experimental results demonstrate that APENAS achieves 90.05% accuracy on CIFAR-10 with only 0.07 million parameters, which is comparable to state of the art. Especially, APENAS has high parallel scalability, achieving 92.5% parallel efficiency on 64 nodes.
High-performance electronics has fueled the rich emergence of medical imaging applications that led to the exponential growth in treatment and diagnostic solutions of various medical problems. High-throughput and Ener...
详细信息
High-performance electronics has fueled the rich emergence of medical imaging applications that led to the exponential growth in treatment and diagnostic solutions of various medical problems. High-throughput and Energy-efficient systems are required to enable the development of complex medical imaging applications. this article presents an energy-efficient hardware-software (HW-SW) co-design of a scalable and reconfigurable image segmentation/classification streaming-based processing platform explored at various design abstraction levels. Optimized algorithms and architectural techniques achieve significant savings in energy consumption and operational time. the proposed platform has been implemented on Xilinx Spartan-6 FPGA board and co-simulated with Xilinx system generator, enabled real-time processing of CT scans for pulmonary nodule detection. Optimized pipelining and scheduling have minimized the memory requirements to few kB. parallel architecture has been employed achieving 10× higher energy-efficiency compared to serial counterpart and reduced execution period by 70 ×. Clinical validation shows that parallel architecture introduces 5-7 % error in nodule characteristic determination in comparison to serial one.
Sum-of-Squares polynomial normalizing flows have been proposed recently, without taking into account the convexity property and the geometry of the corresponding parameter space. We develop two gradient flows based on...
详细信息
the proceedings contain 45 papers. the special focus in this conference is on Scale Space and Variational Methods in Computer Vision. the topics include: Multiscale Registration;challenges for Optical Flow Estimates i...
ISBN:
(纸本)9783030755485
the proceedings contain 45 papers. the special focus in this conference is on Scale Space and Variational Methods in Computer Vision. the topics include: Multiscale Registration;challenges for Optical Flow Estimates in Elastography;an Anisotropic Selection Scheme for Variational Optical Flow Methods with Order-Adaptive Regularisation;low-Rank Registration of Images Captured Under Unknown, Varying Lighting;towards Efficient Time Stepping for Numerical Shape Correspondence;first Order Locally Orderless Registration;first-Order Geometric Multilevel Optimization for Discrete Tomography;bregman Proximal Gradient algorithms for Deep Matrix Factorization;Hessian Initialization Strategies for -BFGS Solving Non-linear Inverse Problems;inverse Scale Space Iterations for Non-convex Variational Problems Using Functional Lifting;quantisation Scale-Spaces;A Scaled and Adaptive FISTA Algorithm for Signal-Dependent Sparse Image Super-Resolution Problems;Convergence Properties of a Randomized Primal-Dual Algorithm with Applications to parallel MRI;wasserstein Generative Models for Patch-Based Texture Synthesis;sketched Learning for Image Denoising;Translating Numerical Concepts for PDEs into Neural architectures;CLIP: Cheap Lipschitz Training of Neural Networks;variational Models for Signal processing with Graph Neural Networks;synthetic Images as a Regularity Prior for Image Restoration Neural Networks;geometric Deformation on Objects: Unsupervised Image Manipulation via Conjugation;learning Local Regularization for Variational Image Restoration;Equivariant Deep Learning via Morphological and Linear Scale Space PDEs on the Space of Positions and Orientations;on the Correspondence Between Replicator Dynamics and Assignment Flows;learning Linear Assignment Flows for Image Labeling via Exponential Integration;on the Geometric Mechanics of Assignment Flows for Metric Data Labeling;a Deep Image Prior Learning Algorithm for Joint Selective Segmentation and Registration.
暂无评论