Discriminative correlation filter (DCF) is a highly efficient tracking technique using the circulant shifted samples of search images to update the template, so the reliability of input samples determines template qua...
详细信息
Tensor decomposition and reconstruction attention is a promising global context learning approach because it can remain efficient while avoiding feature compression. To exploit its potential even further in visual tra...
详细信息
To enhance the query efficiency of relational databases and build a unified computing backend, Meta has developed Velox, a vectorized execution engine library based on columnar storage, Currently, there is no standard...
详细信息
The task of retrieving and analyzing mass spectra is indispensable for the identification of compounds in mass spectrometry (MS). This methodology is of critical importance as it enables researchers to correlate obser...
详细信息
With the deepening of research and increasing size of data sets, deep neural networks have become larger and larger. To reduce the training time of large neural networks, researchers propose to optimize neural network...
详细信息
ISBN:
(纸本)9781665442787
With the deepening of research and increasing size of data sets, deep neural networks have become larger and larger. To reduce the training time of large neural networks, researchers propose to optimize neural networks from different levels. When performing optimizations, prior knowledge about execution time of each part of the network can help avoid repeatedly time-consuming testing and profiling process. However it is quite challenging to build an accurate iteration time prediction model, due to opaque underlying implementation of network operators and complex architecture of accelerators. In this paper, we propose SEER, an iteration time prediction model for CNNs, targeting on GPU platforms. We propose to categorize convolution kernels into three different types: Compute-bound, DRAM-bound and Under-utilized, then we build performance model for each type respectively. We combined analytical models and learning-based models to make the performance model accurate and in line with GPU execution model. Experimental results show that, our model achieves 14.71% prediction error on convolution kernels and up to 1.79% prediction error for the overall computation time in one iteration of common CNNs. Besides, when used for selecting the best convolution algorithm, our model shows 7.14% lower error rate than cuDNN's official algorithm picker.
A sequent is a pair (Γ, Δ), which is true under an assignment if either some formula in Γ is false, or some formula in Δ is true. In L_(3)-valued propositional logic, a multisequent is a triple Δ∣Θ∣Γ, which i...
详细信息
A sequent is a pair (Γ, Δ), which is true under an assignment if either some formula in Γ is false, or some formula in Δ is true. In L_(3)-valued propositional logic, a multisequent is a triple Δ∣Θ∣Γ, which is true under an assignment if either some formula in Δ has truth-value t, or some formula in Θ has truth-value m, or some formula in Γ has truth-value f. There is a sound, complete and monotonic Gentzen deduction system G for sequents. Dually, there is a sound, complete and nonmonotonic Gentzen deduction system G′ for co-sequents Δ: Θ: Γ. By taking different quantifiers some or every, there are 8 kinds of definitions of validity of multisequent Δ∣Θ∣Γ and 8 kinds of definitions of validity of co-multisequent Δ: Θ: Γ, and correspondingly there are 8 sound and complete Gentzen deduction systems for sequents and 8 sound and complete Gentzen deduction systems for co-sequents. Correspondingly their monotonicity is discussed.
Mass spectrometry serves as a pivotal tool for the analysis of small molecules through an examination of their mass-to-charge ratios. Recent advancements in deep learning have markedly enhanced the analysis of mass sp...
详细信息
Large-scale graphs usually exhibit global sparsity with local cohesiveness,and mining the representative cohesive subgraphs is a fundamental problem in graph *** k-truss is one of the most commonly studied cohesive su...
详细信息
Large-scale graphs usually exhibit global sparsity with local cohesiveness,and mining the representative cohesive subgraphs is a fundamental problem in graph *** k-truss is one of the most commonly studied cohesive subgraphs,in which each edge is formed in at least k 2 triangles.A critical issue in mining a k-truss lies in the computation of the trussness of each edge,which is the maximum value of k that an edge can be in a *** works mostly focus on truss computation in static graphs by sequential ***,the graphs are constantly changing dynamically in the real *** study distributed truss computation in dynamic graphs in this *** particular,we compute the trussness of edges based on the local nature of the k-truss in a synchronized node-centric distributed *** decomposing the trussness of edges by relying only on local topological information is possible with the proposed distributed decomposition ***,the distributed maintenance algorithm only needs to update a small amount of dynamic information to complete the *** experiments have been conducted to show the scalability and efficiency of the proposed algorithm.
The magnetic skyrmion transport driven by pure voltage-induced strain gradient is proposed and studied via micromagnetic *** combining the skyrmion with multiferroic heterojunction,a voltage-induced uniaxial strain gr...
详细信息
The magnetic skyrmion transport driven by pure voltage-induced strain gradient is proposed and studied via micromagnetic *** combining the skyrmion with multiferroic heterojunction,a voltage-induced uniaxial strain gradient is adjusted to move *** the system,a pair of short-circuited trapezoidal top electrodes can generate the symmetric *** to the symmetry of strain,the magnetic skyrmion can be driven with a linear motion in the middle of the nanostrip without *** calculate the strain distribution generated by the trapezoidal top electrodes pair,and further investigate the influence of the strain intensity as well as the strain gradient on the skyrmion *** findings provide a stable and low-energy regulation method for skyrmion transport.
Neural Vector Search (NVS) has exhibited superior search quality over traditional key-based strategies for information retrieval tasks. An effective NVS architecture requires high recall, low latency, and high through...
详细信息
ISBN:
(纸本)9798331506476
Neural Vector Search (NVS) has exhibited superior search quality over traditional key-based strategies for information retrieval tasks. An effective NVS architecture requires high recall, low latency, and high throughput to enhance user experience and cost-efficiency. However, implementing NVS on existing neural network accelerators and vector search accelerators is sub-optimal due to the separation between the embedding stage and vector search stage at both algorithm and architecture levels. Fortunately, we unveil that Product Quantization (PQ) opens up an opportunity to break separation. However, existing PQ algorithms and accelerators still focus on either the embedding stage or the vector search stage, rather than both simultaneously. Simply combining existing solutions still follows the beaten track of separation and suffers from insufficient parallelization, frequent data access conflicts, and the absence of scheduling, thus failing to reach optimal recall, latency, and throughput. To this end, we propose a unified and efficient NVS accelerator dubbed NeuVSA based on algorithm and architecture co-design philosophy. Specifically, on the algorithm level, we propose a learned PQ-based unified NVS algorithm that consolidates two separate stages into the same computing and memory access paradigm. It integrates an end-to-end joint training strategy to learn the optimal codebook and index for enhanced recall and reduced PQ complexity, thus achieving smoother acceleration. On the architecture level, we customize a homogeneous NVS accelerator based on the unified NVS algorithm. Each sub-accelerator is optimized to exploit all parallelism exposed by unified NVS, incorporating a structured index assignment strategy and an elastic on-chip buffer to alleviate buffer conflicts for reduced latency. All sub-accelerators are coordinated using a hardware-aware scheduling strategy for boosted throughput. Experimental results show that the joint training strategy improves recall
暂无评论