Monte Carlo (MC) simulation plays a key role in radiotherapy. Since the simulation time of the MC program cannot fully meet the clinical requirements, we use the ARM-based FT-2000+ multi-core processor for paralleliza...
详细信息
Monte Carlo (MC) simulation plays a key role in radiotherapy. Since the simulation time of the MC program cannot fully meet the clinical requirements, we use the ARM-based FT-2000+ multi-core processor for parallelization, which provides an effective solution for accelerating MC dose calculation. In this paper, we implement and verify FT-DPM, which is an OpenMP-based MC Dose Planning Method on FT-2000+. FT-DPM utilizes the parallelism of MC simulation and the advantage of ARM architecture to achieve the parallelization on ARM architecture. Meanwhile, we optimize the original DPM program in terms of memory allocation, data structure and data type. The experiments show that, compared with the original DPM code, FT-DPM obtains very accurate results and reaches the maximum speedups of 155.94 times for the electron case. The parallel program based on FT-2000+ only takes 44.1 seconds to simulate the particle transport of 100 million times, showing good clinical application potential. In addition, the speedup and efficiency of FT-DPM running on different core counts are also discussed.
Fully capturing contextual information and analyzing the association between entity semantics and type is helpful for joint extraction task: 1) The context can reflect the part of speech and semantics of entity. 2) Th...
详细信息
With the rapid advancement of artificial intelligence, chips have become increasingly important. The emerging RISC-V instruction set gradually provides powerful computing support for this field. In this context, along...
详细信息
ISBN:
(数字)9798331541750
ISBN:
(纸本)9798331541767
With the rapid advancement of artificial intelligence, chips have become increasingly important. The emerging RISC-V instruction set gradually provides powerful computing support for this field. In this context, along with the computing requirements of deep learning, this paper presents the design of a high-performance floating-point arithmetic logic unit (FALU) that facilitates calculations with double-precision, single-precision, half-precision, and Bfloat16 precision data. This design is based on a single-channel algorithm with merged rounding. It improves and implements a composite adder that combines high and low bits. It also proposes a tree-like floating-point comparator based on the Kogge-Stone parallel prefix network. To ensure that the FALU components meet performance requirements, we undergo functional verification in the Vivado simulation environment. Operating at 1.47GHz under the 28nm CMOS process, the components achieve the predetermined performance indicators.
Facts in military field tend to involve elements of time, space, quantity, status, and so on. Existing methods of representing knowledge in the form of triples fail to adequately express these facts, and also cause ob...
详细信息
A lot of efforts have been devoted to solving the problem about complex relationship and localized cooperation among a large number of agents in large-scale multi-agent systems. However, global cooperation among all a...
详细信息
A lot of efforts have been devoted to solving the problem about complex relationship and localized cooperation among a large number of agents in large-scale multi-agent systems. However, global cooperation among all agents is also important while interactions between agents often happen locally. It is a challenging problem to enable agent to learn global and localized cooperate information simultaneously in multi-agent systems. In this paper, we model the global and localized cooperation among agents by global and localized agent graphs and propose a novel graph convolutional reinforcement learning mechanism based on these two graphs which allows each agent to communicate with neighbors and all a-gents to cooperate at the high level. Experiments on the large-scale multi-agent scenarios in StarCraft II show that our pro-posed method gets better performance compared with state-of-the-art algorithms and allows agents learning to cooperate efficiently.
Controlled thermonuclear fusion has always been a dream pursued by mankind. However, the physical processes of controlled thermonuclear fusion are complex, requiring numerical simulations with high performance computi...
详细信息
The correctness and robustness of the neural network model are usually proportional to its depth and width. Currently, the neural network models become deeper and wider to cope with complex applications, which leads t...
详细信息
The correctness and robustness of the neural network model are usually proportional to its depth and width. Currently, the neural network models become deeper and wider to cope with complex applications, which leads to high memory capacity requirement and computer capacity requirements of the training process. The multi-accelerator parallelism is a promising choice for the two challenges, which deploys multiple accelerators in parallel for training neural networks. Among them, the pipeline parallel scheme has a great advantage in training speed, but its memory capacity requirements are relatively higher than other parallel schemes. Aiming at solving this challenge of pipeline parallel scheme, we propose a data transfer mechanism, which effectively reduces the peak memory usage of the training process by real-time data transferring. In the experiment, we implement our design and apply it to Pipedream, a mature pipeline parallel scheme. The memory requirement of training process is reduced by up to 48.5%, and the speed loss is kept within a reasonable range.
Knee osteoarthritis (OA) is a common musculoskeletal illness. To solve the problem that inaccurate knee joint localization and inadequate knee OA features extracted from plain radiographs affect the accuracy of knee O...
详细信息
With the development of Deep Learning (DL), Deep Neural Network (DNN) models have become more complex. At the same time, the development of the Internet makes it easy to obtain large data sets for DL training. Large-s...
详细信息
With the development of Deep Learning (DL), Deep Neural Network (DNN) models have become more complex. At the same time, the development of the Internet makes it easy to obtain large data sets for DL training. Large-scale model parameters and training data enhance the level of AI by improving the accuracy of DNN models. But on the other hand, they also present more severe challenges to the hardware training platform because training a large model needs a lot of computing and memory resources that can easily exceed the capacity of a single processor. In this context, integrating more processors on a hierarchical system to conduct distributed training is a direction for the development of training platforms. In distributed training, collective communication operations (including all-to-all, all-reduce, and all-gather) take up a lot of training time, making the interconnection network between computing nodes one of the most critical factors affecting the system performance. The hierarchical torus topology, combined with the Ring All-Reduce collective communication algorithm, is one of the current mainstream distributed interconnection networks. However, we believe that its communication performance is not the best. In this work, we first designed a new intra-package communication topology, i.e. the switch-based fully connected topology, which shortens the time consumed by cross-node communication. Then, considering the characteristics of this topology, we carefully devised more efficient all-reduce and all-gather communication algorithms. Finally, combined with the torus topology, we implemented a novel distributed DL training platform. Compared with the hierarchical torus, our platform improves communication efficiency and provides 1.16-2.68 times speedup in distributed training of DNN models.
Many anomaly detection applications can provide partially observed anomalies, but only limited work is for this setting. Additionally, a number of anomaly detectors focus on learning a particular model of normal/abnor...
详细信息
ISBN:
(纸本)9781665424288
Many anomaly detection applications can provide partially observed anomalies, but only limited work is for this setting. Additionally, a number of anomaly detectors focus on learning a particular model of normal/abnormal class. However, the intra-class model might be too complicated to be accurately learned. It is still a non-trivial task to handle data with anomalies/inliers in skewed and heterogeneous distributions. To address these problems, this paper proposes an anomaly detection method to leverage Partially Labeled anomalies via Surrogate supervision-based Deviation learning (denominated PLSD). The original supervision (i.e., known anomalies and a set of explored inliers) is transferred to semantic-rich surrogate supervision signals (i.e., anomaly-inlier and inlier-inlier class) via vector concatenation. Then different relationships and interactions between anomalies and inliers are directly and efficiently learned thanks to the neural network’s connection property. Anomaly scoring is processed via the trained network and the high-efficacy inliers. Extensive experiments show that PLSD significantly prevails state-of-the-art semi/weakly-supervised anomaly detectors.
暂无评论