the performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models...
详细信息
ISBN:
(纸本)9781665462723
the performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models are restricted to fixed compute and access pattern types, which limits software-driven optimizations and the applicability of such an offload interface to heterogeneous accelerator resources. this paper presents a computation offload interface for multi-core systems augmented withdistributed on-chip accelerators. With energy-efficiency as the primary goal, we define mechanisms to identify offload partitioning, create a low-overhead execution model to sequence these fine-grained operations, and evaluate a set of workloads to identify the complexity needed to achieve distributed near-data *** demonstrate that our model and interface, combining features of dataflow in parallel with near-data processing engines, can be profitably applied to memory hierarchies augmented with either specialized compute substrates or lightweight near-memory cores. We differentiate the benefits stemming from each of elevating data access semantics, near-data computation, inter-accelerator coordination, and compute/access logic specialization. Experimental results indicate a geometric mean (energy efficiency improvement; speedup; data movement reduction) of (3.3; 1.59; 2.4)×, (2.46; 1.43; 3.5)× and (1.46; 1.65; 1.48)× compared to an out-of-order processor, monolithic accelerator with centralized accesses and monolithic accelerator with decentralized accesses, respectively. Evaluating both lightweight core and CGRA fabric implementations highlights model flexibility and quantifies the benefits of compute specialization for energy efficiency and speedup at 1.23× and 1.43×, respectively.
Monte Carlo (MC) simulation plays an important part in dose calculation for radiotherapy treatment planning. Since the accuracy of MC simulation relies on the number of simulated particles histories, it's very tim...
详细信息
Monte Carlo (MC) simulation plays an important part in dose calculation for radiotherapy treatment planning. Since the accuracy of MC simulation relies on the number of simulated particles histories, it's very time-consuming. the Intel Many Integrated Core (MIC) architecture, which consists of more than 50 cores and supports many parallel programming models, provides an efficient alternative for accelerating MC dose calculation. this paper implements the OpenMP-based MC Dose Planning Method (DPM) for radiotherapy treatment problems on the Intel MIC architecture. the implementation has been verified on the target MIC coprocessor including 57 cores. the results demonstrate that the OpenMP-based DPM implementation exhibits very accurate results and achieves the maximum speedup of 10.53 times in comparison to the original DPM one on a Xeon E5-2670 CPU. Additionally, speedup and efficiency of the implementation running on the different number of cores in MIC are also reported.
A novel method that provides extremely fast Input/Output (I/O) data transfer to a RISC-like graphics engine is presented. this method employs a combined two-port / two-access and a three-port / three-access register f...
ISBN:
(纸本)9780897913195
A novel method that provides extremely fast Input/Output (I/O) data transfer to a RISC-like graphics engine is presented. this method employs a combined two-port / two-access and a three-port / three-access register file used for concurrent processing and I/O data transfer. the three-port cell employed is only 25 % larger than the two-port cell, offering considerable advantages over alternative approaches, such as FIFOs or register *** paper discusses some methods that can provide highly fast I/O data transfer in parallel with execution and focuses on the design and implementation of the CMOS-2um register file *** Terms - Computer Architecture, Reduced Instruction Set Computers, VLSI Design, Computer Image Generation, Interprocessor Communication.
Communication and information service systems are undergoing rapid advancements driven by the evolution of telecommunications and computer technology. the convergence of these fields has led to significant growth in d...
详细信息
ISBN:
(数字)9798331542559
ISBN:
(纸本)9798331542566
Communication and information service systems are undergoing rapid advancements driven by the evolution of telecommunications and computer technology. the convergence of these fields has led to significant growth in digital networks, which has spurred extensive research and exploration. However, optimizing the use of these networks with minimal additional costs—while integrating features such as video conferencing, email services, and remote database access— remains a substantial technological challenge. this study addresses the limitations of current mobile communication networks by exploring potential enhancements through the application of smart computing solutions, particularly artificial intelligence (AI). the research methodology involves a comprehensive review of existing mobile communication technologies and their integration with AI. this includes examining advancements in AI and machine learning (ML) as applied to mobile networks, with a focus on improving performance, flexibility, and automation. the results of the review indicate that AI and ML have significant potential to enhance network operations. By leveraging vast data generated within 5G systems, these technologies can optimize resource utilization, improve service quality, and support diverse emerging applications. Despite these advancements, challenges such as reliability, speed trade-offs, complexity, privacy, and security persist. the findings highlight the need for further research in AI, including deep neural network acceleration, parallel computing, cloud computing, and distributed deep learning systems. Such research is crucial for enabling mobile communication networks to achieve higher throughput and ultra-low latency. Conclusively, while AI-enabled communication systems offer promising solutions to many challenges, simplifying AI implementation is essential for practical deployment. this will ensure that future communication networks can effectively meet the growing demands and evolving
this paper introduces a scalable solution for distributing content-based video analysis tasks using the emerging MapReduce programming model. Scalable and efficient solutions are needed for this type of tasks, as the ...
详细信息
ISBN:
(纸本)9781467362337
this paper introduces a scalable solution for distributing content-based video analysis tasks using the emerging MapReduce programming model. Scalable and efficient solutions are needed for this type of tasks, as the number of multimedia content is growing at an increasing rate. We present a novel implementation utilizing the popular Apache Hadoop MapReduce framework for both analysis job scheduling and video data distribution. We employ face detection as a case example because it represents a popular visual content analysis task. the main contribution of this paper is the performance evaluation of distribution models for video content processing in various configurations. In our experiments, we have compared the performance of our video data distribution method against two alternatives solutions on a seven node cluster. Hadoop's performance overhead in video content analysis was also evaluated. We found Hadoop to be a data efficient solution with minimal computational overhead for the face detection task.
暂无评论