Recently, there is increasing demand for energy-efficient signal processing in wearable visual-stimuli-based brain-computer interface (V-BCI) devices. For the better accuracy and the reduced latency of the V-BCI syste...
详细信息
Recently, there is increasing demand for energy-efficient signal processing in wearable visual-stimuli-based brain-computer interface (V-BCI) devices. For the better accuracy and the reduced latency of the V-BCI system, the target identification (TI) algorithm that analyzes brain signals is being advanced, and the importance of an energy-efficient accelerating chip that processes various linear algebra operations constituting the TI algorithms is growing. In this paper, we propose a domain-specific reconfigurable array processor (RAP) with a dynamically reconfigurable and scalable array including 5-heterogeneous processing elements (PEs) for the energy-efficient acceleration of basic linear algebra subprograms (BLAS) and matrix decompositions. The system-on-chip (SoC), including the proposed RAP, was fabricated in 130-nm CMOS technology with an area of 16.87-mm(2) and measured at 1.0 V 90 MHz. The RAP achieved an information transfer rate (ITR) of 139.9-bits/min and a TI accuracy of 95.4% on a fabricated chip through an optimized TI algorithm and scalable array processing. In addition, the RAP has 16.8x higher TI energy efficiency than prior work and achieved an energy efficiency of 2144.2-bits/min/mW for information transfer processing rate with the proposed TI algorithm. The RAP supports a greater variety of linear algebra operations and data sizes with hardware reconfiguration than the prior accelerators.
reconfigurable array processors have emerged as powerful solution to speed up computationally intensive applications. However, they may suffer from a data access bottleneck as the frequency of memory access rises. At ...
详细信息
reconfigurable array processors have emerged as powerful solution to speed up computationally intensive applications. However, they may suffer from a data access bottleneck as the frequency of memory access rises. At present, the distributed cache design in the reconfigurable array processor has a large cache failure rate, and the frequent access to external memory leads to a long delay in memory access. To mitigate this problem, we present a Runtime Dynamically Migration Mechanism (RDMM) of distributed cache for reconfigurable array processor based on the feature of obvious locality and high parallelism in accessing data. This mechanism allows temporary, static data to be dynamically scheduled to migrate data with a high access frequency from the remote cache to the processor's local migration storage table based on how often the reconfigurable array processors access the remote cache. We can accurately get the data on the shortest path by way of data search strategy based on migration storage tables, thereby effectively reducing the access delay of the entire system, increasing the memory bandwidth of the reconfigurable array processor. We leverage the hardware platform of reconfigurable array processor to test the proposed mechanism. The experimental results show that RDMM reduces access delay by up to 35.24% compared with the tradition distributed cache at the highest conflict rate. And compared with the Ref.[19], Ref.[20], Ref.[21] and Ref.[23], the working frequency can be increased by 15%, the hit rate can be increased by 6.1%, and the peak bandwidth can be increased by about 3x.
Integrated development environment (IDE) is one of the key points to construct software ecological of reconfigurable array processor (RAP) chips. However, transplanting from conventional IDE is a daunting task, becaus...
详细信息
ISBN:
(纸本)9789881476890
Integrated development environment (IDE) is one of the key points to construct software ecological of reconfigurable array processor (RAP) chips. However, transplanting from conventional IDE is a daunting task, because of the complexity of high-level behavior description in front-end and special spatial-temporal instructions bind with hardware, such as branch prediction, out-of-order execution, SIMD parallelism. Therefore, we propose a hierarchical IDE design method. At the front-end, the static back slicing is introduced to deconstruct the abstract semantics of high-level language (HLLs) into relatively fixed operations and simple structure. So that the spatial-temporal features are easy to be peel out. At the bottom, the machine instruction sets are encapsulated into instruction groups (IGs) . The semantic abstraction level of hardware description is enhanced. Physical hardware details are separated from the Intermediate Representation ( IR) , the scalability is brought out. Finally, an IDE is developed by this method, for high efficiency video coding (HEVC) algorithm mapping. The testing results show that the efficiency of algorithm development is greatly improved while maintaining the same coding quality.
With the rapid growth of the amount of computations and power consumption, there is a pressing need for a high power-efficiency architecture, which takes account of computational efficiency and flexibility of applicat...
详细信息
With the rapid growth of the amount of computations and power consumption, there is a pressing need for a high power-efficiency architecture, which takes account of computational efficiency and flexibility of application. This paper proposes a type of array-processor architecture for multimedia application which is programmable and self-reconfigurable and consists of 1024 thin-core processing elements (PE). The performance and power dissipation are demonstrated with different multimedia application algorithms such as hash, and fractional motion estimation (FME). The results show that the proposed architecture can provide high performance with less energy consumption using parallel computation.
Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph *** paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-resorting algo...
详细信息
Unstructured and irregular graph data causes strong randomness and poor locality of data accesses in graph *** paper optimizes the depth-branch-resorting algorithm(DBR),and proposes a branch-alternation-resorting algorithm(BAR).In order to make the algorithm run in parallel and improve the efficiency of algorithm operation,the BAR algorithm is mapped onto the reconfigurable array processor(APR-16)to achieve vertex reordering,effectively improving the locality of graph *** paper validates the BAR algorithm on the GraphBIG framework,by utilizing the reordered dataset with BAR on breadth-first search(BFS),single source shortest paht(SSSP)and betweenness centrality(BC)algorithms for *** results show that compared with DBR and Corder algorithms,BAR can reduce execution time by up to 33.00%,and 51.00%*** terms of data movement,the BAR algorithm has a maximum reduction of 39.00%compared with the DBR algorithm and 29.66%compared with Corder *** terms of computational complexity,the BAR algorithm has a maximum reduction of 32.56%compared with DBR algorithm and53.05%compared with Corder algorithm.
Deep learning algorithms have been widely used in computer vision,natural language processing and other ***,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing perfor...
详细信息
Deep learning algorithms have been widely used in computer vision,natural language processing and other ***,due to the ever-increasing scale of the deep learning model,the requirements for storage and computing performance are getting higher and higher,and the processors based on the von Neumann architecture have gradually exposed significant shortcomings such as consumption and long *** order to alleviate this problem,large-scale processing systems are shifting from a traditional computing-centric model to a data-centric model.A near-memory computing array architecture based on the shared buffer is proposed in this paper to improve system performance,which supports instructions with the characteristics of store-calculation integration,reducing the data movement between the processor and main *** data reuse,the processing speed of the algorithm is further *** proposed architecture is verified and tested through the parallel realization of the convolutional neural network(CNN)*** experimental results show that at the frequency of 110 MHz,the calculation speed of a single convolution operation is increased by 66.64%on average compared with the CNN architecture that performs parallel calculations on field programmable gate array(FPGA).The processing speed of the whole convolution layer is improved by 8.81%compared with the reconfigurable array processor that does not support near-memory computing.
After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be *** reduce the...
详细信息
After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be *** reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this *** on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference *** the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original *** avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is *** parallelization design reduces the storage access time,configuration time and saves the storage *** with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per *** with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach *** average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.
Although the cross-component linear model used in H.266 chroma coding can increase the coding efficiency, it also introduces the issue of high complexity. To address this problem, this paper studies the relationship b...
详细信息
ISBN:
(纸本)9798400707674
Although the cross-component linear model used in H.266 chroma coding can increase the coding efficiency, it also introduces the issue of high complexity. To address this problem, this paper studies the relationship between the texture complexity of Coding Units (CU) and the coding modes of other CUs that are spatially adjacent to them. The article proposes a fast linear pattern prediction algorithm based on adjacent block coding patterns and texture complexity. First, data from experiments are used to determine the association between texture complexity and intra-frame prediction pattern decision. Second, the current encoding pattern is decided by analyzing the encoding patterns adjacent to the current space. Finally, a parallel implementation strategy for DPRAP-based chroma linear intra-frame prediction mapping is developed. According to the testing results, the optimized method has an execution time reduction of roughly 26.3% when compared to the VTM-9.0 standard algorithm.
A SystemC based emulator architecture optimized for reconfigurable array processor is *** novel architecture,the emulator support performance evaluation,rtl co-simulation as well as system software *** emulator has be...
详细信息
A SystemC based emulator architecture optimized for reconfigurable array processor is *** novel architecture,the emulator support performance evaluation,rtl co-simulation as well as system software *** emulator has been used in the development of a reconfigurable multimedia application processor (ReMAP) which has been successfully fabricated under SMIC 0.8um processing.
暂无评论