Advanced SSDs employ a RAM-based write buffer to improve their write performance. the buffer intentionally delays write requests in order to reduce flash write traffic and reorders them to minimize the cost of garbage...
详细信息
ISBN:
(纸本)9781479961245
Advanced SSDs employ a RAM-based write buffer to improve their write performance. the buffer intentionally delays write requests in order to reduce flash write traffic and reorders them to minimize the cost of garbage collection. this work presents a novel buffer algorithm for page-mapping multichannel SSDs. We propose grouping temporally or spatially correlated buffer pages and writing these grouped buffer pages to the same flash block. this strategy dramatically increases the probability of bulk data invalidations in flash blocks. In multichannel architectures, channels are assigned to their own groups of buffer pages for writing, and so channel striping does not divide a group of correlated buffer pages into small pieces. We have conducted simulations and experiments using a SSD simulator and a real SSD platform, respectively. Our results show that our design greatly outperforms existing buffer algorithms.
In today’s industrial landscape, automation has become increasingly vital, particularly in the deployment of robots for tasks such as sorting machine components. the use of robotic systems enhances process accuracy a...
详细信息
ISBN:
(数字)9798331509972
ISBN:
(纸本)9798331509989
In today’s industrial landscape, automation has become increasingly vital, particularly in the deployment of robots for tasks such as sorting machine components. the use of robotic systems enhances process accuracy and speed, resulting in significant cost reductions and improved productivity compared to manual labor. this paper aims to design and develop an automated sorting system for various types of mechanical and electrical parts, utilizing image processing and machine vision algorithms with a Delta parallel robot equipped with a two-finger gripper. the target mechanical and electrical parts in this study are screws, nuts, metal washers, rubber washers, retaining rings, rectangular keys, wall plugs, resistors, potentiometers, capacitors, batteries, ICs, and LEDs. the YOLOv3 algorithm and Adaptive thresholding method are employed to detect and distinguish objects from the background, and size measurement is achieved withthe help of a custom marker with known dimensions. In this study, transfer learning based on pre-trained weights of YOLOv3 for the COCO dataset is applied. the proposed system attains a final mean Average Precision (mAP@0.5) value exceeding 0.95 for part detection using YOLOv3. Additionally, it demonstrates an overall pick-and-place success rate exceeding $\mathbf{9 0 \%}$.
the embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server pr...
详细信息
the embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server processors and the increased performance of the new low power Systems-on-Chip (SoCs) developed to meet the requirements of the demanding mobile market. this convergence allows the porting to low power embedded architectures of applications that were originally confined to traditional HPC systems. In this paper, we present our experience of porting the Filtered Back-projection Algorithm to a low power, low cost system-on-chip, the NVIDIA Tegra K1, which is based on a quad core ARM CPU and on a NVIDIA Kepler GPU. this Filtered Back-projection Algorithm is heavily used in 3D Tomography reconstruction software. the porting has been done exploiting various programming languages (i.e. OpenMP, CUDA) and multiple versions of the application have been developed to exploit boththe SoC CPU and GPU. the performances have been measured in terms of 2D slices (of a 3D volume) reconstructed per time unit and per energy unit. the results obtained with all the developed versions are reported and compared withthose obtained on a typical x86 HPC node accelerated with a recent NVIDIA GPU. the best performances are achieved combining the OpenMP version and the CUDA version of the algorithm. In particular, we discovered that only three Jetson TK1 boards, equipped with Giga Ethernet interconnections, allow to reconstruct as many images per time unit as a traditional server, using one order of magnitude less energy. the results of this work can be applied for instance to the construction of an energy-efficient computing system of a portable tomographic apparatus.
Optimising the execution of Bag-of-Tasks (BoT) applications on the cloud is a hard problem due to the trade-offs between performance and monetary cost. the problem can be further complicated when multiple BoT applicat...
详细信息
Optimising the execution of Bag-of-Tasks (BoT) applications on the cloud is a hard problem due to the trade-offs between performance and monetary cost. the problem can be further complicated when multiple BoT applications need to be executed. In this paper, we propose and implement a heuristic algorithm that schedules tasks of multiple applications onto different cloud virtual machines in order to maximise performance while satisfying a given budget constraint. Current approaches are limited in task scheduling since they place a limit on the number of cloud resources that can be employed by the applications. However, in the proposed algorithm there are no such limits, and in comparison with other approaches, the algorithm on average achieves an improved performance of 10%. the experimental results also highlight that the algorithm yields consistent performance even with low budget constraints which cannot be achieved by competing approaches.
the sizes of databases have seen exponential growth in the past, and such growth is expected to accelerate in the future, withthe steady drop in storage cost accompanied by a rapid increase in storage capacity. Many ...
详细信息
the sizes of databases have seen exponential growth in the past, and such growth is expected to accelerate in the future, withthe steady drop in storage cost accompanied by a rapid increase in storage capacity. Many years ago, a terabyte database was considered to be large, but nowadays they are sometimes regarded as small, and the daily volumes of data being added to some databases are measured in terabytes. In the future, petabyte and exabyte databases will be common. With such volumes of data, it is evident that the sequential processing paradigm will be unable to cope, for example, even assuming a data rate of 1 terabyte per second, reading through a petabyte database will take over 10 days. To effectively manage such volumes of data, it is necessary to allocate multiple resources to it, very often massively so. the processing of databases of such astronomical proportions requires an understanding of how high-performance systems and parallelism work. Besides the massive volume of data in the database to be processed, some data has been distributed across the globe in a Grid environment. these massive data centres are also a part of the emergence of Cloud computing, where data access has shifted from local machines to powerful servers hosting web applications and services, making data access across the Internet using standard web browsers pervasive. this adds another dimension to such systems. this talk, based on our recent published book [1], discusses fundamental understanding of parallelism in data-intensive applications, and demonstrates how to develop faster capabilities to support them. this includes the importance of indexing in parallel systems [2-4], specialized algorithms to support various query processing [5-9], as well as objectoriented scheme [10-12]. parallelism in databases has been around since the early 1980s, when many researchers in this area aspired to build large special-purpose database machines -- databases employing dedicated specialized
Partial shading faults on photovoltaic (PV) modules can lead to power reduction, hot spots, and life reduction. Although the shaded modules can be bypassed by the bypass diodes, the peak power produced is lower than t...
详细信息
ISBN:
(数字)9798350351330
ISBN:
(纸本)9798350351347
Partial shading faults on photovoltaic (PV) modules can lead to power reduction, hot spots, and life reduction. Although the shaded modules can be bypassed by the bypass diodes, the peak power produced is lower than the ideal values. In this paper, a differential power processing scheme with a shadow fault detection method is proposed for two parallel-connection PV strings. the method utilizes normalized error (DE) of the comparison of the I-V curve in normal operation and under partial shading conditions to define whether PV cells dissipated power. When the shadow fault is detected, the proposed voltage equalizer would operate to eliminate the unbalanced voltage and power and improve the peak output power. the voltage equalizer use the series resonant voltage multiplier (SRVM) to realize the aims.
Withthe growing acceptance of multi-core architectures by the industry, devising novel techniques to extract thread-level parallelism from sequential programs has become a fundamental need. the role of compiler along...
详细信息
Withthe growing acceptance of multi-core architectures by the industry, devising novel techniques to extract thread-level parallelism from sequential programs has become a fundamental need. the role of compiler along with programming model and architectural innovation is of utmost importance to fully realize the potential performance benefits of the multi-core architectures. this paper evaluates the capabilities and limitations of parallelizing compilers to extract parallelism automatically from the loops present in sequential programs. the applications from embedded benchmark suites EEMBC 1.1 and MiBench are analyzed using the Intel C++ 9.1 Compiler for Linux. the contributions of the paper are manifold: Firstly, the paper shows that on average 10% of the loops can be parallelized automatically by the Intel Compiler. Secondly, we have shown that the auto- parallelizable loops cover only about 12.5% of the total program execution-time. thirdly, we have explored the reasons behind the inability of the compiler to auto-parallelize the majority of the loops. We have found that on average 37.5% and 8% of the loops can't be auto-parallelized because of statically unknown loop trip count and probable data dependence, respectively. Finally, this study identifies the set of loops which comprises the most of the execution time of the programs and shows that compiler, on average, can automatically parallelize about 22% of such loops.
Mixed signal SoC has always played an important role in long haul, high capacity fiber optic transmission systems. In early days of 10Gbit NRZ transmission, the functions were simple, but implementing decision device,...
详细信息
Mixed signal SoC has always played an important role in long haul, high capacity fiber optic transmission systems. In early days of 10Gbit NRZ transmission, the functions were simple, but implementing decision device, clock recovery and de-multiplexing functions at such high data rate was challenging at the time. In the span of 15 years, the single channel data rate and network capacities have increased 20x, owing to a fundamental change in the way data is modulated and demodulated over the fiber optic channel. Intensity modulation and direct detection has given way to coherent detection and advanced modulation formats such as Polarization-Multiplexed QPSK and higher order QAM. Today, commercial state-of-art single channel data rate have reached 100Gbit/s. 400Gbit/s data rates are being attempted in the laboratory. these astounding achievements are not possible without the parallel advancements in CMOS technology. High speed A/D and D/A converters implemented in CMOS side-by-side with massively parallel, highly dense digital circuits is a key enabler in delivering the network capacities demanded by network operators and consumers. In this tutorial, we first review the challenges of the fiber optic channel, and introduce the necessary digital signal processing functions that need to be implemented in the SOC. We provide some example implementations of clock recovery, carrier recovery and equalization algorithms. We will also review forward error correction methods adopted in the most state-of-art designs. We conclude by discussing the challenges in designing next generation transceiver ASICs.
It is common nowadays that consumer embedded system products are built on platforms with System-On-a-Chip (SOC) in which two or more processor cores, which are not necessarily of the same type, are put into a single c...
详细信息
ISBN:
(纸本)9781424480784
It is common nowadays that consumer embedded system products are built on platforms with System-On-a-Chip (SOC) in which two or more processor cores, which are not necessarily of the same type, are put into a single chip and form the architecture of Chip-level Multi-Processor (CMP). Although such platform is capable of achieving high performance at relatively low cost, the system architecture of CMP brings new design challenges as well as increased complexity in developing embedded software especially at the level of kernel or operating system software. this paper presents our experience and some preliminary results from the project of building a multi-kernel embedded system platform for application software running in the environment of a newly developed multi-core SOC, namely PAC Duo SOC, which is the latest product from the PAC (short for parallel Architecture Core) Project initiated by the Industry Technology Research Institute (ITRI) in Taiwan. PAC Duo SOC is a chip-level heterogeneous multi-processor SOC composed of one ARM926 core serving as the general purpose processor (GPP for short) and two ITRI PAC DSP cores serving as the special purpose processors (SPP). We ported Linux operating system to run on the ARM926 processor and ported mC/OS-II real-time kernel to run on one PAC DSP core, leaving the other PAC DSP core withthe option, for flexibility, of running either mC/OS-II or a different kernel. In addition, an inter-processor communication (IPC) mechanism is developed which not only takes application-specific requirements into account but also takes advantages of hardware features.
As network infrastructures with10Gb/s bandwidth and beyond have become pervasive and as cost advantages of large commodity-machine clusters continue to increase, research and industry strive to exploit the available ...
详细信息
ISBN:
(纸本)9781424472611;9780769540597
As network infrastructures with10Gb/s bandwidth and beyond have become pervasive and as cost advantages of large commodity-machine clusters continue to increase, research and industry strive to exploit the available processing performance for large-scale database processing tasks. In this work we look at the use of high-speed networks for distributed join processing. We propose Data Roundabout as a lightweight transport layer that uses Remote Direct Memory Access (RDMA) to gain access to the throughput opportunities in modern networks. the essence of Data Roundabout is a ring-shaped network in which each host stores one portion of a large database instance. We leverage the available bandwidth to (continuously) pump data through the high-speed network. Based on Data Roundabout, we demonstrate cyclo-join, which exploits the cycling flow of data to execute distributed joins. the study uses different join algorithms (hash join and sort-merge join) to expose the pitfalls and the advantages of each algorithm in the data cycling arena. the experiments show the potential of a large distributed main-memory cache glued together with RDMA into a novel distributed database architecture.
暂无评论