Nowadays, heterogeneous embedded platforms are extensively used in various low-latency applications, including the automotive industry, real-time IoT systems, and automated factories. These platforms utilize specific ...
详细信息
Presentations have become an important part of the work life whether it is a student or an employee or a businessman. But creating a presentation is a time consuming process. The proposed method aims to reduce the amo...
详细信息
In literary research, subjective analysis by researchers and critics using artisanal analog methods is still the mainstream approach. In contrast, text-processingtechniques that make full use of machine learning are ...
详细信息
Big Data and its uses are widely used in many applications and fields;artificial information, medical care, business, and much more. Big Data sources are widely distributed and diverse. Therefore, it is essential to g...
详细信息
ISBN:
(纸本)9783031042164;9783031042157
Big Data and its uses are widely used in many applications and fields;artificial information, medical care, business, and much more. Big Data sources are widely distributed and diverse. Therefore, it is essential to guarantee that the data collected and processed is of the highest quality, to deal with this large volume of data from different sources with caution and attention. Consequently, the quality of Big Data must be fulfilled starting from the beginning;data collection. This paper provides a viewpoint on the key Big Data collection Quality Factors that need to be considered every time the data are captured, generated, or created. This study proposes a quality model that can help create and measure data collection methods and techniques. However, the quality model is still introductory and needs to be further investigated.
Motion and appearance cues play a crucial role in Multi-object Tracking (MOT) algorithms for associating objects across consecutive frames. While most MOT methods prioritize accurate motion modeling and distincti...
详细信息
Graph algorithms support a broad spectrum of big data applications. A typical approach to scale graph algorithms is to run in a distributed and parallel setting with multiple processing devices. The approach requires ...
详细信息
Iladoop MapReduce is a software framework for processing vast amounts of data in parallel on large clusters. As data size increases, a need arises to resolve the significant increase in the disk I/O of reducer nodes, ...
详细信息
ISBN:
(纸本)9781665414555
Iladoop MapReduce is a software framework for processing vast amounts of data in parallel on large clusters. As data size increases, a need arises to resolve the significant increase in the disk I/O of reducer nodes, which may cause runtime bottlenecks. In response, we propose Bucket MapReduce, a system that utilizes bucket sort and pipeline parallelism to improve the performance of Hadoop MapReduce. Several parameters may influence the performance in Bucket MapReduce, and thus we will further discuss parameter tuning. In this paper, we perform experiments on TeraSort benchmark. Bucket MapReduce successfully reduces local disk I/O by 61% and improves the runtime by 1.39x in 800GB TeraSort benchmark.
Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, they face significant challenges in accessing up-to-date information and providing verifiab...
详细信息
ISBN:
(数字)9798331509712
ISBN:
(纸本)9798331509729
Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, they face significant challenges in accessing up-to-date information and providing verifiable responses, particularly in domain-specific contexts. Retrieval-Augmented Generation (RAG) systems have emerged as a promising solution, but they introduce new vulnerabilities related to data integrity, privacy, and centralization. This paper introduces DeRAG (Decentralized Multi-Source Retrieval-Augmented Generation), a novel approach that leverages blockchain technology and advanced cryptographic techniques to address these challenges. DeRAG incorporates a decentralized multi-source data ecosystem, a RAG-Optimized Pyth Consensus protocol for data validation, and a domain-specific validator network structured as a Directed Acyclic Graph. The system also features an innovative RAG processing Layer with a decentralized index structure based on distributed Hash Tables and content-addressable storage. We present a comprehensive performance evaluation framework that assesses the system’s scalability, efficiency, output quality, and economic model effectiveness in a decentralized context. Results demonstrate that DeRAG significantly enhances the security, efficiency, and privacy of RAG systems, paving the way for more robust and reliable RAG applications in information retrieval and generation.
This paper examines a new system for web data extraction that utilizes an AcDWS approach. The main goal of this system is to collect information related to user-requested products quickly and at scale. In this system,...
详细信息
Stream applications are widely deployed on the cloud. While modern distributed streaming systems like Flink and Spark Streaming can schedule and execute them efficiently, streaming dataflows are often dynamically chan...
详细信息
ISBN:
(纸本)9783030953911;9783030953904
Stream applications are widely deployed on the cloud. While modern distributed streaming systems like Flink and Spark Streaming can schedule and execute them efficiently, streaming dataflows are often dynamically changing, which may cause computation imbalance and back-pressure. We introduce AutoFlow, an automatic, hotspot-aware dynamic load balance system for streaming dataflows. It incorporates a centralized scheduler that monitors the load balance in the entire dataflow dynamically and implements state migrations correspondingly. The scheduler achieves these two tasks using a simple asynchronous distributed control message mechanism and a hotspot-diminishing algorithm. The timing mechanism supports implicit barriers and a highly efficient state-migration without global barriers or pauses to operators. It also supports a time-window based load-balance measurement and feeds them to the hotspot-diminishing algorithm without user interference. We implemented AutoFlow on top of Ray, an actor-based distributed execution framework. Our evaluation based on various streaming benchmark datasets shows that AutoFlow achieves good load-balance and incurs a low latency overhead in a highly data-skew workload.
暂无评论