A large part of data science projects is spent on data engineering. Especially in open data contexts, data quality issues are prevalent and are often tackled by non-professional programmers. We introduce and evaluate ...
详细信息
A large part of data science projects is spent on data engineering. Especially in open data contexts, data quality issues are prevalent and are often tackled by non-professional programmers. We introduce and evaluate Jayvee, a domain-specific language for data engineering aimed at reducing barriers to building data pipelines. We show that a structured DSL can have positive effects on speed, ease of use, and quality for data engineering by non-professional developers. For this, we present an empirical quantitative study, in which we compare the performance of students as proxies for non-professional programmers using Jayvee with Python and Pandas. We search for reasons for the empirical findings using a follow-up interview study on how using a DSL changes how non-professional programmers build data pipelines. Participants solve a subset of tasks faster, more easily, and with higher quality when using Jayvee compared to Python. Interviewees describe tradeoffs regarding the DSL's more limited features, stricter code structure, and explicit descriptions. Jayvee is found to be more approachable, which leads to a more guided development flow. New data engineering languages should provide good tooling and documentation, plan how to visualize intermediate data and consider new development workflows involving tools like ChatGPT to find adoption.
In order to improve the performance of the hot news collection and propagation of hotspot news, we aim at improving the performance and solve the opinion collection and propagation bottleneck by analyzing the demand a...
详细信息
In order to improve the performance of the hot news collection and propagation of hotspot news, we aim at improving the performance and solve the opinion collection and propagation bottleneck by analyzing the demand and framework of current hot news supported systems. Then, a corresponding opinion collection and propagation optimization framework and strategy are proposed with a distributed hotspot caching system is designed and implemented. On the basis of the existing distributed search engine, a cache server is added between the overall hotspot query processing server and each cluster sub-query processing server, so that we can improve the caching and searching efficiency greatly. The experimental results show that after using the proposed distributed caching system and strategy for hot news propagation, the processing capacity of the hot news system has been greatly enhanced.
In the era of digital world data flows are saturated with development of universal multifunctional system to solve problems and to optimise the computing resources. The information system is highly loaded with modern ...
详细信息
Song recognition refers to automatically recognizing the corresponding song name for the input audio clip. Because of its friendly interactive form and convenience, song recognition has become a hot topic in the resea...
详细信息
Song recognition refers to automatically recognizing the corresponding song name for the input audio clip. Because of its friendly interactive form and convenience, song recognition has become a hot topic in the research of music retrieval. However, most of the existing song recognition methods assume that the collected audios are clean data. Unfortunately, in practical applications, they often face problems such as the low price of the acquisition equipment and the serious noise pollution of the collected audio data, resulting in poor recognition accuracy. To solve the above problems, facing data engineering and low-cost microphone scenario, this paper proposes a deep learning based two-stage song recognition framework. Specifically, the Denoising Auto-Encoder network is first used for speech enhancement to obtain clean audio data. Then, the Con-LSTM network is proposed for clean song recognition. More specifically, Con-LSTM network integrates the advantages of convolutional neural network (CNN) and recurrent neural network (RNN), thus it has stronger recognition ability. The final experimental results show that the proposed song recognition framework can effectively identify the songs collected by low-cost microphones. As such, the proposed framework can be embedded in the web of things (WoT) system for well help to improve speech recognition task, which are essential in many advanced WoT systems
The prevalence of increased streaming of data revolutionises organisations approach with stream processing and real time analytics through actionable and immediate insights in the generated data streams. Advancements ...
详细信息
ISBN:
(数字)9798331518578
ISBN:
(纸本)9798331518585
The prevalence of increased streaming of data revolutionises organisations approach with stream processing and real time analytics through actionable and immediate insights in the generated data streams. Advancements in data engineering drives the stream processing and real time analytics. The present study explores the transformative impact of real time analytics by examining the major components in stream processing and real time analytics such as visualisation tools, in-memory storage, stream processing and data ingestion. It enables the decision-making using data, optimised operations and personalised user experience. The study examines the multinational retail corporations and video streaming services for data engineering. The major advantages are competitive and operational efficiencies enhanced customer experience and rapid insights on stream processing. The tremendous potential is highlighted for implementation of real time analytics and stream processing with integration complexities, skill gaps, privacy and security concerns, systems scalability, enhanced data quality and management of data volume. Stream processing engine is used for higher throughput and lower processing latency based on workload.
In the era of digital world data flows are saturated with development of universal multifunctional system to solve problems and to optimise the computing resources. The information system is highly loaded with modern ...
详细信息
ISBN:
(数字)9798331518578
ISBN:
(纸本)9798331518585
In the era of digital world data flows are saturated with development of universal multifunctional system to solve problems and to optimise the computing resources. The information system is highly loaded with modern data and large number of resources. The user request and heterogeneity of the incoming streams can be evaluated using different types of multimedia services and its requirement for computing resources and its performance with the entire data. The incoming flow of data heterogeneity is considered as the distinctive feature of request in the modern information system to support different types of multimedia services in single platform. Large volumes of data and data heterogeneity Creates numerous problems related to data storage security and the speed of digital system. To address these challenges artificial intelligence technology can be used for execution of digital telecommunication complex for processing and storing the dynamic flow of data that are in multi format. The prospects and trends can be identified to develop these models based on the perspective characteristics. The development of digital communication with multi object analytic system for storing and analysing complex data with data engineering. An fuzzy based model is used for data processing with enhanced accuracy of 98%.
As data surge, the demand for skilled data engineers significantly increases, underscoring the importance of data engineering. However, learning data engineering skills can be daunting due to the complexity of setting...
详细信息
ISBN:
(数字)9798331542788
ISBN:
(纸本)9798331542795
As data surge, the demand for skilled data engineers significantly increases, underscoring the importance of data engineering. However, learning data engineering skills can be daunting due to the complexity of setting up multiple platforms, often unnecessary as companies typically employ other professionals to handle infrastructure. Additionally, data engineering is rarely taught in traditional educational settings, leaving interested students at a disadvantage. To address this, this project aims to develop a web-based platform that simplifies data engineering learning, providing hands-on experience for free without complex setups for users from different backgrounds. The platform includes a Large Language Model (LLM)-powered chatbot for real-time guidance, creating an interactive learning environment. With access to our platform, users can instantly access the necessary tools and resources. Typically, a web page will have everything required for a course, streamlining the virtual learning process and reducing setup time.
data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. data engineering deals with a variety of data formats, storage, data extra...
详细信息
ISBN:
(纸本)9780738110868
data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.
The combination of big data and machine learning appears in the manufacturing context frequently. In a modern factory, data is collected everywhere. It is a challenge for the companies, finding their way to use the pr...
详细信息
ISBN:
(纸本)9781728180533
The combination of big data and machine learning appears in the manufacturing context frequently. In a modern factory, data is collected everywhere. It is a challenge for the companies, finding their way to use the produced data. The model's quality is strongly dependent on the quality of the training dataset;the data engineer is responsible for the infrastructure, like providing context and quality input-data for machine learning algorithms. In the discussed case-study, a data pipeline is introduced as a potential solution. It proposes a strategy through the organization, from the shop floor to decision-makers.
暂无评论