This work presents the DGChain (data-Git- for Blockchain) project. This Python package allows versioncontrol of data in blockchain and IPFS based on a DAO (decentralized autonomous organization) for managing data in ...
详细信息
Machine Learning (ML) systems are gaining popularity, reshaping various domains ranging from customer services to software engineering. The effectiveness of ML systems is dependent on the quality of their training dat...
详细信息
ISBN:
(纸本)9798400705915
Machine Learning (ML) systems are gaining popularity, reshaping various domains ranging from customer services to software engineering. The effectiveness of ML systems is dependent on the quality of their training data. Therefore, practitioners invest substantial time experimenting with different data, parameters, and models to guarantee the quality of the end system. Prior work highlighted unique challenges of developing ML systems, particularly concerning versioning data and models. Recently, various tools such as DVC and MLFlow have emerged to aid developers in the storage and tracking of data. Despite their growing popularity, very little is known about their usage patterns and impact on open-source software (OSS) systems. To address this gap, we conducted an empirical study on 56 GitHub OSS projects that use DVC to understand the DVC usage pattern and the impact of using DVC on the software development process. We found that versioning and tracking is the most adopted DVC feature, being utilized by all 56 projects and being the only adopted feature in 85.7% of them. Furthermore, we found that DVC has a significant impact on the software development process indicators such as the number of created pull requests (PRs), and the number of bug-fix commits. For instance, our findings showed that DVC causes a peak in the number of commits and PRs at the moment of the adoption, followed by a long-term decrease. We believe that our findings can assist practitioners in tailoring tools to better meet user requirements and help organizations realize potential outcomes of adopting such tools.
In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applicat...
详细信息
ISBN:
(纸本)9781450356435
In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among dataversions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which cycle through data discovery, data exploration, feature engineering, model training, model evaluation, and back to data discovery. The contributions of this paper are: 1) to recognize the needs and to call out the requirements of an ML data platform, 2) to share our experiences in building MLdp by adopting existing database technologies to the new problem as well as by devising new solutions, and 3) to call for actions from our communities on future challeng
暂无评论