SQL is the most commonly used front-end language for data-intensive scalable computing (DISC) applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL int...
详细信息
SQL is the most commonly used front-end language for data-intensive scalable computing (DISC) applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL introduces several layers of abstraction that significantly reduce the visibility and transparency of workflows, making it challenging for developers to find and fix errors in a query. When a query returns incorrect outputs, it takes a non-trivial effort to comprehend every stage of the query execution and find the root cause among the input data and complex SQL query. We aim to bring the benefits of step-through interactive debugging to DISC-powered SQL with DeSQL. Due to the declarative nature of SQL, there are no ordered atomic statements to place a breakpoint to monitor the flow of data. DeSQL’s automated query decomposition breaks a SQL query into its constituent sub queries, offering natural locations for setting breakpoints and monitoring intermediate data. However, due to advanced query optimization and translation in DISC systems, a user query rarely matches the physical execution, making it challenging to associate subqueries with their intermediate data. DeSQL performs fine-grained taint analysis to dynamically map the subqueries to their intermediate data, while also recognizing subqueries removed by the optimizers. For such subqueries, DeSQL efficiently regenerates the intermediate data from a nearby subquery’s data. On the popular TPC-DC benchmark, DeSQL provides a complete debugging view in 13% less time than the original job time while incurring an average overhead of 10% in addition to retaining Apache Spark’s scalability. In a user study comprising 15 participants engaged in two debugging tasks, we find that participants utilizing DeSQL identify the root cause behind a wrong query output in 74% less time than the de-facto, manual debugging.
With the development of big data, machine learning, and AI, existing software engineering techniques must be re-imagined to provide the productivity gains that developers desire. Furthermore, specialized hardware acce...
详细信息
ISBN:
(纸本)9798350324969
With the development of big data, machine learning, and AI, existing software engineering techniques must be re-imagined to provide the productivity gains that developers desire. Furthermore, specialized hardware accelerators like GPUs or FPGAs have become a prominent part of the current computing landscape. However, developing heterogeneous applications is limited to a small subset of programmers with specialized hardware knowledge. To improve productivity and performance for data-intensive and compute-intensive development, now is the time that the software engineering community should design new waves of refactoring, testing, and debugging tools for big data analytics and heterogeneous application development. In this paper, we overview software development challenges in this new data-intensive scalable computing and heterogeneous computing domain. We describe examples of automated software engineering (debugging, testing, and refactoring) techniques that target this data and compute intensive domain and share lessons learned from building these techniques.
To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems such as Apache Spark. In terms of debugging, DISC systems support only postmortem log analysis and do not pro...
详细信息
ISBN:
(纸本)9781450341974
To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems such as Apache Spark. In terms of debugging, DISC systems support only postmortem log analysis and do not provide any debugging functionality. This demonstration paper showcases BIGDEBUG: a tool enhancing Apache Spark with a set of interactive debugging features that can help users in debug their Big data Applications.
Map-reduce, the cornerstone computational framework for cloud computing applications, has star appeal to draw students to the study of parallelism. Participants will carry out hands-on exercises designed for students ...
详细信息
ISBN:
(纸本)9781450318686
Map-reduce, the cornerstone computational framework for cloud computing applications, has star appeal to draw students to the study of parallelism. Participants will carry out hands-on exercises designed for students at CS1/intermediate/advanced levels that introduce data-intensive scalable computing concepts, using WebMapReduce (WMR), a simplified open-source interface to the widely used Hadoop map-reduce programming environment. These hands-on exercises enable students to perform data-intensivescalable computations carried out on the most widely deployed map-reduce framework, used by Facebook, Microsoft, Yahoo, and other companies. WMR supports programming in a choice of languages (including Java, Python, C++, C#, Scheme); participants will be able to try exercises with languages of their choice. Workshop includes brief introduction to direct Hadoop programming, and information about access to cluster resources supporting WMR. Workshop materials will reside on ***, along with WMR software. Intended audience: CS instructors. Laptop required (Windows, Mac, or Linux).
WebMapReduce (WMR) is a strategically simplified user interface for the Hadoop implementation of the map-reduce model for distributed computing on clusters, designed so that novice programmers in an introductory CS co...
详细信息
ISBN:
(纸本)9781450305006
WebMapReduce (WMR) is a strategically simplified user interface for the Hadoop implementation of the map-reduce model for distributed computing on clusters, designed so that novice programmers in an introductory CS courses can perform authentic data-intensivescalable computations using the programming language they are learning in their course. WMR currently supports Java, C++, Python, and Scheme computations, and can readily be extended to support additional programming languages, and configured to adapt to the practices at a particular institution for teaching introductory programming. The open-source system is designed to give beginning CS students experience with parallel computing and exposure to concepts of parallelism, at a wide variety of institutions with diverse curricular choices and cluster resources. Potential applications in courses at all undergraduate levels are indicated, and implementation of the WMR software is described.
暂无评论