The typical characteristic of today's LAN and WAN environments is one of mixture. Old systems are mixed with a wide variety of new systems. Establishing effective network security in multi-platform, multi-vendor, ...
详细信息
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource effi...
详细信息
ISBN:
(纸本)9798350302936
Selecting the right resources for big data analytics jobs is hard because of the wide variety of configuration options like machine type and cluster size. As poor choices can have a significant impact on resource efficiency, cost, and energy usage, automated approaches are gaining popularity. Most existing methods rely on profiling recurring workloads to find near-optimal solutions over time. Due to the cold-start problem, this often leads to lengthy and costly profiling phases. However, big data analytics jobs across users can share many common properties: they often operate on similar infrastructure, using similar algorithms implemented in similar frameworks. The potential in sharing aggregated profiling runs to collaboratively address the cold start problem is largely unexplored. We present Karasu, an approach to more efficient resource configuration profiling that promotes data sharing among users working with similar infrastructures, frameworks, algorithms, or datasets. Karasu trains lightweight performance models using aggregated runtime information of collaborators and combines them into an ensemble method to exploit inherent knowledge of the configuration search space. Moreover, Karasu allows the optimization of multiple objectives simultaneously. Our evaluation is based on performance data from diverse workload executions in a public cloud environment. We show that Karasu is able to significantly boost existing methods in terms of performance, search time, and cost, even when few comparable profiling runs are available that share only partial common characteristics with the target job.
The article describes architecture of a big dataprocessing system based on Apache Hadoop, Apache Flume and Apache Spark toolset. Application of the developed system is shown for storage and analysis of dataset contai...
详细信息
ISBN:
(纸本)9781728136028
The article describes architecture of a big dataprocessing system based on Apache Hadoop, Apache Flume and Apache Spark toolset. Application of the developed system is shown for storage and analysis of dataset containing generated events within GitHub repository - the world's largest web-service for version control using Git. System performance results are evaluated using chosen metrics.
The suffix array is the key to efficient solutions for myriads of string processing problems in different application domains, like data compression, data mining, or bioinformatics. With the rapid growth of available ...
详细信息
ISBN:
(纸本)9781538650356
The suffix array is the key to efficient solutions for myriads of string processing problems in different application domains, like data compression, data mining, or bioinformatics. With the rapid growth of available data, suffix array construction algorithms have to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows scalable processing of input sizes on distributed systems in orders of magnitude that have not been considered before.
The computer systems developed during the 1960s and 1970s made very little impact on management decision. Management Information System design was constrained by three factors — the technology was large‐scale and in...
详细信息
The computer systems developed during the 1960s and 1970s made very little impact on management decision. Management Information System design was constrained by three factors — the technology was large‐scale and inevitably centralised and controlled by dataprocessing staff; the systems were designed by specialist staff who rarely understood the business requirements; and managers themselves had little knowledge or “hands‐on” experience of computers. In the 1980s a greater awareness of the need for planning and better use of personnel information, coupled with the development of distributedprocessing systems, has presented personnel management with opportunities to use computing technology as a means of increasing the professionalism of practising personnel managers. Effective use will only occur if the implementation of technology is matched by appraisal of skills and organisation within personnel departments. Staff will need a minimum level of computing expertise and some managers will need skills in modelling, particularly financial modelling. The relationship between personnel and dataprocessing needs careful redefining to build a link between the two and dataprocessing staff need to design and communicate an end‐user strategy.
in our conducted research we have built the dataprocessing pipeline for storing railway KPIs data based on Big data open-source technologies - Apache Hadoop, Kafka, Karim HDFS Connector. Spark, Airflow and PostgreSQL...
详细信息
ISBN:
(纸本)9781728108582
in our conducted research we have built the dataprocessing pipeline for storing railway KPIs data based on Big data open-source technologies - Apache Hadoop, Kafka, Karim HDFS Connector. Spark, Airflow and PostgreSQL. Created methodology for data load testing allowed to iteratively perform data load tests with increased data size and evaluate needed cluster software and hardware resources and, finally, detected bottlenecks of solution. As a result of the research we proposed architecture for dataprocessing and storage, gave recommendations on data pipeline optimization. In addition, we calculated approximate cluster machines sizing for current dataset volume for dataprocessing and storage services.
Presents major findings from an in-depth multi-company American study on the effective management of the distributed computing environment. The study, led by Cambridge Technology Partners, focuses on the management ap...
详细信息
Web archives constitute valuable sources for researchers in various disciplines. How- ever, their sheer size, the typically broad scope and their temporal dimension make them difficult to work with. We have identified...
详细信息
Web archives constitute valuable sources for researchers in various disciplines. How- ever, their sheer size, the typically broad scope and their temporal dimension make them difficult to work with. We have identified three views to access and explore Web archives from different perspectives: user-, data- and graph-centric. The natural way to look at the information in a Web archive is through a Web browser, just like the live Web is consumed. This is what we consider the user-centric view. The most commonly used tool to access a Web archive this way is the Wayback Machine, the Internet Archive's replay tool to render archived webpages. To facilitate the discovery of a page if the URL or timestamp of interest is unknown, we propose effective approaches to search Web archives by keyword with a temporal dimension through social bookmarks and labeled hyperlinks. Another way for users to find and access archived pages is past information on the current Web that is linked to the corresponding evidence in a Web archive. A presented tool for this purpose ensures coherent archived states of webpages related to a common object as rich temporal representations to be referenced and shared. Besides accessing a Web archive by closely reading individual pages like users do, distant reading methods enable analyzing archival collections at scale. This data-centric view enables analysis of the Web and its dynamics itself as well as the contents of archived pages. We address both angles: 1. by presenting a retrospective analysis of crawl meta- data on the size, age and growth of a Web dataset, 2. by proposing a programming framework for efficiently processing archival collections. ArchiveSpark operates on stan- dard formats to build research corpora from Web archives and facilitates the process of filtering as well as data extraction and derivation at scale. The third perspective is what we call the graph-centric view. Here, websites, pages or extracted facts are considered nodes in
暂无评论