Hadoop Distributed File System (HDFS) and MapReduce model have become de facto standard for large scale data organization and analysis. Existing model of data organization and processing in Hadoop using HDFS and MapRe...
详细信息
ISBN:
(纸本)9781467385664
Hadoop Distributed File System (HDFS) and MapReduce model have become de facto standard for large scale data organization and analysis. Existing model of data organization and processing in Hadoop using HDFS and MapReduce are ideally tailored for search and data parallel applications, for which there is no data dependency with neighboring/adjacent data. Many scientific applications such as image mining, data mining, knowledge data mining, satellite image processing etc., are dependent on adjacent data for processing and analysis. In this paper, we discuss the requirements of the overlapped data organization and propose XHAMI as a two phase extensions to HDFS and MapReduce programming model to address such requirements. We present the APIs and discuss their implementation specific to Image Processing (IP) domain in detail, followed by sample case studies of image processing functions along with the results. XHAMI though has little overheads in data storage and input/output operations, but greatly improves the system performance and simplifies the application development process. The proposed system works without any changes for the existing MapReduce models with zero overheads, and can be used for many domain specific applications where there is a requirement of overlapped data.
Visualizing and analyzing large-scale datasets are both critical and challenging, as they require substantial resources for data processing and storage. While the speed of supercomputers continues to set higher standa...
详细信息
ISBN:
(纸本)9780769557854
Visualizing and analyzing large-scale datasets are both critical and challenging, as they require substantial resources for data processing and storage. While the speed of supercomputers continues to set higher standard, the I/O systems have not kept in pace, resulting in a significant performance bottleneck. To alleviate the I/O bottleneck for scientific visualization applications, we propose a Visualization via a Heterogeneous Distributed Storage Infrastructure (VH-DSI) solution to improve I/O speed and accelerate overall visualization performance. VH-DSI replaces the traditional parallel file system with a distributed file system to support visualization applications. A new scheduling algorithm HeterSche is proposed in VH-DSI to assign computing tasks to data nodes with the consideration of cluster heterogeneity and data locality. VH-DSI also includes a design to support POSIX-IO for distributed file system. The performance evaluation has shown that the proposed VH-DSI solution can achieve significant performance improvement for visualization applications. Compared to the traditional visualization, the VH-DSI solution reduces the response time by at least 5 times. The HeterSche scheduling algorithm is capable to speed up visualization compared to other scheduling algorithms especially for large scale datasets.
暂无评论