The ubiquity of Big data has greatly influenced the direction and the development of storage technologies. To meet the needs of storing and analyzing Big data, researchers and administrators have turned to parallel an...
详细信息
ISBN:
(纸本)9781538619964
The ubiquity of Big data has greatly influenced the direction and the development of storage technologies. To meet the needs of storing and analyzing Big data, researchers and administrators have turned to parallel and distributed storage and compute architectures. While the problems of securely and consistently storing and accessing data in large parallel and distributed file systems have been largely addressed in both the research and production systems, efficiently searching across large unstructureddata and metadata has largely been overlooked. According to the International data Corporation, more than 90% of data found in the digital universe is unstructured, emphasizing the importance of developing efficient solutions for querying distributed data. This paper proposes a novel indexing solution, called FusionDex, that provides an efficient model for querying across distributed file systems. FusionDex leverages state-of-the-art, open-source indexing modules as its building blocks to deliver an integrated system for enabling efficient user-specified queries over distributed and unstructureddata. FusionDex has been evaluated on a cluster of 64 nodes, and results show that it outperforms existing tools (in some cases by several orders of magnitude), such as Hadoop Grep and Cloudera search.
Purpose Due to the exceptionally high standards for accuracy and data integrity in scientific regulatory reporting, it is vital that any tool that aims to streamline this process is as efficient or more in gathering d...
详细信息
Purpose Due to the exceptionally high standards for accuracy and data integrity in scientific regulatory reporting, it is vital that any tool that aims to streamline this process is as efficient or more in gathering data as a team of scientists, without higher cost in terms of time or resources. For this reason, an artificial intelligence-based tool with parallel search, document creation, and data integrity review capabilities is being investigated as a potential solution. This paper describes a proof of concept project to develop an AI-based tool to rapidly assemble an end-of-phase 2 (EOP2) briefing document for a potential medicine. We have called the tool an Intelligent Machine for Document Preparation or IMDP. Methods A training corpus of approximately 65,000 pdf documents derived from electronic lab notebooks and technical reports related to five molecules (including Merestinib) was ingested, and prior EOP2 documents from the remaining four molecules was used to generate training questions and answers. Then, an annotation-light natural language processing algorithm analyzed a set of structured and unstructureddata regarding Merestinib. A simple user interface was created allowing scientists to query the system in natural language, and a table builder, image/plot finder, and free-text addition features were added to allow for advanced search without dependence on keywords. Results Three significant innovations were designed-in to improve overall performance as compared to our benchmark solution without sacrificing usability. First, the AI-based IMDP was built to improve accuracy and accelerate document creation with remarkably low amount of training. Second, image search capability was added to enrich the knowledge base, and third, the IMDP was integrated with the existing process rather than adding a step in the workflow. Finally, accuracy and total document creation time were compared with the existing tool (benchmark tool). Our experiments show that the AI-
Vector similarity search (VSS) for unstructured vectors generated via machine learning methods is a promising solution for many applications, such as face search. With increasing awareness and concern about data secur...
详细信息
ISBN:
(数字)9781665462723
ISBN:
(纸本)9781665462723
Vector similarity search (VSS) for unstructured vectors generated via machine learning methods is a promising solution for many applications, such as face search. With increasing awareness and concern about data security requirements, there is a compelling need to store data and process VSS applications locally on edge devices rather than send data to servers for computation. However, the explosive amount of data movement from NAND storage to DRAM across memory hierarchy and data processing of the entire dataset consume enormous energy and require long latency for VSS applications. Specifically, edge devices with insufficient DRAM capacity will trigger data swap and deteriorate the execution performance. To overcome this crucial hurdle, we propose an intelligent cognition engine (ICE) with cognitive 3D NAND, featuring non-volatile in-memory computing (nvIMC) to accelerate the processing, suppress the data movement, and reduce data swap between the processor and storage. This cognitive 3D NAND features digital nvIMC techniques (i.e., ADC/DAC-free approach), high-density 3D NAND, and compatibility with standard 3D NAND products with minor modifications. To facilitate parallel INT8/INT4 vector-vector multiplication (VVM) and mitigate the reliability issue of 3D NAND, we develop a bit-error-tolerance data encoding and a two's complement-based digital accumulator. VVM can support similarity computations (e.g., cosine similarity and Euclidean distance), which are required to search "the most similar data" right where they are stored. In addition, the proposed solution can be realized on edge storage products, e.g., embedded MultiMedia Card (eMMC). The measured and simulated results on real 3D NAND chips show that ICE enhances the system execution time by 17 x to 95 x and energy efficiency by 11 x to 140 x, compared to traditional von Neumann approaches using state-of-the-art edge systems with MobileFaceNet on CASIA-WebFace dataset. To the best of our knowledge, this work
Vector similarity search (VSS) for unstructured vectors generated via machine learning methods is a promising solution for many applications, such as face search. With increasing awareness and concern about data secur...
详细信息
ISBN:
(纸本)9781665462723
Vector similarity search (VSS) for unstructured vectors generated via machine learning methods is a promising solution for many applications, such as face search. With increasing awareness and concern about data security requirements, there is a compelling need to store data and process VSS applications locally on edge devices rather than send data to servers for computation. However, the explosive amount of data movement from NAND storage to DRAM across memory hierarchy and data processing of the entire dataset consume enormous energy and require long latency for VSS applications. Specifically, edge devices with insufficient DRAM capacity will trigger data swap and deteriorate the execution performance. To overcome this crucial hurdle, we propose an intelligent cognition engine (ICE) with cognitive 3D NAND, featuring non-volatile in-memory computing (nvIMC) to accelerate the processing, suppress the data movement, and reduce data swap between the processor and storage. This cognitive 3D NAND features digital nvIMC techniques (i.e., ADC/DAC-free approach), high-density 3D NAND, and compatibility with standard 3D NAND products with minor modifications. To facilitate parallel INT8/INT4 vector-vector multiplication (VVM) and mitigate the reliability issue of 3D NAND, we develop a bit-error-tolerance data encoding and a two's complement-based digital accumulator. VVM can support similarity computations (e.g., cosine similarity and Euclidean distance), which are required to search "the most similar data" right where they are stored. In addition, the proposed solution can be realized on edge storage products, e.g., embedded MultiMedia Card (eMMC). The measured and simulated results on real 3D NAND chips show that ICE enhances the system execution time by 17× to 95× and energy efficiency by 11× to 140×, compared to traditional von Neumann approaches using state-of-the-art edge systems with MobileFaceNet on CASIA-WebFace dataset. To the best of our knowledge, this work demo
暂无评论