Information retrieval and extraction essentially rely on estimating the relevance of words present in a large corpus of documents or text. One of the approaches to measuring relevance is analyzing the importance of wo...
详细信息
ISBN:
(纸本)9781467301275
Information retrieval and extraction essentially rely on estimating the relevance of words present in a large corpus of documents or text. One of the approaches to measuring relevance is analyzing the importance of words based on their statistical distribution within a document. Quite another approach ensues from their linguistic relevance within a logically perceived context. Literature presents a body of work done employing both statistical as well as contextual approaches. The challenge currently is on enhancing the performance of document analysis and clustering systems. Ever since we witnessed a massive explosion of information and raw data available on the web, their analysis demands more rigorous computations and processing. Given the widely distributed environment as a backbone platform for these systems to operate, there is an urgent need to develop techniques to scale up their performance on multiple processors. We propose a parallelized strategy to estimate the statistical as well as contextual relevance of words, employing master-slave configuration on a cluster of processors. Our parallel algorithm has been successfully tested on a self-made Beowulf cluster comprising ten nodes, showing significant performance improvement over single processor implementation, following Amdahl's speedup law.
暂无评论