Pre-processing for raster image based document segmentation begins with image thresholding, which is a binarization process separating foreground from background. In this paper, we compare an existing (Otsu), modified...
详细信息
ISBN:
(纸本)9781605580814
Pre-processing for raster image based document segmentation begins with image thresholding, which is a binarization process separating foreground from background. In this paper, we compare an existing (Otsu), modified existing (Kittler-Illingworth) and simple peak-based thresholding approach on a set of 982 documents for which existing ground truth (full text) is available. We use the Output of all Open Source OCR engine which incorporates an adaptive/dynamic thresholder that can be bypassed by one of the three global thresholds we tested. This allowed comparison of these three approaches in the aggregate. We then used an independently-generated dictionary as a means of characterizing thresholder efficacy. Such an approach, if successful, Will provide the means for selecting an optimal thresholder in the absence of a large set of ground truthed documents. Our preliminary findings here indicate that this approach may provide a reliable means for thresholder comparison and eventually preclude the need for time-intensive human ground truthing.
暂无评论