Finding Images and Line Drawings in Document-Scanning Systems


This work addresses the problem of finding images and line-drawings in scanned pages. It is a crucial processing step in the creation of a large-scale system to detect and index images found in books and historic documents. Within the scanned pages that contain both text and images, the images are found through the use of local-feature extraction, applied across the full scanned page. This is followed by a novel learning system to categorize the local features into either text or image. The discrimination is based on using multiple classifiers trained via stochastic sampling of weak classifiers for each AdaBoost stage. The approach taken in sampling includes stochastic hill climbing across weak detectors, allowing us to reduce our classification error by as much as 25% relative to more naive stochastic sampling. Stochastic hill climbing in the weak classifier space is possible due to the manner in which we parameterize the weak classifier space. Through the use of this system, we improve image detection by finding more line-drawings, graphics, and photographs, as well as reducing the number of spurious detections due to misclassified text, discoloration, and scanning artifacts.