Jump to Content
Ray Smith

Ray Smith

Ray developed the Tesseract OCR engine at HPLabs Bristol for 10 years, followed by a 3 year term developing the text and line drawings pipelines for the HP PrecisionScan product in Greeley, Colorado. After spending a further 7 years developing a new architecture for the Omnipage OCR product for Caere/Scansoft/Nuance, Ray is now at Google, working on Tesseract again.

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Improving Book OCR by Adaptive Language and Image Models
    Dar-Shyang Lee
    Proceedings of 2012 10th IAPR International Workshop on Document Analysis Systems, IEEE, pp. 115-119
    Preview abstract In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. In this work, we describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large dataset of scanned books demonstrate that word error rates can be reduced by 25% using this approach. View details
    Preview abstract Although large language models are used in speech recognition and machine translation applications, OCR systems are “far behind” in their use of language models. The reason for this is not the laggardness of the OCR community, but the fact that, at high accuracies, a frequency-based language model can do more damage than good, unless carefully applied. This paper presents an analysis of this discrepancy with the help of the Google Books n-gram Corpus, and concludes that noisy-channel models that closely model the underlying classifier and segmentation errors are required. View details
    Table Detection in Heterogeneous Documents
    Faisal Shafait
    Document Analysis Systems 2010, ACM International Conference Proceedings series
    Preview abstract Detecting tables in document images is important since not only do tables contain important information, but also most of the layout analysis methods fail in the presence of tables in the document image. Existing approaches for table de- tection mainly focus on detecting tables in single columns of text and do not work reliably on documents with varying layouts. This paper presents a practical algorithm for table detection that works with a high accuracy on documents with varying layouts (company reports, newspaper articles, magazine pages, . . . ). An open source implementation of the algorithm is provided as part of the Tesseract OCR engine. Evaluation of the algorithm on document images from pub- licly available UNLV dataset shows competitive performance in comparison to the table detection module of a commercial OCR system. View details
    Hybrid Page Layout Analysis via Tab-Stop Detection
    Proceedings of the 10th international conference on document analysis and recognition, IEEE (2009)
    Preview abstract A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr. View details
    Combined Orientation and Script Detection using the Tesseract OCR Engine
    Ranjith Unnikrishnan
    Workshop on Multilingual OCR (MOCR), Proc. 10th Intl. Conf. on Document Analysis and Recognition (ICDAR), (2009)
    Preview abstract This paper proposes a simple but effective algorithm to estimate the script and dominant page orientation of the text contained in an image. A candidate set of shape classes for each script is generated using synthetically rendered text and used to train a fast shape classifier. At run time, the classifier is applied independently to connected components in the image for each possible orientation of the component, and the accumulated confidence scores are used to determine the best estimate of page orientation and script. Results demonstrate the effectiveness of the approach on a dataset of 1846 documents containing a diverse set of images in 14 scripts and any of four possible page orientations. View details
    Adapting the Tesseract Open Source OCR Engine for Multilingual OCR
    Daria Antonova
    Dar-Shyang Lee
    MOCR '09: Proceedings of the International Workshop on Multilingual OCR (2009)
    Preview abstract We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%. ©ACM, 2009. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the International Workshop on Multilingual OCR 2009, Barcelona, Spain July 25, 2009. View details
    An Overview of the Tesseract OCR Engine
    Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society (2007), pp. 629-633
    Preview abstract The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy[1], is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier. View details
    A simple and efficient skew detection algorithm via text row accumulation
    Proceedings 3rd ICDAR'95, IEEE (1995), pp. 1145-1148
    Computer processing of line images: a survey
    Pattern Recogn., vol. 20 (1987), pp. 7-15