Yasuhisa Fujii
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Hierarchical Text Spotter for Joint Text Spotting and Layout Analysis
Winter Conference on Applications of Computer Vision 2024 (2024) (to appear)
Preview abstract
We propose Hierarchical Text Spotter (HTS), the first method for the joint task of word-level text spotting and geometric layout analysis.
HTS can annotate text in images with a hierarchical representation of 4 levels: character, word, line, and paragraph.
The proposed HTS is characterized by two novel components:
(1) a Unified-Detector-Polygon (UDP) that produces Bezier Curve polygons of text lines and an affinity matrix for paragraph grouping between detected lines;
(2) a Line-to-Character-to-Word (L2C2W) recognizer that splits lines into characters and further merges them back into words.
HTS achieves state-of-the-art results on multiple word-level text spotting benchmark datasets as well as geometric layout analysis tasks.
Code will be released upon acceptance.
View details
Chain-of-Table: Evolves Tables in the LLM Reasoning Chain for Table Understanding
Zilong Wang
Hao Zhang
Chun-Liang Li
Jingbo Shang
ICLR (2024)
Preview abstract
Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on WikiTQ, FeTaQA, and TabFact benchmarks across multiple LLM choices.
View details
Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation
International Conference on Document Analysis and Recognition (ICDAR) (2023) (to appear)
Preview abstract
Text reading order is a crucial aspect in the output of an OCR engine, with a large impact on downstream tasks. Its difficulty lies in the large variation of domain specific layout structures, and is further exacerbated by real-world image degradations such as perspective distortions. We propose a lightweight, scalable and generalizable approach to identify text reading order with a multi-modal, multi-task graph convolutional network (GCN) running on a sparse layout based graph. Predictions from the model provide hints of bidimensional relations among text lines and layout region structures, upon which a post-processing cluster-and-sort algorithm generates an ordered sequence of all the text lines. The model is language-agnostic and runs effectively across multi-language datasets that contain various types of images taken in uncontrolled conditions, and it is small enough to be deployed on virtually any platform including mobile devices.
View details
FormNetV2: Inductive Multimodal Graph Contrastive Learning for Form Document Information Extraction
Chun-Liang Li
Hao Zhang
Xiang Zhang
Nikolai Glushnev
Joshua Ainslie
Nan Hua
ACL (2023)
Preview abstract
The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph contrastive learning strategy to unify self-supervised pre-training for all modalities in one loss. The graph contrastive objective maximizes the agreement of multimodal representations, providing a natural interplay for all modalities without special customization. In addition, we extract image features within the bounding box that joins a pair of tokens connected by a graph edge, capturing more targeted visual cues without loading a sophisticated and separately pre-trained image embedder. FormNetV2 establishes new state-of-the-art performance on FUNSD, CORD, SROIE and Payment benchmarks with a more compact model size.
View details
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Cheng-Yu Hsieh
Chun-Liang Li
Alexander Ratner
Ranjay Krishna
ACL 2023
Preview abstract
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a benchmark task.
View details
Preview abstract
Language models are useful adjuncts to optical models for producing accurate optical character recognition (OCR) results. One factor which limits the power of language models in this context is the existence of many specialized domains with language statistics very different from those implied by a general language model - think of checks, medical prescriptions, and many other specialized document classes.
This paper introduces an algorithm for efficiently generating and attaching a domain specific word based language model at run time to a general language model in an OCR system. In order to best use this model the paper also introduces a modified CTC beam search decoder which effectively allows hypotheses to remain in contention based on possible future completion of vocabulary words.
The result is a substantial reduction in word error rate in recognizing material from specialized domains.
View details
ICDAR 2023 Competition on Hierarchical Text Detection and Recognition
Dmitry Panteleev
ICDAR 2023: International Conference on Document Analysis and Recognition (2023)
Preview abstract
We organize a competition on hierarchical text detection and recognition. The competition is aimed to promote research into deep learning models and systems that can simultaneously perform text detection and recognition and geometric layout analysis. We present details of the proposed competition organization, including tasks, datasets, evaluations, and schedule. During the competition period (from January 2nd 2023 to April 1st 2023), at least 50 submissions from more than 30 teams were made in the 2 proposed tasks. Considering the number of teams and submissions, we conclude that the HierText competition has been successfully held. In this report, we will also present the competition results and insights from them.
View details
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
Cheng-Yu Hsieh
Si-An Chen
Chun-Liang Li
Alexander Ratner
Ranjay Krishna
arXiv preprint arXiv:2308.00675 (2023)
Preview abstract
Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.
View details
Towards End-to-End Unified Scene Text Detection and Layout Analysis
Dmitry Panteleev
CVPR 2022 (2022)
Preview abstract
Scene text detection and document layout analysis have long been treated as two separate tasks in different image domains. In this paper, we bring them together and introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way. Comprehensive experiments show that our unified model achieves better performance than multiple well-designed baseline methods. Additionally, this model achieves stateof-the-art results on multiple scene text detection datasets without the need of complex post-processing. Dataset and code: https://github.com/google-researchdatasets/hiertext.
View details
Post-OCR Paragraph Recognition by Graph Convolutional Networks
Winter Conference on Applications of Computer Vision (WACV) 2022
Preview abstract
Paragraphs are an important class of document entities. We propose a new approach for paragraph recognition by spatial graph convolutional networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a beta-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.
View details
FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction
Chun-Liang Li
Nan Hua
Joshua Ainslie
Association for Computational Linguistics (ACL) (2022)
Preview abstract
Sequence modeling has demonstrated state-of-the-art performance on natural language and document understanding tasks. However, it is challenging to correctly serialize tokens in form-like documents in practice due to their variety of layout patterns. We propose FormNet, a structure-aware sequence model to mitigate the suboptimal serialization of forms. First, we design Rich Attention that leverages the spatial relationship between tokens in a form for more precise attention score calculation. Second, we construct Super-Tokens for each word by embedding representations from their neighboring tokens through graph convolutions. FormNet therefore explicitly recovers local syntactic information that may have been lost during serialization. In experiments, FormNet outperforms existing methods with a more compact model size and less pre-training data, establishing new state-of-the-art performance on CORD, FUNSD and Payment benchmarks.
View details
Unified Line and Paragraph Detection by Graph Convolutional Networks
Shuang Liu
International Workshop on Document Analysis System (DAS) (2022)
Preview abstract
We formulate the task of detecting lines and paragraphs in
a document into a unified two-level clustering problem. Given a set of
text detection boxes that roughly correspond to words, a text line is a
cluster of boxes and a paragraph is a cluster of lines. These clusters form
a two-level tree that represents a major part of the layout of a document.
We use a graph convolutional network to predict the relations between
text detection boxes and then build both levels of clusters from these
predictions. Experimentally, we demonstrate that the unified approach
can be highly efficient while still achieving state-of-the-art quality for
detecting paragraphs in public benchmarks and real-world images.
View details
ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction
Chun-Liang Li
Chu Wang
Association for Computational Linguistics (ACL) (2021)
Preview abstract
Natural reading orders of words are crucial for information extraction from form-like documents. Despite recent advances in Graph Convolutional Networks (GCNs) on modeling spatial layout patterns of documents, they have limited ability to capture reading orders of given word-level node representations in a graph. We propose Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents. ROPE generates unique reading order codes for neighboring words relative to the target word given a word-level graph connectivity. We study two fundamental document entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We show that ROPE consistently improves existing GCNs with a margin up to 8.4% F1-score.
View details
Preview abstract
We propose an end-to-end trainable network that can simultaneously detect and recognize text of arbitrary curved path, making substantial progress on the open problem of reading scene text of irregular shape. We formulate arbitrary shape text detection as an instance segmentation problem; an attention model is then used to decode the textual content of each irregularly shaped text region without rectification. To extract useful irregularly shaped text instance features from image scale features, we propose a simple yet effective RoI masking step. Finally, we show that predictions from an existing multi-step OCR engine can be leveraged as partially labeled training data, which leads to significant improvements in both the detection and recognition accuracy of our model. Our method surpasses the state-of-the-art for end-to-end recognition tasks on the ICDAR15 (straight) benchmark by 4.6%, and on the Total-Text (curved) benchmark by more than 16%.
View details
Preview abstract
Many studies on (Offline) Handwritten Text Recognition (HTR) systems have focused on building state-of-the-art models for line recognition on small corpora. However, adding HTR capability to a large scale multilingual OCR system poses new challenges. This paper addresses three problems in building such systems: data, efficiency, and integration. Firstly, one of the biggest challenges is obtaining sufficient amounts of high quality training data. We address the problem by using online handwriting data collected for a large scale production online handwriting recognition system. We describe our image data generation pipeline and study how online data can be used to build HTR models. We show that the data improve the models significantly under the condition where only a small number of real images is available, which is usually the case for HTR models. It enables us to support a new script at substantially lower cost. Secondly, we propose a line recognition model based on neural networks without recurrent connections. The model achieves a comparable accuracy with LSTM-based models while allowing for better parallelism in training and inference. Finally, we present a simple way to integrate HTR models into an OCR system. These constitute a solution to bring HTR capability into a large scale OCR system.
View details
Sequence-to-Label Script Identification for Multilingual OCR
Jonathan Michael Baccash
Patrick Michael Hurst
Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2017)
Preview abstract
We describe a novel line-level script identification
method. In multilingual OCR, script identification is a crucial
component as it automates the provision of a language hint.
Previous work repurposed an OCR model that generates per-character
script codes, aggregated by a counting heuristic to
obtain line-level script ID. This baseline has two shortcomings.
First, as a sequence-to-sequence model it is more complex than
necessary for the sequence-to-label problem of line script ID,
making it hard to train and inefficient to run. Second, the counting
heuristic may be suboptimal compared to a learned model.
Therefore we reframe line script identification as a
sequence-to-label problem and solve it using two components, trained
end-to-end: Encoder and Summarizer. The encoder converts a line
image into a sequence of features. The summarizer aggregates
this sequence to classify the line. We test various summarizers
while keeping identical inception-style convolutional networks as
encoders. Experiments on scanned books and photos containing
232 languages in 30 scripts show 16% reduction of script ID error
rate compared to the baseline. This improved script ID reduces
the character error rate attributable to script misidentification
by 33%.
View details
Label Transition and Selection Pruning and Automatic Decoding Parameter Optimization for Time-Synchronous Viterbi Decoding
Dmitriy Genzel
Remco Teunen
13th International Conference on Document Analysis and Recognition (ICDAR), IEEE (2015), pp. 756-760
HMM-based script identification for OCR
Dmitriy Genzel
Remco Teunen
Proceedings of the 4th International Workshop on Multilingual OCR, ACM, New York, NY, US (2013), 2:1-2:5
Preview abstract
While current OCR systems are able to recognize text in
an increasing number of scripts and languages, typically
they still need to be told in advance what those scripts and
languages are. We propose an approach that repurposes
the same HMM-based system used for OCR to the task of
script/language ID, by replacing character labels with script
class labels. We apply it in a multi-pass overall OCR process
which achieves “universal” OCR over 54 tested languages
in 18 distinct scripts, over a wide variety of typefaces in
each. For comparison we also consider a brute-force approach,
wherein a singe HMM-based OCR system is trained
to recognize all considered scripts. Results are presented on
a large and diverse evaluation set extracted from book images,
both for script identification accuracy and for overall
OCR accuracy. On this evaluation data, the script ID system
provided a script ID error rate of 1.73% for 18 distinct
scripts. The end-to-end OCR system with the script ID system
achieved a character error rate of 4.05%, an increase of
0.77% over the case where the languages are known a priori.
View details
No Results Found