A collaboration between researchers from China’s Beihang University and Microsoft Research Asia has produced TableBank, a new image-based dataset for table detection and recognition built with novel weak supervision from Word and Latex documents on the Internet.

Researchers built several strong baselines using SOTA models with deep neural networks, which will enable deployment of more deep learning methods to table detection and recognition tasks. TableBank has been open-sourced on Github.

“Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousands human labeled examples, which is difficult to generalize on real world applications. With TableBank that contains 417K high-quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks.” (arXiv).

Synced invited Christian Beckmann, a Data Scientist in the Innovation Hub of Deutsche Telekom AG and Dean of the Darmstadt School of AI, to share his thoughts on TableBank.

How would you describe TableBank?

TableBank is a high-quality image based dataset — with 417k labeled tables and source documents — to support research in the area of table detection and recognition using deep learning. The research paper outlines an automated and scalable way to create the dataset using weak supervision using the mark-up tags available in Word and Latex documents found on the Internet.

It is currently the largest available dataset and might have the same impact in this research field as the well known ImageNet or COCO datasets had in the field of object detection.

Why does TableBank matter?

Table detection and recognition is an important task in many document analysis applications. Conventional techniques use hand-crafted rules and heuristics based on layout analysis. But these techniques fail to generalize well because they are not robust to variations in the layout of tables.

The rapid advancement in the field of object detection using deep learning has brought up a different approach. Using techniques like Faster R-CNN from the field of object detection enables a purely data-driven approach to detecting tables without the need for predefined rules.

But currently there are only small datasets available for training models for this task. Datasets like the UNLV or Marmot Dataset only contain a hundred to a few thousands hand-labeled examples which can be used to train such models. To get around that problem of small datasets current approaches make use of pretrained models of different domains like object detection with COCO or ImageNet. In this way, training models on tables can be realized by fine-tuning existing models.

The concept outlined in the paper solves both problems: It describes a way to create a large, labeled dataset in an automated and scalable way; and provides a dataset that is big enough to enable research on models specific to the task of table detection and recognition.

What impact might this research bring to the research community?

The availability of a large, high-quality dataset is key to the research and development of task specific models. This research paper outlines a solution for how to build such a dataset in an automated and scalable way. And as a result it also makes available a dataset with 417k samples.

These results will help research on new machine learning based approaches for the detection and recognition of tables.

Can you identify any bottlenecks in the research?

Using weak supervision enables building a large database in an automated and scalable way, but this comes with some drawbacks on quality assurance. Scaling the dataset while maintaining robustness in case of errors or noise might become more challenging as the dataset grows.

Can you predict any potential future developments related to this research?

As more and more information is available in digital formats there is a growing need to make the information available in a structured and processable way. This enables companies to automate processes, reduce costs and make the most out of their data.

The paper TableBank: Table Benchmark for Image-based Table Detection and Recognition is on arXiv.