Kern Roman, Frey Matthias
2015
Table recognition and table extraction are important tasks in information extraction, especially in the domain of schol- arly communication. In this domain tables are commonplace and contain valuable information. Many different automatic approaches for table recognition and extraction exist. Com- mon to many of these approaches is the need for ground truth datasets, to train algorithms or to evaluate the results. In this paper we present the PDF Table Annotator, a web based tool for annotating elements and regions in PDF doc- uments, in particular tables. The annotated data is intended to serve as a ground truth useful to machine learning algo- rithms for detecting table regions and table structure. To make the task of manual table annotation as convenient as possible, the tool is designed to allow an efficient annotation process that may spawn multiple session by multiple users. An evaluation is conducted where we compare our tool to three alternative ways of creating ground truth of tables in documents. Here we found that our tool overall provides an efficient and convenient way to annotate tables. In addition, our tool is particularly suitable for complex table structures, where it provided the lowest annotation time and the highest accuracy. Furthermore, our tool allows to annotate tables following a logical or a functional model. Given that by the use of our tool ground truth datasets for table recognition and extraction are easier to produce, the quality of auto- matic tables extraction should greatly benefit. General