Intra-Africa Journal Hub's Portfolio

Abstract

Currently, a substantial volume of document data exists in an unstructured format, encompassing Portable Document Format (PDF) files and images. Extracting information from these documents presents formidable challenges due to diverse table styles, complex forms, and the inclusion of different languages. Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and
Read Full Text

PdfTable: A Unified Toolkit for Deep
Learning-Based Table Extraction
Lei Sheng
1*
and Shuai-Shuai Xu
2
1*
Automated institute, Wuhan University of Technology, 122 Luoshi
Road, Wuhan, 430070, Hubei, China.
2
School of Software, University of Science and Technology of China,
No.96, JinZhai Road Baohe District, Hefei, 230026, Anhui, China.
*Corresponding author(s). E-mail(s): xuanfeng1992@whut.edu.cn;
Contributing authors: sa517432@mail.ustc.edu.cn;
Abstract
Currently,  a  substantial  volume  of  document  data  exists  in  an  unstructured
format,  encompassing  Portable  Document  Format  (PDF)  files  and  images.
Extracting information from these documents presents formidable challenges due
to diverse table styles, complex forms, and the inclusion of different languages.
Several open-source toolkits, such as Camelot, Plumb a PDF (pdfnumber), and
Paddle Paddle Structure V2 (PP-StructureV2), have been developed to facilitate
table extraction from PDFs or images. However, each toolkit has its limitations.
Camelot  and  pdfnumber  can  solely  extract  tables  from  digital  PDFs  and  can-
not handle image-based PDFs and pictures. On the other hand, PP-StructureV2
can comprehensively extract image-based PDFs and tables from pictures. Never-
theless, it lacks the ability to differentiate between diverse application scenarios,
such as wired tables and wireless tables, digital PDFs, and image-based PDFs.
To address these issues, we have introduced the PDF table extraction (PdfTable)
toolkit.  This  toolkit  integrates  numerous  open-source  models,  including  seven
table recognition models, four Optical character recognition (OCR) recognition
tools,  and  three  layout  analysis  models.  By  refining  the  PDF  table  extrac-
tion process, PdfTable achieves adaptability across various application scenarios.
We  substantiate  the  efficacy  of  the  PdfTable  toolkit  through  verification  on
a  self-labeled  wired  table  dataset  and  the  open-source  wireless  Publicly  Table
Reconition Dataset (PubTabNet). The PdfTable code will available on Github:
https://github.com/CycloneBoy/pdf
table.
Keywords:Intelligent document analysis, Table structure recognition, Portable
Document Format file table extraction, Information Extraction
1
arXiv:2409.05125v1  [cs.CV]  8 Sep 2024

1 Introduction
Portable Document Format (PDF
1
), a file format used to present documents in a
hardware- and software-independent manner. It finds widespread application across
various domains, including academic papers and financial report documents. With the
rapid  development  of  document  digitization,  the  automated  extraction  of  informa-
tion from PDFs has gained paramount significance. Consequently, several tools have
emerged  to  facilitate  the  conversion  of  PDFs  into  easily  parseable  HTML  formats.
However, the intricate structures and diverse styles of tables, coupled with the poten-
tial inclusion of different languages in table contents, present persistent challenges in
table structure recognition(TSR) during the parsing of PDF documents. In response
to  this  challenge,  diverse  methods  have  been  proposed  to  address  the  complexities
associated with TSR.
Several toolkits are available to directly extract tables from PDFs, including Tab-
ula
2
, Camelot
3
and pdfnumber
4
. Tabula, Camelot, and pdfnumber primarily employ
rule-based methods for table extraction in digital PDFs, demonstrating inaccuracies
when  confronted  with  tables  featuring  complex  cross-line  and  cross-column  styles.
With the rapid development of deep learning, early researchers proposed models such
as DeepDeSRT[1] ,TableNet[2] and SEM[3] to address table extraction challenges in
image-based documents. However, due to a scarcity of extensively annotated datasets,
the  outcomes  were  less  than  satisfactory.  Recent  years  have  witnessed  the  intro-
duction  of  diverse  TSR  datasets,  such  as  SciTSR[4],  TableBank[5],  PubtabNet[6]  ,
PubTables-1M[7], and WTW[8]. Models like CascadeTabNet[9], EDD[6], LGPMA[10],
TSRFormer[11], Cycle-CenterNet[8], LORE[12], etc., trained on these datasets have
demonstrated  proficient  table  parsing  results.  Despite  successful  parsing  in  general-
ized  table  scenarios,  these  models  encounter  challenges  when  applied  to  real-world
scenarios.
While  the  existing  table  parsing  algorithm  performs  admirably,  there  remains  a
deficiency in open-source tools designed for end-to-end PDF table extraction to address
diverse  table  extraction  tasks  in  practical  applications.  Baidu’s  recent  open  source
PP-StructureV2[13] toolkit, employing the SLANet table structure recognition model
in  conjunction  with  PaddleOCR[14],  has  garnered  widespread  user  appreciation  for
achieving end-to-end table recognition and extraction. Nevertheless, there are notable
areas for optimization: (1) the end-to-end table extract process lacks sufficient subdivi-
sion, such as the differentiation between wired and wireless tables, and the extraction
of text from digital PDFs versus image-based PDFs; (2) Each functional module in
the recognition process supports a limited number of models, for instance: two layout
analysis models, three table recognition models and one OCR text recognition model;
(3) Open source table recognition models commonly employ different frameworks and
dependent environments, posing challenges for debugging and reproducibility within
a unified environment.
1
https://en.wikipedia.org/wiki/PDF
2
https://github.com/tabulapdf/tabula
3
https://github.com/camelot-dev/camelot
4
https://github.com/jsvine/pdfplumber
2

In addressing the aforementioned challenges, we present a novel end-to-end PDF
extraction table toolkit called PdfTable. Initially, we partition the table recognition
process  into  distinct  modules,  including  data  preprocessing,  layout  analysis,  table
structure recognition, table content extraction and upper-layer application. Then dif-
ferent open source algorithms and toolkits are integrated for different modules. Diverse
open-source  algorithms  and  toolkits  are  integrated  for  each  module,  with  uniform
coding  implemented  in  Pytorch[15]  to  streamline  debugging  and  model  integration.
Presently, the toolkit encompasses seven table structure recognition algorithms, three
layout analysis algorithms, and four mainstream OCR recognition tools. Subsequently,
we conduct end-to-end integration and optimization of the table recognition process,
ultimately enabling the batch conversion of both digital and scanned PDF documents
into HTML or WORD formats. PdfTable facilitates the direct extraction of PDF tables
into Excel and supports numerous languages. To validate the toolkit’s effectiveness,
we annotated a small table dataset within the Chinese financial domain, comprising
both digital and scanned PDFs. PdfTable demonstrated commendable performance
on this dataset, affirming the efficacy of the toolkit. Concurrently, we evaluated the
integration of four wireless table models on the PubtabNet[6] wireless table dataset,
with results attesting to the correctness of the model integration.
In summary, our primary contributions can be outlined as follows:
1.  Introduction of PdfTable, an end-to-end deep learning-based PDF table extraction
toolkit, supporting the extraction of tables from both digital and scanned PDFs,
encompassing wired and wireless table extraction.
2.  Integration  of  numerous  open-source  algorithms  into  our  toolkit,  encompass-
ing  seven  table  structure  recognition  models,  four  mainstream  OCR  recognition
tools,  and  three  layout  analysis  models.  This  integration  provides  users  with  a
straightforward and user-friendly API.
3.  Conducted  experiments  to  validate  the  efficacy  of  the  PdfTable  toolkit,  utilizing
a  self-labeled  small  Chinese  financial  field  wired  table  dataset  and  the  wireless
table dataset PubtabNet[6]. The experimental results unequivocally demonstrate
the effectiveness and correctness of our toolkit.
2 Related Work
2.1 Document Layout Analysis
Document  layout  analysis  is  a  basic  pre-processing  task  for  modern  document
understanding  and  digitization.  It  mainly  divides  documents  into  different  regions,
such  as  pictures,  tables,  text  and  formulas,  which  can  be  regarded  as  a  sub-task
of  object  detection.  Presently  mainstream  methods  include  object  detection-based
models, segmentation-based models and GNN-based methods[16]. DeepDeSRT[1] pio-
neered the use of Faster R-CNN[17] for table detection, achieving commendable results.
With more and more layout analysis datasets ,such as: PubLayNet[18], TableBank[5],
etc.  and  different  object  detection  models  such  as  Mask  R-CNN[19],  YOLO[20],
DETR[21],  etc.  proposed,  the  task  of  layout  analysis  has  been  further  developed.
3

Layout-parser[22] is a unified toolkit for document image analysis based on deep learn-
ing,  providing  rich  pre-trained  models  and  user-friendly  APIs.  PP-StructureV2[13]
also  provides  a  variety  of  English  and  Chinese  layout  analysis  models  trained  on
PP-YOLOv2[23] and PP-PicoDet[24] models.
2.2 Table Structure Recognition
Historically, early approaches to table recognition primarily employed rule-based
and statistical machine learning methods, often limited by their dependence on the
rigid  rectangular  layout  of  tables.  Consequently,  these  methods  could  only  effec-
tively handle straightforward table structures or tables embedded in PDFs. In recent
years, the landscape has shifted towards deep learning-based methods, demonstrating
substantial  improvements  in  accuracy  compared  to  traditional  approaches.  Broadly
categorized,  these  contemporary  methods  fall  into  three  main  groups:  boundary
extraction-based  methods,  image-to-markup  generated  methods,  and  graph-based
methods.
Boundary extraction based methodsThese methods employ object detection
or semantic segmentation algorithms to initially identify the rows and columns of the
table. Subsequently, the cells of the table are determined through cross-combination
of  the  identified  rows  and  columns.  DeepDeSRT[1]  and  TableNet[2]  leverage  Fully
Convolutional  Network  (FCN)-based  semantic  segmentation  model  for  TSR  anal-
ysis.  However,  the  basic  FCN  faces  challenges  in  accurately  recognizing  numerous
blank tables due to its limited receptive field. To address this limitation, subsequent
researchers proposed enhancements [3, 9, 25]. SEM[3] stands out by integrating visual
and textual information through three independent modules—splitter, embedder, and
merge—enabling the extraction of both simple and complex tables. RobusTabNet[11]
proposed a new method of splitting and merging TSR using spatial Convolutional Neu-
ral Network (CNN) module, which can effectively identify tables with a large number
of blanks and distortions.
Image-to-markup generation based methodsThese methods transform the
table recognition task into an image-to-markup generation task, directly generating
markup  (HTML  or  LaTeX)  to  represent  the  table  structure.  Leveraging  a  substan-
tial volume of labeled table data extracted from existing PDFs, web pages, or LaTeX
papers  through  rules  or  semi-supervised  methods,  researchers  have  proposed  many
benchmark datasets TABLE2LATEX[26], Tablebank[5], PubtabNet[6]. Additionally,
they have organized related competitions, including ICDAR2019[26], ICDAR2021[27],
significantly  fostering  the  rapid  development  of  TSR.  TableMaster[28]  directly  pre-
dicts  HTML  and  text  box  regression  based  on  MASTER[29],  achieving  the  best
results on the PubtabNet benchmark dataset. SLANet[13] uses PP-LCNet[30] and a
series of optimization strategies to make model inference efficient on the CPU. MTL-
TabNet[31] proposes an end-to-end TSR model that uses a multi-task learning method
to  directly  solve  table  structure  recognition  and  table  content  recognition  with  one
model. OTSL[32] proposes a novel method for marking tables, utilizing only five tokens
to represent the table structure, thereby reducing the inference time of the Image2seq
method by approximately half while enhancing model accuracy. Only five tokens can
4

be used to represent the table structure, which can shorten the inference time of the
Image2seq method by about half while improving model accuracy.
Graph based methodsThese methods treat table cells or cell contents as nodes
in a graph, employing graph neural networks(GNN) to predict whether these nodes
belong to the same group. GraphTSR[4] takes table cells as input, and then uses GNN
to predict the relationship between table cells to predict the table structure, achieving
good  results  on  the  SciTSR[4]  data  set.  TGRNet[33]  proposes  an  end-to-end  table
graph reconstruction network to perform table structure recognition by simultaneously
predicting the physical and logical positions of table cells.
2.3 Optical Character Recognition
Table content recognition is also a crucial phase in the table recognition process.
Tables in digital PDFs can directly read text coordinates and content, while scanned
PDFs usually require an OCR model to extract text. OCR is currently divided into
two primary tasks: text detection and text recognition, each optimized independently.
Additionally,  there  are  also  end-to-end  recognition  models.  Since  OCR  has  a  wide
range of applications, it has received widespread attention from researchers and indus-
tries,  leading  to  the  proposal  of  numerous  models.  Notably,  the  DB[34]  detection
model and CRNN[35] recognition model stand out as widely adopted combinations.
Several  readily  available  open-source  toolkits  (such  as  PaddleOCR[14],  EasyOCR
5
,
TesseractOCR
6
, MMOCR
7
, and duguangOCR
8
) and commercial APIs (Amazon Tex-
tract
9
,  Google  Document  ai
10
,  BaiDu  OCR
11
)  offer  fundamental  OCR  capabilities.
The majority of these open-source toolkits provide the latest OCR algorithms and pre-
trained models, facilitating convenient direct use or fine-tuning. Due to the distinct
nature of document OCR, it is readily identifiable, and existing open-source toolkits
can effectively fulfill most requirements.
2.4 PDF To HTML
The conversion of PDF to machine-readable HTML format holds significant impli-
cations.  For  instance,  it  can  enhance  accessibility  for  individuals  who  are  blind  or
visually impaired[36] and contribute to the improved retrieval and dissemination of
academic papers[37, 38]. While several off-the-shelf systems exist for direct PDF-to-
HTML conversion, they exhibit limitations[39–41]. Notably, [37, 38] lacks table parsing
functionality, converting PDF tables into images for display only. [42] relies on hand-
designed rules for table extraction, demonstrating poor generalization. TableParser[41]
is  a  model  trained  based  on  a  weakly  supervised  dataset  constructed  from  spread-
sheets and cannot parse wireless tables or deformed tables. Pdf2htmlEX
12
exclusively
converts digital PDFs, failing to convert tables into HTML format. The end-to-end
5
https://github.com/JaidedAI/EasyOCR
6
https://github.com/tesseract-ocr/tesseract
7
https://github.com/open-mmlab/mmocr
8
https://github.com/AlibabaResearch/AdvancedLiterateMachinery
9
https://aws.amazon.com/textract/
10
https://cloud.google.com/document-ai
11
https://ai.baidu.com/tech/ocr
12
https://github.com/pdf2htmlEX/pdf2htmlEX
5

PaddleOCRDuGuangOCREasyOCRTesseractOCR
SLANet
Cycle-CenterNet
LGPMA
LORE
TableMaster
MTL-TabNet
LineCell
PicoDet LCNetDocXLayoutLayoutParser
PDF Extract TablePDF Convert HTMLPDF Convert Word
Table Structure
Labeling
Document Layout Analysis
Table Structure Recognition
Text Detection and Text Recognition
Application
Fig. 1System overview of PdfTable
model proposed by Nougat[43], based on a visual Transformer, excels in converting
academic papers into LaTeX format. However, its end-to-end nature necessitates data
collection for retraining when dealing with PDFs in different languages or structures,
imposing certain limitations on its versatility. Despite the effectiveness of these sys-
tems in specific scenarios, a unified PDF-to-HTML conversion tool supporting diverse
languages and document types remains elusive.
3 Design and implementation of PdfTable library
3.1 System Overview
The system overview of PdfTable is illustrated in Figure 1. The core of the entire
system is to provide table parsing related algorithms, which is mainly composed of four
modules. The layout analysis module locates tables and images; the table structure
recognition  module  parses  table  structures;  text  detection  and  recognitionc  module
identifies textual content; the application module primarily handles the conversion of
6

PDF
or Image
Document
pre-processing
Layout Analysis
imagetabletext
pdf extract
text
ocr-detection
ocr-
recognition
table html
pdf to html
paragraph
merge text
other text
cell box
 text split
table text
cell and text
match
pdf cell 
pdf to docxtable to excel
extrct image
pdf text
table cell
ocr extract text
wireless table
structure extract
wired table
structure extract
cell structure
html
cell logical
location
cell spatial
location
cell graph
structure
Fig. 2table processing pipline
all recognition results into various types. We have standardized the algorithm inter-
face for each module, allowing flexible switching based on the model name to facilitate
user utilization. Since different algorithms rely on different environments and frame-
works, we use the Pytorch[15] framework to reconstruct part of the model, eliminating
unnecessary dependency packages.
3.2 PdfTable Parse Pipline
The  framework  of  table  recognition  is  illustrated  in  Figure  2.  Firstly,  the  input
image or PDF document is preprocessed through the document preprocessing module,
7

URL or file
PDF or Image
 download
PDF
Image
Digital
PDF
Image-based
PDF
Image
Image orientation
correction
Check
 is pdf
Check is
digital pdf
Convert PDF
to Image
Extract
image 
Fig. 3table preprocess
such as network file download, PDF conversion to image, image orientation correction,
etc. Subsequently, the layout analysis module divides the image into distinct regions
(e.g.,  pictures,  tables,  and  text)  to  facilitate  subsequent  individual  processing.  The
image  area  is  sent  to  the  image  extraction  module  for  extraction.  The  table  area
distinguishes whether it is a wired table or a wireless table through rules, and then
extracts the table structure through the TSR algorithm. For the text area, extraction
is performed based on the document type. In the case of digital PDFs, text is directly
extracted  from  the  PDF,  while  OCR  is  employed  for  scanned  PDFs  or  images  to
identify the corresponding text. Next, the text in the table area is matched with the
table structure to generate table HTML, and other text is consolidated into paragraphs
through the paragraph merging module. Ultimately, the recognized pictures, tables,
and text paragraphs are output into distinct files according to the specified output
format requirements.
3.3 Module Design
3.3.1  Input preprocessing
The input preprocessing module primarily preprocesses PDFs or images to facili-
tate extraction of subsequent algorithm models. The processing flow chart is shown in
Figure 3. Initially,the determination is made regarding the necessity of downloading
the input file. If the input file is a PDF, it is requisite to split the PDF file into indi-
vidual pages and convert them into images. For digital PDF, the Ghostscript
13
tool
13
https://www.ghostscript.com/
8

is used to convert it into images, whereas for image PDFs, direct image extraction is
conducted. Since the current document processing algorithm mainly processes docu-
ments with with a 0-degree orientation, the extraction outcome for rotated document
information is suboptimal. Consequently, the orientation of the input document must
be  rectified  before  the  subsequent  processing  stages.  The  image  orientation  correc-
tion module incorporates the document orientation classification algorithm
14
(output
categories: 0,90,180,270) and the text orientation classification algorithm (output cat-
egories:  0,180)  to  execute  rotation  correction  based  on  the  document  orientation
(rotation directions: 0,90,180,270). Concurrently, rules are applied for small-angle rota-
tion correction (rotation angle: -45 - 45 degrees) on documents tilted at a slight angle,
ultimately aligning the image for processing to roughly 0 degrees. The pre-processing
module has significantly enhanced our recognition efficacy on irregular documents.
3.3.2  Layout analysis module
The task of layout analysis is to divide the areas in the document image accord-
ing  to  categories  (e.g.,text,  images,  tables,  formulas).  Currently,  mainstream  object
detection-based  models  have  demonstrated  commendable  performance  across  vari-
ous  benchmark  datasets.  In  PdfTable,  we  have  incorporated  two  lightweight  layout
analysis models, PP-picodet[24] and DocxLayout
15
, and the LayoutParser[22] toolkit.
PP-picodet  is  a  lightweight  target  detection  backbone  model  based  on  PaddleDe-
tection
16
, and ppstructure[13] extends its capabilities to Chinese and English layout
analysis models and table detection models. We convert them into a pytorch model.
DocxLayout is a layout analysis model based on the DLA-34[44] backbone network
provided  by  Alibaba  Research.  The  LayoutParser[22]  toolkit  integrates  a  variety  of
layout analysis models based on different datasets. To enhance usability, we provide a
standardized interface for invoking different models.
3.3.3  Table Structure Recognition
Table borders are usually used for visual display of table structures and can also
be  used  as  an  important  basis  for  identifying  table  structures.  The  current  main-
stream method divides tables into two categories (wired tables and wireless tables),
and designs TSR algorithms for processing different types of tables. In the TSR pro-
cessing flow of PdfTable, an initial rule-based method is employed to categorize tables
as either wired or wireless, followed by the application of specific algorithms to iden-
tify  each  table  type.  PdfTable  currently  integrates  seven  TSR  algorithms,  offering
flexibility in configuration and utilization.
Wired  TableSince  wired  tables  have  the  obvious  feature  of  borders,  algo-
rithms can be used to directly identify the borders of the table and then restore the
table structure through post-processing. Traditional methods effectively handle most
straightforward  table  scenes.  We  refer  to  camelot
17
and  Multi-TypeTD-TSR[45]  to
14
https://www.ghostscript.com/
15
https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/
DocumentUnderstanding/DocXLayout
16
https://github.com/PaddlePaddle/PaddleDetection
17
https://github.com/camelot-dev/camelot
9

implement  the  LineCell  algorithm  for  extracting  table  cells  based  on  OpenCV[46].
Firstly, we extract horizontal and vertical line segments, table areas, and intersections
of line segments through a series of operations such as binarization, erosion, expan-
sion, and contour search. Then we use line segment intersections and line segments
to  construct  table  cells.  Finally,  we  use  line  segment  relationships  to  merge  across
rows and columns cells. Despite their efficacy in simple scenes, traditional methods
exhibit limited generalization due to their dependence on manually set rules. In con-
trast,  contemporary  approaches  leverage  deep  learning  techniques  to  identify  table
edges or cells. Therefore, PdfTable also integrates two latest TSR algorithms, Cycle-
CenterNet[8] and LORE[12]. They adopt different methods to simultaneously predict
the logical structure and physical structure of table cells, and then restore the struc-
ture of the table through simple post-processing operations, which can identify wired
tables in real-world scenarios.
Wireless  TableWireless  tables  distinguished  by  the  absence  of  table  borders,
present  a  more  challenging  identification  task  compared  to  wired  tables.  Presently,
various  methods  employ  image-to-sequence  generation  techniques,  directly  generat-
ing  tags  and  text  borders  to  represent  the  table  structure.  Subsequently,  the  table
is  reconstructed  by  aligning  table  cells  with  their  respective  content  positions.  We
implemented  four  such  algorithms:  SLANet[13],  LGPMA[10],  TableMaster[28],  and
MTL-TabNet[31].  The  LORE[12]  algorithm,  while  initially  designed  for  recognizing
wired tables, exhibits the capability to recognize wireless tables as well. This is achieved
by predicting both the logical structure of the table and the physical borders of the
table text.
3.3.4  Text Extraction
In order to completely restore the table, it is also necessary to extract the content
in the table cells (mainly text). The process of extracting table content in PdfTable
is  illustrated  in  the  right  part  of  Figure  2,  which  is  mainly  divided  into  PDF  text
extraction  and  OCR  text  extraction.  For  digital  PDF  sources,  the  existing  toolkit
pdfminer.six
18
is  utilized  to  directly  extract  text  coordinates  and  content;  other-
wise, an OCR toolkit is employed. To accommodate multiple languages and diverse
business  scenarios,  PdfTable  integrates  several  mainstream  OCR  toolkits,  including
PaddleOCR[14],  EasyOCR
19
,  TesseractOCR
20
and  duguangOCR
21
.  Combined with
the table structure extracted previously, we split the text boxes across cells, and then
match them with the table structure to generate the final table HTML. Text outside
the table is merged into paragraphs to facilitate subsequent processing.
3.3.5  Application
To address diverse application scenarios, we summarize the extracted table struc-
ture, text content and images and uniformly represent them into a PdfCell structure
with  coordinate  positions  and  content.  This  approach  facilitates  the  generation  of
18
https://github.com/pdfminer/pdfminer.six
19
https://github.com/JaidedAI/EasyOCR
20
https://github.com/tesseract-ocr/tesseract
21
https://github.com/AlibabaResearch/AdvancedLiterateMachinery
10

diverse output formats. Currently, the applications implemented in PdfTable include:
PDF to HTML, PDF to DOCX, and table to Excel. In the future, we will implement
more different applications.
4 Experiments
The primary objective of PdfTable is to streamline the extraction of tables from
diverse PDF formats. The extraction of tables from PDFs poses challenges in practical
applications due to variations in PDF types (digital or image-based), table categories
(wired  or  wireless),  and  the  presence  of  text  in  multiple  languages  (English,  Chi-
nese or other languages). A singular model proves insufficient for accommodating all
business scenarios. PdfTable overcomes this limitation by integrating multiple mod-
els and allowing the selection of appropriate models for combined extraction based on
distinct business types. Given that the extraction effectiveness of PdfTable depends
on the chosen models and specific application contexts, a comprehensive evaluation
is challenging. Initial assessments were conducted on a common application scenario
involving Chinese  wired tables to validate  the effectiveness of  the PdfTable toolkit.
Additionally, for wireless table recognition, an evaluation on the PubTabNet[6] dataset
was undertaken to verify the correctness of the integrated TSR algorithm.
Table 1Statistics of the datasets
that we use in experiments.
Test SetsPageTable
Digital PDF2,5893,709
Image-based PDF2,1922,956
PubTabNet9,1159,115
4.1 Datasets and evaluation metrics
To assess PdfTable’s capability in extracting wired tables, we curated a dataset
comprising Chinese financial documents tables, encompassing both digital and image-
based PDFs. This dataset comprises 4,781 pages and encompasses 6,665 tables. For
the  evaluation  of  wireless  table  extraction,  we  utilized  the  validation  set  from  the
extensively employed PubTabNet[6] dataset. The details of the data set are shown in
Table 1.
We  employ  metrics  such  as  Accuracy,  Precision,  Recall,  F1-scores  and  TEDS-
Struct[12, 13, 47] for evaluating table structure recognition. A table is deemed correctly
recognized in Precision calculation when all its cells are accurately identified. TEDS-
Struct is a modified variant of the tree edit distance-based similarity (TEDS)[6] metric,
which disregards the text content within table cells and exclusively evaluates the table
structure.
11

Table 2Performance on the financial reporting dataset.
DatasetMethods
Precision
(%)
Recall
(%)
F1
(%)
TEDS-Struct
(%)
Digital PDF
LineCell98.598.2   98.499.5
LORE[12]90.587.789.197.2
LORE
∗
[12]95.293.294.298.4
Image-based PDF
LineCell83.984.784.294.7
LORE[12]80.577.178.992.8
LORE
∗
[12]86.383.484.895.3
Best results are inbold. The ”
∗
” indicates that the model first uses the layout analysis model to
identify the table area, and then identifies the table structure separately in the table area. No ”
∗
”
means that the model directly recognizes the table structure of the entire PDF image.
4.2 Experimental results
Wired  table  resultTable  2  presents  the  evaluation  outcomes  for  the  Chinese
financial documents table dataset. We choose the LineCell model implemented in this
paper and the latest LORE[12] model for comparison. Additionally, we investigated
the impact of employing the layout analysis model for table area identification in the
LORE[12] model. Analyzing the experimental results, we can find:
Table 3Compare with state-of-the-art methods on PubTabNet dataset.
Methods
Acc
(%)
TEDS
(%)
TEDS-Struct
(%)
Inference
time(ms)
Model
Size(M)
TableMaster[28]77.9096.12-2144253
TableMaster
∗
[28]78.60-97.562764260
LGPMA[10]65.7494.7096.70-177
LGPMA
∗
[10]65.30-96.68345177
SLANet[13]76.3195.8997.017669.2
SLANet
∗
[13]76.03-97.337989.2
MTL-TabNet[31]-96.6797.88-289
MTL-TabNet
∗
[31]79.10-98.484520289
”
∗
” denotes the results of our assessment. Model size refers to the actual physical
size of the model. Regarding the inference time of some models, we quote from
the SLANet[13] paper.
The  evaluation  metrics  exhibit  superior  performance  on  digital  PDFs  compared
to image-based PDFs, with the F1-score demonstrating an 11.2% increase on digital
PDFs. This suggests that table extraction is more challenging in image-based PDFs,
highlighting substantial room for improvement. The LineCell model outperforms the
LORE[12] model by 4.2% in F1-score on digital PDFs, while registering a 0.6% lower
F1-score on image-based PDFs. This indicates that the traditional LineCell model still
has certain advantages in identifying wired tables within PDFs. Notably, the LineCell
model achieves an F1-score of 98.4% on digital PDFs and 84.2% on image-based PDFs,
showcasing its effective identification of wired tables in PDF documents.
12

By  comparing  the  results  of  whether  the  LORE[12]  model  first  uses  the  layout
analysis  model,  it  is  evident  that  employing  the  layout  analysis  model  before  table
structure recognition enhances the F1 score by 5.5% and the TEDS-Struct score by
1.85%. This underscores the effectiveness of incorporating the layout analysis model for
predictive processing, resulting in a notable improvement in table recognition accuracy.
Wireless table resultThe PdfTable toolkit incorporates diverse models for wire-
less table structure recognition. To assess the accuracy of algorithm integration, we
conducted evaluations on the [6] dataset. The experimental results are shown in Table
3. It can be found from the experimental results:
By comparing the Acc and TEDS-Struct metric of the four models, the maximum
difference between our evaluation results and the original paper results is 0.7%, falling
within  the  acceptable  margin  of  error.  This  preliminary  validation  underscores  the
accuracy of the algorithm integration.
From the perspective of inference speed and Acc metric, SLANet[13] exhibits dis-
tinct advantages compared to other models. It achieves an Acc metric of 76% with an
average  inference  time  of 798  ms.  TableMaster[28]  and MTL-TabNet[31]  can attain
higher Acc, their average inference times are considerably slower. Notably, the MTL-
TabNet[31] model achieves the best results, but the average inference time is as high
as 4520 ms.
The LORE[12] model can also support wireless table recognition, and the TEDS
metric reaches 98.1% on the PubTabNet[6] dataset. However, there are currently prob-
lems  with  the  integration  in  PdfTable,  and  the  experimental  results  have  not  been
entirely replicated. Future optimizations are planned.
4.3 Qualitative Assessment
(a) Original table image 1(b) LineCell prediction results 1(c) LORE prediction results 1
(d) Original table image 2(e) LineCell prediction results 2(f) LORE prediction results 2
Fig. 4Qualitative  results  of  LineCell  and  LORE  on  digital  PDF.  The  red  border  represents  the
identified cell.
13

The qualitative results in Figure 4 show that in some cases, the LORE[12] model
predicts that some cells in the table cannot be accurately identified, whereas LineCell
can accurately identify all cells.
5 Conclusion and future work
In  this  paper,  we  introduce  a  novel  end-to-end  PDF  table  extraction  toolkit,
PdfTable, designed for seamless table extraction from both digital and image-based
PDFs. The toolkit integrates various existing models, including those for layout analy-
sis, table structure recognition, OCR detection, and OCR recognition. This integration
allows for flexible combinations to adapt to diverse application scenarios. To validate
the  efficacy  of  the  PdfTable  toolkit,  we  annotated  a  small  dataset  of  wired  tables.
Concurrently, we evaluated the wireless table recognition model on the PubTabNet[6]
dataset, confirming the accuracy of the algorithm integration. In the future, we will
optimize  this  toolkit  from  the  following  aspects:  1.  Developing  new  algorithms  to
differentiate  wired  tables  from  wireless  tables;  2.  Incorporating  the  ability  to  fine-
tune integrated models, such as table recognition models; 3. Enhancing the toolkit’s
capacity to recognize wired tables in image-based PDFs.
6 Declarations
Conflict of interestThe authors have no conflicts of interest to declare that are
relevant to the content of this article.
Ethics approvalThis article has never been submitted to more than one journal
for simultaneous consideration. This article is original.
Data AvailabilityThe datasets analysed during the current study are available
in the https://github.com/ibm-aur-nlp/PubTabNet.
Code  availabilityCode  and  data  used  in  this  paper  are  publicly  available  at
https://github.com/CycloneBoy/pdf
table.
References
[1]  Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learn-
ing  for  detection  and  structure  recognition  of  tables  in  document  images.  In:
2017 14th IAPR International Conference on Document Analysis and Recogni-
tion (ICDAR), vol. 01, pp. 1162–1167 (2017). https://doi.org/10.1109/ICDAR.
2017.192
[2]  Paliwal,  S.,  D,  V.,  Rahul,  R.,  Sharma,  M.,  Vig,  L.:  TableNet:  Deep  Learning
model for end-to-end Table detection and Tabular data extraction from Scanned
Document  Images.  arXiv.  https://doi.org/10.48550/arXiv.2001.01469  .  http://
arxiv.org/abs/2001.01469 Accessed 2023-05-23
[3]  Zhang, Z., Zhang, J., Du, J.: Split, embed and merge: An accurate table structure
recognizer. arXiv. http://arxiv.org/abs/2107.05214 Accessed 2023-07-28
14

[4]  Chi, Z., Huang, H., Xu, H.-D., Yu, H., Yin, W., Mao, X.-L.: Complicated table
structure recognition. arXiv preprint arXiv:1908.04729 (2019)
[5]  Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: Tablebank: Table benchmark
for image-based table detection and recognition. In: Proceedings of the Twelfth
Language Resources and Evaluation Conference, pp. 1918–1925 (2020)
[6]  Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition:
Data, model, and evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M.
(eds.) Computer Vision – ECCV 2020, pp. 564–580. Springer, Cham (2020)
[7]  Smock,  B.,  Pesala,  R.,  Abraham,  R.:  Pubtables-1m:  Towards  comprehensive
table extraction from unstructured documents. 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 4624–4632 (2021)
[8]  Long,  R.,  Wang,  W.,  Xue,  N.,  Gao,  F.,  Yang,  Z.,  Wang,  Y.,  Xia,  G.-S.:
Parsing table structures in the wild. In: 2021 IEEE/CVF International Confer-
ence on Computer Vision (ICCV), pp. 924–932 (2021). https://doi.org/10.1109/
ICCV48922.2021.00098
[9]  Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet:
An approach for end to end table detection and structure recognition from image-
based documents. arXiv. version: 2. https://doi.org/10.48550/arXiv.2004.12629 .
http://arxiv.org/abs/2004.12629 Accessed 2023-08-21
[10]  Qiao, L., Li, Z., Cheng, Z., Zhang, P., Pu, S., Niu, Y., Ren, W., Tan, W., Wu, F.:
Lgpma: Complicated table structure recognition with local and global pyramid
mask alignment. In: Llad ́os, J., Lopresti, D., Uchida, S. (eds.) Document Analysis
and Recognition – ICDAR 2021, pp. 99–114. Springer, Cham (2021)
[11]  Ma, C., Lin, W., Sun, L., Huo, Q.: Robust table detection and structure recog-
nition  from  heterogeneous  document  images.  Pattern  Recognition133,  109006
(2023) https://doi.org/10.1016/j.patcog.2022.109006
[12]  Xing, H., Gao, F., Long, R., Bu, J., Zheng, Q., Li, L., Yao, C., Yu, Z.: LORE:
logical location regression network for table structure recognition. In: Williams,
B.,  Chen,  Y.,  Neville,  J.  (eds.)  Thirty-Seventh  AAAI  Conference  on  Artificial
Intelligence,  AAAI  2023,  Thirty-Fifth  Conference  on  Innovative  Applications
of  Artificial  Intelligence,  IAAI  2023,  Thirteenth  Symposium  on  Educational
Advances  in  Artificial  Intelligence,  EAAI  2023,  Washington,  DC,  USA,  Febru-
ary 7-14, 2023, pp. 2992–3000. AAAI Press, ??? (2023). https://doi.org/10.1609/
aaai.v37i3.25402 . https://doi.org/10.1609/aaai.v37i3.25402
[13]  Li,   C.,   Guo,   R.,   Zhou,   J.,   An,   M.,   Du,   Y.,   Zhu,   L.,   Liu,   Y.,   Hu,   X.,
Yu,  D.:  Pp-structurev2:  A  stronger  document  analysis  system.  arXiv  preprint
arXiv:2210.05391 (2022)
15

[14]  Du, Y., Li, C., Guo, R., Yin, X., Liu, W., Zhou, J., Bai, Y., Yu, Z., Yang, Y.,
Dang, Q., et al.: Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint
arXiv:2009.09941 (2020)
[15]  Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-
performance  deep  learning  library.  Advances  in  neural  information  processing
systems32(2019)
[16]  Wei,  S.,  Xu,  N.:  PARAGRAPH2GRAPH:  A  GNN-based  framework  for  layout
paragraph analysis. arXiv. http://arxiv.org/abs/2304.11810 Accessed 2023-08-17
[17]  Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 1440–1448 (2015)
[18]  Zhong,  X.,  Tang,  J.,  Jimeno  Yepes,  A.:  Publaynet:  Largest  dataset  ever  for
document layout analysis. In: 2019 International Conference on Document Anal-
ysis and Recognition (ICDAR), pp. 1015–1022 (2019). https://doi.org/10.1109/
ICDAR.2019.00166
[19]  He, K., Gkioxari, G., Doll ́ar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the
IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
[20]  Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 779–788 (2016)
[21]  Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-
to-end object detection with transformers. In: European Conference on Computer
Vision, pp. 213–229 (2020). Springer
[22]  Shen,  Z.,  Zhang,  R.,  Dell,  M.,  Lee,  B.C.G.,  Carlson,  J.,  Li,  W.:  Layoutparser:
A unified toolkit for deep learning based document image analysis. In: Llad ́os,
J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR
2021, pp. 131–146. Springer, Cham (2021)
[23]  Authors, P.: PaddleDetection, Object detection and instance segmentation toolkit
based   on   PaddlePaddle.   https://github.com/PaddlePaddle/PaddleDetection
(2019)
[24]  Yu, G., Chang, Q., Lv, W., Xu, C., Cui, C., Ji, W., Dang, Q., Deng, K., Wang, G.,
Du, Y., et al.: Pp-picodet: A better real-time object detector on mobile devices.
arXiv preprint arXiv:2111.00902 (2021)
[25]  Siddiqui,  S.A.,  Fateh,  I.A.,  Rizvi,  S.T.R.,  Dengel,  A.,  Ahmed,  S.:  Deeptabstr:
Deep learning based table structure recognition. In: 2019 International Conference
on Document Analysis and Recognition (ICDAR), pp. 1403–1409 (2019). https:
16

//doi.org/10.1109/ICDAR.2019.00226
[26]  Deng,  Y.,  Rosenberg,  D.,  Mann,  G.:  Challenges  in  end-to-end  neural  scien-
tific table recognition. In: 2019 International Conference on Document Analysis
and Recognition (ICDAR), pp. 894–901 (2019). https://doi.org/10.1109/ICDAR.
2019.00148
[27]  ICDAR 2021 Competition on Scientific Literature Parsing. https://arxiv.org/abs/
2106.14616 Accessed 2023-08-27
[28]  Ye, J., Qi, X., He, Y., Chen, Y., Gu, D., Gao, P., Xiao, R.: PingAn-VCGroup’s
Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B:
Table Recognition to HTML (2021)
[29]  Lu,  N.,  Yu,  W.,  Qi,  X.,  Chen,  Y.,  Gong,  P.,  Xiao,  R.,  Bai,  X.:  Master:  Multi-
aspect  non-local  network  for  scene  text  recognition.  Pattern  Recognition117,
107980 (2021) https://doi.org/10.1016/j.patcog.2021.107980
[30]  Cui, C., Gao, T., Wei, S., Du, Y., Guo, R., Dong, S., Lu, B., Zhou, Y., Lv, X.,
Liu, Q., Hu, X., Yu, D., Ma, Y.: PP-LCNet: A Lightweight CPU Convolutional
Neural Network. https://arxiv.org/abs/2109.15099v1 Accessed 2023-09-27
[31]  Ly, N.T., Takasu, A.: An end-to-end multi-task learning model for image-based
table recognition, 626–634 (2023) https://doi.org/10.5220/0011685000003417
[32]  Lysak, M., Nassar, A., Livathinos, N., Auer, C., Staar, P.: Optimized table tok-
enization  for  table  structure  recognition.  In:  Fink,  G.A.,  Jain,  R.,  Kise,  K.,
Zanibbi, R. (eds.) Document Analysis and Recognition - ICDAR 2023, pp. 37–50.
Springer, Cham (2023)
[33]  Xue, W., Yu, B., Wang, W., Tao, D., Li, Q.: Tgrnet: A table graph reconstruction
network for table structure recognition. In: 2021 IEEE/CVF International Con-
ference  on  Computer  Vision  (ICCV),  pp.  1275–1284.  IEEE  Computer  Society,
Los Alamitos, CA, USA (2021). https://doi.org/10.1109/ICCV48922.2021.00133
. https://doi.ieeecomputersociety.org/10.1109/ICCV48922.2021.00133
[34]  Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with
differentiable binarization. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 34, pp. 11474–11481 (2020)
[35]  Shi,  B.,  Bai,  X.,  Yao,  C.:  An  end-to-end  trainable  neural  network  for  image-
based sequence recognition and its application to scene text recognition. IEEE
transactions  on  pattern  analysis  and  machine  intelligence39(11),  2298–2304
(2016)
[36]  Fayyaz,  N.,  Khusro,  S.,  Imranuddin:  Enhancing  accessibility  for  the  blind  and
visually impaired: Presenting semantic information in PDF tables35(7), 101617
17

https://doi.org/10.1016/j.jksuci.2023.101617 . Accessed 2023-07-18
[37]  Wang, L.L., Cachola, I., Bragg, J., Cheng, E.Y.-Y., Haupt, C., Latzke, M., Kuehl,
B., Zuylen, M.N., Wagner, L., Weld, D.: SciA11y: Converting scientific papers to
accessible HTML. In: Proceedings of the 23rd International ACM SIGACCESS
Conference on Computers and Accessibility. ASSETS ’21, pp. 1–4. Association
for Computing Machinery. https://doi.org/10.1145/3441852.3476545 . https://dl.
acm.org/doi/10.1145/3441852.3476545 Accessed 2023-05-10
[38]  Ahuja,   A.:   Analyzing   and   navigating   electronic   theses   and   dissertations.
Accepted:  2023-07-22T08:00:15Z  Artwork  Medium:  ETD  Interview  Medium:
ETD Publisher: Virginia Tech. Accessed 2023-07-30
[39]  Shigarov,  A.,  Altaev,  A.,  Mikhailov,  A.,  Paramonov,  V.,  Cherkashin,  E.:  Tab-
byPDF:  Web-based  system  for  PDF  table  extraction.  In:  Damaˇseviˇcius,  R.,
Vasiljevien ̇e, G. (eds.) Information and Software Technologies. Communications
in Computer and Information Science, pp. 257–269. Springer. https://doi.org/10.
1007/978-3-319-99972-2
20
[40]  PR, N., Krishnamoorthy, H., Srivatsan, K., Goyal, A., Santhiappan, S.: DEXTER:
An  end-to-end  system  to  extract  table  contents  from  electronic  medical  health
documents. arXiv. http://arxiv.org/abs/2207.06823 Accessed 2023-09-18
[41]  Rao, S.X., Rausch, J., Egger, P., Zhang, C.: TableParser: Automatic Table Parsing
with Weak Supervision from Spreadsheets. arXiv. version: 1. https://doi.org/10.
48550/arXiv.2201.01654 . http://arxiv.org/abs/2201.01654 Accessed 2023-08-30
[42]  Namysl,  M.,  Esser,  A.M.,  Behnke,  S.,  K ̈ohler,  J.:  Flexible  Table  Recognition
and  Semantic  Interpretation  System.  arXiv.  http://arxiv.org/abs/2105.11879
Accessed 2023-08-03
[43]  Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural Optical Under-
standing for Academic Documents. arXiv. https://doi.org/10.48550/arXiv.2308.
13418 . http://arxiv.org/abs/2308.13418 Accessed 2023-08-31
[44]  Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 2403–2412 (2018)
[45]  Fischer, P., Smajic, A., Abrami, G., Mehler, A.: Multi-type-td-tsr – extracting
tables  from  document  images  using  a  multi-stage  pipeline  for  table  detection
and  table  structure  recognition:  From  ocr  to  structured  table  representations.
In: Edelkamp, S., M ̈oller, R., Rueckert, E. (eds.) KI 2021: Advances in Artificial
Intelligence, pp. 95–108. Springer, Cham (2021)
[46]  Bradski,  G.:  The  opencv  library.  Dr.  Dobb’s  Journal:  Software  Tools  for  the
Professional Programmer25(11), 120–123 (2000)
18

[47]  Raja,  S.,  Mondal,  A.,  Jawahar,  C.V.:  Table  Structure  Recognition  using  Top-
Down and Bottom-Up Cues. arXiv. https://doi.org/10.48550/arXiv.2010.04565 .
http://arxiv.org/abs/2010.04565 Accessed 2023-09-12
19