Intra-Africa Journal Hub's Portfolio

Abstract

crucial for numerous indexing, retrieval, and analysis use cases. Choos- ing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few content elements like header metadata or bibliographic references
Read Full Text

Related papers athttps://gipplab.org/pub
Preprint of the paper:
Meuschke,  N.  &  Jagdale,  A.  &  Spinde,  T.  &  Mitrovi ́c,  J.  &  Gipp,  B.,
”A Benchmark of PDF Information Extraction Tools Using a Multi-task
and Multi-domain Evaluation Framework for Academic Documents”, in
Information for a Better World: Normality, Virtuality, Physicality, Inclu-
sivity, LNCS, vol. 13972, Cham: Springer Nature Switzerland, 2023, pp.
383–405, DOI: 10.1007/978-3-031-28032-0
31.
Click to download:BibTeX
A Benchmark of PDF Information Extraction
Tools using a Multi-Task and Multi-Domain
Evaluation Framework for Academic Documents
Norman Meuschke
1,[ORCID]
, Apurva Jagdale
2
, Timo Spinde
1,[ORCID]
,
Jelena Mitrovi ́c
2,3,[ORCID]
, and Bela Gipp
1,[ORCID]
1
University of G ̈ottingen, 37073 G ̈ottingen, Germany
{meuschke, spinde, gipp}@uni-goettingen.de
2
University of Passau, 94032 Passau, Germany
{apurva.jagdale, jelena.mitrovic}@uni-passau.de
3
The Institute for Artificial Intelligence R&D of Serbia, 21000 Novi Sad, Serbia
Abstract.Extracting  information  from  academic  PDF  documents  is
crucial for numerous indexing, retrieval, and analysis use cases. Choos-
ing the best tool to extract specific content elements is difficult because
many,  technically  diverse  tools  are  available,  but  recent  performance
benchmarks  are  rare.  Moreover,  such  benchmarks  typically  cover  only
a few content elements like header metadata or bibliographic references
and use smaller datasets from specific academic disciplines. We provide
a large and diverse evaluation framework that supports more extraction
tasks than most related datasets. Our framework builds upon DocBank,
a  multi-domain  dataset  of  1.5M  annotated  content  elements  extracted
from 500K pages of research papers on arXiv. Using the new framework,
we benchmark ten freely available tools in extracting document meta-
data, bibliographic references, tables, and other content elements from
academic PDF documents. GROBID achieves the best metadata and ref-
erence extraction results, followed by CERMINE and Science Parse. For
table  extraction,  Adobe  Extract  outperforms  other  tools,  even  though
the performance is much lower than for other content elements. All tools
struggle to extract lists, footers, and equations. We conclude that more
research on improving and combining tools is necessary to achieve satis-
factory extraction quality for most content elements. Evaluation datasets
and frameworks like the one we present support this line of research. We
make our data and code publicly available to contribute toward this goal.
Keywords:PDF·Information Extraction·Benchmark·Evaluation.
1  Introduction
The Portable Document Format (PDF) is the most prevalent encoding for aca-
demic documents. Extracting information from academic PDF documents is cru-
cial for numerous indexing, retrieval, and analysis tasks. Document search, rec-
ommendation, summarization, classification, knowledge base construction, ques-
tion answering, and bibliometric analysis are just a few examples [31].
1
arXiv:2303.09957v1  [cs.IR]  17 Mar 2023

2Meuschke et al.
However,  the  format’s  technical  design  makes  information  extraction  chal-
lenging.  Adobe  designed  PDF  as  a  platform-independent,  fixed-layout  format
by  extending  the  PostScript  [24]  page  description  language.  PDF  focuses  on
encoding a document’s visual layout to ensure a consistent appearance of the
document across software and hardware platforms but includes little structural
and semantic information on document elements.
Numerous tools for information extraction (IE) from PDF documents have
been presented since the format’s inception in 1993. The development of such
tools has been subject to a fast-paced technological evolution of extraction ap-
proaches from rule-based algorithms, over statistical machine learning (ML) to
deep learning (DL) models (cf. Section 2). Finding the best tool to extract spe-
cific content elements from PDF documents is currently difficult because:
1.  Typically,  tools  only  support  extracting  a  subset  of  the  content  elements
in  academic  documents,  e.g.,  title,  authors,  paragraphs,  in-text  citations,
captions, tables, figures, equations, or references.
2.  Many information extraction tools, e.g., 12 of 35 tools we considered for our
study, are no longer maintained or have become obsolete.
3.  Prior evaluations of information extraction tools often consider only specific
content elements or use domain-specific corpora, which makes their results
difficult to compare. Moreover, the most recent comprehensive benchmarks
of information extraction tools were published in 2015 for metadata
4
[55],
2017  for  body  text  [6],  and  2018  for  references
5
[54],  respectively.  These
evaluations do not reflect the latest technological advances in the field.
To alleviate this knowledge gap and facilitate finding the best tool to extract
specific elements from academic PDF documents, we comprehensively evaluate
ten state-of-the-art non-commercial tools that consider eleven content elements
based on a dataset of 500K pages from arXiv documents covering multiple fields.
Our code, data, and resources are publicly available at
http://pdf-benchmark.gipplab.org
2  Related Work
This  section  presents  approaches  for  information  extraction  from  PDF  (Sec-
tion 2.1), labeled datasets suitable for training and evaluating PDF information
extraction approaches, and prior evaluations of IE tools (Section 2.2).
2.1  Information Extraction from PDF Documents
Table 1 summarizes publications on PDF information extraction since 1999. For
each publication, the table shows the primary technological approach and the
4
For example author(s), title, affiliation(s), address(es), email(s)
5
Refers to extracting the components of bibliographic references, e.g., author(s), title,
venue, editor(s), volume, issue, page range, year of publication, etc.

A Benchmark of PDF Information Extraction Tools3
Table 1: Publications on information extraction from PDF documents.
Publication
1
Year Task
2
Method  Training Dataset
3
Palermo [44]1999    M, ToCRules100 documents
Klink [27]2000    MRules979 pages
Giuffrida [18]2000    MRules1,000 documents
Aiello [2]2002    RO, Title   Rules1,000 pages
Mao [37]2004    MOCR,
Rules
309 documents
Peng [45]2004    M, RCRFCORA (500 refs.)
Day [14]2007    M, RTemplate160,000 citations
Hetzner [23]2008    RHMMCORA (500 refs.)
Councill [12]2008    RCRFCORA (200 refs.), CiteSeer (200 refs.)
Lopez [36]2009    B, M, RCRF, DLNone
Cui [13]2010    MHMM400 documents
Ojokoh [42]2010    MHMMCORA (500 refs.), ManCreat
FLUX-CiM (300 refs.),
Kern [25]2012    MHMME-prints, Mendeley, PubMed (19K entries)
Bast [5]2013    B, M, RRulesDBLP (690 docs.), PubMed (500 docs.)
Souza [53]2014    MCRF100 documents
Anzaroot [3]2014    RCRFUMASS (1,800 refs.)
Vilnis [57]2015    RCRFUMASS (1,800 refs.)
Tkaczyk [55]2015    B, M, RCRF,
Rules,
SVM
CiteSeer (4,000 refs.), CORA (500 refs.),
GROTOAP, PMC (53K docs.)
Bhardwaj [7]2017    RFCN5,090 references
Rodrigues [49]   2018    RBiLSTM40,000 references
Prasad [46]2018    M, RCRF, DLFLUX-CiM (300 refs.), CiteSeer (4,000 refs.)
Jahongir [4]2018    MRules10,000 documents
Torre [15]2018    B, MRules300 documents
Rizvi [47]2020    RR-CNN40,000 references
Hashmi [22]2020    MRules45 documents
Ahmed [1]2020    MRules150 documents
Nikolaos [33]2021    B, M, RAttention,
BiLSTM
3,000 documents
1
Publications in chronological order; the labels indicate the first author only.
2
(B) Body text, (M) Metadata, (R) References, (RO) Reading order, (ToC) Table of contents
3
Domain-specific datasets: Computer Science: CiteSeer [43], CORA [39], DBLP [52], FLUX-
CiM [10,11], ManCreat [42]; Health Science: PubMed [40], PMC [41]

4Meuschke et al.
training dataset. Eighteen of 27 approaches (67%) employ machine learning or
deep learning (DL) techniques, and the remainder rule-based extraction (Rules).
Early tools rely on manually coded rules [44]. Second-generation tools use sta-
tistical  machine  learning,  e.g.,  based  on  Hidden  Markov  Models  (HMM)  [8],
Conditional Random Fields (CRF) [29], and maximum entropy [26]. The most
recent information extraction tools employ Transformer models [56].
A  preference  for—in  theory—more  flexible  and  adaptive  machine  learning
and deep learning techniques over case-specific rule-based algorithms is observ-
able in Table 1. However, many training datasets are domain-specific, e.g., they
exclusively consist of documents from Computer Science or Health Science, and
comprise fewer than 500 documents. These two factors put the generalizability
of the respective IE approaches into question. Notable exceptions like Ojokoh et
al. [42], Kern et al. [25], and Tkaczyk et al. [55] use multiple datasets covering
different domains for training and evaluation. However, these approaches address
specific tasks, i.e, header metadata extraction, reference extraction, or both.
Moreover, a literature survey by Mao et al. shows that most approaches for
text extraction from PDF do not specify the ground-truth data and performance
metrics they use, which impedes performance comparisons [38]. A positive excep-
tion is a publication by Bast et al. [5], which presents a comprehensive evaluation
framework for text extraction from PDF that includes a fine-grained specifica-
tion of the performance measures used.
2.2  Labeled Datasets and Prior Benchmarks
Table 2 summarizes datasets usable for training and evaluating PDF information
extraction  approaches  grouped  by  the  type  of  ground-truth  labels  they  offer.
Most datasets exclusively offer labels for document metadata, references, or both.
Table 2: Labeled datasets for information extraction from PDF documents.
Publication
1
SizeGround-truth Labels
Fan [16]147 documentsMetadata
F ̈arber [17]90K documentsReferences
Grennan [21]1B referencesReferences
Saier [51,50]1M documents.References
Ley [30,52]6M documentsMetadata, references
Mccallum [39]   935 documentsMetadata, references
Kyle [34]8.1M documentsMetadata, references
Ororbia [43]6M documents.Metadata, references
Bast [6]12,098 documentsBody text, sections, title
Li [31]500K pagesCaptions, equations, figures, footers
lists, metadata, paragraphs,
references, sections, tables
1
The labels indicate the first author only.

A Benchmark of PDF Information Extraction Tools5
Only the DocBank dataset by Li et al. [31] offers annotations for 12 diverse
content elements in academic documents, including, figures, equations, tables,
and  captions.  Most  of  these  content  elements  have  not  been  used  for  bench-
mark evaluations yet. DocBank is comparably large (500K pages from research
papers published on arXiv in a four-year period). A downside of the DocBank
dataset  is  its  coarse-grained  labels  for  references,  which  do  not  annotate  the
fields of bibliographic entries like the author, publisher, volume, or date, as do
bibliography-specific datasets like unarXive [21] or S2ORC [34].
Table 3 shows PDF information extraction benchmarks performed since 1999.
Few such works exist and were rarely repeated or updated, which is sub-optimal
given that many tools receive updates frequently. Other tools become techno-
logically obsolete or unmaintained. For instance, pdf-extract
6
, lapdftext
7
, PDF-
SSA4MET
8
, and PDFMeat
9
are no longer maintained actively, while ParsCit
10
has been replaced by NeuralParsCit
11
and SciWING
12
.
Table 3: Benchmark evaluations of PDF information extraction approaches.
Publication
1
DatasetMetrics
2
Tools  Labels
3
Granitzer [19]    E-prints (2,452 docs.),
Mendeley (20,672 docs.)
P,R2M
Lipinski [32]arXiv (1,253 docs.)Acc7M
Bast [6]arXiv (12,098 docs.)Custom14NL, Pa
RO, W
K ̈orner [28]100 (German docs.)P,R,F
1
4Ref
Tkaczyk [54]9,491 documentsP,R,F
1
10Ref
Rizvi [48]8,766 referencesF
1
4Ref
1
The labels indicate the first author only.
2
(P) Precision, (R) Recall, (F
1
)F
1
-score, (Acc) Accuracy
3
(M) Metadata, (NL) New Line, (Pa) Paragraph, (Ref) Reference, (RO)
Reading order, (W) Words
As Table 3 shows, the most extensive dataset used for evaluating PDF infor-
mation extraction tools so far contains approx. 24,000 documents. This number
is small compared to the sizes of datasets available for this task, shown in Ta-
ble  2.  Most  studies  focused  on  exclusively  evaluating  metadata  and  reference
extraction (see also Table 3). An exception is a benchmark by Bast and Korzen
6
https://github.com/CrossRef/pdfextract
7
https://github.com/BMKEG/lapdftext
8
https://github.com/eliask/pdfssa4met
9
https://github.com/dimatura/pdfmeat
10
https://github.com/knmnyn/ParsCit
11
https://github.com/WING-NUS/Neural-ParsCit
12
https://github.com/abhinavkashyap/sciwing

6Meuschke et al.
[6], which evaluated spurious and missing words, paragraphs, and new lines for
14 tools but used a comparably small dataset of approx. 10K documents.
We conclude from our review of related work that (1) recent benchmarks of
information extraction tools for PDF are rare, (2) mostly analyze metadata ex-
traction, (3) use small, domain-specific datasets, and (4) include tools that have
become obsolete or unmaintained. (5) A variety of suitably labeled datasets have
not been used to evaluate information extraction tools for PDF documents yet.
Therefore, we see the need for benchmarking state-of-the-art PDF information
extraction tools on a large labeled dataset of academic documents covering mul-
tiple domains and containing diverse content elements.
3  Methodology
This section presents the experimental setup of our study by describing the tools
we evaluate (Section 3.1), the dataset we use (Section 3.2), and the procedure
we follow (Section 3.3).
3.1  Evaluated Tools
We  chose  ten  actively  maintained  non-commercial  open-source  tools  that  we
categorize by extraction tasks.
1.Metadata Extractionincludes tools to extract titles, authors, abstracts,
and similar document metadata.
2.Reference Extractioncomprises tools to access and parse bibliographic
reference strings into fields like author names, publication titles, and venue.
3.Table Extractionrefers to tools that allow accessing both the structure
and data of tables.
4.General Extractionsubsumes tools to extract, e.g., paragraphs, sections,
figures, captions, equations, lists, or footers.
For each of the tools we evaluate, Table 4 shows the version, supported extraction
task(s), primary technological approach, and output format. Hereafter, we briefly
describe each tool, focusing on its technological approach.
Adobe Extract
13
is a cloud-based API that allows extracting tables and
numerous other content elements subsumed in thegeneral extractioncategory.
The API employs the Adobe Sensei
14
AI and machine learning platform to un-
derstand the structure of PDF documents. To evaluate the Adobe Extract API,
we used the Adobe PDFServices Python SDK
15
to access the API’s services.
Apache Tika
16
allows metadata and content extraction in XML format. We
used the tika-python
17
client to access the Tika REST API. Unfortunately, we
found that tika-python only supports content (paragraphs) extraction.
13
https://www.adobe.io/apis/documentcloud/dcsdk/pdf-extract.html
14
https://www.adobe.com/de/sensei.html
15
https://github.com/adobe/pdfservices-python-sdk-samples
16
https://tika.apache.org/
17
https://github.com/chrismattmann/tika-python

A Benchmark of PDF Information Extraction Tools7
Table 4: Overview of evaluated information extraction tools.
ToolVersion Task
1
TechnologyOutput
Adobe Extract   1.0G, TAdobe Sensei AI FrameworkJSON, XLSX
Apache Tika2.0.0GApache PDFBoxTXT
Camelot0.10.1TOpenCV, PDFMinerCSV, Dataframe
CERMINE1.13G, M, RCRF, iText, Rules, SVMJATS
GROBID0.7.0G, M, R, T   CRF, Deep Learning, PdfaltoTEI XML
PdfActn/aG, M, R, T   pdftotext, rulesJSON, TXT, XML
PyMuPDF1.19.1GOCR, tesseractTXT
RefExtract0.2.5Rpdftotext, rulesTXT
ScienceParse1.0G, M, R,CRF, pdffigures2, rulesJSON
Tabula1.2.1TPDFBox, rulesCSV, Dataframe
1
(G) General, (M) Metadata, (R) References, (T) Table
Camelot
18
can extract tables using either theStreamorLatticemodes. The
former uses whitespace between cells and the latter table borders for table cell
identification. For our experiments, we exclusively use the Stream mode, since
our test documents are academic papers, in which tables typically use whitespace
in favor of cell borders to delineate cells. The Stream mode internally utilizes
the PDFMiner library
19
to extract characters that are subsequently grouped into
words and sentences using whitespace margins.
CERMINE[55] offers metadata, reference, and general extraction capabil-
ities. The tool employs the iText PDF toolkit
20
for character extraction and the
Docstrum
21
image segmentation algorithm for page segmentation of document
images.  CERMINE  uses  an  SVM  classifier  implemented  using  the  LibSVM
22
library  and  rule-based  algorithms  for  metadata  extraction.  For  reference  ex-
traction, the tool employsk-means clustering, and Conditional Random Fields
implemented  using  the  MALLET
23
toolkit  for  sequence  labeling.  CERMINE
returns  a  single  XML  file  containing  the  annotations  for  an  entire  PDF.  We
employ  the  Beautiful  Soup
24
library  to  filter  CERMINE’s  output  files  for  the
annotations relevant to our evaluation.
GROBID
25
[35] supports all four extraction tasks. The tool allows using ei-
ther feature-engineered CRF (default) or a combination of CRF and DL models
realized using the DeLFT
26
Deep Learning library, which is based on TensorFlow
and  Keras.  GROBID  uses  a  cascade  of  sequence  labeling  models  for  different
components. The models in the model cascade use individual label sequencing
18
https://github.com/camelot-dev/camelot
19
https://github.com/pdfminer/pdfminer.six
20
https://github.com/itext
21
https://github.com/chulwoopack/docstrum
22
https://github.com/cjlin1/libsvm
23
http://mallet.cs.umass.edu/sequences.php
24
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
25
https://github.com/kermitt2/grobid
26
https://github.com/kermitt2/delft

8Meuschke et al.
algorithms and features; some models employ tokenizers. This approach offers
flexibility by allowing model tuning and improves the model’s maintainability.
We evaluate the default CRF model with production settings (a recommended
setting to improve the performance and availability of the GROBID server, ac-
cording to the tool’s documentation
27
).
PdfActformerly called Icecite [5] is a rule-based tool that supports all four
extraction tasks, including the extraction of appendices, acknowledgments, and
tables of contents. The tool uses the PDFBox
28
and pdftotext
29
PDF manipu-
lation and content extraction libraries. We use the tool’s JAR release
30
.
PyMuPDF
31
extends  the  MuPDF
32
viewer  library  with  font  and  image
extraction,  PDF  joining,  and  file  embedding.  PyMuPDF  uses  tesseract
33
for
OCR. PyMuPDF could not process files whose names include special characters.
RefExtract
34
is a reference extraction tool that uses pdftotext
35
and regu-
lar expressions. RefExtract returns annotations for the entire bibliography of a
document. The ground-truth annotations in our dataset (cf. Section 3.2), how-
ever, pertain to individual pages of documents and do not always cover the entire
document. If ground-truth annotations are only available for a subset of the ref-
erences in a document, we use regular expressions to filter RefExtract’s output
to those references with ground-truth labels.
Science Parse
36
uses a CRF model trained on data from GROBID to ex-
tract the title, author, and references. It also employs a rule-based algorithm by
Clark and Divvala [9] to extract sections and paragraphs in JSON format.
Tabula
37
is a table extraction tool. Analogous to Camelot, Tabula offers a
Streammode realized using PDFBox, and aLatticemode realized using OpenCV
for table cell recognition.
3.2  Dataset
We use the DocBank
38
dataset, created by Li et al. [31], for our experiments.
Figure  1  visualizes  the  process  for  compiling  the  dataset.  First,  the  creators
gathered arXiv documents, for which both the PDF and LaTeX source code was
available. Li et al. then edited the LaTeX code to enable accurate automated
annotations of content elements in the PDF version of the documents. For this
purpose,  they  inserted  commands  that  formatted  content  elements  in  specific
27
https://GROBID.readthedocs.io/en/latest/Troubleshooting/
28
http://pdfbox.apache.org/
29
https://github.com/jalan/pdftotext
30
https://github.com/ad-freiburg/pdfact
31
https://github.com/pymupdf/PyMuPDF
32
https://mupdf.com/
33
https://github.com/tesseract-ocr/tesseract
34
https://github.com/inspirehep/refextract
35
https://linux.die.net/man1/pdftotext
36
https://github.com/allenai/science-parse
37
https://github.com/chezou/tabula-py
38
https://github.com/doc-analysis/DocBank

A Benchmark of PDF Information Extraction Tools9
Documents
Semantic Structure
Detection
Token Annotation 
Data acquisition from arXiv (PDF and .tex)
\section {Section1} 
| 
\section{{\color {fontcolor} {Section1}}} 
Tab-separated ground truth file 
Fig. 1: Process for generating the DocBank dataset.
colors. The center part of Figure 3 shows the mapping of content elements to
colors. In the last step, the dataset creators used PDFPlumber
39
and PDFMiner
to extract and annotate relevant content elements by their color. DocBank pro-
vides the annotations as separate files for each document page in the dataset.
Table  5  shows  the  structure  of  the  tab-separated  ground-truth  files.  Each
line in the file refers to one component on the page and is structured as follows.
Index 0 represents the token itself, e.g., a word. Indices 1-4 denote the bounding
box information of the token, where (x0, y0) represents the top-left and (x1, y1)
the bottom-right corner of the token in the PDF coordinate space. Indices 5-7
reflect the token’s color in RGB notation, index 8 the token’s font, and index 9
the label for the type of the content element. Each ground-truth file adheres to
the naming scheme shown in Figure 2.
Table 5: Structure of DocBank’s plaintext ground-truth files.
Index0123456789
Content    token    x0    y0    x1    y1    R    G    B    font name    label
Source:https://doc-analysis.github.io/docbank-page/index.html.
Fig. 2: Naming scheme for DocBank’s ground-truth files.
39
https://github.com/jsvine/pdfplumber

10Meuschke et al.
The DocBank dataset offers ground-truth annotations for 1.5M content ele-
ments on 500K pages. Li et al. extracted the pages from arXiv papers in Physics,
Mathematics, Computer Science, and numerous other fields published between
2014 and 2018. DocBank’s large size, recency, diversity of included documents,
number of annotated content elements, and high annotation quality due to the
weakly supervised labeling approach make it an ideal choice for our purposes.
3.3  Evaluation Procedure
Evaluation Framework 
9
Labeled Data
PDF Object
Assemble Data
PDF Extraction Tool 
Similarity Matrix
−Annotated data
−File name
−Page number
−File path
Abstract
Title
Author
Caption
Equation
List
Footer
Reference 
Paragraph
Section
Table
{
Ground-truth DF
for anElement
−Separate Tokens
−Collated Tokens
Evaluation Metrics
TXT, JSON, XML, 
XSLX
Element 
Parsing
Element 
Selection
{
−LevenshteinRatio
−Precision
−Recall
−Accuracy
−F1 Score
Extracted DF
for anElement
Fig. 3: Overview of the procedure for comparing content elements extracted by
IE tools to the ground-truth annotations and computing evaluation metrics.
Figure 3 shows our evaluation procedure. First, we select the PDF files whose
associated ground-truth files contain relevant labels. For example, we search for
ground-truth  files  containingreferencetokens  to  evaluate  reference  extraction
tools.  We  include  the  PDF  file,  the  ground-truth  file,  the  document  ID  and
page number obtainable from the file name (cf. Figure 2), and the file path in a
self-defined Python object (seePDF Objectin Figure 3).
Then, the evaluation process splits into two branches whose goal is to create
two pandas data frames—one holding the relevant ground-truth data, and the
other the output of an information extraction tool. For this purpose, both the
ground-truth  files  and  the  output  files  of  IE  tools  are  parsed  and  filtered  for

A Benchmark of PDF Information Extraction Tools11
the relevant content elements. For example, to evaluate reference extraction via
CERMINE, we exclusively parse reference tags from CERMINE’s XML output
file into a data frame (seeExtracted DFin Figure 3).
Finally, we convert both theground-truth data frameand theextracted data
frameinto two formats for comparison and computing performance metrics. The
first is theseparate tokensformat, in which every token is represented as a row in
the data frame. The second is thecollated tokensformat, in which all tokens are
combined into a single space-delimited row in the data frame. Separate tokens
serve to compute a strict score for token-level extraction quality, whereas collated
tokens yield a more lenient score intended to reflect a tool’s average extraction
quality for a class of content elements. We will explain the idea of both scores
and their computation hereafter.
We  employ  theLevenshtein Ratioto  quantify  the  similarity  of  extracted
tokens  and  the  ground-truth  data  for  both  the  separate  tokens  and  collated
tokens format. Equation (1) defines the computation of theLevenshtein distance
of the extracted tokenst
e
and the ground-truth tokenst
g
.
lev
t
e
,t
g
(i,j) =









max(i,j),ifmin(i,j) = 0,
min





lev
t
e
,t
g
(i−1,j) + 1
lev
t
e
,t
g
(i,j−1) + 1
lev
t
e
,t
g
(i−1,j−1) + 1
(t
ei
6=t
ej
)
otherwise.
(1)
Equation (2) defines the derived Levenshtein Ratio score (γ).
γ(t
e
,t
g
) = 1−
lev
t
e
,t
g
(i,j)
|t
e
|+|t
g
|
(2)
Equation (3) shows the derivation of thesimilarity matrix(∆
d
) for a doc-
ument (d), which contains the Levenshtein Ratio (γ) of every token in the ex-
tracted data frame with separate tokensE
s
of sizemand the ground-truth data
frame with separate tokensG
s
of sizen.
∆
d
m×n
=γ
[
E
s
i
,G
s
j
]
m,n
i,j
(3)
Using them×nsimilarity matrix, we compute thePrecisionP
d
andRe-
callR
d
scores according to Equation (4) and Equation (5), respectively. As the
numerator, we use the number of extracted tokens whose Levenshtein Ratio is
larger or equal to 0.7. We chose this threshold for consistency with the exper-
iments  by  Granitzer  et  al.  [19].  We  then  compute  theF
d
1
score  according  to
Equation (6) as a token-level score for a tool’s extraction quality.
P
d
=
#∆
d
i,j
≥0.7
m
(4)
R
d
=
#∆
d
i,j
≥0.7
n
(5)

12Meuschke et al.
F
1
d
=
2×P
d
×R
d
P
d
+R
d
(6)
Moreover, we compute theAccuracyscoreA
d
reflecting a tool’s average ex-
traction quality for a class of tokens. To obtainA
d
, we compute the Levenshtein
Ratioγof the extracted tokensE
c
and ground-truth tokensG
c
in the collated
tokens format, according to Equation (7).
A
d
=γ[E
c
,G
c
](7)
Figure  4  and  Figure  5  show  the  similarity  matrices  for  the  author  names
’Yuta,’ ’Hamada,’ ’Gary,’ and ’Shiu’ using separate and collated tokens, respec-
tively. Figure 4 additionally shows an example computation of the Levenshtein
Ratio for the stringsGaryandYuta. The strings have a Levenshtein distance of
six and a cumulative string length of eight, which results in a Levenshtein Ratio
of 0.25 that is entered into the similarity matrix. Figure 5 analogously exemplifies
computing the Accuracy score of the two strings using collated tokens.
Yuta  Hamada  Gary  Shiu1,
Yuta   1.0   0.2   0.25   0.2
Hamada  0.2   1.0   0.2   0.0
Gary   0.25   0.2   1.0   0.0
Shiu   0.25   0.0   0.0   0.8
Y u t  a
0 1 2 3 4
G1 2 3 4 5
a2 3 4 5 4
r3 4 5 6 5
y4 3 4 5 6
Fig. 4: Left: Similarity matrix for author names using separate tokens.
Right: Computation of the Levenshtein distance (6) and the optimal edit tran-
script (yellow highlights) for two author names using dynamic programming.
Yuta Hamada Gary Shiu1,
Yuta Hamada Gary Shiu0.957
Fig. 5: Similarity matrix for two sets of author names using collated tokens.

A Benchmark of PDF Information Extraction Tools13
4  Results
We present the evaluation results grouped by extraction task (see Figures 6–9)
and  by  tools  (see  Table  6).  This  two-fold  breakdown  of  the  results  facilitates
identifying  the  best-performing  tool  for  a  specific  extraction  task  or  content
element and allows for gauging the strengths and weaknesses of tools more easily.
Note that the task-specific result visualizations (Figures 6–9) only include tools
that support the respective extraction task. See Table 4 for an overview of the
evaluated tools and the extraction tasks they support.
Figure 6 shows the cumulativeF
1
scores of CERMINE, GROBID, PdfAct,
and Science Parse for the metadata extraction task, i.e., extracting title, abstract,
and authors. Consequently, the best possible cumulativeF
1
score equals three.
Overall, GROBID performs best, achieving a cumulativeF
1
score of 2.25 and
individualF
1
scores  of  0.91  fortitle,  0.82  forabstract,  and  0.52  forauthors.
Science  Parse  (2.03)  and  CERMINE  (1.97)  obtain  comparable  cumulativeF
1
scores, while PdfAct has the lowest cumulativeF
1
score of 1.14. However, PdfAct
performs second-best for title extraction with aF
1
score of 0.85. The performance
of all tools is worse for extracting authors than for titles and abstracts. It appears
that machine-learning-based approaches like those of CERMINE, GROBID, and
Science Parse perform better for metadata extraction than rule-based algorithms
like the one implemented in PdfAct
40
.
Figure 7 shows the results for the reference extraction task. With aF
1
score of
0.79, GROBID also performs best for this task. CERMINE achieves the second
rank  with  aF
1
score  of  0.74,  while  Science  Parse  and  RefExtract  share  the
third rank with identicalF
1
scores of 0.49. As for the metadata extraction task,
PdfAct also achieves the lowestF
1
score of 0.15 for reference extraction. While
both RefExtract and PdfAct employ pdftotext and regular expressions, GROBID
performs efficient segregation of cascaded sequence labeling models
41
for diverse
components, which can be the reason for its superior performance [36].
Figure  8  depicts  the  results  for  the  table  extraction  task.  Adobe  Extract
outperforms the other tools with aF
1
score of 0.47. Camelot (F
1
= 0.30), Tabula
(F
1
=  0.28),  and  GROBID  (F
1
=  0.23)  perform  notably  worse  than  Adobe
Extract. Both Camelot and Tabula incorrectly treat two-column articles as tables
and table captions as a part of the table region, which negatively affects their
performance scores. The use of comparableStreamandLatticemodes in Camelot
and Tabula (cf. Section 3.1) likely cause the tools’ highly similar results. PdfAct
did not produce an output for any of our test documents that contain tables,
although the tool supposedly supports table extraction. The performance of all
tools is significantly lower for table extraction than for other content elements,
which is likely caused by the need to extract additional structural information.
The difficulty of table extraction is also reflected by numerous issues that users
opened on the matter in the GROBID GitHub repository
42
.
40
See Table 4 for more information on the tools’ extraction approaches.
41
https://grobid.readthedocs.io/en/latest/Principles/
42
https://github.com/kermitt2/grobid/issues/340

14Meuschke et al.
CERMINEGROBIDPdfActScience Parse
Title0.810.910.850.70
Abstract0.720.820.160.81
Authors0.440.520.130.52
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Cumulative F1 Score
1.97
2.25
1.14
2.03
Fig. 6: Results for metadata extraction.
CERMINEGROBIDPdfAct
Science
 Parse
RefExtract
Reference
0.740.790.150.490.49
0.0
0.2
0.4
0.6
0.8
1.0
F1 Score
Fig. 7: Results for reference extraction.

A Benchmark of PDF Information Extraction Tools15
Adobe 
 Extract
CamelotGROBIDPdfActTabula
Table
0.470.300.230.000.28
0.0
0.2
0.4
0.6
0.8
1.0
F1 Score
Fig. 8: Results for table extraction.
Figure  9  visualizes  the  results  for  the  general  extraction  task.  GROBID
achieves the highest cumulativeF
1
score of 2.38, followed by PdfAct (cumulative
F
1
= 1.66). The cumulativeF
1
scores of Science Parse (1.25), which only support
paragraph and section extraction, and CERMINE (1.20) are much lower than
GROBID’s score and comparable to that of PdfAct. Apache Tika, PyMuPDF,
and Adobe Extract can only extract paragraphs.
For  paragraph  extraction,  GROBID  (0.9),  CERMINE  (0.85),  and  PdfAct
(0.85)  obtained  highF
1
scores  with  Science  Parse  (0.76)  and  Adobe  Extract
(0.74) following closely. Apache Tika (0.52) and PyMuPDF (0.51) achieved no-
tably lower scores because the tools include other elements like sections, captions,
lists, footers, and equations in paragraphs.
Notably,  only  GROBID  achieves  a  promisingF
1
score  of  0.74  for  the  ex-
traction of sections. GROBID and PdfAct are the only tools that can partially
extract captions. None of the tools is able to extract lists. Only PdfAct supports
the  extraction  of  footers  but  achieves  a  lowF
1
score  of  0.20.  Only  GROBID
supports  equation  extraction  but  the  extraction  quality  is  comparatively  low
(F
1
=  0.25).  To  reduce  the  evaluation  effort,  we  first  tested  the  extraction  of
lists, footers, and equations on a two-months sample of the data covering Jan-
uary and February 2014. If a tool consistently obtained performance scores of
0,  we  did  not  continue  with  its  evaluation.  Following  this  procedure,  we  only
evaluated GROBID and PdfAct on the full dataset.

16Meuschke et al.
For  the  general  extraction  task,  GROBID  outperforms  other  tools  due  to
its segmentation model
43
, which detects the main areas of documents based on
layout  features.  Therefore,  frequent  content  elements  like  paragraphs  will  not
impact  the  extraction  of  rare  elements  from  a  non-body  area  by  keeping  the
imbalanced classes in separate models. The cascading models used in GROBID
also offer the flexibility to tune each model. Using layouts and structures as a
basis for the process allows the association of simpler training data.
Adobe 
 Extract
Apache 
 Tika
CERMINEGROBIDPdfActPyMuPDF
Science
 Parse
Paragraph0.740.520.850.900.850.510.76
Section0.000.000.350.740.160.000.49
Caption0.000.000.000.490.450.000.00
List0.000.000.000.000.000.000.00
Footer0.000.000.000.000.200.000.00
Equation0.000.000.000.250.000.000.00
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Cumulative F1 Score
0.74
0.52
1.2
2.38
1.66
0.51
1.25
Fig. 9: Results for general data extraction.
The breakdown of results by tools shown in Table 6 underscores the main
takeaway point of the results’ presentation for the individual extraction tasks.
The tools’ results differ greatly for different content elements. Certainly, no tool
performs best for all elements, rather, even tools that perform well overall can
fail completely for certain extraction tasks. The large amount of content elements
whose extraction is either unsupported or only possible in poor quality indicates
a large potential for improvement in future work.
43
https://grobid.readthedocs.io/en/latest/Principles/

A Benchmark of PDF Information Extraction Tools17
Table 6: Results grouped by extraction tool.
Tool
1
Label
# De-
tected
# Pro-
cessed
2
Acc   F
1
PR
Adobe ExtractTable1,6357360.52 0.47 0.45 0.49
Paragraph   3,9853,0880.85    0.74    0.72    0.76
Apache TikaParagraph   339,603258,5820.55    0.52    0.43    0.65
CamelotTable16,28911,6280.27    0.30    0.23    0.44
CERMINETitle16,19614,5010.84    0.81    0.81    0.81
Author19,78814,7970.43    0.44    0.44    0.46
Abstract19,34216,7160.71    0.72    0.68    0.76
Reference40,33335,1930.80    0.74    0.71    0.77
Paragraph   361,273348,1600.89    0.85    0.83    0.87
Section163,077139,9210.40    0.35    0.32    0.38
GROBIDTitle16,19616,0180.92 0.91 0.91 0.92
Author19,788  19,563  0.54 0.52 0.52 0.53
Abstract19,34218,7140.820.82 0.810.83
Reference40,333  36,020  0.82 0.79 0.79 0.80
Paragraph   361,273358,730 0.90 0.90 0.89 0.91
Section163,077 163,037 0.77 0.74 0.73 0.76
Caption90,606  62,445  0.57 0.49 0.470.51
Table16,7408,6330.24    0.23    0.23    0.23
Equation142,736 96,560  0.26 0.25 0.20 0.32
PdfActTitle17,670  16,8340.85    0.85    0.85    0.86
Author13,1102,1870.14    0.13    0.12    0.18
Abstract21,4704,6830.17    0.16    0.15    0.20
Reference    30,26312,7050.19    0.15    0.17    0.20
Paragraph361,318357,9050.85    0.85    0.80    0.89
Section129,36187,6050.21    0.16    0.12    0.25
Caption83,43553,3140.45    0.45    0.400.52
Footer32,457  26,252  0.23 0.20 0.25 0.16
PyMuPDFParagraph   339,650258,3830.55    0.51    0.41    0.65
RefExtractReference40,33338,4050.55    0.49    0.44    0.55
Science ParseTitle11,69611,6870.79    0.70    0.70    0.70
Author4714710.54 0.52 0.52 0.53
Abstract14,15014,1490.830.81    0.730.90
Reference40,33335,2000.55    0.49    0.49    0.50
Paragraph361,318355,5290.79    0.76    0.76    0.76
Section163,077158,5560.54    0.49    0.49    0.50
TabulaTable10,3619,4560.29    0.28    0.20    0.46
1
Boldface indicates the best value for each content element type.
2
The differences in the number of detected and processed items are due to
PDF Read Exceptions or Warnings. We label an item as processed if it has
a non-zeroF
1
score.

18Meuschke et al.
5  Conclusion and Future Work
We present an open evaluation framework for information extraction from aca-
demic PDF documents. Our framework uses the DocBank dataset [31] offering
12 types and 1.5M annotated instances of content elements contained in 500K
pages of arXiv papers from multiple disciplines. The dataset is larger, more top-
ically diverse, and supports more extraction tasks than most related datasets.
We use the newly developed framework to benchmark the performance of ten
freely available tools in extracting document metadata, bibliographic references,
tables, and other content elements in academic PDF documents. GROBID, fol-
lowed by CERMINE and Science Parse achieves the best results for the metadata
and reference extraction tasks. For table extraction, Adobe Extract outperforms
other tools, even though the performance is much lower than for other content
elements. All tools struggle to extract lists, footers, and equations.
While DocBank covers more disciplines than other datasets, we see further
diversification of the collection in terms of disciplines, document types, and con-
tent elements as a valuable task for future research. Table 2 shows that more
datasets suitable for information extraction from PDF documents are available
but unused thus far. The weakly supervised annotation approach used for creat-
ing the DocBank dataset is transferable to other LaTeX document collections.
Apart from the dataset, our framework can incorporate additional tools and
allows easy replacement of tools in case of updates. We intend to update and
extend our performance benchmark in the future.
The  extraction  of  tables,  equations,  footers,  lists,  and  similar  content  ele-
ments poses the toughest challenge for tools in our benchmark. In recent work,
Grennan et al.[20] showed that the usage of synthetic datasets for model train-
ing can improve citation parsing. A similar approach could also be a promising
direction for improving the access to currently hard-to-extract content elements.
Combining extraction approaches could lead to a one-fits-all extraction tool,
which we consider desirable. The Sciencebeam-pipelines
44
project currently un-
dertakes initial steps toward that goal. We hope that our evaluation framework
will help to support this line of research by facilitating performance benchmarks
of IE tools as part of a continuous development and integration process.
References
1.  Ahmed, M.W., Afzal, M.T.: FLAG-PDFe: Features Oriented Metadata Extraction
Framework for Scientific Publications. IEEE Access8, 99458–99469 (May 2020).
https://doi.org/10.1109/ACCESS.2020.2997907
2.  Aiello,  M.,  Monz,  C.,  Todoran,  L.,  Worring,  M.:  Document  Understanding  for
a  Broad  Class  of  Documents.  International  Journal  on  Document  Analysis  and
Recognition5(1)(Aug 2002). https://doi.org/10.1007/s10032-002-0080-x
3.  Anzaroot, S., Passos, A., Belanger, D., McCallum, A.: Learning Soft Linear Con-
straints with Application to Citation Field Extraction. In: Proceedings of the 52nd
44
https://github.com/elifesciences/sciencebeam-pipelines

A Benchmark of PDF Information Extraction Tools19
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). pp. 593–602. Association for Computational Linguistics, Baltimore, Mary-
land (2014). https://doi.org/10.3115/v1/P14-1056
4.  Azimjonov,   J.,   Alikhanov,   J.:   Rule   Based   Metadata   Extraction   Framework
from  Academic  Articles.  arXiv  CoRR1807.09009v1 [cs.IR],  1–10  (2018).
https://doi.org/10.48550/arXiv.1807.09009
5.  Bast, H., Korzen, C.: The Icecite Research Paper Management System. In: Web
Information Systems Engineering – WISE 2013, vol. 8181, pp. 396–409. Springer
Berlin,  Heidelberg,  Nanjing,  China  (2013).  https://doi.org/10.1007/978-3-642-
41154-0
30
6.  Bast, H., Korzen, C.: A Benchmark and Evaluation for Text Extraction from PDF.
In:  2017  ACM/IEEE  Joint  Conference  on  Digital  Libraries  (JCDL).  pp.  1–10.
IEEE, Toronto, ON, Canada (2017). https://doi.org/10.1109/JCDL.2017.7991564
7.  Bhardwaj,  A.,  Mercier,  D.,  Dengel,  A.,  Ahmed,  S.:  DeepBIBX:  Deep  Learning
for Image Based Bibliographic Data Extraction. In: Proceedings of the 24th In-
ternational Conference on Neural Information Processing. LNCS, vol. 10635, pp.
286–293. Springer, Guangzhou, China (2017). https://doi.org/10.1007/978-3-319-
70096-030
8.  Borkar,   V.,   Deshmukh,   K.,   Sarawagi,   S.:   Automatic   Segmentation   of   Text
into   Structured   Records.   SIGMOD   Record30(2),   175–186   (Jun   2001).
https://doi.org/10.1145/376284.375682
9.  Clark, C., Divvala, S.: PDFFigures 2.0: Mining Figures from Research Papers. In:
Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries.
pp.  143–152.  JCDL  ’16,  Association  for  Computing  Machinery,  New  York,  NY,
USA (2016). https://doi.org/10.1145/2910896.2910904
10.  Cortez, E., da Silva, A.S., Gon ̧calves, M.A., Mesquita, F., de Moura, E.S.: FLUX-
CIM:  Flexible  Unsupervised  Extraction  of  Citation  Metadata.  In:  Proceedings
of  the  7th  ACM/IEEE-CS  Joint  Conference  on  Digital  Libraries.  pp.  215–224.
JCDL  ’07,  Association  for  Computing  Machinery,  New  York,  NY,  USA  (2007).
https://doi.org/10.1145/1255175.1255219
11.  Cortez,  E.,  da  Silva,  A.S.,  Gon ̧calves,  M.A.,  Mesquita,  F.,  de  Moura,  E.S.:  A
Flexible Approach for Extracting Metadata from Bibliographic Citations. JASIST
60(6), 1144–1158 (Jun 2009). https://doi.org/10.1002/asi.21049
12.  Councill,  I.,  Giles,  C.L.,  Kan,  M.Y.:  ParsCit:  an  Open-source  CRF  Reference
String Parsing Package. In: Proceedings of the Sixth International Conference on
Language Resources and Evaluation. European Language Resources Association,
Marrakech, Morocco (2008),https://aclanthology.org/L08-1291/
13.  Cui,  B.G.,  Chen,  X.:  An  Improved  Hidden  Markov  Model  for  Literature  Meta-
data Extraction. In: Advanced Intelligent Computing Theories and Applications,
vol.  6215,  pp.  205–212.  Springer  Berlin  Heidelberg,  Changsha,  China  (2010).
https://doi.org/10.1007/978-3-642-14922-1
26
14.  Day, M.Y., Tsai, R.T.H., Sung, C.L., Hsieh, C.C., Lee, C.W., Wu, S.H., Wu, K.P.,
Ong, C.S., Hsu, W.L.: Reference metadata extraction using a hierarchical knowl-
edge  representation  framework.  Decision  Support  Systems43(1),  152–167  (Feb
2007). https://doi.org/10.1016/j.dss.2006.08.006
15.  De La Torre, M., Aguirre, C., Anshutz, B., Hsu, W.: MATESC: Metadata-analytic
text extractor and section classifier for scientific publications. In: Proceedings of
the 10th International Joint Conference on Knowledge Discovery, Knowledge En-
gineering  and  Knowledge  Management.  vol.  1,  pp.  261–267.  SciTePress  (2018).
https://doi.org/10.5220/0006937702610267

20Meuschke et al.
16.  Fan, T., Liu, J., Qiu, Y., Jiang, C., Zhang, J., Zhang, W., Wan, J.: PARDA: A
Dataset for Scholarly PDF Document Metadata Extraction Evaluation. In: Col-
laborative Computing: Networking, Applications and Worksharing. pp. 417–431.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12981-1
29
17.  F ̈arber, M., Thiemann, A., Jatowt, A.: A High-Quality Gold Standard for Citation-
based Tasks. In: Proceedings of the Eleventh International Conference on Language
Resources and Evaluation. European Language Resources Association, Miyazaki,
Japan (2018),https://aclanthology.org/L18-1296
18.  Giuffrida, G., Shek, E.C., Yang, J.: Knowledge-Based Metadata Extraction from
PostScript Files. In: Proceedings of the Fifth ACM Conference on Digital Libraries.
pp. 77–84. DL ’00, Association for Computing Machinery, New York, NY, USA
(2000). https://doi.org/10.1145/336597.336639
19.  Granitzer, M., Hristakeva, M., Jack, K., Knight, R.: A Comparison of Metadata
Extraction Techniques for Crowdsourced Bibliographic Metadata Management. In:
Proceedings of the 27th Annual ACM Symposium on Applied Computing. pp. 962–
964. SAC ’12, Association for Computing Machinery, New York, NY, USA (2012).
https://doi.org/10.1145/2245276.2245462
20.  Grennan,  M.,  Beel,  J.:  Synthetic  vs.  Real  Reference  Strings  for  Citation  Pars-
ing,  and  the  Importance  of  Re-training  and  Out-Of-Sample  Data  for  Meaning-
ful  Evaluations:  Experiments  with  GROBID,  GIANT  and  CORA.  In:  Proceed-
ings  of  the  8th  International  Workshop  on  Mining  Scientific  Publications.  pp.
27–35. Association for Computational Linguistics, Wuhan, China (2020),https:
//aclanthology.org/2020.wosp-1.4
21.  Grennan, M., Schibel, M., Collins, A., Beel, J.: GIANT: The 1-Billion Annotated
Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data]
(2019). https://doi.org/10.7910/DVN/LXQXAO
22.  Hashmi,  A.M.,  Afzal,  M.T.,  ur  Rehman,  S.:  Rule  Based  Approach  to  Ex-
tract   Metadata   from   Scientific   PDF   Documents.   In:   2020   5th   Interna-
tional  Conference  on  Innovative  Technologies  in  Intelligent  Systems  and  In-
dustrial   Applications   (CITISIA).   pp.   1–4.   IEEE,   Sydney,   Australia   (2020).
https://doi.org/10.1109/CITISIA50690.2020.9371784
23.  Hetzner,  E.:  A  Simple  Method  for  Citation  Metadata  Extraction  Using  Hidden
Markov Models. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on
Digital Libraries. pp. 280–284. JCDL ’08, Association for Computing Machinery,
New York, NY, USA (2008). https://doi.org/10.1145/1378889.1378937
24.  Kasdorf, W.E.: The Columbia Guide to Digital Publishing. Columbia University
Press, USA (2003)
25.  Kern,    R.,    Jack,    K.,    Hristakeva,    M.:    TeamBeam    -    Meta-Data    Extrac-
tion    from    Scientific    Literature.    D-Lib    Magazine18(7/8)    (Jul    2012).
https://doi.org/10.1045/july2012-kern
26.  Klein,  D.,  Manning,  C.D.:  Conditional  Structure  versus  Conditional  Estima-
tion  in  NLP  Models.  In:  Proceedings  of  the  2002  Conference  on  Empirical
Methods   in  Natural  Language  Processing   (EMNLP).   pp.  9–16.  Association
for  Computational  Linguistics,  Pennsylvania,  Philadelphia,  PA,  USA  (2002).
https://doi.org/10.3115/1118693.1118695
27.  Klink, S., Dengel, A., Kieninger, T.: Document Structure Analysis Based on Layout
and  Textual  Features.  In:  IAPR  International  Workshop  on  Document  Analysis
Systems. IAPR, Rio de Janeiro, Brazil (2000)
28.  K ̈orner, M., Ghavimi, B., Mayr, P., Hartmann, H., Staab, S.: Evaluating Reference
String  Extraction  Using  Line-Based  Conditional  Random  Fields:  A  Case  Study

A Benchmark of PDF Information Extraction Tools21
with German Language Publications. In: New Trends in Databases and Information
Systems. pp. 137–145. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
67162-8
15
29.  Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Prob-
abilistic  Models  for  Segmenting  and  Labeling  Sequence  Data.  In:  Proceedings
of  the  Eighteenth  International  Conference  on  Machine  Learning.  pp.  282–289.
ICML  ’01,  Morgan  Kaufmann  Publishers  Inc.,  San  Francisco,  CA,  USA  (2001),
https://dl.acm.org/doi/10.5555/645530.655813
30.  Ley, M.: DBLP: Some Lessons Learned. Proc. VLDB Endowment2(2), 1493–1500
(Aug 2009). https://doi.org/10.14778/1687553.1687577
31.  Li,  M.,  Xu,  Y.,  Cui,  L.,  Huang,  S.,  Wei,  F.,  Li,  Z.,  Zhou,  M.:  DocBank:  A
Benchmark Dataset for Document Layout Analysis. In: Proceedings of the 28th
International  Conference  on  Computational  Linguistics.  pp.  949–960.  Interna-
tional Committee on Computational Linguistics, Barcelona, Spain (Online) (2020).
https://doi.org/10.18653/v1/2020.coling-main.82
32.  Lipinski,  M.,  Yao,  K.,  Breitinger,  C.,  Beel,  J.,  Gipp,  B.:  Evaluation  of  Header
Metadata  Extraction  Approaches  and  Tools  for  Scientific  PDF  Documents.  In:
Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. pp.
385–386. JCDL ’13, Association for Computing Machinery, New York, NY, USA
(2013). https://doi.org/10.1145/2467696.2467753
33.  Livathinos,  N.,  Berrospi,  C.,  Lysak,  M.,  Kuropiatnyk,  V.,  Nassar,  A.,  Car-
valho,   A.,   Dolfi,   M.,   Auer,   C.,   Dinkla,   K.,   Staar,   P.:   Robust   PDF   Doc-
ument   Conversion   using   Recurrent   Neural   Networks.   Proceedings   of   the
AAAI  Conference  on  Artificial  Intelligence35(17),  15137–15145  (May  2021).
https://doi.org/10.1609/aaai.v35i17.17777
34.  Lo,  K.,  Wang,  L.L.,  Neumann,  M.,  Kinney,  R.,  Weld,  D.:  S2ORC:  The  Seman-
tic Scholar Open Research Corpus. In: Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics. pp. 4969–4983. Association for
Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-
main.447
35.  Lopez, P.: GROBID (2008),https://github.com/kermitt2/grobid
36.  Lopez, P.: GROBID: Combining Automatic Bibliographic Data Recognition and
Term Extraction for Scholarship Publications. In: Research and Advanced Technol-
ogy for Digital Libraries, LNCS, vol. 5714, pp. 473–474. Springer Berlin Heidelberg
(2009). https://doi.org/10.1007/978-3-642-04346-8
62
37.  Mao,   S.,   Kim,   J.,   Thoma,   G.R.:   A   Dynamic   Feature   Generation   Sys-
tem   for   Automated   Metadata   Extraction   in   Preservation   of   Digital   Mate-
rials.  In:  1st  International  Workshop  on  Document  Image  Analysis  for  Li-
braries.  pp.  225–232.  IEEE  Computer  Society,  Palo  Alto,  CA,  USA  (2004).
https://doi.org/10.1109/DIAL.2004.1263251
38.  Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a
literature survey. In: Proceedings Document Recognition and Retrieval X. SPIE
Proceedings,  vol.  5010,  pp.  197–207.  SPIE,  Santa  Clara,  California,  USA  (Jan
2003). https://doi.org/10.1117/12.476326
39.  McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the Construc-
tion of Internet Portals with Machine Learning. Information Retrieval3(2), 127–
163 (Jul 2000). https://doi.org/10.1023/A:1009953814988
40.  National Library of Medicine: PubMed,https://pubmed.ncbi.nlm.nih.gov/
41.  National Library of Medicine: PubMed Central,https://www.ncbi.nlm.nih.gov/
pmc/

22Meuschke et al.
42.  Ojokoh, B., Zhang, M., Tang, J.: A trigram hidden Markov model for metadata
extraction from heterogeneous references. Information Sciences181(9), 1538–1551
(May 2011). https://doi.org/10.1016/j.ins.2011.01.014
43.  Ororbia,  A.G.,  Wu,  J.,  Khabsa,  M.,  WIlliams,  K.,  Giles,  C.L.:  Big  Scholarly
Data  in  CiteSeerX:  Information  Extraction  from  the  Web.  In:  Proceedings  of
the 24th International Conference on World Wide Web. pp. 597–602. WWW ’15
Companion, Association for Computing Machinery, New York, NY, USA (2015).
https://doi.org/10.1145/2740908.2741736
44.  Palmero,  G.,  Dimitriadis,  Y.:  Structured  document  labeling  and  rule  extraction
using a new recurrent fuzzy-neural system. In: Proceedings of the Fifth Interna-
tional Conference on Document Analysis and Recognition. pp. 181–184. Springer,
Bangalore, India (1999). https://doi.org/10.1109/ICDAR.1999.791754
45.  Peng,  F.,  McCallum,  A.:  Accurate  Information  Extraction  from  Research  Pa-
pers using Conditional Random Fields. In: Proceedings of the Human Language
Technology  Conference  of  the  North  American  Chapter  of  the  Association  for
Computational  Linguistics:  HLT-NAACL.  pp.  329–336.  Association  for  Compu-
tational Linguistics, Boston, Massachusetts, USA (2004),https://aclanthology.
org/N04-1042
46.  Prasad, A., Kaur, M., Kan, M.Y.: Neural ParsCit: a deep learning-based reference
string parser. International Journal on Digital Libraries19(4), 323–337 (Nov 2018).
https://doi.org/10.1007/s00799-018-0242-1
47.  Rizvi, S.T.R., Dengel, A., Ahmed, S.: A Hybrid Approach and Unified Framework
for Bibliographic Reference Extraction. IEEE Access8, 217231–217245 (Dec 2020).
https://doi.org/10.1109/ACCESS.2020.3042455
48.  Rizvi, S.T.R., Lucieri, A., Dengel, A., Ahmed, S.: Benchmarking Object Detection
Networks  for  Image  Based  Reference  Detection  in  Document  Images.  In:  2019
Digital Image Computing: Techniques and Applications (DICTA). pp. 1–8. IEEE,
Perth, WA, Australia (2019). https://doi.org/10.1109/DICTA47822.2019.8945991
49.  Rodrigues Alves, D., Colavizza, G., Kaplan, F.: Deep Reference Mining From Schol-
arly  Literature  in  the  Arts  and  Humanities.  Frontiers  in  Research  Metrics  and
Analytics3,  21 (Jul 2018). https://doi.org/10.3389/frma.2018.00021
50.  Saier, T., F ̈arber, M.: Bibliometric-Enhanced arXiv: A Data Set for Paper-Based
and  Citation-Based  Tasks.  In:  Proceedings  of  the  8th  International  Workshop
on  Bibliometric-enhanced  Information  Retrieval  (BIR).  CEUR  Workshop  Pro-
ceedings, vol. 2345, pp. 14–26. CEUR-WS.org, Cologne, Germany (2019),http:
//ceur-ws.org/Vol-2345/paper2.pdf
51.  Saier,  T.,  F ̈arber,  M.:  unarXive:  A  Large  Scholarly  Data  Set  with  Publications’
Full-Text,  Annotated  In-Text  Citations,  and  Links  to  Metadata.  Scientometrics
125(3), 3085–3108 (Dec 2020). https://doi.org/10.1007/s11192-020-03382-z
52.  Schloss Dagstuhl - Leibniz Center for Informatics, University of Trier: dblp: com-
puter science bibliography,https://dblp.org/
53.  Souza, A., Moreira, V., Heuser, C.: ARCTIC: Metadata Extraction from Scientific
Papers in Pdf Using Two-Layer CRF. In: Proceedings of the 2014 ACM Symposium
on Document Engineering. pp. 121–130. DocEng ’14, Association for Computing
Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2644866.2644872
54.  Tkaczyk, D., Collins, A., Sheridan, P., Beel, J.: Machine Learning vs. Rules and
Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Ref-
erence  and  Citation  Parsers.  In:  Proceedings  of  the  18th  ACM/IEEE  on  Joint
Conference on Digital Libraries. pp. 99–108. JCDL ’18, Association for Computing
Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3197026.3197048

A Benchmark of PDF Information Extraction Tools23
55.  Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski,  L.: CERMINE:
automatic  extraction  of  structured  metadata  from  scientific  literature.  Interna-
tional  Journal  on  Document  Analysis  and  Recognition  (IJDAR)18(4),  317–335
(Dec 2015). https://doi.org/10.1007/s10032-015-0249-8
56.  Vaswani,  A.,  Shazeer,  N.,  Parmar,  N.,  Uszkoreit,  J.,  Jones,  L.,  Gomez,  A.N.,
Kaiser,  L., Polosukhin, I.: Attention is All You Need. In: Proceedings of the 31st
International  Conference  on  Neural  Information  Processing  Systems.  pp.  6000–
6010.  NIPS’17,  Curran  Associates  Inc.,  Red  Hook,  NY,  USA  (2017),https:
//dl.acm.org/doi/10.5555/3295222.3295349
57.  Vilnis, L., Belanger, D., Sheldon, D., McCallum, A.: Bethe Projections for Non-
Local Inference. In: Proceedings of the Thirty-First Conference on Uncertainty in
Artificial Intelligence. pp. 892–901. UAI’15, AUAI Press, Arlington, Virginia, USA
(2015). https://doi.org/10.48550/arXiv.1503.01397