Abstract
crucial for numerous indexing, retrieval, and analysis use cases. Choos- ing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few content elements like header metadata or bibliographic references
Read Full Text
Related papers athttps://gipplab.org/pub
Preprint of the paper:
Meuschke, N. & Jagdale, A. & Spinde, T. & Mitrovi ́c, J. & Gipp, B.,
”A Benchmark of PDF Information Extraction Tools Using a Multi-task
and Multi-domain Evaluation Framework for Academic Documents”, in
Information for a Better World: Normality, Virtuality, Physicality, Inclu-
sivity, LNCS, vol. 13972, Cham: Springer Nature Switzerland, 2023, pp.
383–405, DOI: 10.1007/978-3-031-28032-0
31.
Click to download:BibTeX
A Benchmark of PDF Information Extraction
Tools using a Multi-Task and Multi-Domain
Evaluation Framework for Academic Documents
Norman Meuschke
1,[ORCID]
, Apurva Jagdale
2
, Timo Spinde
1,[ORCID]
,
Jelena Mitrovi ́c
2,3,[ORCID]
, and Bela Gipp
1,[ORCID]
1
University of G ̈ottingen, 37073 G ̈ottingen, Germany
{meuschke, spinde, gipp}@uni-goettingen.de
2
University of Passau, 94032 Passau, Germany
{apurva.jagdale, jelena.mitrovic}@uni-passau.de
3
The Institute for Artificial Intelligence R&D of Serbia, 21000 Novi Sad, Serbia
Abstract.Extracting information from academic PDF documents is
crucial for numerous indexing, retrieval, and analysis use cases. Choos-
ing the best tool to extract specific content elements is difficult because
many, technically diverse tools are available, but recent performance
benchmarks are rare. Moreover, such benchmarks typically cover only
a few content elements like header metadata or bibliographic references
and use smaller datasets from specific academic disciplines. We provide
a large and diverse evaluation framework that supports more extraction
tasks than most related datasets. Our framework builds upon DocBank,
a multi-domain dataset of 1.5M annotated content elements extracted
from 500K pages of research papers on arXiv. Using the new framework,
we benchmark ten freely available tools in extracting document meta-
data, bibliographic references, tables, and other content elements from
academic PDF documents. GROBID achieves the best metadata and ref-
erence extraction results, followed by CERMINE and Science Parse. For
table extraction, Adobe Extract outperforms other tools, even though
the performance is much lower than for other content elements. All tools
struggle to extract lists, footers, and equations. We conclude that more
research on improving and combining tools is necessary to achieve satis-
factory extraction quality for most content elements. Evaluation datasets
and frameworks like the one we present support this line of research. We
make our data and code publicly available to contribute toward this goal.
Keywords:PDF·Information Extraction·Benchmark·Evaluation.
1 Introduction
The Portable Document Format (PDF) is the most prevalent encoding for aca-
demic documents. Extracting information from academic PDF documents is cru-
cial for numerous indexing, retrieval, and analysis tasks. Document search, rec-
ommendation, summarization, classification, knowledge base construction, ques-
tion answering, and bibliometric analysis are just a few examples [31].
1
arXiv:2303.09957v1 [cs.IR] 17 Mar 2023
2Meuschke et al.
However, the format’s technical design makes information extraction chal-
lenging. Adobe designed PDF as a platform-independent, fixed-layout format
by extending the PostScript [24] page description language. PDF focuses on
encoding a document’s visual layout to ensure a consistent appearance of the
document across software and hardware platforms but includes little structural
and semantic information on document elements.
Numerous tools for information extraction (IE) from PDF documents have
been presented since the format’s inception in 1993. The development of such
tools has been subject to a fast-paced technological evolution of extraction ap-
proaches from rule-based algorithms, over statistical machine learning (ML) to
deep learning (DL) models (cf. Section 2). Finding the best tool to extract spe-
cific content elements from PDF documents is currently difficult because:
1. Typically, tools only support extracting a subset of the content elements
in academic documents, e.g., title, authors, paragraphs, in-text citations,
captions, tables, figures, equations, or references.
2. Many information extraction tools, e.g., 12 of 35 tools we considered for our
study, are no longer maintained or have become obsolete.
3. Prior evaluations of information extraction tools often consider only specific
content elements or use domain-specific corpora, which makes their results
difficult to compare. Moreover, the most recent comprehensive benchmarks
of information extraction tools were published in 2015 for metadata
4
[55],
2017 for body text [6], and 2018 for references
5
[54], respectively. These
evaluations do not reflect the latest technological advances in the field.
To alleviate this knowledge gap and facilitate finding the best tool to extract
specific elements from academic PDF documents, we comprehensively evaluate
ten state-of-the-art non-commercial tools that consider eleven content elements
based on a dataset of 500K pages from arXiv documents covering multiple fields.
Our code, data, and resources are publicly available at
http://pdf-benchmark.gipplab.org
2 Related Work
This section presents approaches for information extraction from PDF (Sec-
tion 2.1), labeled datasets suitable for training and evaluating PDF information
extraction approaches, and prior evaluations of IE tools (Section 2.2).
2.1 Information Extraction from PDF Documents
Table 1 summarizes publications on PDF information extraction since 1999. For
each publication, the table shows the primary technological approach and the
4
For example author(s), title, affiliation(s), address(es), email(s)
5
Refers to extracting the components of bibliographic references, e.g., author(s), title,
venue, editor(s), volume, issue, page range, year of publication, etc.
A Benchmark of PDF Information Extraction Tools3
Table 1: Publications on information extraction from PDF documents.
Publication
1
Year Task
2
Method Training Dataset
3
Palermo [44]1999 M, ToCRules100 documents
Klink [27]2000 MRules979 pages
Giuffrida [18]2000 MRules1,000 documents
Aiello [2]2002 RO, Title Rules1,000 pages
Mao [37]2004 MOCR,
Rules
309 documents
Peng [45]2004 M, RCRFCORA (500 refs.)
Day [14]2007 M, RTemplate160,000 citations
Hetzner [23]2008 RHMMCORA (500 refs.)
Councill [12]2008 RCRFCORA (200 refs.), CiteSeer (200 refs.)
Lopez [36]2009 B, M, RCRF, DLNone
Cui [13]2010 MHMM400 documents
Ojokoh [42]2010 MHMMCORA (500 refs.), ManCreat
FLUX-CiM (300 refs.),
Kern [25]2012 MHMME-prints, Mendeley, PubMed (19K entries)
Bast [5]2013 B, M, RRulesDBLP (690 docs.), PubMed (500 docs.)
Souza [53]2014 MCRF100 documents
Anzaroot [3]2014 RCRFUMASS (1,800 refs.)
Vilnis [57]2015 RCRFUMASS (1,800 refs.)
Tkaczyk [55]2015 B, M, RCRF,
Rules,
SVM
CiteSeer (4,000 refs.), CORA (500 refs.),
GROTOAP, PMC (53K docs.)
Bhardwaj [7]2017 RFCN5,090 references
Rodrigues [49] 2018 RBiLSTM40,000 references
Prasad [46]2018 M, RCRF, DLFLUX-CiM (300 refs.), CiteSeer (4,000 refs.)
Jahongir [4]2018 MRules10,000 documents
Torre [15]2018 B, MRules300 documents
Rizvi [47]2020 RR-CNN40,000 references
Hashmi [22]2020 MRules45 documents
Ahmed [1]2020 MRules150 documents
Nikolaos [33]2021 B, M, RAttention,
BiLSTM
3,000 documents
1
Publications in chronological order; the labels indicate the first author only.
2
(B) Body text, (M) Metadata, (R) References, (RO) Reading order, (ToC) Table of contents
3
Domain-specific datasets: Computer Science: CiteSeer [43], CORA [39], DBLP [52], FLUX-
CiM [10,11], ManCreat [42]; Health Science: PubMed [40], PMC [41]
4Meuschke et al.
training dataset. Eighteen of 27 approaches (67%) employ machine learning or
deep learning (DL) techniques, and the remainder rule-based extraction (Rules).
Early tools rely on manually coded rules [44]. Second-generation tools use sta-
tistical machine learning, e.g., based on Hidden Markov Models (HMM) [8],
Conditional Random Fields (CRF) [29], and maximum entropy [26]. The most
recent information extraction tools employ Transformer models [56].
A preference for—in theory—more flexible and adaptive machine learning
and deep learning techniques over case-specific rule-based algorithms is observ-
able in Table 1. However, many training datasets are domain-specific, e.g., they
exclusively consist of documents from Computer Science or Health Science, and
comprise fewer than 500 documents. These two factors put the generalizability
of the respective IE approaches into question. Notable exceptions like Ojokoh et
al. [42], Kern et al. [25], and Tkaczyk et al. [55] use multiple datasets covering
different domains for training and evaluation. However, these approaches address
specific tasks, i.e, header metadata extraction, reference extraction, or both.
Moreover, a literature survey by Mao et al. shows that most approaches for
text extraction from PDF do not specify the ground-truth data and performance
metrics they use, which impedes performance comparisons [38]. A positive excep-
tion is a publication by Bast et al. [5], which presents a comprehensive evaluation
framework for text extraction from PDF that includes a fine-grained specifica-
tion of the performance measures used.
2.2 Labeled Datasets and Prior Benchmarks
Table 2 summarizes datasets usable for training and evaluating PDF information
extraction approaches grouped by the type of ground-truth labels they offer.
Most datasets exclusively offer labels for document metadata, references, or both.
Table 2: Labeled datasets for information extraction from PDF documents.
Publication
1
SizeGround-truth Labels
Fan [16]147 documentsMetadata
F ̈arber [17]90K documentsReferences
Grennan [21]1B referencesReferences
Saier [51,50]1M documents.References
Ley [30,52]6M documentsMetadata, references
Mccallum [39] 935 documentsMetadata, references
Kyle [34]8.1M documentsMetadata, references
Ororbia [43]6M documents.Metadata, references
Bast [6]12,098 documentsBody text, sections, title
Li [31]500K pagesCaptions, equations, figures, footers
lists, metadata, paragraphs,
references, sections, tables
1
The labels indicate the first author only.
A Benchmark of PDF Information Extraction Tools5
Only the DocBank dataset by Li et al. [31] offers annotations for 12 diverse
content elements in academic documents, including, figures, equations, tables,
and captions. Most of these content elements have not been used for bench-
mark evaluations yet. DocBank is comparably large (500K pages from research
papers published on arXiv in a four-year period). A downside of the DocBank
dataset is its coarse-grained labels for references, which do not annotate the
fields of bibliographic entries like the author, publisher, volume, or date, as do
bibliography-specific datasets like unarXive [21] or S2ORC [34].
Table 3 shows PDF information extraction benchmarks performed since 1999.
Few such works exist and were rarely repeated or updated, which is sub-optimal
given that many tools receive updates frequently. Other tools become techno-
logically obsolete or unmaintained. For instance, pdf-extract
6
, lapdftext
7
, PDF-
SSA4MET
8
, and PDFMeat
9
are no longer maintained actively, while ParsCit
10
has been replaced by NeuralParsCit
11
and SciWING
12
.
Table 3: Benchmark evaluations of PDF information extraction approaches.
Publication
1
DatasetMetrics
2
Tools Labels
3
Granitzer [19] E-prints (2,452 docs.),
Mendeley (20,672 docs.)
P,R2M
Lipinski [32]arXiv (1,253 docs.)Acc7M
Bast [6]arXiv (12,098 docs.)Custom14NL, Pa
RO, W
K ̈orner [28]100 (German docs.)P,R,F
1
4Ref
Tkaczyk [54]9,491 documentsP,R,F
1
10Ref
Rizvi [48]8,766 referencesF
1
4Ref
1
The labels indicate the first author only.
2
(P) Precision, (R) Recall, (F
1
)F
1
-score, (Acc) Accuracy
3
(M) Metadata, (NL) New Line, (Pa) Paragraph, (Ref) Reference, (RO)
Reading order, (W) Words
As Table 3 shows, the most extensive dataset used for evaluating PDF infor-
mation extraction tools so far contains approx. 24,000 documents. This number
is small compared to the sizes of datasets available for this task, shown in Ta-
ble 2. Most studies focused on exclusively evaluating metadata and reference
extraction (see also Table 3). An exception is a benchmark by Bast and Korzen
6
https://github.com/CrossRef/pdfextract
7
https://github.com/BMKEG/lapdftext
8
https://github.com/eliask/pdfssa4met
9
https://github.com/dimatura/pdfmeat
10
https://github.com/knmnyn/ParsCit
11
https://github.com/WING-NUS/Neural-ParsCit
12
https://github.com/abhinavkashyap/sciwing
6Meuschke et al.
[6], which evaluated spurious and missing words, paragraphs, and new lines for
14 tools but used a comparably small dataset of approx. 10K documents.
We conclude from our review of related work that (1) recent benchmarks of
information extraction tools for PDF are rare, (2) mostly analyze metadata ex-
traction, (3) use small, domain-specific datasets, and (4) include tools that have
become obsolete or unmaintained. (5) A variety of suitably labeled datasets have
not been used to evaluate information extraction tools for PDF documents yet.
Therefore, we see the need for benchmarking state-of-the-art PDF information
extraction tools on a large labeled dataset of academic documents covering mul-
tiple domains and containing diverse content elements.
3 Methodology
This section presents the experimental setup of our study by describing the tools
we evaluate (Section 3.1), the dataset we use (Section 3.2), and the procedure
we follow (Section 3.3).
3.1 Evaluated Tools
We chose ten actively maintained non-commercial open-source tools that we
categorize by extraction tasks.
1.Metadata Extractionincludes tools to extract titles, authors, abstracts,
and similar document metadata.
2.Reference Extractioncomprises tools to access and parse bibliographic
reference strings into fields like author names, publication titles, and venue.
3.Table Extractionrefers to tools that allow accessing both the structure
and data of tables.
4.General Extractionsubsumes tools to extract, e.g., paragraphs, sections,
figures, captions, equations, lists, or footers.
For each of the tools we evaluate, Table 4 shows the version, supported extraction
task(s), primary technological approach, and output format. Hereafter, we briefly
describe each tool, focusing on its technological approach.
Adobe Extract
13
is a cloud-based API that allows extracting tables and
numerous other content elements subsumed in thegeneral extractioncategory.
The API employs the Adobe Sensei
14
AI and machine learning platform to un-
derstand the structure of PDF documents. To evaluate the Adobe Extract API,
we used the Adobe PDFServices Python SDK
15
to access the API’s services.
Apache Tika
16
allows metadata and content extraction in XML format. We
used the tika-python
17
client to access the Tika REST API. Unfortunately, we
found that tika-python only supports content (paragraphs) extraction.
13
https://www.adobe.io/apis/documentcloud/dcsdk/pdf-extract.html
14
https://www.adobe.com/de/sensei.html
15
https://github.com/adobe/pdfservices-python-sdk-samples
16
https://tika.apache.org/
17
https://github.com/chrismattmann/tika-python
A Benchmark of PDF Information Extraction Tools7
Table 4: Overview of evaluated information extraction tools.
ToolVersion Task
1
TechnologyOutput
Adobe Extract 1.0G, TAdobe Sensei AI FrameworkJSON, XLSX
Apache Tika2.0.0GApache PDFBoxTXT
Camelot0.10.1TOpenCV, PDFMinerCSV, Dataframe
CERMINE1.13G, M, RCRF, iText, Rules, SVMJATS
GROBID0.7.0G, M, R, T CRF, Deep Learning, PdfaltoTEI XML
PdfActn/aG, M, R, T pdftotext, rulesJSON, TXT, XML
PyMuPDF1.19.1GOCR, tesseractTXT
RefExtract0.2.5Rpdftotext, rulesTXT
ScienceParse1.0G, M, R,CRF, pdffigures2, rulesJSON
Tabula1.2.1TPDFBox, rulesCSV, Dataframe
1
(G) General, (M) Metadata, (R) References, (T) Table
Camelot
18
can extract tables using either theStreamorLatticemodes. The
former uses whitespace between cells and the latter table borders for table cell
identification. For our experiments, we exclusively use the Stream mode, since
our test documents are academic papers, in which tables typically use whitespace
in favor of cell borders to delineate cells. The Stream mode internally utilizes
the PDFMiner library
19
to extract characters that are subsequently grouped into
words and sentences using whitespace margins.
CERMINE[55] offers metadata, reference, and general extraction capabil-
ities. The tool employs the iText PDF toolkit
20
for character extraction and the
Docstrum
21
image segmentation algorithm for page segmentation of document
images. CERMINE uses an SVM classifier implemented using the LibSVM
22
library and rule-based algorithms for metadata extraction. For reference ex-
traction, the tool employsk-means clustering, and Conditional Random Fields
implemented using the MALLET
23
toolkit for sequence labeling. CERMINE
returns a single XML file containing the annotations for an entire PDF. We
employ the Beautiful Soup
24
library to filter CERMINE’s output files for the
annotations relevant to our evaluation.
GROBID
25
[35] supports all four extraction tasks. The tool allows using ei-
ther feature-engineered CRF (default) or a combination of CRF and DL models
realized using the DeLFT
26
Deep Learning library, which is based on TensorFlow
and Keras. GROBID uses a cascade of sequence labeling models for different
components. The models in the model cascade use individual label sequencing
18
https://github.com/camelot-dev/camelot
19
https://github.com/pdfminer/pdfminer.six
20
https://github.com/itext
21
https://github.com/chulwoopack/docstrum
22
https://github.com/cjlin1/libsvm
23
http://mallet.cs.umass.edu/sequences.php
24
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
25
https://github.com/kermitt2/grobid
26
https://github.com/kermitt2/delft
8Meuschke et al.
algorithms and features; some models employ tokenizers. This approach offers
flexibility by allowing model tuning and improves the model’s maintainability.
We evaluate the default CRF model with production settings (a recommended
setting to improve the performance and availability of the GROBID server, ac-
cording to the tool’s documentation
27
).
PdfActformerly called Icecite [5] is a rule-based tool that supports all four
extraction tasks, including the extraction of appendices, acknowledgments, and
tables of contents. The tool uses the PDFBox
28
and pdftotext
29
PDF manipu-
lation and content extraction libraries. We use the tool’s JAR release
30
.
PyMuPDF
31
extends the MuPDF
32
viewer library with font and image
extraction, PDF joining, and file embedding. PyMuPDF uses tesseract
33
for
OCR. PyMuPDF could not process files whose names include special characters.
RefExtract
34
is a reference extraction tool that uses pdftotext
35
and regu-
lar expressions. RefExtract returns annotations for the entire bibliography of a
document. The ground-truth annotations in our dataset (cf. Section 3.2), how-
ever, pertain to individual pages of documents and do not always cover the entire
document. If ground-truth annotations are only available for a subset of the ref-
erences in a document, we use regular expressions to filter RefExtract’s output
to those references with ground-truth labels.
Science Parse
36
uses a CRF model trained on data from GROBID to ex-
tract the title, author, and references. It also employs a rule-based algorithm by
Clark and Divvala [9] to extract sections and paragraphs in JSON format.
Tabula
37
is a table extraction tool. Analogous to Camelot, Tabula offers a
Streammode realized using PDFBox, and aLatticemode realized using OpenCV
for table cell recognition.
3.2 Dataset
We use the DocBank
38
dataset, created by Li et al. [31], for our experiments.
Figure 1 visualizes the process for compiling the dataset. First, the creators
gathered arXiv documents, for which both the PDF and LaTeX source code was
available. Li et al. then edited the LaTeX code to enable accurate automated
annotations of content elements in the PDF version of the documents. For this
purpose, they inserted commands that formatted content elements in specific
27
https://GROBID.readthedocs.io/en/latest/Troubleshooting/
28
http://pdfbox.apache.org/
29
https://github.com/jalan/pdftotext
30
https://github.com/ad-freiburg/pdfact
31
https://github.com/pymupdf/PyMuPDF
32
https://mupdf.com/
33
https://github.com/tesseract-ocr/tesseract
34
https://github.com/inspirehep/refextract
35
https://linux.die.net/man1/pdftotext
36
https://github.com/allenai/science-parse
37
https://github.com/chezou/tabula-py
38
https://github.com/doc-analysis/DocBank
A Benchmark of PDF Information Extraction Tools9
Documents
Semantic Structure
Detection
Token Annotation
Data acquisition from arXiv (PDF and .tex)
\section {Section1}
|
\section{{\color {fontcolor} {Section1}}}
Tab-separated ground truth file
Fig. 1: Process for generating the DocBank dataset.
colors. The center part of Figure 3 shows the mapping of content elements to
colors. In the last step, the dataset creators used PDFPlumber
39
and PDFMiner
to extract and annotate relevant content elements by their color. DocBank pro-
vides the annotations as separate files for each document page in the dataset.
Table 5 shows the structure of the tab-separated ground-truth files. Each
line in the file refers to one component on the page and is structured as follows.
Index 0 represents the token itself, e.g., a word. Indices 1-4 denote the bounding
box information of the token, where (x0, y0) represents the top-left and (x1, y1)
the bottom-right corner of the token in the PDF coordinate space. Indices 5-7
reflect the token’s color in RGB notation, index 8 the token’s font, and index 9
the label for the type of the content element. Each ground-truth file adheres to
the naming scheme shown in Figure 2.
Table 5: Structure of DocBank’s plaintext ground-truth files.
Index0123456789
Content token x0 y0 x1 y1 R G B font name label
Source:https://doc-analysis.github.io/docbank-page/index.html.
Fig. 2: Naming scheme for DocBank’s ground-truth files.
39
https://github.com/jsvine/pdfplumber
10Meuschke et al.
The DocBank dataset offers ground-truth annotations for 1.5M content ele-
ments on 500K pages. Li et al. extracted the pages from arXiv papers in Physics,
Mathematics, Computer Science, and numerous other fields published between
2014 and 2018. DocBank’s large size, recency, diversity of included documents,
number of annotated content elements, and high annotation quality due to the
weakly supervised labeling approach make it an ideal choice for our purposes.
3.3 Evaluation Procedure
Evaluation Framework
9
Labeled Data
PDF Object
Assemble Data
PDF Extraction Tool
Similarity Matrix
−Annotated data
−File name
−Page number
−File path
Abstract
Title
Author
Caption
Equation
List
Footer
Reference
Paragraph
Section
Table
{
Ground-truth DF
for anElement
−Separate Tokens
−Collated Tokens
Evaluation Metrics
TXT, JSON, XML,
XSLX
Element
Parsing
Element
Selection
{
−LevenshteinRatio
−Precision
−Recall
−Accuracy
−F1 Score
Extracted DF
for anElement
Fig. 3: Overview of the procedure for comparing content elements extracted by
IE tools to the ground-truth annotations and computing evaluation metrics.
Figure 3 shows our evaluation procedure. First, we select the PDF files whose
associated ground-truth files contain relevant labels. For example, we search for
ground-truth files containingreferencetokens to evaluate reference extraction
tools. We include the PDF file, the ground-truth file, the document ID and
page number obtainable from the file name (cf. Figure 2), and the file path in a
self-defined Python object (seePDF Objectin Figure 3).
Then, the evaluation process splits into two branches whose goal is to create
two pandas data frames—one holding the relevant ground-truth data, and the
other the output of an information extraction tool. For this purpose, both the
ground-truth files and the output files of IE tools are parsed and filtered for
A Benchmark of PDF Information Extraction Tools11
the relevant content elements. For example, to evaluate reference extraction via
CERMINE, we exclusively parse reference tags from CERMINE’s XML output
file into a data frame (seeExtracted DFin Figure 3).
Finally, we convert both theground-truth data frameand theextracted data
frameinto two formats for comparison and computing performance metrics. The
first is theseparate tokensformat, in which every token is represented as a row in
the data frame. The second is thecollated tokensformat, in which all tokens are
combined into a single space-delimited row in the data frame. Separate tokens
serve to compute a strict score for token-level extraction quality, whereas collated
tokens yield a more lenient score intended to reflect a tool’s average extraction
quality for a class of content elements. We will explain the idea of both scores
and their computation hereafter.
We employ theLevenshtein Ratioto quantify the similarity of extracted
tokens and the ground-truth data for both the separate tokens and collated
tokens format. Equation (1) defines the computation of theLevenshtein distance
of the extracted tokenst
e
and the ground-truth tokenst
g
.
lev
t
e
,t
g
(i,j) =
max(i,j),ifmin(i,j) = 0,
min
lev
t
e
,t
g
(i−1,j) + 1
lev
t
e
,t
g
(i,j−1) + 1
lev
t
e
,t
g
(i−1,j−1) + 1
(t
ei
6=t
ej
)
otherwise.
(1)
Equation (2) defines the derived Levenshtein Ratio score (γ).
γ(t
e
,t
g
) = 1−
lev
t
e
,t
g
(i,j)
|t
e
|+|t
g
|
(2)
Equation (3) shows the derivation of thesimilarity matrix(∆
d
) for a doc-
ument (d), which contains the Levenshtein Ratio (γ) of every token in the ex-
tracted data frame with separate tokensE
s
of sizemand the ground-truth data
frame with separate tokensG
s
of sizen.
∆
d
m×n
=γ
[
E
s
i
,G
s
j
]
m,n
i,j
(3)
Using them×nsimilarity matrix, we compute thePrecisionP
d
andRe-
callR
d
scores according to Equation (4) and Equation (5), respectively. As the
numerator, we use the number of extracted tokens whose Levenshtein Ratio is
larger or equal to 0.7. We chose this threshold for consistency with the exper-
iments by Granitzer et al. [19]. We then compute theF
d
1
score according to
Equation (6) as a token-level score for a tool’s extraction quality.
P
d
=
#∆
d
i,j
≥0.7
m
(4)
R
d
=
#∆
d
i,j
≥0.7
n
(5)
12Meuschke et al.
F
1
d
=
2×P
d
×R
d
P
d
+R
d
(6)
Moreover, we compute theAccuracyscoreA
d
reflecting a tool’s average ex-
traction quality for a class of tokens. To obtainA
d
, we compute the Levenshtein
Ratioγof the extracted tokensE
c
and ground-truth tokensG
c
in the collated
tokens format, according to Equation (7).
A
d
=γ[E
c
,G
c
](7)
Figure 4 and Figure 5 show the similarity matrices for the author names
’Yuta,’ ’Hamada,’ ’Gary,’ and ’Shiu’ using separate and collated tokens, respec-
tively. Figure 4 additionally shows an example computation of the Levenshtein
Ratio for the stringsGaryandYuta. The strings have a Levenshtein distance of
six and a cumulative string length of eight, which results in a Levenshtein Ratio
of 0.25 that is entered into the similarity matrix. Figure 5 analogously exemplifies
computing the Accuracy score of the two strings using collated tokens.
Yuta Hamada Gary Shiu1,
Yuta 1.0 0.2 0.25 0.2
Hamada 0.2 1.0 0.2 0.0
Gary 0.25 0.2 1.0 0.0
Shiu 0.25 0.0 0.0 0.8
Y u t a
0 1 2 3 4
G1 2 3 4 5
a2 3 4 5 4
r3 4 5 6 5
y4 3 4 5 6
Fig. 4: Left: Similarity matrix for author names using separate tokens.
Right: Computation of the Levenshtein distance (6) and the optimal edit tran-
script (yellow highlights) for two author names using dynamic programming.
Yuta Hamada Gary Shiu1,
Yuta Hamada Gary Shiu0.957
Fig. 5: Similarity matrix for two sets of author names using collated tokens.
A Benchmark of PDF Information Extraction Tools13
4 Results
We present the evaluation results grouped by extraction task (see Figures 6–9)
and by tools (see Table 6). This two-fold breakdown of the results facilitates
identifying the best-performing tool for a specific extraction task or content
element and allows for gauging the strengths and weaknesses of tools more easily.
Note that the task-specific result visualizations (Figures 6–9) only include tools
that support the respective extraction task. See Table 4 for an overview of the
evaluated tools and the extraction tasks they support.
Figure 6 shows the cumulativeF
1
scores of CERMINE, GROBID, PdfAct,
and Science Parse for the metadata extraction task, i.e., extracting title, abstract,
and authors. Consequently, the best possible cumulativeF
1
score equals three.
Overall, GROBID performs best, achieving a cumulativeF
1
score of 2.25 and
individualF
1
scores of 0.91 fortitle, 0.82 forabstract, and 0.52 forauthors.
Science Parse (2.03) and CERMINE (1.97) obtain comparable cumulativeF
1
scores, while PdfAct has the lowest cumulativeF
1
score of 1.14. However, PdfAct
performs second-best for title extraction with aF
1
score of 0.85. The performance
of all tools is worse for extracting authors than for titles and abstracts. It appears
that machine-learning-based approaches like those of CERMINE, GROBID, and
Science Parse perform better for metadata extraction than rule-based algorithms
like the one implemented in PdfAct
40
.
Figure 7 shows the results for the reference extraction task. With aF
1
score of
0.79, GROBID also performs best for this task. CERMINE achieves the second
rank with aF
1
score of 0.74, while Science Parse and RefExtract share the
third rank with identicalF
1
scores of 0.49. As for the metadata extraction task,
PdfAct also achieves the lowestF
1
score of 0.15 for reference extraction. While
both RefExtract and PdfAct employ pdftotext and regular expressions, GROBID
performs efficient segregation of cascaded sequence labeling models
41
for diverse
components, which can be the reason for its superior performance [36].
Figure 8 depicts the results for the table extraction task. Adobe Extract
outperforms the other tools with aF
1
score of 0.47. Camelot (F
1
= 0.30), Tabula
(F
1
= 0.28), and GROBID (F
1
= 0.23) perform notably worse than Adobe
Extract. Both Camelot and Tabula incorrectly treat two-column articles as tables
and table captions as a part of the table region, which negatively affects their
performance scores. The use of comparableStreamandLatticemodes in Camelot
and Tabula (cf. Section 3.1) likely cause the tools’ highly similar results. PdfAct
did not produce an output for any of our test documents that contain tables,
although the tool supposedly supports table extraction. The performance of all
tools is significantly lower for table extraction than for other content elements,
which is likely caused by the need to extract additional structural information.
The difficulty of table extraction is also reflected by numerous issues that users
opened on the matter in the GROBID GitHub repository
42
.
40
See Table 4 for more information on the tools’ extraction approaches.
41
https://grobid.readthedocs.io/en/latest/Principles/
42
https://github.com/kermitt2/grobid/issues/340
14Meuschke et al.
CERMINEGROBIDPdfActScience Parse
Title0.810.910.850.70
Abstract0.720.820.160.81
Authors0.440.520.130.52
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Cumulative F1 Score
1.97
2.25
1.14
2.03
Fig. 6: Results for metadata extraction.
CERMINEGROBIDPdfAct
Science
Parse
RefExtract
Reference
0.740.790.150.490.49
0.0
0.2
0.4
0.6
0.8
1.0
F1 Score
Fig. 7: Results for reference extraction.
A Benchmark of PDF Information Extraction Tools15
Adobe
Extract
CamelotGROBIDPdfActTabula
Table
0.470.300.230.000.28
0.0
0.2
0.4
0.6
0.8
1.0
F1 Score
Fig. 8: Results for table extraction.
Figure 9 visualizes the results for the general extraction task. GROBID
achieves the highest cumulativeF
1
score of 2.38, followed by PdfAct (cumulative
F
1
= 1.66). The cumulativeF
1
scores of Science Parse (1.25), which only support
paragraph and section extraction, and CERMINE (1.20) are much lower than
GROBID’s score and comparable to that of PdfAct. Apache Tika, PyMuPDF,
and Adobe Extract can only extract paragraphs.
For paragraph extraction, GROBID (0.9), CERMINE (0.85), and PdfAct
(0.85) obtained highF
1
scores with Science Parse (0.76) and Adobe Extract
(0.74) following closely. Apache Tika (0.52) and PyMuPDF (0.51) achieved no-
tably lower scores because the tools include other elements like sections, captions,
lists, footers, and equations in paragraphs.
Notably, only GROBID achieves a promisingF
1
score of 0.74 for the ex-
traction of sections. GROBID and PdfAct are the only tools that can partially
extract captions. None of the tools is able to extract lists. Only PdfAct supports
the extraction of footers but achieves a lowF
1
score of 0.20. Only GROBID
supports equation extraction but the extraction quality is comparatively low
(F
1
= 0.25). To reduce the evaluation effort, we first tested the extraction of
lists, footers, and equations on a two-months sample of the data covering Jan-
uary and February 2014. If a tool consistently obtained performance scores of
0, we did not continue with its evaluation. Following this procedure, we only
evaluated GROBID and PdfAct on the full dataset.
16Meuschke et al.
For the general extraction task, GROBID outperforms other tools due to
its segmentation model
43
, which detects the main areas of documents based on
layout features. Therefore, frequent content elements like paragraphs will not
impact the extraction of rare elements from a non-body area by keeping the
imbalanced classes in separate models. The cascading models used in GROBID
also offer the flexibility to tune each model. Using layouts and structures as a
basis for the process allows the association of simpler training data.
Adobe
Extract
Apache
Tika
CERMINEGROBIDPdfActPyMuPDF
Science
Parse
Paragraph0.740.520.850.900.850.510.76
Section0.000.000.350.740.160.000.49
Caption0.000.000.000.490.450.000.00
List0.000.000.000.000.000.000.00
Footer0.000.000.000.000.200.000.00
Equation0.000.000.000.250.000.000.00
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Cumulative F1 Score
0.74
0.52
1.2
2.38
1.66
0.51
1.25
Fig. 9: Results for general data extraction.
The breakdown of results by tools shown in Table 6 underscores the main
takeaway point of the results’ presentation for the individual extraction tasks.
The tools’ results differ greatly for different content elements. Certainly, no tool
performs best for all elements, rather, even tools that perform well overall can
fail completely for certain extraction tasks. The large amount of content elements
whose extraction is either unsupported or only possible in poor quality indicates
a large potential for improvement in future work.
43
https://grobid.readthedocs.io/en/latest/Principles/
A Benchmark of PDF Information Extraction Tools17
Table 6: Results grouped by extraction tool.
Tool
1
Label
# De-
tected
# Pro-
cessed
2
Acc F
1
PR
Adobe ExtractTable1,6357360.52 0.47 0.45 0.49
Paragraph 3,9853,0880.85 0.74 0.72 0.76
Apache TikaParagraph 339,603258,5820.55 0.52 0.43 0.65
CamelotTable16,28911,6280.27 0.30 0.23 0.44
CERMINETitle16,19614,5010.84 0.81 0.81 0.81
Author19,78814,7970.43 0.44 0.44 0.46
Abstract19,34216,7160.71 0.72 0.68 0.76
Reference40,33335,1930.80 0.74 0.71 0.77
Paragraph 361,273348,1600.89 0.85 0.83 0.87
Section163,077139,9210.40 0.35 0.32 0.38
GROBIDTitle16,19616,0180.92 0.91 0.91 0.92
Author19,788 19,563 0.54 0.52 0.52 0.53
Abstract19,34218,7140.820.82 0.810.83
Reference40,333 36,020 0.82 0.79 0.79 0.80
Paragraph 361,273358,730 0.90 0.90 0.89 0.91
Section163,077 163,037 0.77 0.74 0.73 0.76
Caption90,606 62,445 0.57 0.49 0.470.51
Table16,7408,6330.24 0.23 0.23 0.23
Equation142,736 96,560 0.26 0.25 0.20 0.32
PdfActTitle17,670 16,8340.85 0.85 0.85 0.86
Author13,1102,1870.14 0.13 0.12 0.18
Abstract21,4704,6830.17 0.16 0.15 0.20
Reference 30,26312,7050.19 0.15 0.17 0.20
Paragraph361,318357,9050.85 0.85 0.80 0.89
Section129,36187,6050.21 0.16 0.12 0.25
Caption83,43553,3140.45 0.45 0.400.52
Footer32,457 26,252 0.23 0.20 0.25 0.16
PyMuPDFParagraph 339,650258,3830.55 0.51 0.41 0.65
RefExtractReference40,33338,4050.55 0.49 0.44 0.55
Science ParseTitle11,69611,6870.79 0.70 0.70 0.70
Author4714710.54 0.52 0.52 0.53
Abstract14,15014,1490.830.81 0.730.90
Reference40,33335,2000.55 0.49 0.49 0.50
Paragraph361,318355,5290.79 0.76 0.76 0.76
Section163,077158,5560.54 0.49 0.49 0.50
TabulaTable10,3619,4560.29 0.28 0.20 0.46
1
Boldface indicates the best value for each content element type.
2
The differences in the number of detected and processed items are due to
PDF Read Exceptions or Warnings. We label an item as processed if it has
a non-zeroF
1
score.
18Meuschke et al.
5 Conclusion and Future Work
We present an open evaluation framework for information extraction from aca-
demic PDF documents. Our framework uses the DocBank dataset [31] offering
12 types and 1.5M annotated instances of content elements contained in 500K
pages of arXiv papers from multiple disciplines. The dataset is larger, more top-
ically diverse, and supports more extraction tasks than most related datasets.
We use the newly developed framework to benchmark the performance of ten
freely available tools in extracting document metadata, bibliographic references,
tables, and other content elements in academic PDF documents. GROBID, fol-
lowed by CERMINE and Science Parse achieves the best results for the metadata
and reference extraction tasks. For table extraction, Adobe Extract outperforms
other tools, even though the performance is much lower than for other content
elements. All tools struggle to extract lists, footers, and equations.
While DocBank covers more disciplines than other datasets, we see further
diversification of the collection in terms of disciplines, document types, and con-
tent elements as a valuable task for future research. Table 2 shows that more
datasets suitable for information extraction from PDF documents are available
but unused thus far. The weakly supervised annotation approach used for creat-
ing the DocBank dataset is transferable to other LaTeX document collections.
Apart from the dataset, our framework can incorporate additional tools and
allows easy replacement of tools in case of updates. We intend to update and
extend our performance benchmark in the future.
The extraction of tables, equations, footers, lists, and similar content ele-
ments poses the toughest challenge for tools in our benchmark. In recent work,
Grennan et al.[20] showed that the usage of synthetic datasets for model train-
ing can improve citation parsing. A similar approach could also be a promising
direction for improving the access to currently hard-to-extract content elements.
Combining extraction approaches could lead to a one-fits-all extraction tool,
which we consider desirable. The Sciencebeam-pipelines
44
project currently un-
dertakes initial steps toward that goal. We hope that our evaluation framework
will help to support this line of research by facilitating performance benchmarks
of IE tools as part of a continuous development and integration process.
References
1. Ahmed, M.W., Afzal, M.T.: FLAG-PDFe: Features Oriented Metadata Extraction
Framework for Scientific Publications. IEEE Access8, 99458–99469 (May 2020).
https://doi.org/10.1109/ACCESS.2020.2997907
2. Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for
a Broad Class of Documents. International Journal on Document Analysis and
Recognition5(1)(Aug 2002). https://doi.org/10.1007/s10032-002-0080-x
3. Anzaroot, S., Passos, A., Belanger, D., McCallum, A.: Learning Soft Linear Con-
straints with Application to Citation Field Extraction. In: Proceedings of the 52nd
44
https://github.com/elifesciences/sciencebeam-pipelines
A Benchmark of PDF Information Extraction Tools19
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). pp. 593–602. Association for Computational Linguistics, Baltimore, Mary-
land (2014). https://doi.org/10.3115/v1/P14-1056
4. Azimjonov, J., Alikhanov, J.: Rule Based Metadata Extraction Framework
from Academic Articles. arXiv CoRR1807.09009v1 [cs.IR], 1–10 (2018).
https://doi.org/10.48550/arXiv.1807.09009
5. Bast, H., Korzen, C.: The Icecite Research Paper Management System. In: Web
Information Systems Engineering – WISE 2013, vol. 8181, pp. 396–409. Springer
Berlin, Heidelberg, Nanjing, China (2013). https://doi.org/10.1007/978-3-642-
41154-0
30
6. Bast, H., Korzen, C.: A Benchmark and Evaluation for Text Extraction from PDF.
In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 1–10.
IEEE, Toronto, ON, Canada (2017). https://doi.org/10.1109/JCDL.2017.7991564
7. Bhardwaj, A., Mercier, D., Dengel, A., Ahmed, S.: DeepBIBX: Deep Learning
for Image Based Bibliographic Data Extraction. In: Proceedings of the 24th In-
ternational Conference on Neural Information Processing. LNCS, vol. 10635, pp.
286–293. Springer, Guangzhou, China (2017). https://doi.org/10.1007/978-3-319-
70096-030
8. Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic Segmentation of Text
into Structured Records. SIGMOD Record30(2), 175–186 (Jun 2001).
https://doi.org/10.1145/376284.375682
9. Clark, C., Divvala, S.: PDFFigures 2.0: Mining Figures from Research Papers. In:
Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries.
pp. 143–152. JCDL ’16, Association for Computing Machinery, New York, NY,
USA (2016). https://doi.org/10.1145/2910896.2910904
10. Cortez, E., da Silva, A.S., Gon ̧calves, M.A., Mesquita, F., de Moura, E.S.: FLUX-
CIM: Flexible Unsupervised Extraction of Citation Metadata. In: Proceedings
of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. pp. 215–224.
JCDL ’07, Association for Computing Machinery, New York, NY, USA (2007).
https://doi.org/10.1145/1255175.1255219
11. Cortez, E., da Silva, A.S., Gon ̧calves, M.A., Mesquita, F., de Moura, E.S.: A
Flexible Approach for Extracting Metadata from Bibliographic Citations. JASIST
60(6), 1144–1158 (Jun 2009). https://doi.org/10.1002/asi.21049
12. Councill, I., Giles, C.L., Kan, M.Y.: ParsCit: an Open-source CRF Reference
String Parsing Package. In: Proceedings of the Sixth International Conference on
Language Resources and Evaluation. European Language Resources Association,
Marrakech, Morocco (2008),https://aclanthology.org/L08-1291/
13. Cui, B.G., Chen, X.: An Improved Hidden Markov Model for Literature Meta-
data Extraction. In: Advanced Intelligent Computing Theories and Applications,
vol. 6215, pp. 205–212. Springer Berlin Heidelberg, Changsha, China (2010).
https://doi.org/10.1007/978-3-642-14922-1
26
14. Day, M.Y., Tsai, R.T.H., Sung, C.L., Hsieh, C.C., Lee, C.W., Wu, S.H., Wu, K.P.,
Ong, C.S., Hsu, W.L.: Reference metadata extraction using a hierarchical knowl-
edge representation framework. Decision Support Systems43(1), 152–167 (Feb
2007). https://doi.org/10.1016/j.dss.2006.08.006
15. De La Torre, M., Aguirre, C., Anshutz, B., Hsu, W.: MATESC: Metadata-analytic
text extractor and section classifier for scientific publications. In: Proceedings of
the 10th International Joint Conference on Knowledge Discovery, Knowledge En-
gineering and Knowledge Management. vol. 1, pp. 261–267. SciTePress (2018).
https://doi.org/10.5220/0006937702610267
20Meuschke et al.
16. Fan, T., Liu, J., Qiu, Y., Jiang, C., Zhang, J., Zhang, W., Wan, J.: PARDA: A
Dataset for Scholarly PDF Document Metadata Extraction Evaluation. In: Col-
laborative Computing: Networking, Applications and Worksharing. pp. 417–431.
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12981-1
29
17. F ̈arber, M., Thiemann, A., Jatowt, A.: A High-Quality Gold Standard for Citation-
based Tasks. In: Proceedings of the Eleventh International Conference on Language
Resources and Evaluation. European Language Resources Association, Miyazaki,
Japan (2018),https://aclanthology.org/L18-1296
18. Giuffrida, G., Shek, E.C., Yang, J.: Knowledge-Based Metadata Extraction from
PostScript Files. In: Proceedings of the Fifth ACM Conference on Digital Libraries.
pp. 77–84. DL ’00, Association for Computing Machinery, New York, NY, USA
(2000). https://doi.org/10.1145/336597.336639
19. Granitzer, M., Hristakeva, M., Jack, K., Knight, R.: A Comparison of Metadata
Extraction Techniques for Crowdsourced Bibliographic Metadata Management. In:
Proceedings of the 27th Annual ACM Symposium on Applied Computing. pp. 962–
964. SAC ’12, Association for Computing Machinery, New York, NY, USA (2012).
https://doi.org/10.1145/2245276.2245462
20. Grennan, M., Beel, J.: Synthetic vs. Real Reference Strings for Citation Pars-
ing, and the Importance of Re-training and Out-Of-Sample Data for Meaning-
ful Evaluations: Experiments with GROBID, GIANT and CORA. In: Proceed-
ings of the 8th International Workshop on Mining Scientific Publications. pp.
27–35. Association for Computational Linguistics, Wuhan, China (2020),https:
//aclanthology.org/2020.wosp-1.4
21. Grennan, M., Schibel, M., Collins, A., Beel, J.: GIANT: The 1-Billion Annotated
Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing [Data]
(2019). https://doi.org/10.7910/DVN/LXQXAO
22. Hashmi, A.M., Afzal, M.T., ur Rehman, S.: Rule Based Approach to Ex-
tract Metadata from Scientific PDF Documents. In: 2020 5th Interna-
tional Conference on Innovative Technologies in Intelligent Systems and In-
dustrial Applications (CITISIA). pp. 1–4. IEEE, Sydney, Australia (2020).
https://doi.org/10.1109/CITISIA50690.2020.9371784
23. Hetzner, E.: A Simple Method for Citation Metadata Extraction Using Hidden
Markov Models. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on
Digital Libraries. pp. 280–284. JCDL ’08, Association for Computing Machinery,
New York, NY, USA (2008). https://doi.org/10.1145/1378889.1378937
24. Kasdorf, W.E.: The Columbia Guide to Digital Publishing. Columbia University
Press, USA (2003)
25. Kern, R., Jack, K., Hristakeva, M.: TeamBeam - Meta-Data Extrac-
tion from Scientific Literature. D-Lib Magazine18(7/8) (Jul 2012).
https://doi.org/10.1045/july2012-kern
26. Klein, D., Manning, C.D.: Conditional Structure versus Conditional Estima-
tion in NLP Models. In: Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing (EMNLP). pp. 9–16. Association
for Computational Linguistics, Pennsylvania, Philadelphia, PA, USA (2002).
https://doi.org/10.3115/1118693.1118695
27. Klink, S., Dengel, A., Kieninger, T.: Document Structure Analysis Based on Layout
and Textual Features. In: IAPR International Workshop on Document Analysis
Systems. IAPR, Rio de Janeiro, Brazil (2000)
28. K ̈orner, M., Ghavimi, B., Mayr, P., Hartmann, H., Staab, S.: Evaluating Reference
String Extraction Using Line-Based Conditional Random Fields: A Case Study
A Benchmark of PDF Information Extraction Tools21
with German Language Publications. In: New Trends in Databases and Information
Systems. pp. 137–145. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-
67162-8
15
29. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Prob-
abilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings
of the Eighteenth International Conference on Machine Learning. pp. 282–289.
ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001),
https://dl.acm.org/doi/10.5555/645530.655813
30. Ley, M.: DBLP: Some Lessons Learned. Proc. VLDB Endowment2(2), 1493–1500
(Aug 2009). https://doi.org/10.14778/1687553.1687577
31. Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: DocBank: A
Benchmark Dataset for Document Layout Analysis. In: Proceedings of the 28th
International Conference on Computational Linguistics. pp. 949–960. Interna-
tional Committee on Computational Linguistics, Barcelona, Spain (Online) (2020).
https://doi.org/10.18653/v1/2020.coling-main.82
32. Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of Header
Metadata Extraction Approaches and Tools for Scientific PDF Documents. In:
Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. pp.
385–386. JCDL ’13, Association for Computing Machinery, New York, NY, USA
(2013). https://doi.org/10.1145/2467696.2467753
33. Livathinos, N., Berrospi, C., Lysak, M., Kuropiatnyk, V., Nassar, A., Car-
valho, A., Dolfi, M., Auer, C., Dinkla, K., Staar, P.: Robust PDF Doc-
ument Conversion using Recurrent Neural Networks. Proceedings of the
AAAI Conference on Artificial Intelligence35(17), 15137–15145 (May 2021).
https://doi.org/10.1609/aaai.v35i17.17777
34. Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.: S2ORC: The Seman-
tic Scholar Open Research Corpus. In: Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics. pp. 4969–4983. Association for
Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-
main.447
35. Lopez, P.: GROBID (2008),https://github.com/kermitt2/grobid
36. Lopez, P.: GROBID: Combining Automatic Bibliographic Data Recognition and
Term Extraction for Scholarship Publications. In: Research and Advanced Technol-
ogy for Digital Libraries, LNCS, vol. 5714, pp. 473–474. Springer Berlin Heidelberg
(2009). https://doi.org/10.1007/978-3-642-04346-8
62
37. Mao, S., Kim, J., Thoma, G.R.: A Dynamic Feature Generation Sys-
tem for Automated Metadata Extraction in Preservation of Digital Mate-
rials. In: 1st International Workshop on Document Image Analysis for Li-
braries. pp. 225–232. IEEE Computer Society, Palo Alto, CA, USA (2004).
https://doi.org/10.1109/DIAL.2004.1263251
38. Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a
literature survey. In: Proceedings Document Recognition and Retrieval X. SPIE
Proceedings, vol. 5010, pp. 197–207. SPIE, Santa Clara, California, USA (Jan
2003). https://doi.org/10.1117/12.476326
39. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the Construc-
tion of Internet Portals with Machine Learning. Information Retrieval3(2), 127–
163 (Jul 2000). https://doi.org/10.1023/A:1009953814988
40. National Library of Medicine: PubMed,https://pubmed.ncbi.nlm.nih.gov/
41. National Library of Medicine: PubMed Central,https://www.ncbi.nlm.nih.gov/
pmc/
22Meuschke et al.
42. Ojokoh, B., Zhang, M., Tang, J.: A trigram hidden Markov model for metadata
extraction from heterogeneous references. Information Sciences181(9), 1538–1551
(May 2011). https://doi.org/10.1016/j.ins.2011.01.014
43. Ororbia, A.G., Wu, J., Khabsa, M., WIlliams, K., Giles, C.L.: Big Scholarly
Data in CiteSeerX: Information Extraction from the Web. In: Proceedings of
the 24th International Conference on World Wide Web. pp. 597–602. WWW ’15
Companion, Association for Computing Machinery, New York, NY, USA (2015).
https://doi.org/10.1145/2740908.2741736
44. Palmero, G., Dimitriadis, Y.: Structured document labeling and rule extraction
using a new recurrent fuzzy-neural system. In: Proceedings of the Fifth Interna-
tional Conference on Document Analysis and Recognition. pp. 181–184. Springer,
Bangalore, India (1999). https://doi.org/10.1109/ICDAR.1999.791754
45. Peng, F., McCallum, A.: Accurate Information Extraction from Research Pa-
pers using Conditional Random Fields. In: Proceedings of the Human Language
Technology Conference of the North American Chapter of the Association for
Computational Linguistics: HLT-NAACL. pp. 329–336. Association for Compu-
tational Linguistics, Boston, Massachusetts, USA (2004),https://aclanthology.
org/N04-1042
46. Prasad, A., Kaur, M., Kan, M.Y.: Neural ParsCit: a deep learning-based reference
string parser. International Journal on Digital Libraries19(4), 323–337 (Nov 2018).
https://doi.org/10.1007/s00799-018-0242-1
47. Rizvi, S.T.R., Dengel, A., Ahmed, S.: A Hybrid Approach and Unified Framework
for Bibliographic Reference Extraction. IEEE Access8, 217231–217245 (Dec 2020).
https://doi.org/10.1109/ACCESS.2020.3042455
48. Rizvi, S.T.R., Lucieri, A., Dengel, A., Ahmed, S.: Benchmarking Object Detection
Networks for Image Based Reference Detection in Document Images. In: 2019
Digital Image Computing: Techniques and Applications (DICTA). pp. 1–8. IEEE,
Perth, WA, Australia (2019). https://doi.org/10.1109/DICTA47822.2019.8945991
49. Rodrigues Alves, D., Colavizza, G., Kaplan, F.: Deep Reference Mining From Schol-
arly Literature in the Arts and Humanities. Frontiers in Research Metrics and
Analytics3, 21 (Jul 2018). https://doi.org/10.3389/frma.2018.00021
50. Saier, T., F ̈arber, M.: Bibliometric-Enhanced arXiv: A Data Set for Paper-Based
and Citation-Based Tasks. In: Proceedings of the 8th International Workshop
on Bibliometric-enhanced Information Retrieval (BIR). CEUR Workshop Pro-
ceedings, vol. 2345, pp. 14–26. CEUR-WS.org, Cologne, Germany (2019),http:
//ceur-ws.org/Vol-2345/paper2.pdf
51. Saier, T., F ̈arber, M.: unarXive: A Large Scholarly Data Set with Publications’
Full-Text, Annotated In-Text Citations, and Links to Metadata. Scientometrics
125(3), 3085–3108 (Dec 2020). https://doi.org/10.1007/s11192-020-03382-z
52. Schloss Dagstuhl - Leibniz Center for Informatics, University of Trier: dblp: com-
puter science bibliography,https://dblp.org/
53. Souza, A., Moreira, V., Heuser, C.: ARCTIC: Metadata Extraction from Scientific
Papers in Pdf Using Two-Layer CRF. In: Proceedings of the 2014 ACM Symposium
on Document Engineering. pp. 121–130. DocEng ’14, Association for Computing
Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2644866.2644872
54. Tkaczyk, D., Collins, A., Sheridan, P., Beel, J.: Machine Learning vs. Rules and
Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Ref-
erence and Citation Parsers. In: Proceedings of the 18th ACM/IEEE on Joint
Conference on Digital Libraries. pp. 99–108. JCDL ’18, Association for Computing
Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3197026.3197048
A Benchmark of PDF Information Extraction Tools23
55. Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P.J., Bolikowski, L.: CERMINE:
automatic extraction of structured metadata from scientific literature. Interna-
tional Journal on Document Analysis and Recognition (IJDAR)18(4), 317–335
(Dec 2015). https://doi.org/10.1007/s10032-015-0249-8
56. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., Polosukhin, I.: Attention is All You Need. In: Proceedings of the 31st
International Conference on Neural Information Processing Systems. pp. 6000–
6010. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017),https:
//dl.acm.org/doi/10.5555/3295222.3295349
57. Vilnis, L., Belanger, D., Sheldon, D., McCallum, A.: Bethe Projections for Non-
Local Inference. In: Proceedings of the Thirty-First Conference on Uncertainty in
Artificial Intelligence. pp. 892–901. UAI’15, AUAI Press, Arlington, Virginia, USA
(2015). https://doi.org/10.48550/arXiv.1503.01397