The UMass citation field extraction dataset provides labels and segments for extracted citations from articles found on arxiv.org. In comparison to the previous standard dataset in citation field extraction, this dataset exceeds four-times more data, supplies detailed nested labels rather than coarse-grained flat labels, and is derived from four different academic fields rather than one.
In May 2012 we collected 5,000 research papers in PDF format from ArXiv.org, comprising 1,250 papers each from its sections on physics, mathematics, computer science and quantitative biology. The papers represent a variety of formats and styles, including journal pre-prints, conference papers and technical reports. Text and layout information were extracted using our custom-improved pdf2text system. Five citations per PDF were then manually extracted from 1200 of those papers, resulting in 6,000 unlabeled citation strings. Of these, 1829 citation strings have been labeled to date.
Please cite this paper if you use this data set in a publication.
Each of these citation strings is labeled in a hierarchical manner, demarcating both coarse-grain labeled
segments, as well as fine-grain labeled segments within.
See our ICML 2013 PEER workshop paper for more information: