BibTex Dataset

This is a dataset containing a large number of BibTex files downloaded from the internet. The dataset has been used for large-scale entity resolution (see the publications below).


  • Number of BibTex files: 4,387
  • Number of Papers: 607,335 (correctly parsed)
  • Number of Authors: 1,313,517


The archive files is available here: bibtex.tar.gz (87MB, 385MB uncompressed), MD5

Checksum: 7dfea8b8228dc55b2d6173aa4484becd

If you use this data in your papers, please use the following citation: bib


This dataset can be processed by any BibTex parser. The Factorie library contains the parser we used (*), along with code to construct the variables and the model for author disambiguation.


The following papers use this dataset. If you are using this dataset, and would like your paper to be added, let us know.


The BibTex files were downloaded and aggregated by Michael Wick. This website is maintained by Michael Wick and Sameer Singh.