Identifying and disambiguating a large set of mentions from the web using Wikipedia. The data contains ~40 million mentions referring to ~3 million entities, extracted from ~10 million webpages.
Publicly-available BibTex files that can be used for large-scale entity resolution. The dataset contains more than a million author mentions.
Text of citations hand-clustered into groups referring to the same paper.
Citation field extraction dataset provides labels and segments for extracted citations from articles from many different academic disciplines.
Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.
48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.
73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.
Corporate web pages classified into a topic hierarchy with about 70 leaves.
About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.