Entity Resolution

Wikilinks Dataset [large-scale entity resolution, cross-document coreference]

Identifying and disambiguating a large set of mentions from the web using Wikipedia. The data contains ~40 million mentions referring to ~3 million entities, extracted from ~10 million webpages.

BibTex Dataset [large-scale entity resolution]

Publicly-available BibTex files that can be used for large-scale entity resolution. The dataset contains more than a million author mentions.

Cora Citation Matching [reference matching, object correspondence]

Text of citations hand-clustered into groups referring to the same paper.

Information Extraction

UMass Citation Field Extraction Dataset

Citation field extraction dataset provides labels and segments for extracted citations from articles from many different academic disciplines.

Cora Information Extraction

Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

CMU Seminar Announcements

48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.

Document Classification

SRAA: Simulated/Real/Aviation/Auto UseNet data

73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.

Cora Research Paper Classification [relational document classification]

Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.

Industry Sector

Corporate web pages classified into a topic hierarchy with about 70 leaves.

20 Newsgroups

About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.