Entity Resolution

TypeNet [Entity typing]

TypeNet is a hierarchical type system for the task of fine grained entity typing. It contains 1081 freebase types, and 860 Wordnet types organised in a deep hierarchy with an average depth of 7.8.

Identifying and disambiguating a large set of mentions from the web using Wikipedia. The data contains ~40 million mentions referring to ~3 million entities, extracted from ~10 million webpages.

Rexa v0.1[author coreference]

Rexa is a dataset of 1459 scientific author records derived from bibliographic data. The dataset is organized into 8 blocks defined by unique first initial and last name combinations.

BibTex Dataset [large-scale entity resolution]

Publicly-available BibTex files that can be used for large-scale entity resolution. The dataset contains more than a million author mentions.

Cora Citation Matching [reference matching, object correspondence]

Text of citations hand-clustered into groups referring to the same paper.

Information Extraction

Expert Modeling

Relevance judgments between papers and reviewers. Please refer to this paper for more details.

UMass Citation Field Extraction Dataset

Citation field extraction dataset provides labels and segments for extracted citations from articles from many different academic disciplines.

Cora Information Extraction

Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

CMU Seminar Announcements

48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.

Document Classification

SRAA: Simulated/Real/Aviation/Auto UseNet data

73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification—separating real from simulated, and auto from aviation—making the point that the same data can be classified different ways depending on the user’s needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.

Cora Research Paper Classification [relational document classification]

Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.

Industry Sector

Corporate web pages classified into a topic hierarchy with about 70 leaves.

20 Newsgroups

About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90’s. This is the original set, without various editing done by Jason Rennie and others.