This page lists the datasets created or hosted at the lab.

Citation field extraction dataset provides labels and segments for extracted citations from articles from many different academic

Wikilinks Dataset [large-scale entity resolution, cross-document coreference]
Identifying and disambiguating a large set of mentions from the web using Wikipedia. The data contains ~40 million mentions referring to ~3 million entities, extracted from ~10 million webpages.

BibTex Dataset [large-scale entity resolution]
Publicly-available BibTex files that can be used for large-scale entity resolution. The dataset contains more than a million author mentions.

SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification]
73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.

Cora Citation Matching [reference matching, object correspondence]
Text of citations hand-clustered into groups referring to the same paper.

Cora Research Paper Classification [relational document classification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.

Cora Information Extraction [information extraction]
Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

Frequently Asked Questions [information extraction]
Several UseNet FAQ's segmented into questions and answers. Data gathered and labeled by Dayne Freitag and Andrew McCallum.

CMU Seminar Announcements [information extraction]
48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.

Industry Sector [document classification]
Corporate web pages classified into a topic hierarchy with about 70 leaves.

20 Newsgroups [document classification]
About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.