Datasets

This page lists the datasets created or hosted at the lab.

Citation field extraction dataset provides labels and segments for extracted citations from articles from many different academic
disciplines.

Wikilinks Dataset [large-scale entity resolution, cross-document coreference]
Identifying and disambiguating a large set of mentions from the web using Wikipedia. The data contains ~40 million mentions referring to ~3 million entities, extracted from ~10 million webpages.

BibTex Dataset [large-scale entity resolution]
Publicly-available BibTex files that can be used for large-scale entity resolution. The dataset contains more than a million author mentions.

SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification]
73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research.

Cora Citation Matching [reference matching, object correspondence]
Text of citations hand-clustered into groups referring to the same paper.

Cora Research Paper Classification [relational document classification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.

Cora Information Extraction [information extraction]
Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.

Frequently Asked Questions [information extraction]
Several UseNet FAQ's segmented into questions and answers. Data gathered and labeled by Dayne Freitag and Andrew McCallum.

CMU Seminar Announcements [information extraction]
48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.

Industry Sector [document classification]
Corporate web pages classified into a topic hierarchy with about 70 leaves.

20 Newsgroups [document classification]
About 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others.