This page lists the datasets created or hosted at the lab.
UMass Citation Field Extraction Dataset [information extraction]
Citation field extraction dataset provides labels and segments for extracted citations from articles from many different academic
Wikilinks Dataset [large-scale entity resolution, cross-document coreference]
Identifying and disambiguating a large set of mentions from the web using Wikipedia. The data contains ~40 million mentions referring to ~3 million entities, extracted from ~10 million webpages.
BibTex Dataset [large-scale entity resolution]
Publicly-available BibTex files that can be used for large-scale entity resolution. The dataset contains more than a million author mentions.
SRAA: Simulated/Real/Aviation/Auto UseNet data [document classification]
Cora Citation Matching [reference matching, object correspondence]
Cora Research Paper Classification [relational document classification]
Cora Information Extraction [information extraction]
Frequently Asked Questions [information extraction]
CMU Seminar Announcements [information extraction]
Industry Sector [document classification]
20 Newsgroups [document classification]