Structured Prediction Energy Networks (SPENs)

Structured Prediction is an important task that appears in many different applications such as multi-label classification, semantic image segmentation, and sequence labeling. In a structured prediction task, the goal is to learn a mapping from input data to an output structure. The output structure is described using output variables. For example, in the task of multi-label classification, the input is raw features, for example, extracted from documents, and output is a set of active labels such the possible categories of each document. Here, the output variables are all possible labels, and if an output variable is one then the corresponding label is in the output set. The goal of structured prediction energy networks (SPENs) is to describe the interaction among input and output variables using deep neural networks. However, in this setting inference is not tractable, so SPENs relies on gradient descent inference for answering the queries. In this project, we are exploring different training algorithms for SPENs as well as applying SPENs to interesting problems.

Selected publications:

  • Pedram Rooshenas, Aishwarya Kamath, Andrew McCallum (2018), “Training Structured Prediction Energy Networks with Indirect Supervision”, In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL).

  • David Belanger, Bishan Yang, Andrew McCallum (2017), “End-to-End Learning for Structured Prediction Energy Networks”, In Proceedings of the 34th International Conference on Machine Learning (ICML).

  • David Belanger, Andrew McCallum (2016), “Structured Prediction Energy Networks”, In Proceedings of the 33rd International Conference on Machine Learning (ICML).

Representation Learning

We work on building models for commonsense knowledge representation. We are focusing focus on hieratical relation learning currently. We have been exploring different models to learn asymmetric hieratical relation representation, proposed an improved model based on previous work. We also proposed a new evaluation for taxonomy structure learning.

Related publication:

  • Xiang Li, Luke Vilnis, Andrew McCallum (2017), Improved Representation Learning for Predicting Commonsense Ontologies, In International Conference on Machine Learning Workshop on Deep Structured Prediction (ICML WS).

  • Luke Vilnis, Andrew McCallum (2015), Word Representations via Gaussian Embedding, In International Conference on Learning Representations (ICLR), Oral Presentation.

Universal Schema

A fundamental task in language understanding is identifying relationships between entities. Most approaches to relation extraction learn a mapping from text to a small fixed set of predefined relation types. Universal Schema instead learns an embedding space over entity pairs and relations from both structured knowledge bases and raw text allowing the model to learn arbitrary relationship types with much greater diversity than those within a structured knowledge base alone.

Related publication:

  • Patrick Verga, Arvind Neelakantan and Andrew McCallum (2017), Generalizing to Unseen Entities and Entity Pairs with Row-less Universal Schema, In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

  • Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth, and Andrew McCallum (2016), Multilingual Relation Extraction using Compositional Universal Schema, In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).

Fast and accurate low-level NLP: Tagging, chunking, parsing…

Techniques for low-level NLP tasks such as part-of-speech tagging, named entity recognition and syntactic dependency parsing are now accurate enough to be of use to practitioners who wish to extract structured information from unstructured text. This can include blog posts and discussion forums on the web, or the text of scientific research papers. Though we now wish to deploy these tools on billions of documents, many of the most accurate models were designed with no regard for computational cost. In response, our work aims to design machine learning algorithms to facilitate fast inference in NLP models while sacrificing as little accuracy as possible.

We focus on two avenues for improving the speed-accuracy trade-off: First, we develop models which can quickly build up rich representations of tokens in contexts used as features in a sequential prediction model, where sequence labeling is performed as a series of independent multi-class classifications. This approach allows for much faster inference than e.g. structured prediction in a graphical model while maintaining accuracy via high-quality feature representations incorporating wide context and a concept of neighboring labels. Second, we unify related NLP tasks into a single end-to-end model which reasons in the joint space of output labels. With this approach we aim to increase accuracy by reducing cascading errors and leveraging shared statistics of co-occurring labels, while at the same time decreasing wall-clock runtime speed by sharing model parameters and computation across tasks.

Selected publications:

  • Emma Strubell, Patrick Verga, David Belanger and Andrew McCallum (2017), Fast and Accurate Entity Recognition with Iterated Dilated Convolutions, In Conference on Empirical Methods in Natural Language Processing (EMNLP).

  • Emma Strubell, Luke Vilnis, Kate Silverstein and Andrew McCallum (2015), Learning Dynamic Feature Selection for Fast Sequential Prediction, In Annual Meeting of the Association for Computational Linguistics (ACL), Outstanding paper award.

Large-Scale Clustering

Clustering is a core machine learning problem that is central in many application areas. In recent work, we develop an algorithm for clustering datasets that have a large number of points as well as a large number of clusters–known as extreme clustering. Our algorithm, called PERCH, consumes data points one at a time and routes them through an incrementally built tree. Crucially, when the algorithm detects sub-optimalities in the tree, it invokes tree rotations, recursively, to reorganize the tree and improve its balance. In other work, we have developed a framework for gradient-based hierarchical clustering. Unlike hierarchical k-means–or other traditional hierarchical clustering algorithms–that sequentially bisect the data, our framework optimizes all splits of the hierarchical clustering jointly.

Related publications:

  • Ari Kobren*, Nicholas Monath*, Akshay Krishnamurthy, Andrew McCallum (2017). A Hierarchical Algorithm for Extreme Clustering, In The International Conference on Knowledge Discovery and Data Mining (KDD).

  • Nicholas Monath*, Ari Kobren*, Akshay Krishnamurthy, Andrew McCallum (2017). Gradient-based Hierarchical Clustering, In NIPS ‘17 Workshop on Discrete Structures in Machine Learning (NIPS-DISCML).

Entity Resolution with and without Human Interaction

In our ongoing work on author coreference, the goal is to partition a set of mentions, e.g., citations, by authors. This problem is made difficult by a number of factors including: inconsistent name translations into English, missing fields–including first names–and errorful records. Our approach is to learn a model of entity cohesion and then run an efficient, linkage-based inference algorithm that leverages clever nearest neighbor data structures. We have also begun an exploration into entity resolution with humans in the loop where we have empirically shown that feedback made with respect to inferred-entity attributes is superior to more conventional, mention-level feedback.

Related publications:

  • Ari Kobren, Nicholas Monath, Andrew McCallum (2017). Entity-centric Attribute Feedback for Interactive Knowledge Bases, In NIPS ’17 Workshop on Automated Knowledge Base Construction (NIPS-AKBC).

Distributions over Clusterings

Our research focuses on representing uncertainty in clustering. We propose an inference procedure that computes the partition function, the marginal probability of a cluster, and the MAP clustering—all exactly. Rather than the Nth Bell number, these exact solutions take time and space proportional to the substantially smaller powerset of N, reducing the best known time complexity for this problem by a factor of N.

Low-resources NLP

In most domains of scientific literature, we usually do not have sufficient amount of annotations to train high-quality NLP models, such as named entity recognizer, entity linker, and relation extractor. Our research investigates the solutions which exploit the signals in existing data, including unsupervised learning and emphasizing parts of datasets. Another direction is to smartly collect more data either from the web or from humans (i.e., active learning). Examples of the publications in this direction are distributional inclusion vector embedding (DIVE), a state-of-the-art unsupervised hypernym detection method, active bias, a deep learning training trick based on active learning, and search engine supervision, a data denoising method for relation extraction.

Related publication:

  • Haw-Shiuan Chang, Amol Agrawal, Ananya Ganesh, Anirudha Desai, Vinayak Mathur and Andrew McCallum (2018), Efficient Graph-based Word Sense Induction by Distributional Inclusion Vector Embeddings, In TextGraphs (NAACL WS).

  • Haw-Shiuan Chang, ZiYun Wang, Luke Vilnis, Andrew McCallum (2018), Distributional Inclusion Vector Embedding for Unsupervised Hypernymy Detection, In Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL).

  • Haw-Shiuan Chang, Erik G. Learned-Miller, Andrew McCallum (2017), Active Bias: Training a More Accurate Neural Network by Emphasizing High Variance Samples, In Advances in Neural Information Processing Systems (NIPS).

  • Haw-Shiuan Chang, Abdurrahman Munir, Ao Liu, Johnny Tian-ZhengWei, Aaron Traylor, Ajay Nagesh, Nicholas Monath, Patrick Verga, Emma Strubell, Andrew McCallum (2016), Extracting Multilingual Relations under Limited Resources: TAC 2016 Cold-Start KB construction and Slot-Filling using Compositional Universal Schema, In Text Analysis Conference, Knowledge Base Population (TAC/KBP).

2008-Present: Automatic Knowledge Base Construction, with Probabilistic Databases & Human Cooperation

Incorporating uncertainty and probabilistic inference into databases has posed many challenges, often leading to systems that sacrifice modeling power, scalability, or restrict the class of relational algebra formulae under which they are closed. We have proposed [Wick et al, VLDB, 2009] an alternative approach in which the underlying relational database always represents a single possible world, and an external, unrestricted factor graph encodes a distribution over the set of possible worlds; Markov chain Monte Carlo (MCMC) inference is then used to recover this uncertainty to a desired level of fidelity. Our approach allows the efficient evaluation of arbitrary SQL queries over probabilistic databases with dependencies expressed by unrestricted graphical models, (including graphical model structure that changes during inference).

As a result, state-of-the-art, machine-learning-based information extraction, entity resolution, schema-aligment can then be efficiently run “inside” the database. Furthermore, rather than running in pipeline fashion, they can all interoperate in the same scalable infrastructure, imparting the increased accuracy of joint inference. This also provides a convenient any-time functionality to KB construction and query processing.

In addition, we advocate an approach to information integration (including knowledge-base construction by human-machine cooperation) in which the the canonical “true” entities and relations in the database are always inferred from integrated or human-entered data, never injected directly. Even human “corrections” to the KB are merely additional pieces of evidence input as probabilistic evidence—allowing our system to reason probabilistically about provenance and trust.

MCMC sampling provides efficiency by hypothesizing modifications to possible worlds rather than generating entire worlds from scratch. Queries are then run over the portions of the world that change, avoiding the onerous cost of running full queries over each sampled world. We leverage materialzed view maintenance techniques to dramatically speed query processing. We demonstrate system’s ability to answer relational queries with aggregation over one million tuples, and show additional scalability through use of parallelization.

1997-Present: Extraction, Integration and Mining of Bibliographic Data

Back in the 1997 we conceived of and lead a project at JustResearch that created Cora, one of the first domain-specific search engines over computer science research papers. You can read more about our research on Cora in our IRJ journal paper or a paper presented at the AAAI’99 Spring Symposium. The Cora team also included Kamal Nigam, Kristie Seymore, Jason Rennie, Huan Chang and Jason Reed.

More recently we have been working on an enhanced alternative to Google Scholar, CiteSeer, and other digital libraries of the research literature. Our system, called Rexa, automatically extracts a de-duplicated cross-referenced database of not just papers (and references), but also people, grants, publication venues and institutions. We also perform various kinds of topic and bibliometric impact analysis on this data.

2006-2010: Semi-supervised Learning & Alignment Learning in Natural Language

The only way to put natural language learning into the hands of the people is to reduce the burden of labeling training data. Over the years we have worked on various methods of semi-supervised learning that combines small amounts of labeled data with large amounts of unlabeled data. Our most recent work is in Generalized Expectation (GE) criteria, one form of which can be understood as enabling “feature labeling” as opposed to the traditional “instance labeling”.

We have also removed the need for human labeled data entirely by leveraging information already in relevant databases, and learning information extractors by discovering CRF-based alignments between database records and unstructured text.

2002-2010: Unified Information Extraction and Data Mining

Although information extraction and data mining appear together in many applications, their interface in most current deployments would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach.

We have been researching relational probabilistic models that unify extraction and mining, so that by sharing common inference procedures, they can each overcome the weaknesses of the other. For example, data mining run on a partially-filled database can find patterns that provide “top-down” accuracy-improving constraints to information extraction. Information extraction can provide a much richer set of “bottom-up” hypotheses to data mining if the mining is able to handle additional uncertainty information from extraction.

2001-2010: Conditional Random Fields for Relational Data, Approximate Inference and Learning

The above unified processing requires large-scala joint inference that cannot be performed exactly. We have been developing various methods of MCMC inference methods and corresponding learning approaches aimed specifically at extremely large relational-data domains. Our approach based on Metropolis-Hastings inference and learning by ranking, achieved best-in-the-world coreference resolution on a standard newswire dataset. This work is also quite relevant to recent interest in “combining logic and probability”.

2004-2009: Social Network Analysis with Structured Topic Models

Traditional social network analysis examines the connectivity of entities in a graph. However, in many cases we have data not just about the existence of a graph-edge, but also various properties of the nodes and edges—including large quantities of corresponding textual data. We have used Bayesian latent variable models, variants of “topic models” augmented with non-textual variables to (a) discover roles of people in the sender-receiver structure of a large email collection, (b) discover groups (coalitions) of U.S. senators or U.N. countries from their voting records and the topics of the bills, (c) discover communities of academic researchers from their papers and the venues in which they publish.

2004-2008: Joint Inference for NLP, Dialog Pragmatics, Perception and Action

As part of a MURI project joint with UPenn, we have begun work on probabilistic modeling of natural language dialog, designing methods that will do joint, unified inference all the way from natural language understanding, through dialog pragmatics, to perception and action in a shared world. This work will leverage our research in large-scale joint inference in CRFs.

2002-2008: Intelligent Understanding of our Email World

As part of the CALO project, we extracted information about people and other entities appearing in email streams, performed large-scale entity resolution, and discovered topics and expertise.

2001-2007: Conditional Probability Models for Sequences and other Relational Data

BBack in the 1990’s, after having some success using hidden Markov models for information extraction, we found ourselves frustrated by their lack of ability to incorporate many arbitrary, overlapping features of the input sequence, such as capitalization, lexicon memberships, spelling features, and conjunctions of such features in a large window of past and future observations. The same difficulties with non-independent features exist in many generatively-trained models historically used in NLP. We have begun work with conditionally-trained probability models that address these problems. Maximum entropy Markov models are locally-normalized conditional sequence models. Finite-state Conditional Random Fields (CRFs) are globally-normalized models. We have also been working with CRFs for coreference and multi-sequence labeling , analogous to conditionally-trained Dynamic Bayesian Networks (DBNs). We now work with even more complex CRFs, that integrate logic and probability, as described above.