If you're enrolled in the course please sign up for the mailing list here. The email for the list is firstname.lastname@example.org
The code is on the course's github page.
The course will consist of mostly lectures. Before each lecture each student is expected to read a paper or book chapter (always linked from here) and send a report by email. The grades will come from these weekly reports and from a research-level project to be concluded by the end of the course. A list of suggested project topics will be here soon.
Days marked as "project workshop" are days in which no class will happen but Alex will be available for looking at the current state of the projects and discussing things.
ML-based english sentence segmenter, chinese token segmenter
The goal of this project is to implement a supervised machine learning-based sentence and segmenter for English, and a supervised machine learning-based token segmenter for Chinese. While seemingly simple, there are many edge cases to both these tasks, and the performance of real systems depends on solving them accurately and consistently. There is very little research in English sentence segmentation, and yet it is nontrivial on many interesting data types.
Joint pos-tagging and mention boundary / type classification
The goal of this project is to implement a system which jointly predicts part-of-speech tags and mention boundaries. Since both tasks are traditionally handled with sequence-based models, it is simple to design a joint inference algorithm which exactly solves this problem, so the main challenges in this project are doing so efficiently, at a not much bigger cost than independent inference, and doing so more accurately, using features from the mentions to help tagging and vice versa.
The goal of this project is to build a system which will, given a document, find the wikipedia pages it refers to and label the specific sections which are references with links to the correct wikipedia page. The main challenges are finding the mentions, disambiguating when there are many possible wikipedia entities, and doing it all efficiently. We expect that using the output of within-document coreference resolution systems can help disambiguate mentions, as it allows checking the semantic compatibility of modifiers of other words which corefer to the specific link. It should also be possible to use embeddings of phrases and tokens to improve the accuracy of the current semantic classifiers used in wikification systems.
Semantic role labeling
Semantic role labeling is the task of, given a sentence, annotating it with who-did-what-to-whom information. Much of this can be extracted from a dependency parse of the sentence, though state-of-the-art parsers are prone to error in the more semantic parts of parsing (such as prepositional phrase attachment). We expect that leveraging a finer-grained notion of semantic compatibility from word embeddings should be able to improve a semantic role labeling system, as well as doing inference jointly across an entire corpus using dual decomposition.
Distant supervision relation extraction
Given a knowledge base, such as freebase, with entities and their relations, and a large corpus, such as wikipedia or news articles, one can find sentences which express the relations in the knowledge base, and then learn a classifier capable of extracting more relations from the text. While the problem of improving coverage of the relations contained in freebase has already been successfully tackled, extending this to relations which are across multiple hops in freebase is still an open problem, and both learning and inference can be improved to consider this information.
While the problem of within-document coreferent resolution has traditionally been tackled with machine learning-based approaches, researchers at Stanford have recently published a system which is fully deterministic, quite simple, and yet performs as well as or better than most learning-based approaches. The goal of this project is to reimplement this system, adding to it richer notions of semantic compatibilities obtained from word embeddings.
The goal of this project is to build a state-of-the-art shallow parser, essentially a modern version of the system presented in the paper we read in class.
Error analysis on a full NLP pipeline
While people doing each individual project should perform error analysis on their individual system, it is also interesting to have a more global perspective on what is happening. The goal of this project is to analyze the output of an entire NLP pipeline on data which is out-of-domain and interesting, and try to categorize the multiple ways in which it fails, showing why each failure happens and also how it can be fixed on improved.