Good decision-making is dependent on comprehensive, accurate knowledge. But the information relevant to many important decisions in areas such as business, government, medicine and scientific research is massive, and growing at an accelerating pace. Relevant raw data is widely available on the web and other data sources, but usually in order to be useful it must be gathered, extracted, organized, and normalized into a queryable, minable knowledge base. Hand-built knowledge bases such as Wikipedia have proven useful, but more than human editing will be necessary to create a wide variety of domain-specific, deeply comprehensive, highly-structured knowledge bases. Various automated methods have begun to reach levels of accuracy and scalability that make them applicable to automatically constructing useful knowledge bases from text and other sources. These capabilities have been enabled by research in areas including natural language processing, information extraction, information integration, databases, search and machine learning. In this seminar we will relevant read papers in all these areas, write responses to them, and discuss them. Students will also work together to build a system that constructs a KB of all UMass faculty, postdocs and graduate students, and strives to predict students' year of PhD completion.
- Read approximately four papers per week.
- Write short "reading responses" for each one (what did you like, what did you not like, what questions do you have). Due each Sunday by 5pm.
- Prepare 10-minute presentations for the ~2 papers you will be assigned to cover.
- Participate in discussion.
- Participate in coordinated effort to build a AKBC system (various levels of participation possible)
1. Introduction to AKBC - Monday September 12, 2011
DBLife: A Community Information Management Platform for the Database Research Community. P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. CIDR-07 (demo). DBLife web site. [Andrew]
Open Information Extraction from the Web. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni. IJCAI, 2007. [Mike]
Toward an Architecture for Never-Ending Language Learning. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr. and T.M. Mitchell. Proceedings of the Conference on Artificial Intelligence (AAAI), 2010. NELL web site. [Laura]
SOFIE: a self-organizing framework for information extraction. Suchanek, F.M., Sozio, M., and Weikum, G. Proceedings of WWW. 2009. [Andrew]
Other related, non-required reading:
Search needs a shake-up. Oren Etzioni. Nature, 476: 25-26, August 4, 2011.
2. Hand-build KBs - Monday September 19, 2011
A Semantic Web Primer. Grigoris Antoniou and Frank van Harmelen, (Chapters 1,3,4) [Sebastian]
DBpedia - A crystallization point for the Web of Data. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak and Sebastian Hellmann. Web Semantics: Science, Services and Agents on the World Wide Web. Volume 7 Issue 3, September, 2009 [Anton]
CYC: A Large-Scale Investment in Knowledge Infrastructure. Douglas B. Lenat, Communications of the ACM CACM Homepage archive, Volume 38 Issue 11, Nov. 1995 [Limin]
Pathway Databases: A Case Study in Computational Symbolic Theories. Peter D. Karp, Science (2001) [Sebastian]
3. Named Entity Extraction - Monday September 26, 2011
Unsupervised named-entity extraction from the web: an experimental study, Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria PopescuTal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates, Journal Artificial Intelligence, Volume 165, Issue 1, June 2005 (skip section 5) [Jiaping]
of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and 117
Very Large Corpora [Alexandre]
Related, non-required reading:
4. Relation Extraction - Monday October 3, 2011
Kernel Methods for Relation Extraction. Dmitry Zelenko , Chinatsu Aone , Anthony Richardella , Jaz K , Thomas Hofmann , Tomaso Poggio , John Shawe-taylor, Journal of Machine Learning Research. 2003. [Sebastian]
In Proceedings of the Association for Computational Linguistics (ACL), 2011. [Laura]
5. Entity Resolution - Tuesday October 11, 2011
CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes, Pradhan, Sameer S., Ramshaw, Lance A., Marcus, Mitchell P. Palmer, Martha, Weischedel, Ralph M. ,Xue, Nianwen, CoNLL 2011 [Jiaping]
6. Schema & Ontology Alignment -
A Unified Approach for Schema Matching, Coreference,and Canonicalization. Michael Wick, Khashayar Rohanimanesh, Karl Schultz, Andrew McCallum. In the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD), Las Vegas, Nevada, 2008. [Mike]
Related non-required reading:
A Note on the Unification of Information Extraction and Data Mining using Conditional- Probability, Relational Models. Andrew McCallum and David Jensen, IJCAI'03 Workshop on Learning Statistical Models from Relational Data, 2003.
9. Targeted Information Gathering, Resource-bounded IE
[from last session:]Parsing natural scenes and natural language with recursive neural networks. Socher, Lin, Ng, and Manning (ICML 2011). [Alexandre will give a five-minute tutorial on recursive neural networks]
Selecting Actions for Resource-bounded Information Extraction using
Reinforcement LearningKanani et al., WSDM 2012 (will be distributed to the class)
To Search or to Crawl? Towards a Query Optimizer for Text-Centric
Tasks Ipeirotis et al., SIGMOD, 2006
Researcher affiliation extraction from homepages
Nagy et al, IJCNLP 2009
Related non-required reading:
A. Krause and C. Guestrin. Near-optimal nonmyopic value of information
in graphical models. In UAI'05, page 05, 2005.
10. Probabilistic Databases
Felix: Scaling Inference for Markov Logic with an Operator-based Approach, Feng Niu Ce Zhang Christopher R.e Jude Shavlik, Technical Report 2011 http://arxiv.org/pdf/1108.0294v1
[YOU CAN SKIP THE APPENDIX]
11. KB Editing by Humans
12. Large Scale Information Extraction
- Delip Rao, Paul McNamee and Mark Dredze, "Streaming Cross Document Entity Coreference Resolution", in Proceedings of Conference on Computational Linguistics (COLING), 2010 PDF (Jiaping)
- Ryan McDonald, Keith Hall, Gideon Mann, “Distributed Training Strategies for the Structured Perceptron”, North American Chapter of the Association for Computational Linguistics (NAACL), 2010. PDF (Brian)
- Sameer Singh, Amarnag Subramanya, Fernando Pereira, Andrew McCallum, “Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models”, Association for Computational Linguistics (ACL), 2011 PDF (Sameer)