Courses‎ > ‎

CMPSCI 791KC - Automated Knowledge Base Construction

Monday 3:30-5:30pm
Rm 140
3 credits

Course Description

Good decision-making is dependent on comprehensive, accurate knowledge. But the information relevant to many important decisions in areas such as business, government, medicine and scientific research is massive, and growing at an accelerating pace. Relevant raw data is widely available on the web and other data sources, but usually in order to be useful it must be gathered, extracted, organized, and normalized into a queryable, minable knowledge base.  Hand-built knowledge bases such as Wikipedia have proven useful, but more than human editing will be necessary to create a wide variety of domain-specific, deeply comprehensive, highly-structured knowledge bases.  Various automated methods have begun to reach levels of accuracy and scalability that make them applicable to automatically constructing useful knowledge bases from text and other sources.  These capabilities have been enabled by research in areas including natural language processing, information extraction, information integration, databases, search and machine learning.  In this seminar we will relevant read papers in all these areas, write responses to them, and discuss them.  Students will also work together to build a system that constructs a KB of all UMass faculty, postdocs and graduate students, and strives to predict students' year of PhD completion.

Course Work
  • Read approximately four papers per week.
  • Write short "reading responses" for each one (what did you like, what did you not like, what questions do you have).  Due each Sunday by 5pm.
  • Prepare 10-minute presentations for the ~2 papers you will be assigned to cover.
  • Participate in discussion.
  • Participate in coordinated effort to build a AKBC system (various levels of participation possible)


1. Introduction to AKBC - Monday September 12, 2011

DBLife: A Community Information Management Platform for the Database Research Community. P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, R. Ramakrishnan. CIDR-07 (demo). DBLife web site.  [Andrew]

Open Information Extraction from the Web.  Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni.  IJCAI, 2007.  [Mike]

Toward an Architecture for Never-Ending Language Learning. A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr. and T.M. Mitchell. Proceedings of the Conference on Artificial Intelligence (AAAI), 2010.  NELL web site.  [Laura]

SOFIE: a self-organizing framework for information extractionSuchanek, F.M., Sozio, M., and Weikum, G.  Proceedings of WWW. 2009.  [Andrew]

Other related, non-required reading:

Search needs a shake-up.  Oren Etzioni.  Nature, 476: 25-26, August 4, 2011.

2. Hand-build KBs - Monday September 19, 2011

A Semantic Web Primer.  Grigoris Antoniou and Frank van Harmelen, (Chapters 1,3,4) [Sebastian]

DBpedia - A crystallization point for the Web of Data. Christian Bizer,  Jens Lehmann,  Georgi Kobilarov,  Sören Auer,  Christian Becker,  Richard Cyganiak and  Sebastian Hellmann. Web Semantics: Science, Services and Agents on the World Wide Web. Volume 7 Issue 3, September, 2009 [Anton]

CYC: A Large-Scale Investment in Knowledge Infrastructure. Douglas B. Lenat, Communications of the ACM CACM Homepage archive, Volume 38 Issue 11, Nov. 1995 [Limin]

Pathway Databases: A Case Study in Computational Symbolic Theories. Peter D. Karp, Science (2001) [Sebastian]

3. Named Entity Extraction - Monday September 26, 2011
Unsupervised named-entity extraction from the web: an experimental studyOren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria PopescuTal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates, Journal Artificial Intelligence, Volume 165, Issue 1, June 2005 (skip section 5) [Jiaping]

Unsupervised Models for Named Entity Classification. M. Collins and Y.Singer. Proc. 
of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and 117
Very Large Corpora [Alexandre]

Structured generative models for unsupervised named-entity clusteringMicha Elsner, Eugene Charniak and Mark Johnson, NAACL 2009. [Sebastian]

Related, non-required reading:
CRF Tutorial, Charles Sutton

4. Relation Extraction - Monday October 3, 2011

Kernel Methods for Relation Extraction. Dmitry Zelenko ,  Chinatsu Aone ,  Anthony Richardella ,  Jaz K ,  Thomas Hofmann ,  Tomaso Poggio ,  John Shawe-taylor, Journal of Machine Learning Research. 2003. [Sebastian]

Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic RelationsPatrick Pantel and Marco Pennacchiotti. In Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL-06). pp. 113-120. Sydney, Australia. 2006. [Sebastian]

Learning Semantic Correspondences with Less Supervision  Percy Liang ,  Michael I. Jordan ,  Dan Klein, ACL 2009 [Limin]

Knowledge-Based Weak Supervision for Information Extraction of Overlapping RelationsRaphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld.
In Proceedings of the Association for Computational Linguistics (ACL), 2011. [Laura]

5. Entity Resolution - Tuesday October 11, 2011
Coreference Resolution in a Modular, Entity-Centered Model by Aria Haghighi ,  Dan Klein, NAACL 2010 [Sameer]

E. Bengtson and D. Roth, Understanding the Value of Features for Coreference Resolution. EMNLP  (2008) [Alexandre]

Overview of the TAC 2010 Knowledge Base Population Track, by Ji, Grishman, Dang, Griffitt, Ellis, 2010 [Sam]

CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes
 Pradhan, Sameer S., Ramshaw, Lance A., Marcus, Mitchell P. Palmer, Martha, Weischedel, Ralph M. ,Xue, Nianwen, CoNLL 2011 [Jiaping]

6. Schema & Ontology Alignment - 
Semantic integration research in the database community: A brief survey, Anhai Doan, Alon Y. Halevy, AI Magazine, 2005. [David]

Annotating and searching web tables using entities, types and relationshipsGirija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010.  [Kevin]

Statistical Schema Matching across Web Query Interfaces, Bin He and Kevin Chen-Chuan Chang, Sigmod 2003. [Arti]

A Unified Approach for Schema Matching, Coreference,and CanonicalizationMichael Wick, Khashayar Rohanimanesh, Karl Schultz, Andrew McCallum. In the 14th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (KDD), Las Vegas, Nevada, 2008. [Mike]

7. Joint Inference
A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing. Auli and Lopez, EMNLP 2011. [Sebastian will give a 5-min tutorial on Dual Decomposition]

Bidirectional joint inference for Entity Resolution and Segmentation using imperatively-defined factor graphs. Sameer, Karl, Andrew. ECML-PKDD 2009. [Karl will give a five-minute tutorial on Metropolis-Hastings]

Global inference for entity and relation identification via a linear programming formulation. Roth and Yih. Book chapter. [David Belanger will give a five-minute tutorial on integer linear programming]

Parsing natural scenes and natural language with recursive neural networks.  Socher, Lin, Ng, and Manning (ICML 2011). [Alexandre will give a five-minute tutorial on recursive neural networks]

Related non-required reading:
A Note on the Unification of Information Extraction and Data Mining using Conditional- Probability, Relational ModelsAndrew McCallum and David Jensen, IJCAI'03 Workshop on Learning Statistical Models from Relational Data, 2003.

8. Rule Learning
Learning First-Order Horn Clauses from Web Text, Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP 2010) [Arti]

Learning the structure of Markov Logic Networks, S. Kok and P. Domingos, ICML 2005  [Sebastian]

A Mutually Beneficial Integration of Data Mining and Information Extraction. Un Yong Nahm and Raymond J. Mooney, AAAI 2000 [James]

[from last session:]
Parsing natural scenes and natural language with recursive neural networks.  Socher, Lin, Ng, and Manning (ICML 2011). [Alexandre will give a five-minute tutorial on recursive neural networks]

9. Targeted Information Gathering, Resource-bounded IE
Selecting Actions for Resource-bounded Information Extraction using
Reinforcement LearningKanani et al., WSDM 2012 (will be distributed to the class)

To Search or to Crawl? Towards a Query Optimizer for Text-Centric
Tasks Ipeirotis et al.,  SIGMOD, 2006

Researcher affiliation extraction from homepages
Nagy et al, IJCNLP 2009

Related non-required reading:
A. Krause and C. Guestrin. Near-optimal nonmyopic value of information
in graphical models. In UAI'05, page 05, 2005.

10. Probabilistic Databases
Nilesh N. Dalvi, Christopher Ré, and Dan Suciu Probabilistic databases: Diamonds in the dirt. Commun. ACM Volume 52, 2009, p. 86-94

Scalable probabilistic databases with factor graphs and MCMC. Michael Wick, Andrew McCallum, Gerome Miklau. Very Large Data Bases (VLDB) 2010. 

Felix: Scaling Inference for Markov Logic with an Operator-based Approach, Feng Niu Ce Zhang Christopher R.e Jude Shavlik, Technical Report 2011 [YOU CAN SKIP THE APPENDIX]

11. KB Editing by Humans

    Bayesian Knowledge Corroboration with Logical Rules and User Feedback Gjergji Kasneci, Jurgen Van Gael, Ralf Herbrich, and Thore Graepel European Conference on Machine Learning (ECML PKDD 2010).

12. Large Scale Information Extraction
  • Delip Rao, Paul McNamee and Mark Dredze, "Streaming Cross Document Entity Coreference Resolution", in Proceedings of Conference on Computational Linguistics (COLING), 2010 PDF (Jiaping)
  • Ryan McDonaldKeith HallGideon Mann, “Distributed Training Strategies for the Structured Perceptron”, North American Chapter of the Association for Computational Linguistics (NAACL), 2010. PDF (Brian)
  • Sameer Singh, Amarnag Subramanya, Fernando Pereira, Andrew McCallum, “Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models”, Association for Computational Linguistics (ACL), 2011 PDF (Sameer)