Master's Thesis Defense: Weam Abu Zaki

Time: December 1, 2009 - 10:30 AM - 11:30 AM

Location: INI Lower Level Conference Room


Master's Thesis Defense Presenter: Weam Abu Zaki

Advisor: Professor Tom Mitchell, Fredkin Professor of AI and Machine Learning Chair, Machine Learning Department, SCS

Reader: Professor William Cohen, Associate Research Professor, Machine Learning Department, SCS

Title: Know Little? Know Much! Iterative Multi-View Category Extraction from the Redundant Web

The abundance of free text data on the Internet, contrasted with the difficult and expensive task of classifying web content by human hand, motivates the pursuit of intelligent algorithms that can autonomously acquire knowledge by analyzing web pages with little human input and negligible attention. A class of emerging technology that performs this task efficiently is algorithms that rely on the redundant occurrence of instances of text patterns on a multitude of web pages. Since we tend to represent knowledge as entities of certain types and relations among them, we want to build systems that use redundancy to populate this knowledge structure for us. One such rapidly evolving system is the Read The Web (RTW) project at CMU.

In this presentation, and towards the final goal of reaching a continuous learning system, we focus on building an important module for such system. We investigate the problem of extracting noun-phrases of a defined category from a redundancy-based web-scale noun-phrase/context pairs corpus. We use a probabilistic multi-view iterative algorithm, co-EM (Coupled Expectation Maximization) with an underlying naive Bayes classifier to perform semi-supervised learning of multiple categories starting with very few labeled examples. We first learn each category separately, and calibrate the algorithm's ranked output per category enabling thresholded extraction of more instances. Then we explore coupling the learning of multiple categories with mutual exclusion constraints among them, and study the effect of this coupling on the overall learning results. We conclude by analyzing our results, bridging with earlier efforts in the literature and laying down a roadmap for further future work.