Doozer: Knowledge Extraction and Domain Model creation from Community-Generated Content

Overview

This project aims at extracting entities and relationships from text. The name Doozer was chosen to recognize the little monsters from Fraggle Rock who tirelessly build their structures regardless of the hardship encountered, the change of their environment and influences from other monsters. In our environment, in which the state of knowledge changes as rapidly as it does, we are also required to constantly update our beliefs. For a knowledge representation approach this means that we constantly need to update and change the models that describe our beliefs about the world. The two major parts of the project are:

  1. Domain model generation - extract a hierarchy of terms/instances and topics/classes from the Wikipedia corpus
  2. Pattern-based relationship extraction - extract binary relationships between entities using pattern-vectors
The technologies created in this research are currently used to build a domain model for a Human Performance and Cognition Ontology (HPCO). Snapshots of the created model can be seen here.

Domain model generation

Doozer is an application that aims at generating or extracting a domain model from Wikipedia or other similarly structured knowledge sources. It takes as input an incomplete description of a domain, such as a query or list of seed concepts. Doozer then expands on these seeds to get related concepts, which are then again evaluated regarding their indicativeness of the domain. The output is an extended model that still focuses on the intended domain.

Creating a model for the Neoplasms domain

In order to evaluate the resulting models to a gold standard taxonomy, we ran several trial versions with different parameters on the following domain description:
seed query: Adenoma Carcinoma Vipoma Fibroma Glucagonoma Glioblastoma Leukemia Lymphoma Melanoma Myoma Neoplasm Papilloma
The Broader Focus Domain is: oncology, medicine
The World View taken is: biology

MeSH-Neoplasms comparison model 1, settings were as follows:

# initial search results expansion threshold min p(Domain|Article)
40 0.5 0.1

MeSH-Neoplasms comparison model 2, settings were as follows:

# initial search results expansion threshold min p(Domain|Article)
40 0.5 0.4

MeSH-Neoplasms comparison model 3, settings were as follows:

# initial search results expansion threshold min p(Domain|Article)
25 0.8 0.5

Evaluation


Evaluation of the above Neoplasm models.

The following model was manually refined from model 3. This process took about 5 minutes. This example is to show that even though it is almost impossible to automatically create a model without any false positives, the manual labor involved in building these models is extremely reduced with Doozer.
MeSH-Neoplasms comparison model 3 - manually modified


Pattern-based relationship extraction

We learn surface pattern represenations for relationship types encountered in Linked Open Data (LOD). These pattern representations are then compared to patterns found between named entities in free text. The named entities can be manually entered, extracted from free text or from automatically generated domain models as described above. Below is an evaluation of the pattern based relationship extraction algorithm. Shown are precision and recall averaged over all relationship types as well as the maximum precision and recall, i.e. the relationship type that, over all testing data, performed best.

Evaluation over a testing set of DBPedia triples

Evaluation over a testing set of UMLS triples

Publications

Christopher Thomas, Pankaj Mehra, Roger Brooks and Amit Sheth, Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction, in the Proceedings of the 2008 IEEE/WIC International Conference on Web Intelligence, 2008

Christopher Thomas, Pankaj Mehra, Wenbo Wang, Amit Sheth, Gerhard Weikum and Victor Chan, Automatic Domain Model Creation Using Pattern-Based Fact Extraction, Knoesis Center technical report.