Domain-specific Knowledge Extraction from the Web of Data

TitleDomain-specific Knowledge Extraction from the Web of Data
Publication TypeDissertation
Year of Publication2018
AuthorsSarasi Lalithsena
KeywordsDomain-specific knowledge extraction

Domain knowledge plays a significant role in powering a number of intelligent applications such as entity recommendation, question answering, data analytics, and knowledge discovery. Recent advances in Artificial Intelligence and Semantic Web communities have contributed to the representation and creation of this domain knowledge in a machine-readable form. This has resulted in a large collection of structured datasets on the Web which is commonly referred to as the Web of data. The Web of data continues to grow rapidly since its inception, which poses a number of challenges in developing intelligent applications that can benefit from its use. Majority of these applications are focused on a particular domain. Hence they can benefit from a relevant portion of the Web of Data. For example, a movie recommendation application predominantly requires knowledge of the movie domain and a biomedical knowledge discovery application predominantly requires relevant knowledge on the genes, proteins, chemicals, disorders and their interactions. Leveraging the entire Web of data is both unnecessary and computationally intensive, and the irrelevant portion can add to the noise which may negatively impact the performance of the application. This motivates the need to identify and extract relevant data for domain-specific applications from the Web of data. Therefore, this dissertation studies the problem of domain-specific knowledge extraction from the Web of data.

The rapid growth of the Web of data takes place in three dimensions: 1) the number of knowledge graphs, 2) the size of the individual knowledge graph, and 3) the domain coverage. For example, the Linked Open Data (LOD), which is a collection of interlinked knowledge graphs on the Web, started with 12 datasets in 2007, and has evolved to more than 1100 datasets in 2017. DBpedia, which is a knowledge graph in the LOD, started with 3 million entities and 400 million relationships in 2012, and now has grown up to 38:3 million entities and 3 billion relationships. As we are interested in domain-specific applications and the domain of interest is already known, we propose to use the domain to restrict/reduce the other two dimensions from the Web of data. Reducing the first dimension requires reducing the number of knowledge graphs by identifying relevant knowledge graphs to the domain. However, this still may results in large knowledge graphs such as DBpedia, Freebase, and YAGO that cover multiple domains including our domain of interest. Hence, it is required to reduce the size of the knowledge graphs by identifying the relevant portion of a large knowledge graph. This leads to two key research problems to address in this dissertation. (1) Can we identify the relevant knowledge graphs that represent a domain? and (2) Can we identify the relevant portion of cross-domain knowledge graphs to represent the domain?

A solution to the first problem requires automatically identifying the domain represented by each knowledge graph. This can be challenging for several reasons: 1) Knowledge graphs represent domains at different levels of abstractions and specificity, 2) a single knowledge graph can represent multiple domains (i.e., cross-domain knowledge graphs), and 3) the represented domains by knowledge graphs keep evolving. We propose to use existing crowd-sourced knowledge bases with their schema to automatically identify the domains and show its effectiveness in finding relevant knowledge graphs for specific domains. The challenge in addressing the second issue is the nature of the relationships connecting entities in these knowledge graphs. There are two types of relationships: 1) Hierarchical relationships, and 2) non-hierarchical relationships. While hierarchical relationships connect in-domain and out-of-domain entities using the same relationship type and hence represent uniform semantics, non-hierarchical relationships connect in-domain entities and out-of-domain entities using different relationships, i.e., they capture diverse semantics. We propose both data=driven and knowledge-driven approaches to capture the domain relevancy of both hierarchical and non-hierarchical relationships. The solution encodes human knowledge on the domain specificity as probabilistic statements and infers the most probable explanation which captures the domain specificity of concepts and relationships of the original knowledge graph. We present use cases related to entity recommendation for multiple domains to show the effectiveness in extracting the domain-specific subgraph. The domain-specific subgraphs extracted by our approach were 80% smaller in size in terms of the number of paths compared to the original knowledge graph and resulted in more than tenfold reduction of required computational time for entity recommendation task, yet produced better accuracy. We believe that this work will have major impact in utilizing knowledge graphs for domain-specific applications, especially with the extensive growth in the creation of knowledge graphs.

Citation Key2890