Domain-specific Document Retrieval Framework on Near Real-time Social Health Data

TitleDomain-specific Document Retrieval Framework on Near Real-time Social Health Data
Publication TypeThesis
Year of Publication2015
AuthorsSwapnil Soni
Academic DepartmentDepartment of Computer Science and Engineering
DegreeM.S.
Number of Pages63
Date Published05/2015
UniversityWright State University
CityDayton
Thesis TypeM.S. Thesis
Keywordshealth informatics, Real-time Data Processing, Semantic Web, Social Health Signals, social media analysis, Text-Mining, twitter
Abstract

With the advent of web search and microblogging, the percentage of Online Health Information Seekers (OHIS) using these services to share and seek health information in real-time has increased exponentially. Recently, Twitter has emerged as one of the primary mediums for sharing and seeking of the latest information related to a variety of topics, including health information. Although Twitter is an excellent information source, the identification of useful information from the deluge of tweets is one of the major challenges. Twitter search is limited to keyword-based techniques to retrieve information for a given query and sometimes the results do not contain up-to-date (real-time) information. Moreover, Twitter does not utilize semantics to retrieve results. To address these challenges, we developed a system termed Social Health Signals, by leveraging rich domain knowledge to extract relevant and reliable health information from Twitter in near real-time. We have used semantics based techniques to (1) retrieve relevant and reliable health information shared on Twitter in real-time, (2) enable question answering, (3) rank results based on relevancy, popularity and reliability, and (4) to enable efficient browsing of the results, we group the search results into health categories using domain knowledge (semantic categorization). In our approach, we have considered Twitter to search documents based on several unique features, including triple-pattern based mining, near real-time retrieval, and tweet contained URL based search. First, the triple-based pattern (subject, predicate, and object) mining technique extracts triple patterns from microblog messages related to chronic health conditions. The triple pattern is defined in a user given question (natural language). Second, in order to make the system near real-time, the search results are divided into intervals of six hours. Third, in addition to tweets, we use the content of the URLs (mentioned in the tweet) as the data source. Finally, the results are ranked according to relevancy and popularity such that at a particular time the most relevant information for the questions are displayed instead of basing results solely on temporal relevance. Our evaluation focuses on questions related to diabetes, such as 'How to control diabetes?,' and compare the results with a Twitter search. To measure our results with Twitter, we have selected reliability, relevancy, and real-time features for the evaluation. We have conducted a blind survey to check the relevance of the results in which we selected three questions dealing with diabetes. To evaluate the reliable source, we compared a Google domain pagerank of our top 10 results with the Twitter's top 10 results. Also, for real-time we have compared timestamp of the Twitter search results with our system's search results.