PREDOSE is the name of the NIH funded PREscription Drug abuse Online-Surveillance and Epidemiology projec(July 2011 - present) which is an inter-disciplinary collaborative project between the Ohio Center for Excellence in Knowledge-enabled Computing (Kno.e.sis) and the Center for Treatment, Interventions and Addictions Research (CITAR) at Wright State University. The overall aim of PREDOSE is to develop automated mechanisms for web forum data analysis related to the illicit use of pharmaceutical opioids.
The non-medical use of pharmaceutical opioids has been identified as one of the fastest growing forms of drug abuse in the U.S. Furthermore, significant increases in the illicit use of pharmaceutical opioids have expanded the pathways to heroin addiction and resulted in escalating rates of accidental overdose deaths. To design effective and responsive prevention and policy measures, public health professionals require timely and reliable information on new and emerging drug trends. Although existing epidemiological data systems provide critically important information about drug abuse trends, they are often time-lagged. There is therefore a need for epidemiological sources that could complement existing drug trend monitoring systems and enhance their capacity for early identification of new and emerging trends. The World Wide Web (Web) has been identified as one of the leading data sources for detecting patterns and changes in the non-medical use of pharmaceutical and other illicit drugs. Many Web 2.0 empowered social platforms, including Web forums, provide venues for individuals to freely share their experiences, post questions, and offer comments about different drugs.
This project aims to address this critical need for relevant and timely information by pursuing two(2) specific goals:
Problem: Historically, qualitative research in drug abuse interventions programs has been characterized by manual data collection, initiated by interactive interview sessions with individual(s) or groups of individual addicts. The transcribed interviews (audio-to-text) obtained from this process are typically annotated by researchers with themes that emerged from the interview sessions. This process is called qualitative coding. Various tools such as NVivo, have been developed to facilitate this annotation process. Such tools commonly provide researchers with additional functionality, including search, retrieval and various levels of data analysis. However, the intensive manual effort required to make the interactive approach scalable is overwhelming. Furthermore, to effectively process the large volume of heterogeneous Web-based data, the field requires a highly automated way of accessing and processing Web data.
Proposed Solution: Researchers at the Kno.e.sis Center at Wright State University have successfully applied Semantic Web, Machine Learning and Natural Language Processing techniques to automatically extract knowledge from structured biomedical text. Substantial progress has also been made in using these (and other techniques) to understand the content and identify social perceptions of informal text on MySpace, Facebook, and Twitter data, through metadata extraction and spatio-temporal and thematic analysis (i.e., semantic analysis). These cutting-edge information processing techniques, with appropriate adaptations can now be exploited to fit the needs of public health and drug abuse research on conversational and informal text, such as those occurring in web forums.
The overall research plan has three(3) distinct stages: the first stage is the Data Collection stage, which is an intended alternative to manually conducted interviews. It operates under the assumption that information gathered from interview sessions are expressed in online forums and therefore, data crawling software can be used to collect data from web sources instead of laborious interviews as the means of obtaining qualitative data. The second stage is the process of Automatic Qualitative Coding. Through entity identification, relationship identification and complete triple extraction, this process aims at capturing the semantics of information expressed in the web forum data, with acceptable levels of precision and recall. The complete range of techniques, including rule-based, pattern-based, statistical probabilistic and semantics-based analysis will play a critical role in this phase. The final stage is Data Analysis & Interpretation of the RDF data (i.e. Drug Abuse Ontology - DAO) collected from phase 2 using existing semantic web tools at Kno.e.sis or new tools to be developed where appropriate.
Stage 1: Data Collection
- Web Site Selection: Web forums selected for the study are chosen based on the following criteria 1) they allow free discussion of psychoactive drug use; 2) contain information on illicit pharmaceutical drug use, and 3) are publicly accessible. Additionally, since it is important that this study collects relevant and timely information, such forums are also expected to be very active both in terms of number of users and topics of discussion.
- Web Crawling: Various popular HTML parsers (e.g. Nutch, Jericho HTML Parser etc) exist for parsing web data. Data crawling periodically is necessary to update our databases with the most recent data published by the selected sources. Standardized web forum software somewhat alleviate the traditional problems involved with mining web data. The use of such software enable exploitation of the structure of web forum site by our custom crawlers.
- Data Cleaning: One of the most challenging problems in dealing with web data is decoding special HTML characters to obtain ASCII text and separating special characters from standard text.
- Location Resolution: Collection location data is important for spatio-temporal-thematic analysis. It would not be surprising that drug abuse practices across continent with regard to some specific drugs (e.g. heroin) will vary vastly. The most anticipated variations are likely in drug mixtures. For example, it may be popular culture to use heroin+cocaine in one region, while this practice is entirely uncommon in another.
- Informal Text Database: It is necessary to collect and store a wide selection on data for this study. Some database tables include, users, posts, source and location (city, state, country, continent, zip).
Stage 2: Automatic Qualitative Coding
This is the most challenging aspect of this project. The aim is to use various information extraction techniques to extraction triples from web forum data. Such extraction is to be undertaken in three steps:
- Entity Identification:The most challenging aspect of entity identification from web forum data is the informal nature of the text. Web forum data is characterized by a proliferation of slang terms instead of standard references to known drugs. Fortunately, slang term to known drug mappings are available online through various source, such as (NIDA, NDCP, Erowid, Urban Dictionary etc). We exploit these sources as a starting point for recognizing slang terms that reference known drugs. However, these mappings create the unfortunate side effect of ambiguity. "Oxy" can refer to Oxycontin, Generic Oxycontin, Oxycontin OP or Oxycontin OC. Hence, some techniques for slang term disambiguation become necessary. We have so far taken a probabilistic approach to entity disambiguation, since the surround terms to an ambiguous slang term are also slang and therefore do not help semantics-based approach that leverage the ontology schema.
- Relationship Extraction: We anticipate that the success of our entity extraction along with Drug Abuse Ontology schema will directly impact the relationship extraction. However alternative relationship extraction have been covered elsewhere and will be adapted where appropriate.
- Triple Extraction: Previous work in the lab have successfully implemented rule-based triple extraction (Ramakrishnan C, Mendes P. N. etc) on structured biomedical literature. In other work, (Thomas C, Mehra P, etc) have implemented a statistical/probabilistic approach to triple extraction also on structured text. Such techniques are not likely apply to informal web forum text. Hence, one approach is to translate our informal text into structured text, once entities and relationships have been identified. Alternatively, standard-alone pattern-based, probabilistic and semantics-based techniques can be used to complete triple extraction based on the effectiveness of the entity and relationship extraction.
- Drug Abuse Ontology (DAO): The final output of the triple extraction is population of the Drug Abuse Ontology instance base. This, together with the DAO schema, we intend to maintain as a dynamic ontology created from user-generated content (UGC).
Stage 3: Data Analysis & Interpretation
- Semantic Web Tools: Many tools for data analysis exist at Kno.e.sis. Some of these include, 1) Twitris for spatio-temporal-thematic analysis 2) Cuebee for automatic complex query creation over RDF data and 3) Scooner for guided navigation of documents annotated with semantic metadata (entities or triples). Once the DAO has been created, the data can be easily infused into any of these tools to support analysis. Alternatively, new tools can be created on demand.
- Spatio-Temporal-Thematic Analysis: Discussion on the integration of web forum data into Twitris has already begun. Owing to the use of the slang term dictionary, qualitative researchers will be able to observe posts contains easily identifiable and non-ambiguous references to known drugs in various locations.
This project is sponsored by the National Institutes of Health (NIH) Grant No. R21 DA030571-01A1 awarded to the Ohio Center for Excellence in Knowledge-enabled Computing (Kno.e.sis) and the Center for Treatment, Interventions and Addictions Research (CITAR) titled “A Study of Social Web Data on Buprenorphine Abuse using Semantic Web Technology.” Any opinions, findings, conclusions or recommendations expressed in this material are those of the investigator(s) and do not necessarily reflect the views of the National Institutes of Health.
Contact: Delroy Cameron