- About Us
- Research & Labs
- Projects and Funding
Please checkout a comprehensive tutorial from our group at WWW 2011 conference:
Citizen Sensor Data Mining, Social Media Analytics and Development Centric Web Applications
For further reading on Citizen Sensing research, read Computing for Human experience- Semantics-Empowered Sensors, Services, and Social Computing on the Ubiquitous Web
Over the last few years, there has been a growing public fascination with 'social media' and its role in modern society. At the heart of this fascination is the ability for users to create and share content via a variety of platforms such as blogs, micro-blogs, collaborative wikis, multimedia sharing sites, social networking sites etc. Our research primarily focuses on the analysis of various aspects of User-Generated Content (UGC) that are central to understanding inter-personal communication on social media. More recently, our interdisciplinary collaboration is studying People-Content-Network analysis. The objective of our work on semantic content analysis is to bring structure and organization to unstructured chatter on social media centered around the following questions:
On one hand, the social context surrounding the production, consumption, and sharing of user-generated content has opened several opportunities for enriching user interaction with content. On the other hand, this same social aspect to content production has introduced new challenges in terms of the content's informal nature.
User-generated textual content in social media has unique characteristics that set it apart from the traditional content we find in news or scientific articles. Due to social media's personal and interactive communication format, user-generated content is inherently less formal and unmediated. Off-topic discussions are common, making it difficult to automatically identify context. Content is often fragmented, doesn't always follow English grammar rules, and relies heavily on domain- or demographic-specific slang, abbreviations, and entity variations (using "skik3" for "SideKick 3", for example). Some user-generated content is also terse by nature, such as in Twitter posts, which leaves minimal clues for automatically identifying context. All of these factors make the process of automatically identifying what a social media snippet is actually about much harder.
Moreover, the recent popularity of the Web and online social media has completely revolutionized human interaction. The Web has an unprecedented scale of user interactions, as the Internet has become an integral part of human life. This shift in human social dynamics requires research, as the applicability of traditional research methods in social sciences is not clear in this new scale of social interactions. Therefore, we need an interdisciplinary approach, where Computer Science methods build on insights from existing theories in other disciplines- Social Science, Psychology, Linguistics etc., can enable us to perform quantitative analysis of this massive social data, and help us to understand the evolving human social dynamics.
Our current phase of the work is identifying micro level variables from People, Content and Network dimensions of a social network, which will ultimately result in understanding of macro-level behavior of social phenomenon such as, information dissemination, influence spread, community evolution and sustainability, coordination in communities etc. It will also allow us to validate the classical theories of Social Sciences in this digital realm.
Our work in this space spans eight closely-aligned projects:
Depending on the context, we have studied, analyzed and evaluated a variety of content from Twitter, Facebook, Myspace and Wikipedia and also, we have used a variety of relevant background knowledge to advance traditional computational techniques.
Our work in identifying what people talk about on social media focuses on recognizing named entities, with particular focus on a class of entities called Cultural Entities - those that refer to artifacts of culture, for example, names of movies, TV shows, songs and book titles. In addition to referring to multiple real-world entities (e.g. "The Lord of the Rings" can refer to multiple instances of movies, different video games and a number of novels), cultural entities are particularly hard to extract because of their use of fragments from everyday language.
In a recent work with Amir Padovitz at Microsoft Research, we explored a feature-based approach to improve the accuracy of existing named entity classifiers in identifying such cultural entities. We hypothesized that knowing how hard it is to extract an entity is useful for learning better entity classifiers. With such a measure, entity extractors become "complexity aware", i.e. they can learn to respond differently to signals depending on the entity's extraction difficulty. We proposed and developed an unsupervised algorithm to extract this prior using graph-based spreading activation and clustering techniques. We conducted evaluations in identifying movie named entities in informal weblog posts and found overwhelming evidence that this new prior improves extraction accuracy, supporting our hypothesis about engineering 'complexity aware' classifiers.
Our another investigation in identifying cultural entities, along with researchers at IBM Almaden, utilized MusicBrainz, a rich domain knowledge of music entities and their relationships (encoded in RDF) to annotate artist and track/album mentions in UGC from MySpace music forums [ISWC09a]. In this work, we showed that eliminating parts of the domain model using constraints implied in the content and metadata from the domain model effectively reduces entity disambiguation scenarios and improves spotting precision. For example, a comment, 'Saw you last night in Denver' indicates that the artist is still alive, allowing us to rule out parts of the Ontology mentioning artists who are not. We also showed that simple Machine Learning classifiers built over such pruned models, and learning over a variety of feature types (a combination of Natural Language, domain-related words such as music, song, concert etc. and sentiment expression features) yielded better results than using any of them alone.
In other related efforts, we worked on providing spatio-temporal-thematic summaries of chatter on Twitter using contextual information from the social medium [WISE09]. As part of a targeted content delivery platform, we used an information theory based algorithm to eliminate off-topic chatter in user-generated content on MySpace and Facebook forums to detect the main topic of discussion [WI09]. All of these studies highlighted pertinent challenges that informal UGC brings to text analytics.
In collaboration with Prof. Marti Hearst at UC Berkeley, we conducted an analysis on how men and women self-present in their online dating profiles [ICWSM09]. We studied language usage by quantifying usages of words from linguistic, personal and psychological categories from the Linguistic Inquiry Word Count (LIWC) Dictionary; next, using a multivariate statistical approach called Exploratory Factor Analysis to identify systematic co-occurrence patterns among LIWC variables; and finally grouping user profiles on the basis of their shared multi-dimensional features to compare and contrast self-presentation strategies.
We found interesting results in language usage by the gender groups that deviated from results of past studies in the general speech and writing habits of men and women. In particular, we found men to be using a higher proportion of tentative words (e.g., 'may be', 'could' and 'perhaps'), a class of words typically attributed to the feminine discourse.
We also found many similarities in actual word usages, more so in the use of open-class category words (like affect and verb groups). We believe it may be the case that self-expression tends towards attempting homophily in online dating profiles, given the tendency to 'imitate and impress' in courtship.
Of the several messages that users post on social media everyday, an important task for an application trying to respond to a message is to identify the underlying intent. Our work in the identification of intents behind user posts caters to monetization of user activity on social networks [WI09]. Unlike web search, the presence of an entity alone does not classify intent accurately - any of these intentions could occur with a product X - 'i am thinking of getting X' (transactional); 'i like my new X' (information sharing); and 'what do you think about X' (information seeking). Our approach to automatic identification of intents relied on using 'action patterns' - pattern of words surrounding entity X. Using a set of seed 'action patterns' indicating intent, we developed a minimally supervised bootstrapping algorithm that learns new intent revealing patterns from an un-annotated corpus of 10K user posts from MySpace. Intent tendencies of new patterns are computed using semantic (using communicative functions of words from the LIWC dictionary) and distributional similarity with seed patterns.
As part of a targeted content delivery application [WI09], we found that the new learned patterns were effective in identifying monetizable posts, i.e., those with information seeking and transactional intents. We also found that users were 8 times more likely to click on ads that were generated from their monetizable footprints left on the network (in wall posts, forum messages) than those generated from their profile information (hobbies, activities etc.). An example below shows a user post and contextual ads generated for it using Google AdSense when content has off-topic keywords (upper half). The bottom half shows ads after off-topic keywords have been eliminated.
Figure 1. Difference in the targeted ads after eliminating off-topic keywords
Understanding information, influence or popularity propagation in online social networks is a challenging problem with several contributing factors such as the timeliness of the information, the participants responsible for sharing the information, the structure of the network etc.
We believe that there is a three-dimensional dynamic at play in how information propogates through a network - the people involved (passionate advocate or an objective observer), the content being propagated (fact-sharing or emotionally charged) and the connections between the people, all play a role in how information spreads. Understanding these micro-level variables and their interactions will shed light on macro-level consequences, e.g., political decisions or consumer behaviors.
In a recent study of tweeting practices [ICWSM10], we analyzed viral content on Twitter, i.e., tweets that were passed around or retweeted the most in a given period of time.
We focussed on topical tweets generated by communities that gather on Twitter around real-world events. Our study includes popular events of the year 2009 -- the Iran Election, the Health Care Reform debate and the International Semantic Web Conference (ISWC). These events were selected to represent varied characteristics in terms of social significance, attracting different populations, spanning different time periods, different lengths of time and representing a good variety of twitter activity.
We analyzed 250 most viral tweets which were found based on n-gram analysis of the tweet content. For each tweet, we gathered all instances of the tweet in dataset and plotted the re-tweet connections between the authors of the tweets as a directed network, for example, If user A retweets user B, an edge is drawn from node B to node A; indicating the direction of information diffusion. We found two clear patterns in analyzing the networks.
Figure 2. Graphs showing sparse (A) and dense (B) RT networks and their corresponding follower graphs for 'call
The patterns were observed consistent across the events despite the fact that the events were varied in the population they attracted and the goals of the communities. This categorization is certainly not exhaustive but suggests an important finding - the content being tweeted plays a key role in what an explicit retweet network will look like and in many cases, whether it will be traceable at all. we believe that the suggestive relationship between the tweet type and its retweet pattern will contribute to the study of link-based diffusion models.
In another study, we explored the dynamics of user engagement in social media [SoME'11,WWW'11], as the shift in the information consumption in today's social media has demanded effective methods to understand the new forms of user engagement, the factors impacting them, and the fundamental reasons for such engagements. We perform exploratory analysis on Twitter to understand the dynamics of user engagement by studying what attracts a user to participate in discussions on a topic. We identify various factors which might affect user engagement, ranging from content properties, network topology to user characteristics on the social network, and use them to predict user joining behavior. As opposed to traditional ways of studying them separately, these factors are organized in our framework, People-Content-Network Analysis (PCNA), mainly designed to enable understanding of human social dynamics on the web. We perform experiments on various Twitter user communities formed around topics from diverse domains, with varied social significance, duration and spread. Our findings suggest that capabilities of content, user and network features vary greatly, motivating the incorporation of all the factors in user engagement analysis, and hence, a strong need can be felt to study dynamics of user engagement by using the PCNA framework. Our study also reveals certain correlation between types of event for discussion topics and impact of user engagement factors.
Trust relationships occur naturally in many diverse contexts such as ecommerce, social interactions, social networks, ad hoc mobile networks, distributed systems, decision-support systems, (semantic) sensor web, emergency response scenarios, etc. As the connections and interactions between humans and/or machines (collectively called agents) evolve, and as the agents providing content and services become increasingly removed from the agents that consume them, miscreants attempt to corrupt, subvert or attack existing infrastructure. This in turn calls for support for robust trust inference (e.g., gleaning, aggregation, propagation) and update (also called trust management). Unfortunately, there is neither a universal notion of trust that is applicable to all domains nor a clear explication of its semantics in many situations. Because Web, social networking and sensor information often provide complementary and overlapping information about an activity or event that are critical for overall situational awareness, there is a unique need for developing an understanding of and techniques for managing trust that span all these information channels. Currently, we are pursuing research on trust and trustworthiness issues in interpersonal, social, and sensor networks, to potentially unify and integrate them for exploiting their complementary strengths.
Social media serves as a platform for people to speak their mind more freely, which lead to a growing volume of opinionated data that can be used by: (1) Individuals as suggestion and recommendation. (2) Companies for making marketing strategies and other decisions. (3) Government for monitoring social phenomenon, being aware of potential dangerous situations, etc.
As a popular social media service, twitter provides a convenient and instant way for people to share their opinions by tweets anytime anywhere. For better use of opinionated tweets, there should be methods for automatically identifying opinionated tweets and classifying them according to their sentiment polarity. As an initial attempt, we address the problem of topical sentiment analysis of tweets and propose a novel approach to construct domain and context-aware sentiment lexicon for this task. The proposed technique has been evaluated on the movie domain, and the results show that the proposed approach outperforms several baseline methods significantly. Currently we are trying to conduct the evaluation on multiple domains.
Check out our Twitris Wiki page and a short summary of Twitris' Capbilities.
Twitris 2.0, a Semantic Web application that facilitates understanding of social perceptions by Semantics-based processing of massive amounts of event-centric data. Twitris 2.0 addresses challenges in large scale processing of social data, preserving spatio-temporal-thematic properties. Twitris 2.0 also covers context based semantic integration of multiple Web resources and expose semantically enriched social data to the public domain. Semantic Web technologies enable the system's integration and analysis abilities.
Emergence of microblogging platforms such as Twitter, friendfeed etc. have revolutionized how unfiltered, real-time information is disseminated and consumed by citizens. Twitter, has therefore emerged as the preeminent medium for sharing citizen-sensor observations, as was demonstrated in a variety of situations ranging from Mumbai terrorist attack to Iran elections.
While the decentralized information diffusion model offered by twitter has gained momentum and has created avenues for experiential data sharing, millions of observations, shared through tweets, create significant information overload. In many cases it becomes nearly impossible to make sense of the information around a topic of interest. This problem is further compounded by the fact that tweets increasingly integrate other social networking sites (flickr, twitpics) and general Web content(news, Wikipedia, blogs) through embedded links and metadata. Given this data deluge, analyzing the numerous social signals carried by tweets and associated content to find out what is being said about an event (theme), where (spatial), when (temporal), how are key concerns (topics of discussion) changing over a period of time and whether there are regional differences in the opinions on a given topic, can be extremely challenging.
In response to this growing data deluge, we have developed Twitris (currently Twitris 2.0) with the vision of performing semantics-empowered analysis of a broad variety of social media content. Specifically, Twitris aims to capture semantics (i.e., meaning and understanding) with spatial, temporal, thematic dimensions, user intentions and sentiments, networking behavior (user interactions patterns and features such as information diffusion and centrality) and other information present in social media. Semantic Web technologies enable its core integration, analysis and data/knowledge sharing abilities. Twitris 2.0, focuses only on content centric analysis , leveraging the relevant Semantic Web technologies, background knowledge, languages, tools where appropriate.
Twitris 2.0 is a Semantic Social Web approach to detect social signals by analyzing massive, event-centric data through:
a. Analysis of casual text with spatio-temporal-thematic (STT) bias, to extract event descriptors.
b. Capturing semantics from contexts associated with tweets.
c. Use of deep semantics (using automatically created domain models) to understand the meaning of standard event descriptors.
d. Use of shallow semantics(semantically annotated entities) for knowledge discovery and representation.
e. Exposure of processed social data to the public domain, complying with semantic Web standards.
f. Semantic Integration of multiple external Web resources (news, articles, images and videos) utilizing the semantic similarity between contexts.
Twitris 2.0 is developed as a multi-layered system where each component acts as part of a pipeline. Here is a Functional Overview of Twitris.
The system is currently being used for a number of People-Content-Network study experiments and being extended to integrate with SMS and other Web data used by a number of widely deployed open source projects. These include applications used by non governmental organizations (NGO) in developing countries for crisis management (in particular, Ushahidi.org, eMoksha.org and Kiirti.org). Twitris 2.0 is being extended with Twarql technology for limited real-time support and is being adapted for a cloud platform for much higher scalability.
TWITRIS is part of a larger research agenda on Computing for Human Experience, at the Ohio Center of Excellence in Knowledge-Enabled Computing (Kno.e.sis) Center at the Wright State University, Dayton, Ohio (other key themes include semantics-enriched services computing and the sensor Web).